DCT

3:25-cv-05666

Sanasai Inc v. Krisp Tech Inc

Key Events

Amended Complaint

I. Executive Summary and Procedural Information

Parties & Counsel:
- Plaintiff: Sanas.AI Inc. (Delaware)
- Defendant: Krisp Technologies, Inc. (Delaware)
- Plaintiff’s Counsel: KOBRE & KIM LLP
Case Identification: 3:25-cv-05666, N.D. Cal., 09/22/2025
Venue Allegations: Venue is alleged as proper in the Northern District of California because Defendant Krisp Technologies, Inc. maintains a permanent and continuous presence and a regular and established place of business in the district.
Core Dispute: Plaintiff alleges that Defendant’s AI-powered accent conversion software infringes six patents related to real-time speech modification technology.
Technical Context: The technology at issue involves using machine learning models to alter a speaker's accent in real-time during voice communications, a field with significant applications in global customer support and communications.
Key Procedural History: The complaint alleges that Defendant engaged in partnership discussions with Plaintiff under a non-disclosure agreement, gained access to Plaintiff's confidential and proprietary accent conversion technology, and subsequently launched a "copycat" product while secretly developing it during the discussions.

Case Timeline

Date	Event
2021-05-06	’550 Patent Priority Date
2021-09-07	Krisp initiates contact with Sanas
2021-11-17	Parties execute a Non-Disclosure Agreement
2022-01-10	’457 and ’561 Patents Priority Date
2022-11-04	Krisp terminates partnership discussions with Sanas
2023-04-27	Krisp announces "Krisp AI Accent Conversion" product
2023-05-05	’496 Patent Priority Date
2023-06-27	’745 Patent Priority Date
2023-07-21	Krisp files provisional patent applications that led to its ’609 and ’979 patents
2023-08-01	U.S. Patent No. 11,715,457 issues
2024-04-02	U.S. Patent No. 11,948,550 issues
2024-08-01	’756 Patent Priority Date
2024-10-22	U.S. Patent No. 12,125,496 issues
2024-10-29	U.S. Patent No. 12,131,745 issues
2025-03-25	Krisp launches "Krisp AI Accent Conversion v3"
2025-09-09	U.S. Patent No. 12,412,561 issues
2025-09-16	U.S. Patent No. 12,417,756 issues
2025-09-22	Complaint Filing Date

II. Technology and Patent(s)-in-Suit Analysis

U.S. Patent No. 11,948,550 - "REAL-TIME ACCENT CONVERSION MODEL," issued April 2, 2024

The Invention Explained

Problem Addressed: The patent identifies two key deficiencies in prior art accent modification techniques. First, "voice conversion" methods that only adjust audio characteristics like pitch and intonation fail to address differences in pronunciation (e.g., "th-stopping"). Second, methods that rely on speech-to-text followed by text-to-speech (STT-TTS) introduce significant latency, making them unsuitable for real-time conversations ('550 Patent, col. 1:45-68).
The Patented Solution: The invention proposes a system that uses a first machine-learning algorithm (akin to an Automatic Speech Recognition engine) to derive a "non-text linguistic representation" from the source speech. This representation, which captures the phonetic content without converting it to text, is then fed to a second machine-learning algorithm (a voice conversion engine) that synthesizes new audio data in the target accent. This approach is designed to be low-latency and capable of changing phoneme pronunciation ('550 Patent, Abstract; col. 2:15-32; col. 7:1-31).
Technical Importance: The described solution aims to make real-time, low-latency accent conversion practical for live communications by avoiding the high-latency STT-TTS pipeline while still addressing fundamental pronunciation differences between accents ('550 Patent, col. 2:8-14).

Key Claims at a Glance

The complaint asserts independent claims 1 and 19. The essential elements of Claim 1 include:
- Training a first machine-learning algorithm with audio data from multiple speakers having a first accent.
- Applying the trained first algorithm to received speech content to derive a "non-text linguistic representation" of the phonemes.
- Synthesizing new audio data in a second accent using a second machine-learning algorithm, based on the derived linguistic representation.
- Wherein synthesizing involves "mapping at least a first non-text linguistic representation of a first phoneme... to a second non-text linguistic representation of a second phoneme... wherein the first and second phonemes are different."
- Converting the synthesized audio data into the final output speech.
The complaint reserves the right to assert other claims, including dependent claims (Compl. ¶165).

U.S. Patent No. 12,125,496 - "METHODS FOR NEURAL NETWORK-BASED VOICE ENHANCEMENT AND SYSTEMS THEREOF," issued October 22, 2024

The Invention Explained

Problem Addressed: The patent background describes that while many systems focus on removing background noise, they are less effective when the speech itself is unclear due to characteristics like "slurring, mumbling, or being too quiet." Existing methods can also distort speech features necessary for comprehension ('496 Patent, col. 1:29-41).
The Patented Solution: The invention describes a two-network system for voice enhancement. A first neural network processes input audio frames and converts them into a "low-dimensional representation" that specifically "omit[s] one or more of the non-content elements" (e.g., background noise). A second neural network then takes this cleaned, low-dimensional representation and generates "target speech frames," which are combined to create the enhanced output audio ('496 Patent, Abstract).
Technical Importance: This approach seeks to improve speech clarity by explicitly separating and omitting "non-content elements" like noise from the core speech signal before regenerating the final audio, aiming to preserve the integrity of the desired speech ('496 Patent, col. 2:49-57).

Key Claims at a Glance

The complaint asserts independent claim 1. Its essential elements include:
- Fragmenting input audio data (comprising foreground speech content, non-content elements, and speech characteristics) into a plurality of input speech frames.
- Converting the input speech frames to "low-dimensional representations" using a first neural network, where these representations "omit one or more of the non-content elements."
- Applying a second neural network to these low-dimensional representations to generate "target speech frames."
- Combining the target speech frames to generate output audio data that includes portions of the original foreground speech content and speech characteristics.
The complaint reserves the right to assert other claims (Compl. ¶179).

U.S. Patent No. 12,131,745 - "SYSTEM AND METHOD FOR AUTOMATIC ALIGNMENT OF PHONETIC CONTENT FOR REAL-TIME ACCENT CONVERSION," issued October 29, 2024

Technology Synopsis: This patent addresses the technical challenge of aligning phonetically dissimilar audio between two different accents. The invention claims a method for determining an alignment by maximizing a cosine distance between phonetic embedding vectors of a source accent and transformed vectors representing a target accent, enabling more accurate real-time conversion ('745 Patent, Abstract).
Asserted Claims: The complaint asserts claims 1-20 (Compl. ¶192).
Accused Features: The complaint alleges that Krisp's marketing materials and technical diagrams, which describe and depict a step of "utterance alignment," practice the claimed invention (Compl. ¶193).

U.S. Patent No. 11,715,457 - "REAL TIME CORRECTION OF ACCENT IN SPEECH AUDIO SIGNALS," issued August 1, 2023

Technology Synopsis: This patent describes a system for real-time accent correction that processes a speech signal in "chunks," extracts acoustic and linguistic features, and uses a synthesis module with a speaker embedding to generate an output with a reduced accent ('457 Patent, Abstract). Claim 14 specifically claims an apparatus with a latency between 40 and 300 milliseconds.
Asserted Claims: The complaint asserts claims 1-20 (Compl. ¶205).
Accused Features: The complaint points to Krisp's "Voice Profiles" feature and a blog post claiming an "audio latency of 220ms" as evidence of infringement (Compl. ¶¶206-208). A screenshot from a Krisp user guide shows the "Voice Profiles mode" which replaces a user's voice with a pre-configured one (Compl. p. 43).

U.S. Patent No. 12,412,561 - "REAL TIME CORRECTION OF ACCENT IN SPEECH AUDIO SIGNALS," issued September 9, 2025

Technology Synopsis: This patent is related to the '457 Patent and similarly describes a system for real-time accent correction. It includes claims directed to an apparatus that processes speech signals and offers different output sample rates (e.g., 8 kHz, 16 kHz, 32 kHz) ('561 Patent, col. 23:1-24:67, Claim 16).
Asserted Claims: The complaint asserts claims 1-18 (Compl. ¶245).
Accused Features: The complaint alleges that Krisp's "Voice Profiles" feature and its advertised offering of various voice quality outputs (8kHz, 16kHz, and 32kHz) infringe the patent (Compl. ¶¶246-248).

U.S. Patent No. 12,417,756 - "SYSTEMS AND METHODS FOR REAL-TIME ACCENT MIMICKING," issued September 16, 2025

Technology Synopsis: This patent describes a system for modifying a second user's speech to mimic the accent of a first user while preserving the second user's "natural voice." The system involves extracting accent features from the first user's speech and applying them to the second user's speech ('756 Patent, Abstract).
Asserted Claims: The complaint asserts claims 1-20 (Compl. ¶257).
Accused Features: The complaint alleges that Krisp's user interface, which offers both a "Voice Preservation mode" and "Voice Profiles," infringes the claims of the patent (Compl. ¶259).

III. The Accused Instrumentality

Product Identification

The accused products are software marketed by Krisp as "AI Accent Conversion" and "Accent Localization," with a specific version identified as "Krisp AI Accent Conversion v3" (Compl. ¶¶7, 157).

Functionality and Market Context

The complaint alleges the accused products are designed for real-time accent conversion, primarily for business-process outsourcing (BPO) and call centers (Compl. ¶166). The product is described as "softening accents while preserving the speaker's voice for authenticity" (Compl. ¶166). Technical descriptions cited in the complaint state the product uses "high-quality parallel data to directly map input accented speech to target native speech" and employs processes of "feature extraction" and "utterance alignment" (Compl. ¶¶167-168). A screenshot of Krisp's website shows the product's marketing claims and user interface (Compl. p. 35). The product also allegedly offers user-selectable "Voice Profiles" to replace a speaker's voice with a pre-configured one and a "Voice Preservation mode" that retains the user's natural voice (Compl. ¶259).

IV. Analysis of Infringement Allegations

'550 Patent Infringement Allegations

Claim Element (from Independent Claim 1)	Alleged Infringing Functionality	Complaint Citation	Patent Citation
apply the first machine-learning algorithm to speech content received via at least one microphone... to derive a non-text linguistic representation of the set of phonemes...	The accused product performs "Feature extraction" on source accented speech to create an intermediate representation for further processing.	¶168	col. 7:20-31
synthesize, using a second machine-learning-algorithm... audio data representative of the received speech content having the second accent...	The accused product's technical approach uses "high-quality parallel data" to map input speech to a target accent, employing a "Speech generation NN" for synthesis.	¶¶167, 168	col. 7:48-56
wherein synthesizing the fourth audio data comprises mapping at least a first non-text linguistic representation of a first phoneme... to a second non-text linguistic representation of a second phoneme... wherein the first and second phonemes are different...	The accused product's function of "softening accents" and "accent conversion" is alleged to inherently require changing phoneme pronunciations, which constitutes the claimed mapping of different phonemes.	¶166	col. 8:12-25

Identified Points of Contention:
- Scope Questions: A central question may be whether Krisp’s "Feature extraction" process, as depicted in a technical diagram provided in the complaint (Compl. p. 36), results in a "non-text linguistic representation" as required by the claim. The dispute may focus on whether Krisp's intermediate data structure is merely acoustic or truly "linguistic" (i.e., containing phoneme-level information).
- Technical Questions: The analysis may turn on evidence of the specific mechanism Krisp uses for conversion. It raises the question of whether the accused system performs a direct phoneme-to-different-phoneme mapping, or if it achieves accent modification through a different transformation that may not read on the claim's specific "mapping" limitation.

'496 Patent Infringement Allegations

Claim Element (from Independent Claim 1)	Alleged Infringing Functionality	Complaint Citation	Patent Citation
convert the input speech frames to low-dimensional representations... wherein the low-dimensional representations of the input speech frames omit one or more of the non-content elements...	The accused product provides "Background noise and voice cancellation robustness," which is alleged to function by creating a representation of speech that omits noise, a "non-content element."	¶182	col. 8:5-10
apply a second neural network to the low-dimensional representations of the input speech frames to generate target speech frames...	Krisp's blog describes a "speech synthesis part of the model" that is "robust against noise and background voices," suggesting a synthesis step that operates on a data representation from which noise has already been removed or omitted.	¶181	col. 8:11-18
combine the target speech frames to generate output audio data... includ[ing] one or more portions of the foreground speech content and one or more of the speech characteristics.	The accused product is alleged to generate enhanced speech that preserves the speaker's voice, thereby including the "foreground speech content" and "speech characteristics" as required.	¶166	col. 8:19-25

Identified Points of Contention:
- Scope Questions: The case may hinge on the definition of "omit." The question is whether Krisp's noise cancellation technology creates a new, smaller "representation" from which noise information is absent, or if it uses a more traditional filtering or subtraction method that might not be considered "omitting" an element from a representation.
- Technical Questions: What evidence does the complaint provide that Krisp’s system uses a two-network architecture where the first creates a low-dimensional representation and the second generates new frames? The infringement theory suggests this architecture, but the dispute will require technical evidence of the accused product's actual implementation.

V. Key Claim Terms for Construction

For the ’550 Patent

The Term: "non-text linguistic representation"
Context and Importance: This term defines the crucial intermediate data format that allows the system to be low-latency (by avoiding full text conversion) while still being capable of phoneme-level modification. Whether Krisp's "feature extraction" output meets this definition will be a primary point of contention.
Intrinsic Evidence for Interpretation:
- Evidence for a Broader Interpretation: The specification suggests the Automatic Speech Recognition (ASR) engine "may break down the received speech content... and classify each frame according to the sounds (e.g., monophones and triphones) that are detected," implying the representation is rooted in phonetic units ('550 Patent, col. 7:20-27). This may support a construction covering any data structure that encodes phoneme-level information, even if not a complete transcription.
- Evidence for a Narrower Interpretation: A party could argue that the term requires a structured, symbolic representation of phonemes, as opposed to a mere vector of acoustic features that implicitly contains phonetic information. The patent's abstract distinguishes its derived "linguistic representation" from purely acoustic adjustments, which may support an argument for a more constrained definition.

For the ’496 Patent

The Term: "low-dimensional representation... [that] omit[s] one or more of the non-content elements"
Context and Importance: This limitation defines the mechanism for noise removal. The infringement analysis depends on whether the accused product's process of handling noise can be characterized as creating a new data representation from which noise is "omitted."
Intrinsic Evidence for Interpretation:
- Evidence for a Broader Interpretation: The abstract states the system converts frames to representations that "omit one or more of the non-content elements," which could be interpreted to cover any process that results in a "clean" data representation fed to the next stage, regardless of the specific mathematical operation used ('496 Patent, Abstract).
- Evidence for a Narrower Interpretation: Claim 1 recites a sequence of "fragment[ing]," then "convert[ing]" to a representation that "omits." This language may support a narrower construction requiring a distinct conversion step that produces a new data object lacking the "non-content elements," as opposed to an in-place filtering or signal subtraction process.

VI. Other Allegations

Indirect Infringement: The complaint alleges that Krisp induces infringement by "encouraging, instructing, and aiding" customers and end-users to purchase and use the accused products in a manner that infringes the asserted patents (Compl. ¶¶171, 172, 184, 185).
Willful Infringement: Willfulness is a central allegation, based on Krisp's alleged pre-suit knowledge of Sanas's technology and patents. The complaint alleges that Krisp obtained deep technical knowledge of Sanas's products under an NDA during partnership discussions, which it then used to build a competing product (Compl. ¶¶79, 174). Further, it alleges that Krisp's own patents cite Sanas's published patent applications, suggesting direct knowledge of the patented technology (Compl. ¶160).

VII. Analyst’s Conclusion: Key Questions for the Case

A key evidentiary question will be one of technological provenance: Does discovery show that Krisp's accent conversion technology was developed independently, or will evidence from the parties' prior discussions under NDA demonstrate that Krisp's product is based on confidential information and patented methods disclosed by SANASAI? The answer will be critical for both infringement and willfulness.
A core issue will be one of technical mechanism: Does Krisp's "feature extraction" and noise cancellation architecture operate in substantially the same way as claimed in the patents-in-suit? Specifically, does it create a "non-text linguistic representation" for phoneme mapping as claimed in the '550 patent, and does it "omit" non-content elements in a "low-dimensional representation" as claimed in the '496 patent?
The case may also turn on a question of definitional scope: Can the term "mapping... a first phoneme... to a... second phoneme... wherein the... phonemes are different" be construed to cover the specific transformation algorithm used in Krisp's accent conversion software, or does the accused product achieve its result through a method that falls outside this claimed functionality?