DCT

1:24-cv-10008

Singular Computing LLC v. Google LLC

I. Executive Summary and Procedural Information

  • Parties & Counsel:
  • Case Identification: 1:24-cv-10008, D. Mass., 01/02/2024
  • Venue Allegations: Plaintiff alleges venue is proper in the District of Massachusetts because Defendant Google maintains a regular and established place of business in Cambridge, Massachusetts, with over 1,500 employees.
  • Core Dispute: Plaintiff alleges that Defendant’s Tensor Processing Unit (TPU) computer chips, specifically the TPUv4 and TPUv5 generations, infringe six patents related to novel computer architectures that utilize large numbers of low-precision processors to accelerate artificial intelligence workloads.
  • Technical Context: The technology involves specialized processors (ASICs) designed to improve computational efficiency for artificial intelligence and machine learning by trading high numerical precision for a massive increase in parallel operations.
  • Key Procedural History: The complaint notes that Plaintiff previously sued Google in the same district (Case No. 1:21-cv-12110) over two of the patents-in-suit ('616 and '775), a case that was dismissed without prejudice. It also alleges that Google's counsel filed Inter Partes Review (IPR) petitions against the '616 patent, a fact used to support allegations of pre-suit knowledge of the patent family and willfulness.

Case Timeline

Date Event
2009-06-19 Priority Date for all Patents-in-Suit (Provisional App. No. 61/218,691)
2020-Q1 Accused Product TPUv4i Deployed
2020-08-25 U.S. Patent No. 10,754,616 Issues
2021-11-09 U.S. Patent No. 11,169,775 Issues
2021-12-22 Prior Lawsuit (1:21-cv-12110) Filed
2022-05-10 U.S. Patent No. 11,327,714 Issues
2023-04-20 Prior Lawsuit (1:21-cv-12110) Dismissed
2023-09-26 U.S. Patent No. 11,768,659 Issues
2023-09-26 U.S. Patent No. 11,768,660 Issues
2023-12-12 U.S. Patent No. 11,842,166 Issues
2024-01-02 Complaint Filed

II. Technology and Patent(s)-in-Suit Analysis

U.S. Patent No. 11,327,714 - PROCESSING WITH COMPACT ARITHMETIC PROCESSING ELEMENT

The Invention Explained

  • Problem Addressed: The patent's background section describes the inefficiency of conventional computer architectures, which use billions of transistors to perform only a handful of high-precision operations per clock cycle, leaving most of the hardware's theoretical computing power inaccessible to software (Compl. ¶8; ’714 Patent, col. 1:30-62).
  • The Patented Solution: The invention proposes a heterogeneous computing architecture that solves this problem by integrating a massive number of small, power-efficient "low precision high dynamic range" (LPHDR) processing units with a much smaller number of traditional high-precision units. By trading unnecessary precision for massive parallelism, the architecture can perform significantly more calculations per second for workloads, such as AI, that can tolerate minor numerical errors. (’714 Patent, col. 2:1-15).
  • Technical Importance: This architectural approach provides a method for dramatically accelerating AI and other computationally intensive tasks that were becoming bottlenecked by the performance-per-watt limitations of conventional CPUs and GPUs (Compl. ¶10, ¶15).

Key Claims at a Glance

  • The complaint asserts at least independent claim 1 (Compl. ¶48).
  • Essential elements of claim 1 include:
    • A device with instruction memory and a silicon chip comprising a "plurality of first execution units" and a "second execution unit."
    • The "first execution units" perform low-precision multiplication, producing results that differ from an exact calculation by at least 0.05% for a defined portion of inputs, but operate over a high dynamic range.
    • The "second execution unit" performs traditional high-precision (at least 32-bit) multiplication.
    • The total number of "first execution units" exceeds the total number of "second execution units" by a large ratio (at least 100 more than five times).
    • Each of the "first execution units" is smaller than the "second execution unit."
    • The "first execution units" collectively perform at least "tens of thousands" of operations per clock cycle.
  • The complaint does not explicitly reserve the right to assert dependent claims for this patent.

U.S. Patent No. 10,754,616 - PROCESSING WITH COMPACT ARITHMETIC PROCESSING ELEMENT

The Invention Explained

  • Problem Addressed: The patent addresses the challenge of efficiently connecting a massively parallel processing array to a conventional host computer without creating a data input/output (I/O) bottleneck that would nullify the performance gains of the parallel architecture (Compl. ¶46; ’616 Patent, col. 8:45-60).
  • The Patented Solution: The invention describes a system comprising a host computer and a specialized computing chip. The chip contains a "processing element array" (PEA) of at least 5000 elements. An input/output unit connects the host to processing elements located only at the edge of this array. This structure allows the host to control the massively parallel array without needing to interface with every element directly, preventing the system from becoming I/O bound while minimizing long-distance data transfers within the chip. (’616 Patent, FIG. 1; col. 9:7-22).
  • Technical Importance: This architecture provides a practical blueprint for integrating a high-throughput, specialized co-processor with a general-purpose host, a foundational design for modern hardware accelerators (Compl. ¶46).

Key Claims at a Glance

  • The complaint asserts at least independent claim 7 (Compl. ¶64).
  • Essential elements of claim 7 include:
    • A computing system with a "host computer" and a "computing chip."
    • The chip comprises a "processing element array" of no less than 5000 "first processing elements."
    • A subset of these elements is positioned at an "edge" of the array, while another subset is in the "interior."
    • An "input-output unit" is connected to the subset of elements at the edge.
    • A "host connection" links the input-output unit with the host computer.
    • The processing elements contain multiplier circuits adapted to receive floating point values with a mantissa of no more than 11 bits and an exponent of at least 6 bits.
  • The complaint does not explicitly reserve the right to assert dependent claims for this patent.

U.S. Patent No. 11,169,775 - PROCESSING WITH COMPACT ARITHMETIC PROCESSING ELEMENT

  • Technology Synopsis: This patent is similar to the ’616 Patent but further specifies the heterogeneous nature of the architecture. It explicitly claims both "first arithmetic units" (low-precision) and "second processing elements" containing "second arithmetic units" (high-precision) and requires that the transistors of the second multiplier circuits exceed in number those of the first. (’775 Patent, col. 33:55-34:52).
  • Asserted Claims: At least independent claim 7 is asserted (Compl. ¶79).
  • Accused Features: The complaint alleges that Google's TPUs, part of the Accused TPU Computing Systems, infringe by having Matrix Multiply Units (MXUs) as the claimed "first arithmetic units" and Vector Processing Units (VPUs) as the "second processing elements" and "second arithmetic units" (Compl. ¶81-82, ¶88).

U.S. Patent No. 11,768,659 - PROCESSING WITH COMPACT ARITHMETIC PROCESSING ELEMENT

  • Technology Synopsis: This patent claims a method for performing massive numbers of low-precision multiplications in a single clock cycle. It requires completing "at least tens of thousands of first multiplication operations" where the number of such operations is at least 1000 more than three times the maximum number of high-precision operations the chip can perform in a cycle. (’659 Patent, col. 31:1-24).
  • Asserted Claims: At least independent claim 1 is asserted (Compl. ¶93).
  • Accused Features: The complaint accuses the TPUv4 and TPUv5 chips, alleging they perform approximately 131,000 and 65,000 low-precision bfloat16 multiplication operations per clock cycle, respectively, far exceeding their high-precision capabilities (Compl. ¶96, ¶98).

U.S. Patent No. 11,768,660 - PROCESSING ELEMENT WITH COMPACT ARITHMETIC PROCESSING ELEMENT

  • Technology Synopsis: This patent claims a device comprising a silicon chip with a plurality of execution units that jointly contain a "first plurality of custom silicon arithmetic elements" for low-precision multiplication. It requires that the total number of these low-precision elements vastly exceeds the number of "second custom silicon arithmetic elements" for high-precision multiplication. (’660 Patent, col. 31:1-32:2).
  • Asserted Claims: At least independent claim 1 is asserted (Compl. ¶104).
  • Accused Features: The complaint accuses the TPU devices, identifying the MXU Multiplier Circuits as the "first plurality of custom silicon arithmetic elements" and a subset of the VPU ALUs as the "second custom silicon arithmetic elements" (Compl. ¶108-109).

U.S. Patent No. 11,842,166 - PROCESSING ELEMENT WITH COMPACT ARITHMETIC PROCESSING ELEMENT

  • Technology Synopsis: This patent claims a device with an architecture highly similar to that claimed in the '714 patent. It requires a silicon chip with a large number of small, low-precision "first execution units" and a smaller number of larger, high-precision "second execution units." Unlike the '714 patent, it does not include the specific numerical error rate limitation. (’166 Patent, col. 29:30-30:48).
  • Asserted Claims: At least independent claim 1 is asserted (Compl. ¶114).
  • Accused Features: The infringement allegations are substantially similar to those for the '714 patent, mapping the MXU cells to the "first execution units" and VPU ALUs to the "second execution unit" (Compl. ¶116-121).

III. The Accused Instrumentality

Product Identification

  • The accused instrumentalities are Google's "Accused TPU Devices" (specifically, TPUv4 and TPUv5 chips), "Accused TPU Pods" (groups of TPU devices), and "Accused TPU Computing Systems" (Cloud TPU systems incorporating the accused chips) (Compl. ¶17, ¶48).

Functionality and Market Context

  • The complaint alleges that Google's TPUs are custom-designed integrated circuits (ASICs) that accelerate machine learning workloads for services like Google Translate, Photos, Search, and Assistant (Compl. ¶19, ¶48). A block diagram in the complaint illustrates the TPUv4i chip's architecture, showing components such as the TensorCore, Matrix Multiply Unit (MXU), and Vector Processing Unit (VPU) (Compl. p. 28). The core of the infringement allegation is that the TPUs employ a massively parallel architecture using numerous low-precision "multiply-accumulators" in their MXUs (which perform calculations using bfloat16 format) alongside a smaller number of high-precision vector units (VPUs) that handle float32 operations (Compl. ¶49, ¶52, ¶55). The complaint asserts Google adopted this architecture to overcome the limitations of conventional computers for its rapidly growing AI services (Compl. ¶15, ¶18-19).

IV. Analysis of Infringement Allegations

U.S. Patent No. 11,327,714 Infringement Allegations

Claim Element (from Independent Claim 1) Alleged Infringing Functionality Complaint Citation Patent Citation
a silicon chip comprising a plurality of first execution units... adapted to execute a first operation of multiplication... Each accused TPU is a silicon chip containing a plurality of "MXU Reduced Precision Multiply Cells," which are alleged to be the "first execution units" that perform multiplication at reduced bfloat16 precision. ¶52 col. 2:1-15
wherein the dynamic range... is at least as wide as from 1/1,000,000,000 through 1,000,000,000 and for each of at least X=10% of the possible valid inputs... the numerical value... differs by at least Y=0.05% from the result of an exact mathematical calculation... The accused devices allegedly operate on float32 inputs, which have a dynamic range far exceeding the claim requirement. The conversion to bfloat16 for multiplication introduces numerical differences, which Singular's testing allegedly shows meets the claimed error distribution. ¶54 col. 5:6-20
a second execution unit adapted to execute a second operation of traditional high-precision multiplication on floating point numbers that are at least 32 bits wide Each TensorCore in an Accused TPU Device allegedly contains a Vector Processing Unit (VPU) with ALUs that handle float32 computations, which are asserted to be the "second execution units." ¶55 col. 2:10-15
wherein a total number of first execution units... exceeds, by at least 100 more than five times, a total number of execution units in the silicon chip adapted to execute the operation of traditional high-precision multiplication... A TPUv4 chip allegedly has 131,072 "first execution units" (MXU cells) and 2,048 "second execution units" (VPU ALUs), a ratio far exceeding the claim's requirement. A TPUv5 chip is alleged to have a similarly high ratio. ¶56 col. 2:6-10
wherein each of the plurality of first execution units is smaller than the second execution unit The bfloat16 multipliers in the MXU cells allegedly require "so much less circuitry" and consume less power than the FP32 multipliers in the VPU ALUs, citing publications by a Google engineer. ¶57 col. 6:49-54
wherein the plurality of first execution units are adapted to collectively perform, per cycle, at least tens of thousands of the first operation. The TPUv4 is alleged to perform ~131,000 bfloat16 multiplication operations per clock cycle, and the TPUv5 is alleged to perform ~65,000, both meeting the "tens of thousands" requirement. ¶59 col. 2:1-5

Identified Points of Contention

  • Technical Questions: A primary technical question is how the "first operation of multiplication" is defined. The complaint alleges the accused TPUs take 32-bit float32 inputs, convert them internally to 16-bit bfloat16, perform the multiplication, and then accumulate results. The court will need to determine if this entire process constitutes an "operation... on... first input signals" that are float32, as Singular alleges, or if the operation is properly defined as only the bfloat16 multiplication step, which acts on different, internal signals.
  • Scope Questions: The infringement case for the specific error rate ("differs by at least Y=0.05%") relies on "Singular test results" presented in the complaint (Compl. p. 39). A point of contention may be the source, methodology, and admissibility of this testing data to prove that the accused products meet this quantitative limitation.

U.S. Patent No. 10,754,616 Infringement Allegations

Claim Element (from Independent Claim 7) Alleged Infringing Functionality Complaint Citation Patent Citation
A computing system, comprising: a host computer; a computing chip comprising: The "Accused TPU Computing Systems" are alleged to be the computing system, with the "TPU Host" (a physical computer) as the host computer and the Accused TPU Device as the computing chip. ¶65-67 col. 8:14-16
a processing element array comprising a plurality of first processing elements, wherein the plurality of first processing elements is no less than 5000 in number Each MXU in an accused TPU chip is alleged to contain a 128x128 systolic array of "MXU Multiply Add Cells." With 16,384 such cells per MXU, this is alleged to meet the "no less than 5000" requirement. ¶68 col. 7:1-9
wherein each of a first subset of the plurality of first processing elements is positioned at a first edge of the processing element array, and wherein each of a second subset... is positioned in the interior... The systolic array structure of the MXU is alleged to have interconnected processing elements arranged with some at the edge and others in the interior. The complaint references a visual of a systolic array to illustrate this structure. ¶68 col. 8:45-56
an input-output unit connected to each of the first subset of the plurality of first processing elements Each TPU chip allegedly includes input-output units connected to the edge of the MXU's systolic array of processing elements, based on figures from a Google patent application. ¶69 col. 9:7-12
a plurality of memory units... wherein each of the plurality of memory units is local to its associated one of the plurality of first processing elements Each "MXU Multiply Add Cell" allegedly has an associated local memory unit used to store weights or parameters, as depicted in a figure from a Google patent application. A diagram from this application showing a "weight shift" register (350a) feeding a multiplier is referenced. ¶71, p. 36 col. 9:40-44
wherein the plurality of arithmetic units each comprises a first corresponding multiplier circuit adapted to receive as a first input... a first floating point value having a first binary mantissa of width no more than 11 bits and a first binary exponent of width at least 6 bits... The multiplier circuits within the MXUs allegedly operate on the bfloat16 format, which uses an 8-bit exponent (>= 6 bits) and an 8-bit mantissa (<= 11 bits). ¶74 col. 5:6-20

Identified Points of Contention

  • Evidentiary Questions: The complaint's allegations regarding the internal architecture of the MXU (e.g., edge vs. interior elements, local memory, connections) rely heavily on figures and descriptions from Google's own patent application ('165 application) which describes the TPUv2/v3 architecture (Compl. ¶52b, ¶68b, ¶69, ¶71). A key point of contention will be whether these disclosures about older products constitute admissions or accurate descriptions of the accused TPUv4 and TPUv5 products.
  • Scope Questions: The term "edge of the processing element array" may require construction. The dispute may turn on what constitutes an "edge" in a complex, three-dimensional systolic array and whether the accused product's I/O connections meet the specific arrangement required by the claim.

V. Key Claim Terms for Construction

"first execution unit" ('714 Patent) / "first processing element" ('616 Patent)

  • Context and Importance: The entire infringement theory depends on mapping these terms to Google's "MXU Reduced Precision Multiply Cells" or "MXU Multiply Add Cells." Practitioners may focus on this term because the patents contrast these numerous "first" units with a smaller number of "second" high-precision units. The definition will determine if Google's heterogeneous architecture reads on the claims.
  • Intrinsic Evidence for Interpretation:
    • Evidence for a Broader Interpretation: The specification describes these elements functionally as performing LPHDR arithmetic without being limited to one specific circuit design, suggesting the term could cover any circuit that performs the claimed low-precision, high-dynamic-range function (’616 Patent, col. 2:1-6).
    • Evidence for a Narrower Interpretation: The detailed description provides a specific example of a processing element including a particular LPHDR arithmetic unit, logic unit, and registers (’616 Patent, FIG. 4). A defendant might argue the term should be limited to structures possessing these disclosed components.

"a first operation of multiplication: on one or more first input signals" ('714 Patent)

  • Context and Importance: Practitioners may focus on this term because the accused TPUs allegedly receive 32-bit inputs but perform the core multiplication on internally-converted 16-bit values. The interpretation of whether the "operation" is defined by its external inputs or its internal mechanism is central to the infringement analysis.
  • Intrinsic Evidence for Interpretation:
    • Evidence for a Broader Interpretation: The patent's summary describes processing elements designed to "perform arithmetic operations ... on numerical values of low precision but high dynamic range," which could be read to encompass a high-level operation whose internal steps involve format conversion (’714 Patent, col. 2:1-6).
    • Evidence for a Narrower Interpretation: The claim language recites an operation "on one or more first input signals." A defendant may argue that if the "first input signals" are 32-bit, but the multiplier circuit itself acts only on 16-bit values, then the operation is not performed "on" the claimed input signals.

VI. Other Allegations

Indirect Infringement

  • The complaint focuses on allegations of direct infringement under 35 U.S.C. § 271(a) for each asserted patent and does not include separate counts for induced or contributory infringement.

Willful Infringement

  • The complaint alleges Google’s infringement is willful (Compl. ¶25). This allegation is based on alleged pre-suit knowledge stemming from extensive meetings, calls, and non-disclosure agreements between Singular's founder and Google representatives from 2010 to 2017, during which the patented technology was allegedly disclosed (Compl. ¶16). Willfulness is also supported by a prior lawsuit involving two of the asserted patents and Google's filing of IPR petitions against the patent family, which allegedly demonstrates knowledge (Compl. ¶26, ¶61).

VII. Analyst’s Conclusion: Key Questions for the Case

  • A core issue will be one of operational definition: does Google's practice of receiving 32-bit floating-point values and internally converting them to a 16-bit format for the multiplication step constitute an "operation on" the original 32-bit signals as claimed, or is this internal conversion a legally dispositive difference in technical operation?
  • A key evidentiary question will be one of product versus publication: to what extent can technical descriptions and figures from Google's own patent applications and academic papers, which describe prior-generation technology, be used to establish the specific internal architecture of the accused commercial TPUv4 and TPUv5 chips?
  • The case will also turn on a question of structural equivalence: do the distinct functional blocks within Google's TensorCore—specifically the massively parallel "MXU" systolic arrays and the more traditional "VPU" vector units—map cleanly onto the patents' claimed "first" (low-precision) and "second" (high-precision) execution units, or is there a fundamental architectural mismatch?