Now loading...
Deepseek, a Chinese artificial intelligence firm, has developed an innovative optical character recognition (OCR) system designed to compress image-based text documents, thus allowing AI to manage significantly longer contexts without being hindered by memory limitations. The core premise of this technology is that processing text in its image form can be more efficient compared to handling digital text directly. Deepseek’s technical paper asserts that their OCR can achieve up to a tenfold compression of text while maintaining 97% of the original data integrity.
The Deepseek OCR system operates through two primary components: DeepEncoder for image analysis and a text generator using the Deepseek3B-MoE architecture, which features 570 million active parameters. Within this setup, DeepEncoder is responsible for processing each image with its own 380 million parameters to create a compressed output.
Part of the technology integrates Meta’s Segment Anything Model (SAM) for image segmentation and OpenAI’s CLIP model, which combines image and text recognition capabilities. A 16x image token compressor is also part of the process, streamlining the original token count dramatically—from thousands to as few as 256 for a 1,024 by 1,024 pixel image.
Deepseek OCR has demonstrated versatility across various image resolutions, requiring as few as 64 tokens for simple images and up to 400 for high-resolution inputs. This stands in stark contrast to conventional OCR systems, which typically demand significantly more tokens. In benchmark tests, Deepseek OCR outperformed competing systems, processing 100 vision tokens compared to its competitors’ higher counts. For instance, it triumphed over MinerU 2.0, which requires over 6,000 tokens for similar tasks.
The system is adept at handling not just text but also a range of complex document types—supporting diagrams, chemical formulas, and geometric designs—across about 100 languages. It can output files in various formats while preserving aspects like original layout and providing general image descriptions. To train its algorithms, Deepseek used data comprised of 30 million PDF pages across multiple languages, including a significant number of Chinese and English documents.
In practical applications, Deepseek OCR can process more than 200,000 pages using a single Nvidia A100 GPU, a capacity that escalates to 33 million pages daily across a setup of 20 servers, each equipped with eight GPUs. This efficiency could enable the swift accumulation of training datasets for other AI models requiring extensive text data.
Moreover, the researchers are proposing the use of Deepseek OCR technology to compress chatbot conversation histories, adapting the way data is stored by using reduced resolutions for older exchanges—a method inspired by human memory retention patterns. Both the code and model weights for Deepseek OCR are publicly accessible, promoting wider adoption and integration into various applications.