What We Learned Processing 10 Million Pages of User Documents

Michal Raczy

• May 28, 2026• 13 min read

document processing

data extraction

ocr

automation

document ai

What We Learned Processing 10 Million Pages of User Documents

Suparse has now processed more than 10 million pages of user documents into structured data. That volume includes clean digital PDFs, phone photos, scanned paperwork, long mixed PDFs, and document types that no standard OCR template was designed to understand.

The findings below come from aggregated 2026 Suparse internal telemetry, reviewed at the workflow level rather than from customer document content. That matters because the lessons are based on real usage patterns, not a controlled demo set.

The milestone matters because document extraction fails in patterns. A single invoice demo tells you very little. Ten million real pages tells you where users get stuck, which features become non-negotiable, and why "OCR accuracy" is only one part of a working automation system.

If you process documents into spreadsheets, APIs, accounting tools, or AI workflows, these are the lessons that should shape your buying criteria. For a hands-on accuracy test, start with our guide to testing extraction accuracy with free pages.

Key Takeaways

Suparse processed 10M+ user document pages into structured data.

The top categories were invoices, receipts, bank statements, custom documents, purchase orders, checks, and delivery notes.

Auto splitting, schema assignment, validation, API access, and pricing matter as much as OCR.

What Document Types Show Up Most Often at Scale?

In 2026, Suparse crossed 10 million processed pages, with the largest categories being invoices, bills, receipts, bank statements, custom document types, purchase orders, bank checks, and delivery notes. The practical answer is simple: users don't automate one tidy document type. They automate messy back-office work.

Top document types processed	Primary business use case
Invoices, bills, and receipts	AP automation, expense tracking, bookkeeping, and tax preparation
Bank statements	Reconciliation, cash-flow analysis, transaction categorization, and audit support
Custom document types	Internal operations, specialized logs, donation records, and niche business workflows
Purchase orders	Procurement control, supplier matching, and order-to-invoice checks
Bank checks	Payment processing, deposit records, and finance operations
Delivery notes	Logistics tracking, receiving checks, and proof-of-delivery workflows

The strongest signal was the breadth of documents. Invoices and receipts were expected. Bank statements were expected. The surprise was how quickly users moved into custom extraction: handwritten logs, donation records, specialized operational forms, and internal documents that don't fit a public template.

That matters for product design. A document automation tool that only handles common categories will solve the first workflow, then stall. Real teams need prebuilt support for common documents and a way to define their own extraction logic when the next process appears.

Suparse's top document categories after more than 10 million processed pages show that document automation demand is broader than invoice OCR. Finance documents dominate early usage, but custom document types become important once teams trust extraction enough to automate internal workflows.

For standard financial workflows, see the dedicated guides for invoice line item extraction, receipt scanning to spreadsheets, and PDF bank statement conversion.

Why Are Custom Extraction Schemas a Power-User Feature?

In 2026, the intelligent document processing market was already measured in the billions. SNS Insider's Intelligent Document Processing Market Report estimated a 2024 market size of about $2.6 billion and a 2032 forecast above $24 billion. Growth is driven by repeatable workflows, and repeatable workflows eventually need custom schemas.

Custom extraction schemas are not the first feature most users need. A new user usually wants to upload a bank statement or receipt and export Excel. But once a team has a working process, they start asking for fields that match their business vocabulary.

The best custom schemas don't only extract what is visible. They also create derived fields. A bank transaction can include a raw description, but the user often needs a category such as payroll, rent, tax, travel, utilities, or vendor payment. A schema can ask: "Based on the transaction description, provide a transaction category from this list."

That turns extraction into enrichment. The user doesn't just get a row copied from a PDF. They get a row that is closer to analysis, bookkeeping, approval, reconciliation, or import into another system.

Custom schemas are valuable because they shift document processing from OCR to operational data modeling. The extraction result should match the system that receives it, not the visual layout of the source document.

This is why Suparse supports ready-made document types and adjustable schemas. You can start with financial document extraction, then move into custom document parsing when a workflow becomes specific.

Why Is Auto Splitting Now an Expected Feature?

In 2025, users in a Microsoft Azure Document Intelligence Q&A thread still reported unstable table extraction when merged headers and uneven alignment appeared in PDFs. Layout complexity gets worse when one PDF contains many documents, which is why automatic splitting is no longer optional.

A user may upload a 120-page PDF that contains supplier invoices, delivery notes, purchase orders, bank statements, and cover sheets. They don't think of this as a special case. To them, it is one file from email, a scanner, a portal, or an accountant.

If the product asks them to split it manually before extraction, the automation has already failed. Manual pre-processing is just another form of data entry.

After seeing high-volume uploads, our view changed: auto splitting is not a convenience feature. It is table stakes. Users expect the system to identify page boundaries, separate individual documents, and preserve the relationship between pages that belong together.

Mixed PDFs reveal a basic truth about document automation: users bring files as they exist in the business, not as the software vendor wishes they existed. Automatic splitting reduces pre-processing work before extraction even begins.

For a practical example, see the guide on how to split one PDF containing multiple invoices, or the broader workflow guide for high-volume document processing.

Why Does Schema Assignment Matter for Mixed PDFs?

Across the 10 million pages we processed, mixed long PDFs became one of the clearest product lessons. Splitting pages is only half the answer. The system also needs to assign the right extraction schema to each separated document.

This is where document automation becomes a workflow engine. An invoice needs supplier, tax, totals, and line items. A bank statement needs transactions and balances. A purchase order needs buyer, vendor, items, quantities, and agreed prices. A check needs payee, amount, date, and routing details.

If the system uses the wrong schema, accuracy drops even when OCR is good. The extracted text may be readable, but the data structure is wrong.

Auto schema assignment is powerful because it lets a user process a mixed PDF in one go. The user uploads one file, and the system splits it into documents, classifies each one, applies the correct extraction schema, and exports a unified dataset.

Automatic schema assignment turns one mixed document pack into multiple structured outputs without asking the user to sort pages first. That is the difference between a tool that helps with conversion and a platform that can run a back-office document workflow.

This is especially useful for teams handling purchase order extraction, delivery note OCR, bank check extraction, and invoice matching in the same process.

Why Is Human-in-the-Loop Still Essential?

In 2024, Article 14 of the EU AI Act made human oversight a formal requirement for high-risk AI systems, stating that oversight should prevent or minimize risks to health, safety, or fundamental rights. Even when document extraction is not a high-risk AI use case, the principle is right for finance data.

Over 99% accuracy sounds complete until you process enough documents. At scale, the remaining exceptions become real. A blurred digit, a missing minus sign, a merged table header, or a handwritten correction can matter when the output feeds accounting, payroll, procurement, or compliance.

That doesn't mean automation failed. It means the interface must make review fast.

Human-in-the-loop works when the user can see the source document next to the extracted field, correct values directly, and rely on validation checks to catch suspicious output. For bank statements, balance checks are useful. For invoices, totals and line items need consistency checks. For custom schemas, required fields should be obvious.

High extraction accuracy does not remove the need for review. It changes the review job from manual entry to exception handling, where people check the small set of fields that matter most before data leaves the system.

For related guidance, read our article on data validation for bank statements and the security-focused overview of financial data privacy.

What Did We Learn About Scan Quality?

In 2026, Thomson Reuters guidance for OCR scanning recommends 300 or 600 DPI for documents processed with OCR technology. Our experience matches the direction of that advice: image quality still matters, even when AI extraction is much better than legacy OCR.

The hard cases are predictable. Low-resolution scans blur small digits. Phone photos add perspective distortion. Faxes and compressed scans lose contrast. Handwritten notes add ambiguity. Tables with merged headers confuse row and column structure.

The lesson is not "reject imperfect documents." Users have imperfect documents. The product has to accept PDFs, scans, JPGs, PNGs, and phone photos because that is how paperwork arrives.

The better lesson is to expose uncertainty clearly. When the system is less confident, the user should know where to look. A clean review workflow can turn a weak scan into usable data because the user checks only the uncertain parts.

Scan quality remains one of the largest practical drivers of document extraction review time. Better OCR helps, but low DPI, skew, poor contrast, handwriting, and complex tables still create exceptions that need visible validation.

Suparse is built for scanned documents and image files, but the best results still come from readable inputs. The same principle applies whether you are using receipt OCR, invoice OCR, or bank statement extraction.

Why Do API and MCP Integrations Matter Now?

In 2026, Anthropic's Claude Code MCP documentation describes MCP as a way to connect Claude Code to hundreds of external tools and data sources. That reflects a broader shift: document extraction is moving from "upload and download" into connected AI and software workflows.

Excel exports still matter. CSV still matters. Google Sheets still matters. But teams processing documents every week eventually want extraction inside their existing process: a CRM, ERP, accounting app, case management system, procurement workflow, or AI assistant.

API access is the baseline for that. It lets developers send documents, check status, retrieve JSON, and push clean data downstream without manual export.

MCP is becoming important for a different reason. AI coworkers and coding agents increasingly need controlled access to business tools. A user may want Claude Code, Cursor, or another assistant to process a document, inspect structured output, or connect extraction results to a workflow.

API integration automates document extraction for software systems, while MCP makes extraction available to AI-assisted work environments. Together, they move document processing from a separate web app into the tools where operations and developers already work.

Suparse supports both directions: direct exports for business users and integration paths for technical teams. For developer workflows, start with the document extraction API guide and the extraction API page.

What Makes the Business Case Easy to Justify?

In 2026, APQC tracks invoice-entry cycle time as the hours from invoice receipt until data is entered into an accounts payable system. That is the right economic frame: document extraction pays off when it removes repeated handling time.

The business case is not only about OCR. It is about fewer copied fields, fewer spreadsheet cleanups, fewer manual splits, fewer misclassified documents, faster review, and cleaner exports.

Pricing matters because document automation often starts with small teams. If the cost per page is too high, users ration automation and keep manual work for "small" jobs. That defeats the purpose. The product should make the automated path easier to justify than another hour of manual entry.

Suparse pricing is designed around that reality. Competitive pricing helps customers automate the whole workflow, not just the most painful subset. That changes adoption. Users test with one document type, then expand to invoices, receipts, bank statements, checks, delivery notes, purchase orders, and custom internal forms.

The clearest ROI case for document automation is not a single accuracy number. It is the combined reduction in pre-processing, extraction, review, correction, and export work across every document type a team handles.

That is why we recommend testing your real files, not vendor demos. Upload your own documents with 50 free pages, compare the exported data to your manual workflow, and calculate the time saved on the whole process.

Methodology and Privacy

In 2026, this article uses aggregated internal telemetry from more than 10 million pages processed through Suparse production workflows. The analysis focuses on document categories, workflow patterns, and product lessons. It does not require exposing customer document content.

We intentionally avoid customer-level examples, private field values, or small-sample claims that could reveal sensitive information. The point is to explain patterns across a large volume of usage, not to publish customer data.

The privacy model matters because document extraction often handles financial and operational records. The European Commission's GDPR guidance on data protection by design and by default highlights this obligation under Article 25.

Suparse's product approach follows the same practical direction: minimize what is needed, protect documents in transit and at rest, keep users in control, and avoid using customer documents for model training without explicit agreement.

Privacy-safe document automation should report aggregate metadata and workflow outcomes, not customer content. That makes it possible to learn from usage patterns while keeping sensitive business records out of public analysis.

For more detail on security and retention practices, see secure financial data privacy and the Suparse privacy policy.

Final Thoughts

Crossing more than 10 million processed user document pages changed how we think about document automation. OCR is necessary, but it is not enough. The real product has to handle mixed files, custom schemas, derived fields, review, validation, APIs, MCP, exports, and pricing that makes regular use sensible.

The lesson is clear: document processing should adapt to the way work arrives. Users should not have to pre-sort every PDF, redraw every table, rebuild every schema, or manually check every field.

If you want to see how Suparse performs on your own paperwork, start with Suparse for free. Upload the documents you actually use, then judge the output by how much manual work disappears.

Test Suparse on your own documents

Upload invoices, receipts, statements, purchase orders, checks, delivery notes, or custom documents. Start with free pages and export clean data to Excel, CSV, Google Sheets, JSON, API, or MCP.

Try Free - No Credit Card Required

Processing 10 Million Document Pages FAQ

What document types does Suparse process most often?

Across more than 10 million processed pages, the most common categories are invoices, bills, receipts, bank statements, custom document types, purchase orders, bank checks, and delivery notes. The mix shows that users want one extraction platform for finance, operations, procurement, and custom internal workflows.

Can Suparse process one PDF that contains many document types?

Yes. Suparse can split long PDFs into separate documents and assign the right extraction schema automatically. This is useful when one scan contains invoices, receipts, statements, delivery notes, checks, and purchase orders in the same file.

Do custom extraction schemas improve results?

Yes. Custom schemas are most useful for power users with repeatable workflows, especially when paired with derived fields that classify, enrich, or normalize extracted data during processing, such as transaction categories, tax mapping, approval flags, or vendor names.

Is human review still needed with high OCR accuracy?

Yes. Even 99%+ extraction accuracy leaves exceptions. Suparse provides a review interface and validation checks so users can correct important fields before exporting data to Excel, CSV, Google Sheets, JSON, API, or MCP workflows.

Does Suparse support API and MCP workflows?

Yes. Suparse supports API integration and MCP workflows for teams that want document extraction inside AI-assisted tools, coding environments, and automated operations.

How should I evaluate document extraction ROI?

Test your own documents and measure the full workflow. Include splitting, classification, extraction, validation, corrections, export, and downstream cleanup. The cheapest tool is not always the lowest-cost workflow, but competitive pricing makes broad automation easier to justify.

Michal Raczy

Michal is the founder of Suparse.com. He has over 15 years of experience in delivering projects in data analysis, automation, and document processing. Michal solves complex automation and AI implementation challenges for both SMEs and large corporations, with a particular focus on document processing. Contact at michal@suparse.com.