OCR a PDF: Make a Scanned PDF Searchable Without Uploading

A scanned PDF looks like a document, but to your computer it is just a photograph of one. You can see the words on screen, yet you cannot select them, search for a phrase, or copy a sentence — because there is no text in the file at all, only pixels arranged in the shape of letters. OCR (optical character recognition) is the technology that bridges that gap. It reads the image, recognises the characters, and gives you back real, machine-readable text. This guide explains exactly how OCR works, why a scan behaves so differently from a normal PDF, and how to run the whole process in your browser so a sensitive scan never leaves your device.

What a scanned PDF really is

When you scan a paper document — or snap a photo of it with your phone and "save as PDF" — the scanner captures an image. That image gets wrapped in a PDF container so it opens like any other PDF, but inside there is no text layer. Every page is a single picture.

This is why a scanned PDF behaves so strangely compared with one exported from a word processor:

Ctrl+F finds nothing, even for words you can clearly see on the page.
You cannot highlight a line to copy it — your cursor selects the whole image or nothing.
Converting it to Word or text produces a blank or an image-only file, because there is no text to extract.
The file is often large, since images take far more space than the equivalent text.

A "born-digital" PDF, by contrast, stores the actual characters plus their fonts and positions. That hidden text layer is what makes searching, selecting and copying possible. OCR's job is to recreate that missing layer for a scan.

What OCR actually does

Optical character recognition analyses the picture of a page and works out which shapes are letters, numbers and punctuation. Under the hood the engine typically:

Cleans the image — adjusts contrast and removes noise so characters stand out from the background.
Finds the layout — detects lines, words and the spaces between them.
Recognises each glyph — matches the shape of every character against a trained model.
Assembles the text — stitches the recognised characters back into words and lines, often with a dictionary check to fix near-misses.

The result is a text layer aligned with the image. Once that exists you can:

Search the document for any word or number.
Select and copy passages straight out of it.
Feed it into a PDF-to-Word or PDF-to-text conversion that finally contains real words instead of a flat picture.

That single step is what turns a drawer full of scans into a searchable, reusable archive.

Why run OCR locally instead of uploading

Scanned documents are frequently the most sensitive things people own: signed contracts, passports and IDs, tax returns, bank statements, medical records, and old family letters. Those are precisely the files you should think twice about before handing to a stranger.

Most online OCR services require you to upload the entire document to their servers, where it is processed and, in many cases, retained for some period under terms few people read. For a casual scan that might be fine. For a divorce settlement or a copy of your driver's licence, it is exactly the wrong trade.

Running OCR in the browser removes that risk. The recognition happens locally using a WebAssembly build of the open-source Tesseract engine — the same well-regarded engine that powers countless OCR projects, compiled to run on the page you already have open. Your scan is read by code executing on your own machine, not sent anywhere.

You do not have to take that on faith. Open DevTools (press F12), switch to the Network tab, then run OCR on a page. You will see the engine and language data load, but you will not see your PDF being uploaded — because it never is. The same check works for every tool on the site.

How to OCR a PDF in your browser

The process is deliberately simple:

Open the OCR PDF tool.
Add your scanned PDF — drag it in or pick it from your device.
Choose the document's language if prompted, so the engine loads the right model.
Let it process. Recognition runs on your hardware, so the time depends on your machine and the number of pages — a few pages are quick, a long report takes longer.
Use the recognised text directly, or export it for the next step.

Because the work is local, there is no queue, no file-size cap imposed by a server, and no upload progress bar — just your own CPU doing the recognition.

Tips for the most accurate results

OCR accuracy depends far more on the quality of the input than on the engine. A clean scan can reach the high-ninety-percent range; a crumpled phone photo in poor light might come out barely usable. A few habits make a large difference:

Scan at 300 DPI. This is the sweet spot for text. Below about 200 DPI, characters blur together and accuracy drops sharply; far above 300 DPI mostly adds file size without helping recognition.
Keep the page straight. Skew is one of the biggest accuracy killers. If a scan is tilted even a few degrees, lines drift and the engine mis-reads them. Straighten or rotate crooked pages before running OCR.
Maximise contrast. Crisp black text on a white background reads best. Faint photocopies, grey backgrounds and coloured paper all reduce accuracy. Many scanners have a "document" or "text" mode that boosts contrast automatically.
Flatten and light evenly. For phone captures, lay the page flat, avoid shadows, and fill the frame with the page so the text is as large and sharp as possible.
Prefer printed text. OCR is designed for typeset characters. Handwriting, decorative fonts and text printed over images or stamps are much harder and should be expected to need correction.
Pick the right language. Loading the matching language model — and the correct one for accented or non-Latin scripts — noticeably improves results.

What accuracy to realistically expect

OCR is excellent, but it is not magic, and no engine — local or cloud — is perfect. On a clean, high-contrast, 300 DPI scan of ordinary printed text, you can expect very high accuracy with only the occasional slip. On a faded fax, a tightly packed table, or a low-resolution photo, errors climb.

The characters most often confused are the ones that look alike: the digit 0 and the letter O, 1 and l and I, 5 and S, 8 and B. That matters because those are exactly the characters in the data you most care about.

So the rule is simple: trust the bulk text, but proofread the numbers that matter — dates, totals, account numbers, IDs and reference codes. A two-minute check of the critical figures is worth far more than assuming a perfect result. And because OCR is computationally intensive and runs on your device, large documents take time; that patience is the price of keeping the scan private.

What to do after OCR

Once your scan carries a real text layer, it becomes a normal, workable document — and a whole set of next steps opens up:

Edit it. Convert it to Word to reshape the content, update a recovered letter, or reuse an old report as a template.
Grab just the words. If you only need the raw text — to paste into an email, a note or a spreadsheet — extract the plain text instead of a full conversion.
Remove sensitive details. Now that names, numbers and addresses are recognisable, you can redact them properly. True redaction deletes the underlying content rather than covering it, which is essential once a scan is searchable — a black box over text that OCR just made selectable would leak immediately.

In other words, OCR is rarely the destination. It is the step that makes every other PDF task possible on a document that started life as an image.

A note on languages

Tesseract supports a long list of languages, including non-Latin scripts and accented characters, and the in-browser tool loads the model for the language you select. Choosing the right one is not a nicety — it is one of the biggest accuracy levers available. Running an English model over a French or German page, or a Latin model over Cyrillic or Greek text, produces predictable nonsense. If your document mixes languages, recognise it with the dominant one and proofread the rest. Matching the model to the page is the difference between clean output and a transcript full of stray symbols.

The bottom line

OCR is what turns a flat scan into a genuinely useful document — searchable, selectable and ready to convert or redact. And there is no reason to upload private contracts, IDs or medical paperwork to get there. Feed your tool a clean, straight, high-contrast scan, pick the right language, proofread the numbers that matter, and let the recognition run on your own machine. Run OCR on your PDF entirely in your browser, and keep the scan where it belongs.