Why scanned PDFs are harder to redact than digital ones
Text in a digital PDF is a 'text object' — selectable, copyable, parseable. Text in a scan is 'image pixels' — readable to your eyes but only recognizable to a machine after OCR. That one difference moves the hard parts of redaction and PII detection around entirely.
Redacting a scan: actually more straightforward
Since the whole page is already an image, drawing a black bar and burning it into the pixels genuinely overwrites the covered pixels — there's no 'text object still hiding underneath' problem.
So scans are actually hard to fail at, while the digital-PDF 'black-box fake redaction' is the more dangerous case. The caveat: don't draw the bar on a still-removable layer — burn it into the bitmap.
PII detection in a scan: OCR is required first
To automatically find ID or card numbers in a scan, the machine must first OCR the pixels into text before it can run the checksum algorithms.
OCR introduces recognition errors (0/O, 1/l confusion), so automated detection on scans is less reliable than on digital text and needs more human review.
A hidden risk: QR/barcodes in scans
Many scanned reports and receipts carry a QR or barcode that often encodes a sensitive identifier like a visit or order number.
They exist as images, so text detection can't see them — remember to cover these graphics during redaction too.
Practical guidance
To redact a scan: use image/region masking, burn solid black or pixelation into the pixels, and cover any QR codes too.
To detect PII in a scan: OCR to text first, then run PII detection — or read it through manually. Always re-check yourself before sharing.
FAQ
- Can a redacted scan still be reversed by OCR?
- As long as the black bar is truly burned into the bitmap and overwrites the original pixels, the covered area is just solid color and OCR has nothing to recover. The risk only comes from a bar that's still a removable layer.
- How do I tell whether a PDF is a scan or digital?
- Try Ctrl/⌘+A to select all. If you can select and copy text, it's digital (has a text layer); if nothing selects and the page behaves like a picture, it's a scan.