How Chinese national IDs and card numbers get detected in a document

Finding a national ID or card number in a document is hard not because of 'finding digits' but because of 'not false-alarming.' A random 18-digit string isn't necessarily an ID, and 16 digits isn't necessarily a card. Reliable PII detection uses the checksum built into the number itself to validate it mathematically, separating real numbers from order IDs and transaction references — all locally in your browser.

National ID: the mod-11-2 check digit

China's second-generation national ID is 18 characters, and the last is a check digit. The first 17 digits are each multiplied by fixed weights, summed, taken mod 11, and looked up to derive the expected check character (a digit or X).

The detector computes that check digit and compares it to the last character; only a match flags a suspected ID. This filters out the vast majority of strings that happen to be 18 digits.

Cards: the Luhn algorithm

Most card numbers satisfy the Luhn check: from the right, double every second digit (subtract 9 if over 9), sum everything, and the total is divisible by 10.

One Luhn pass eliminates nearly all random 16-digit strings, leaving only numbers that are structurally valid card numbers.

Why a checksum beats a plain regex

A plain regex (matching '18 digits') flags order numbers, tracking numbers, and random IDs as national IDs — the false positives are glaring and train people to ignore real alerts.

With check-digit math, the false-positive rate drops sharply, so the scan result is trustworthy and usable as a pre-redaction gate.

After detection

Once the PII scanner flags suspected sensitive fields, you can go back into the document and truly redact those areas (rasterized deletion), not just know that 'something is here.'

The whole chain — detect, validate, redact — runs locally in the browser; numbers are never uploaded.

FAQ

Does passing the checksum mean it's a real, valid ID/card?
Not necessarily. The checksum only proves the number is internally consistent in format and check digit, not that it really exists or belongs to someone. But that's enough to flag it as 'suspected PII to redact.'
Could it miss some real numbers?
Possibly — e.g. numbers split by spaces/hyphens or misread by OCR. So detection is an aid; still read through the document yourself before sharing.

Tools mentioned here