Orbis ML Classifier

Summary

1.18M

Entities

65.3%

Classified

98.75%

XLM-R Macro Prec.

91.2%

TF-IDF Coverage

Model Workflow

Three-stage cascade. Each stage runs only when the previous cannot classify with confidence.

Rule-Based

Keywords, suffixes (GMBH, INC, LTD), government patterns. 56.5% coverage.

LLM

Claude Haiku 3.5 via Anthropic Batches. 8.8% coverage, mean confidence 0.928.

XLM-RoBERTa

Trained on Steps 1 and 2. Zero-cost inference. Target coverage 89%.

Classification Results

Category	Count	%	Examples
company	705,158	60.0%	APPLE INC, SIEMENS AG
individual	26,449	2.2%	MR. AHMED AL-RASHID, JOHN SERVOS
family_firm	21,435	1.8%	GARCIA & HIJOS SL, SMITH & SONS LTD
government	14,592	1.2%	COMMUNE DE PARIS, DEPT OF TRANSPORT
Classified	767,635	65.3%
unknown	408,605	34.7%	Queued for Step 3

Method Comparison

Method	Precision	Coverage	Speed	Cost
Rule-based	~100%	56.5%	< 1 min	$0
LLM API	0.928	8.8%	~2 hr	~$32
XLM-RoBERTa	0.9875	~24%	~45 min	$0
TF-IDF baseline	0.782	91.2%	< 5 min	$0

XLM-R best precision, zero marginal cost.

XLM-RoBERTa Training

Epoch	Train Loss	Macro Prec.	Macro Recall	Macro F1
1	0.3225	0.9379	0.9680	0.9524
2	0.1351	0.9781	0.9662	0.9717
3	0.0895	0.9772	0.9736	0.9754
4	0.0549	0.9875	0.9707	0.9789	Best

Epoch 4. Abstains when max softmax < 0.80.

Precision Design

Precision over recall. Misclassification biases regression. Model abstains when uncertain.

THRESHOLD = 0.90 for entity in entities: probabilities = model(entity) if max(probabilities) >= THRESHOLD: label = argmax(probabilities) # assign else: label = "unknown" # abstain

Metric	Value
Accuracy (1K hand-labeled)	97.0%
Individual prec. / recall	0.75 / 0.91
Coverage	88.7%
Abstention	11.3%

Outputs

File	Description	Rows
`CLASSIFIED_ALL_PARENTS_FINAL.csv`	Steps 1 and 2, 65.3% coverage	1,176,240
`CLASSIFIED_ALL_PARENTS_TFIDF.csv`	With TF-IDF baseline, 91.2% coverage	1,176,240
`01_validation_report.txt`	Hand-labeled validation
`scripts/05_train_xlmr_colab.ipynb`	Training notebook (Colab)

Parent_name,parent_ID,parent_cty,category,confidence,method APPLE INC,US123456,US,company,0.99,rule JOHN SMITH,GB789012,GB,individual,0.95,llm_api FAMILIA GARCIA & HIJOS SL,ES345678,ES,family_firm,0.91,xlmr DEPARTMENT OF DEFENSE,US000001,US,government,1.00,rule QUINNELL SEPTIC & WELL SERVICE,US148650,US,unknown,0.72,unknown

Status

#	Task	Status
1	XLM-R inference on 408K unknowns	In progress
2	`FINAL_XLMR.csv` (89% coverage)	Pending
3	Archive model weights	Pending
4	Methodology section

Simulation

Click an entity. See which step classifies it. Switch tabs to view code from each stage.

APPLE INC MR. AHMED KHAN & SONS LTD COMMUNE DE PARIS JOHN SMITH QUINNELL SEPTIC

Entity APPLE INC

Step 1 (Rule)

Output company

      # Keywords, suffixes, government patterns
      if name in GOVERNMENT_KEYWORDS: return 'government'
      if '& SONS' in name: return 'family_firm'
      if suffix in ['INC', 'LTD', 'GMBH']: return 'company'
      if title in ['MR.', 'DR.'] and word_count <= 3: return 'individual'
      return None  # pass to Step 2
    

      # Anthropic Batches API
      prompt = f"Classify entity: {name}"
      response = client.messages.create(model="claude-3-5-haiku", ...)
      if response.confidence >= 0.85: return response.category
      return None  # pass to Step 3
    

      text = f'{name} {country}'
      enc = tokenizer(text, max_length=64, padding='max_length')
      logits = model(input_ids=enc['input_ids']).logits
      probs = torch.softmax(logits, dim=-1)
      max_p, pred_id = probs.max(dim=-1)
      return ID2LABEL[pred_id] if max_p >= 0.80 else 'unknown'
    

Summary

Model Workflow

Rule-Based

LLM

XLM-RoBERTa

Classification Results

Method Comparison

XLM-RoBERTa Training

Precision Design

Outputs

Status

Simulation

Data by Country

Country

Visualization

Data

Statistics Summary

Recent Updates

Summary

Model Workflow

Rule-Based

LLM

XLM-RoBERTa

Classification Results

Method Comparison

XLM-RoBERTa Training

Precision Design

Outputs

Status

Simulation

Data by Country

Country

Visualization

Data

Statistics Summary