Summary

1.18M
Entities
65.3%
Classified
98.75%
XLM-R Macro Prec.
91.2%
TF-IDF Coverage

Model Workflow

Three-stage cascade. Each stage runs only when the previous cannot classify with confidence.

Input1,176,240 entities Step 1: Rule-Based664,167 (56.5%) · ~100% precision Step 2: LLM API103,468 (8.8%) · ~2 hr · ~$32 Step 3: XLM-RoBERTa~280K (~24%) · ~45 min · $0 OutputFINAL.csv · ~89% coverage
1

Rule-Based

Keywords, suffixes (GMBH, INC, LTD), government patterns. 56.5% coverage.

2

LLM

Claude Haiku 3.5 via Anthropic Batches. 8.8% coverage, mean confidence 0.928.

3

XLM-RoBERTa

Trained on Steps 1 and 2. Zero-cost inference. Target coverage 89%.

Classification Results

CategoryCount%Examples
company705,15860.0%APPLE INC, SIEMENS AG
individual26,4492.2%MR. AHMED AL-RASHID, JOHN SERVOS
family_firm21,4351.8%GARCIA & HIJOS SL, SMITH & SONS LTD
government14,5921.2%COMMUNE DE PARIS, DEPT OF TRANSPORT
Classified767,63565.3%
unknown408,60534.7%Queued for Step 3

Method Comparison

MethodPrecisionCoverageSpeedCost
Rule-based~100%56.5%< 1 min$0
LLM API0.9288.8%~2 hr~$32
XLM-RoBERTa0.9875~24%~45 min$0
TF-IDF baseline0.78291.2%< 5 min$0
XLM-R best precision, zero marginal cost.

XLM-RoBERTa Training

EpochTrain LossMacro Prec.Macro RecallMacro F1
10.32250.93790.96800.9524
20.13510.97810.96620.9717
30.08950.97720.97360.9754
40.05490.98750.97070.9789Best
Epoch 4. Abstains when max softmax < 0.80.

Precision Design

Precision over recall. Misclassification biases regression. Model abstains when uncertain.

THRESHOLD = 0.90 for entity in entities: probabilities = model(entity) if max(probabilities) >= THRESHOLD: label = argmax(probabilities) # assign else: label = "unknown" # abstain
MetricValue
Accuracy (1K hand-labeled)97.0%
Individual prec. / recall0.75 / 0.91
Coverage88.7%
Abstention11.3%

Outputs

FileDescriptionRows
CLASSIFIED_ALL_PARENTS_FINAL.csvSteps 1 and 2, 65.3% coverage1,176,240
CLASSIFIED_ALL_PARENTS_TFIDF.csvWith TF-IDF baseline, 91.2% coverage1,176,240
01_validation_report.txtHand-labeled validation
scripts/05_train_xlmr_colab.ipynbTraining notebook (Colab)
Parent_name,parent_ID,parent_cty,category,confidence,method APPLE INC,US123456,US,company,0.99,rule JOHN SMITH,GB789012,GB,individual,0.95,llm_api FAMILIA GARCIA & HIJOS SL,ES345678,ES,family_firm,0.91,xlmr DEPARTMENT OF DEFENSE,US000001,US,government,1.00,rule QUINNELL SEPTIC & WELL SERVICE,US148650,US,unknown,0.72,unknown

Status

#TaskStatus
1XLM-R inference on 408K unknownsIn progress
2FINAL_XLMR.csv (89% coverage)Pending
3Archive model weightsPending
4Methodology section

Simulation

Click an entity. See which step classifies it. Switch tabs to view code from each stage.

APPLE INC MR. AHMED KHAN & SONS LTD COMMUNE DE PARIS JOHN SMITH QUINNELL SEPTIC
Entity APPLE INC
Step 1 (Rule)
Output company
# Keywords, suffixes, government patterns if name in GOVERNMENT_KEYWORDS: return 'government' if '& SONS' in name: return 'family_firm' if suffix in ['INC', 'LTD', 'GMBH']: return 'company' if title in ['MR.', 'DR.'] and word_count <= 3: return 'individual' return None # pass to Step 2

Data by Country

Country

Visualization

Select a country to load visualization.

Data

Statistics Summary

Puneeth Kotha · Prof. Belen Villalonga (Supervisor)
Parent entity classification. Orbis. 4 categories.