ML Job Scoring
Pseudo-labeling + ablation study — precision 94%, recall 87%, F1 0.90 on 200+ jobs/day
What doesn't work
Manually reviewing 200+ job postings daily is impossible. Simple keyword filters miss 40% of matches and pass 30% of irrelevant ones. Need a model that understands context, not just keywords.
Architectural approach
3-model pseudo-labeling (majority vote) to create training data without manual labeling. Ablation study for feature selection. Grid search for threshold optimization. Weak supervision (Snorkel-style) for scaling. Cohen's Kappa for agreement control.
What made it hard
Pseudo-labeling introduces noise: 3 LLMs disagree in 18% of cases — had to investigate each disagreement and build tie-breaking rules. Upwork aggressively blocks scraping — anti-detection, proxy rotation, fingerprint randomization. Grid search on thresholds (50-95) on real data revealed a non-obvious optimum of 65 — lower than intuitively expected.
My role & contribution
Architect & sole developer
Entire product from scratch: Upwork/HH scraper, ML pipeline (3-LLM pseudo-labeling, 5 feature group ablation study, threshold grid search), Telegram bot for digests, dedicated server deployment. Personally designed and conducted ablation study, tuned thresholds via grid search.
How it looks
System architecture
How it works
Three LLMs independently label jobs (match/no match). Majority vote → pseudo-labels. Ablation study: 5 feature groups, each disabled in turn, F1 measured. Grid search on score threshold (65-95). IR metrics: precision, recall, F1 per iteration.
Why this way
Pseudo-labeling instead of manual annotation
Manually label 500+ jobs for training data
Manual labeling: 2-3 days of work, subjective, doesn't scale. 3 models × majority vote: more objective than one person, scales to any volume, Cohen's Kappa confirms agreement.
Quality labels in minutes. Kappa 0.82 — agreement higher than two humans
Results
- 01
- Precision 94%, Recall 87%, F1 0.90
- 02
- Cohen's Kappa: 0.82 (strong agreement between 3 LLMs)
- 03
- Ablation study: 5 feature groups, each disabled in turn
- 04
- Grid search on threshold: optimum 65 from range 50-95
- 05
- 200+ jobs/day → hourly digest in Telegram
Impact on business
Automatic scoring of 200+ jobs/day instead of manual review. Job search time: from 2-3 hours/day to 5 minutes (Telegram digest review). 94% precision means almost every auto-application goes to a matching position.
Algorithms & patterns
Technologies
- Python
- scikit-learn
- DeepSeek API
- Telegram Bot API
- PostgreSQL