Skip to content
← All casesML PipelineAI

ML Job Scoring

Pseudo-labeling + ablation study — precision 94%, recall 87%, F1 0.90 on 200+ jobs/day

Problem

What doesn't work

Manually reviewing 200+ job postings daily is impossible. Simple keyword filters miss 40% of matches and pass 30% of irrelevant ones. Need a model that understands context, not just keywords.

Solution

Architectural approach

3-model pseudo-labeling (majority vote) to create training data without manual labeling. Ablation study for feature selection. Grid search for threshold optimization. Weak supervision (Snorkel-style) for scaling. Cohen's Kappa for agreement control.

Challenges

What made it hard

Pseudo-labeling introduces noise: 3 LLMs disagree in 18% of cases — had to investigate each disagreement and build tie-breaking rules. Upwork aggressively blocks scraping — anti-detection, proxy rotation, fingerprint randomization. Grid search on thresholds (50-95) on real data revealed a non-obvious optimum of 65 — lower than intuitively expected.

Role

My role & contribution

Architect & sole developer

Entire product from scratch: Upwork/HH scraper, ML pipeline (3-LLM pseudo-labeling, 5 feature group ablation study, threshold grid search), Telegram bot for digests, dedicated server deployment. Personally designed and conducted ablation study, tuned thresholds via grid search.

Demo

How it looks

Architecture

System architecture

Upwork ScraperHH ScraperRaw Jobs200+/day3-LLM Pseudo-labelmajority, Kappa 0.82Feature Extract5 groups (ablation)Scoringgrid, thr=65FilterP94 R87 F1=.90TG Bothourly digestPrecision 94% | Recall 87% | F1 0.90 | Cohen's Kappa 0.82 | Threshold 65AI/LLMDataInfraEval
Implementation

How it works

Three LLMs independently label jobs (match/no match). Majority vote → pseudo-labels. Ablation study: 5 feature groups, each disabled in turn, F1 measured. Grid search on score threshold (65-95). IR metrics: precision, recall, F1 per iteration.

Architecture Decision

Why this way

Pseudo-labeling instead of manual annotation

Alternative

Manually label 500+ jobs for training data

Why it didn't fit

Manual labeling: 2-3 days of work, subjective, doesn't scale. 3 models × majority vote: more objective than one person, scales to any volume, Cohen's Kappa confirms agreement.

Result

Quality labels in minutes. Kappa 0.82 — agreement higher than two humans

Metrics

Results

01
Precision 94%, Recall 87%, F1 0.90
02
Cohen's Kappa: 0.82 (strong agreement between 3 LLMs)
03
Ablation study: 5 feature groups, each disabled in turn
04
Grid search on threshold: optimum 65 from range 50-95
05
200+ jobs/day → hourly digest in Telegram
Business Impact

Impact on business

Automatic scoring of 200+ jobs/day instead of manual review. Job search time: from 2-3 hours/day to 5 minutes (Telegram digest review). 94% precision means almost every auto-application goes to a matching position.

Methods

Algorithms & patterns

Pseudo-labeling (3 models)Ablation StudyCohen's KappaGrid SearchWeak Supervision (Snorkel-style)F1/Precision/RecallBias detection
Stack

Technologies

  • Python
  • scikit-learn
  • DeepSeek API
  • Telegram Bot API
  • PostgreSQL

Ready to discuss?

If you need an architect who builds autonomous AI systems — reach out.

Serbia-based · CET/CEST timezone · EU-aligned working hours · International contracts experience