Lijie Li · Data Scientist

Aalto UniversityKTH Royal Institute of Technology

MSc Data Science candidate at Aalto and KTH. I work on speech AI, retrieval systems, and practical ML pipelines, with current research on universal speech enhancement for speech-based health biomarkers.

Programming

PythonPyTorchscikit-learnLangChainPydanticSQLJavaScriptJavaLaTeX

Machine Learning

Speech EnhancementASRRAGAgentsEmbeddingsMambaConformal PredictionModel Evaluation

Systems & MLOps

LinuxSlurmTriton HPCGPU TrainingDockerGitCI/CDWandBAsync Programming

Data & Retrieval

MongoDBQdrantBM25HDBSCANSparkTableauWeb Scraping

Core Areas

Deep LearningNLPInformation RetrievalDistributed ComputingMathematical OptimizationData Mining

Focus

Speech enhancement, retrieval, and applied ML

My current research starts from USE baselines and evaluation, but the goal is to design or adapt enhancement models that preserve biomarker-relevant speech cues under real recording-condition shifts.

Practice #1

Speech and health-biomarker research

The thesis direction is to adapt universal speech enhancement for biomarker robustness, not only to improve perceptual speech quality.

  • Distorted speech simulation and cleaning
  • Public SE baseline benchmarking
  • AVQI and downstream classification evaluation
  • PyTorch experiments on Triton with Slurm/WandB

Practice #2

Retrieval and applied ML systems

Past projects cover legal RAG, knowledge-graph search, NLP moderation, ETA uncertainty, and small product-facing ML systems.

  • Agentic RAG and source-grounded QA
  • Hybrid retrieval and reranking
  • Model evaluation beyond a single score
  • Data pipelines and reproducible reports

Projects

Project notes with the implementation details left in

The resume keeps these short. This page keeps more context: input data, modeling choice, evaluation, and what was actually implemented.

View project repositories on GitHub ↗
Speech ResearchAalto University · Master's thesis / research assistant work

Universal Speech Enhancement for Health Biomarkers

Testing whether enhancement can reduce recording-condition drift before biomarker classification.

  • Building distorted-speech simulation and cleaning pipelines for noisy, reverberant, codec-degraded, and microphone-mismatched speech.
  • Benchmarking public speech-enhancement baselines and connecting outputs to the lab's AVQI biomarker evaluation pipeline.
  • Running PyTorch experiments on Aalto Triton with Slurm job scripts and WandB records.
Retrieval SystemsLexembed · Sweden

Legal QA with Agentic RAG

Built components for source-grounded legal QA over uploaded document collections.

  • Implemented query decomposition, entity extraction, document retrieval, and source-grounded answer generation.
  • Used RAGAS-style checks to compare retrieval relevance, answer grounding, and citation quality.
  • Kept the workflow explicit because legal QA needs traceability more than fluent unconstrained generation.
AI SystemsVTT · Finland · 3rd place · 2025 AaltoAI Hackathon

Knowledge Graph Challenge on Heterogeneous Sources

Built a pipeline for merging innovation records from company sources and graph files.

  • Flattened graph relationships into a unified relation table before entity resolution and graph reconstruction.
  • Used embeddings and HDBSCAN for semantic duplicate detection while preserving source IDs, names, descriptions, and lineage.
  • Built hybrid retrieval with Qdrant ANN + BM25, RRF fusion, and Cross-Encoder reranking; evaluated with Hit Rate and MRR.
AI ResearchAalto University · 2nd place

SNLP Challenge: Multilingual Speech + Toxicity

WER 0.0664 / CER 0.0123 with Wav2Vec2-BERT + SpecAugment.

  • Fine-tuned Wav2Vec2-BERT with SpecAugment and regularization for low-resource Esperanto ASR.
  • Benchmarked multilingual toxicity models across English, German, and Finnish.
  • Used Triton GPU resources and WandB to track model comparisons and error analysis.
Computer VisionAalto Computer Vision Challenge

Reference-Based AI Image Tampering Localization

Localized AI-edited regions by comparing a reference image with its modified version, treating the task as supervised change segmentation.

  • Used a Siamese encoder-decoder setup with shared visual encoders and feature-difference fusion, rather than simple RGB subtraction.
  • Trained the segmentation head with mask-aware losses such as Dice/Focal-style objectives to handle small edited regions.
  • Applied asymmetric augmentation on the edited branch, including compression, color shifts, resizing artifacts, and slight misalignment, to avoid learning only pixel noise.
Predictive ModelingWolt Data Science Case

Delivery Time Estimation with Calibrated Uncertainty

Built ETA point models and calibrated prediction intervals for skewed delivery-time errors.

  • Compared target transformations such as raw minutes versus log1p minutes to reduce the effect of long-tail delays.
  • Tested tree-based regression baselines and inspected residual distributions to separate systematic bias from random delay variance.
  • Applied asymmetric conformal prediction on calibration residuals so ETA ranges can allocate more uncertainty to late deliveries than early arrivals.
Data ProductsKunshan Yuanpai Trading · China

Recommendation & Uni-cloud Platform

Implemented a small recommendation and data-service stack for order, inventory, and customer behavior data.

  • Combined DBSCAN customer segmentation, matrix-factorization candidates, and MAB-style re-ranking.
  • Optimized MongoDB schema and indexes for common order, inventory, and recommendation queries.
  • Built SQL interfaces and Tableau dashboards for operational reporting.
AI EngineeringP&G · Finland · 3rd place · 2025 Junction

Agent Challenge on Automated Personalized Marketing

Built a campaign-generation workflow that converts a brief into localized SMS/email assets with review checks.

  • Used n8n to orchestrate brief parsing, audience/language adaptation, channel-specific copy generation, and asset handoff.
  • Added a self-review step to check brand constraints, safety rules, and SMS/email length limits before final output.
Document AIPersonal tooling for paper reading

LLM-Based Academic Paper Translation Pipeline

Built a personal research workflow around Codex skills, Zotero MCP, Obsidian notes, and LLM-assisted paper reading.

  • Created and used Codex skills for paper analysis, PDF translation, image extraction, paper recommendation, and Obsidian-formatted note generation.
  • Connected Zotero/MCP-style metadata lookup with arXiv/PDF parsing so paper notes include source links, bibliographic context, and extracted figures.
  • Optimized OCR, context-window chunking, and section-aware prompts for long academic PDFs, then saved structured bilingual notes into Obsidian.

Experience

Professional experience

View full CV on LinkedIn ↗

AI / Data Roles

Feb 2026 — Present

Research Assistant / Master's Thesis Worker

Aalto University · Espoo, Finland

Working on universal speech enhancement for speech-based health biomarkers, with emphasis on data drift, benchmark setup, and downstream evaluation.

  • Researching and building distorted-data simulation for USE experiments, including noisy, reverberant, codec-degraded, and microphone-mismatched speech.
  • Connecting public SE baselines to the lab's AVQI biomarker evaluation pipeline to test whether enhancement reduces drift-induced classification failures.

Aug 2025 — Feb 2026

Data Scientist

Lexembed · Sweden

Developing legal QA components around Agentic RAG, document retrieval, and source-grounded generation.

  • Built query decomposition, entity extraction, knowledge-graph context, and case-based retrieval steps for uploaded legal documents.
  • Used RAGAS-based checks to compare citation grounding, answer relevance, and retrieval quality during iteration.

Aug 2023 — Mar 2024

Data Specialist (Intern)

International Digital Economy Academy · Shenzhen, China

Owned the end-to-end lifecycle for policy moderation models, from generative data augmentation to adversarial hardening and deployment.

  • Fine-tuned DeBERTaV3 with QLoRA + TPE, cutting VRAM usage by 80% and improving F1 by 5 points.
  • Used TextAttack adversarial suites to harden classifiers and validated robustness with macro-F1 and MCC dashboards.

Availability

Open to data science roles

Based in Espoo. Open to onsite or remote ML, speech, retrieval, and AI infrastructure roles across Europe, and to internship opportunities in China.

Based in Espoo · Europe roles · China internships · English / 中文