Manish Saraan
Apr 15, 2024

From 2 Seconds to 200ms: How I Fixed a Slow Resume Matching System

Recently, I worked with a startup that helps fresh graduates find internships through their SaaS platform. Despite having thousands of users, they faced a significant challenge: their resume processing algorithm was inefficient, causing long wait times for recruiters trying to find suitable candidates. As their user base grew, this performance issue became critical to address.

Initial Assessment

Upon accessing their MERN stack application, I discovered that each resume was taking approximately 2 seconds to process. This might seem negligible for a single resume, but considering their system needed to process thousands of resumes per job posting, it created substantial delays. The recruiters often had to wait several minutes before they could see relevant matches for their job postings.

The application was built using Express.js for the backend with MongoDB as the database. It provided features for users to upload their latest resumes, add previous experience, and list their skills. During my initial testing, I found that processing just one resume from the database that could be relevant to a job took around 2 seconds - far too long for an efficient matching system.

The Core Problem

After deeper investigation, I identified the root cause of the performance issues. The system architecture had several inefficiencies in how it handled resume data:

The resumes were stored as PDFs in AWS S3 buckets, which is a reasonable choice for document storage. However, the matching algorithm needed both the job description and the content of these PDFs to process matches. For each matching operation, the system would download the PDF from S3, parse it to extract information, and then match it with job requirements. This process was repeated for every single resume, every time a match was needed.

Solution Implementation

Based on my previous experience with Apache Solr in similar scenarios, I knew it would be perfect for this use case. Here's how we implemented the solution:

I decided to parse the resume information once during the initial upload instead of processing it repeatedly. This parsed data would be stored in Apache Solr, which I deployed on an AWS EC2 instance. The choice of Solr was strategic - its powerful text analysis capabilities and fast query performance made it ideal for our needs.

We created utility methods to handle various scenarios:

  • When a user uploads a new resume, it's immediately parsed and indexed in Solr
  • If a user updates their resume, the system automatically re-indexes the new information
  • Previous experience and skills are also indexed in Solr for comprehensive matching

This approach eliminated the need for repeated parsing and provided fast access to structured data.

Migration Strategy

With the new system in place, we needed to handle existing data. I developed a migration script that:

  1. Fetched user information from the MongoDB database
  2. Retrieved and parsed their stored resumes from S3
  3. Indexed all the extracted information into Solr

The migration process was straightforward since we could reuse the same utility methods we'd built for new uploads. This ensured consistency across all data, whether it was newly uploaded or migrated from the old system.

Results

The impact of these optimizations was immediate and significant. The processing time improved by a factor of 10, dropping from 2 seconds to approximately 200ms per resume. This improvement was achieved by eliminating the need to download and parse PDFs during the matching process.

Thanks to Solr's API, we could now fetch all required resume data in a single operation, further improving efficiency. The system could now handle thousands of resumes without significant delay, providing recruiters with quick access to relevant candidates.

Technical Stack

The final solution utilized:

  • Express.js for the backend server
  • MongoDB for user data storage
  • AWS S3 for storing original resume PDFs
  • Apache Solr for processed resume data
  • AWS EC2 for hosting the Solr instance

Conclusion

This optimization project demonstrated how proper architectural decisions and tool selection can dramatically improve application performance. By moving from real-time processing to pre-processing with Apache Solr, we not only solved the immediate performance issues but also created a more scalable foundation for future improvements.

The client was extremely satisfied with the results, and we could then move on to optimizing the matching algorithm itself, knowing we had a solid infrastructure in place to support it.

Would you like me to adjust or expand any part of this article?