Abstract
Organizations now use AI-assisted resume screeners to manage applicant pools that human recruiters cannot review one by one. That change matters most in entry-level hiring, where internships and first jobs shape later access to technical careers. A student may upload a resume expecting a recruiter to read it, while an automated ranking system has already decided whether that resume appears near the top of the shortlist. My thesis portfolio studies that shared problem from two sides: how to audit the ranking behavior of an LLM-assisted screening pipeline, and how institutions govern commercial screening tools after they purchase them. The technical project measures where demographic gaps appear in ranked outputs and tests mitigation strategies. The STS research paper examines how procurement terms, recruiter workflows, and vendor contracts decide who can inspect, challenge, or override those outputs. The projects are coupled because both focus on the same failure point: a ranked list can look efficient to the organization using it while hiding the reasons some qualified candidates never receive attention. Together, the projects ask how hiring systems can preserve efficiency without making fairness invisible to applicants, universities, and recruiters.
The technical report, Auditing the Shortlist: Measuring and Mitigating Demographic Bias in LLM-Assisted, Early-Career Resume Ranking, builds a transparent screening pipeline for software-engineering internship applications. The system scores 112 synthetic resumes across eight demographic groups against a standardized job description and a seven-part qualification rubric. Gemma4-26B-A4B, served locally through an MLX runtime, produces structured scores and decision logs for each candidate. I tested four conditions: a baseline, demographic masking, post-hoc reranking, and a hybrid of masking plus reranking. Every condition reached 1.000 top-15 precision, so every shortlisted candidate met the ground-truth qualification standard. That utility result hid large fairness gaps. The baseline produced a 0.250 false negative rate gap across groups, and masking increased the gap to 0.500. Post-hoc reranking reduced the gap to 0.119 without reducing precision. The result shows that fairness must be measured at the shortlist boundary, where recruiter attention gets allocated. It also shows that a familiar mitigation strategy can fail in a specific model and data setting. In this experiment, masking names, universities, and locations removed some signals while leaving content-level proxies the model still used.
The STS research paper, A Sociotechnical Analysis of AI-Assisted Resume Screening in Entry-Level Hiring, studies the governance system around the same kind of tool. I analyzed 23 public documents: 11 university requests for proposals and 12 vendor terms, agreements, and white papers from major hiring-technology providers. I coded each document for merit definitions, audit rights, candidate appeal mechanisms, and liability language. The findings show that universities often ask for speed without demanding accountability. Nine of 11 RFPs requested automated resume parsing, skills matching, or keyword scoring, but none required vendors to validate those scoring criteria against job performance. Only two RFPs requested any audit access. Nine of 12 vendor documents limited or prohibited reverse engineering, no RFP named applicants as stakeholders, and no vendor document provided a meaningful applicant appeal path. Ten vendor documents capped liability, while seven disclaimed responsibility for discriminatory outcomes caused by client use. The paper argues that procurement turns the screener into a black box by giving vendors control over decision logic while leaving universities with legal exposure and applicants with no practical way to challenge a ranking. This matters because transparency by itself would not solve the problem. A university also needs the right to demand data, require changes, and explain decisions to affected applicants.
The portfolio's main contribution is the connection between measurement and authority. The technical project shows that a glass-box audit can reveal bias patterns that ordinary accuracy and precision metrics miss. The STS project shows why that evidence often fails to matter in real hiring systems: institutions may lack the contractual rights to inspect scoring logic, request candidate-level data, or act on audit findings. Future work should join those two requirements at procurement. Universities should define merit in testable terms, require audit access before adoption, provide applicants with explanations and appeal channels, and share liability with vendors whose systems shape hiring outcomes. The next version of this work should scale the technical audit across more job descriptions, model families, and resume datasets, then connect those audit logs to a recruiter interface with appeal and override paths. AI resume screening can help recruiters manage volume, but only if the institutions using it keep authority over the decisions it influences.