Olga Kononykhina

Where Data, AI and Society meet

Olga Kononykhina - Quantitative sociology and data science in the fields of art, culture, diversity, governance, development, civil society and non-profits

 

Occupational Coding &
AI-Supported NLP Classification

 

My work on occupational coding focuses on what happens before and beyond the algorithm itself. Across 6 research papers developed during my PhD, I examined the full occupational classification pipeline: from survey design and job description collection to model bias, model training and evaluation. My work identified 4 critical weak points where data and design decisions can lead to accuracy losses of up to 30%.
These insights are now being used to inform the development of a next generation of automatic occupational coding.

My research has been published at venues such as ACL and the Journal of Open Source Software (JOSS), with additional papers currently under revision in leading journals in the field.

 

I also serve as Chair of the International Partnership on Automatic Occupational Coding (IPAJC), where I lead an international community of researchers and tool developers to support development of new coding and classification tools and promote occupational data collection in the fields of social science, epidemiology, occupational safety and official statistics

 
 

Mind the Gap: Gender-based Differences in Occupational Embeddings (ACL)
Olga Kononykhina, Anna-Carolina Haensch, Frauke Kreuter

Large Language Models (LLMs) offer promising alternatives to traditional occupational coding approaches in survey research. Using a German dataset, we examine the extent to which LLM-based occupational coding differs by gender. Our findings reveal systematic disparities: gendered job titles (e.g., “Autor” vs. “Autorin”, meaning “male author” vs. “female author”) frequently result in diverging occupation codes, even when semantically identical. Across all models, 54%–82% of gendered inputs obtain different Top-5 suggestions. The practical impact, however, depends on the model. GPT includes the correct code most often (62%) but demonstrates female bias (up to +18 pp). IBM is less accurate (51%) but largely balanced. Alibaba, Gemini, and MiniLM achieve about 50% correct-code inclusion, and their small (< 10 pp) and direction-flipping gaps could indicate a sampling noise rather than gender bias. We discuss these findings in the context of fairness and reproducibility in NLP applications for social data.

 

The Impact of Question Framing on the Performance of Automatic Occupation Coding
Olga Kononykhina, Frauke Kreuter, Malte Schierholz

Occupational data play a vital role in research, official statistics, and policymaking, yet their collection and accurate classification remain a challenge. This study investigates the effects of occupational question wording on data variability and the performance of automatic coding tools. We conducted and replicated a split-ballot survey experiment in Germany using two common occupational question formats: one focusing on “job title” and another on “occupational tasks”. Our analysis reveals that automatic coding tools, such as CASCOT and OccuCoDe, exhibit sensitivity to the form and origin of the data. Specifically, these tools were more efficient when coding responses to the job title question format compared with the occupational task format, suggesting a potential way to improve the respective questions for many German surveys. In a subsequent “detailed tasks and duties” question, providing a guiding example prompted respondents to give longer answers without broadening the range of unique words they used. These findings highlight the importance of harmonising survey questions and of ensuring that automatic coding tools are robust to differences in question wording. We emphasise the need for further research to optimise question design and coding tools for greater accuracy and applicability in occupational data collection.

 

People earn a living a multitude of ways which is why the occupations they pursue are almost as diverse as people themselves. This makes quantitative analyses of free-text occupational responses from surveys hard to impossible, especially since people may refer to the same occupations with different terms. To address this problem, a variety of different classifications have been developed, such as the International Standard Classification of Occupations 2008 (ISCO) (ILO, 2012) and the German Klassifikation der Berufe 2010 (KldB) (Bundesagentur für Arbeit, 2011), narrowing down the amount of occupation categories into more manageable numbers in the mid hundreds to low thousands and introducing a hierarchical ordering of categories. This leads to a different problem, however: Coding occupations into these standardized categories is usually expensive, time-intensive and plagued by issues of reliability. Here we present a new instrument that implements a faster, more convenient and interactive occupation coding workflow where respondents are included in the coding process. Based on the respondent’s answer, a novel machine learning algorithm generates a list of suggested occupational categories from the Auxiliary Classification of Occupations (Schierholz, 2018), from which one is chosen by the respondent (see Figure 1). Issues of ambiguity within occupational categories are addressed through clarifying follow-up questions. We provide a comprehensive toolbox including anonymized German training data and pre-trained models without raising privacy issues, something not possible yet with other algorithms due to the difficulties of anonymizing free-text data