Machine Learning for legal interpretations of annual reports regarding gender equality

A blog post by former Junior Researchers and ITU alumni Niels Helsø, Per Rådberg Nagbøl and Benjamin Olsen, who in their Master Thesis investigated the use of Machine Learning in checking if companies fulfilled the legal requirements of the Danish Financial Statements Act §99b. Here, they supplement the outline of the project with some thoughts that have arisen post-graduation.

Roughly summarized, the law states that a company shall report on the status and progress in getting an equal gender balance in top management. If the company has unequal gender distribution in top management, they must declare this and how they will achieve an equal status between the genders. This declaration has to consist of both target figures, time frames, politics, and initiatives that can support gender equality. Machine Learning was applied to check if companies reported according to the law. We believe technology is not a goal, but rather a means to an end. Therefore, we included the following sub-themes in the Master Thesis: Machine Learning, Governmentality, linguistics, participatory design, reflections of truth and data, and categorization. We did this in order to gain a better understanding of the context surrounding the tool.

Human Learning as a necessity for Machine Learning

One does not simply build a tool that can “interpret” if a company fulfills legal requirements. One must first understand the domain expert’s practice of interpreting the law, the governmental perspective under which the law is put into practice, and the different purposes of the law?  In our case, the law was put into practice in two settings; regulation and policymaking. Establishment of a mutual learning process was, therefore, a key component in the project. We told the different domain experts about the technical possibilities meanwhile they told about their area of expertise and the practice of their work.

We facilitated the mutual learning process by combining techniques such as thinking-aloud experiments, in-situ interviews, interviews, observations, document analysis, and a workshop/presentation event. The learning outcome was then translated into a tagging guide which was used to create the training data. This should ensure accurate and consistent interpretation of the law — a difficult process due to the fact that the law needed to be split into binary answerable questions. The reason for this was that the §99b should be “interpretable” by a computer. This data was then used as training data for supervised machine learning algorithms.

Finding a fitting category

One of our main problems was the categorization of §99b. One of the problems was that §99b is a law with many exceptions. This made it hard to divide into smaller boxes by assigning labels. Thus, many boxes overlapped and generated several frustrations. This also led to difficulties in the categorization of the output of our models. We realized that classical categories would be hard to use and thus forced us to look elsewhere. This led us to George Lakoff’s categorization approach called Radial categories.

The approach is built from the thought that you have a main category, which all the subcategories shares an attribute feature with.  The main category is defined by all the attributes that would also define the subcategories. Each subcategory is defined by a subset of the attributes of the main category thus making the subcategories more general.  However, all subcategories share one common attribute with the main category. You can call it the point which you will be some to generalized you self-up to. In our case, this was defined by whether the company was required to work in accordance with §99b, which is decided based on the reporting class of the company.

To us, this approach was very helpful because it made us able to break the law down into attributes, which were then applied to the main category and scattered out on our subcategories. Our subcategories were the possible ways that a company could uphold the law. Thus dealing with all the exemptions that were in the law. It also gave an overview of how many aspects our tool should handle, which in practice meant that we made a model for each attribute we located in our scheme.

As can be seen in the in the model above, we had 9 attributes over four subcategories. Our subcategory 1 is the most generalized one whereas subcategory 4 is the one with most attributes besides the primary category.  In our case the, central case was almost unachievable. A company would never need to have this number of attributes because then they would have upheld the law on multiple levels. However, to us, it represented all the attributes, which our tool should handle, and the subcategories are the possible ways the law can be upheld with those attributes. And yet again, a pattern the tool should learn.

Building the tool

The tool consisted of three main parts, which were: data preparation, classification and evaluation.

  1. The preparation part is mainly concerned with data formatting and selection. First language detection is applied to divide the data into Danish and English (companies always need to report in at least one of these languages). We then split each report into sentences, which were sent to the machine learning layer.
  2. The machine learning part within the dotted box is divided into three layers. First, an algorithm finds all sentences related to gender in the annual report. Next, each of those sentences is classified according to the law using several different classification models. Lastly, we extract quantitative information in form of target figures from the relevant sentences.
  3. The evaluation part is concerned with the categorization of the report based on the classification of the individual sentences.

The primary models were all built using open source machine learning libraries, sci-kit learn, genism and NLTK. We further experimented with the algorithm FastText made by Facebook researchers. To find the relevant sentences, we also used the sentences immediately around the target sentence, since the context turned out to be important when detecting the right topics. Then, each of the relevant sentences was classified on its own, according to the several categories found in the law. Lastly, for the relevant sentences, we extracted target figures, by creating a sliding window across the sentence and classifying each word/token. We could not just extract numeric since annual report contains a range of different numbers and these numbers are expressed in multiple ways either both in text and numeric form.


In general, our models performed very well when looking at the ROC AUC score and overall accuracy. However, these results hide that all the models have an issue with the recall. This is because of the nature of our dataset which was unbalanced. This happened even though we tried to balance by having a layer that found all relevant sentences, which pushed the balance from 1/99 to 10/90 on the core classification tasks.

Out of the models that performed reasonably on accuracy and ROC AUC was the model that found the relevant sentences that had the best recall. Even though the models AUC was at .98 and accuracy at 97.2 % its recalling was only at 83.8 %, meaning that quite a few sentences (16.2 %) that were about gender were ignored and thrown out. For other models, this was much worse. For example, a model that had to find companies that reported an equal distribution at the board of directors had an accuracy on 98 % and a ROC AUC at .88 and thus the models seemed good at the first glance. However, on a closer inspection, the models recall was only at 14 % meaning the 86 % of the sentences in which companies reported equal distribution was ignored by the model. It turned out that the reason for this issue is an extremely skewed dataset, meaning that only very few companies have an equal distribution in the eyes of the law. This happens even when an equal distribution, in some cases, can be achieved with as little as 25 % women in top management according to the law. This we find to be a justification for the law on its own. In general, this can be seen in several of our models.