Your face is average. According to that guy.

This is a blog post written by Dana Marie Yu. Dana was a Junior Researcher in ETHOS Lab in the spring semester 2017. She is on the MSc in Software Development programme at the ITU and during the fall of 2017 she is studying abroad in Japan.

Your adolescent self can kindly dismiss the service of your mirror because the time spent examining your face – analyzing the size of your forehead or the shape of your nose – is over. Now there is face attribute recognition software to do it for you.

In recent years, employment of deep learning techniques has significantly improved classification accuracy in widespread image recognition applications. Deep learning models are now also being applied to improve face attribute recognition. 

How does it work?

For supervised machine learning (supervised meaning the classification outputs are known), a labelled training data set is key. This data set has sample data points representing the classifications the model should recognize, and each data point is given a corresponding class label. This data is fed into a machine learning model that predicts a class for a given data point. It compares the predicted class label to the actual class label for the data point and recalculates internally to improve its accuracy score. The completed “trained” model is then used to predict classifications for new unlabelled data points.

The key to a classification model is in the training data set it used to learn from. In face attribute recognition, learning from sample input and class labels makes sense when predicting attributes like “wears glasses” and “teeth visible” since these can be universally defined. But what about the other attribute labels that include “chin size”, “nose width”, and “face shape”?

One face image data set publicly available for training classification models, CelebA data set [1], is labelled with 40 attributes that include “attractive”, “young”, “big nose”, and “chubby” [2].

Building a quality and sizable training data set comes with established financial and temporal challenges. But where does basic data annotation in training data sets for face attribute recognition come from for many attributes that appear fairly subjective? 

One research team that recently developed a face attribute recognition model acknowledged the lack of labelled face attributes in existing face image data sets for training [3].

Therefore, to build their own training data set, the team generated face attribute annotations by using an off-the-shelf face attribute predictor model [2] trained on the CelebA data set to label face attributes on each image in the CASIA-WebFace data set [4] taking the majority voted label for each attribute to set as the “ground-truth” attribute label for each image [3]. 

To evaluate an accuracy score for their proposed network model, the team needed new face images and attribute annotations independent from the training data set. To generate these attribute annotations, they asked three annotators unrelated to their project to label the selected facial attributes on a randomly sampled subset of images from the FaceScrub data set [3, 5]. The majority voted labels among the three annotators became the “ground-truth” face attribute labels for each image used to evaluate their model’s attribute classification accuracy.

One publicly available face image data set that includes labelled attributes, PubFig [6], was created by outsourcing the attribute annotation task to individuals via Amazon Mechanical Turk [7, 8], an online platform that matches online tasks that require human intelligence to online workers. A few manually labelled images annotated by the data set creators were submitted as examples, and the online workers were asked to select face images that exhibited a specified attribute. 

Each submitted labelling job was performed by three different workers, and only labels agreed upon by all three workers were used in the final data set. Via this process, “ground-truth” face attribute annotations were collected for 65 attributes including “Asian”, “Indian”,  “attractive woman”, “chubby”, “eyebrow shape”, and “middle-aged” for the images in the PubFig data set [8]. 

Why is this concerning?

So far, the training data sets encountered have been annotated by taking the majority opinion of seemingly random individuals. In the examples above, up to three individuals gave their opinions for each attribute.

In June 2017, a survey was conducted asking individuals to label the “eyes distance” face attribute in eight face images. The individuals labeled “eyes distance” using five categories that are currently being used by a commercial face recognition web service, Betaface API [9]: “extra close”, “close”, “average”, “far”, and “extra far”. For each image, 200 human annotations were collected and compared to the face attribute label from Betaface API.

In Images 1, 2, 3, 4, the majority of 200 labels matched the attribute label from Betaface API. However, Image 4 displays a bare majority of 1%.

Image 1
Image 2
Image 3
Image 4

In Images 5, 6, 7, 8, the majority label mismatched the attribute label from Betaface API.

Image 5
Image 6
Image 7
Image 8

Only one image resulted in over 75% majority, Image 5, and it mismatched the label from Betaface API. The survey results display a general spread of opinions across all “eyes distance” categories. 

The survey also showed a face image and its corresponding “eyes distance” label as classified by Betaface API for each of the five categories and asked individuals if they agreed or disagree with the software measurement. Only two out of the five images result in over 70% majority, and the rest are relatively split: 

Figure 1

When asked how the “average” category was determined across attributes (and if there was any scientific research factored into this), the Betaface API CTO replied, “we have our own general average model where we fit faces to figure out relative sizes.”

Instead of looking at AI advancements in terms of “humans vs computers”, perhaps we should consider comparisons in terms of “humans vs another set of humans who curate the training data set that prescribes a software system’s predictive behavior.” However, it is the perspectives of only certain humans that are being translated into how predictor models that are entering widespread use should classify face attributes given any input face. 

Automated face attribute recognition systems would certainly measure more consistently and save immensely on time. But whose definitions of how our faces should be measured are our face attributes being labelled by? How will this labelled data be used? For the time being, I might prefer to go back to my mirror. 


Links and References

[1] CelebA data set

[2] “MOON: A Mixed Objective Optimisation Network for the Recognition of Facial Attributes”

[3] “Multi-task Deep Neural Network for Joint Face Recognition and Facial Attribute Prediction”

[4] CASIA-WebFace data set

[5] FaceScrub data set

[6] PubFig data set

[7] Amazon Mechanical Turk

[8] “Attributes and Simile Classifiers for Face Verification”

[9] Betaface API