Person’s Recognition: A Criminal Can Be Found By Linguistic Analysis Of Their Social Network Profile

The story had begun in 2011 when a hacker under the nickname ‘Hell’ hacked into the email of the Russian opposition leader Alexei Navalny and made his correspondence public.

‘Hell’ had a personal blog and used warped language to make de-anonymizing harder. Although machine NLP (natural language processing) was not widely known at the time, it made sense to protect ourselves from future technologies that could provide a clue to the police.

Today, ten years later, the rationality of this approach seems to have been justified.

Linguistic And Semantic Files

Nowadays criminals are forced to avoid areas with facial recognition cameras in order to hide from law enforcement agencies. The most cautious have even to stop using messengers in order to stay in the shadows.

In a sense, the situation for criminals is getting worse: law enforcement agencies do not have any video surveillance data from, let’s say, the 2000s. But the text content of hacker forums a decade ago is stored on the net. Making use of warped language is not a fundamental obstacle, since the model operates at the level of the words’ meaning regardless of spelling.

Recognizing Reddit.com Users With 87% Accuracy

The recognition accuracy has surpassed our initial expectations resulting in 87%, more than enough for a proof-of-concept. Moreover, this figure, if desired, can be increased extensively through the amount of data and additional model training, or intensively using more advanced techniques.

In order to reach 80% accuracy, 600 sentences per person were sufficient. This figure is a drop in the ocean compared to the number of text materials generated by a single user in social networks and messengers.

A person’s identification becomes much easier as there is no need to search for a particular user among millions of social network followers; the search can be narrowed to the desired topic or a certain part of the social graphs using various technologies.

Troll Factories, Fake News, And Total Loss Of Anonymity

Same as any other technology, it can be used both for good and evil. A great deal of text content is personalized and publicly available. Even more is available to owners of social networks and messengers. In a nutshell, using NLP algorithms anonymous stakeholder blogs can stop being anonymous.

The Technical Side. Speech2Vec And Cascade Models

The most common patterns used are word-to-vector [word2vec] or sentence-to-vector [sentence2vec].

However, at this level, it is still impossible to distinguish the unique traits of a person. To achieve significant accuracy, it was necessary to increase the scale of formalization and implement vector representation for very long sequences. In fact, it turned out to be ‘speech to vector’, so let’s call it speech2vec.

At the final stage of the classification task, a cascade of small models was used. In this case, it is more efficient and more accurate than a big one.

The advantages of this approach have recently been confirmed by Google researchers.

The idea that the absence of a face on an avatar guarantees anonymity proves to be wrong. Weak AI eliminates the word ‘anon’ from this canon.

As a proof, our team has prepared a simplified version of the dataset, code and models so that anyone could reproduce and understand the way it works.

Acrux Cyber Services, Data Science Team.

--

--

Lithuanian Cyber Security and AI solutions company

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store