Steganalysis and Machine Learning: a European answer

Cybersecurity 08 July 2020

By Igino Corona, Matteo Mauri
Heading image edited and modified starting from "Tree in green wheat field" by Johann Siemens on Unsplash

Steganography is a secret mechanism for encoding information by any means of transmission. Its use has been known since ancient Greece and defined in the glossaries towards the end of the fifteenth century. Both encoding and medium of transmission are secret, that is, known only to the parties who intend to communicate in an occult way. Steganography therefore presents itself as an ideal tool for the creation of secret communication channels that can be used in sophisticated espionage scenarios, computer crime, data breaches in public and private sectors.

Steganography differs from cryptography, in which encoding of information and medium of transmission are generally known (think for example to the HTTPS protocol used by this site). In this case, the encoding mechanism makes the extraction of information (extremely) difficult without the knowledge of additional data, known as encryption/decryption keys. These keys are known only to the parties authorized to communicate (for example, your browser and our web server).

The process of analyzing steganography is also known as steganalysis. In its simplest implementation, this process aims to detect the presence of steganography in one or more transmission media, and only in a further stage it may extract the hidden message.
The effectiveness of steganalysis techniques is strictly dependent on the degree of sophistication and "personalization" of the steganography techniques used by an opponent.

The simplest case reflects an opponent with low or zero knowledge of steganography, who simply uses tools implemented and made available by others (off-the-shelf tools): in computer security such an opponent is often called script kiddie.
In the digital field, there are many software that implement steganography and most of them combine cryptographic techniques. Examples of open-source software using both techniques are shown in Table 1.

Table 1: Examples of open-source steganography tools

Of course, off-the-shelf tools are also available to those who intend to perform steganalysis. While implementing steganography, each software typically leaves (more or less implicitly) characteristic artifacts in the manipulated files, which can be studied to build signatures (fingerprinting). These signatures can be used in the steganalysis phase to identify not only the presence of steganography, but the specific tool used, as well as to successfully extract hidden contents [7,8]. Most of steganalysis systems employ this approach [9].

It is easy to see that we are in a vicious circle ("arms-race") which prefigures an increase in the sophistication of techniques and tools used both by those who intend to use steganography, and by those who instead intend to unmask it and reveal its hidden contents. Among the two subjects, in general, the first profile has an advantage, since it will be able at any time to change the means of transmission and/or encoding of information to escape detection.

For example, an opponent could modify the steganography software implementation to escape fingerprinting, or even implement totally new steganographic techniques. This of course has a cost - we are no longer in the presence of kiddie scripts - but this cost can be reasonable according to motivations (e.g., strategic/economic benefits of a cyber-espionage organization).
This situation is well known in the field of computer security: it is generally much easier to attack computer systems than to defend them. Malware instances constantly appear in "polymorphic" variants precisely to evade the detection mechanisms put in place by defenders (e.g., antimalware signatures).

In this scenario, machine learning may represent a sophisticated weapon at the service of those who intend to unmask steganography. Through machine learning techniques it is in fact possible to automatically develop a steganalysis model, starting from a set of samples with and/or without steganography.
Most of the proposed approaches use the so-called two-class supervised learning (steganography present/absent), which requires the use of samples with and without steganography, to automatically determine statistical differences. This method is particularly useful for detecting the presence of known steganographic techniques variants (e.g. implemented in new software) for which there are no signatures.
Examples of various algorithms based on supervised learning for the detection of steganography in images have been implemented in an open-source library called Aletheia [10].

Signatures and supervised learning can provide good accuracy when it comes to detecting known steganography techniques and its variants, but are subject to evasion in the presence of totally new techniques, for example, with a statistical profile significantly different from that observed on the samples used in the learning phase.
For this reason, other studies [11,12] have instead proposed the use of unsupervised - anomaly-based learning techniques. This approach only employ samples in which steganography is absent, for the automatic construction of a normal profile. The presence of anomalies ("outliers"), or deviations from this profile, can therefore be used to detect totally unknown steganographic techniques. This approach, however, must focus on features whose deviation from the norm is a reliable indication of steganography to offer good accuracy. Think, for example, of the comparison between the size specified in the header of a file, compared to the actual size.

Since each steganalysis technique has its pros, a combination is often useful: signatures, supervised and non-supervised learning [12]. This is exactly one of the objectives of a strategic project funded by the European Commission, called SIMARGL - Secure intelligent methods for advanced recognition of malware, stegomalware & information hiding methods (Grant Agreement n ° 833042 - www.simargl.eu).

The project, with a total budget of 6 million euros, aims to create advanced steganalysis systems applied to the detection of (stego)malware, malicious software increasingly used by cybercrime and national states in espionage actions. In this project, relevant international actors such as Airbus, Siveco, Thales, Orange Cert, FernUniversität (project coordinator), work alongside three "Italians" to fight against stegomalware: Pluribus One participates as software provider and developer, making available two solutions: Attack Prophecy, advanced system for the detection and protection of web applications, based on (adversarial) Machine Learning algorithms, and AIsafe DNS, a comprehensive solution for the prevention and detection of endpoints threats, that offers coverage against a wide range of threats, from malware to phishing; CNR, Genoa Unit, puts in place Energy-Aware detection algorithms based on artificial intelligence; Numera, a company operating in the ICT sector based in Sassari, will submit some of its systems for the credit market to the "screening" of SIMARGL.
In total, 14 international partners (Netzfactor, ITTI, Warsaw University, IIR, RoEduNet, Stichting CUIng Foundation also participate in the consortium) from 7 countries that will field artificial intelligence, sophisticated products already available and machine learning algorithms on the way for improvement, in order to propose an integrated solution capable of facing different scenarios and acting at different levels: from monitoring network traffic to detecting blurred bits within images.
The challenge of the SIMARGL project has just begun and will provide concrete answers to the problem of stegomalware in the next two years: the project will end in April 2022.

It is important to emphasize that machine learning (and more generally artificial intelligence) is a neutral technology (like many other technologies). Specifically, it is of dual use [13] and does not belong to the domain of the good. In principle, machine learning can also be used to develop more sophisticated, polymorphic, data-based steganographic techniques.
Let's get ready, because this scenario could represent the future of cyber threats (and perhaps a piece of the future is already present).

[1] Xiao Steganography, https://www.softpedia.com/get/Security/Encrypting/Xiao-Steganography.shtml

[2] Image Steganography, https://archive.codeplex.com/?p=imagesteganography

[3] Steghide, http://steghide.sourceforge.net/download.php

[4] SSuite Picsel, https://www.ssuitesoft.com/ssuitepicselsecurity.htm

[5] Stego Magic, https://www.gohacking.com/hide-data-in-image-audio-video-files-steganography/

[6] Open Puff, https://embeddedsw.net/OpenPuff_Steganography_Home.html

[7] Pengjie Cao, Xiaolei He, Xianfeng Zhao, Jimin Zhang, Approaches to obtaining fingerprints of steganography tools which embed message in fixed positions, Forensic Science International: Reports, Volume 1, 2019, 100019, ISSN 2665-9107, https://doi.org/10.1016/j.fsir.2019.100019

[8] Chen Gong, Jinghong Zhang, Yunzhao Yang, Xiaowei Yi, Xianfeng Zhao, Yi Ma, Detecting fingerprints of audio steganography software, Forensic Science International: Reports, Volume 2, 2020, 100075,ISSN 2665-9107, https://doi.org/10.1016/j.fsir.2020.100075

[9] Gary C. Kessler, An Overview of Steganography for the Computer Forensics Examiner, https://www.garykessler.net/library/fsc_stego.html

[10] Aletheia, https://github.com/daniellerch/aletheia

[11] Jacob T. Jackson, Gregg H. Gunsch, Roger L. Claypoole, Jr., Gary B. Lamont, Blind Steganography Detection Using a Computational Immune System: A Work in Progress, International Journal of Digital Evidence, Winter 2003, Issue 1, Volume 4

[12] Brent T. McBride, Gilbert L. Peterson, Steven C. Gustafson, A new blind method for detecting novel steganography, Digital Investigation, Volume 2, Issue 1, 2005, Pages 50-70, ISSN 1742-2876, https://doi.org/10.1016/j.diin.2005.01.003

[13] Fabio Roli, Matteo Mauri, Artificial Intelligence: past, present and future. Part II - The Good, the Bad and the Ugly, AI & Cybersecurity insights: Pluribus One Blog, September 2019, https://www.pluribus-one.it/company/blog/81-artificial-intelligence/76-good-bad-ugly-in-ai

[14] Matteo Mauri, Igino Corona, Davide Ariu, What is Stegomalware? Information hiding-capable malware and the European answer: the SIMARGL project, AI & Cybersecurity insights: Pluribus One Blog, March 2020, https://www.pluribus-one.it/company/blog/84-cybersecurity/83-stegomalware

Steganalysis and Machine Learning: a European answer

Info

Legal entity

University of Cagliari

Certifications