Machine learning for network intrusion detection is an area of ongoing and active research (see references in [1] for a representative selection), however nearly all results in this area are empirical in nature, and despite the significant amount of work that has been performed in this area, very few such systems have received nearly the widespread support or adoption that manually configured systems such as Bro [2] or Snort [3] have. As discussed in [1], there are several differences between more conventional applications of machine learning and machine learning for network intrusion detection that make intrusion detection a challenging domain; these include the overwhelming class imbalance (see [4] for a detailed discussion of this issue), the high asymmetry in misclassification costs, the difficulty in evaluating the performance of an intrusion detection system, and the constantly changing nature of network attacks.
However; these arguments, which largely stem from empirical observations about the distribution and impact of attacks, do not address a more fundamental question: whether or not it is possible from a theoretical perspective to use machine learning to construct a general and effective intrusion detection system. This is the sort of fundamental question that we seek to answer with a “fundamental science of cybersecurity” [5], which has so far proved to be somewhat elusive. In this article, we collect results from the field of grammatical inference to show that the use of machine learning for intrusion detection via deep packet inspection is inherently limited, and cannot be expected to generalize effectively even if the proposed algorithm can be shown to perform well on some particular task.
Network inputs, file formats, and other automatically generated input structures (henceforth: ‘messages’) are by construction formal grammars, in which the message is both produced and interpreted according to a fixed set of rules. If we wish to identify a particular class of message as either “safe” or “malicious”, we are then implicitly attempting to learn a recognizer for the (perhaps merely implicit) formal grammar underlying that class of message format or protocol. This places the problem firmly within the realm of grammatical inference, a field with a deep theoretical literature. And indeed, we can find strong parallels, if not outright reductions, from many commonly encountered problems in machine learning for intrusion detection and grammatical inference.
In the following, we present brief background on grammatical inference, explore some common approaches to network intrusion detection through the lens of computational and show how these results illuminate several common approaches to machine learning for network intrusion detection. In particular, these results suggest that general purpose machine learning approaches to network intrusion detection are – from first principles – not computationally tractable, and in fact may be attempting to solve problems that range from cryptographically hard to NP-complete.
Key Concepts from Formal Language Theory and Grammatical Inference
The key concepts from formal language theory that we will require are “languages”, “grammars”, and “recognizers”.
Grammars are defined as tuples: G=(N,Σ,R,i) where N is a set of nonterminal symbols, Σ is a set of all possible symbols (terminal and nonterminal), R is a set of production rules indicating which symbols in the string can be replaced with other symbols (the ‘left’ and ‘right’ side of the rule, respectively), and i is an initial symbol. A “production” of a grammar is a sequence of rule applications to a sequence until it consists of only terminal symbols; the (typically infinite) set of all productions of a grammar is the language L produced by the grammar, often denoted L(G). A recognizer for a grammar is some algorithm which takes some input sequence as input, and either returns “accept” if s∈L(G), or “do not accept” if S∉L(G).
The various restrictions that are placed on the production rules place the grammar into one class or another. The most commonly encountered classification of grammars – the Chomsky hierarchy [6] – classifies grammars in increasing complexity as regular, context-free, context-sensitive, and unrestricted; each class being contained in the following one, having successively fewer restrictions on the production rules, and requiring a successively more powerful model of computation in order to recognize. Regular grammars, for instance, may only have rules in which the left side contains a single nonterminal symbol, and the right side contains an empty string, a terminal symbol, or a terminal symbol followed by a nonterminal symbol. The Chomsky hierarchy is of particular interest because each class within the Chomsky hierarchy corresponds precisely to a particular model of computation which is required to build a recognizer for a language in that class. Regular grammars, for instance, can be recognized by finite automata, while context-free grammars require the addition of a stack in order to be able to construct a recognizer for them. Context-free grammars are of particular importance as the Backus-Naur notation, which is frequently used to describe network protocols and occasionally file formats are itself a notation for context-free grammars.
Grammars and recognizers then fix a set of production/recognition rules, and produce decisions about strings. Grammatical inference (see [7] for a broad review) tackles the inverse problem: given some set of data relating to an unknown grammar, can we produce decisions about properties of that grammar?
Example data available for training may include only strings {s+:s∈Σ*∩L} in the language generated by a grammar – a “positive presentation” – or a set of strings s={s+:s∈Σ*∩L}∪{s-:s∈Σ* L } combined with labels n : {0,1} which indicate if s∈s+ or s∈s– (a “complete presentation”). The available data can in some cases also be obtained dynamically: the learning algorithm may have access to additional information, such as an oracle that when given a conjectured grammar G’ will either confirm its accuracy if G’=G, or provide a counterexample s∈L(G)L(G’) otherwise (“learning from a teacher”).
One of the earliest formalizations for what it means to “learn” a grammar is “learning in the limit” [8], in which (informally) the learning algorithm requires only a finite amount of data to produce a correct answer from which it never then deviates. Note that this is not a particularly strong model of learning: the amount of data required can be arbitrarily large so long as it is finite, and it places no restriction on the size of the learned automata; it need not be minimal or even bounded. More realistic models of learning include the “Probably Approximately Correct” (PAC) framework [9], rather than requiring L(G’) = L(G) always, we might require only that over some distribution of strings and some distribution of training data T, we require that PT(PD(s∉L (G’)∩L(G))<ϵ)≥1-δ; in other words, the two grammars disagree on at most some bounded proportion of strings (i.e. our proposed grammar is “approximately correct”) and that this “probably” occurs, learning such an approximately correct grammar on at least 1-δ of all possible training sets.
Regardless of the family, general results considering these various models are not encouraging. One of the earliest considered cases was examined by Gold [8]: that of learning in the limit from “presentations” as described above, where the learner is provided with data, cannot form its own queries, and must eventually converge to an equivalent grammar (it may only make finitely many mistakes). In this area of learning by examples, Gold’s Theorem [8] demonstrates learning even regular languages is not possible from only positive examples (indeed, any class that contains all finite languages and at least one infinite language cannot be inferred from a positive presentation). Given a finite set of both positive and negative examples, the work of [10] shows that the decision question of whether there is an n-state DFA that agrees with the finite data set is NP-complete. Through a reduction to Boolean formulas, the work of [11] shows that PAC learning of DFAs from both positive and negative examples – the standard “supervised learning” context perhaps more familiar to those with a background in machine learning – is at least as hard as factoring Blum integers, and so under current cryptographic assumptions does not appear to be polynomial-time tractable. As DFAs occupy one of the lowest levels of the Chomsky hierarchy, it immediately follows that any grammar that requires more power than that offered by a DFA must be at least as hard as a DFA to learn.
Active learning – in which the learning algorithm can form some kind of query to narrow its search – provides slightly more positive results. The closest counterpart to the notion of active learning encountered in (non-grammatical inference oriented) machine learning is the model of an “informant,” in which case the algorithm is allowed to construct query strings, and has access to an oracle which responds to membership queries for proposed strings. When learning with an informant, context-sensitive grammars may be learned in the limit [8]. The more powerful model of a “teacher” as described above (an informant who will provide counterexamples to proposed grammars) allows for yet more powerful inference. Under the model of a teacher, some context-free grammars may be learned in polynomial time with respect to the number of states in the underlying automata [12]. Finally, the use of “simple examples” as training data allows for learning minimal DFA in polynomial time [13], however simple examples – informally – are training items carefully chosen to guide the learning algorithm to the correct answer, and so in addition to being specific to the learning algorithm and the grammar, do not correspond to a more conventional learning approach where the distribution of the input data cannot be controlled.
In addition to variations in available input to the algorithm, various modifications of the grammatical classes to which the unknown language may belong have also been considered. The class of pattern languages [14] can be shown to be learnable in the limit from positive data due to the property of “finite thickness” [15] – loosely, that there is no string that is found in infinitely many members of the class and therefore for any given subset of grammars in the class there must be some production which is unique to it – which is shown to be sufficient for identification in the limit. While pattern languages themselves are difficult to apply to many empirical problems in intrusion detection [16], the notion of finite thickness may in many cases be applicable. A related notion is that of elasticity [17]. A class of languages has “infinite elasticity” if and only if one can find an infinite sequence of productions such that for all n, the first n productions are found in some grammar in the class while production n+1 is not; if a class does not have infinite elasticity, then it is said to have finite elasticity and it can again be inferred in the limit from positive data. Finally, if a class of grammars has a family of “characteristic sets” (‘condition 1’ in [15]) such that each language in the grammar has an associated finite characteristic set, and showing that the characteristic set for language i exists within language j is sufficient to show that language i is a subset of language j, then languages from that class are learnable in the limit.
Taken in whole, these results do not paint a particularly positive picture of grammatical inference. Even the simplest classes of grammars we might encounter have extremely poor “learnability,” and rendering them easier to learn often requires some degree of interactivity or side information. Significant restrictions on the class of grammars which we may wish to perform inference on allows for substantial reductions in complexity, however we have little guarantee for novel protocols that we may rely on such restrictions. In the following section, we briefly examine major classes of machine learning that have been proposed for intrusion detection, and relate the results in this section to the practical problems posed by such learning systems.