Machine Learning and Intrusion Detection
The literature on machine learning and intrusion detection is vast (see references in [1] for a partial overview; also, short reviews by [18] and [19] which contain more details about specific machine learning methods that have been attempted); however, it divides broadly into the two main categories of “anomaly detection” and “signature inspired” [20] (in machine learning terminology, “supervised learning”). Anomaly detection, including such systems as Anagram [21] and McPAD [22], focuses on constructing a “normal” model of traffic and producing alerts when traffic that does not fit this model is observed. Supervised learning systems (see [23], [24], and [25] for representative examples) are provided with both malicious and benign traffic, and attempt to learn rules to distinguish them. While less common in the domain of intrusion detection, active learning (i.e. interactive) approaches for outlier detection have been presented as well, as in [26].
In all cases, the general formulation of the problem is approximately the same. Network messages are – by construction – designed to be parsed and interpreted by a machine, and hence can be characterized as formal grammars which accept and reject on specific strings. Within the space of all possible strings M received by the network service, there is the subset A⊆M of messages that are accepted by the network endpoint; this subset itself contains S⊆A of “safe” messages that represent normal use of the endpoint. In learning a function that can identify which subset of a given message lies within – in particular for some string s whether or not s∈S – we are constructing a recognizer for S, thus placing the problem exactly within the domain of grammatical inference.
Consider, for example the case of a standard machine learning approach in which we are presented with samples of normal traffic from S, and hostile traffic from the set AS, each appropriately labeled, and we wish to train a function to determine whether a candidate string lies in one set or the other. This is precisely the problem of learning a grammar from a complete presentation, and as such we may readily apply existing results. Even if we make the simplifying assumption that the protocol under consideration (or the union of protocols if deployed on a multi-protocol endpoint) is regular, we are still attempting to learn a Discrete Finite Automata from a complete presentation. If we wish to learn it precisely (in the limit) then we have that the problem is NP-complete [10]. If we wish to learn it in a more practical sense (i.e. PAC), then we have that the identification problem is “merely” cryptographically hard [11]. This forces us to accept the conclusion that even if we obtain empirically good performance for a particular algorithm in a particular setting, we cannot be sure that it will generalize to a new domain.
If we consider (as in McPAD [22]) that only positive (‘normal’) data S is available, and continue to assume that we are observing then we are attempting to learn a grammar from its positive presentation, with all associated complexities. While specific examples of grammars are clearly at least PAC-learnable in this setting, as shown by the results of [22], it follows immediately from the difficulty of learning from a positive presentation that McPAD must fail to generalize to at least some classes of grammars; whether or not those grammars are of practical relevance to intrusion detection cannot be decided in any fashion other than empirically. We are thus left with no foundational guarantee of correctness; simply empirical observations.
Clearly, when we consider a more realistic scenario in which several protocols may be present in the same set of network traffic, the problem often becomes significantly more difficult; the problem of learning the mixture of grammars is at a minimum at least as hard as learning the most complex one, and depending upon the closure properties of that class, may in fact be more difficult. While most languages in the Chomsky hierarchy are in fact closed under unions, it is often not clear whether or not restricted classes (such as those that have finite thickness) may be.
While somewhat outside the realm of network intrusion detection, more powerful inference models such as learning from an informant have been shown to generate positive results in areas such as developing usable models of program execution (stateful typestates) [27]. This approach obtains a direct generative model of program outputs which can be examined for various security properties. Standard fuzzing techniques [28] are perhaps a more direct application within the security domain, in which a subset C⊆A of inputs that lead to crashes is learned from interaction with a program, frequently combined with additional information about the execution of code paths [29], however these methods do not typically produce formal descriptions or generative models of incorrect inputs, and rather seek to enumerate a useful subset of them, typically (in defensive settings) attempting to reduce the size of the set A. The work of [30] explores methods to leverage attacker knowledge in constructing fuzzing inputs via a descriptive language, which could be used in an iterative fashion to eventually describe a subset of the target grammar.
In many cases, we do not even require the results of grammatical inference to show that a particular classifier cannot (provably) learn a sharp distinction between malicious and benign traffic. A key step in any machine learning process is that of ‘feature extraction’ in which the raw data that is to be classified is converted into some numerical representation that can then be operated on by the learning algorithm. N-grams (and minor variations on the concept such as skip-grams) are the core feature representation used in a number of anomaly-based intrusion detection systems, including Anagram [21] and McPAD [22], in which every n-byte substring within the payload is tabulated (for instance, the string “learning” would have 3-grams of “lea”, “ear”, “arn”, “rni”, and so on).
However such representations can be shown to be insufficiently powerful to distinguish between many members of the class of regular languages. For example, the rather trivial regular languages (ab)x(ba)y(ab)z and (ba)x(ba)y(ba)z cannot be distinguished from each other on the basis of 2-grams (note that the 2-grams bb and aa both appear exactly once in each, with variable numbers of ab and ba tokens), while constructing a recognizing DFA for each is trivial. Similar counterexamples can be constructed for n-grams of arbitrary length. This immediately implies that any learning algorithm that first reduces a sequence to n-gram counts is a priori incapable of learning large subsets of the class of regular grammars, and as a consequence we may expect that – even if empirically good performance is shown on some data sets – this performance cannot be relied upon to generalize.
Other feature representations are also used. Perhaps most widely known are those of the (now severely out-of-date but still regularly studied) KDD’99 data set [31], which parses the network traffic to extract a number of features suggested by expert knowledge to be of use. In addition to “metadata” describing flow-based properties such as the duration of the connection and the number of concurrent connections to the same host, a number of content-based features are extracted from both the payload of the packets (e.g., the presence of a shell prompt, extracting FTP commands to obtain a tally of outbound ones, etc.) and headers of the packets (identifying the network protocol, various ports and addresses, protocol-specific flags, and so on). These features obviously make no attempt to model any significant portion of the content of the packets, and so make the prospect of inferring a grammar from them infeasible; at best, some of the manually extracted features act as “telltales” for specific attacks, and thus allow what is effectively signature-based detection.
And indeed, the most effective current approach in intrusion detection remains (anecdotally at least) signature-based solutions [1] such as Snort [3]. The effectiveness of such solutions can be explained precisely within the context of grammatical inference, as a well-written content-based signature is equivalent to a production that is not (or is very rarely) a production of the grammar underlying “good” traffic, and hence form a telltale for the set AS. And indeed, Snort and Bro [2] both contain sophisticated pattern-matching rules that are capable of recognizing a wide range of malicious traffic, in effect acting as small recognizers for subsets of the malicious grammars under consideration. It is worth noting, however, that the rule generation in this case is often done by hand, and even when done in an automated fashion is typically attempting to match a finite subset of malicious traffic for a specific attack, and then tested on a set of larger normal traffic to assess false positives; this is equivalent to the bounded version of the problem posed in [10], which can take place in polynomial time. A key distinction here is that – rather than attempting to model all ‘safe’ or ‘malicious’ productions – any method that produces some form of signature is attempting to model a finite number of productions of a single protocol under heavily supervised conditions, and so does not address the question of novel attacks that machine learning-based solutions are often attempting to address [1].