Cybersecurity Data Gap
Network and host-based sensors collect data that are foundational for current-day cybersecurity technologies such as intrusion detection and prevention systems. However, for cybersecurity incidents, these data only tell a part of the story. Lacking are the data from the inside view (or attacker perspective), including specific attacker actions, tools used, and strategies. Availability of such data will lead to technologies that provide decision support, perform automated security testing, and strengthened intrusion detection systems.
Technologies for Data Acquisition
Cybersecurity-related incidents in our world today are an unfortunate, yet common, occurrence. Networks typically collect activity traces during such incidents. As examples, Microsoft provides an application programming interface (API) for Windows Event Log, Windows Event Tracing, and also a suite of tools to collect and view these data [1]. Linux and Macintosh operating systems (Mac OS) have similar mechanisms with, for example, syslog, logger, and Snoopy [2]. Wireshark, tshark, and tcpdump are widely used sniffers for collecting network traffic. Analysis engines such as the Bro, Snort, Open Source HIDS Security (OSSEC), and HBSS Intrusion Detection Systems (IDS) monitor data and issue alerts for potentially malicious activities [3]. Very often, however, such data are not shared among the cybersecurity community due to its sensitivity. Most cyber datasets available to the public are collected during Capture The Flag competitions (CTF). DEFCON [4] has hosted yearly CTF events for over 20 years. After each event, the tools, data, write-ups, and source code for the challenges and CTF engine are released to the public. The International CTF (iCTF) [5] has held CTF events at an international scope over the past 12 years. However, datasets collected through these events have limitations; they mostly consist of network data alone, the data are mixed (participants are on the same network), and in many of these events, the objectives are not necessarily representative of real-world scenarios and instead focus on the competitive, game aspect in the assigned tasks.
For this reason, researchers tend to use emulation and simulation engines to design realistic and mock-up cybersecurity scenarios for collecting data and testing new concepts. For example, the Common Open Research Emulator (CORE) [6] is capable of emulating hosts and network devices at the network layer and above. These nodes can also install and execute services and applications. CORE supports Hardware-In-the-Loop (HIL) and is easily configurable and extensible, which makes it a good platform for creating scenarios that can be migrated to other systems. The Extendable Mobile Ad-hoc Network Emulator (EMANE) [7] can be used separately or alongside CORE to provide emulation for the physical and data link layer. These technologies are very well suited for providing flexible, efficient, and simultaneous experimentation environments. However, some additional key features must be implemented for comprehensive and large-scale data acquisition (including the inside, attacker, view).
Driving Facilities
The Center for Cyber Analysis and Assessment (CCAA), located at the University of Texas at El Paso (UTEP), an Army Research Lab (ARL)-South Satellite campus, was established to tackle these issues. The center brings together government, industry, and academic partners to partake in and develop hands-on workshops that help to understand and develop solutions for real-world technology gaps and research questions. The Cybersecurity through Workshops, Analysis, and Research (CyWAR) laboratory is a collaborative working area that offers shared office space for collaborators.
The successful collaboration between ARL and UTEP is primarily fueled through in-kind contributions. The Software Engineering courses (both at the undergraduate and graduate level) build tools and techniques to tackle real-world, current-day, problems. UTEP Scholarship for service (SFS) students at the Master and Doctoral levels refine and tailor the developed software for use in cybersecurity research. Collaboratively developed coursework and workshops help to attract students to UTEP’s cybersecurity programs and to satisfy engagement requirements for University designations such as the National Security Agency (NSA) Center of Academic Excellence (CAE) in both Cyber Defense and Cyber Operations.
Collaboration Pipeline
ARL-UTEP collaborative interactions form a pipeline where critical emphasis lies on empirical research and tools that coincide with the fast-paced field of cybersecurity (see Figure 1).
Design Basis and Workshop Development
The development of hands-on cybersecurity workshops involves students from ARL researchers and other experts investigating publicly known cybersecurity incidents, tools, and vulnerabilities. This involves consulting experts in specific technical areas and understanding current technologies and their weaknesses. A scenario outline, that describes the steps an attacker may take to compromise the system is documented as an exercise for workshop participants. This document includes the goals and outcomes for each phase in the exercise. After being reviewed by the ARL-South group, the network topology is developed using the CORE and VirtualBox. Any custom Virtual Machines (VMs), e.g., a Windows 7 machine vulnerable to the WannaCry ransomware, are configured separately and connected to CORE through its HIL feature.
A typical workshop’s duration is between one and three hours. To accommodate multiple skill levels, each workshop consists of a regular challenge and an advanced challenge. The advanced challenge encourages participants to research external resources and leverage the knowledge gained during the regular challenge. The following are the sample steps involved with two such workshops.
Workshop: Pivoting and Exploitation
In this workshop, participants are located on the Internet and must gain access to an email server that resides in an Intranet. The Intranet has a host that is running a publicly accessible and vulnerable JBoss service. Participants are given a document that was apparently found while dumpster diving. It describes several subnetworks in the Intranet, including IP addresses and subnet masks.
Participants complete the following tasks: 1) find the IP address of the node serving the JBoss service by scanning the Intranet, 2) use Metasploit to identify and exploit a vulnerability in the JBoss service and to run a Meterpreter session, 3) configure the compromised node as a pivot by configuring routing and using a socks4a proxy, and 4) access the internal email server using the browser.
Workshop: Route Hijacking
In this workshop, participants lack prior knowledge about the network. They are connected to a routing gateway that is using the Routing Information Protocol (RIP) for dynamic routing.
Participants complete the following tasks: 1) use Wireshark to view the routing network packets and identify all subnetworks, 2) spoof a plaintext authentication web page running on a remote host, 3) host the spoofed web page, 4) use the Loki.py tool to advertise a false route to the web server, and 5) use Wireshark to retrieve user credentials.
Table 1 lists additional workshops that have been developed collaboratively between academia, government, and industry.
Table 1. Collaboratively developed workshops – Source: Author
Workshop Name | Description |
---|---|
WannaCry ransomware | Infect a Windows 7 machine, observe traffic, find and implement the kill switch so that the malware will no longer spread. |
DEFCON challenges | Recreated qualifying challenges for the DEFCON capture-the-flag events. |
Slow HTTP POST Denial of Service (DoS) | Understand the Slowloris DoS tool, recreate effects, and configure a web server for prevention. |
Cross-site scripting | Identify weak JavaScript code and use it to obtain a victim’s information and then fix the code vulnerability. |
Bot malware forensics | Use volatility to identify and reverse engineering an infected process. Decode communication and obtain additional information from the bot master. |
Watering hole | Scan and find a vulnerable HTTP File System service on a web server and then replace a legitimate file with a reverse shell. Afterwards, apply defense in depth to harden the system. |
Reverse engineering | Use IDA Pro to find the password for an encrypted malware file. |
Buffer overflow | Identify a weakness in an FTP server program and use Metasploit to generate shell code. Afterwards, apply defense in depth to harden the system. |
ARP spoof | Understand the arpspoof tool in Kali Linux OS and then use it to eavesdrop on traffic. Afterwards, propose potential fixes. |
Workshop Delivery
Over the past two years, we have hosted over 15 workshops and hosted over 800 participants. While we primarily use the CCAA to host workshops for students and professionals, we have also conducted workshops at external venues including the Hispanic Engineer National Achievement Awards Corporation (HENAAC) Conference and the White Sands Missile Range Leaders New Mexico (LMN) event.
There are several ways to deliver workshops. If held outside of the Center, participants use their own computers; if held in the Center, or furnished laptops. In the former case, the only requirement is that laptops have remote desktop client software, such as Microsoft Remote Desktop or rdesktop (Linux). Workshops start with a presentation that describes background knowledge related to the security issue or incident. While most of the workshops target freshman to sophomore-level college students (e.g., WannaCry ransomware), a few are designed for cybersecurity professionals (e.g., the DEFCON challenge). To start the exercise, participants navigate to a webserver and then download and open a Remote Desktop Protocol connection (RDP) file.
If the workshop is used for training or awareness, participants are given an exercise handout that consists of fill-in-the-blank questions mixed with short explanations. During the exercise, the workshop developers and other aids answer questions and offer guidance. If used for testing, participants are given an objective and offered little or no guidance.
The Backend: Execution and Instrumentation Tools
Our pipeline uses two collaboratively developed tools. The Emulation Sandbox (EmuBox) is used to serving multiple simultaneous scenarios and the Evaluator-Centric and Extensible Logger (ECEL) is used for collecting data. Below is a description of each of these tools.
EmuBox
The EmuBox is a lightweight, open source testbed1. It is written in Python and has been tested on Windows 7+, Kali Linux 2016.1, 2016.2, and Ubuntu 14.04 LTE (32 and 64 bit). The EmuBox leverages VirtualBox and CORE to support mixed virtual/physical systems, virtual remote desktop connection (VRDP), and heterogeneous (e.g., mixed MANET and wired) networks.
The EmuBox can host up to 8 simultaneous participants on a laptop with an Intel i7 process and 16GB of memory. Figure 2 shows a setup using four computers to run the EmuBox and a network switch to connect participants to the internal virtual machines.
Scenario VMs are grouped into Workshop Units and Workshop Groups. Workshop Units contain the set of VMs that make up a single scenario. At least one of these VMs must have the virtual remote desktop protocol (VRDP) enabled (a feature of the VirtualBox extensions pack).
VirtualBox consumes VRDP data meaning that the traffic associated with the remote desktop connection is not visible within the VMs. A VM is used with CORE to construct the network topology. The topology may consist of Linux containers and Docker containers. Additionally, external hardware, such as a Controller Area Network (CAN bus), and other non-IP-based systems may be connected using HIL. Vulnerable systems, scripted actors (e.g., operators/defenders), and instrumentations may also be incorporated into the scenarios.
Network isolation is implemented using VirtualBox internal network adapters. After a Workshop Unit is configured, the machines are started and a snapshot is taken; this snapshot acts as a frozen image that preserves state and can be restored at a later time. For example, the snapshot may be taken after a user logs in and starts the sshd service or after all routes converge in the network topology. The EmuBox can also clone Workshop Units, or scenarios; adjusting VRDP ports and internal network adapter names so that each group is isolated and uniquely accessible by participants. Additionally, the EmuBox has a backend subsystem that provides a web frontend to show all available workshops and to restore VMs from snapshots once participants disconnect. See [8] for performance analysis.
ECEL
The ECEL is open source2, written in Python, and is designed using a plugin architecture (see Figure 3). While the ECEL itself is cross-platform, some plugins are not, such as Snoopy, which is used for collecting system calls on Linux systems.
The ECEL’s execution engine runs as a service that interfaces with backend functions for collection and parsing. Users may interact with the engine using the Graphical User Interface (GUI) or through a terminal window. The ECEL capabilities are easily extended through the implementation of collector and parser plugins. Collector plugins capture data from a resource such as tool output, system logs, or operating system hooks. Parser plugins read the captured data and transform it into a structured format. We have built plugins for network traffic (Dumpcap/Multi-Dumpcap), system calls (Snoopy), screenshots (Manual Screenshot) as well as keystrokes and mouse-clicks (Pykeylogger). Our parsers format data into JavaScript Object Notation (JSON) which is an open-standard file format that uses human-readable text to transmit data objects consisting of attribute–value pairs and array data types used during analysis. See [8] for an in-depth description of plugins. We walked through and captured a small dataset for the route hijacking and pivoting & exploitation workshops described earlier3.
Data Analytics
To provide an efficient way to analyze data, there are several visualization tools. The timeline viewer, shown in Figure 4, is used for editing, annotating, and extracting portions of workshop-related data. This helps to map attacker actions to network traffic and to build models for decision support and attacker profiles.
The heat map viewer, shown in Figure 5, is used to identify similarity in network traffic across traffic captures and scenarios. This is used to improve intrusion detection systems and also to aid during security assessments (such as penetration testing) and to fine-tune and prune attack graphs, e.g., by assigning confidence metrics based on attacker profiles [9]. The heat map in Figure 5 shows two captures with high occurrences of the tftp lexicon (and, hence, the protocol).
Ongoing Research
The pipeline feeds into several research efforts that focus on the defensive and the testing aspect of security. The following are some examples.
Related to attack analysis and profiling, we employ temporal pattern mining and motif mining techniques to investigate the workshop data for detecting suspicious activities that frequently appear together in attacker’s data streams. We study the correlation between network characteristics, network traffic, and system commands. We also conduct further studies to identify the best approach for modeling attacker’s profile based on pre-intrusion and post-intrusion activities at the network and system level. In short, we are adding another dimension of training data (the inside view) to improve intrusion detections systems.
Another effort is attempting to extrapolate relationships between attacker actions and personality traits in relation to the dark triad (Machiavellianism, narcissism, and psychopathy). We have developed a system (see Figure 6 that leverages the EmuBox and the ECEL to analyze user network scanning and probing). Users complete a personality questionnaire and a workshop. We are attempting to identify correlations between the answers to the questionnaires and metrics related to actions, timing, and stealth. This work will help to predict attacker behavior in the early stages of an attack; probing and scanning are usually the first steps in an attack.
Regarding security testing, our work focuses on automated methodologies in the realm of protocol analysis and cybersecurity assessments. Using machine learning we are developing algorithms for automatic extraction of network protocol structures into a standardized format. We then use these structures to generate software templates that can communicate with non-IP protocols. Currently, the automated software generates ns-3 models, and Scapy which is a powerful interactive packet manipulation tool, packet generator, network scanner, network discovery tool, and packet sniffer [10].
We are also creating a decision support system for use by penetration testers (see Figure 7). Testers will be able to efficiently identify low-hanging fruit (i.e., findings that have been identified previously and are still unfixed) and to allocate more time and resources to test other, more complex, systems.
This system uses data collected during workshops with the ECEL and can also be trained to leverage in-house tools and techniques specific to an organization. Eventually, we will investigate the possibility of creating automated agents that execute a set of automated tasks; dependent on likelihood of success and collateral risk.
Conclusion
The relationship between ARL and UTEP has yielded many fruitful results. ARL has benefited by leveraging subject matter experts to cooperatively design and develop tools, conduct next-generation cybersecurity research, expand its overall capabilities, and also to attract and retain talent in the workforce. The University has strengthened its security program and outreach activities that have led to joint proposals and research grants among others. Students graduate with a firm understanding of cybersecurity concepts and issues augmented with practical experiences gained from working alongside experts in the field.
In the short term, we plan to make workshops accessible across the Internet by using virtual private network (VPN) which is a technology that creates a safe and encrypted connection over a less secure network, such as the internet. and load balancing technologies. We will continue to expand our collaborative relationship and plan to reach out to other partners to develop and broaden our research focus.
References
- Schauland, D., & Jacobs, D. (2016). Managing the Windows Event Log. In Troubleshooting Windows Server with PowerShell. Springer, 2016, pp. 17–33.
- Eriksen, M. A., & Skufca, B., “Snoopy logger,” [Online]. Available: https://github.com/a2o/snoopy
- Milenkoski, A., Vieira, M., Kounev, S., Avritzer, A., & Payne, B. D. (2015). Evaluating computer intrusion detection systems: A survey of common practices. ACM Computing Surveys (CSUR), 48(1), 12.
- Nunes, E., Kulkarni, N., Shakarian, P., Ruef, A., & Little, J. (2016). Cyber-deception and attribution in capture-the-flag exercises. In Cyber Deception (pp. 151-167). Springer International Publishing.
- Vigna, G., Borgolte, K., Corbetta, J., Doupe, A., Fratantonio, Y., Invernizzi, L., Kirat, D. & Shoshitaishvili, Y., “Ten years of ictf: The good, the bad, and the ugly,” in 2014 USENIX Summit on Gaming, Games, and Gamification in Security Education (3GSE 14). San Diego, CA: USENIX Association, 2014. [Online]. Available: https://www.usenix.org/conference/3gse14/summitprogram/presentation/vigna
- Ahrenholz, J., Danilov, C., Henderson, T. R., & Kim, J. H. (2008, November). CORE: A real-time network emulator. In Military Communications Conference, 2008. MILCOM 2008. IEEE (pp. 1-7). IEEE.
- Scott, L., Marcus, K., Hardy, R., & Chan, K. (2016, November). Exploring dependencies of networks of multi-genre network experiments. In Military Communications Conference, MILCOM 2016-2016 IEEE (pp. 576-581). IEEE.
- Acosta, J. C., McKee, J., Fielder, A., & Salamah, S. (2017, October). A Platform for Evaluator-Centric Cybersecurity Training and Data Acquisition. In Military Communications Conference, MILCOM 2017-2017 IEEE. IEEE.
- Acosta, J. C., Padilla, E., & Homer, J. (2016, November). Augmenting attack graphs to represent data link and network layer vulnerabilities. In Military Communications Conference, MILCOM 2016-2016 IEEE (pp. 1010-1015). IEEE.
- Acosta, J. C., & Estrada, P. (2017, May). A preliminary architecture for building communication software from traffic captures. In SPIE Defense+ Security (pp. 102060T-102060T). International Society for Optics and Photonics.