III. Methodology
In this research, we made use of a Java Optimized Processor (JOP) which is a soft-core CPU written in VHDL. One of the key benefits of the JOP is that it can be implemented on most commercially available FPGAs with no modifications. Internally, the architecture of the JOP is a 4-stage, pipelined CPU with separate Bytecode Fetch, Fetch, Decode, and Execute/Stack stages. These stages control the operation of the memory controller, the cache, the ALU, and the I/O interface contained within the JOP. The JOP was designed primarily to provide an efficient Java interpreter for embedded systems. The JOP was chosen for this research because it is open source (e.g., free) and uses a modern CPU pipelined architecture. The use of a soft-core JOP enabled us to easily implement different hardware protection schemes by modifying the underlying JOP architecture. In our research, we implemented the AES algorithm in the Java programming language and executed the Java code on three different variations of the JOP. A secret key was randomly generated and the same secret key was used for all of the experiments. In order to implement the hardware countermeasures, changes were made to the underlying VHDL architecture of the JOP [13]. These countermeasures were then evaluated as using side channel analysis.
A. Side Channel Analysis (SCA)
Side Channel Analysis (SCA) is used to gain information about the secret key by measuring EM emissions while the AES algorithm is executing. The emissions are considered the “signal” of interest for an attacker and arise from the movement of data between the processor and memory as the AES algorithm executes. Specifically, as operations are performed within the JOP, data needed for the operations is stored and retrieved from a local cache to improve processor performance. Interestingly, the cache in the JOP is implemented in the form of a stack. As data is moved between the registers and the cache, information about the key is leaked in the form of EM emissions which are correlated with the key. Worse, when these cached values are written back to main system memory, the EM emissions often have greater magnitude due to the larger capacitances present in external data buses which interface the processor and main system memory. Thus, for circuits running AES cryptography, protection methods are centered on reducing emissions resulting from data transfers between the processor, cache, and RAM write-back operations. With this in mind, our hypothesis is that through the addition of memory units using both masking and elimination techniques, the correlation between the processed data and the EM emissions can be significantly reduced.
B. A Protection Method for a Java Optimized Processor
In the protected version of the JOP, when values are saved, they are first split into two values. First, all of the odd bits of the original value, D, are saved in the first part of the mask, D0, as all the odd bits with each even bits containing the inverse of the odd bits. Thus, the first bit of D (b0) would be stored in the first bit of D0, and the second bit of D0 would be the inverse of the first bit (b1’). Then the third bit of D (b2) is stored in the third bit of D0, and the fourth bit of D0 would be the inverse of the third bit (b2’). The second part of the mask, D1, would contain all of the even bits of the original value, D, and all of the odd bits of D1 would contain all of the inverse values of the even bits. See Figure 2 below.
Fig. 2. The separation of a 32 bit integer into two values of equal hamming weight
This method of masking splits the values into two masked values, and for the 32 bit system of the JOP each of the resulting masked values will always have a Hamming Weight (HW) of 16 regardless of what the original value was. The HW is the number of bits in the number in the “on” position. This method to cause every value to have the same HW helps eliminate power usage differences for different values. It is implemented when values are stored to the JOP cache and implemented when values are stored in the RAM.
C. Testing Setup
The testing setup for this experiment used an ML506 Virtex 5 FPGA running AES algorithm on a soft-core JOP. Electromagnetic emissions, hereafter called “traces,” were collected using a RISCure EM probe connected to a Lecroy Wavemaster 8zi oscilloscope and analyzed using RISCure’s “Inspector” software package (see http://www.riscure.com/). The physical test setup is shown in Figure 3.
Fig. 3. Testing setup showing EM probe centered over Virtex 5 with shield removed
In order to find the best location to center the EM probe before collecting data, the chip surface was divided into a 10×10 grid. EM measurements were collected at each of the 100 locations to determine the location with the best signal to noise ratio. Figure 4 shows an example color graph depicting the magnitude of the power levels recorded over the surface of the chip. In this graph, light green is the area of greatest signal to noise ratio.
Fig. 4. Physical surface of the chip showing locations of high EM radiation
Once the optimal location for the EM probe was selected, 1000 traces were collected during the AES encryption of random plaintexts, one trace per encryption. Because the VHDL code for each of the variations of the JOP was different (e.g., baseline JOP architecture with no countermeasures, JOP with masked cache architecture, and JOP with masked RAM architecture), the process above was repeated to determine the optimal probe location in order to maximize the signal to noise ratio for each of the three architectures. From these 1000 traces, a 1st-order DPA analysis was performed considering only the portion of the trace that occurred in SubBytes during the first round of AES. The “Inspector” software tool was used to perform the DPA analysis. This software requires the user to provide the plaintext and identify the relevant portions of the collected trace to analyze. The software generates a testable statistical model in which the collected data is used to test a set of hypothesis until the statistically most probably key emerges. In every case of the 1000 trace set, the correct key was found. Subsequently, the process was repeated using a smaller number of traces until the correct key was not found. This procedure yielded the minimum number of needed traces to deduce the correct key. This whole process was repeated thirty times to obtain the average minimum number of traces required to deduce the correct key. These thirty data points, derived from the 30,000 collected traces, represented the number of needed to arrive at the correct key using DPA for a given JOP architecture.
Figure 5 below shows an example trace, recorded from the EM probe during one encryption of AES in the baseline JOP with no protections. The periodic nature of the 10 rounds of AES can be easily seen in the collected trace.
Fig. 5. A trace of AES showing 10 rounds in the unprotected baseline JOP
Figure 6 shows the areas of interest when the SubBytes of the first AES round occurs. Specifically, Figure 6(a) highlights the trace when the cache is being accessed and Figure 6(b) highlights the traces when RAM write backs occur. These areas represent the targets of opportunity for reducing the signal available to an attacker during DPA.
Fig. 6. (a) The execution portion of a SubBytes substitution and (b) the RAM write back portion
IV. Results
In summary, the DPA presented in this research used traces collected during the SubBytes phase of the first AES round for three different versions of the Java Optimized Processor (JOP): 1) unprotected baseline JOP, 2) a JOP with masked cache, and 3) a JOP with masked RAM. Figure 7 shows the traces collected from each of the three JOPs during the SubBytes phase of the first AES round. Figure 7(a) shows the traces collected from the baseline unprotected JOP; Figure 7(b) shows the trace collected from the JOP with the masked cache; and Figure 7(c) shows the trace collected from the JOP with the masked RAM.
(a)
(b)
(c)
Fig. 3. AES SubBytes for unprotected JOP (a), masked cache (b), masked RAM (c)
When attacking the JOP without any countermeasures, the execution portion of the trace (with heavy cache usage) required an average of 308.3 traces to extract the correct key, while the RAM write-back portion of the trace required an average of 154.8 traces. The execution portion of the trace required more traces than the RAM write-back portion because less power is used by the JOP to interface with the on-chip cache than the off-chip RAM, thus providing less leakage and weaker signals from the JOP.
When using the masked cache countermeasure, a t-test with a 95% confidence showed the increase in security to be between -95 traces to 14 traces during the execution stage. This means that with a 95% confidence, there is no statistical increase in security. This was found to be due to the fact that the JOP contains many registers that pass the values and communication between registers was not protected by the masked cache, so information was still leaked.
When using the masked RAM countermeasure, a 95% confidence t-test of the data found that the average increase in the needed traces to find the correct key increased from 43 traces to 137 traces. This means that with a 95% confidence, the masked RAM had a substantial improvement in the security for the RAM write-back portion of the trace. This gives us an increase in the number of needed traces to derive the correct key to be between 31% and 87% with a 95% confidence. As expected, there is significantly greater leakage of information during the RAM write back than during the transfer of data from the registers to the cache, requiring about half the needed traces as compared to considering information leaked by the cache. This clearly indicates that efforts at reducing the leakage of the RAM write back module will yield the best return on investment when protecting a processor.
However, it is important to note that masking schemes incur costs in terms of die area and processor speed. The costs of the cache protection was negligible, having less than 1% total increase in area of the CPU and having no impact on processor performance. In contrast, the RAM masking scheme increased the execution time by 2.5x what it was previously, and required twice the RAM. Determining if these costs are acceptable depend upon the specific application context.