Introduction
Security is an ever-increasing risk. As we add more components to our infrastructure the number of attack vectors increase. When building an authorization system such as Udaru , it is just as important to think of the security of Udaru itself as the security of the systems that Udaru protects.
Creating an intrusion-detection system for Udaru poses some challenges; Udaru itself doesn’t have any restrictions on what type of application or domain it can be used for. Because of this, it is very difficult to create feature/application specific rules. Secondly, there is often no, or very little, data about when something is known to be an intrusion attempt and especially when we want to detect intrusion patterns that haven’t been considered.
Therefore, our intrusion-detection system dynamically creates rules based on a training dataset of previously seen valid requests. This kind of data modelling is known as zero-negative classification. Typical modern approaches such as machine learning are not very suited for these kind of problems. Instead we use frequentist and bayesian statistics for the models as those approaches are more suited for this type of problem.
Udaru
Udaru is a Policy Based Access Control (PBAC) authorization system. Let's take a user in an organisation. This user asks to perform an action on a certain resource. Udaru policies can control if this action is allowed or not. For example, we can validate if a user can write to a specific file on a server.
Udaru information flow: Udaru decouples the user login from the access control itself. This means when Udaru gets an authorization request, it assumes that the user login has already been validated.
It is important to understand that Udaru itself is not responsible for the user login. Intrusion-detection for attack vectors there are meant to get user credentials, such as, brute force attacks , timing attacks or SQL injections , is therefore not within the scope of intrusion-detection for Udaru.
Intrusion Strategies
The intrusion-detection system looks at IP addresses and timestamps to see if a user moves too fast. This may indicate that an intruder has gained another person’s user credentials.
Secondly, the intrusion-detection system looks at the resource string to see if this is similar to previously seen resource strings. If not, this might indicate that an intruder is trying to gain access to resources through obscure means. Such as by using special characters ( resource:server:/pubic/server/../../../etc/passwd
) or unusual syntax ( resource:server:/public/server:/etc/passwd
). In these cases, a poorly written policy may allow access because resource:server:/public/server
matches the poorly written policy. Even if the polices are fine, it is still important to know that someone is attempting to break the authorization system.
Udaru poses some challenges in the second case because it doesn’t assume anything about the application. By default, it does not have any restrictions on how resource strings are supposed to look. For example, in the example above, the resources are directories on servers. However, they can also be physical doors in a corporate building or personal information used by human resources.
Transaction audit log
To facilitate logging the authorization data, nearForm have recently developed a new audit log service called Trail . Trail lets you index the authorization data and make it easily accessible to the intrusion-detection system. To use Trail with Udaru there is also an integration model called Udaru-Trail .
Geolocation Anomalies
For detecting users who “move” too fast, the IP-address is translated to longitude and latitude using the MaxMind GeoLite2 Database . Using the longitude and latitude of the IP-address of the last login and the new login, the distance can be calculated using the Haversine formula :
[katex display=true] d = r \cdot \mathrm{archav}\left(\sqrt{ \mathrm{hav}(\phi_2 - \phi_1) + \cos(\phi_1)\cos(\phi_2)\mathrm{hav}(\lambda_2 - \lambda_1) }\right) [/katex]
Based on the fact that the Earth's radius is approximately [katex display=true] v = \frac{d}{t_2 - t_1}v=t2−t1d [/katex]
With the traveled distance calculated, it is straightforward to compute the velocity assuming infinite acceleration:
[katex display=true] v = \frac{d}{t_2 - t_1}v=t2−t1d [/katex]
These equations require the following parameters:
- [katex](\lambda_1, \phi_1, t_1)(λ1,ϕ1,t1)[/katex] longitude, latitude, and timestamp for the last login
- [katex](\lambda_2, \phi_2, t_2)(λ2,ϕ2,t2)[/katex] longitude, latitude, and timestamp for the last login
For the Udaru intrusion-detection, a velocity above the speed of sound (343 m/s) is considered an intrusion.
The Haversine functions is not a part of the standard Python math
library; but they can easily be implemented using the following trigonometric identities:
[katex display=true] \begin{aligned} \mathrm{hav}(\theta) &= \sin^2\left(\frac{\theta}{2}\right) \\ \mathrm{archav}(h) &= 2 \arcsin\left(\sqrt{h}\right) = 2 \mathrm{arctan2}\left(\sqrt{h}, \sqrt{1 - h}\right)\end{aligned} [/katex]
Resource Anomalies
To detect anomalies in the resource strings, 3 different models are used:
- Check if the string length matches what was previously seen.
- Check if the characters used matches what was previously seen.
- Check if the grammatical structure matches what was previously seen.
These models are based on the paper " Anomaly Detection of Web-based Attacks " but with some modifications especially in the case of the grammatical structure model.
When validating a resource string, the fastest approach would be to validate the resource string using the length model first; as this is very cheap and the others have a computational complexity that scales with the string length. Next, one should use the character model. However, for the purpose of demonstration the results of all 3 models are presented here.
1. String Length Model
A big problem in creating a statistical model based on string length is that it is very hard to assume anything about the length. The same instance of Udaru can be used for handling resources of different types, such as door numbers and file paths. Therefore the empirical distribution of the resource string length can easily look something like this:
Hypothetical string length histogram: bimodal distributions or worse can be expected. In general, it is very hard to assume anything about the distribution of the resource string length.
Because it is very hard to assume anything about the distribution, a good idea is to use the Chebyshev inequality theorem as this doesn’t assume anything about the distribution:
[katex display=true] P(|x - \mu| > t) < \frac{\sigma^2}{t^2} [/katex] The Chebyshev inequality says that the probability that something ([katex]x[/katex]) deviates more from the mean than a threshold ([katex]t[/katex]), is less than [katex]\frac{\sigma^2}{t^2}[/katex]. This is reformulated to "the probability that something ([katex]x[/katex]) deviates more from the mean than the current deviation [katex]|l - \mu|[/katex] from the mean. Where [katex]l[/katex] is the current resource string length. [katex display=true]P(|x - \mu| > |l - \mu|) < \frac{\sigma^2}{\left(l - \mu\right)^2}[/katex] To validate a string length the lower-bound probability [katex]P(|x - \mu| > |l - \mu|)[/katex] is then thresholded by some fixed value. In this case, 10% is used. Meaning if the probability is greater than 10%, the string length is considered valid.
[katex display=true]10\% \le P(|x - \mu| > |l - \mu|) < \frac{\sigma^2}{\left(l - \mu\right)^2}[/katex]
2. Character Distribution Model
The character distribution model computes the empirical distribution from the training dataset. For validation, it then compares this with the character distribution of new resource string using a [katex]\chi^2[/katex]-test. If the p-value is greater than 10%, it is considered a valid string.
To increase the degrees of freedom, some characters are grouped into the same symbol. The character groupings are 0-9, a-f, g-z, A-F, and G-Z. The letters are split on F and G to allow the model to easily treat hexadecimal strings differently from strings that use all English letters.
3. Grammatical Model
The Grammatical model is by far the most complex model. The idea is to dynamically build a finite automaton that represents all valid resource strings. The difficult part is creating a finite automaton that doesn’t just fit the training dataset but generalizes to the actual pattern. However, the finite automaton must not end up being so general that it fits everything.
Grammatical Model: This describes resource strings such as resource:server:/public/server
and rejects resource strings such as resource:server:/public/server:/etc/passwd
.
This balance between fitting the training dataset and generalizing is accomplished using Bayesian statistics.
[katex display=true]\begin{aligned} P(Model|Dataset) &= \frac{P(Dataset|Model)P(Model)}{P(Dataset)} \\ &\propto P(Dataset|Model)P(Model) \end{aligned}[/katex]
When building the model the finite automaton is augmented with emission and transition probabilities, thereby making it a Markov Chain. The process of dynamically building this Markov Chain is called Bayesian Model Merging (see Inducing Probabilistic Grammars by Bayesian Model Merging ).
As Markov Chains inherently makes it theoretically trivial to compute the probability of a sequence, computing [katex]P(Dataset|Model)[/katex], there is no universal correct choice. In the udaru-anomaly-detection
tool the following prior is used:
[katex display=true]P(Model) = \sum_{s \in States} N^{-3 \cdot (e_s - 1) + 1} N^{-t_s}[/katex]
Where [katex]e_s[/katex] and [katex]t_s[/katex] are the numbers of emissions and transitions for the state [katex]s[/katex].
Now that the unnormalized [katex]P(Model|Dataset)[/katex] can be calculated, the model merging strategy can be described.
For each sequence in the dataset, do the following greedily, sequence by sequence:
- Extend the Markov Model by creating a linked list between the start and the end. Mark each of these states as unmergeable. Meaning, another state cannot be merged into an unmergeable state. However, an unmergeable state can be merged into a mergeable state.
- Attempt to merge each unmergeable state into all other mergeable states, do this greedily from start to end.
- If any of these merges caused [katex]P(Model|Dataset)[/katex] to increase, consider this the best model the new model. If [katex]P(Model|Dataset)[/katex] wasn’t increased by merging, mark the unmergeable state as mergeable.
To make Bayesian Model Merging work in practice you have to go quite a lot beyond what the original paper describes. Some key observations:
- Use BeamSearch to approximate [katex]P(Dataset|Model)[/katex]. The final grammatical model often has a polynomial complexity. However, when exploring possible models it is almost unavoidable to encounter Markov Models that contain cycles that cause an exponential computational complexity when calculating [katex]P(Dataset|Model)[/katex].
- If a string already fits the model, simply increase the probabilities along the most likely path.
- Compute all probabilities in log-space. This is much more numerically stable and it makes the Bayesian prior very cheap to compute.
- Consider checking for model pruning opportunities. If two states are self-linking and link to each other, attempt to merge those two states.
- Before merging the new sequence into the model, merge equivalent subsequent emissions together. This dramatically shortens the new sequence and therefore the number of merges to check.
Together these optimizations bring the training time down from a couple of days to less than 5 minutes on 100 resource strings.
Running The Tool
The udaru-anomaly-detection
tool is an experiment which is why there is no UI and the CLI is limited in terms of configuration.
To install the udaru-anomaly-detection
tool, run:
You can then set up Udaru with Trail integration using the following code:
Alternatively, if you just want to try it out, you can insert some data into the Trail server. In this case you still need the Trail server but not the Udaru server. To start a stand alone Trail server use npx run trail-hapi-server
. With the Trail server running, you can insert some fake data with:
It is then possible to train and test the model with:
Demo
Training: Shows training on all observations between 2017-01-01 and 2017-12-31, there are 100 resource strings in this interval. The trained models are stored in ./cli_model. Training the length model and character distribution model is very fast. The grammatical model is much more expensive but still only takes about 3 minutes. It appears to hang a few times, especially at resource string number 47. This is because it contains a “-” character, which is not a part of the existing grammatical model. The training algorithm therefore has to work a lot to extend the grammatical model appropriately.
Insight, imagination and expertly engineered solutions to accelerate and sustain progress.
Contact