Showing posts with label DNS. Show all posts
Showing posts with label DNS. Show all posts

Saturday, 17 March 2018

Network Forensics with Windows DNS Analytical Logging

Overview


DNS queries and responses are a key data source used by network defenders in support of incident response as well as intrusion discovery. If these transactions are collected for processing and analytics in a big data system, they can enable a number of valuable security analytic scenarios. An exercise to this end was conducted with Microsoft internal DNS systems. This document outlines the approach, and results  to enable Windows DNS customers to re-produce the outcome.

Motivation


The at-scale processing and analysis of DNS data in a Big Data system is a powerful capability that can be used to support analyst investigations and discovery of intrusions. Below are a selection of scenarios that are enabled –

IOC Detection

Domain Names and IP addresses are one of the most common sources of Indicators of Compromise (IOC), often referring to Command and Control servers of attacker infrastructure. The collection, processing and storage of DNS data allows for queried domains, and resource record response data for hosts within the network to be searched for these IOCs, providing a quick and accurate detection of whether the network has been impacted by an intrusion. The on-going collection of this data also allows for a powerful retrospective search of IOCs on computer networks.

Protocol Agnostic Detection

Network defenders often have access to other data sources that can be searched for IP and Domain IOCs, such as Web proxy and Firewall logs. DNS collection provides a higher-fidelity detection of these, where the protocol implemented by attacker Command and Control infrastructure does not involve HTTP, or where DNS itself is being used as a covert channel.

Covert Channel Detection

DNS can be used by adversaries as a covert channel to provide remote configuration or data transfer capability to malware inside a computer network. At scale analysis of abnormal response packets can be used to identify such covert channels.

Adversary Tracking

Historic logging of query and response data and associated analysis enables the tracking of command and control infrastructure usage used by adversaries over time, where multiple domains and IP addresses are used and infrastructure is transitioned following the discovery of activity.

Analytical Logging in Windows DNS


Windows Server DNS (2012R2 onwards) has implemented enhanced logging of various DNS server actions in Windows, including the logging of query and response data with a focus on negligible performance impact.

Negligible Performance Impact of Enabling

A DNS server running on modern hardware that is receiving 100,000 queries per second (QPS) can experience a performance degradation of 5% when analytic logs are enabled. There is no apparent performance impact for query rates of 50,000 QPS and lower

Details of Logged Data


The Analytic log type implemented through this feature contains much of the day to day operational detail of the DNS server, and although many types of data are recorded, including zone transfer requests, responses and dynamic updates, for the forensics and threat analytics we will focus on QUERY_RECEIVED and RESPONSE_SUCCESS data types in this example. These form the core of our current internal collection and pose the biggest challenge in collection due to volumes of data.

QUEY_RECEIVED and RESPONSE_SUCCESS events that are logged contain a number of the fields that make up a query and response, but crucially also contain the full packet data received enabling the processing of any aspect of one of these objects. Here is an example response event from the Applications and Services Logs\Microsoft\Windows\DNS-Server analytic log –

Azure Tutorials and Materials, Azure Certifications, Azure Learning, Azure Guides

Implementation


The logging of this data was implemented as an ETW (Event Tracing for Windows) provider. Many event types in Windows are enabled via this mechanism, and allow for high performance logging of data, and the subscription to these providers via their unique GUIDs. In the case of the Microsoft-Windows-DNSServer provider, GUID {EB79061A-A566-4698-9119-3ED2807060E7} is used as its identity. Windows comes with many tools to record samples of this data, such as tracelog, which will record events offered by the ETW channel and write them to a file. The Windows event viewer essentially replicates this subscription model when presenting the administrator with a view of these events, writing the collection sample to a temporary file location. As an example, here is the location of the data underlying the enhanced DNS logging feature – 

%SystemRoot%\System32\Winevt\Logs\Microsoft-Windows-DNSServer%4Analytical.etl

Azure Tutorials and Materials, Azure Certifications, Azure Learning, Azure Guides

The event viewer acts as a browser for this file.

Collection of Data


One method of collecting events from Windows servers is Windows Event Collection (WEC). WEC is a mechanism built into Windows that will forward an XML representation of an event to a configured collection server, based upon a filter specifying an event identifier and selection criteria. WEC, however can only be configured for log types of ‘Operational’. Operational events are stored in a rolled permanent location inside an .evtx file on the host. When these events are created they are also written via the Windows Event Collector service, which performs forwarding off host. For more information on Windows Event Collection, see the following article on msdn – 

https://msdn.microsoft.com/en-us/library/windows/desktop/bb427443(v=vs.85).aspx

There is an inherent overhead in logging an event in this way, and is the reason DNS query and response logging was implemented as an ‘Analytic’ type. Analytic log types do not write events via the WEC service and as such have a lower performance impact. This helps to give us the negligible performance win we mention earlier at a high QPS in the order of 100 thousand queries per second.

For servers that do not have such high QPS needs, using WEC from an operational channel becomes a more viable option. Internal DNS servers which serve a dedicated enterprise network may have significantly lower QPS requirements (around 10,000 QPS.) At these levels, collection over WEC becomes a more realistic scenario. Further, from a security analytics stand point the majority of queries and responses for reputed domains such as *.microsoft.com, are less valuable to us and can be dropped, further reducing the effective QPS logged for WEC. 

For the internal Microsoft project, a high performance event collector was implemented that consumes the QUERY_RECEIVED and RESPONSE_SUCCESS events from the ETW channel on DNS servers. This consumer filters out high repute domains that are less valuable to us for security analytics and writes these to an operational log equivalent ready for collection over normal WEC channels. The following diagram gives a high level overview of the functionality of this tool –

Azure Tutorials and Materials, Azure Certifications, Azure Learning, Azure Guides

Query Volume Modelling


Selecting domain names for filtering can be a balancing act that requires modelling using sample data. Front loading too many domain filters in the tool can cause un-necessary processing, whilst letting high volume domains through to the event writer can result in un-necessary volume and associated storage costs. 

Customers can collect a sample of query and response data from the analytic logging in a .etl file from a DNS server on their network. This ETL file can be analyzed for the top volume queried domains in terms of Second Level Domain (SLD). Taking the top SLDs, excluding county TLDs such as .co.uk customers can model query volume reduction by the application of various SLD filters, and extrapolate these to enterprise network coverage. These figures can then be used to calculate approximate storage costs of implementing such a solution. Alternatively, data such as the Alexa Top 100 SLDs can be taken as a starting point and refined as required to fit the needs of the enterprise. 

In our prototype, a value of 100 SLDs was chosen for implementation based upon sample data logged on the network. This resulted in a reduction of volume from 3,286 QPS from our sample data set, to 136 QPS, 4.13% of the total volume. The extrapolated effective QPS rate for the whole of the network drops significantly in this scenario, easily manageable by big data and WEC infrastructure. 

The Pilot Deployment and Results


We worked with Microsoft IT to pilot the analytic logging, and the ETW consumer/filtering tool to the Microsoft corporate DNS infrastructure. The pilot project was rolled out to 29 DNS caching servers. A little snapshot of the query volume that this project was dealing with is:

Azure Tutorials and Materials, Azure Certifications, Azure Learning, Azure Guides

The total size of raw data storage post filtering for all 29 Microsoft corporate DNS servers peaks at approx. 100GB  / day at the busiest times, and drops down to around 15gb / day during less busy periods such as weekends.  

This enables a whole new way of using information related to compromised domains, the identification of malicious transactions, infected machines and thus enabling us to monitor and fortify our network.

Thursday, 15 March 2018

Heuristic DNS detections in Azure Security Center

Today, we are discussing some of our more complex, heuristic techniques to detect malicious use of this vital protocol and how these detect key components of common real-world attacks.

These analytics focus on behavior that is common to a variety of attacks, ranging from advanced targeted intrusions to the more mundane worms, botnets and ransomware. Such techniques are designed to complement more concrete signature-based detection, giving the opportunity to identify such behavior prior to the deployment of analyst driven rules. This is especially important in the case of targeted attacks, where time to detection of such activity is typically measured in months. The longer an attacker has access to a network, the more expensive the eventual clean-up and removal process becomes. Similarly, while rule-based detection of ransomware is normally available within a few days of an outbreak, this is often too late to avoid significant brand and financial damage for many organizations.

These analytics, along with many more, are enabled through Azure Security Center upon enabling the collection of DNS logs on Azure based servers. While this logging requires Windows DNS servers, the detections themselves are largely platform agnostic, so they can run across any client operating system configured to use an enabled server.

A typical attack scenario


A bad guy seeking to gain access to a cloud server starts a script attempting to log in by brute force guessing of the local administrator password. With no limit to the number of incorrect login attempts, following several days of effort the attacker eventually correctly guesses the perceived strong password of St@1w@rt.

Upon successful login, the intruder immediately proceeds to download and install a malicious remote administration tool. This enables a raft of useful functions, such as the automated stealing of user passwords, detection of credit card or banking details, and assistance in subsequent brute force or Denial-of-Service attacks. Once running, this tool begins periodically beaconing over HTTP to a pre-configured command and control server, awaiting further instruction.

This type of attack, while seemingly trivial to detect, is not always easy to prevent. For instance, limiting incorrect login attempts appears to be a sensible precaution, but doing so introduces a severe risk of denial of service through lockouts. Likewise, although it is simple to detect large numbers of failed logins, it is not always easy to differentiate legitimate user activity from the almost continual background noise of often distributed brute force attempts.

Detection opportunities


For many of our analytics, we are not specifically looking for the initial infection vector. While our above example could potentially have been detected from its brute force activity, in practice, this could just as easily have been a single malicious login using a known password, as might be the case following exploitation of a legitimate administrator’s desktop or successful social engineering effort. The following techniques are therefore looking to detect the subsequent behavior or the downloading and running of the malicious service.

Network artifacts


Attacks, such as the one outlined above, have many possible avenues of detection over the network, but a consistent feature of almost all attacks is their usage of DNS. Regardless of transport protocol used, the odds are that a given server will be contacted by its domain name. This necessitates usage of DNS to resolve this hostname to an IP address. Therefore, by analyzing only DNS interactions, you get a useful view of outbound communication channels from a given network. An additional benefit to running analytics over DNS, rather than the underlying protocols, is local caching of common domains. This reduces their prevalence on the network, reducing both storage and computational expense of any analytic framework.

Azure Security Center, DNS, Microsoft Tutorials and Materials

WannaCry Ransomware detected by Random Domain analytic.

Azure Security Center, DNS, Microsoft Tutorials and Materials

Malware report listing hard-coded domains enumerated by WannaCry ransomware.

Random domains


Malicious software has a tendency towards randomly generated domains. This may be for many reasons, ranging from simple language issues, avoiding the need to tailor domains to each victim’s native language. To even assisting in the automation of the registration of large numbers of such names, along with helping reduce the chances of accidental reuse or collision. This is highlighted by techniques such as Domain Generation Algorithms (DGAs) but is frequently used in static download sites and command and control servers, such as in the above WannaCry example.

Detecting these “random” names is not always straightforward. Standard tests tend to only work on relatively large amounts of data. Entropy, for instance, requires a minimum of several times the size of the character set or at least hundreds of bytes. Domain names, on the other hand, are a maximum of 63 characters in length. To address this issue, we have used basic language modelling, calculating the probabilities of various n-grams occurring in legitimate domain names. We also use these to detect the occurrence of highly unlikely combinations of characters in a given name.

Azure Security Center, DNS, Microsoft Tutorials and Materials

Malware report detailing use of randomly generated domain names by ShadowPad trojan.

Periodicity


As mentioned, this attack involved the periodic beaconing of a command and control server. For the sake of argument, let’s assume this is an hourly HTTP request. When attempting to make this request, the HTTP client will first attempt to resolve the server’s domain name through the local DNS resolver. This resolver will tend to keep some local cache of such resolutions, meaning that you cannot guarantee you will see a DNS request on every beacon. However, you can see these on some multiple of an hour.

In attempting to find such periodic activity, we use a version of Euclid’s algorithm to keep track of an approximate greatest common divisor of the time between lookups of each specific domain. Once a domain’s GCD falls within the permitted error (i.e. in the exact case to one), it is added to a bloom filter of domains to be ignored from further calculations. Assuming a GCD greater than this error, we take the current GCD or estimate of the beacon period and the number of observations to calculate the probability of observing this many concurrent lookups on multiples of this period. I.e. the chances of randomly seeing three concurrent lookups to some domain, all on multiples of two seconds is 1/2^3  or 1 in 8. On the other hand, as with our example, the probability of seeing three random lookups, precisely to the nearest second on multiples of one hour is 1/〖3600〗^3  or 1 in 46,656,000,000. Thus, the longer the time delta, the fewer observations we need to observe before we are certain it is periodic.

Conclusion


As demonstrated in the above scenario, analyzing network artifacts can be extremely useful in detecting malicious activity on endpoints. While the ideal situation is the analysis of all protocols from every machine on a network, in practice, this is too expensive to collect and process. Choosing a single protocol to give the highest chance of detecting malicious communications while minimizing the volume of data collected results in a choice between HTTP and DNS. By choosing DNS, you lose the ability to detect direct IP connections. In practice, these are rare, due to the relative scarcity of static IP addresses, alongside the potential to block such connections at firewalls. The benefits of examining DNS is its ability to observe connections across all possible network protocols from all client operating systems in a relatively small dataset. The compactness of this data is further aided by the default behavior of on-host caching of common domains.