July 06, 2020Jason Trost • Comment

All your SPF are belong to us: Exploring trust relationships through global scale SPF Mining

In this post we explore a large collection of Sender Policy Framework (SPF) records to see what they might tell us about global email sending trust relationships and how they relate to email security providers. This is a fast follow-up to my previous post on Mining DNS MX Records for Fun and Profit.

Here is the methodology I devised for this (very similar to the previous post, but with new custom built tools):

Collect a large sample of SPF records via DNS TXT lookups of popular domain names (and recursively resolving SPF “include” domains).
Enrich SPF records with IP intelligence and useful metadata (including email security provider mappings)
Analyze the enriched results.

Intro to Sender Policy Framework (SPF)

The Sender Policy Framework (SPF) enables domain name administrators to authorize hosts to use their domain names when sending email (i.e. in the “MAIL FROM” or “HELO” identities in SMTP). One of the goals of SPF is to limit spammer’s abilities to spoof email messages. SPF is limited and is usually used with DKIM and DMARC. SPF records are published using DNS TXT records. SPF compliant mail receivers use the published SPF records to test the authorization of sending Mail Transfer Agents (MTAs). SPF can be used to build complex policies around who can send email on whose behalf. Below is an example SPF record for Florida State University.

According to this SPF record 146.201.58.212, 146.201.58.213, 146.201.107.145, 146.201.107.249, 192.12.121.23, and 199.188.157.80 are allowed to send email purporting to be from fsu.edu. Also, the SPF records from spf.protection.outlook.com, _spf.qualtrics.com, spf.blackboardconnect.com, servers.mcsv.net, and _spf.mlsend.com should be retrieved and their policies applied as well. Below are the SPF records for each of these domains. As you can see they include more and more IPs/CIDRs as well as additional SPF includes.

As you can see, SPF forms a chain of trust between the domain owner and all the SPF policies included recursively (potentially crossing several different administrative boundaries). In this post I was hoping to explore this chain of trust at a large scale by collecting a large sample of SPF records and mining them.

Below are some useful resources for understanding SPF:

RFC7208: Sender Policy Framework (SPF) for Authorizing Use of Domains in Email
SPF Syntax Table - really useful guide for understanding SPF “mechanisms”.

Step One: Collection

For step one, I built a very ~~crude~~ useful SPF crawler that uses dig (optionally adnshost) to perform DNS TXT requests, parse out SPF records found, and then recursively follow the trail of SPF include records and perform TXT lookups against the included domains.

In order to seed the SPF crawler, I used the same domains I used in my previous blog post on mining MX records. I downloaded the Alexa top 1M domains, Quantcast top 1m domains (from WaybackMachine), Domcop Top 10m domains, Majestic Million Domains and Cisco Umbrella top 1m domains. I identified the registered domain using tldextract for each of these and then combined them into a single de-duplicated list. This resulted in ~8.3M unique domain names.

These domains were fed into my SPF crawler and then the results were collected, parsed, and then assembled. I ended up backing the SPF crawler with “dig” instead of “adnshost” this time since I found dig was more reliable, completing 23% more DNS requests in an experiment against the Fortune 1000 domains. Dig is single threaded, but I easily parallelized it using splits files and xargs and its performance ended up being good enough. See parallel_dig.sh for more details.

Below are a few simple commands as well as example output data collected with my SPF crawler applied to just one domain. As you can see, the assembled output for fsu.edu includes all the IPs and Netblocks from all the SPF includes that it links to, recursively.

Below is the same information, visualized as a network (and enriched with ASN info from Maxmind).

Step Two: Enrichment

For this step, I reused a lot of the code from my previous blog post on Mining MX records and performed the following enrichments:

Maxmind ASN
Maxmind Country
Cloud Provider IP Lookups for AWS, Azure, and GCP
Alexa Ranking
Email Security Provider mapping

netaddr, tldextract, and cidr-trie were useful during this stage.

Step Three: Analysis

Through this analysis, I hoped to answer the following questions:

What is the largest trusted network size (both single CIDR and aggregate network space)? … HUGE
Could I find any blatantly misconfigured SPF records? … YES
What does SPF data show about email security providers? … A lot that MX doesn’t
What are the most “included” SPF includes? … Not many surprises here
Does SPF augment the MX record mining (give more coverage? reveal things previously hidden? or 100% redundant?) … YES!
Are domains trusting IP space from cloud providers that may be re-usable (i.e. AWS EC2)? … YES!

Below are some outputs and commentary from this project’s Jupyter notebook that answer the questions above.

Network Graphs

These networkx visualizations of the Fortune 100 and Alexa 100 are a bit of a mess, but they should get the point across of how interconnected the SPF trust relationships are.

Fortune 100 SPF Trusted Networks Graph

Alexa 100 SPF Trusted Networks Graph

Heatmaps

As you can see from the next several heatmaps, as we go beyond the Alexa top 1,000 domains the number of networks trusted drastically increases, and as we hit the Alexa 1m, the entire Internet is trusted (likely due to SPF misconfigurations).

These heatmaps were generated with the awesome ipv4-heatmap tool provided by the Measurement Factory. The code to automate this can be found in my Jupyter Notebook here.

Fortune 1,000 SPF Trusted Networks Heatmap

Alexa 1,000 SPF Trusted Networks Heatmap

Alexa 10,000 SPF Trusted Networks Heatmap

Alexa 100,000 SPF Trusted Networks Heatmap

Alexa 1,000,000 SPF Trusted Networks Heatmap

Alexa Top 1M Domains Trusting /7 or larger networks

As you can see from this list, there are quite a few domains that trust very large networks. Several of these seem like likely misconfigurations. For example, these four domains trust the entire Internet:

hitadouble[.]com: 208.67.207.0/0
payukraine[.]com: 0.0.0.0/0
angliss[.]edu[.]au: 0.0.0.0/0
hutkigrosh[.]by: 0.0.0.0/0

This domain trusts half of the Internet - salaam[.]af: 175.106.32.0/1

And these five domains trust 1/4 of the Internet. cfe[.]fr appears to have fixed this apparent misconfiguration now. As their TXT record has changed.

creativecircle[.]com: 64.4.22.64/2
gevestor[.]de: 91.241.72.0/2
debeersgroup[.]com: 10.47.149.168/2
cfe[.]fr: 82.97.62.0/2
adecco[.]com: 148.105.8.0/2

Top SPF Includes from all top domain lists (via SPF)

Using all the popular domain names, here is a summary of the top 10 SPF includes.

Major Cloud Email Providers:

Microsoft: spf.protection.outlook.com
Google: _spf.google.com

Hosting Providers:

HostGator: websitewelcome.com
OVH: mx.ovh.com
Bluehost: bluehost.com

Commercial Email Marketing companies

MailChimp: servers.mcsv.net
Mandrill: spf.mandrillapp.com (MailChimp add-on)
Sendgrid: sendgrid.net

Email Security company:

MailChannels: mailchannels.net (more on this later)

Top SPF Includes from Fortune 1000 (via SPF)

Top SPF Includes from Alexa top1m

Email Security Providers

If you read my previous blog post on Mining DNS MX Records for Fun and Profit, then you might notice that these top lists look significantly different than the top email providers as identified from MX records. The top 5 providers identified in the SPF data are MailChannels, Mimecast, Proofpoint, Solarwinds, and Barracuda. In the MX post, the top 5 were Proofpoint, Mimecast, Deteque, Barracuda, and Solarwinds, AND MailChannels was #48 on that list. These top lists are using all the popular domains data which is likely not an accurate reflection of the actual email security market. When reviewing the Fortune 1000 top Email Security providers the story is not as surprising as the top 4 from the Fortune 1000 Email security providers were nearly identical across SPF and MX records with just the order being different. I suspect that MailChannels shows up as popular in SPF because either it is the default setting on newly registered domains OR it is the default setting for domains that are parked with certain hosting providers, but I haven’t spent the time to prove/disprove this.

(Update 7/7/2020) I received this message from Ken Simpson, CEO of MailChannels, that helps explain why there is a mismatch between the MX and SPF counts.

“You were wondering why MailChannels shows up in a lot of SPF records (actually, we’re number one), but relatively few MX records. MailChannels delivers email for the web hosting industry, with over 700 service provider customers worldwide. To deliver email reliably, they have to add us to their customers’ SPF records. Those same customers often host their inbound email with someone else - GSuite, Microsoft 365, or another provider. Hence the mismatch in SPF and MX records.”

One other interesting aspect with SPF is it (potentially) reveals relationships with multiple email security providers. See the “Fortune 100 Email Security Providers Listing (via SPF)” and “Domains with 4 or more Email Security Providers (via SPF)” gists below. In the Fortune 100 list, there are 3 domains with SPF relationships with more than one provider. If you look across all the top domains data you can see there are many. For anyone who has worked in the cyber security department at a large company before, this is not surprising, but it was cool to be able to see this in the data.

Domains with 2 SPF relationships with Email Security Providers: 11,393
Domains with 3 SPF relationships with Email Security Providers: 468
Domains with 4 SPF relationships with Email Security Providers: 35
Domains with 5 SPF relationships with Email Security Providers: 1

Top Email Security Provider from all top domain lists (via SPF)

Top Email Security Provider from Alexa 1m (via SPF)

Top Email Security Provider from Fortune 1000 (via SPF)

Top Email Security Provider from Fortune 100 (via SPF)

Fortune 100 Email Security Providers Listing (via SPF)

Domains with 4 or more Email Security Providers (via SPF)

Trusting Cloud Provider Networks

As you can see from the next few tables, many domains transitively trust a lot of Cloud provider IP space for SPF. For some of the larger networks trusted it seems like this carries risk since it may be possible for the cloud IP space to get reused; see Fishing the AWS IP Pool for Dangling Domains for a practical example of this. Like I mentioned earlier, SPF is usually used with DKIM and DMARC so this data doesn’t paint the whole picture. I am hoping to dive into DMARC/DKIM next.

Alexa 1000 Trusting AWS Networks

Alexa 1000 Trusting Azure Networks

Alexa 1000 Trusting GCP Networks

Fortune 1000 Trusting AWS Networks

Fortune 1000 Trusting Azure Networks

Fortune 1000 Trusting GCP Networks

Some other potentially interesting results, not worth dumping here:

Alexa top1m domains trusting AWS Networks
Alexa top1m domains trusting Azure Networks
Alexa top1m domains trusting GCP Networks
Top Maxmind ASNs of SFP Trusted Networks from Fortune 1000 (via SPF)
Top Maxmind ASNs of SFP Trusted Networks from all top domain lists (via SPF)
Top Maxmind ASNs of SFP Trusted Networks from Alexa top1m (via SPF)
Graph analytics applied to Fortune 1000 and Alexa 1000: degree centrality, edge betweenness centrality, pagerank, closeness centrality, triangle counts, and connected components stats, see the notebook and search for “print_graph_metrics”.

Future Work

SPF Crawler enhancements: As you can see from the SPF guide I shared above for “a” and “mx”, SPF supports some fairly complex policies for allowing certain IPs to send email (esp. the prefix operators on these SPF mechanisms). I did not provide support for these mechanisms in the first version of my SPF crawler mainly due to the complexity involved. Because of this, my results will under represent the trust relationships where these are used. I hope to add support for these operators to expand what could be found in this data.
Try some more graph analytics on the entire dataset. In the Jupyter notebook I ran several graph algorithms on subsets of the entire graph (Fortune 100 and Alexa 100). These showed some mildly interesting results, but testing against larger graphs caused graphviz to fail due to some data format issues that I have not had a chance to research.
Perform another study measuring DMARC and DKIM usage across popular domains.

Resources

As usual all notebooks, code, and summary results can be found in Github: https://github.com/covert-labs/mx-intel.

And all data can be found at the links below:

all-registered-domains.txt.gz - base domains extracted from combining several popular domains lists together and then uniqued.
all-registered-domains-outputs-combined.txt.gz - raw dig output for all the TXT requests.
spf-results-all-registered-domains.json.gz - the parsed results from running the SPF Crawler against all-registered-domains.txt.gz.
spf-linked-all-registered-domains.json.gz - the assembled results from processing spf-results-all-registered-domains.json.gz. This is the collapsed/combined data that shows all the SPF domains and networks included recursively.

–Jason
@jason_trost

June 27, 2020Jason Trost • Comment

Mining DNS MX Records for Fun and Profit

If you have read my blog before, you may realize that I really love DNS data and dns analytics. In this post, I share some experiences in using mostly DNS data for identifying the visible footprint of popular email security providers.

This may not be terribly novel, but it was an interesting exploration during a time of boredom for me. This work was initially motivated by two events:

When the Proofpoint email protection machine learning vulnerability (CVE-2019-20634) was announced by Will Pearce and Nick Landers I got to wondering about how large their deployment footprint was and how one could figure this out, and
A friend at another company mentioned that they were using a specific startup email security provider and I wondered whether I could determine what other companies were also using this same provider.

Here is the methodology I devised for this:

Collect a large sample of MX records
Enrich MX records with IP intelligence and useful metadata
Sift through the enriched records and identify recognizable email provider’s domains through OSINT (whois, PDNS, Google) and market research.
Profit?!?!?

For step one, I downloaded the Alexa top 1M domains, Quantcast top 1m domains (from WaybackMachine), Domcop Top 10m domains, Majestic Million Domains and Cisco Umbrella top 1m domains. I identified the registered domain using tldextract for each of these and then combined them into a single de-duplicated list. This resulted in ~8.3M unique domain names. I then performed bulk MX lookups using adnshost against my own bind9 recursive nameserver. In my experience, adnshost works pretty well for bulk DNS resolution at this scale, and it will perform both the lookup requested (MX) as well as a domain resolution (A-lookup). When performing bulk DNS lookups at this scale it is important to add retry logic for failed resolutions as this tends to happen enough to be a problem. I did this using a simple bash script that retried failed lookups up to three times.

For step two, I then developed a simple Jupyter notebook to parse the adnshost logs and perform the enrichments using tldextract, PTR lookups (also using adnshost), Maxmind ASN, Maxmind City, Alexa ranking, and Cloud provider IP Ranges for AWS, Azure, and GCP. Side note: I also attempted to perform SOA lookups on the /24 networks of each IP after noticing some useful patterns with failed PTR lookups. This appears potentially useful for identifying some of uses of some of the IP space of the cloud providers, but this turned into a rabbit hole since adnshost appears to crash when trying to handle some of the results it received.

For step three, I did the following:

Performed market research on the top email security providers as well as emerging and niche providers. This site was helpful as well as just googling around and exploring PDNS/Whois data from PassiveTotal and SecurityTrails.
Scrutinized the top MX server registered domains and ASNs and tried to identify potential security providers.
Sifted through the remaining results trying to identify any obvious providers with “malware”, “phish”, “spam”, or “security” in their domain names.

I used this to build two mappings to email security providers: MX server base domains and ASN names. The mappings can be found here. Then I summarized the overall dataset and those results are presented below. Elephants in the room I purposefully did not include Microsoft, Google, and some of the bigger tech companies that provide email service as part of these mappings since I don’t consider them email security companies. This may be debatable since these companies do provide security features through their offerings.

Brief Intro to MX records

For those of you who may not be familiar with DNS MX records, these are DNS Resource Records (RRs) used to map a domain name to the Mail Exchange (MX) servers responsible for accepting email for that domain. MX records are used by Mail Transfer Agents (MTA) in order to identify where email should be sent for a given recipient email address. Below we use the command line utility “dig” to perform an MX lookup on gmail.com to find its Mail Exchange servers. As you can see, at the time of this writing, there are five MX domains that can accept email for gmail.com.

Besides being critical for identifying where email should be sent, MX records are also useful for mapping out infrastructure and can sometimes be used to identify which email security providers are being used by a company of interest. Below is an example for Florida State University (go Noles!) that reveals that, at the time of this writing, they are using Proofpoint to receive their email. How do we know this? Their mail exchanges are hosted on sub domains of pphosted.com which is owned by Proofpoint.

Some companies obscure their security providers by first receiving their email to other mail exchanges such as ones hosted in their own data center or ones hosted by Google or Microsoft. In this blog, we explore a large DNS dataset to identify interesting info about the visible footprint / market share of email security companies.

All code and data for this study can be found in this Github Repo: https://github.com/covert-labs/mx-intel.

Observations:

Email security provider OPSEC is remarkably bad in a lot of cases and it is often easy to determine which provider is being used. Anyone who works in cyber security knows it is generally not a good idea to broadcast which cyber security products you are using since it may provide information that can be exploited by the adversary. This is especially true when vulnerabilities are announced in security products.
Since email exchanges can be chained together, only the outermost layer is visible in DNS MX records. For this reason, this research will underestimate the size of each provider’s market share.
Some security providers supply very specialized services (like anti-phishing only) and because of this they are often not the first layer in the email exchange chain. They will be dramatically underrepresented in this study.

Results:

Summary:

8,395,595 domains (derived from several top domain lists)
12,910,550 unique MX records (from 5994452 unique domains)
2,901,843 Unique Mail server domains
1,940,993 Unique Mail server base domains
25,733 Unique Mail server ASNs
56 Unique Security Providers identified

Analytics:

Here are the questions I was hoping to answer with the tables presented below:

Who are the market leading security companies reflected in the data?
What is the visible market share of email security providers as reflected in DNS records?
What can be inferred from publicly available MX records about email security?
Which email security providers are leveraging cloud hosting? And which cloud hosting environments are used most?
Who are the visible customers of provider X?

Note: All tables below show the count of domains hosted, NOT companies; companies can own many domains. Fortune 1000 domains are from 2015 and are based on this file created by Bob Rudis.

Top Email Security Providers Overall

Fortune 1000 Email security providers

Fortune 100 domain, MX base domain, email security provider

Note: I ended up adding Google and Microsoft to this table since they were very well represented. As you can see, Proofpoint and self-hosting dominate the Fortune 100.

Alexa 1000 Email security providers

Alexa 100 domain, MX base domain, email security provider

Note: I ended up adding Google and Microsoft to this table since they were very well represented. As you can see, self-hosting, Google and Microsoft dominate the Alexa 100. Almost all of these domains are from large technology / web companies so this isn’t so surprising, but it is interesting as compared to the Fortune 100.

Top Email Security Providers Hosted in AWS

Many large email security companies are operating from AWS.

Top Email Security Providers Hosted in Azure

Only a small number of identifiable email security companies were operating from Azure.

Top Self-hosted Email Security Providers

Misc Findings

When mining this data I discovered a few interesting items.

Linode / CSC Digital Brand Services

One of the more popular email security providers, “CSC Digital Brand Services” (which service multiple Fortune 100 companies), uses Linode for their hosting. This was surprising since Linode seems like a much smaller player in the Cloud hosting market.

googlemial[.]com

When I initially collected this data, freecodecamp.org had a misconfigured MX domain pointing to googlemial[.]com. And this sketchy domain is not owned by Google and resolved to a GCP IP. Upon further inspection, this IP appears to be hosting a parking page for unregistered domains owned by GoDaddy. A quick PDNS check of other domains resolving to this IP reveals ~4.2M+ domains, and a quick DNS resolution on those domains with any subdomain shows that they all resolve to the same IP.

adnshost logs for freecodecamp.org

Future Work

I am not sure if I will return to this research or not, but I had some ideas that may be worth pursuing at some point, maybe during the next pandemic :)

Perform similar work against a much larger scale - using all major zone files (COM, NET, ORG) and ICANN’s CZDS as the inputs.
Or perform similar work using the Rapid7 Opendata DNS data sets.
Determine if port scans against MX servers could be useful to augment this.
Automate PDNS queries and analysis against the MX records found to identify other domains not found in the top domain lists.
Perform similar work, but collect SPF records and see what interesting insights could be gleaned about email sending trust (and whether vulns could be identified – like AWS IPs in the SPF that are stale and potentially obtainable).
Completely automate this entire process and use it to generate weekly reports.
Identify providers hidden by the first layer mail exchange. It may be possible to do this at scale (but only for some companies) if the companies send Bounced notifications to external email senders for non-existent recipients. These bounced messages often contain all the SMTP headers of the original message sent. These headers can reveal security products. This technique was used on a targeted basis by Will Pearce and Nick Landers in their DerbyCon research on Proofpoint. Trying to do this at scale may draw a lot of attention or get my research box put on some blacklists. It would also likely be a lot more effort to identify the SMTP headers associated with different security providers.

Resources

Data:

all-registered-domains.txt.gz - base domains extracted from combining several popular domains lists together and then uniqued.
all-popular-domains-MX-20200620.txt.unique.gz - adnshost logs from performing MX lookups on domains from all-registered-domains.txt.gz.
mailserver_registered_domain-NS-20200620.txt.gz - adnshost logs from performing NS lookups on all the MX base domains; used for enrichment.
mx-intel-enriched.csv.gz - the final enriched output from this work.

Notebooks, Code, and summary results: https://github.com/covert-labs/mx-intel.

–Jason
@jason_trost

May 23, 2020Jason Trost • Comment

Seven Short Links on Cyber Security Alert Triage Automation

A short listing of research papers I’ve discovered recently that aim to automate or speed up cyber security alert triage (alert prioritization/ranking, causal event correlation, and enrichment).

–Jason
@jason_trost

The “short links” format was inspired by O’Reilly’s Four Short Links series.

May 23, 2020Jason Trost • Comment

Eight Short Links on Provenance Analytics for Cyber Security

A short listing of research papers I’ve discovered recently that use Provenance Analytics for various Cyber Security usecases from EDR data analysis to malware analysis to threat hunting and IR.

–Jason
@jason_trost

The “short links” format was inspired by O’Reilly’s Four Short Links series.

February 01, 2020Jason Trost • Comment

3 Short Links on Popular Domain Lists for Threat Intelligence

A short listing of research papers I’ve read that analyze popular domain lists. These papers analyze Alexa, Quantcast, Cisco Umbrella, and Majestic top websites/domains.

–Jason
@jason_trost

The “short links” format was inspired by O’Reilly’s Four Short Links series.

covert.io

security + big data + machine learning

Intro to Sender Policy Framework (SPF)

Step One: Collection

Step Two: Enrichment

Step Three: Analysis

Network Graphs

Fortune 100 SPF Trusted Networks Graph

Alexa 100 SPF Trusted Networks Graph

Heatmaps

Fortune 1,000 SPF Trusted Networks Heatmap

Alexa 1,000 SPF Trusted Networks Heatmap

Alexa 10,000 SPF Trusted Networks Heatmap

Alexa 100,000 SPF Trusted Networks Heatmap

Alexa 1,000,000 SPF Trusted Networks Heatmap

Alexa Top 1M Domains Trusting /7 or larger networks

Top SPF Includes from all top domain lists (via SPF)

Top SPF Includes from Fortune 1000 (via SPF)

Top SPF Includes from Alexa top1m

Email Security Providers

Top Email Security Provider from all top domain lists (via SPF)

Top Email Security Provider from Alexa 1m (via SPF)

Top Email Security Provider from Fortune 1000 (via SPF)

Top Email Security Provider from Fortune 100 (via SPF)

Fortune 100 Email Security Providers Listing (via SPF)

Domains with 4 or more Email Security Providers (via SPF)

Trusting Cloud Provider Networks

Alexa 1000 Trusting AWS Networks

Alexa 1000 Trusting Azure Networks

Alexa 1000 Trusting GCP Networks

Fortune 1000 Trusting AWS Networks

Fortune 1000 Trusting Azure Networks

Fortune 1000 Trusting GCP Networks

Some other potentially interesting results, not worth dumping here:

Future Work

Resources

Brief Intro to MX records

Observations:

Results:

Summary:

Analytics:

Top Email Security Providers Overall

Fortune 1000 Email security providers

Fortune 100 domain, MX base domain, email security provider

Alexa 1000 Email security providers

Alexa 100 domain, MX base domain, email security provider

Top Email Security Providers Hosted in AWS

Top Email Security Providers Hosted in Azure

Top Self-hosted Email Security Providers

Misc Findings

Linode / CSC Digital Brand Services

googlemial[.]com

Future Work

Resources