Domain-name lookups only reveal websites visited, not individual pages viewed, right?
Wrong: the interaction between a user and the DNS is more revealing than previously believed, according to a paper from German postdoc researcher Dominik Herrmann. In work published at pre-print server Arxiv (in German – thank you, Google Translate), Herrmann writes that behavioural tracking using recursive name servers is a genuine privacy risk.
DNS – the infrastructure that converts into the IP address 188.8.131.52 – does, of course, reveal which sites a user visits. However, as Herrmann writes, that is an association between the user's public-facing IP address and the requests they make. Since ISPs have to use dynamic IP addresses to cope with the IPv4 address shortage, a user's address changes, making it harder to track them over time.
However, Herrmann writes, someone with access to the infrastructure can easily watch a user's behaviour while they have one IP address, create a classifier for that user, and look for behaviour that matches that classifier when the IP address changes.
“Each user pursues his interests and preferences while surfing, and ... each user has a unique combination of interests and preferences,” the paper states. Visits from one IP to Google followed by favourite newspapers, shopping sites, government services or transport are enough to identify a user when they pop up under a different IP, Herrmann reckons, and this “behavioural chaining” doesn't have to rely on tracking cookies.
To put this idea to the test, Herrmann ran a naive Bayes classifier over five months of anonymised DNS data from the University of Regensburg, covering thousands of users. In a sample of 3,800 students over two months, behavioural chaining correctly identified 86 per cent of individuals from one IP address to the next; and when the experiment was run for 12,000 students the accuracy remained high, at 76 per cent.
Herrmann offers two observations about why this is more worrying than it may appear first sight. While people will correctly point out that DNS resolves only as far as (for example) www.wikipedia.org – a DNS record doesn't show law enforcement that someone read en.wikipedia.org/wiki/Alcoholism, so their privacy is intact.
Not so, he responds: “Many websites produce a so distinctive DNS retrieval pattern” that requests can be recognised “more or less unequivocally.” An analysis of retrieval of 5,000 Wikipedia entries, 6,200 news posts on Heise, and the top 100,000 websites, most pages showed unique demand patterns, he writes.
*For example, most pages you visit don't involve a single DNS resolution. With images, trackers, third-party links, CDN links and the like, the paper notes that “typically between ten and 20 domains are resolved” – and these are frequently page-specific (because, for example, a news site includes particular ads associated with particular topics).
Whoever is carrying out DNS resolution doesn’t only see the DNS request for www.example.com/page — they see requests for anything else that page depends on. In many countries' data retention regimes, the IP addresses a user visits are recorded, but browser histories are off limits. Herrmann asserts law enforcement to use DNS records, IP address records, and behavioural chaining to reconstruct a more detailed browsing history than most users expect.
It can, however, be disrupted by ISPs, should they wish, by refreshing users' IP addresses more frequently. With an hourly change to IP address, Herrmann writes, the reconstruction fails 45 per cent of the time, and at five-minute changes, accuracy drops to 31 per cent – and if the user is inactive for enough intervals, “the trail disappears.”