Online Census? Think Twice!

Completing the decennial census online probably seems like an easy task, but convenience comes at a price—a price that I am unwilling to pay.

Census 2020

To understand what is at stake in completing a census online, one needs to first understand a little about web technology and a little about how that technology has evolved. Web servers consist of two discrete elements with the same name. The first is the physical hardware; the second is the virtual software. Logically, the hardware provides the circuitry that enables the software to run. Since one does not function without the other it is common to speak of them as one thing, but it is occasionally useful to regard them separately. Pretty much any “box” can run web server software, but web server hardware is usually tailored to this one specific purpose. So when we speak of storage capacity, memory, and processor speed we are speaking of hardware. When we speak of GUI, CGI, platform, protocol, and capability we are speaking of software. In this post I am in fact speaking almost entirely about web servers as software. By some accounts, Apache accounts for 40% of web servers. Nginx powers another 32%. Apache and Ngnix then have modules for PHP and Perl and Python and Ruby and SQL and so on. Web servers are responsible for everything from plain HTML to the immersive PHP+JS on Facebook to the streaming video on YouTube. In many cases, even smartphone apps are just shells that interact with a web server on the back end.

The thing about web servers is that every one logs every thing. While different web servers log more or less information, at a minimum all of them log the date, time, protocol, response code, referrer, IP (internet protocol) address, URI (uniform resource identifier, a.k.a. “file”), and UA (user-agent, a.k.a “browser”). Most people already know that IPs are unique addresses on a given network. The internet has IPs which are assigned to client nodes by various cascaded routers. Routers do exactly what it sounds like (routing internet data) all along the internet from every backbone node to every data center and every user. Many of us have used a VPN which itself incorporates a router and most of us have wifi routers at home. Routers exist on fiber optic lines, coaxial lines, copper lines, and cellular networks. Whether on an internet backbone, within a university or company, or within a local ISP, routers assign IPs to computers, printers, cellphones, and, yes, to other routers. Backbone IPs remain static down to the ISP (internet service provider) level. ISPs are assigned fixed ranges of IPs which are then dynamically assigned via DHCP (dynamic host configuration protocol). Ironically, while dynamic IPs were once truly dynamic, for more than a decade now most ISPs have configured their DHCP servers to always assign the same IP address to each customer. This practice makes it easier for ISPs to identify naughty customers that abuse the network to say nothing of subpoena compliance. The statically-allocated ranges also makes it possible for internet databases to correlate IPs to general geographies pretty much in the same manner that area codes and exchanges once identified states, cities, and communities. Over time, a number of databases have come into existence that offer even more finely-tuned geolocation. Google itself can resolve IPs to specific addresses, though it does not make such fine-grain resolution available to the public. Still, there is enough public data available from real estate and prior censi/censuses to infer a great deal about the demographics associated with IP ranges.

Server logs are pretty boring in and of themselves, but they are essential to load balancing and troubleshooting. The original premise of server logs, though, was that the data contained in the logs was functionally anonymized. Over time, however, myriad programs came along for analyzing server logs. Originally, log analysis was useful for understanding peak traffic times and resource allocation. Then log analyzers began correlating geography and browser (because in the 1990s and 2000s there was an ongoing browser war; each browser wanted to deliver a compelling niche experience and therefore interpreted the HTML standards differently than its competitors.) By the time 2020 rolled around, usage analytics evolved to reach far beyond mere server logs; websites now link invisible resources and embed hidden “pixel” images from third-parties like Google and Microsoft and Amazon in order to track a user’s activity across countless sites. Except for TOR, there is no no such thing as private or anonymous browsing (and as good as TOR is, it isn’t 100%).

Suppose Citizen Doe visits website X which is hosted on Amazon AWS. Then Citizen Doe visits website Y hosted on Cloudflare. Citizen Doe doesn’t know and doesn’t care whose physical servers power those websites. Citizen Doe also doesn’t know and doesn’t care that both sites embedded a 1-pixel image or 1-line stylesheet from Google. But when Citizen Doe loads a page from those sites, Google gets to log the hidden request to its servers. Google then knows that the same UA on the same IP visited both sites. To see how detailed this information can get,  this less-sophisticated site will display rudimentary your UA, OS, IP, and location, among other metadata. Taken as a whole, this data comprises a shockingly reliable virtual fingerprint. Over the last twienty years, analytics purveyors have mined enough data to very accurately identify the race, gender, approximate age, approximate income, and approximate net worth of an otherwise “anonymous” user. This information is what powers targeted advertising and how a user magically sees ads for Izod polos on poshmark after looking at Ralph Lauren polos on ebay. This is also how Google knows which automobiles would be of greatest interest to a particular user and whether to advertise Netflix (paid) or Pluto (free). It’s simple socioeconomic demographics. And it gets worse!

Suppose a user also has a Google email account. The W3C standards and specifications are supposed to restrict cookie-reading access to the cookie-setting site so that Google can’t read Microsoft’s cookies, and vice versa. Even so, when Citizen Doe checks her Gmail via her web browser, Google sets a cookie. When Citizen Doe receives an email from her friend about particular chunky heel booties, Google sets a cookie. When Citizen Doe searches for formal dresses, Google sets a cookie. Since Google set all those cookies, Google can read all those cookies. Not only does Google add all this information to its database, but Google positively knows Citizen Doe’s identity, age, gender, address, educational history, friends, relatives, favorite vacation destinations, language fluency, and probably even height, weight, hair color, eye color, complexion, and race—all of which is extracted from her gmail contacts, emails, attachments, and photos (not to mention the  geolocated IP of every place she has ever checked her gmail). Google knows what she uploads to Pinterest, Facebook, Tumblr, and Twitter. All of these activities are fully (though obfuscatorily) disclosed in the TOS (terms of service) and EULA (end-user licensing agreement) that are so prohibitively long that no one reads them. (FYI, do read up on privacy within Google Docs and Google Drive!)

Everything I describe is perfectly legal. It is also perfectly helpful. I really don’t care that Google knows such things and I am appropriately conscientious in keeping sensitive matters away from Google (which is largely why I don’t use Google Docs or Google Drive). Email from healthcare providers goes elsewhere, email from colleagues stays within company-controlled servers, There is nothing overly personal in my Gmail either—just social media, mailing lists, receipts, and clutter. I kind of like that Google knows that I ordered a Domino’s pizza yesterday; perhaps tomorrow Google will present me with an incredible deal at Pizza Hut! I kind of like that Google knows my gender, race, and age because showing me ads for boner pills would be an annoyance and a waste of time. And I like that Google has learned what interests me because I very well might like what I see in an Old Navy advertisement and that could be a win-win-win (a win for me because I got something I really like that I didn’t know was on sale, a win for Old Navy who got a sale and hopefully will use analytics to continue producing the very garments that most appeal to me, and a win for Google who made $1 off the initial click-through and another $2 from the final click-through-purchase).

Google is not evil for doing what it does, nor is it the only player in the marketplace (it’s just the largest and most convenient example). For all the information that Google gathers, Google has an intrinsic interest is keeping its shrewdly-gathered data secure lest a competitor profit from Google’s hard work. Google’s interests lie in capitalistic gain and I can trust Google to the extent that my interests align with Google’s interests. This is not true of Uncle Sam’s CIA, NSA, FBI, ATF, IRS, CDC, FDA, DOL, DOT, and all the other alphabet agencies. September 11 was terrible, but the surveillance measures enacted thereafter are so intolerable to a democracy that Uncle Sam needs to go fuck himself in the ass with a railroad tie. No one can know what analytics he’s using nor what deals he’s struck with analytics purveyors, and I’m not willing to open the door for him probe more deeply into my life. Google certainly won’t open the front door for Uncle Sam to explore its database, but Uncle Sam is still a deep-pocket customer who might just use demographic data to fact-check responses and to investigate any discrepancies it might find. Then too, maybe the USCB just stores the analytics data until ten years from now, or maybe USCB shares its analytics information (which is outside the narrow scope of confidential census data) with the IRS which might want to see if my metaprofile comports with my tax return. Is anyone out there eager to increase audit probability?

Fill out the census questionnaire online? Hell no! Not from my home, not from my laptop, not from my office, not from my university, and not from my cell phone. I might consider completing it online from my public library (that is, once this corona pandelerium subsides and the county reopens). In fact, I like the idea of using a library computer because it would confuse the hell out of any analytics. On the other hand, it might also give the impression that I am a 8 year-old child with some form of cognitive dysfunction. Either way, one thing is for certain, the information on a paper form is constrained to the four corners of the document and nothing more.

Comments, Reactions, Replies, & Thoughts