How to act like a data scientist 1

Today, I’m starting a series of posts that describe how data scientists may/should react to the use of data and information during this year’s U.S. Presidential election cycle. If you are not a data scientist, I’m hoping you’ll become smarter at detecting nonsense that comes online or on air. My head often explodes when I hear arguments that violate “numbersense”. 

For budding data scientists, one of the greatest skills you need is to make convincing data arguments. These short essays are case studies of applying best practices and spotting mistakes.

Last night’s debate between Democratic primary candidates in Las Vegas was messy.

But it was also a true debate, in which the candidates were allowed to talk to each other. Democracy is messy. (I wish the questions asked by the hosts could be more uplifting. Instead, many of the questions were mean-spirited and disguised personal attacks. I prefer questions that lead candidates to explain how their policies help Americans.)

Lasvegas_debate copy

Sanders Schooled Host on Reading Polls

The most data-relevant moment in the debate was when the host attacked Bernie Sanders with a poll finding: the host cited this just-released NBC News/WSJ poll, saying that 67% of respondents are uncomfortable with a candidate being a “socialist”. And Sanders’s response was perfect:

“Who won the poll?” he mused.

He was referring to the headline announcing the poll findings: “Sanders opens up double-digit national lead in primary race”. The same poll found that Sanders had 27% support, which is significantly ahead of the next four candidates, all of whom have around 15%. The gap of 12% is larger than twice the margin of error and so this is a clear lead.


If a data scientist on my team said what Sanders did, I would be so proud!

When Sanders asked “who won the poll”, the host literally went silent for a few seconds.

One of the ways people misuse survey results is to cherry-pick the answers they like and suppress the answers they don’t. The data scientist sees through this right away.

The argument is not that the overall preference data is correct and the socialist label data is incorrect. That would make the same error in reverse. It’s that they are either both correct or both incorrect.

The two results can both be correct… if the host thought they were conflicting, that’s only because he assumed that the most important factor affecting one’s vote preference is the label of a socialist.

In fact, this interpretation of the world is invalidated within the same poll because these same respondents picked Sanders as the front-runner.

If you have numbersense, you may notice a small flaw in the above sentence. Since there are so many candidates, even though Sanders had a clear lead, he only has 27% vote share. It is possible that all of the 27% came from those who are not uncomfortable with “socialist” (after all, 33% is more than 27%). Thumbs up if you saw that.

But… to complete my argument, I can use another pertinent question from the same poll: in a hypothetical head-to-head with Michael Bloomberg, these same respondents chose Sanders 57% to 37%.

The larger lesson here is that polls (and surveys) are wonderful statistical toys. The power of polls is not just in the headline results but the interplay between different results.

[Also note this: Using my methodology outlined here, the vote shares expressed in this NBC News/WSJ poll translate to just 58%-43% vote split in a 2-person race, which is very close to that head-to-head result they found! Nice validation for the methodology. By the way, a 60/40 split in a 2-person race is not a landslide no matter how many times the media mischaracterizes it.

Warren Hammered Bloomberg on NDAs

The most memorable moment in the debate was around Michael Bloomberg’s use of NDAs to silence women who complained they were sexually harrassed. Elizabeth Warren wanted to know how many such NDAs have been signed, and whether Bloomberg would release the women from these agreements so they could tell the public about what happened.

This attack reminded me of the same against Trump. And that’s because these NDAs are standard instruments used to prevent unfavorable information from leaking out. It’s, dare we say it, a quid pro quo: the women must sign these documents and keep their mouths shut in order to receive a settlement.

The data angle here is that information (data) is very powerful. Withholding information leaves the data scientist cold.

Politicians have been in the forefront of suppressing information – ranging from Hillary Clinton’s servers and deleted emails to Donald Trump’s obstruction of investigations, from the Senate Republicans’ suppression of witness testimony to these NDAs, from redacting documents to declaring state secrets.

All of these actions prevent information from the public’s scrutiny.

This is why as citizens, we must demand more transparency – but also we must demand the right to protect our own information. Right now, if you are using any cloud services (which include most mobile apps), you are entrusting technology companies with private information.

Imagine that during the drafting of the NDAs, Bloomberg’s staff used cloud services like Google Docs, then Google (plus possibly third-party vendors used by Google) has copies of those documents in the cloud. The cloud consists of a whole network of servers; the cloud architecture requires everything to be duplicated lots of times so that if one copy is lost, there is another copy available for use.

I don’t know that the Bloomberg organization uses Google – since Bloomberg is a tech company, it might not. But we do know that in other situations, like Gmail, Google deploys algorithms that “read” our emails, and extract data out of them. Google even allows third parties (i.e. advertisers) to “read” our emails, e.g. look at electronic receipts in Gmail.

These Big Tech companies hold the key to a lot of potentially valuable data. Can they be trusted to do the right thing?

Read More