The past couple of weeks have had a fun on-again, off-again fixing a “legacy” problem, if that term can be applied to a four-month old site: IDing when Trump himself uses @realdonaldtrump, vs his staff.
Trumpologists have known for a while that he used a Samsung Galaxy S3, aka an Android phone. His staff all used iPhones. Further, multiple analyses of the language style, time of day, etc. (nerd out here and here) validated the connection between the Android and tweets from the thumbs of Trump.
But things change. His Android phone was last seen the morning of March 8th:
Then… nothing. Android showed up briefly for two tweets on March 25th within four minutes of each other (I’m picturing a brief wrestling match with the Secret Service as he pulled the phone from its hiding spot in the limo on the way to Trump National in Potomac Falls, VA… the two tweets were during his ride over, 20 minutes before arrival… maybe while on the Toll Road?) and disappeared again.
Meanwhile, there were 139 other tweets, mostly from the iPhone. Including lots of FAKE NEWS references and other tweets that are universally agreed to have come from him.
So, here in the Fact Cave, we’ve had an algorithm for months that looked at his text. The trouble: it essentially biased the heck out of Android as an indicator. When the Android disappeared, the algo rolled over, showed its belly and promptly failed miserably. Back to the drawing board.
So we got to work. And iterated, tested, iterated and iterated some more.
Meanwhile, Andrew McGill at The Atlantic remembered the golden rule we forgot: Always. Be. Shipping. His excellent code can be found here.
We agreed with most of his approach, but now properly motivated, we blew the dust off perfection, hunkered down and reclassified all the tweets from March 8th forward?
The logic? We kept it simple:
- Control. In 2016, removing retweets, the Android phone tweeted 1,357 times. Other devices tweeted 2,264 times. For our purposes, this was treated as gospel, where Android = Trump, not Android = staff
- Words. Trump is very distinctive. We generated a deep count of commonly used words on both accounts. Simple word frequency
- Hashtags. Trump almost never uses hashtags. His staff uses them frequently. Appearance of hashtags biases heavily to Staff
- URLs and Photos. Outside of retweets, Trump used either a URL or photo a total of 10 times out of 1,357 tweets. Another heavy bias
- Others. What we tested and ignored: sentiment (Trump tweets bias negative, but not consistently enough to be a factor), user mentions, use of capitalization, use of exclamation points. None of these were as clear as hashtags and URLs.
- New platforms. Twitter Ads and Media Studio – both social platforms / products unlikely to be in Trump’s hands on his phone, are an automatic staff.
- Then we guessed…
… well, not really. We took those three factors (for each we did a log odds ratio) and threw it into a test that automatically adjusted the weighting and scores for each of the factors and compared it against a random sample of the control (1,000 items) until it settled on the best outcome.
The best result we could get without relying on the device as a major indicator? The robot can correctly identify Trump tweets 91% of the time, and staff tweets 85% of the time.
Perfect? No. Better than nothing? Yes.
We’re going to keep working on it and getting that number up. And we’ll be faster about it next time.
5 thoughts on “When Occam’s Razor Cuts You”
I don’t even know the way I ended up right here,
however I assumed this submit used to be great. I do not know who you might
be however definitely you’re going to a famous blogger if you happen to aren’t already.
Wow! In the end I got a webpage from where I be able to genuinely obtain helpful information regarding my study and knowledge.
I’m extremely pleased to find this great site.
I wanted to thank you for ones time just for this fantastic read!!
I definitely liked every bit of it and I have you bookmarked to see
new information on your site.
It’s genuinely very complex in this active life to listen news on Television, thus I
only use internet for that reason, and take the most up-to-date information.
Hello, its pleasant post concerning media print, we all know media is a
great source of data.