Feed Me (Transcripts), Seymour…

If there’s one thing about statistical models that’s generally true: they need to be fed.

For about six months now, I’ve been living most waking moments in the words of Donald Trump. I love algorithms, but I check them. And check them again. And again. It’s not even borderline compulsive. We blew past borderline around January. It is compulsive.

Probably the single biggest challenge I face in shaping the models: access to raw materials. We check, and check again, every word. Yes, he was on Oprah in 1988, but we need more than 3:11… we need the whole show for context.

We are constantly updating our backlog of material, with volunteers generously sending in links (I’m looking at you CJ in particular), text, videos that in turn need to be checked and then fed into Margaret, our pseudo-AI that ravenously consumes every word spoken, analyzing the audio, video and text to build her model. This in turn analyzes tweets, transcribes better, and does lots of other cool things.

The single best source of this information are interviews. As opposed to speeches, they are generally unscripted. As opposed to tweets, you get more than 19.6 words at a time (1 year moving average, 3,203 tweets, 62,871 words). Sometime, I’ll have enough time to do a separate post explaining how different the models view speeches vs. interviews… it’s almost two different people in the output.

However, as Chris Cillizza at CNN pointed out in a recent tweet, these are often unshared, even after the news cycle. Some organizations publish transcripts simultaneously. Most publish just excerpts, noting they’ve been edited. Some share audio and video, but with cuts and jumps. Others… nothing.

I’m not naming names, but given that the messages coming from The White House can at times appear to contradict each other, this raw material is crucial, both for the historical record, and for building a base of research that others can analyze.

Also, the full, unedited interview can remove potential questions as to whether comments are in context. Personally, I think that is in nearly every case a ridiculous argument, but the argument can’t be made if there’s no edits.

So I’m making both a public plea, and an offer: please, in the name of all that is good in the world, once you’ve run your stories and pieces, please publish and share the raw materials. Pull any off-the-record comments, but otherwise, share the raw audio, video and text.

Since everyone has a few things to do nowadays, here’s what we’ll offer for any interview with the President, if time or resources constrains a full transcript or sharing raw video and/or audio.

  1. Factba.se will happily, and freely, transcribe in full any video or audio provided, both via Margaret, and with a human editor to verify.
  2. Factba.se will provide, via a spreadsheet or any other medium, ALL metadata developed. This is the stuff that is behind the scenes (not for long) on our site. If audio, you’ll get back second-by-second audio analysis of voice stress and emotion, which is keyed to Trump (the sotto voce whisper). If video, it will include facial expressions, smile/frown, gestures and other analysis (clothing identification, colors, smile / frown, the two-handed punctuation I myself have as a third-generation bridge-and-tunnel child, etc). It is even learning to pick up when he flushes (complexion change). It will be a lot. But it will be everything.
  3. We will provide the full keyword and entity extraction, by three-sentence pair, section and overall, both for the entire interview, and specifically on just when Trump is speaking.
  4. We will provide the full-range of analysis. Grade-level models, sentiment, emotion… all of it.
  5. We will respect any and all embargoes given. We are not meant to be a news organization. If you’d like us to hold until a day, two days, three days after the stories run before integrating and sharing the information, fine. You’re the boss. It’s your interview. You get it back first and control the story.
  6. If a human is in the mix editing, figure two hours per hour of video/audio for transcript. If you don’t mind raw from Margaret (she’s close to 95% dead on now), 90 seconds per hour. We just need a little notice to plan our day to be ready for it if you want a quick turnaround.
  7. We will, of course, link out to your pieces from the text.
  8. If there are any other requests… fine. Our interest is the record, and sharing the resulting analysis.

We’re not looking to create a hippie commune. We are looking, however, to unleash the data that is contained in your excellent work, in a way that does not conflict with your job.

Also, on the off chance Margaret becomes sentient again, you’ll be in her good graces.

— Bill Frischling

When Occam’s Razor Cuts You

The past couple of weeks have had a fun on-again, off-again fixing a “legacy” problem, if that term can be applied to a four-month old site: IDing when Trump himself uses @realdonaldtrump, vs his staff.

Trumpologists have known for a while that he used a Samsung Galaxy S3, aka an Android phone. His staff all used iPhones. Further, multiple analyses of the language style, time of day, etc. (nerd out here and here) validated the connection between the Android and tweets from the thumbs of Trump.

But things change. His Android phone was last seen the morning of March 8th:

Then… nothing. Android showed up briefly for two tweets on March 25th within four minutes of each other (I’m picturing a brief wrestling match with the Secret Service as he pulled the phone from its hiding spot in the limo on the way to Trump National in Potomac Falls, VA… the two tweets were during his ride over, 20 minutes before arrival… maybe while on the Toll Road?) and disappeared again.

Meanwhile, there were 139 other tweets, mostly from the iPhone. Including lots of FAKE NEWS references and other tweets that are universally agreed to have come from him.

So, here in the Fact Cave, we’ve had an algorithm for months that looked at his text. The trouble: it essentially biased the heck out of Android as an indicator. When the Android disappeared, the algo rolled over, showed its belly and promptly failed miserably. Back to the drawing board.

So we got to work. And iterated, tested, iterated and iterated some more.

Meanwhile, Andrew McGill at The Atlantic remembered the golden rule we forgot: Always. Be. Shipping. His excellent code can be found here.

We agreed with most of his approach, but now properly motivated, we blew the dust off perfection, hunkered down and reclassified all the tweets from March 8th forward?

The logic? We kept it simple:

  • Control. In 2016, removing retweets, the Android phone tweeted 1,357 times. Other devices tweeted 2,264 times. For our purposes, this was treated as gospel, where Android = Trump, not Android = staff
  • Words. Trump is very distinctive. We generated a deep count of commonly used words on both accounts. Simple word frequency
  • Hashtags. Trump almost never uses hashtags. His staff uses them frequently. Appearance of hashtags biases heavily to Staff
  • URLs and Photos. Outside of retweets, Trump used either a URL or photo a total of 10 times out of 1,357 tweets. Another heavy bias
  • Others. What we tested and ignored: sentiment (Trump tweets bias negative, but not consistently enough to be a factor), user mentions, use of capitalization, use of exclamation points. None of these were as clear as hashtags and URLs.
  • New platforms. Twitter Ads and Media Studio – both social platforms / products unlikely to be in Trump’s hands on his phone, are an automatic staff.
  • Then we guessed…

… well, not really. We took those three factors (for each we did a log odds ratio) and threw it into a test that automatically adjusted the weighting and scores for each of the factors and compared it against a random sample of the control (1,000 items) until it settled on the best outcome.

The best result we could get without relying on the device as a major indicator? The robot can correctly identify Trump tweets 91% of the time, and staff tweets 85% of the time.

Perfect? No. Better than nothing? Yes.

We’re going to keep working on it and getting that number up. And we’ll be faster about it next time.

 

Factba.se 2.0. Now with 0.2 More!

Okay, been focusing on quite a few things, but we just pushed out a fairly large update, and we’ve got some news as well. But first, the updates from today and the past two weeks:

  • Full Access to transcripts. We’ve been asked about this repeatedly. Now, you can browser through everything Trump has said that we have in the system in a handy timeline here: https://factba.se/transcripts. In addition, it surfaces some of the behind-the-scenes analytics, like emotion analysis, sentiment analysis, keywords, entities and more. Just click on an item to see a detailed breakout. (for example: https://factba.se/transcript/donald-trump-remarks-greek-ceo-march-24-2017).
  • White House Schedule. A simple little doodad. It lists the President’s Schedule (public schedule), broken out as appointments. As analysis comes in, it is linked into the schedule. It’s also available in JSON, CSV, and of course iCal format, as well as in a public Google calendar. https://factba.se/topic/calendar
  • iOS App. This grew out of the consolidated White House feed we did, so everyone can monitor all the White House’s social feeds, website and email list to the press in one spot. We were asked for realtime alerts. Then we thought about an app. Then I said “Hey, how hard can it be to learn to code an iOS app?” Seven days laters, with about four hours of sleep total and three cases of Diet Coke, the keyword-friendly-named “Trump White House Consolidated News Release Feed” app was born. A whopping $0.99, which after using for a year, means we lose money on the push alert costs. But it needed to be done.  http://apple.co/2nEVN7Y

Whew, that’s quite a bit for an update. One more piece of news…

Open Data Access. After a fair amount of discussion, we’ve decided to pursue freely distributing the entire Trump dataset via APIs. This will provide data access to:

  • Complete Transcript Library (3MM+ words) + Meta Data
  • The live Trump Twitter Archive
  • The complete screenshot library of his @realdonaldtrump feed
  • Financial records in data form and mapped to company holdings
  • H1B Filings
  • Court Records

We’ve already started doing that with our live feeds and calendars. Anyone who wants to data mine or come up with new ways of using the data will be free to do so.

We need to get the infrastructure in place, and that may take a couple of weeks, but we’ll have managed public APIs that let you get some, or all, of the data for public use, on the condition the work product is shared publicly as well.

The live White House feed is available freely now as:

The President’s Schedule is available similarly as:

That’s enough of an update for today. Onward.

Factba.se v1.8 – We’re almost to v2

We’ll get to the v2 to release. But first, a couple of notes:

1. We’ve been a bit overwhelmed by the requests to assist — voluntarily — with the site and information collection. If we’ve been slow to follow up, see #3 below.

2. We pushed live an internal tool that we think would be useful. One of our unexpected challenges was the lack of a centralized place to gather new information. Update can appear on Twitter, Facebook, Youtube, the White House site. To that end, we centralized a realtime feed that pulls together:

  • Whitehouse.gov
  • Facebook (DonaldTrump, POTUS, Whitehouse)
  • Instagram (Whitehouse)
  • YouTube (WhiteHouse)
  • Twitter (realDonaldTrump, WhiteHouse, POTUS, VP, Mike_Pence, SeanSpicer)
  • The White House press distribution list (immediate release only)

…and puts it here: https://factba.se/topic/latest . This is the same feed we use to monitor and add all new public statements. The social feeds update realtime. The White House and email are every 60 seconds. You can also plug it in to RSS, hit the JSON directly, or follow it live on the robotic @FactbaseFeed

If you see a source missing, just let us know. As near as we can tell, it’s the only source that monitors everything coming out of the WHPO from all sources.

3. A bit of personal news. Since the election, Factba.se has become an increasing focus of my life in particular. To that end, I left my day job last week as Vice President / Entrepreneur-in-Residence at U.S. News and World Report to dedicate more time and focus on the platform and content. Based on the traffic, it’s getting regular, repeat use in newsrooms. We hope to become only more valuable as time goes on. This includes getting back to the folks in regard to what we need (mostly: tracking down video for documents).

And if you know a good project manager who could take some time to yell at me daily to stop obsessing on minor details, please send them my way… or just randomly call and yell at me. The 120 Jira tickets aren’t going down fast enough :-).

Onward.

@BillFrisch

 

v1.7 Data Is Easy. Messy Data is Hard

“Hey, I wonder if the President ever filed for H-1Bs?” sayeth I this morning. How hard can that be to track down.

Tracking down? Not so bad.

Gathering records that were originally OCR’ed from fax? (2001-2006)? A helpful Department of Labor uploading neat data… in Excel… 500,000 rows at a time?

So, a simple data mining exercise ended up being about six hours of glorious frustration dealing used hermes briefcase calabasas ca
with 200+MB XLSX to free the data into a database. Oh, and please change the field hermes belts names every two years. And rearrange the columns while you’re at it.

So, at 4:50 am ET, the data is free, in a searchable database… all 6.1MM records since the start of the millennium (h/t to FLC for archiving the older records pre-2008) and all 564 companies checked, including for rough typos, then hand checked. And, the answer, it turns out is 162 H-1Bs filed for 288 open positions.

Note we included Eric Trump’s winery, since it appears President Trump was involved in the organization. Also birkin bag hermes included are some hotels that license President’s Trump name, as standard licenses from the Trump Organization involve upholding standards 2015 hermes scarf catalogue monoprix en
set by the organization.

Expect a flurry of releases in the coming hermes bags days. The last hermes h belt week was spent on the back end refining the composite model for better hermes handbags
transcription. Also, we’re archiving and internalizing video as a backup, as videos of Trump campaign speeches have been disappearing from Youtube.

Also, as a side note, we’d love more information poloponynetwork.com on the property at 2265 Aragon St, Sebring, Florida. It’s a 0.25 acre parcel of undeveloped swampland, owned by President Trump since July, 2005. It doesn’t appear on either of his Form 278e financial disclosure forms, but taxes are current as of November, 2016.

Within the next seven days, we should be back on track with speeches and statements automatically appearing on the site. We wanted to make sure that whatever we did, we wouldn’t run into video links dying on us again.

Until then, the salary ranges and positions for the H-1Bs make for an interesting read.

v1.6 – This is Vacation?

Yeesh, okay. Got it in just under the wire before inauguration. Quick note but the basics:

  • Added about 50 more hours of interviews, including interviews from the past two weeks, press conferences, 18 new stump speeches, and three paid speaking transcripts, including Australia.
  •  Up to date as of the Lincoln poloponynetwork.com Memorial concerts on 19 January
  • Margaret is better at transcribing. You’ll see this in a lot of the new uploads.
  • We think we have the feed set for the press office transcripts and videos… we’ll know in nine hours
  • Smarter search used hermes briefcase calabasas ca
    – now searches for things like “deleted tweets” surface hermes birkin bag the right topic pages.

Thanks for your patience. Back from vacation Sunday, and full hermes belts replica steam on 1.7.