Hacking Oregon's Hidden Political Connections
Hacking Oregon’s Hidden Political Connections
A TotalGood project
v0.0.4
Material
- Data
- Code: bit.ly/hackor-notebooks
- RFP
- Hack Oregon by Cat
- Behind the Curtain by Ken
- Force Directed Graph
Agenda:
For Hack Oregon we explored the data in unusual ways
- Pandas as a DB
- Find Connections (FKs, PKs, other DBs)
- TFIDF on a DB table
- TFIDF similarity
- Similarity Similarity
Intro: 1
Pandas as a relational DB
- Identify foreign keys automatically
- Use FKs to do join SQL-like queries
Intro: 2
Intersect large sets
- AM emails in BehindTheCurtain DB?
- 10 GB mysql dump » dozens of CSVs
- Load 50M emails efficiently
- Intersect emails with public records
Intro: 3
Restructure a DB
- Why?
- How?
- Restructure (TFIDF)
- Raw python
- Sklearn
Intro: 4
TFIDF to detect similarity between records
- cluster Oregon PACs by their “mission”
- d3 force-directed graph of PAC similarity
- compare to DG of financial transactions
Intro: 5
Similarity between similarity matrices
SAY
(TFIDF)
vs.
DO
(Transactions)
3. Restructure DB
Why?
- Squish fields into a string?
- Vectorizing later anyway, right?
Because
- Dimensions are vaguely defined/understood
- Information “smear” across fields/dimensions
3. Restructure DB: How?
- Ignore numbers/dates
- Stringify each field
- Stem words
- Ignore words (are you sure?)
- Concatenate
- Split
- Vectorize/Count
3. Restructure DB: TFIDF
- Must be sparse to fit in memory
- Explicit python builtins:
Counter
,defaultdict
- sklearn
4. TFIDF Similarity
Large dimensions are scary
- Everything is far apart
- Euclidean distance is meaningless
- Our brains fail
4. TFIDF Similarity
Vector distances
- L_1, L_2, [L_0, L_inf, L_sup]
- Fractional Norm
4. TFIDF Similarity
Cosine Similarity
(similarity = 1/distance)
- Equivalent:
- Pierson Correlation
-
v_1 dot v_2 (projection) - angle between v1 and v2
- Bounded: [-1, +1]
5. Similarity Similarity
Cluster Oregon PACs by their “mission”
- d3 force-directed graph of PAC similarity
- compare to DG of financial transactions
Thank You!
- Thunder
- Grimm
- Cat and Hack Oregon
- Pizza, data, and a cause!
- Jeremey Tanner
- PyDx talk “Python for Evil”
- Total Good
- Summer 2015 grant
- Open RFP!
Written on October 27, 2015