February 8, 2016 · 5 min read
By Alexis Perrier
#####appliqué aux fils twitters.
Alexis Perrier @alexip
Data & Software, Berklee College of Music, Boston @BerkleeOnline
Data Science contributor @ODSC
Part I: Topic Modeling
Part II: Projet: followers sur twitter
Technique non-supervisée
1 document \(\Leftrightarrow\) plusieurs topics
1 topic \(\Leftrightarrow\) un ensemble de mots
La proportion des topics varie entre les documents
Approche vectorielle
Approche probabiliste, Bayésienne
Approche Neural Networks

aka Latent Semantic Indexing (LSI)
Un topic est une liste des probabilités des mots dans un vocabulaire donné.
LDA: La distribution des topics suit une loi de Dirichlet.
Details: https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
Inférence bayésienne, Gibbs sampling, Chinese restaurant process
Python libraries
R packages
Java libraries: S-Space Package, MALLET
C/C++ libraries: lda-c, hlda c, ctm-c d, hdp
3 articles

Résultats pour le moins difficiles a interpreter

Franchement mieux
u'0.055*app + 0.045*team + 0.043*contact + 0.043*idea + 0.029*quote + 0.022*free + 0.020*development + 0.019*looking + 0.017*startup + 0.017*build',
u'0.033*socialmedia + 0.022*python + 0.015*collaborative + 0.014*economy + 0.010*apple + 0.007*conda + 0.007*pydata + 0.007*talk + 0.007*check + 0.006*anaconda',
u'0.053*week + 0.041*followers + 0.033*community + 0.030*insight + 0.010*follow + 0.007*world + 0.007*stats + 0.007*sharing + 0.006*unfollowers + 0.006*blog',
u'0.014*thx + 0.010*event + 0.008*app + 0.007*travel + 0.006*social + 0.006*check + 0.006*marketing + 0.005*follow + 0.005*also + 0.005*time',
u'0.044*docker + 0.036*prodmgmt + 0.029*product + 0.018*productmanagement + 0.017*programming + 0.012*tipoftheday + 0.010*security + 0.009*javascript + 0.009*manager + 0.009*containers',
u'0.089*love + 0.035*john + 0.026*update + 0.022*heart + 0.015*peace + 0.014*beautiful + 0.012*beauty + 0.010*life + 0.010*shanti + 0.009*stories',
u'0.033*geek + 0.009*architecture + 0.007*code + 0.007*products + 0.007*parts + 0.007*charts + 0.007*software + 0.006*cryptrader + 0.006*moombo + 0.006*book',
u'0.049*stories + 0.046*network + 0.044*virginia + 0.044*entrepreneur + 0.039*etmchat + 0.025*etmooc + 0.021*etm + 0.015*join + 0.014*deis + 0.010*today',
u'0.056*slots + 0.053*bonus + 0.052*fsiug + 0.039*casino + 0.031*slot + 0.024*online + 0.014*free + 0.013*hootchat + 0.010*win + 0.009*bonuses',
u'0.056*video + 0.043*add + 0.042*message + 0.032*blog + 0.027*posts + 0.027*media + 0.025*training + 0.017*check + 0.013*gotta + 0.010*insider'
245 timelines
Slides: alexisperrier.com
- https://github.com/alexisperrier/datatalks/tree/master/twitter
- https://github.com/alexisperrier/datatalks/tree/master/debates
- https://nbviewer.jupyter.org/github/alexisperrier/datatalks/blob/master/twitter/LDAvis_V2.ipynb
- https://alexisperrier.com/stm-visualization/index.html
Ref:
- topic modeling https://thesai.org/Downloads/Volume6No1/Paper_21-A_Survey_of_Topic_Modeling_in_Text_Mining.pdf
- lda: https://ai.stanford.edu/~ang/papers/nips01-lda.pdf
- pyLDAvis: https://github.com/bmabey/pyLDAvis
- stm: https://scholar.princeton.edu/files/bstewart/files/stmnips2013.pdf
- stm R: https://structuraltopicmodel.com/
- stmBrowser: https://github.com/mroberts/stmBrowser