Some topic analysis is done here, using techniques from Latent Dirichlet Allocation.


The data (stamp: \~ 2014-04-05 19:23 CET) is publicly available from the American Civil Liberties Union (ACLU):

For the time being following 213 documents (PDF) are given: (raw text)


Up to now, after converting the documents via pdftotext to raw text, removing stop words, following topics emerge:

rel usa secret top fvey data target comint si analytic<br /> nsa br top metadata fisa number compliance court order analysts<br /> nsa intelligence national information al security activities classified communications declaration<br /> information court order nsa application authorized metadata records intelligence states<br /> government information section collection intelligence security states united privacy data<br /> games game gaming world virtual influence online trends fouo activities<br /> si ts nf noforn top secret metadata ras nsa id<br /> court judge access review committee issues report act rules house<br /> ll ii infonnation telephone se data el en ed es<br /> intelligence states united foreign communications person information general activities persons

Ten groups of topics, using ten words per group. One line, one group.