Topic-Analysis-NSA-Archive

Some topic analysis is done here, using techniques from Latent Dirichlet Allocation.

Data

The data (stamp: \~ 2014-04-05 19:23 CET) is publicly available from the American Civil Liberties Union (ACLU):

https://www.aclu.org/nsa-documents-search

For the time being following 213 documents (PDF) are given:

http://paste.the-compiler.org/index.php/view/8526681e (raw text)

Results

Up to now, after converting the documents via pdftotext to raw text, removing stop words, following topics emerge:

rel usa secret top fvey data target comint si analytic<br /> nsa br top metadata fisa number compliance court order analysts<br /> nsa intelligence national information al security activities classified communications declaration<br /> information court order nsa application authorized metadata records intelligence states<br /> government information section collection intelligence security states united privacy data<br /> games game gaming world virtual influence online trends fouo activities<br /> si ts nf noforn top secret metadata ras nsa id<br /> court judge access review committee issues report act rules house<br /> ll ii infonnation telephone se data el en ed es<br /> intelligence states united foreign communications person information general activities persons

Ten groups of topics, using ten words per group. One line, one group.