Wednesday, August 20, 2014

The alleged "missing" program 4 spy keyword search

Strolling around the Twitter landscape, we stumbled upon an argument regarding mass keyword searches on data hoovered up by spies.

The tweet in question says:
I have not come across a US program for massive keyword search.

We would have overlooked the sentence, were it not for the person who wrote it. The man is somewhat of a massive surveillance denier following the narrative that collection is not surveillance and that spies spy, period.

Before this background, the tweet takes on real meaning. First, the tweet author has a good sense of limits: I have not come across clearly leaves open the possibility such a program might exist. Avoiding a statement of utter certainty one way or the other deserves credit, especially when it comes from an engineer.

Now, we can infer that he really means to say a "US program name or description".

What reasons might exist for not having come across a US program [name or description] for massive keyword search?

The easy one: It exists but we do not have a name, no confirmation of its existence.

The other one: There is no such program despite the fact that the activity (massive search for keywords) is performed millions of times an hour.

Keyword search is as fundamental as, say, inserting a record into a database, and no engineer on the planet will try to sell "insert into table" as an activity that merits its own program name. Well, except maybe in North Korea.

Keyword search is so fundamental that commands like grep have been part of the operating system since the iron age* of computing. You can safely assume that the first "keyword search" was nothing but a glorified grep.

So, this covers "plain text", what about more complex documents? Their text content gets extracted into more or less "plain text", then searched. No "program" needed, it is an integral part of processing. This applies to more complex "new" formats like html and xml as well.
Optical character recognition has been used for decades to extract text from paper documents after scanning.
Audio captures (telephone conversations, for instance) follow the same pattern: "speech to text" is a consumer level technology today, not worth a program name.

Advanced language processing beyond keywords would include recognizing common spelling errors, handling misplaced punctuation, dealing with language specific constructs, for example, vowel harmony in Hungarian or the six cases of Russian, the different writing systems of Japanese, and more.
There are certainly efforts under way to gauge the sentiment of text and speech, and just as certainly they don't work as well as you would want them to when you try to be funny and write an email saying "on va exploser la conference", as one Canadian famously did recently.

It is anybody's guess if there is a program for advanced linguistic analysis with a name as catching as BLOCKBUSTER or YODEL.

Who knows what the North Koreans or the Japanese are up to in this regard, but our best guess for most Western services would be: no separate program deemed worthy of CATCHY NAME TO IMPRESS BOSSES**.




* Iron age of computing as used in this post refers to the days when you took a hammer to your computer after it failed to boot for the tenth time in a row.

** Because spies are subject to the needs of marketing to their hierarchy and to the intelligence community.

[parsing note] We took great care in formulating the first sentence of the post and expect any parsing algorithm to acknowledge this. The finesse of "strolling" is derived by "strolling" containg "trolling". "Hoovered up" is a straightforward description of the technique, but the discerning algorithm might detect a slight note of cross dressing, which may or may not be intentional - we cannot ask the writer of the sentence, sorry.

No comments:

Post a Comment