Saturday, April 13, 2013

1st degree web search intent

25 seconds to life in Search Hell. Without the possibility of parole, of course.

You use web searches all the time, and your results vary. Sometimes, you find what you want not just exactly but, say, in the first few items of the "query results" list.

Sometimes, you spend hours and give up in a state of homicidal frustration.

Given that running a search engine costs a lot of money and that the results are worth so much money that the only correct term is "a shitload of money", there is a big industry of people who try to get "the best" search results.

They are the search engine optimization folks. Only a decade ago, they were really only glorified typists entering keywords. In the early web days, the more keywords, the more likely your page was listed higher up on the search results page. Much of the then small industry engaged in very shady practices, stuffing unrelated keywords everywhere.

Then scientists started to take note of the problem and came up with something called "user intent".

The seminal paper Determining the user intent of web search engine queries
says:
Some navigational queries were quite easy to identify, especially
those queries containing portions of URLs or even complete URLs.
We also classified company and organizational names as
navigation queries, assuming that the user intended to go to the
Website of that company or organization. We also noted that most
navigation queries were short in length and occurred at the
beginning of the user session. Identification of transactional
queries was primarily via term and content analysis, with
identification of key terms related to transactional domains such as
entertainment and ecommerce. With the relatively clear
characteristics of navigational and transactional queries,
information queries became the catch-all by default.

Whatever you search, most of the time they will classify it as "informational" (80% of the time).

I personally like to use the example of a county fair to try and explain some of this, although a county fair does not offer the variety of the internet (a county fair with a large porn offering would be a problem for some folks).

The idea is: there is lots of different stuff.
You arrive at the county fair's gate (the search engine) and need to figure out where to go and what to do.

If you know that the hot tub company X is at the county fair, you get directions, and off you go. That would be a navigational query.

If you want to eat something, you can ask where you can buy a hotdog or deep fried twinkies. The answer should be a list of food vendors. That's a transactional query.

If you want to look at the music schedule for the main stage, you'll get a pamphlet or directions to the location, maybe a signpost, where it is posted. If you are interested in crafts in general, you'll get directions to the crafts area. That's an informational query.

Now imagine that the gatekeeper has to read new pamphlets and brochures during his break because the fair changes a lot. So, he is not always up to date, and he sometimes has difficulty understanding you and sends you to the wrong place. He may give you a huge list, and you get frustrated.

Imagine now that right next to your county fair, there is a Spanish county fair but they have fewer personnel at the gate, and the fair visitors may be inclined to do more looking around during a hot afternoon. While the gatekeeper speaks pretty good Spanish (let's say the county fair in Colorado), some of his grammar can be off, sending you to a place, for example, that is a dog show instead of a hot dog vendor.

Back now to a more technical discussion. The SEO companies, the search engine people, and the researchers are outright obsessed with your "intent". 
They process all these search query logs and try to make sense of them in terms of what you want to achieve when you enter a certain phrase.

Somehow, somebody has managed to convince some of the folks that you, dear user, may want to see an image of a person or thing in the search results list even if you do not enter anything that specifies this.

This somebody has come up with a list of things for which "the majority of users would prefer an image". Our post "Crap in the crowd" shows how badly conceived that list is.

There are some very limited categories for which "prefer images" would be useful or even outright perfect. But the category list we have seen has so many "prefer images" items that it is utter nonsense. Someone has totally misunderstood that the idiom "a picture is worth a thousand words" needs to be applied in moderation.

The problem with using query logs to figure out the "intent" is that you do not know if a user visited a certain site that has images which would match the query, and even then, you don't know if the user wanted the images more than text on the page.

Of course, you can be sneaky and give the world a browser that records the website the users visit from a query result link. Even then, except for certain narrow categories, like, yes, porn, "image intent" or "non-image intent" is way unclear.

Ask users directly?

If you ask a user directly "would you prefer an image or only text in the search result for an illness", I will be the famous Susan B. Anthony dollar (minus shipping and handling) that the answer will be "an image". If you ask for clarification, you will find that the users first want the symptoms and then an image.

We tried it.

The upshot of saying "the majority would like to see an image" in the search results list is that it also ignores the fact that the major search engines have a separate "images", and maybe even "videos" and "phone numbers" option.

The money spent on tagging way too much stuff as "prefer image", would be so much better spent on building up better language processing capabilities for the "non-English" part of the Web.  See our post German web searches suck for some more information on this.

The term "user intent" is here to stay and will continue to be hotly debated as long as mind reading is not implemented.


Bonus for reading all this
Imagine there are many humans, call them "users" who want so search for something called "cars".  In order to give them the best results, you want to find the "number of wheels intent".
Instead of asking all the users, why not go through the brochures of a bunch of car makers and determine what the majority of wheels for cars is.
You can use this value to make a pretty educated guess about the "number of wheels intent" of your users.
Same for "image intent" and websites for specific categories.

One more thing:
I know, the irony is we spent decades getting to "want does the user want", and in this specific instance, we may for once know better than the users.



No comments:

Post a Comment