Today Google reported that they’ve developed an algorithm to index Flash content (only text, not video or images). This news is clearly dominating Flash news sources around the web, with mostly mixed reviews.
Many bloggers are criticizing Google claiming that it’s impossible for any algorithm to figure out the text information of SWF files which load text from external sources (such as XML) because it’s impossible to know the format of the XML documents being transferred over. But who cares? I still don’t get why people are saying that that’s how the algorithm works. Google has stated that it’s able to crawl externally loaded SWFs (although they don’t couple it with the original SWF when indexing, which is a significant problem for sites that load multiple SWFs for navigation); consequently, they must be monitoring HTTP requests made by the SWF and can do the same with XML files. Google doesn’t need to know how the XML file is parsed… the Flash document will do that for them. They can just have the Flash load the XML file and monitor the text fields and see the value. That’s probably why they say, “To protect your material from being crawled, convert your text to images when possible”.
The only problem I see is if text fields are very dynamic. Maybe the algorithm only goes through static text fields? Because I see no way how a text field that displays random letters (for visual effect) being able to be indexed by any algorithm.
Here’s my prediction: Community tagging. Just like the Google Image Labler game, Google will ask users to tag/label Flash documents that their parser can’t index correctly. Humans would be the perfect computational tool to solve this kind of problem. Yes, there are millions of SWFs out there needing to be indexed, but we really just want the major ones parsed.