Sunday, June 10, 2007

National Internet Safety Month: How well do Internet content filters work?

This time I will discuss Internet safety filters. June is National Internet Safety Month. To show support for this cause, I will post a series of blogs addressing issues of Internet safety. Today's post discusses Internet content filters. We will get under the hood and perform a little testing. Finally, I will talk about Piffany's Internet Content and Safety Rating System. We have just begun beta testing for this, but we are hopeful that the service will be of immense use to families, schools, companies, and governments. First, a follow-up on my last post.

Follow-up: What does your IP address reveal about your location?

My first post (National Internet Safety Month: What does your IP address reveal about your location?) discussed how your IP address can be used to determine your geographic location. John Magid, founder of SafeKids.org, CBS News On-Air Technology Analyst, and co-author of MySpace Unravelled, asked me if this is really a problem, i.e., are there any cases on record of a predator meeting up offline nonconsentually with a minor they met online. I couldn't find any cases, and it does look like most cases involve consent from the minor. If you know of any cases, then leave a comment or drop me a line.

Internet Content Filters
I won't attempt to review all the options (there is a good review in Adam Thierer's blog), but I will discuss how and how well they work. The basic premise is that before jumping to some location on the web, your browser or proxy checks with a database, either locally or online, to determine if the site you have requested is prohibited according to the options you set when you set up your parental or administrative controls. For example, you might select not to view web sites that are related to alcohol or pornography. Your browser checks to see if the site you are requesting is on the list of known alcohol or pornography related web sites. Conceptually, these filters are fairly simple.

So how does a site get on the list? There are three major methods by which a web site might get on one of these "bad" lists:
  1. The owner or administrator for the web site added it (yeah right)
  2. A human not affiliated with the web site added it
  3. A web robot added it
Self-Evaluation
There are a lot of web pages on the Internet. Let's call the relevant number of web pages 300 billion (estimate including the dark web). The central challenge a web content filter designer faces is this staggering number. A first guess at how to content rate all these sites is to simply ask the people that own the sites to do a self-evaluation. As a first approximation, this isn't bad, but if its your only line of defence then you will likely be dissapointed in the results. The leading organization that uses this approach is the Internet Content Rating Association (IRCA). IRCA filter technology depends on site owners to self-evaluate their site through a questionaire, called a label generator. An example is the language tab, which offers four categories that can be checked off: abusive or vulgar terms, profanity or swearing, mild explitives, or none of the above. The label generator intrinisically takes care of the multi-lingual or multi-cultural problem. They even offer a customized Google search using the IRCA filter. I tried a popular word often associated with felines (I had to turn Google's Filtering off to get the word through).
The results, shown on the left, reveal that 1 of the 7 results shown (all pornography by the way) is correctly labeled Surf:Adult (red arrow). 3 of the 7 results are not labeled at all (gray arrow). The disappointing result is that 3 of the 7 are incorrectly labeled as Surf:All Ages (green arrow), meaning that the sites are labeled by IRCA as OK for all ages to view. Casual inspection of the description text reveals that they are not even close to OK for all ages. They are all blatant pornography. My word choice is not unrealistic for something a kid might type in, though I agree that I couldn't have picked a more dangerous ambiguous word.

So that's only seven results analyzed. What if we do the analysis for more sites? Google will only return 1000 results at a time, so we have to restrict it to that. Doing so, I found that 332 results were labeled at all, and of those results, 67% were incorrectly labeled as safe for all ages. I verified the results by eye and without actually clicking through.

IMPORTANT: None of the results I have seen or discussed here were marked as checked by IRCA. The labels are voluntary self-evaluations by the content provider and not an evaluation from IRCA employees. I am sure the IRCA folks have the best intentions. My only point here is that much of the time, self-evaluation is innaccurate as to content and safety. IRCA-checked means that the content owners paid IRCA a fee ($32/url last time I checked) to verify their self-evaluation. Surely, had these content providers paid for this service, then their labels would have changed. The customized Google search provided by IRCA is not intended as family-safe filtered search results, but rather just a way to check if a site has been labeled or verified using the IRCA's system.

Conclusion: Self-evaluation is not accurate (less that 40% accurate)

Third Party Human Evaluation
I don't have any really good data to evaluate third parties as a tool to evaluate content. So what I used instead is a close approximation. I used the open directory project (ODP), actually I used the Google Kids Directory, but it is just ODP + Google's PageRank algorithm. What makes this test innacurate is the probably relatively high rate of self-submission to this directory. If it was 100% third party submitted then it would be a perfect test, but it probably isn't and I have no way of knowing what percentage is self-submission. I actually submitted a link to a paper I wrote on how school kids can use the Internet to learn physics better. However, I think I was honest in placing the paper where I did, though I am not so sure about the description I entered ("the best paper in the history of the world...").

This time there are far too many web pages for me to evaluate by eye (approximately 20,000 pages in the kids directory that I crawled, skipping the International section). Instead I used a context evaluating robot to look for "bad results" and verified the bad results by eye and by clicking through to the pages when necessary. Here's what I found:

The table at the right gives the percentages for results in categories related to offensive or innapropriate materials. As you can see, a large percentage of the URLs were labeled as suspicious by the robot; however, a smaller fraction, on the order of 1%, were definitively found to use unambiguos profane language. Is 1% too much for a kids directory - that's a couple hundred sites? The categories are relatively straightforward. Strong profanity consists of those five or so words that you absolutely could not get away with saying on TV. Moderate profanity is the usual stuff you hear on the cable channels and occasionally on late night TV. Mild profanity is the really light stuff that you wouldn't completely flip out over if your ten year old blurted one of them out. Racist language is, well you know what it is. Pornography is, well you know what that is, too. Suspicious words are like the feline word from the previous section or a common first name that also refers to a sexual organ. Overall the results look good. The catch is that it is hard to get humans to rate a large number of web sites. The ODP consists of 4.8 million web pages, which is less than 0.01% of the web pages accesible through a commercial search engine like Yahoo or Google (both appear to index somewhere on the order of 60B web pages).

Conclusion: Third party independent evaluation for content is much better
than self-evaluation (better than 95% accuracy), but it is limited to a very very very small number of web sites.

Web Robots
The last category I am considering is content rating via robots. There are several big projects out there that use these things to classify a large number of web sites, in fact, any large scale content rating system must rely on web robots to some degree.

For families, there are some small filter applications available in a variety of forms. Companies like NetNanny and McAfee make some of the better ones. The S4F filter has been around for several years now and is used by many of the family oriented products. A good performance review of the top ten small scale affordable filters is available on toptenreviews.com. In general, these filter tend to push around 98% accuracy in blocking and non-blocking (false positives). NetNanny is noteworthy in that it has a dynamic contextual analysis. For example, CNN.com may be blocked in the morning due to a particularly violent story, but unblocked later on when the story is pushed off the front page. I don't know the extent to which this works, and I doubt they can do it for very many sites, but it is a nice feature to note.
The really good stuff is not really affordable to families, and tailors its services to corporations and governments. The muslim world is very much into censorship and blocking of innappropriate web content. They reportedly use SmartFilter from SecureComputing. (Note: I recently learned that McAfee uses SmartFilter, but I don't know if it is the same version used by corporations and governments) The gradient of content types that it identifies is much finer than the course grained categories used by the more inexpensive filters (70 categories compared with S4F's 23). The following is from Secure Computing's website:
How the best database in the industry is built and maintained
Secure Computing relies on a combination of advanced technology and highly skilled Web analysts – the ideal combination for high accuracy and coverage. Secure Computing utilizes a number of technologies and artificial intelligence techniques to gather and rate potential Web sites including: link crawlers, security forensics, honeypot networks, sophisticated auto-rating tools and customer logs. Candidate sites are added to the SmartFilter database after being reviewed by our multi-lingual Web Analysts who focus their full attention on ensuring the continued quality of the database.

In other words, it appears that they use all three techniques discussed here. Applying a little statistics, if we followed self-evaluation, at a generous 60% effectiveness in blocking, with a robot at 98%, and finally finished up with third party human at 95%, then the overall accuracy for blocking with all three independant filters would be 99.96% accuracy in blocking innapropriate sites. That means that at most, 24 million bad sites could slip through on a search engine as large as Google. In order to reduce this statement to at most less than zero sites slip through, you would need a filter with accuracy of better than 99.999999996%. I haven't actually tested SmartFilter and basing their accuracy or effectiveness on reports available on the Web.

One final note on of what robots are capable of. They can spot pornographic images too, though not with the same effectiveness. I don't know, offhand, exactly how well they work, but I'll guess around 90% effective (please dispute this claim if you know it to be untrue by leaving a comment below). How do they do it? Simply Fourier transform the image and apply a simple learning algorithm to recognize skin tone histograms associated with nude images (trained sets are available). Sounds easy enough, right? If you want to have a look at some actual code to do both the textual filtering and image filtering, check out Poseia, which is a European content filtering open source project. The word lists are no good for Americans, but the basic code elements are there. Of course, most large scale commercial systems wouldn't let you get near there source code, but the Poseia code is open source, so feel free to take a peak.

Conclusion: Robots seem to beat humans (98% effective in blocking innapropriate sites), but mainly due to the large scale of the Internet. An ideal solution would be a combination of all three evaluation methods, yielding an impressive 99.99995% effectiveness in blocking innapropriate material.

What about Google's Strict Safety Filter
It doesn't appear to be a safety filter at all, rather it is a simple censoring program that screens the words a user inputs into the search box. A proper filter would remove sites that contain inappropriate material regardless of the user query. Try it. On Google, go to preferences, which should appear adjacent to the search button. On the preferences page you will see the following text box:
SafeSearch Filtering
Google's SafeSearch blocks web pages containing explicit sexual content from appearing in search results.




Select Use strict filtering, as shown above. Click on save preferences and you should be returned to the main search page. Type in your favorite obscenity, but misspell it. I used the same feline example from before (only one 's'). Here is an example description from a search result I got with strict filtering enabled (with the bad stuff removed):
**** hardcore sexy lesbian **** free pics big black **** ******* white ***** *** **** with sex toy young pink **** *** filled ****** mpeg objects in ****

It does look like they are using a statistical language model to find alternative spellings of my token and apply the NOT operator to alternate spellings if they are on the bad words list, but as you will see if you try it there are clearly words in the search results that should get picked up by even the simplest of filters. I am a big fan of Google, but I have no way to defend them on this issue other than to say they are an adult search engine which is exactly why we founded Piffany--to offer comparable quality and services, but safely and specifically for kids, tweens, and teens.


Limitations of Traditional Filtering Technologies:
It might sound, from what I've said, like the Internet safety problem can be licked from a robust multi-level filter like that offered by Secure Computing. It's not. There are two primary issues that exacerbate the effectiveness of any filter:

1. Dynamic content
Web sites change all the time. Some are updated hourly, daily, monthly, etc. Fortunately, sites tend not to completely remake themselves when they do change, so whatever the content was yesterday, it is probably nearly the same today. Exceptions are like the CNN example earlier, or if a domain expires and is bought up by an entirely different entity (I don't have any good statistics on this, do you?). BTW, it is common for companys to actively search for expiring sites with good pageRanks. A good content and safety rating system must come with an expiration date.
2. Link Structure
Let's say you type some word into a search engine and click on a filter-approved web site. The site itself doesn't look too bad. No nude pictures or profanity, but it has links on the page. Where do they go? How many clicks before you arrive at an innapropriate site? Well, according to Albert-Laszlo Barabasi at Boston University, there are only 19 degrees of separation between web sites. That's 19 clicks and you can get anywhere. The human social network is allegedally 6 degrees of seperation. None of the safety filters presently available guard against sites that are a few links away from innappropriate material. This is one of the reasons that 1 in 3 children inadvertently winds up on a site with pornography; that and the fact that about 13% of the web sites out there are pornography and most parents never turn safety filters on, AND many kids can easily turn off parental controls.

3. Evaluation is not 100% Objective
What does innapropriate mean anyway? A lot of Art would be considered pornography according to the robots. Similarly, teen health sites would be considered innapropriate by many conservative groups; as would alternative lifestyle sites. I have no stated opinion on this but to say that like-minded people travel in common circles, and this needs to be taken into account when making evaluations as to what is appropriate. Can it be done? Well, yes it can, but I can't say how because that would violate the NDA agreement I signed with Piffany Inc.

What does Piffany plan on doing about it?

The easiest answer to this is, wait and see for yourself. We are currently beta testing a new content and site rating system that I believe will be an international standard in a couple years time, if it all works out in the end. It takes into account the dynamic nature of the Internet, link structure, and subjective measurements. How confident am I that it will work well? Fairly confident considering that I quit my job at Harvard University in order to work on it. We anticipate that the rating system will have a similar effect as in movies, where an R rating is the kiss of death for a movie. Are we secret agents in the united conservative front, pushing an agenda of world cencorship and domination? Uhm, no. But we do believe that people should have the choice as to what kind of content they receive on the Internet. Presently, Piffany uses standard safety filtering technology, but stay tuned for SafetyRank to emerge soon. Ok those are just words, do I have a more scientific proof that we are working on a great filter? Given a list of 1M URLs with an even mixture of kids and adult sites, I asked the algorithm to pick out some kids sites and some adult sites. Here are the results:



Stay tuned to this blog for our next topic: Chat Rooms, Social Networking, and Age Verification.
Suggest a topic to me at david(at}piffany[dot]com


Final Words: With safety filters you get what you pay for. If you let your kids use adult search engines like Google, Yahoo, AOl, and MSN, make sure you enable safety filters or parental controls if they are available, but don't rest assured that they will work well, though it's better than nothing.
Filtering controls for the major adult search engines:
If you are using Windows Vista, be sure to check out the built in parental controls.

About Piffany
Why should a 9-year-old and a 29-year-old get exactly the same search results? Piffany is a search engine that strives to safely bring the full potential of the Internet to kids by allowing them to adjust the difficulty level of their search results. To ensure the safety of our search results, we are researching a new Internet content and safety rating system that is similar to ratings given for TV, movies, and video games. Visit us at Piffany.com.