Piffany Search Engine for Kids: 2007

Sunday, June 10, 2007

National Internet Safety Month: How well do Internet content filters work?

This time I will discuss Internet safety filters. June is National Internet Safety Month. To show support for this cause, I will post a series of blogs addressing issues of Internet safety. Today's post discusses Internet content filters. We will get under the hood and perform a little testing. Finally, I will talk about Piffany's Internet Content and Safety Rating System. We have just begun beta testing for this, but we are hopeful that the service will be of immense use to families, schools, companies, and governments. First, a follow-up on my last post.

Follow-up: What does your IP address reveal about your location?

My first post (National Internet Safety Month: What does your IP address reveal about your location?) discussed how your IP address can be used to determine your geographic location. John Magid, founder of SafeKids.org, CBS News On-Air Technology Analyst, and co-author of MySpace Unravelled, asked me if this is really a problem, i.e., are there any cases on record of a predator meeting up offline nonconsentually with a minor they met online. I couldn't find any cases, and it does look like most cases involve consent from the minor. If you know of any cases, then leave a comment or drop me a line.

Internet Content Filters
I won't attempt to review all the options (there is a good review in Adam Thierer's blog), but I will discuss how and how well they work. The basic premise is that before jumping to some location on the web, your browser or proxy checks with a database, either locally or online, to determine if the site you have requested is prohibited according to the options you set when you set up your parental or administrative controls. For example, you might select not to view web sites that are related to alcohol or pornography. Your browser checks to see if the site you are requesting is on the list of known alcohol or pornography related web sites. Conceptually, these filters are fairly simple.

So how does a site get on the list? There are three major methods by which a web site might get on one of these "bad" lists:

The owner or administrator for the web site added it (yeah right)
A human not affiliated with the web site added it
A web robot added it

Self-Evaluation
There are a lot of web pages on the Internet. Let's call the relevant number of web pages 300 billion (estimate including the dark web). The central challenge a web content filter designer faces is this staggering number. A first guess at how to content rate all these sites is to simply ask the people that own the sites to do a self-evaluation. As a first approximation, this isn't bad, but if its your only line of defence then you will likely be dissapointed in the results. The leading organization that uses this approach is the Internet Content Rating Association (IRCA). IRCA filter technology depends on site owners to self-evaluate their site through a questionaire, called a label generator. An example is the language tab, which offers four categories that can be checked off: abusive or vulgar terms, profanity or swearing, mild explitives, or none of the above. The label generator intrinisically takes care of the multi-lingual or multi-cultural problem. They even offer a customized Google search using the IRCA filter. I tried a popular word often associated with felines (I had to turn Google's Filtering off to get the word through).

The results, shown on the left, reveal that 1 of the 7 results shown (all pornography by the way) is correctly labeled Surf:Adult (red arrow). 3 of the 7 results are not labeled at all (gray arrow). The disappointing result is that 3 of the 7 are incorrectly labeled as Surf:All Ages (green arrow), meaning that the sites are labeled by IRCA as OK for all ages to view. Casual inspection of the description text reveals that they are not even close to OK for all ages. They are all blatant pornography. My word choice is not unrealistic for something a kid might type in, though I agree that I couldn't have picked a more dangerous ambiguous word.

So that's only seven results analyzed. What if we do the analysis for more sites? Google will only return 1000 results at a time, so we have to restrict it to that. Doing so, I found that 332 results were labeled at all, and of those results, 67% were incorrectly labeled as safe for all ages. I verified the results by eye and without actually clicking through.

IMPORTANT: None of the results I have seen or discussed here were marked as checked by IRCA. The labels are voluntary self-evaluations by the content provider and not an evaluation from IRCA employees. I am sure the IRCA folks have the best intentions. My only point here is that much of the time, self-evaluation is innaccurate as to content and safety. IRCA-checked means that the content owners paid IRCA a fee ($32/url last time I checked) to verify their self-evaluation. Surely, had these content providers paid for this service, then their labels would have changed. The customized Google search provided by IRCA is not intended as family-safe filtered search results, but rather just a way to check if a site has been labeled or verified using the IRCA's system.

Conclusion: Self-evaluation is not accurate (less that 40% accurate)

Third Party Human Evaluation
I don't have any really good data to evaluate third parties as a tool to evaluate content. So what I used instead is a close approximation. I used the open directory project (ODP), actually I used the Google Kids Directory, but it is just ODP + Google's PageRank algorithm. What makes this test innacurate is the probably relatively high rate of self-submission to this directory. If it was 100% third party submitted then it would be a perfect test, but it probably isn't and I have no way of knowing what percentage is self-submission. I actually submitted a link to a paper I wrote on how school kids can use the Internet to learn physics better. However, I think I was honest in placing the paper where I did, though I am not so sure about the description I entered ("the best paper in the history of the world...").

This time there are far too many web pages for me to evaluate by eye (approximately 20,000 pages in the kids directory that I crawled, skipping the International section). Instead I used a context evaluating robot to look for "bad results" and verified the bad results by eye and by clicking through to the pages when necessary. Here's what I found:

The table at the right gives the percentages for results in categories related to offensive or innapropriate materials. As you can see, a large percentage of the URLs were labeled as suspicious by the robot; however, a smaller fraction, on the order of 1%, were definitively found to use unambiguos profane language. Is 1% too much for a kids directory - that's a couple hundred sites? The categories are relatively straightforward. Strong profanity consists of those five or so words that you absolutely could not get away with saying on TV. Moderate profanity is the usual stuff you hear on the cable channels and occasionally on late night TV. Mild profanity is the really light stuff that you wouldn't completely flip out over if your ten year old blurted one of them out. Racist language is, well you know what it is. Pornography is, well you know what that is, too. Suspicious words are like the feline word from the previous section or a common first name that also refers to a sexual organ. Overall the results look good. The catch is that it is hard to get humans to rate a large number of web sites. The ODP consists of 4.8 million web pages, which is less than 0.01% of the web pages accesible through a commercial search engine like Yahoo or Google (both appear to index somewhere on the order of 60B web pages).

Conclusion: Third party independent evaluation for content is much better than self-evaluation (better than 95% accuracy), but it is limited to a very very very small number of web sites.

Web Robots
The last category I am considering is content rating via robots. There are several big projects out there that use these things to classify a large number of web sites, in fact, any large scale content rating system must rely on web robots to some degree.

For families, there are some small filter applications available in a variety of forms. Companies like NetNanny and McAfee make some of the better ones. The S4F filter has been around for several years now and is used by many of the family oriented products. A good performance review of the top ten small scale affordable filters is available on toptenreviews.com. In general, these filter tend to push around 98% accuracy in blocking and non-blocking (false positives). NetNanny is noteworthy in that it has a dynamic contextual analysis. For example, CNN.com may be blocked in the morning due to a particularly violent story, but unblocked later on when the story is pushed off the front page. I don't know the extent to which this works, and I doubt they can do it for very many sites, but it is a nice feature to note.
The really good stuff is not really affordable to families, and tailors its services to corporations and governments. The muslim world is very much into censorship and blocking of innappropriate web content. They reportedly use SmartFilter from SecureComputing. (Note: I recently learned that McAfee uses SmartFilter, but I don't know if it is the same version used by corporations and governments) The gradient of content types that it identifies is much finer than the course grained categories used by the more inexpensive filters (70 categories compared with S4F's 23). The following is from Secure Computing's website:

How the best database in the industry is built and maintained
Secure Computing relies on a combination of advanced technology and highly skilled Web analysts – the ideal combination for high accuracy and coverage. Secure Computing utilizes a number of technologies and artificial intelligence techniques to gather and rate potential Web sites including: link crawlers, security forensics, honeypot networks, sophisticated auto-rating tools and customer logs. Candidate sites are added to the SmartFilter database after being reviewed by our multi-lingual Web Analysts who focus their full attention on ensuring the continued quality of the database.

In other words, it appears that they use all three techniques discussed here. Applying a little statistics, if we followed self-evaluation, at a generous 60% effectiveness in blocking, with a robot at 98%, and finally finished up with third party human at 95%, then the overall accuracy for blocking with all three independant filters would be 99.96% accuracy in blocking innapropriate sites. That means that at most, 24 million bad sites could slip through on a search engine as large as Google. In order to reduce this statement to at most less than zero sites slip through, you would need a filter with accuracy of better than 99.999999996%. I haven't actually tested SmartFilter and basing their accuracy or effectiveness on reports available on the Web.

One final note on of what robots are capable of. They can spot pornographic images too, though not with the same effectiveness. I don't know, offhand, exactly how well they work, but I'll guess around 90% effective (please dispute this claim if you know it to be untrue by leaving a comment below). How do they do it? Simply Fourier transform the image and apply a simple learning algorithm to recognize skin tone histograms associated with nude images (trained sets are available). Sounds easy enough, right? If you want to have a look at some actual code to do both the textual filtering and image filtering, check out Poseia, which is a European content filtering open source project. The word lists are no good for Americans, but the basic code elements are there. Of course, most large scale commercial systems wouldn't let you get near there source code, but the Poseia code is open source, so feel free to take a peak.

Conclusion: Robots seem to beat humans (98% effective in blocking innapropriate sites), but mainly due to the large scale of the Internet. An ideal solution would be a combination of all three evaluation methods, yielding an impressive 99.99995% effectiveness in blocking innapropriate material.

What about Google's Strict Safety Filter
It doesn't appear to be a safety filter at all, rather it is a simple censoring program that screens the words a user inputs into the search box. A proper filter would remove sites that contain inappropriate material regardless of the user query. Try it. On Google, go to preferences, which should appear adjacent to the search button. On the preferences page you will see the following text box:

SafeSearch Filtering	Google's SafeSearch blocks web pages containing explicit sexual content from appearing in search results.
	Use strict filtering (Filter both explicit text and explicit images) Use moderate filtering (Filter explicit images only - default behavior) Do not filter my search results.

Select Use strict filtering, as shown above. Click on save preferences and you should be returned to the main search page. Type in your favorite obscenity, but misspell it. I used the same feline example from before (only one 's'). Here is an example description from a search result I got with strict filtering enabled (with the bad stuff removed):
**** hardcore sexy lesbian **** free pics big black **** ******* white ***** *** **** with sex toy young pink **** *** filled ****** mpeg objects in ****

It does look like they are using a statistical language model to find alternative spellings of my token and apply the NOT operator to alternate spellings if they are on the bad words list, but as you will see if you try it there are clearly words in the search results that should get picked up by even the simplest of filters. I am a big fan of Google, but I have no way to defend them on this issue other than to say they are an adult search engine which is exactly why we founded Piffany--to offer comparable quality and services, but safely and specifically for kids, tweens, and teens.

Limitations of Traditional Filtering Technologies:
It might sound, from what I've said, like the Internet safety problem can be licked from a robust multi-level filter like that offered by Secure Computing. It's not. There are two primary issues that exacerbate the effectiveness of any filter:

1. Dynamic content

Web sites change all the time. Some are updated hourly, daily, monthly, etc. Fortunately, sites tend not to completely remake themselves when they do change, so whatever the content was yesterday, it is probably nearly the same today. Exceptions are like the CNN example earlier, or if a domain expires and is bought up by an entirely different entity (I don't have any good statistics on this, do you?). BTW, it is common for companys to actively search for expiring sites with good pageRanks. A good content and safety rating system must come with an expiration date.

2. Link Structure

Let's say you type some word into a search engine and click on a filter-approved web site. The site itself doesn't look too bad. No nude pictures or profanity, but it has links on the page. Where do they go? How many clicks before you arrive at an innapropriate site? Well, according to Albert-Laszlo Barabasi at Boston University, there are only 19 degrees of separation between web sites. That's 19 clicks and you can get anywhere. The human social network is allegedally 6 degrees of seperation. None of the safety filters presently available guard against sites that are a few links away from innappropriate material. This is one of the reasons that 1 in 3 children inadvertently winds up on a site with pornography; that and the fact that about 13% of the web sites out there are pornography and most parents never turn safety filters on, AND many kids can easily turn off parental controls.

3. Evaluation is not 100% Objective

What does innapropriate mean anyway? A lot of Art would be considered pornography according to the robots. Similarly, teen health sites would be considered innapropriate by many conservative groups; as would alternative lifestyle sites. I have no stated opinion on this but to say that like-minded people travel in common circles, and this needs to be taken into account when making evaluations as to what is appropriate. Can it be done? Well, yes it can, but I can't say how because that would violate the NDA agreement I signed with Piffany Inc.

What does Piffany plan on doing about it?

The easiest answer to this is, wait and see for yourself. We are currently beta testing a new content and site rating system that I believe will be an international standard in a couple years time, if it all works out in the end. It takes into account the dynamic nature of the Internet, link structure, and subjective measurements. How confident am I that it will work well? Fairly confident considering that I quit my job at Harvard University in order to work on it. We anticipate that the rating system will have a similar effect as in movies, where an R rating is the kiss of death for a movie. Are we secret agents in the united conservative front, pushing an agenda of world cencorship and domination? Uhm, no. But we do believe that people should have the choice as to what kind of content they receive on the Internet. Presently, Piffany uses standard safety filtering technology, but stay tuned for SafetyRank to emerge soon. Ok those are just words, do I have a more scientific proof that we are working on a great filter? Given a list of 1M URLs with an even mixture of kids and adult sites, I asked the algorithm to pick out some kids sites and some adult sites. Here are the results:

Stay tuned to this blog for our next topic: Chat Rooms, Social Networking, and Age Verification.
Suggest a topic to me at david(at}piffany[dot]com

Final Words: With safety filters you get what you pay for. If you let your kids use adult search engines like Google, Yahoo, AOl, and MSN, make sure you enable safety filters or parental controls if they are available, but don't rest assured that they will work well, though it's better than nothing.
Filtering controls for the major adult search engines:

Google: http://www.google.com/intl/en/help/customize.html#safe
Yahoo: http://search.yahoo.com/preferences/preferences?page=filters
Ask: Just click on Options>Content Filtering in the upper right hand corner of ask.com
MSN: http://search.msn.com/settings.aspx, click on filtering offensive sites

If you are using Windows Vista, be sure to check out the built in parental controls.

About Piffany
Why should a 9-year-old and a 29-year-old get exactly the same search results? Piffany is a search engine that strives to safely bring the full potential of the Internet to kids by allowing them to adjust the difficulty level of their search results. To ensure the safety of our search results, we are researching a new Internet content and safety rating system that is similar to ratings given for TV, movies, and video games. Visit us at Piffany.com.

Wednesday, June 6, 2007

National Internet Safety Month: what does your IP address reveal about your location?

June is national Internet safety month. To show support for this cause I will post a series of blogs addressing issues of Internet safety. Some postings will be to promote our new safe search engine for kids, Piffany, but many will be much more general. My first post is regarding the cardinal rule of Internet safety:

Never give out personal information that might give away where you live

This is obviously a very good rule to remember, but how safe are you really? Did you know that your geographical location can be determined from your IP address? Scary, huh? It's true. An IP address is a unique identifying number used by computers to route content on the Internet from computer to computer. Do not confuse them with URLs. IP addresses look like 24.91.135.203, whereas URLs look like www.piffany.com. An IP address is like your computers phone number or street address--some cell phones have them too. Every computer that accesses the Internet is assigned an IP address.

One free service I found returned the image seen below when I visited there.

Yup, that's me (red ball slightly above and to the right of Cambridge). It also listed detailed information about my city, and nearby cities and towns, including latitude and longitude. If I were, say, a high school student, how hard would it be to track down my high school? Not hard. If you knew my last name and my geographical location, an Internet phone directory will give you my my phone number, from which my address easily follows using a reverse lookup service. The reverse lookup I tried even offered a service for unlisted numbers. My number is unlisted, and it returned my old address in nearby Cambridge. I guess I'm safe for now.

Will these sites show me the location for an IP address other than my own? Yup, I tried a friends to confirm that the location was correct. By the way, I am intentionally leaving out information about the sites I used for this blog, but they are easy to find on the Internet.

How Hard is it for someone to get my IP address?

Fortunately, not just anyone can get your IP address. If you host a website on your own server then you can easily obtain the IP addresses of people that visit your site. There are also many CGI scripts available that can be installed on hosted sites. A potential predator could lure you to their server with the malicious intention of retrieving your IP address. This requires some sophistication and judging by the characters caught in the act by Chris Hansen on Dateline's To Catch a Predator, most Internet predators are not savvy enough to pull something like this off, or at least, let's hope not. The most difficult step in locating someone on the Internet is getting their IP address, and that's not too difficult. After that, relatively little work is required before a predator is knocking on your door.

What can be done to safeguard against this?

You can't spoof (fake) an IP address. You can spoof just about anything else, but not your IP address because that is the identifier by which computers on the Internet ensure content is delivered to the right computer. One possibility for hiding your IP address is by using a proxy server, but I am not aware of any services that offer proxy servers for this purpose. It would be expensive for Piffany to offer such a service, though we have discussed it in the past. In general, before committing to a chat room or any website that is social, make sure it doesn't display the IP addresses of its users. You have probably seen before entries like 'David is logged in from 24.91.135.203'.

What does Piffany plan to do about this problem?

For starters, we won't display a users IP address to the public. Piffany's CeSAR algorithm does make it unlikely to find in our search results a website set up with malicious intent, because like PageRank, HITS, and other authority based algorithms, CeSAR uses link structure to determine a sites rank. So friendly communities of sites that you trust won't link to those malicious sites. CeSAR ranks web pages by their proximity to clusters of sites on a particular topic, of interest to a particular group, or that are frequented by a particular age group. Hence, if a website is not acknowledged by an established cluster frequented by 8-10 year old kids, for example, then it will not likely be listed highly in a search performed by 8-10 year old kids. In order to make this trait more effective we only return the top 100 search results. Also, we are experimenting with systems and methods to verify that users who register as kids are actually kids; any suggestions on how to do this are welcome. Initially, we were thinking that we might ask the potential registrant a question about their school. With a lot of users, statistics can help identify legitimate answers, but initially, we will just have to verify them ourselves. The best hope we have is that responsible users will report to us when they find a suspicious user or site.

That's it for now. I hope to see some stimulating comments, so feel free to respond through the link below, and remember,

beware of web sites that publicly display your IP address.

Stay tuned to this blog for our next topic: Internet content rating systems.
Suggest a topic to me at david(at}piffany[dot]com

About Piffany
Why should a 9-year-old and a 29-year-old get exactly the same search results? Piffany is a search engine that strives to safely bring the full potential of the Internet to kids by allowing them to adjust the difficulty level of their search results. To ensure the safety of our search results, we are researching a new Internet content and safety rating system that is similar to ratings given for TV, movies, and video games. Visit us at Piffany.com.

Tuesday, June 5, 2007

18 Bugs and Climbing

As we prepare to invite beta testers to our site, the list of bugs found by the Piffany team and some of our friends (thanks to those of you who have been helping) is at 18 and climbing at last count. We have fixed six of these, and are working our way down the list. This is really a lot of work!

Most annoying bug: Browser compatibility, hands down. Since this is obviously going to be a problem, it would be great if we could see what our users are seeing. I found a tool online that looks promising. TapeFailure is a nice Web 2.0 analytics application that records screencasts (videos) of users interacting with your web page. I don't know if it will work for us, since we do not serve static pages only, but I think we will try it out. If anyone knows of a free, or cheaper, version of this then let me know.

Most Bizarre bug: I implemented the advanced logic modules over the weekend and they worked fine through my administrative portal (a separate server available only from my computer that allows me to explore the databases behind Piffany in more detail), but some of the features didn't work at all through the public front-end even though the snippets of code were the same. I fixed the problem by deleting the old back-end search server and copying the administrative server in its place. Advanced logic allows you to input logical operators like AND, OR, and NOT (quotation marks for exact phrase matching) in order to refine search results. Users can use them by typing uppercase operators alongside search tokens, e.g., Harry NOT Potter, but we are seeking ways to implement them automatically or at least in an easier fashion so kids can use them.

Well, there are probably 19 bugs by now, so I had better get back to work.

Wednesday, May 23, 2007

Parents, teachers, and kids needed to beta test our search engine for kids

A recent comment on this blog prompted me to advertise that we need beta testers for the Piffany search engine. Though we intend to have an offering for tweens and teens in the near future, and believe it or not, we have something in mind for pre-schoolers too, for now our search engine will be tailored for kids ages 5-12. Unlike other search engines, we are just for kids and we are a real search engine. In fact, some of the technology under the hood exceeds that found in adult search engines. If you follow this blog, then you will learn more about that, but for now I would like to call on anyone interested in beta testing our product. If you're interested, let me know by clicking on one of the following:

kids mail to betatester@piffany.com?subject=kids,
parents mail to betatester@piffany.com?subject=parents,
teachers mail to betatester@piffany.com?subject=teachers,
or other mail to betatester@piffany.com?subject=others.

We should be online for testing around June 1, 2007. In the meantime feel free to snoop around the website or watch our demonstration video (very amateurish, I admit, but it gets the message across).

Tuesday, May 22, 2007

You do plan on having a search engine on your search engine, right?

It’s approximately two weeks from our official launch date, but we were supposed to have everything online and ready to go on May 15, 2007 so that the press and investors could have a look at it. It’s a week after the deadline and still no search engine online. The reasons, of course, are the usual suspects: technical issues and uncertainty. It makes me wonder how anything ever gets done anywhere.

I understand that there are a few technical issues regarding the installation of the search engine front end—that portion of the search engine that users interact with. Only Ilan, who is in charge of this component, can fully appreciate those issues, but I have issues of my own on the back end—that portion of the search engine where search results are generated from inputs taken from the front end. I still haven’t put it in the code to allow for multi-token search, or half a dozen other things that will ultimately be needed for the official launch. I gave up long ago on seeking perfection in any single component of the project before moving on to the next component. Instead, I have adopted a strategy that is similar to learning. First cover a topic broadly with little depth on any single sub-topic (breadth first), and then revisit the same material, several times if necessary, each time going deeper (depth first). This works fine for the most part. Where it fails is when new knowledge gives rise to doubt about previous assumptions. A recent example of this process and its shortcomings are taken from my work over the past few days.

Yesterday, I completed the first objective evaluation of Piffany. The analysis compared the top search result for about two hundred different keywords—some general and some specific—on Google, Google Kids Directory, and Piffany. Scores for kids, teens, and adults were assigned based on the relevancy to topic and age. We totally dominated the kids and teens searches for keywords that were more school/homework related. However, we did not fare as well with keywords that were derived from pop-culture. A 16 year old searching for Aguilera resulted in The Christina Connection, an unofficial website about Christina Aguilera, rather than the official Christina Aguilera website, which was number one on both Google and Google Kids Directory search but only number three on Piffany. The official Christina website is entirely Flash based (animation based WebPages that work more like an interactive video than a conventional webpage). We haven’t put in a parser for Flash sites yet. Flash sites tend to favor the kind of sites where presentation is more important than content, and they also tend to get heavy use by commercial sites so it has been low on our list of priorities. But should it be? Many of the top results on Google for pop culture are done in Flash. I can’t really say whether the lack of Flash parsing is the reason Piffany scored lower on pop culture without actually putting in a parser for Flash and trying again.

Keeping with the conventional wisdom of not changing horses amid stream, I will add the Flash parser on the next iteration of version upgrades to the search back end. I think today I will put it the code to allow for multi-token searches. And before you know it, our search engine will have a search engine as part of its offerings.