Content filtering refers to an automatic system put in place to process large volumes of data and take action on any content that meets certain criteria. Publishers often use text and media-filtering solutions to handle the bulk of the user-generated content on their site. These systems are often put in place to filter content such as adult and illegal filesharing as well as the sale of firearms, drugs, alcohol and tobacco.
Developing an in-house solution
Many publishers choose to develop their own filtering system. This decision can have the following benefits:
- Text-based filtering can be relatively easy to code
- It is often significantly cheaper than commercial solutions
- The publisher knows their site and users best and can anticipate policy issues better than anyone else
Creating a list of keywords
- Compile your own list of words and phrases that you wish to filter. You can use your own intuition or get some help:
- Ask your employees to contribute
- Reach out to your users for help
- Use Google Ads: Keywords tool
- For additional inspiration take a look at websites that host undesirable content (adult and/or filesharing sites for example), and find out which keywords show up frequently on these.
- Code your own automatic keyword scraping tool:
- Use search engine data to go through all pages on a site
- Retrieve a list of unique words and word combinations on it
- Keep the most commonly used keywords and discard the rest. Don’t forget to eliminate common articles and words like ‘a’, ‘and’ or ‘the’.
- Output as a text file
- Repeat the above for any number of sites until you're satisfied with your list, and you’ve finished.
- Important: Scraping other sites and using their content as your own is against the Google Publisher Policies and the Spam policies for Google web search and might also be illegal and/or unethical.
All words are not created equal, and some keywords are worse than others. You should therefore consider assigning different weights to different terms.
For example, adult filters in English should weigh the word ‘porno’ higher than ‘sex’. While ‘porno’ is almost exclusively related to content that is not family-safe, ‘sex’ may also mean ‘gender’ – depending on the context it is used in.
Also consider words that are safe on their own but put together with another word might indicate something else entirely. The word ‘pictures’ for example is innocent enough, but ‘teen pictures’ would often refer to pornography.
Method 1 – User generated content is scanned after it's displayed on a page:
- Scan the content
- Flag if it meets filtering criteria
- Disable ad serving on the page hosting said content
- Manually review content:
- If it is safe, enable ad serving and adjust filters
- If it isn't, make sure that the content isn't displayed on pages that include ad code
Method 2 – User generated content is scanned before it's made available to users:
- Scan the content
- Flag if it meets filtering criteria
- Queue it for review or reject it outright
- Manually review content:
- If it is safe, show it on ad serving pages and adjust filters
- If it isn't, disable ad-serving and show it or reject it
Commercial solutions in a nutshell
There are a number of services that provide content filtering, even a few that specialise in filtering specific types like adult or copyrighted content. There are also crowdsourcing platforms that create a bridge between publishers and users looking to make easy money on the Internet. The best way to approach this is to do some market research on the topic and decide on the best solution for the service that you are providing. Try looking for sites that review software and see what kinds of user-generated content filtering systems they are recommending. After having all of this information at hand you should decide on the best solution for you based on the product’s score, its unique features as well as its pricing model.