Technical aside: identifying news articles

How can you tell if a given web page is a news article or not?

The browser extensions need to know this. When you go to a news article the extensions need to perform a look up in the database, so they can tell you about any attached sources or warning labels.

Currently the extensions have a built-in whitelist of news web sites - if you're reading a page on one of those sites, it's assumed to be a news article. The extension makes this decision based solely on the URL, so it can kick of the request to the database at the exactly the same time it starts loading the news article itself. By the time the article is ready to view, the extension should have any sources, warning labels or other information all ready to display.

But a whitelist is pretty unfair to sites not on the list. And it's _really_ hard keeping a comprehensive list up to date. I'm not sure there's a perfect solution, but here are some options:

  1. Let the user add or remove sites from the whitelist
  2. use more information on the page itself to try and tell if it's a news article. For example, a lot of sites contain <meta> tags for facebook's opengraph scheme, so you'd see something like this in the raw HTML:
    <meta property="og:type" content="article" />
  3. Try and guess news articles based solely on their URL. Most sites use URLs which follow a few general forms. Use of slugs is very common. So is the use of numeric identifiers (eg ".../article-12345.html"). Most such URLs could be identified by some simple pattern-matching rules.

1. is problematic for two reasons. Firstly, it places a burden upon the user to maintain their own whitelist of news sites. Secondly, it is not shared between users. So if one user adds a site to their whitelist, it'll still be missing from another users whitelist. The second user will never see any sources or labels which might be attached to articles on such sites.

Option number 2 could correctly identify a lot of news articles (but not all). However, the identification can only be done once the page is loaded, so the extra information from would take longer to become available.

URL matching as in option 3 would correctly trigger for most news article pages. But it would also trigger on a lot of non-news sites which just happen to use similar URL schemes.


So I'm not yet sure what the proper solution is, but I rather suspect that it'll involve some sort of combination of methods.

Suggestions and ideas welcome!