How Semalt.com Inflates Your Site Traffic and How to Stop It

Published Oct, 2 2014
SEO

Stop Semalt.com crawlerStrong web traffic is hard-earned and encouraging to see, but it’s possible that your site traffic is not quite what it seems. Many websites are seeing an inflated traffic count “thanks” to robot web crawlers originating from Semalt.

Google Analytics is all about helping you make informed decision to improve user experience on your site, so it doesn’t do you much good to include non-human traffic in your analysis. In this post we discuss the steps to prevent crawler bots from skewing your data. New bots and other interloping techniques will no doubt pop up, so vigilance is essential, but this is a great start.

What is Semalt.com?

Semalt purports to use web crawlers to gather information about websites, and then re-sells that information as competitive industry analysis. However, there is reason to believe not everything they do is benign. As this excellent post at Info Security Magazine details, Semalt even creates massive “referral spam” attacks to try and manipulate search engine rankings.

Google Analytics typically does a great job excluding robots from showing up as visits to your site, but Semalt’s crawlers manage to break through the human/non-human traffic barrier and subsequently inflate your traffic counts.

Is my traffic inflated by Semalt?

To find out whether your traffic is inflated by Semalt crawlers, log into Google Analytics, click “Acquisition” in the left column, then “All Referrals.” If your referral report is littered with Semalt domains, you know that your traffic count is inflated.

Is your traffic inflated by Semalt.com?

What do I do if Semalt is crawling my site?

Semalt’s website offers to remove your site from the list of sites they crawl. Go ahead and do that using this link, and consider doing it for some friends, too:

semalt.com/project_crawler.php

Of course, there’s no telling how long it will take Semalt to omit your site, if it ever does, so you’ll want to take matters into your own hands to prevent your data from being inflated. <--[UPDATE] Don't do that. There's evidence that can make things worse.

Below are a few methods to address the data-inflating issue. The first option is the most technical and straightforward, but if you don’t want to get your web team involved, we have a few Analytics-based suggestions, as well.

1) Add the following to your “.htaccess” file in your root html folder.

This is typically accessible via FTP, however we strongly encourage you to make a backup copy of your current “.htaccess” file so you can revert quickly in case anything goes wrong.

# block visitors referred from semalt.com

RewriteEngine on

RewriteCond %{HTTP_REFERER} semalt.com [NC]

RewriteRule .* – [F]

2) Create a filter in Google Analytics to prevent future Semalt traffic from being logged.

You can also add a filter on an existing Google Analytics property view, but we recommend creating a new property view for this purpose. Once a filter is in place on a view, the data moving forward is permanently changed, so we generally suggest playing it safe.

When logged into Google Analytics:

Admin > View column > Dropdown: “Create new view”

Create new view

Now that you’re up and running with your dedicated view, you can create a new filter to exclude Semalt traffic:

Admin > Property > Filters (under “View” column) > + New Filter

Use a predefined filter to exclude traffic from the Semalt domain, click “Save,” and you’re good to go!

Create a new filter

3) Create a segment to view historical data without Semalt traffic.

Keep in mind, both new views and new filters only apply to data moving forward; you cannot retroactively filter your data. So what should you do to see unadulterated past data? You can create a segment to view your historical data in an existing view.

On any page when logged into Google Analytics:

Add Segment > Traffic Sources  > Filter Sessions. Under “Source,” switch the drop down to “does not contain” and add “semalt.com.” You can then save and apply the segment.

This segment will flow through to your other reports in Google Analytics as long as you keep it in place. As you can see below, we’re now getting a view of 96.63% of data, or all non-semalt data.

Exclude traffic from semalt.com

After you apply the segment to your data, you can delete the default “All Sessions” segment to reduce noise in your reports. There’s not much value in the comparative data, and you can always add the segment back if you want.

Remove "All Sessions" segment

That’s it!

You are now enjoying all the fruits of undistorted data. It’s not just more useful to get a clearer picture of visitor behavior on your site, but you can feel good knowing you’ve donned your White Hat and done your part to beat back against the tide of potential bad actors.

Need a little extra help from month to month? Learn more about how we help businesses identify and resolve problems like this with our Ongoing Support Plans.