How does changing an HTTP referrer header help circumvent crawler blocking
I have been researching different ways a web crawler might be blacklisted/blocked from a web-server and how to potentially circumvent that. One of those ways is to change the referrer header on the request. I have been going to various places trying to figure out the benefit of doing this, but I believe I am thinking about it too hard and have tunnel vision.
A couple other ways to disguise yourself from web-servers you are attempting to crawl resources from, are changing the User-Agent header on the request, or by Proxying your request through other servers thereby making the call with new public IPs each time. This makes sense since they can't tell the requests are all coming from the same machine, or using the same client agent for making the request. For all they know, it's coming from potentially thousands of machines, from 10-20 different browsers and are all unique users. Is this the same benefit of changing the referrer header in the request? Im getting hung up on how that's implemented. Would you just cycle through hundreds of randomly generated URLs and add a new one to the request headers each time ...
For Example: ref1 = www.random.com, ref2 = www.random2.com, ref3 = random3.com
web-crawlers referrer python request
add a comment |
I have been researching different ways a web crawler might be blacklisted/blocked from a web-server and how to potentially circumvent that. One of those ways is to change the referrer header on the request. I have been going to various places trying to figure out the benefit of doing this, but I believe I am thinking about it too hard and have tunnel vision.
A couple other ways to disguise yourself from web-servers you are attempting to crawl resources from, are changing the User-Agent header on the request, or by Proxying your request through other servers thereby making the call with new public IPs each time. This makes sense since they can't tell the requests are all coming from the same machine, or using the same client agent for making the request. For all they know, it's coming from potentially thousands of machines, from 10-20 different browsers and are all unique users. Is this the same benefit of changing the referrer header in the request? Im getting hung up on how that's implemented. Would you just cycle through hundreds of randomly generated URLs and add a new one to the request headers each time ...
For Example: ref1 = www.random.com, ref2 = www.random2.com, ref3 = random3.com
web-crawlers referrer python request
1
As there are many reasons for the Referer header not to be sent, I don't think there are really than many sites that would block you based on their absence. The major reasons are keeping the default user-agent of a library (e.g. libcurl) instead of a regular browser and excessive traffic from a single IP. But it really depends on whether you're planning to crawl the same site repeatedly or crawling many different sites, and in the former case, how much effort the site puts into trying to prevent crawling.
– jcaron
Jan 4 at 12:21
add a comment |
I have been researching different ways a web crawler might be blacklisted/blocked from a web-server and how to potentially circumvent that. One of those ways is to change the referrer header on the request. I have been going to various places trying to figure out the benefit of doing this, but I believe I am thinking about it too hard and have tunnel vision.
A couple other ways to disguise yourself from web-servers you are attempting to crawl resources from, are changing the User-Agent header on the request, or by Proxying your request through other servers thereby making the call with new public IPs each time. This makes sense since they can't tell the requests are all coming from the same machine, or using the same client agent for making the request. For all they know, it's coming from potentially thousands of machines, from 10-20 different browsers and are all unique users. Is this the same benefit of changing the referrer header in the request? Im getting hung up on how that's implemented. Would you just cycle through hundreds of randomly generated URLs and add a new one to the request headers each time ...
For Example: ref1 = www.random.com, ref2 = www.random2.com, ref3 = random3.com
web-crawlers referrer python request
I have been researching different ways a web crawler might be blacklisted/blocked from a web-server and how to potentially circumvent that. One of those ways is to change the referrer header on the request. I have been going to various places trying to figure out the benefit of doing this, but I believe I am thinking about it too hard and have tunnel vision.
A couple other ways to disguise yourself from web-servers you are attempting to crawl resources from, are changing the User-Agent header on the request, or by Proxying your request through other servers thereby making the call with new public IPs each time. This makes sense since they can't tell the requests are all coming from the same machine, or using the same client agent for making the request. For all they know, it's coming from potentially thousands of machines, from 10-20 different browsers and are all unique users. Is this the same benefit of changing the referrer header in the request? Im getting hung up on how that's implemented. Would you just cycle through hundreds of randomly generated URLs and add a new one to the request headers each time ...
For Example: ref1 = www.random.com, ref2 = www.random2.com, ref3 = random3.com
web-crawlers referrer python request
web-crawlers referrer python request
asked Jan 4 at 6:51
JBTJBT
1184
1184
1
As there are many reasons for the Referer header not to be sent, I don't think there are really than many sites that would block you based on their absence. The major reasons are keeping the default user-agent of a library (e.g. libcurl) instead of a regular browser and excessive traffic from a single IP. But it really depends on whether you're planning to crawl the same site repeatedly or crawling many different sites, and in the former case, how much effort the site puts into trying to prevent crawling.
– jcaron
Jan 4 at 12:21
add a comment |
1
As there are many reasons for the Referer header not to be sent, I don't think there are really than many sites that would block you based on their absence. The major reasons are keeping the default user-agent of a library (e.g. libcurl) instead of a regular browser and excessive traffic from a single IP. But it really depends on whether you're planning to crawl the same site repeatedly or crawling many different sites, and in the former case, how much effort the site puts into trying to prevent crawling.
– jcaron
Jan 4 at 12:21
1
1
As there are many reasons for the Referer header not to be sent, I don't think there are really than many sites that would block you based on their absence. The major reasons are keeping the default user-agent of a library (e.g. libcurl) instead of a regular browser and excessive traffic from a single IP. But it really depends on whether you're planning to crawl the same site repeatedly or crawling many different sites, and in the former case, how much effort the site puts into trying to prevent crawling.
– jcaron
Jan 4 at 12:21
As there are many reasons for the Referer header not to be sent, I don't think there are really than many sites that would block you based on their absence. The major reasons are keeping the default user-agent of a library (e.g. libcurl) instead of a regular browser and excessive traffic from a single IP. But it really depends on whether you're planning to crawl the same site repeatedly or crawling many different sites, and in the former case, how much effort the site puts into trying to prevent crawling.
– jcaron
Jan 4 at 12:21
add a comment |
1 Answer
1
active
oldest
votes
The idea would be to make your requests look as much like a real browser as possible. Real browsers send referrer headers. You'd want to send referrer headers that look as much as possible like the referrer headers that a real browser sends.
A real browser never sends random referrer headers. It sends a referrer header for the previous page. Most referrers then end up being pages from the same site.
The ideal strategy would be to crawl the home page without a referrer header. This would mimic a user who types in the home page URL which is very common. As your crawler views pages on the site it would keep track of not only URLs it finds, but also which pages it found those URLs on. It would use one of the URLs where it found the link for the referrer of the new page it fetches.
No referrers - Gives away that you are a bot.
Random referrers - Gives away that you are a bot and probably pollutes analytics. That type of bot is likely to be blocked even faster than a no referrer bot.
Home page referrer - Always using the home page as the referrer can sometimes get around checks for no referrer and looks somewhat legitimate.
Linking page as referrer - The strategy I described above is most like a real browser, but even then the order in which you visit pages is likely to be different than a real visitor.
As @jcaron notes in his comment on the question, no referrer is not conclusive since UAs don't always send. Home page referrer is dead give-away as bot if you don't have links to referred page on home page. It's usually obvious to my bot filters what's a bot and what's not, using a combo of IP-based info (hostname, org name, geography, etc) and client-supplied info (headers, etc). Hits from server farms are almost always bots, but could be VPNs... headers usually give away bots. Vast majority of bots I see make tell-tale mistakes in individual HTTP headers or in the way they combine headers.
– pseudon
Jan 4 at 17:28
Even without more complex behavioral analysis to detect bots (like tracking the path a visitor takes through the site), other signals that contribute to bot determination are: lack of response (or incorrect response) to cookies, redirects of various kinds, JavaScript / AJAX, websockets, and other interactive client-server behaviors.
– pseudon
Jan 4 at 17:32
2
It is very hard to build a bot that isn't detected by some trivial heuristic. Many bots don't download CSS and images either.
– Stephen Ostermiller♦
Jan 4 at 17:57
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "45"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fwebmasters.stackexchange.com%2fquestions%2f120022%2fhow-does-changing-an-http-referrer-header-help-circumvent-crawler-blocking%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
The idea would be to make your requests look as much like a real browser as possible. Real browsers send referrer headers. You'd want to send referrer headers that look as much as possible like the referrer headers that a real browser sends.
A real browser never sends random referrer headers. It sends a referrer header for the previous page. Most referrers then end up being pages from the same site.
The ideal strategy would be to crawl the home page without a referrer header. This would mimic a user who types in the home page URL which is very common. As your crawler views pages on the site it would keep track of not only URLs it finds, but also which pages it found those URLs on. It would use one of the URLs where it found the link for the referrer of the new page it fetches.
No referrers - Gives away that you are a bot.
Random referrers - Gives away that you are a bot and probably pollutes analytics. That type of bot is likely to be blocked even faster than a no referrer bot.
Home page referrer - Always using the home page as the referrer can sometimes get around checks for no referrer and looks somewhat legitimate.
Linking page as referrer - The strategy I described above is most like a real browser, but even then the order in which you visit pages is likely to be different than a real visitor.
As @jcaron notes in his comment on the question, no referrer is not conclusive since UAs don't always send. Home page referrer is dead give-away as bot if you don't have links to referred page on home page. It's usually obvious to my bot filters what's a bot and what's not, using a combo of IP-based info (hostname, org name, geography, etc) and client-supplied info (headers, etc). Hits from server farms are almost always bots, but could be VPNs... headers usually give away bots. Vast majority of bots I see make tell-tale mistakes in individual HTTP headers or in the way they combine headers.
– pseudon
Jan 4 at 17:28
Even without more complex behavioral analysis to detect bots (like tracking the path a visitor takes through the site), other signals that contribute to bot determination are: lack of response (or incorrect response) to cookies, redirects of various kinds, JavaScript / AJAX, websockets, and other interactive client-server behaviors.
– pseudon
Jan 4 at 17:32
2
It is very hard to build a bot that isn't detected by some trivial heuristic. Many bots don't download CSS and images either.
– Stephen Ostermiller♦
Jan 4 at 17:57
add a comment |
The idea would be to make your requests look as much like a real browser as possible. Real browsers send referrer headers. You'd want to send referrer headers that look as much as possible like the referrer headers that a real browser sends.
A real browser never sends random referrer headers. It sends a referrer header for the previous page. Most referrers then end up being pages from the same site.
The ideal strategy would be to crawl the home page without a referrer header. This would mimic a user who types in the home page URL which is very common. As your crawler views pages on the site it would keep track of not only URLs it finds, but also which pages it found those URLs on. It would use one of the URLs where it found the link for the referrer of the new page it fetches.
No referrers - Gives away that you are a bot.
Random referrers - Gives away that you are a bot and probably pollutes analytics. That type of bot is likely to be blocked even faster than a no referrer bot.
Home page referrer - Always using the home page as the referrer can sometimes get around checks for no referrer and looks somewhat legitimate.
Linking page as referrer - The strategy I described above is most like a real browser, but even then the order in which you visit pages is likely to be different than a real visitor.
As @jcaron notes in his comment on the question, no referrer is not conclusive since UAs don't always send. Home page referrer is dead give-away as bot if you don't have links to referred page on home page. It's usually obvious to my bot filters what's a bot and what's not, using a combo of IP-based info (hostname, org name, geography, etc) and client-supplied info (headers, etc). Hits from server farms are almost always bots, but could be VPNs... headers usually give away bots. Vast majority of bots I see make tell-tale mistakes in individual HTTP headers or in the way they combine headers.
– pseudon
Jan 4 at 17:28
Even without more complex behavioral analysis to detect bots (like tracking the path a visitor takes through the site), other signals that contribute to bot determination are: lack of response (or incorrect response) to cookies, redirects of various kinds, JavaScript / AJAX, websockets, and other interactive client-server behaviors.
– pseudon
Jan 4 at 17:32
2
It is very hard to build a bot that isn't detected by some trivial heuristic. Many bots don't download CSS and images either.
– Stephen Ostermiller♦
Jan 4 at 17:57
add a comment |
The idea would be to make your requests look as much like a real browser as possible. Real browsers send referrer headers. You'd want to send referrer headers that look as much as possible like the referrer headers that a real browser sends.
A real browser never sends random referrer headers. It sends a referrer header for the previous page. Most referrers then end up being pages from the same site.
The ideal strategy would be to crawl the home page without a referrer header. This would mimic a user who types in the home page URL which is very common. As your crawler views pages on the site it would keep track of not only URLs it finds, but also which pages it found those URLs on. It would use one of the URLs where it found the link for the referrer of the new page it fetches.
No referrers - Gives away that you are a bot.
Random referrers - Gives away that you are a bot and probably pollutes analytics. That type of bot is likely to be blocked even faster than a no referrer bot.
Home page referrer - Always using the home page as the referrer can sometimes get around checks for no referrer and looks somewhat legitimate.
Linking page as referrer - The strategy I described above is most like a real browser, but even then the order in which you visit pages is likely to be different than a real visitor.
The idea would be to make your requests look as much like a real browser as possible. Real browsers send referrer headers. You'd want to send referrer headers that look as much as possible like the referrer headers that a real browser sends.
A real browser never sends random referrer headers. It sends a referrer header for the previous page. Most referrers then end up being pages from the same site.
The ideal strategy would be to crawl the home page without a referrer header. This would mimic a user who types in the home page URL which is very common. As your crawler views pages on the site it would keep track of not only URLs it finds, but also which pages it found those URLs on. It would use one of the URLs where it found the link for the referrer of the new page it fetches.
No referrers - Gives away that you are a bot.
Random referrers - Gives away that you are a bot and probably pollutes analytics. That type of bot is likely to be blocked even faster than a no referrer bot.
Home page referrer - Always using the home page as the referrer can sometimes get around checks for no referrer and looks somewhat legitimate.
Linking page as referrer - The strategy I described above is most like a real browser, but even then the order in which you visit pages is likely to be different than a real visitor.
answered Jan 4 at 10:31
Stephen Ostermiller♦Stephen Ostermiller
68.6k1393252
68.6k1393252
As @jcaron notes in his comment on the question, no referrer is not conclusive since UAs don't always send. Home page referrer is dead give-away as bot if you don't have links to referred page on home page. It's usually obvious to my bot filters what's a bot and what's not, using a combo of IP-based info (hostname, org name, geography, etc) and client-supplied info (headers, etc). Hits from server farms are almost always bots, but could be VPNs... headers usually give away bots. Vast majority of bots I see make tell-tale mistakes in individual HTTP headers or in the way they combine headers.
– pseudon
Jan 4 at 17:28
Even without more complex behavioral analysis to detect bots (like tracking the path a visitor takes through the site), other signals that contribute to bot determination are: lack of response (or incorrect response) to cookies, redirects of various kinds, JavaScript / AJAX, websockets, and other interactive client-server behaviors.
– pseudon
Jan 4 at 17:32
2
It is very hard to build a bot that isn't detected by some trivial heuristic. Many bots don't download CSS and images either.
– Stephen Ostermiller♦
Jan 4 at 17:57
add a comment |
As @jcaron notes in his comment on the question, no referrer is not conclusive since UAs don't always send. Home page referrer is dead give-away as bot if you don't have links to referred page on home page. It's usually obvious to my bot filters what's a bot and what's not, using a combo of IP-based info (hostname, org name, geography, etc) and client-supplied info (headers, etc). Hits from server farms are almost always bots, but could be VPNs... headers usually give away bots. Vast majority of bots I see make tell-tale mistakes in individual HTTP headers or in the way they combine headers.
– pseudon
Jan 4 at 17:28
Even without more complex behavioral analysis to detect bots (like tracking the path a visitor takes through the site), other signals that contribute to bot determination are: lack of response (or incorrect response) to cookies, redirects of various kinds, JavaScript / AJAX, websockets, and other interactive client-server behaviors.
– pseudon
Jan 4 at 17:32
2
It is very hard to build a bot that isn't detected by some trivial heuristic. Many bots don't download CSS and images either.
– Stephen Ostermiller♦
Jan 4 at 17:57
As @jcaron notes in his comment on the question, no referrer is not conclusive since UAs don't always send. Home page referrer is dead give-away as bot if you don't have links to referred page on home page. It's usually obvious to my bot filters what's a bot and what's not, using a combo of IP-based info (hostname, org name, geography, etc) and client-supplied info (headers, etc). Hits from server farms are almost always bots, but could be VPNs... headers usually give away bots. Vast majority of bots I see make tell-tale mistakes in individual HTTP headers or in the way they combine headers.
– pseudon
Jan 4 at 17:28
As @jcaron notes in his comment on the question, no referrer is not conclusive since UAs don't always send. Home page referrer is dead give-away as bot if you don't have links to referred page on home page. It's usually obvious to my bot filters what's a bot and what's not, using a combo of IP-based info (hostname, org name, geography, etc) and client-supplied info (headers, etc). Hits from server farms are almost always bots, but could be VPNs... headers usually give away bots. Vast majority of bots I see make tell-tale mistakes in individual HTTP headers or in the way they combine headers.
– pseudon
Jan 4 at 17:28
Even without more complex behavioral analysis to detect bots (like tracking the path a visitor takes through the site), other signals that contribute to bot determination are: lack of response (or incorrect response) to cookies, redirects of various kinds, JavaScript / AJAX, websockets, and other interactive client-server behaviors.
– pseudon
Jan 4 at 17:32
Even without more complex behavioral analysis to detect bots (like tracking the path a visitor takes through the site), other signals that contribute to bot determination are: lack of response (or incorrect response) to cookies, redirects of various kinds, JavaScript / AJAX, websockets, and other interactive client-server behaviors.
– pseudon
Jan 4 at 17:32
2
2
It is very hard to build a bot that isn't detected by some trivial heuristic. Many bots don't download CSS and images either.
– Stephen Ostermiller♦
Jan 4 at 17:57
It is very hard to build a bot that isn't detected by some trivial heuristic. Many bots don't download CSS and images either.
– Stephen Ostermiller♦
Jan 4 at 17:57
add a comment |
Thanks for contributing an answer to Webmasters Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fwebmasters.stackexchange.com%2fquestions%2f120022%2fhow-does-changing-an-http-referrer-header-help-circumvent-crawler-blocking%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
As there are many reasons for the Referer header not to be sent, I don't think there are really than many sites that would block you based on their absence. The major reasons are keeping the default user-agent of a library (e.g. libcurl) instead of a regular browser and excessive traffic from a single IP. But it really depends on whether you're planning to crawl the same site repeatedly or crawling many different sites, and in the former case, how much effort the site puts into trying to prevent crawling.
– jcaron
Jan 4 at 12:21