Curl 404 but Fine in Browser + Solution

May 19, 2009 · 2 minute read
Category: scraping

This is post is now quite old and the the information it contains may be out of date or innacurate.

If you find any errors or have any suggestions to update the information please let us know or create a pull request on GitHub

I just hit an intriguing situation where a page that was perfectly viewable in my browser was not visible via curl.

I scratched my head and messed around testing the page in variety of online proxy services and local web browsers. I even stared messing about with telnet and manually typing headers. My conclusion was that the simpler systems such as text based browsers were not able to see the page and were instead given a 404 message.

However better more modern browsers could see the page. Likewise the page was visible in the Google cache and aso Google Translate.

In the end I downloaded a neat little firefox add-on called Tamper Data. This allows you to tweak your request headers before they are submitted. 5 minutes later I realised that it was the Gzip compatibility which was causing the issue.

Curl (being the awesome tool that it is) can handle Gzip compression, but I wasn’t using it. I have now added the following line to my curl function and I am pulling pages fine.

 if(!empty($compression)){
    curl_setopt($go,CURLOPT_ENCODING , $compression);
 }