Python crawler does not work properly - python

I'd just written a Python crawler to download midi files from freemidi.org. Looking at the request headers in Chrome, I found that the "Referer" attribute had to be https://freemidi.org/download-20225 (referred to as "download-20225" later) if the download page was https://freemidi.org/getter-20225 (referred to as "getter-20225" later) in order to download the midi file properly. I did so in Python, setting the header like this:
headers = {
'Referer': 'https://freemidi.org/download-20225',
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
}
which was exactly the same as the request header I had viewed in Chrome, and I tried to download the file using this line of code.
midi = requests.get(url, headers=headers).content
However, it did not work properly. Instead of downloading the midi file, it downloaded a html file of the site "download-20225". I later found that if I tried to access the site "getter-20225" directly, it takes me to "download-20225" as well. I think it probably indicates that the header was wrong, so it took me to the other website instead of starting the download.
I'm quite new to writing Python crawlers, so could someone help me find what went wrong with the program?

It looks like the problem here is that the page with the midi file (e.g. "getter-20225") wants to redirect you back to the song page (e.g. "download-20225") after downloading the song. However, requests is only returning the content from the final page in the redirect.
You can set the allow_redirects parameter to False to have requests return the content from the "getter" page (i.e. the midi file):
midi = requests.get(url, headers=headers, allow_redirects=False)
Note that if you want to write the midi file to disk, you will need to open your target file in binary mode (since the midi file is written in bytes).
with open('example.mid', 'wb') as ex:
ex.write(midi.content)

Related

Download image with python requests

I'm new to python. I have to download some images from the web and save it to my local file system. I've noticed that the response content does not contain any image data.
The problem only occurs with this specific url, with every other image url the code works fine.
I know the easiest solution would be just use another url but still i'd like to ask if someone had a similar problem.
import requests
url = 'https://assets.coingecko.com/coins/images/1/large/bitcoin.png'
filename = "bitcoin.png"
response = requests.get(url, stream = True)
response.raw.decode_content = True
with open(f'images/{filename}', 'wb') as outfile:
outfile.write(response.content)
First, look at the content of the response with response.text, you'll see the website blocked your request.
Please turn JavaScript on and reload the page.
Then, you can try to check if changing the User-Agent of your request fixes your issue.
response = requests.get(
url,
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36',
},
stream = True
)
If it doesn't, you may need to get your data with something which can parse javascript like selenium or Puppeteer.

Unable to download file from URL using python

I am trying to download the file from the URL:
https://www.cmegroup.com/content/dam/cmegroup/notices/clearing/2020/08/Chadv20-239.pdf
I tried using the python requests library, but the request just timed out. I tried specifying the 'User-Agent' from my browser as a header, but it still just timed out, including when I copied across every single header from my browser into my python script. I tried setting allow_redirects=True, this did not help. I've also tried wget and curl, everything fails apart from actually opening the browser, visiting the URL and downloading the file.
I'm wondering what the actual difference is between the requests in my browser and the python requests where I set the headers to match those in my browser - is there any way I can download this file using python?
Code snippet:
import requests
requests.get("https://www.cmegroup.com/content/dam/cmegroup/notices/clearing/2020/08/Chadv20-239.pdf") # hangs
Check this, It's worked for me.
import requests
headers = {
"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36'}
response = requests.get(
"https://www.cmegroup.com/content/dam/cmegroup/notices/clearing/2020/08/Chadv20-239.pdf", headers=headers)
pdf = open("Chadv20-239.pdf", 'wb')
pdf.write(response.content)
pdf.close()
It is difficult to understand what might be going wrong without some code snippet. How is the file being downloaded? Are you getting raw response content and saving that as pdf? The official docs(https://docs.python-requests.org/en/latest/user/quickstart/#raw-response-content) suggest using chunk based approach to save the streamed/raw content. Did you try that approach?

Html in browser different than the one requested in Python

import requests
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
headers = {'User-Agent': user_agent}
page = requests.get("https://sky.lea.moe/stats/PapaGordsmack/", headers=headers)
html_contents = page.text
print(html_contents)
I am trying to webscrape from sky.lea.moe website for a specific user, but when I request the html and print it, it is different than the one shown in browser(on chrome, viewing page source).
The one I get is: https://pastebin.com/91zRw3vP
Analyzing this one, it is something about checking browser and redirecting. Any ideas what I should do?
This is cloudflare's anti-dos protection, and it is effective at stopping scraping. A JS script will usually redirect you after a few seconds.
Something like Selenium is probably your best option for getting around it, though you might be able to scrape the JS file and get the URL to redirect. You could also try spoofing your referrer to be this page, so it goes to the correct one.
Browsers indeed do more than just download a webpage. They also download additional resources, parse style and things like that. To scrape a webpage it is advised to use a scraping library like Scrapy that does all these things for you and provide a complete library to easily extract information from these pages.

How do I convert Python crawled Bing web page content to human-readable?

I'm playing with crawling Bing web search page using python.
I find the raw content received looks like byte type, but the attempt to decompress it has failed.
Does someone have clue what kind of data is this, and how should I extract readable from this raw content? Thanks!
My code displayed the raw content and then tried to do the gunzip, so you could see the raw content as well as error from the decompression.
Due to the raw content is too long, I just paste the first a few lines in below.
Code:
import urllib.request as Request
import gzip
req = Request.Request('www.bing.com')
req.add_header('upgrade-insecure-requests', 1)
res = Request.urlopen(req).read()
print("RAW Content: %s" %ResPage) # show raw content of web
print("Try decompression:")
print(gzip.decompress(ResPage)) # try decompression
Result:
RAW Content: b'+p\xe70\x0bi{)\xee!\xea\x88\x9c\xd4z\x00Tgb\x8c\x1b\xfa\xe3\xd7\x9f\x7f\x7f\x1d8\xb8\xfeaZ\xb6\xe3z\xbe\'\x7fj\xfd\xff+\x1f\xff\x1a\xbc\xc5N\x00\xab\x00\xa6l\xb2\xc5N\xb2\xdek\xb9V5\x02\t\xd0D \x1d\x92m%\x0c#\xb9>\xfbN\xd7\xa7\x9d\xa5\xa8\x926\xf0\xcc\'\x13\x97\x01/-\x03... ...
Try decompression:
Traceback (most recent call last):
OSError: Not a gzipped file (b'+p')
Process finished with exit code 1
It's much easier to get started with the requests library. Plus, this is also the most commonly used lib for http requests nowadays.
Install requests in your python environment:
pip install requests
In your .py file:
import requests
r = requests.get("http://www.bing.com")
print(r.text)
OSError: Not a gzipped file (b'+p')
You either need to add "accept-encoding: "gzip" or "br" to request headers or read content-encoding from the response and choose the correct one, or use requests library instead that will do everything for you.
The second problem that might appear, you need to pass a user-agent to request headers to act as a "real" user visit.
If no user-agent is being passed into request headers while using requests library it defaults to python-requests so Bing or other search engine understands that it's a bot/script, and blocks a request. Check what's your user-agent.
Pass user-agent using requests library:
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
requests.get('URL', headers=headers)
How to reduce the chance of being blocked while web scraping search engines.
Alternatively, you can achieve the same thing by using Bing Organic Results API from SerpApi. It's a paid API with a free plan.
The difference is that you don't have to spend time trying to bypass blocks from Bing or other search engines or figure out different tedious problems such as picking the correct CSS selector if the HTML layout is not the best out there.
Instead, focus on the data that needs to be extracted from the structured JSON. Check out the playground.
Disclaimer, I work for SerpApi.

Download ZIP file from the web (Python)

I am trying to download a ZIP file using from this website. I have looked at other questions like this, tried using the requests and urllib but I get the same error:
urllib.error.HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop. The last 30x error message was: Found
Any ideas on how to open the file straight from the web?
Here is some sample code
from urllib.request import urlopen
response = urlopen('http://www1.caixa.gov.br/loterias/_arquivos/loterias/D_megase.zip')
The linked url will redirect indefinitely, that's why you get the 302 error.
You can examine this yourself over here. As you can see the linked url immediately redirects to itself creating a single-url loop.
Works for me using the Requests library
import requests
url = 'http://www1.caixa.gov.br/loterias/_arquivos/loterias/D_megase.zip'
response = requests.get(url)
# Unzip it into a local directory if you want
import zipfile, io
zip = zipfile.ZipFile(io.BytesIO(response.content))
zip.extractall("/path/to/your/directory")
Note that sometimes trying to access web pages programmatically leads to 302 responses because they only want you to access the page via a web browser.
If you need to fake this (don't be abusive), just set the 'User-Agent' header to be like a browser. Here's an example of making a request look like it's coming from a Chrome browser.
user_agent = 'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36'
headers = {'User-Agent': user_agent}
requests.get(url, headers=headers)
There are several libraries (e.g. https://pypi.org/project/fake-useragent/) to help with this for more extensive scraping projects.

Categories

Resources