automating download of files with urllib2 and or wget

automating download of files with urllib2 and or wget - python

I was trying to figure out how to download files from a web hosting site like zippy share. I saw this post How to download in bash from zippyshare? that shows how to use wget, and manually add in the cookie from the browser and add that to the header in wget. That works. But I want to use python, and get the cookie and then execute wget, so that I can do this programmatically(example: scraping a bunch of download links).
I came up with this hacky script to get the cookie and execute the wget command but it seems that the cookie is not good because I get a 302 redirect:
import urllib2, os
class HeadRequest(urllib2.Request):
def get_method(self):
return "HEAD"
url = "http://www67.zippyshare.com/d/64003087/2432/Alex%20Henning%2c%20Laurie%20Webb%20-%20In%20Your%20Arms%20%28Joy%20Kitikonti%20Remix%29%20%5bquality-dance-music.com%5d.mp3"
referer = "http://www67.zippyshare.com/v/64003087/file.html"
response = urllib2.urlopen(HeadRequest(url))
headers = response.info()
jcookieString = headers['Set-Cookie'].split(';')[0] #[11:]
# print headers
print "jcookie string " + jcookieString
wgetString = "wget " + url + " --referer=" + referer + " --cookies=off --header \"Cookie: " + jcookieString + "\"" + "--user-agent=\"Mozilla/5.0 (Windows NT 6.0) Gecko/20100101 Firefox/14.0.1\""
os.system(wgetString)
I also tried using python's cookielib, but got the same behavior of the 302 redirect. Thanks.
EDIT: using requests here is the code now persisting the cookie that comes from the referer request because I am using the session to make the request...yet still no go:
looking at the response.history shows that the 302 redirect is still happening for some reason.
import requests
downloadUrl="http://www67.zippyshare.com/d/3278160/42939/Andre%20Nazareth%20-%20Bella%20Notte%20%28Original%20Mix%29%20%5bquality-dance-music.com%5d.mp3"
referer= "http://www67.zippyshare.com/v/3278160/file.html"
header={"user-agent": "\"Mozilla/5.0 (Windows NT 6.0) Gecko/20100101 Firefox/14.0.1\"", 'referer': referer}
refererSession = requests.Session()
refererSession.get(referer)
downloadResponse = refererSession.get(downloadUrl, headers=header)
print downloadResponse.request.headers
print downloadResponse.status_code
if downloadResponse.status_code == 200:
mp3Name = "song2.mp3"
song = open(mp3Name, "wb")
song.write(downloadResponse.content)
song.close()

Using a system call from within python should really be left for situations where there is no other choice. Use the requests library, like so:
import requests
header={"user-agent":\"Mozilla/5.0 (Windows NT 6.0) Gecko/20100101 Firefox/14.0.1\"",
'referer': referer}
cookies = dict(cookie_name='cookie_text')
r = requests.get(url, header=header, cookies=cookies)
If it doesn't work, maybe the settings itself aren't suitable for what you are trying to do. I am perplexed why you both set the cookie and have cookies=off in the wget statement.

Related

get a lot of HTML pages with python url requests

I am trying to dowload a big amount of HTML pages from a certain website, with the following python code using "requests" package:
FROM = 547495
TO = 570000
for page_number in range(FROM, TO):
url = DEFAULT_URL + str(page_number)
response = requests.get(url)
if response.status_code == 200:
with open(str(page_number) + ".html", "wb") as file:
file.write(response.content)
time.sleep(0.5)
I put a sleep(0.5) command in order that the web server will not think it is a DDOS attack.
after about 20,000 pages, I started getting only 403 forbiden http status code, and I can't anymore download pages.
But, if I try to open the same pages in my browser It opens well, so I guess the web server did not block me.
does someone has an Idea what caused it? and how can I handle it?
thank you

Make it look like it's your browser using headers, and set a cookie ID if it requires a session, here is an example. You can retrieve values of headers from inspecting "Network" tab in your browser when visiting the pages.
with requests.session() as sess:
sess.headers["User-Agent"]= "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:81.0) Gecko/20100101 Firefox/81.0"
sess.get(url)
sess.headers["Cookie"] = "eZSESSID={}".format(sess.cookies.get("eZSESSID"))
for page_number in range(FROM, TO):
if response.status_code == 200:
with open(str(page_number) + ".html", "wb") as file:
file.write(response.content)
time.sleep(0.5)

Scraping JSON from AJAX calls

Background
Considering this url:
base_url = "https://www.olx.bg/ad/sobstvenik-tristaen-kamenitsa-1-CID368-ID81i3H.html"
I want to make the ajax call for the telephone number:
ajax_url = "https://www.olx.bg/ajax/misc/contact/phone/7XarI/?pt=e3375d9a134f05bbef9e4ad4f2f6d2f3ad704a55f7955c8e3193a1acde6ca02197caf76ffb56977ce61976790a940332147d11808f5f8d9271015c318a9ae729"
Wanted results
If I press the button through the site in my chrome browser in the console I would get the wanted result:
{"value":"088 *****"}
debugging
If I open a new tab and paste the ajax_url I would always get empty values:
{"value":"000 000 000"}
If I try something like:
Bash:
wget $ajax_url
Python:
import requests
json_response= requests.get(ajax_url)
I would just receive the html of the the site's handling page that there is an error.
Ideas
I have something more when I am opening the request with the browser. What more do I have? maybe a cookie?
How do I get the wanted result with Bash/Python ?
Edit
the code of the response html is 200
I have tried with curl I get the same html problem.
Kind of a fix.
I have noticed that if I copy the cookie of the browser, and make a request with all the headers INCLUDING the cookie from the browser, I get the correct result
# I think the most important header is the cookie
headers = DICT_WITH_HEADERS_FROM_BROWSER
json_response= requests.get(next_url,
headers=headers,
)
Final question
The only question left is how can I generate a cookie through a Python script?

First you should create a requests Session to store cookies.
Then send a http GET request to the page that is actually calling the ajax request. If any cookie is created by the website, it is sent in GET response and your sessions stores the cookie.
Then you can easily use the session to call ajax api.
Important Note 1:
The ajax url you are calling in the original website is a http POST request! you should not send a get request to that url.
Important Note 2:
You also must extract phoneToken from the website js code which is stored in a variable like var phoneToken = 'here is the pt';
Sample code:
import re
import requests
my_session = requests.Session()
# call html website
base_url = "https://www.olx.bg/ad/sobstvenik-tristaen-kamenitsa-1-CID368-ID81i3H.html"
base_response = my_session.get(url=base_url)
assert base_response.status_code == 200
# extract phone token from base url response
phone_token = re.findall(r'phoneToken\s=\s\'(.+)\';', base_response.text)[0]
# call ajax api
ajax_path = "/ajax/misc/contact/phone/81i3H/?pt=" + phone_token
ajax_url = "https://www.olx.bg" + ajax_path
ajax_headers = {
'accept': '*/*',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9,fa;q=0.8',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-origin',
'Referer': 'https://www.olx.bg/ad/sobstvenik-tristaen-kamenitsa-1-CID368-ID81i3H.html',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'
}
ajax_response = my_session.post(url=ajax_url, headers=ajax_headers)
print(ajax_response.text)
When you run the code above, the result below is displayed:
{"value":"088 558 9937"}

from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.firefox.options import Options
from bs4 import BeautifulSoup
import time
options = Options()
options.add_argument('--headless')
driver = webdriver.Firefox(options=options)
driver.get(
'https://www.olx.bg/ad/sobstvenik-tristaen-kamenitsa-1-CID368-ID81i3H.html')
number = driver.find_element_by_xpath(
"/html/body/div[3]/section/div[3]/div/div[1]/div[2]/div/ul[1]/li[2]/div/strong").click()
time.sleep(2)
source = driver.page_source
soup = BeautifulSoup(source, 'html.parser')
phone = soup.find("strong", {'class': 'xx-large'}).text
print(phone)
Output:
088 558 9937

Download ZIP file from the web (Python)

I am trying to download a ZIP file using from this website. I have looked at other questions like this, tried using the requests and urllib but I get the same error:
urllib.error.HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop. The last 30x error message was: Found
Any ideas on how to open the file straight from the web?
Here is some sample code
from urllib.request import urlopen
response = urlopen('http://www1.caixa.gov.br/loterias/_arquivos/loterias/D_megase.zip')

The linked url will redirect indefinitely, that's why you get the 302 error.
You can examine this yourself over here. As you can see the linked url immediately redirects to itself creating a single-url loop.

Works for me using the Requests library
import requests
url = 'http://www1.caixa.gov.br/loterias/_arquivos/loterias/D_megase.zip'
response = requests.get(url)
# Unzip it into a local directory if you want
import zipfile, io
zip = zipfile.ZipFile(io.BytesIO(response.content))
zip.extractall("/path/to/your/directory")
Note that sometimes trying to access web pages programmatically leads to 302 responses because they only want you to access the page via a web browser.
If you need to fake this (don't be abusive), just set the 'User-Agent' header to be like a browser. Here's an example of making a request look like it's coming from a Chrome browser.
user_agent = 'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36'
headers = {'User-Agent': user_agent}
requests.get(url, headers=headers)
There are several libraries (e.g. https://pypi.org/project/fake-useragent/) to help with this for more extensive scraping projects.

How to login phone.ipkall.com with python requests library?

I am try to learn python, but I have no knowledge about HTTP, I read some posts here about how to use requests to login web site. But it doesn't work. My simple code is here (not real number and password):
#!/usr/bin/env python3
import requests
login_data = {'txtDID': '111111111',
'txtPswd': 'mypassword'}
with requests.Session() as c:
c.post('http://phone.ipkall.com/login.asp', data=login_data)
r = c.get('http://phone.ipkall.com/update.asp')
print(r.text)
print("Done")
But I can't get my personal information which should be showed after login. Can anyone give me some hint? Or point me a direction? I have no idea what's going wrong.

Servers don't like bots (scripts) for security reason. So your script have to behave like human using real browser. First use get() to get session cookies, set user-agent in headers to real one. Use http://httpbin.org/headers to see what user-agent is send by your browser.
Always check results r.status_code and r.url
So you can start with this:
(I don't have acount on this server so I can't test it)
#!/usr/bin/env python3
import requests
s = requests.Session()
s.headers.update({
'User-agent': "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:30.0) Gecko/20100101 Firefox/30.0",
})
# --------
# to get cookies, session ID, etc.
r = s.get('http://phone.ipkall.com/login.asp')
print( r.status_code, r.url )
# --------
login_data = {
'txtDID': '111111111',
'txtPswd': 'mypassword',
'submit1': 'Submit'
}
r = s.post('http://phone.ipkall.com/process.asp?action=verify', data=login_data)
print( r.status_code, r.url )
# --------
BTW: If page use JavaScript you have problem because requests can't run javascript on page.

How can I change a user agent string programmatically?

I would like to write a program that changes my user agent string.
How can I do this in Python?

I assume you mean a user-agent string in an HTTP request? This is just an HTTP header that gets sent along with your request.
using Python's urllib2:
import urllib2
url = 'http://foo.com/'
# add a header to define a custon User-Agent
headers = { 'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' }
req = urllib2.Request(url, '', headers)
response = urllib2.urlopen(req).read()

In urllib, it's done like this:
import urllib
class AppURLopener(urllib.FancyURLopener):
version = "MyStrangeUserAgent"
urllib._urlopener = AppURLopener()
and then just use urllib.urlopen normally. In urllib2, use req = urllib2.Request(...) with a parameter of headers=somedict to set all the headers you want (including user agent) in the new request object req that you make, and urllib2.urlopen(req).
Other ways of sending HTTP requests have other ways of specifying headers, of course.

Using Python you can use urllib to download webpages and use the version value to change the user-agent.
There is a very good example on http://wolfprojects.altervista.org/changeua.php
Here is an example copied from that page:
>>> from urllib import FancyURLopener
>>> class MyOpener(FancyURLopener):
... version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11)
Gecko/20071127 Firefox/2.0.0.11'
>>> myopener = MyOpener()
>>> page = myopener.open('http://www.google.com/search?q=python')
>>> page.read()
[…]Results <b>1</b> - <b>10</b> of about <b>81,800,000</b> for <b>python</b>[…]

urllib2 is nice because it's built in, but I tend to use mechanize when I have the choice. It extends a lot of urllib2's functionality (though much of it has been added to python in recent years). Anyhow, if it's what you're using, here's an example from their docs on how you'd change the user-agent string:
import mechanize
cookies = mechanize.CookieJar()
opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cookies))
opener.addheaders = [("User-agent", "Mozilla/5.0 (compatible; MyProgram/0.1)"),
("From", "responsible.person#example.com")]
Best of luck.

As mentioned in the answers above, the user-agent field in the http request header can be changed using builtin modules in python such as urllib2. At the same time, it is also important to analyze what exactly the web server sees. A recent post on User agent detection gives a sample code and output, which gives a description of what the web server sees when a programmatic request is sent.

If you want to change the user agent string you send when opening web pages, google around for a Firefox plugin. ;) For example, I found this one. Or you could write a proxy server in Python, which changes all your requests independent of the browser.
My point is, changing the string is going to be the easy part; your first question should be, where do I need to change it? If you already know that (at the browser? proxy server? on the router between you and the web servers you're hitting?), we can probably be more helpful. Or, if you're just doing this inside a script, go with any of the urllib answers. ;)

Updated for Python 3.2 (py3k):
import urllib.request
headers = { 'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' }
url = 'http://www.google.com'
request = urllib.request.Request(url, b'', headers)
response = urllib.request.urlopen(request).read()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

automating download of files with urllib2 and or wget - python

Related

get a lot of HTML pages with python url requests

Scraping JSON from AJAX calls

Download ZIP file from the web (Python)

How to login phone.ipkall.com with python requests library?

How can I change a user agent string programmatically?

Categories

Resources