Python urllib.urlretrieve and user agent - python

I am retrieving an xml file from a network device. It returns the file in a different format without html tags if I do not specify a user agent.
import urllib
urllib.urlretrieve (url, file_save_name)
How do I specify a user agent when retrieving?

Sounds like you could do.
import urllib
# User-Agents for multiple browsers and OS's
user_agents = [ 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11', 'Opera/9.25 (Windows NT 5.1; U; en)', 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)', 'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Kubuntu)', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12', 'Lynx/2.8.5rel.1 libwww-FM/2.14 SSL-MM/1.4.1 GNUTLS/1.2.9'
# Add user_agent as header argument.
urllib.urlretrieve (url, file_save_name, user_agent)

Related

Instagram story scraper: What would the process be?

I'm trying to code a web scraping python program that gets stories from users with your login. I thought It would be fun to see if I could even get it working since the 4k Stogram costs money just for more functionality.
I logged in successful but I don't know where to go from here.
from bs4 import BeautifulSoup
import json, random, re, requests, urllib.request
import urllib2
USERNAME = '*****'
PASSWD = '****'
account_purging = '****'
BASE_URL = 'https://www.instagram.com/accounts/login/'
LOGIN_URL = BASE_URL + 'ajax/'
headers_list = [
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; FSL 7.0.6.01001)",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; FSL 7.0.7.01001)",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; FSL 7.0.5.01003)",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0",
"Mozilla/5.0 (X11; U; Linux x86_64; de; rv:1.9.2.8) Gecko/20100723 Ubuntu/10.04 (lucid) Firefox/3.6.8",
"Mozilla/5.0 (Windows NT 5.1; rv:13.0) Gecko/20100101 Firefox/13.0.1",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:11.0) Gecko/20100101 Firefox/11.0",
"Mozilla/5.0 (X11; U; Linux x86_64; de; rv:1.9.2.8) Gecko/20100723 Ubuntu/10.04 (lucid) Firefox/3.6.8",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.0.3705)",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)",
"Opera/9.80 (Windows NT 5.1; U; en) Presto/2.10.289 Version/12.01",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727)",
"Mozilla/5.0 (Windows NT 5.1; rv:5.0.1) Gecko/20100101 Firefox/5.0.1",
"Mozilla/5.0 (Windows NT 6.1; rv:5.0) Gecko/20100101 Firefox/5.02",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.112 Safari/535.1",
"Mozilla/4.0 (compatible; MSIE 6.0; MSIE 5.5; Windows NT 5.0) Opera 7.02 Bork-edition [en]"
]
USER_AGENT = headers_list[random.randrange(0,(len(headers_list)+1))]
session = requests.Session()
session.headers = {'user-agent': USER_AGENT}
session.headers.update({'Referer': BASE_URL})
req = session.get(BASE_URL)
soup = BeautifulSoup(req.content, 'html.parser')
body = soup.find('body')
pattern = re.compile('window._sharedData')
script = body.find("script", text=pattern)
script = script.get_text().replace('window._sharedData = ', '')[:-1]
data = json.loads(script)
csrf = data['config'].get('csrf_token')
login_data = {'username': USERNAME, 'password': PASSWD}
session.headers.update({'X-CSRFToken': csrf})
login = session.post(LOGIN_URL, data=login_data, allow_redirects=True)
story_page = "https://www.instagram.com/stories" + "/" + account_purging
# stories url is:
request_headers_story = {
"Accept:" : "video/webm,video/ogg,video/*;q…q=0.7,audio/*;q=0.6,*/*;q=0.5",
"Accept-Language" : "en-US,en;q=0.5",
"Connection" : "keep-alive",
"DNT" : "1",
"Host" : "scontent-ort2-1.cdninstagram.com",
"Range" : "bytes=0-",
"Referer" : story_page,
"TE" : "Trailers",
"User-Agent" : USER_AGENT
}
soup = session.post(story_page, data=request_headers_story, allow_redirects=True)
print(BeautifulSoup(soup.content, 'html.parser'))
I'm trying to get the mp4 and jpg links and using that to download later in an array or something. If there's anything you could point me towards I would appreciate anything.
I'm also trying to avoid using the api because, that just makes it boring.
The easier solution to this problem which avoids using an api is to use selenium. By using selenium, you can login in much faster and efficiently as well as grab the images and videos you need.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
#or driver = webdriver.Chrome()
Note: To grab the image, you need to find the id or name of the image and do something like:
driver.find_element_by_id("image_id")
or
driver.find_element_by_name("image_name")
If you need more information or clarification, check https://selenium-python.readthedocs.io/.
Let me know if that helped you!

Avoid to be detected during scrape

I am trying to scrape the website of lacentrale.fr thanks to scrapy, but even if I rotate my users agent and IP address (thanks to TOR), the web site detect my robot and send me false values.
Please can you check my code used in middlwares and setting and tell me if something went wrong.
code in middlewares :
from tutorial.settings import * #USER_AGENT_LIST
import random
from stem.control import Controller
from toripchanger import TorIpChanger
from stem import Signal
class RandomUserAgentMiddleware(object):
def process_request(self, request, spider):
ua = random.choice(USER_AGENT_LIST)
if ua:
request.headers.setdefault('User-Agent', ua)
def _set_new_ip():
with Controller.from_port(port=9051) as controller:
controller.authenticate(password='')
controller.signal(Signal.NEWNYM)
ip_changer = TorIpChanger(reuse_threshold=10)
class ProxyMiddleware(object):
_requests_count = 0
def process_request(self, request, spider):
self._requests_count += 1
if self._requests_count > 10:
self._requests_count = 0
ip_changer.get_new_ip()
print("New Tor connection processed")
request.meta['proxy'] = 'http://127.0.0.1:8118'
spider.log('Proxy : %s' % request.meta['proxy'])
Code used in settings :
BOT_NAME = 'tutorial'
SPIDER_MODULES = ['tutorial.spiders']
NEWSPIDER_MODULE = 'tutorial.spiders'
ROBOTSTXT_OBEY = True
DOWNLOAD_DELAY = 1
RANDOMIZE_DOWNLOAD_DELAY = True
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
'tutorial.middlewares.RandomUserAgentMiddleware': 400,
'scrapy.contrib.downloadermiddleware.httpcompression.HttpCompressionMiddleware':None, # to avoid the raise IOError, 'Not a gzipped file' exceptions.IOError: Not a gzipped file
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
'tutorial.middlewares.ProxyMiddleware': 100
}
USER_AGENT_LIST=[
{'User-agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5'},
{
'User-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1'},
{
'User-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML like Gecko) Chrome/28.0.1469.0 Safari/537.36'},
{
'User-agent': 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML like Gecko) Chrome/28.0.1469.0 Safari/537.36'},
{
'User-agent': 'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36'},
{
'User-agent': 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.67 Safari/537.36'},
{
'User-agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36'},
{
'User-agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'},
{'User-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:35.0) Gecko/20100101 Firefox/35.0'},
{'User-agent': 'Mozilla/5.0 (Windows NT 6.3; rv:36.0) Gecko/20100101 Firefox/36.0'}
]
EDIT II
it's seems that tor use the same ip each time and there is not rotation on the ip address. I don't know what I can change in my middlwares file to resolve this !!
please any idea ?
You may be detected on several factors, including whether your scraper downloads/runs the javascript files. If that's the case, you may need to use a tool like selenium in conjunction with Python/Scrapy to further pretend like a normal human user.
This stackoverflow post offers some help in getting started:
https://stackoverflow.com/a/17979285/9693088
I don't think I can offer much guidance in what may be going wrong with your TOR set up

Provide network data from Firebug to Python

Is there a way to copy the network data from Firebug (for example POST headers) and put them into Python code so I don't need to write each header by myself?
There is an option Copy Request Headers, but it is not in the right format for Python.
So the thing I want is not to obtain this:
User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:37.0) Gecko/20100101 Firefox/37.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
because I have to change the format to dictionary or something else, but this:
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; rv:37.0) Gecko/20100101 Firefox/37.0"
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
It is not necessary to get it in Python's dictionary format. The only thing I want is to automatically use this data in Python.
Post-process the headers you've copied from Firefox: split each line of the input string by : and make a dictionary, example:
In [1]: headers = """
User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:37.0) Gecko/20100101 Firefox/37.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
"""
In [2]: dict(item.split(": ", 1) for item in headers.splitlines() if item)
Out[2]:
{'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; rv:37.0) Gecko/20100101 Firefox/37.0'}

Google search returns None 302 on AppEngine

I am querying Google Search Engine and it works fine locally by returning the expected results. When the same code is deployed on AppEngine, it returns None 302.
The following program returns the links returned in Google Search results.
# The first two imports will be slightly different when deployed on appengine
from pyquery import PyQuery as pq
import requests
import random
try:
from urllib.parse import quote as url_quote
except ImportError:
from urllib import quote as url_quote
USER_AGENTS = ('Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:11.0) Gecko/20100101 Firefox/11.0',
'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:22.0) Gecko/20100 101 Firefox/22.0',
'Mozilla/5.0 (Windows NT 6.1; rv:11.0) Gecko/20100101 Firefox/11.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.46 Safari/536.5',
'Mozilla/5.0 (Windows; Windows NT 6.1) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.46 Safari/536.5',)
SEARCH_URL = 'https://www.google.com/search?q=site:foobar.com%20{0}'
def get_result(url):
return requests.get(url, headers={'User-Agent': random.choice(USER_AGENTS)}).text
def get_links(query):
result = get_result(SEARCH_URL.format(url_quote(query)))
html = pq(result)
return [a.attrib['href'] for a in html('.l')] or \
[a.attrib['href'] for a in html('.r')('a')]
print get_links('foo bar')
Code deployed on AppEngine:
import sys
sys.path[0:0] = ['distlibs']
import lxml
import webapp2
import json
from requests import api
from pyquery.pyquery import PyQuery as pq
import random
try:
from urllib.parse import quote as url_quote
except ImportError:
from urllib import quote as url_quote
USER_AGENTS = ('Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:11.0) Gecko/20100101 Firefox/11.0',
'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:22.0) Gecko/20100 101 Firefox/22.0',
'Mozilla/5.0 (Windows NT 6.1; rv:11.0) Gecko/20100101 Firefox/11.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.46 Safari/536.5',
'Mozilla/5.0 (Windows; Windows NT 6.1) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.46 Safari/536.5',)
SEARCH_URL = 'https://www.google.com/search?q=site:foobar.com%20{0}'
def get_result(url):
return api.get(url, headers={'User-Agent': random.choice(USER_AGENTS)}).text
def get_links(query):
result = get_result(SEARCH_URL.format(url_quote(query)))
html = pq(result)
return [a.attrib['href'] for a in html('.l')] or \
[a.attrib['href'] for a in html('.r')('a')]
form="""
<form action="/process">
<input name="q">
<input type="submit">
</form>
"""
class MainHandler(webapp2.RequestHandler):
def get(self):
self.response.out.write("<h3>Write something.</h3><br>")
self.response.out.write(form)
class ProcessHandler(webapp2.RequestHandler):
def get(self):
query = self.request.get("q")
self.response.out.write("Your query : " + query)
results = get_links(query)
self.response.out.write(results[0])
app = webapp2.WSGIApplication([('/', MainHandler),
('/process', ProcessHandler)],
debug=True)
I have tried querying with both the http and https protocols. The following is the AppEngine log for a request.
Starting new HTTP connection (1): www.google.com
D 2013-12-21 13:13:37.217
"GET /search?q=site:foobar.com%20foo%20bar HTTP/1.1" 302 None
I 2013-12-21 13:13:37.218
Starting new HTTP connection (1): ipv4.google.com
D 2013-12-21 13:13:37.508
"GET /sorry/IndexRedirect?continue=http://www.google.com/search%3Fq%3Dsite:foobar.com%20foo%20bar HTTP/1.1" 403 None
E 2013-12-21 20:51:32.090
list index out of range
I'm puzzled as to why you're trying to spoof the User-Agent header, but it if makes you happy, go for it. Just note that if requests.get is using urlfetch under the covers, App Engine appends a string to the User-Agent header your app supplies, identifying your app. (See https://developers.google.com/appengine/docs/python/urlfetch/#Python_Request_headers).
Try passing follow_redirects = False to urlfetch. That's how you make requests to other App Engine Apps. For completely non-obvious reasons, it might help you in this case.

Decomposing a Series in Python

I have a series in python that looks as follows:
I want the split the series into users with Windows operating systems and those who are not using Windows operating systems. Is there a way to do this in python 2.7.3? Thank you in advance.
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.78 Safari/535.11
GoogleMaps/RochesterNY
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; InfoPath.3)
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/534.52.7 (KHTML, like Gecko) Version/5.1.2 Safari/534.52.7
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.79 Safari/535.11
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.79 Safari/535.11
Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.79 Safari/535.11
Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1
Opera/9.80 (X11; Linux zbov; U; en) Presto/2.10.254 Version/12.00
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.79 Safari/535.11
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.2) Gecko/20100101 Firefox/10.0.2
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.27) Gecko/20120216 Firefox/3.6.27
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.2) Gecko/20100101 Firefox/10.0.2
As user agent strings are not really standardized but browser vendors only follow each other’s example, the safest bet to recognize Windows clients is to simply check if the string “Windows” is included in it. You can easily check that using the in operator in Python.
For example to count the number of Windows clients, you could do something like this (where lines is a list of all those user agent strings):
numWindows = 0
for line in lines:
if 'Windows' in line:
numWindows += 1
print('{0} of {1} users are using Windows.'.format(numWindows, len(lines)))
How about this, which is similar to #poke's answer, but since Tarek mentioned separating, so here it is separated into two lists:
windows = []
others = []
for line in lines:
if 'Windows' in line:
windows.append(line)
else:
others.append(line)
print('Windows: {} Others: {}'.format(len(windows),len(others))

Categories

Resources