Simple Python web crawler causes high memory usage - python

I am a beginner of Python. I just wrote a very simple web crawler and caused high memory usage when I was running the crawler.Not sure what's wrong in my codes, I spent quite some time but can't resolve it.
I intend to use it to capture some job info from following link: http://search.51job.com/jobsearch/search_result.php?fromJs=1&jobarea=070200%2C00&district=000000&funtype=0000&industrytype=00&issuedate=9&providesalary=06%2C07%2C08%2C09%2C10&keywordtype=2&curr_page=1&lang=c&stype=1&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&lonlat=0%2C0&radius=-1&ord_field=0&list_type=0&dibiaoid=0&confirmdate=9
The crawler extracts the links of each job,and generates the id of each job from the links. Then it reads the job title from the link through xpath, print all the info out in the end. Even the link number is only 50, but it caused my computer nearly unresponsive every time before printing out all the info. Below is my codes.
I just added the header, this is needed to parse the link of each job. My environment is Ubuntu16.04, Python3.5,Pycharm.
import requests
from lxml import etree
import re
headers = {"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "en-US,en;q=0.5",
"Connection": "keep-alive",
"Host": "jobs.51job.com",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"}
def generate_info(url):
html = requests.get(url, headers=headers)
html.encoding = 'GBK'
select = etree.HTML(html.text.encode('utf-8'))
job_id = re.sub('[^0-9]', '', url)
job_title=select.xpath('/html/body//h1/text()')
print(job_id,job_title)
sum_page='http://search.51job.com/jobsearch/search_result.php?fromJs=1&jobarea=070200%2C00&district=000000&funtype=0000&industrytype=00&issuedate=9&providesalary=06%2C07%2C08%2C09%2C10&keywordtype=2&curr_page=1&lang=c&stype=1&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&lonlat=0%2C0&radius=-1&ord_field=0&list_type=0&dibiaoid=0&confirmdate=9'
sum_html=requests.get(sum_page)
sum_select=etree.HTML(sum_html.text.encode('utf-8'))
urls= sum_select.xpath('//*[#id="resultList"]/div/p/span/a/#href')
for url in urls:
generate_info(url)

Related

How to use referer in Python Selenium? [duplicate]

I'm writing some tests with Selenium and noticed, that Referer is missing from the headers. I wrote the following minimal example to test this with https://httpbin.org/headers:
import selenium.webdriver
options = selenium.webdriver.FirefoxOptions()
options.add_argument('--headless')
profile = selenium.webdriver.FirefoxProfile()
profile.set_preference('devtools.jsonview.enabled', False)
driver = selenium.webdriver.Firefox(firefox_options=options, firefox_profile=profile)
wait = selenium.webdriver.support.ui.WebDriverWait(driver, 10)
driver.get('http://www.python.org')
assert 'Python' in driver.title
url = 'https://httpbin.org/headers'
driver.execute_script('window.location.href = "{}";'.format(url))
wait.until(lambda driver: driver.current_url == url)
print(driver.page_source)
driver.close()
Which prints:
<html><head><link rel="alternate stylesheet" type="text/css" href="resource://content-accessible/plaintext.css" title="Wrap Long Lines"></head><body><pre>{
"headers": {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.5",
"Connection": "close",
"Host": "httpbin.org",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:64.0) Gecko/20100101 Firefox/64.0"
}
}
</pre></body></html>
So there is no Referer. However, if I browse to any page and manually execute
window.location.href = "https://httpbin.org/headers"
in the Firefox console, Referer does appear as expected.
As pointed out in the comments below, when using
driver.get("javascript: window.location.href = '{}'".format(url))
instead of
driver.execute_script("window.location.href = '{}';".format(url))
the request does include Referer. Also, when using Chrome instead of Firefox, both methods include Referer.
So the main question still stands: Why is Referer missing in the request when sent with Firefox as described above?
Referer as per the MDN documentation
The Referer request header contains the address of the previous web page from which a link to the currently requested page was followed. The Referer header allows servers to identify where people are visiting them from and may use that data for analytics, logging, or optimized caching, for example.
Important: Although this header has many innocent uses it can have undesirable consequences for user security and privacy.
Source: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Referer
However:
A Referer header is not sent by browsers if:
The referring resource is a local "file" or "data" URI.
An unsecured HTTP request is used and the referring page was received with a secure protocol (HTTPS).
Source: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Referer
Privacy and security concerns
There are some privacy and security risks associated with the Referer HTTP header:
The Referer header contains the address of the previous web page from which a link to the currently requested page was followed, which can be further used for analytics, logging, or optimized caching.
Source: https://developer.mozilla.org/en-US/docs/Web/Security/Referer_header:_privacy_and_security_concerns#The_referrer_problem
Addressing the security concerns
From the Referer header perspective majority of security risks can be mitigated following the steps:
Referrer-Policy: Using the Referrer-Policy header on your server to control what information is sent through the Referer header. Again, a directive of no-referrer would omit the Referer header entirely.
The referrerpolicy attribute on HTML elements that are in danger of leaking such information (such as <img> and <a>). This can for example be set to no-referrer to stop the Referer header being sent altogether.
The rel attribute set to noreferrer on HTML elements that are in danger of leaking such information (such as <img> and <a>).
The Exit Page Redirect technique: This is the only method that should work at the moment without flaw is to have an exit page that you don’t mind having inside of the referer header. Many websites implement this method, including Google and Facebook. Instead of having the referrer data show private information, it only shows the website that the user came from, if implemented correctly. Instead of the referrer data appearing as http://example.com/user/foobar the new referrer data will appear as http://example.com/exit?url=http%3A%2F%2Fexample.com. The way the method works is by having all external links on your website go to a intermediary page that then redirects to the final page. Below we have a link to the website example.com and we URL encode the full URL and add it to the url parameter of our exit page.
Sources:
https://developer.mozilla.org/en-US/docs/Web/Security/Referer_header:_privacy_and_security_concerns#How_can_we_fix_this
https://geekthis.net/post/hide-http-referer-headers/#exit-page-redirect
This usecase
I have executed your code through both through GeckoDriver/Firefox and ChromeDriver/Chrome combination:
Code Block:
driver.get('http://www.python.org')
assert 'Python' in driver.title
url = 'https://httpbin.org/headers'
driver.execute_script('window.location.href = "{}";'.format(url))
WebDriverWait(driver, 10).until(lambda driver: driver.current_url == url)
print(driver.page_source)
Observation:
Using GeckoDriver/Firefox Referer: "https://www.python.org/" header was missing as follows:
{
"headers": {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.5",
"Host": "httpbin.org",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:67.0) Gecko/20100101 Firefox/67.0"
}
}
Using ChromeDriver/Chrome Referer: "https://www.python.org/" header was present as follows:
{
"headers": {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.9",
"Host": "httpbin.org",
"Referer": "https://www.python.org/",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36"
}
}
Conclusion:
It seems to be an issue with GeckoDriver/Firefox in handling the Referer header.
Outro
Referrer Policy

Downloading filesfrom website that requires login and file search - using python

I am trying to download satellite imagery from 'https://gportal.jaxa.jp'. However, to download the files I need to first login, then enter information in the searchbox of the webpage which includes time period, satellite selection etc. At present I am manually entering this search information into the website. Then from the disaplayed webpage, I am manually copying the html code, and parsing the weblinks to store in a txt file (for example datalinks3.txt in the code below). These links are iteratively fed to the code for downloading and saving. However I am facing the following problems.
After certain time period the logout is happening, and empty files are being downloaded.
Every time manually searching, and creating a text file of links is troublesome, as I am looking to download data for over 200+ different conditions.
Is there anyway, I can address the problems above.
for your information, I am trying to download AMSR-2 level 3 data from the given website, between November 1 to July 31 of each year.
import pandas as pd
import h5py
import numpy as np
import itertools
import sys
import math
import wget
import requests
headers = {
"Host": "gportal.jaxa.jp",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
"Accept-Language": "en-GB,en;q=0.5",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep-alive",
"Referer": "https://gportal.jaxa.jp/gpr/search?tab=0",
"Cookie": "_ga_KTLVC524X4=GS1.1.1639056543.2.0.1639056543.0; _ga=GA1.2.746153558.1639032860; iPlanetDirectoryPro=AQIC5wM2LY4SfcwbESrShtDqRRwMkgqUFEyNKt2GZrsautw.%2AAAJTSQACMDEAAlNLABM2ODg2MzQzOTkzMzMxNjg1MDM0AAJTMQAA%2A; _gid=GA1.2.1074489217.1644328782",
"Upgrade-Insecure-Requests": "1"
# Sec-Fetch-Dest: document
# Sec-Fetch-Mode: navigate
# Sec-Fetch-Site: same-origin
# Sec-Fetch-User: ?1
}
import sys
import time
f = open("datalinks3.txt", "r")
lnks = f.readlines()
for i,l in enumerate(lnks) :
l = l.strip()
print("file is:",l)
time.sleep(5)
r=requests.get(l, headers=headers)
# print(r)
with open(l.split("/")[-1],"wb") as fd:
fd.write(r.content)
print(F"completed {i}")

Why can't I access Digikey's website through urllib?

I'm following the guide here:
Python3 Urllib Tutorial
Everything works fine for those first few examples:
import urllib.request
html = urllib.request.urlopen('https://arstechnica.com').read()
print(html)
and
import urllib.request
headers = {}
headers['User-Agent'] = "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:48.0) Gecko/20100101 Firefox/48.0"
req = urllib.request.Request('https://arstechnica.com', headers = headers)
html = urllib.request.urlopen(req).read()
print(html)
But if I replace "arstechnica" with "digikey", that urllib request always times out. But the website is easily accessible through a browser. What's going on?
Most websites will try to defend themselves against unwanted bots. If they detect suspicious traffic, they may decide to stop responding without properly closing the connection (leaving you hanging). Some sites are more sophisticated at detecting bots than than others.
Firefox 48.0 was released back in 2016, so it will be pretty obvious to Digikey that you are probably spoofing the header information. There are also additional headers that browsers typically send, that your script doesn't.
In Firefox, if you open the Developer Tools and go to the Network Monitor tab, you can inspect a request to see what headers it sends, then copy these to better mimic the behaviour of a typical browser.
import urllib.request
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Upgrade-Insecure-Requests": "1"
}
req = urllib.request.Request('https://www.digikey.com', headers = headers)
html = urllib.request.urlopen(req).read()
print(html)

POST request table Python3

I am trying to scrape a table of https://www.domeinquarantaine.nl/, however, for some reason, it does not give a response of the table
#The parameters
baseURL = "https://www.domeinquarantaine.nl/tabel.php"
PARAMS = {"qdate": "2019-04-21", "pagina": "2", "order": "karakter"}
DATA = {"qdate=2019-04-21&pagina=3&order="}
HEADERS = {"Host": "www.domeinquarantaine.nl",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0",
"Accept": "*/*",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate, br",
"Referer": "https://www.domeinquarantaine.nl/",
"Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
"X-Requested-With": "XMLHttpRequest",
"Content-Length": "41",
"Connection": "keep-alive",
"Cookie": "_ga=GA1.2.1612813080.1548179877; PHPSESSID=5694f8e2e4f0b10e53ec2b54310c02cb; _gid=GA1.2.1715527396.1555747200"}
#POST request
r = requests.post(baseURL, headers = HEADERS, data = PARAMS)
#Checking the response
r.text
The response consists of strange tokens and question marks
So my question is why it is returning this response? And how to fix it to eventually end up with the scraped table?
Open web browser, turn off JavaScript and you will see what requests can get.
But using DevTools in Chrome/Firefox (tab Network, filter XHR requests) you should see POST request to url https://www.domeinquarantaine.nl/tabel.php and it sends back HTML with table.
If you open this url in browser then you see table - so you can get it event with GET but using POST you probably can filter data.
After writing this explanation I saw you already has this url in code - you didn't mention it in description.
You have different problem - you set
"Accept-Encoding": "gzip, deflate, br"
so server sends compressed response and you should uncompress it.
Or use
"Accept-Encoding": "deflate"
and server will send uncompressed data and you will see HTML with table
So there are a couple of reasons why you're getting what you're getting:
Your headers don't look correct
The data that you are sending contains some extra variables
The website requires cookies in order to display the table
This can be easily fixed by changing the data and headers variables and adding requests.session() to your code (which will automatically collect and inject cookies)
All in all your code should look like this:
import requests
session = requests.session()
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept": "*/*", "Accept-Language": "en-US,en;q=0.5", "Accept-Encoding": "gzip, deflate", "Referer": "https://www.domeinquarantaine.nl/", "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8", "X-Requested-With": "XMLHttpRequest", "DNT": "1", "Connection": "close"}
data={"qdate": "2019-04-20"}
session.get("https://www.domeinquarantaine.nl", headers=headers)
r = session.post("https://www.domeinquarantaine.nl/tabel.php", headers=headers, data=data)
r.text
Hope this helps!

Google blocking Selenium Webdriver

I have a script where I am trying to search a google page via selenium to test something. Whenever I open up Webdriver, I get a captcha form:
fp = webdriver.FirefoxProfile()
driver = webdriver.Firefox(firefox_profile=fp)
driver.get('https://www.google.com/search?q=asdf')
However, if I open the exact same page, https://www.google.com/search?q=asdf, in a browser, it works fine. Why does Google raise the captcha, and what parameters can I send with webdriver such that it 'looks' like a normal browser and the captcha isn't raised?
Note, I have tried adding my user agent, and it still raises the same error:
fp = webdriver.FirefoxProfile()
fp.set_preference("general.useragent.override","Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:32.0) Gecko/20100101 Firefox/32.0")
driver = webdriver.Firefox(firefox_profile=fp)
Here is an example of my Request headers from the normal browser:
you need to set the user agent.
See this SO ANSWER
on using set_preference.
Pass all the headers using requests:
headers = {
"Host": "www.google.com",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:32.0) Gecko/20100101 Firefox/32.0",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate",
"Cookie": "PREF=ID=0df7e6fbda0c09d3:U=bfc47b624b57a0e9:FF=0:TM=1414961297:LM=1414961298:S=2FtJad1BEeJ0M5XS; NID=67=t5zTrFVtG4cLZH2kVmsQEbqDRFJisM86z1s27zx0A6vTR0MWqg69DaY39muso6fIEgqnli7IaEv1Rge1ZxBG0Nr1_3KH1aLu_z1-Ar48oiVDFFSVX4KDRgWnHQWjUfHC",
"Connection": "keep-alive",
"Cache-Control": "max-age=0",
}

Categories

Resources