python user-agent xpath amazon - python

I am trying to make a scan of the HTML in 2 requests.
at the first one, it's working but when I am trying to use another one,
The HTML I am trying to locate are not visible its getting it wrong.
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
page = requests.get('https://www.amazon.com/gp/aw/ol/B00DZKQSRQ/ref=mw_dp_olp?ie=UTF8&condition=new, headers=headers)
newHeader = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36'}
pagePrice = requests.get('https://www.amazon.com/gp/aw/ol/B01EQJU8AW/ref=mw_dp_olp?ie=UTF8&condition=new',headers=newHeader)
The first request works fine and gets me the good HTML.
The second request gives bad HTML.
I saw this package, but not success :
https://pypi.python.org/pypi/fake-useragent
And I saw this topic not unaswered :
Double user-agent tag, "user-agent: user-agent: Mozilla/"
Thank you very much!

Related

Why BeautifulSoup returning same information over and over again

When I am trying to scrap website over multiple pages BeautifulSoup returning the 1st page content for all the page range.. It is getting repeated again and again..
data=pd.DataFrame()
for i in range(1,10):
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
url="https://www.collegesearch.in/engineering-colleges-india".format(i)
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, 'html5lib')
#clg url and name
clg=soup.find_all('h2', class_='media-heading mg-0')
#other details
details=soup.find_all('dl', class_='dl-horizontal mg-0')
_dict={'clg':clg,'details':details}
df=pd.DataFrame(_dict)
data=data.append(df,ignore_index=True)
It is not an issue of BeautifulSoup - Check your loop, you never change the page, cause url is always the same:
https://www.collegesearch.in/engineering-colleges-india
So change your code and set your counter as value of page parameter:
for i in range(1,10):
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
url=f"https://www.collegesearch.in/engineering-colleges-india?page={i}"
print(url)
May also take a short read: https://docs.python.org/3/tutorial/inputoutput.html

Python request yields status code 500 even though the website is available

I'm trying to use Python to check whether or not a list of websites is online. However, on several sites, requests yields the wrong status code. For example, the status code I get for https://signaturehound.com/ is 500 even though the website is online and in the Chrome developer tools the response code 200 is shown.
My code looks as follows:
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
def url_ok(url):
r = requests.head(url,timeout=5,allow_redirects=True,headers=headers)
status_code = r.status_code
return status_code
print(url_ok("https://signaturehound.com/"))
As suggested by #CaptainDaVinci in the comments, the solution is to replace head by get in the code:
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
def url_ok(url):
r = requests.get(url,timeout=5,allow_redirects=True,headers=headers)
status_code = r.status_code
return status_code
print(url_ok("https://signaturehound.com/"))

Read HTTPS URL's in R like linkedin

I am trying to read the LinkedIn company page, for example, https://www.linkedin.com/company/facebook
getting company name,location,type of industry,etc.
This is my code below
urlCreate1<-"https://www.linkedin.com/company/facebook"
parse_rvest<-getURL(urlCreate1,'useragent' = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36")
nameRest <- content %>%html_nodes(".industry") %>%html_text()
nameRest
and the output I get for this is character(0) which from previous posts I understand that its not getting .industry tag as I read the https code.
I have also tried this
parse_rvest<-content(GET(urlCreate1),encoding='UTF-8')
but it doesn't help
I have a python code that works but I need this to be done in R
This is part of the python code I got online
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36'}
response = requests.get(url, headers=headers)
formatted_response = response.content.replace('<!--', '').replace('-->', '')
doc = html.fromstring(formatted_response)
datafrom_xpath = doc.xpath('//code[#id="stream-promo-top-bar-embed-id-content"]//text()')
if datafrom_xpath:
try:
json_formatted_data = json.loads(datafrom_xpath[0])
company_name = json_formatted_data['companyName'] if 'companyName' in json_formatted_data.keys() else None
size = json_formatted_data['size'] if 'size' in json_formatted_data.keys() else None
Please help me in reading the page. I am using selector gadget to get the xpath(.industry)
Have a look at the LIN API:
https://cran.r-project.org/web/packages/Rlinkedin/Rlinkedin.pdf
Then, you should be able to easily, and legally, do whatever you want to do.
Here are some ideas to get you started.
http://thinktostart.com/analyze-linkedin-with-r/
https://github.com/hadley/httr/issues/200
https://www.reddit.com/r/datascience/comments/3rufk5/pulling_data_from_linkedin_api/

How can I recreate a urllib.requests in Python 2.7?

I'm crawling some web-pages and parsing through some data on them, but one of the sites seems to be blocking my requests. The version of the code using Python 3 with urllib.requests works fine. My problem is that I need to use Python 2.7, and I can't get a response using urllib2
Shouldn't these requests be identical?
Python 3 version:
def fetch_title(url):
req = urllib.request.Request(
url,
data=None,
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
}
)
html = urllib.request.urlopen(req).read().encode('unicode-escape').decode('ascii')
return html
Python 2.7 version:
import urllib2
opener = urllib2.build_opener()
opener.addheaders = [(
'User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
)]
response = opener.open('http://website.com')
print response.read()
The following code should work, essentially with python 2.7 you can create a dictionary with your desired headers and format your request in a way that it will work properly with urllib2.urlopen using urllib2.Request.
import urllib2
def fetch_title(url):
my_headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36"
}
return urllib2.urlopen(urllib2.Request(url, headers=my_headers)).read()

Image url does not return an image. Using Python requests

I use Python requests to get images, but in some case sit doesn't work. It seems to happen more often. An example is
http://recipes.thetasteofaussie.netdna-cdn.com/wp-content/uploads/2015/07/Leek-and-Sweet-Potato-Gratin.jpg
It loads fine in my browser, but using requests, it returns html that says "403 forbidden" and "nginx/1.7.11"
import requests
image_url = "<the_url>"
headers = {'User-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.76 Safari/537.36', 'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Encoding':'gzip,deflate,sdch'}
r = requests.get(image_url, headers=headers)
# r.content is html '403 forbidden', not an image
I have also tried with this header, which has been necessary in some cases. Same result.
headers = {'User-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.76 Safari/537.36', 'Accept':'image/webp,*/*;q=0.8','Accept-Encoding':'gzip,deflate,sdch'}
(I had a similar question a few weeks ago, but this was answered by the particular image file types not being supported by PIL. This is different.)
EDIT: Based on comments:
It seems the link only works if you have already visited the original site http://aussietaste.recipes/vegetables/leek-vegetables/leek-and-sweet-potato-gratin/ with the image. I suppose the browser then uses the cached version. Any workaround?
The site is validating the Referer header. This prevents other sites from including the image in their web pages and using the image host's bandwidth. Set it to the site you mentioned in your post, and it will work.
More info:
https://en.wikipedia.org/wiki/HTTP_referer
import requests
image_url = "http://recipes.thetasteofaussie.netdna-cdn.com/wp-content/uploads/2015/07/Leek-and-Sweet-Potato-Gratin.jpg"
headers = {
'User-agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.76 Safari/537.36',
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Encoding' : 'gzip,deflate,sdch',
'Referer' : 'http://aussietaste.recipes/vegetables/leek-vegetables/leek-and-sweet-potato-gratin/'
}
r = requests.get(image_url, headers=headers)
print r
For me, this prints
<Response [200]>

Categories

Resources