I am trying to do a web scraping with yahoo finance.
"https://finance.yahoo.com/quote/AUDUSD%3DX/history?p=AUDUSD%3DX"
I finish the code and it returns the response code 404.
I notice that I need to add the user-agent header before I can scrape the website e.g.
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
}
But I was just wondering how can I get the above header information via python. Any code I should enter to get the user-agent header? Thank you.
Why dont you check this package, maybe you might find it easier and less confusing:
Download market data from Yahoo! Finance's API Python
Related
I'm new to web scraping and i am trying to use basic skills on Amazon. I want to make a code for finding top 10 'Today's Greatest Deals' with prices and rating and other information.
Every time I try to find a specific tag using find() and specifying class it keeps saying 'None'. However the actual HTML has that tag.
On manual scanning i found out half the code of isn't being displayed in the output terminal. The code displayed is half but then the body and html tag do close. Just a huge chunk of code in body tag is missing.
The last line of code displayed is:
<!--[endif]---->
then body tag closes.
Here is the code that i'm trying:
from bs4 import BeautifulSoup as bs
import requests
source = requests.get('https://www.amazon.in/gp/goldbox?ref_=nav_topnav_deals')
soup = bs(source.text, 'html.parser')
print(soup.prettify())
#On printing this it misses some portion of html
article = soup.find('div', class_ = 'a-row dealContainer dealTile')
print(article)
#On printing this it shows 'None'
Ideally, this should give me the code within the div tag, so that i can continue further to get the name of the product. However the output just shows 'None'. And on printing the whole code without tags it is missing a huge chunk of html inside.
And of course the information needed is in the missing html code.
Is Amazon blocking my request? Please help.
The User-Agent request header contains a characteristic string that allows the network protocol peers to identify the application type, operating system, software vendor or software version of the requesting software user agent. Validating User-Agent header on server side is a common operation so be sure to use valid browser’s User-Agent string to avoid getting blocked.
(Source: http://go-colly.org/articles/scraping_related_http_headers/)
The only thing you need to do is to set a legitimate user-agent. Therefore add headers to emulate a browser. :
# This is a standard user-agent of Chrome browser running on Windows 10
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36' }
Example:
from bs4 import BeautifulSoup
import requests
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
resp = requests.get('https://www.amazon.com', headers=headers).text
soup = BeautifulSoup(resp, 'html.parser')
...
<your code here>
Additionally, you can add another set of headers to pretend like a legitimate browser. Add some more headers like this:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language' : 'en-US,en;q=0.5',
'Accept-Encoding' : 'gzip',
'DNT' : '1', # Do Not Track Request Header
'Connection' : 'close'
}
For some reason I cannot get my head around using the requests.Session() function to save login credentials. My current code isn't working and I cannot figure out why, can someone help correct my code and explain the changes that were needed? I don't want to keep asking for help on a website by website basis.
headers = {
'Content-type': 'text/html',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.97 Safari/537.36',
'origin':'https://stockinvest.us',
'referer':'https://stockinvest.us'
}
data = {'email':'user-email', 'password: 'user-password'}
with requests.Session() as s:
login = s.post('https://stockinvest.us/login', data=data, headers=headers)
s.cookies
r = s.get('https://stockinvest.us/profile/edit')
even though i'm passing headers like below, i'm getting 416 error: http is not handled or not allowed.
headers = {'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding':'gzip, deflate, br,sdch',
'Accept-Language':'en-US,en;q=0.8',
'AlexaToolbar-ALX_NS_PH':'AlexaToolbar/alx-4.0.1',
'Cache-Control':'max-age=0',
'Connection':'keep-alive',
'Host':'www.links.com',
'Referer':'https://www.links.com/',
'Upgrade-Insecure-Requests':'1',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'}
Try limiting the headers to below and see if it helps
headers = {
'Accept-Language':'en-US,en;q=0.8',
'Cache-Control':'max-age=0',
'Host':'www.mdlinx.com',
'Referer':'https://www.mdlinx.com/',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'}
Basically your site is throwing 416 and of course scrapy won't handle that. So you need to workout which headers are causing the issue. Best thing is to use chrome dev tools, copy the request as curl and see if that works. Also this may be related to no cookies being there.
You need to figure what works and what doesn't and then work this out
Edit-1
Im using the requests module in python to try and make a search on the following webiste http://musicpleer.audio/, however this website appears to be blocking me as it issues nothing but a 403 when i attempt to access it, im wondering how i can get around this, ive tried sending it the user agent of my web browser(chrome) and it still returns error 403. any suggestions on how i could get around this an example of downloading a song from the site would be very helpful. Thanks in advance
My code:
import requests, os
def funGetList:
start_path = 'C:/Users/Jordan/Music/' # current directory
list = []
for path,dirs,files in os.walk(start_path):
for filename in files:
temp = (os.path.join(path,filename))
tempLen = len(temp)
"print(tempLen)"
iterate = 0
list.append(temp[22:(len(temp))-4])
def funDownloadMP3:
for i in list:
print(i)
payload = {'searchQuery': 'meme', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'}
url = 'http://musicpleer.audio/'
print(requests.post(url, data=payload))
Putting the User-Agent in the headers seems to work:
In []:
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'}
url = 'http://musicpleer.audio/'
r = requests.get('{}#!{}'.format(url, 'meme'), headers=headers)
r.status_code
Out[]:
200
Note: It looks like the search url is simple '#!<search-term>'
HTML 403 Forbidden error code.
The server might be expecting some more request headers like Host or Cookies etc.
You might want to use Postman to debug it with ease
I am trying to get some data from a page. I open Chrome's development tools and successfully find the data I wanted. It's in XHR with GET method (sorry I don't know how to descript it).Then I copy the params, headers, and put all these to requests.get() method. The response I get is totally different to what I saw on the development tools.
Here is my code
import requests
queryList={
"category":"summary",
"subcategory":"all",
"statsAccumulationType":"0",
"isCurrent":"true",
"playerId":None,
"teamIds":"825",
"matchId":"1103063",
"stageId":None,
"tournamentOptions":None,
"sortBy":None,
"sortAscending":None,
"age":None,
"ageComparisonType":None,
"appearances":None,
"appearancesComparisonType":None,
"field":None,
"nationality":None,
"positionOptions":None,
"timeOfTheGameEnd":None,
"timeOfTheGameStart":None,
"isMinApp":None,
"page":None,
"includeZeroValues":None,
"numberOfPlayersToPick":None,
}
header={
'modei-last-mode':'JL7BrhwmeqKfQpbWy6CpG/eDlC0gPRS2BCvKvImVEts=',
'Referer':'https://www.whoscored.com/Matches/1103063/LiveStatistics/Spain-La-Liga-2016-2017-Leganes-Real-Madrid',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
"x-requested-with":"XMLHttpRequest",
}
url='https://www.whoscored.com/StatisticsFeed/1/GetMatchCentrePlayerStatistics'
test=requests.get(url=url,params=queryList,headers=header)
print(test.text)
I follow this post below but it's already 2 years ago and I believe the structure is changed.
XHR request URL says does not exist when attempting to parse it's content