I wrote a script that retrieves stock data on google finance and prints it out, nice and simple. It always worked, but since this morning I only get a page that tells me that I'm probably an automated script instead of the stock data. Of course, being a script, I can't pass the captcha. What can I do?
well, you finally reached a quite challenging realm. decode the captcha.
there do exist OCR approaches to decode simple captcha into code. not seems to work for google captcha.
I heard there are some companies provide manual captcha decoding services, you can try to use some. ^_^ LOL
ok, to be serious, if google don't want you to do it that way, then it is not easy to decode those captchas. After all, why google on finance data, there are a lot other providers, right? try to scrape those websites.
You can try to solve the blocking issue by adding headers where your user-agent will be specified, this is necessary for Google to recognize the request as from a user, and not as from a bot, and not block it:
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}
An additional step could be to rotate user-agents:
import requests, random
user_agent_list = [
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
]
for _ in user_agent_list:
#Pick a random user agent
user_agent = random.choice(user_agent_list)
#Set the headers
headers = {'User-Agent': user_agent}
requests.get('URL', headers=headers)
In addition to the rotate user-agent, you can rotate proxies (ideally residential) that can be used in combination with CAPTCHA solver to bypass CAPTCHA.
To parse dynamic websites using web browser automation, you can use curl-impersonate or selenium-stealth which can bypass most CAPTCHAs, but the option using browser automation could be CPU, RAM expensive and could be difficult to run in parallel.
There's a reducing the chance of being blocked while web scraping blog post if you need a little bit more info.
As an alternative you can use Google Finance Markets API from SerpApi. It's a paid API with the free plan.
The difference is that it will bypass blocks (including CAPTCHA) from Google, no need to create the parser and maintain it.
Example of SerpApi code for extracting Most Active on the main page of Google Finance in the online IDE.
from serpapi import GoogleSearch
import json, os
params = {
"engine": "google_finance_markets", # serpapi parser engine. Or engine to parse ticker page: https://serpapi.com/google-finance-api
"trend": "most-active", # parameter is used for retrieving different market trends: https://serpapi.com/google-finance-markets#api-parameters-search-query-trend
"api_key": "..." # serpapi key, https://serpapi.com/manage-api-key
}
market_trends_data = []
search = GoogleSearch(params)
results = search.get_dict()
most_active = results["markets"]
print(json.dumps(most_active, indent=2, ensure_ascii=False))
Example output:
[
{
"stock": ".DJI:INDEXDJX",
"link": "https://www.google.com/finance/quote/.DJI:INDEXDJX",
"serpapi_link": "https://serpapi.com/search.json?engine=google_finance&hl=en&q=.DJI%3AINDEXDJX",
"name": "Dow Jones",
"price": 34089.27,
"price_movement": {
"percentage": 0.4574563,
"value": 156.66016,
"movement": "Down"
}
},
{
"stock": ".INX:INDEXSP",
"link": "https://www.google.com/finance/quote/.INX:INDEXSP",
"serpapi_link": "https://serpapi.com/search.json?engine=google_finance&hl=en&q=.INX%3AINDEXSP",
"name": "S&P 500",
"price": 4136.13,
"price_movement": {
"percentage": 0.028041454,
"value": 1.1601563,
"movement": "Down"
}
},
other results ...
]
Related
I am trying to do a web scraping with yahoo finance.
"https://finance.yahoo.com/quote/AUDUSD%3DX/history?p=AUDUSD%3DX"
I finish the code and it returns the response code 404.
I notice that I need to add the user-agent header before I can scrape the website e.g.
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
}
But I was just wondering how can I get the above header information via python. Any code I should enter to get the user-agent header? Thank you.
Why dont you check this package, maybe you might find it easier and less confusing:
Download market data from Yahoo! Finance's API Python
I have a web scraping script that has recently ran into a 403 error.
It worked for a while with just the basic code but now has been running into 403 errors.
I've tried using user agents to circumvent this and it very briefly worked, but those are now getting a 403 error too.
Does anyone have any idea how to get this script running again?
If it helps, here is some context:
The purpose of the script is to find out which artists are on which Tidal playlists, for the purpose of this question - I have only included the snippet of code that gets the site as that is where the error occurs.
Thanks in advance!
The basic code looks like this:
baseurl = 'https://tidal.com/browse'
for i in platformlist:
url = baseurl+str(i[0])
tidal = requests.get(url)
tidal.raise_for_status()
if tidal.status_code != 200:
print ("Website Error: ", url)
pass
else:
soup = bs4.BeautifulSoup(tidal.text,"lxml")
text = str(soup)
text2 = text.lower()
With user-agents:
user_agent_list = [
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
]
url = 'https://tidal.com/playlist/1b418bb8-90a7-4f87-901d-707993838346'
for i in range(1,4):
#Pick a random user agent
user_agent = random.choice(user_agent_list)
#Set the headers
headers = {'User-Agent': user_agent}
#Make the request
tidal = requests.get(url,headers=headers)
print("Request #%d\nUser-Agent Sent:%s\n\nHeaders Received by HTTPBin:"%(i,user_agent))
print(tidal.status_code)
print("-------------------")
#tidal = requests.get(webpage)
tidal.raise_for_status()
print(tidal.status_code)
#make webpage content legible
soup = bs4.BeautifulSoup(tidal.text,"lxml")
print(soup)
#turn bs4 type content into text
text = str(soup)
text2 = text.lower()
I'd like to suggest an alternative solution - one that doesn't involve BeautifulSoup.
I visited the main page and clicked on an album, while at the same time logging my network traffic. I noticed that my browser made an HTTP POST request to a GraphQL API, which accepts a custom query string as part of the POST payload which dictates the format of the response data. The response is JSON, and it contains all the information we requested with the original query string (in this case, all artists for every track of a playlist). Normally this API is used by the page to populate itself asynchronously using JavaScript, which is what normally happens when the page is viewed in a browser like it's meant to be.
Since we have the API endpoint, request headers and POST payload, we can imitate that request in Python to get a JSON response:
def main():
import requests
url = "https://tidal.com/browse/api"
headers = {
"accept": "application/json",
"accept-encoding": "gzip, deflate",
"content-type": "application/json",
"user-agent": "Mozilla/5.0"
}
query = """
query ($playlistId: String!) {
playlist(uuid: $playlistId) {
creator {
name
}
title
tracks {
albumID
albumTitle
artists {
id
name
}
id
title
}
}
}
"""
payload = {
"operationName": None,
"query": query,
"variables": {
"playlistId": "1b418bb8-90a7-4f87-901d-707993838346"
}
}
response = requests.post(url, headers=headers, json=payload)
response.raise_for_status()
playlist = response.json()["data"]["playlist"]
print("Artists in playlist \"{}\":".format(playlist["title"]))
for track_number, track in enumerate(playlist["tracks"], start=1):
artists = ", ".join(artist["name"] for artist in track["artists"])
print("Track #{} [{}]: {}".format(track_number, track["title"], artists))
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
Artists in playlist "New Arrivals":
Track #1 [Higher Power]: Coldplay
Track #2 [i n t e r l u d e]: J. Cole
Track #3 [Fast (Motion)]: Saweetie
Track #4 [Miss The Rage]: Trippie Redd, Playboi Carti
Track #5 [In My Feelings (feat. Quavo & Young Dolph)]: Tee Grizzley, Quavo, Young Dolph
Track #6 [Thumbin]: Kash Doll
Track #7 [Tiempo]: Ozuna
...
You can change the playlistId key-value pair in the payload dictionary to get the artist information for any playlist.
Take a look at this other answer I posted, where I go more in-depth on how to log your network traffic, finding API endpoints and imitating requests.
I am trying to learn how to use BS4 but I ran into this problem. I try to find the text in the Google Search results page showing the number of results for the search but I can't find no text 'results' neither in the html_page nor in the soup HTML parser. This is the code:
from bs4 import BeautifulSoup
import requests
url = 'https://www.google.com/search?q=stack'
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')
print(b'results' in html_page)
print('results' in soup)
Both prints return False, what am I doing wrong? How to fix that?
EDIT:
Turns out the language of the webpage was a problem, adding &hl=en to the URL almost fixed it.
url = 'https://www.google.com/search?q=stack&hl=en'
The first print is now True but the second is still False.
requests library when returning the response in form of response.content usually returns in a raw format. So to answer your second question, replace the res.content with res.text.
from bs4 import BeautifulSoup
import requests
url = 'https://www.google.com/search?q=stack'
res = requests.get(url)
html_page = res.text
soup = BeautifulSoup(html_page, 'html.parser')
print('results' in soup)
Output: True
Keep in mind, Google is usually very active in handling scrapers. To avoid getting blocked/captcha'ed, you can add a user agent to emulate a browser. :
# This is a standard user-agent of Chrome browser running on Windows 10
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36' }
Example:
from bs4 import BeautifulSoup
import requests
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
resp = requests.get('https://www.amazon.com', headers=headers).text
soup = BeautifulSoup(resp, 'html.parser')
...
<your code here>
Additionally, you can add another set of headers to pretend like a legitimate browser. Add some more headers like this:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language' : 'en-US,en;q=0.5',
'Accept-Encoding' : 'gzip',
'DNT' : '1', # Do Not Track Request Header
'Connection' : 'close'
}
It's not because res.content should be changed to res.text as 0xInfection mentioned, it would still return the result.
However, in some cases, it will only return bytes content if it's not gzip or deflate transfer-encodings, which are automatically decoded by requests to a readable format (correct me in the comments or edit this answer if I'm wrong).
It's because there's no user-agent specified thus Google will block a request eventually because default requests user-agent is python-requests and Google understands that it's a bot/script. Learn more about request headers.
Pass user-agent into request headers:
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
request.get('YOUR_URL', headers=headers)
Code and example in the online IDE:
import requests, lxml
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "fus ro dah definition", # query
"gl": "us", # country to make request from
"hl": "en" # language
}
response = requests.get('https://www.google.com/search',
headers=headers,
params=params).content
soup = BeautifulSoup(response, 'lxml')
number_of_results = soup.select_one('#result-stats nobr').previous_sibling
print(number_of_results)
# About 114,000 results
Alternatively, you can achieve the same thing by using Google Direct Answer Box API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you only need to extract the data you want without thinking about how to extract stuff or figure out how to bypass blocks from Google or other search engines since it's already done for the end-user.
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "fus ro dah definition",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
result = results["search_information"]['total_results']
print(result)
# 112000
Disclaimer, I work for SerpApi.
I'm trying to scrape information from the facebook game "Coin Master"
Inspect Element > Network > XHR
This brings up "Balance" which i need to access since it contains the information i need to track
Picture example
Coin Master FB Link to Test
But I do not know what module I need to achieve this. I've used BeautifulSoup, Requests in the past but this isn't as straight forward for me.
Any help/insight to my issue would be much appreciated!
Thanks & kind regards
You need to inspect the requests and under Form data find your data for the requests.
import requests
import json
data = {
"Device[udid]": "",
"API_KEY": "",
"API_SECRET": "",
"Device[change]": "",
"fbToken": ""
}
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36"
}
url = "https://vik-game.moonactive.net/api/v1/users/rof4__cjsvfw2s604xrw1lg5ex42qwc/balance"
r = requests.post(url, data=data, headers=headers)
data = r.json()
print(data["coins"])
I use Python requests to get images, but in some case sit doesn't work. It seems to happen more often. An example is
http://recipes.thetasteofaussie.netdna-cdn.com/wp-content/uploads/2015/07/Leek-and-Sweet-Potato-Gratin.jpg
It loads fine in my browser, but using requests, it returns html that says "403 forbidden" and "nginx/1.7.11"
import requests
image_url = "<the_url>"
headers = {'User-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.76 Safari/537.36', 'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Encoding':'gzip,deflate,sdch'}
r = requests.get(image_url, headers=headers)
# r.content is html '403 forbidden', not an image
I have also tried with this header, which has been necessary in some cases. Same result.
headers = {'User-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.76 Safari/537.36', 'Accept':'image/webp,*/*;q=0.8','Accept-Encoding':'gzip,deflate,sdch'}
(I had a similar question a few weeks ago, but this was answered by the particular image file types not being supported by PIL. This is different.)
EDIT: Based on comments:
It seems the link only works if you have already visited the original site http://aussietaste.recipes/vegetables/leek-vegetables/leek-and-sweet-potato-gratin/ with the image. I suppose the browser then uses the cached version. Any workaround?
The site is validating the Referer header. This prevents other sites from including the image in their web pages and using the image host's bandwidth. Set it to the site you mentioned in your post, and it will work.
More info:
https://en.wikipedia.org/wiki/HTTP_referer
import requests
image_url = "http://recipes.thetasteofaussie.netdna-cdn.com/wp-content/uploads/2015/07/Leek-and-Sweet-Potato-Gratin.jpg"
headers = {
'User-agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.76 Safari/537.36',
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Encoding' : 'gzip,deflate,sdch',
'Referer' : 'http://aussietaste.recipes/vegetables/leek-vegetables/leek-and-sweet-potato-gratin/'
}
r = requests.get(image_url, headers=headers)
print r
For me, this prints
<Response [200]>