Scraping of low fare calendar from airline site. Problem with unusual headers - python

I'm using Python to collect prices data from southwest.com.
Trying to collect their Low Fare Calendar prices.
Example page: https://www.southwest.com/air/low-fare-calendar/select-dates.html?adultPassengersCount=1&currencyCode=USD&departureDate=2019-04-25&destinationAirportCode=MCO&originationAirportCode=MDW&passengerType=ADULT&returnAirportCode=&returnDate=&tripType=oneway
The needed data comes out from post request to https://www.southwest.com/api/air-booking/v1/air-booking/page/air/low-fare-calendar/select-dates.
The problem is in headers which starts with x-:
x-5ku220jw-a: I_ftfpDjDyifDufjizDcDs7teKDhlBY7UN6xgaX7hf8wp7=ZfV=ZiagmgMX1lgT1Yf8tIziO2PrYhLk6DprKAQChh-kEK2=QDyutU3E-gz6wgBpYQrUf8KlMjN6c0aP_95jAIV9BYr35sb66QM06CZP3IZrgF1XwfgNR0akQIc5tIaOIYagjTf1yAyP3louw57s3iZ_m07huHuaxFJI7lU_IFSvNg91IMb8riZ_RIcUrp7rNlpUffIUfqa8aIrEPkaSxiI=wPSP3k9=c1yvM8ikxhZ6h9t1kjSgmibYJInPU8hPfp-3uezOfeKLZIiQypKSQl14WSN_BPJURUaP1qa9R87EBFb8AqSv-ia7IiB0jlzSZ5IktfFfu2asJBII5lh5ss-XagNiZflSxYf8ZPVCljvSzl36bldUMFhj18zkI07=tgkXuU0UcZ7BPpb8tI3=3jr8-I-5QpfLBKWk3iq0bpK7KIsh7C3Ke1cB7t_S32kXhi_fBYN6VpIfu0-kyi753uzv0I-Q5=UsBYNP3DskijalfqWYP2kf5ly6mDILRqLimi7=AE3=tDcsM8-06fr7zDaLcc9I72zBRfIrLqN6Ch-b8D_f7fNuM__MCgaJLDkfhOBSwk-8yI-lusLPFvrN4fejQR79fsLmxFhrRS7jeEbItf_gBPKYS93w6YNimDrfaXR4uICfPKNP6fZP6ldEpciwKqNu7qMBRm3=mDavQ2szxqr=Cf9Yti-YuEU9c3I7t5WkQDs=LlDID2zSMp-gZI-BSf-0bkfURp3Y5fiRuirrB0aFBqf=SNVjLmBY5jakHlzURuu8cTM5wXKv3gaitlD8M3-vMIcYufiDrE4_fi-RhivW51bi7hY7jkhrhfbUBFh=6IaUbXhWTiu8R0Z_T37SQqW=jMLfA5sEBqrDARzQTqU=Nfkqyia7RTLrMUUgCFJbR2TIhl80I9BSLfaijYM6M5XYrCzvjIN6mU_j3jM-sRNiufLPTjUPm8K16Ua5tfr=b3=pfBacjF46fX7uMCujTsW=j0LS4lK8yWTIKlzQMYNhjfqF8WaPcIbIOsSIyp=8cqy6fvKS5de_Bg7v_IzhtYMf7FI8bluf7kJkZ8UFfqifSiSXBlc7ukXDkFJ=LiKs5pZP_f7jBFzvy5y_rRJ6_qM6Nd7PnL46xItbRu3PMUOEjIbSwgKtsHc9MFz9RI-BepBaLqi7A5zqZFapf0N6PYSiuFtj6iZ6LTsX3pbNSWL=w2USZpbPM3a=RI77fI-DnwiUMeSEcDjm0dyfZPb8Mi6jbE1folpkyf-vNMbUcUV9WfoXu5zEMrrj-pur7jr=MprvMYfjZdp5yqapmYrP3D4lteJ8ZphgMI9Y5W2WhjahM5rPRF-RfIILB2zXrDejnFzStjScMY7RurgpMIbsHDU=CIbYjI9ifsSxFfakif-9RU7sMgrmZ9XihWzkZPA8MI-9cR7sB8bPZYfY90asf5ch0davzIB7ulqrKu-8_PZubgL-ApbujDBfZjf8QDUrRjbfhpbThpJ=3iZ6_lkI1DUPsib8NR2pSfasVDw3uvLTqFBiuFbU7WU=jlD8Blbi7hu7MFq8hj30MmDR1jN6VDyStD4Q1jSP-FpMfgr7k2IjM0f83lkzklKhtFw7qlz=nI=ISqrS1g37flKPAizmAfS4ffvrqehI5fTD3fi=sI_YjtZ6Y2eVtj7=t0SNM9KURiakypa0_ib6-sKOcONujgKQ_dejhi7=4IcPQ5chjiNj3IXIrpbIhUjcZDrBfpas74JiuFV=rjf8UUH=3DsPckzkQIkgMuavhIZrZ21ycDpOag7vNLKXhq7PfDNPQCKuilKRulhSy3ij9DDIZI99cqi9fp7OcFJiB2zfjFAenj_PBq78MMI0ZDLSQ29mmDfrzhZPNFK1t93w3i9WZgIjQDeQBUq80qavQDevID-s7iy6Aj7STFJ63FsY5qTmUPO6MiVjsXBjMPfSmaLlbjaSnp7khFBYSg46ZqrbBT78tf1vmQa5yUyPwE-UR4S4=zN6_TZ669KB4fL=6iy6ZEBSmIa5miefM27LBqNPMDzSQqfS3UAUQhip7gs=322kt2ZPm8XPmixikacSCYMMxpBPQiZitiMjIfcEsp-9Rrz833LYus7k9IakRiVfYiQ0epu7ufLi9DaX5pa0UYiLRDd8jp-9SPsaUDIUR37Sm9ccrDd8_IrPV0JIZImiuTbsagaujULPs9hIjphgjYLi7eUSBezQb2zvjU3=mkzu1Ec=TqfI1m7s3gakhDz5AkUf7I-3BXL0hzKIupak_ffY5qSNfii0t3fj9kbUBL4uwF7kUi_8AjaXhiSgEFurL5svxI3_BF7v1dzPMDfjtTNujFq8Ty37tDc=MlU5Al-N6h7E883SM2I86Cc=r0AUR3I7ug7yRF1kygdzvP1XhgejrdL8Ah7=Ki-hujfirP_7tfuU=oWYfgaYIiaIMfJkifzszkU8PPMjADgpZpc9micIWp46BFM0flQ4Rse0tirZBU-lCWLm3YifrUJshfkXjY7P10_MjkaQMU4cukM_RI-S0lIyBXK5VEKchkZZRFd8TFQXuDZ_rezP3k_ZRlJEfT4utq7XBgf7jUfYmDSy-pabrIz9cFAFRFujhQSNR2ffjjNN=j3EBFBYZq7gwj7j0fL=MII=yhw6BkLY5gs=pgq8Z4mkmtevIjJUBIzk3qM6Q9trSY7uSF7=tIrEBUa8yhIrVYakAOQXSihj5DaR4FJEfU-7sDxg_rKpW9K0rlzvthSi7ptRRULj8WIjj8qURdPIRPT8losP3sLIfhLujp-bjb16bgWYS=A8mDbItmu8kjTISpuL7fqi55ZNcqrjwgavmq=gklkXuDsjti9lZ94D9I-hf9h8Qh3kQJasYqrQaqryRp373iZu7O-khX3YufUjQezmm3L9cFJs7gasuz9YjIZ_jisYugs6cf9kNUMr-DpjNqfjAfssuFAIAIcj3qMctFc7EqLNojTIhCxufBQEfSevQXhTxIz4RgL=eM_1Bh_8tiVM1Uhl7DIIJicELfILfdSXCjB8epcrIjr6mjPFDIsRZpbrBXKS68I7hkMfu3ejsq7kjIdTsCAe9i-vhscEToKJcfII7DdrTFzQL_4Tf8yf75J83JLDjEh8Qp3YjWpsA5zkoic=QFLIjq7E7Spjh8zSZXfEfTIYhjrjtgViKdLiu8swwq7SsNZ6P2Z6ZU3Yfkq8MPbrts9kRlpjLDI5AgUky4KPmfJk2q7=UCAm3DNgEFZ66I-==sqw18pkMfsbjp1zx8BjAMZ6iipfpDyvrJcsLjNNRh1vyi_rRqjkmfS4WSSPf3uaxFyPcjfsjiui3bhXjQWUID9uM61=Ajtjt2aDxFBPMta379s=iqLN=XKSMfushq75Ppu9t0aveDN6MFr1jlbUBqaOcIQkjIchhqf9RuVftLbDcYSptj4QLXh=ti6gA9bYhhauA9KQ382=UizLRIZNf8zRgiuncgLjyFX=iuNZ38s=tFZY7Ib8_DzS-DFDBpcvmSfI7iICaI_fhDsjU2y_LF38mInRKFuf7gIjcfbbBf7gQfBS_gMHvKuLRq9sujUD6lsPzF3SwI3YLh9IjIVj1fz6NFBDiFuI1iL=3qf4jleRtIN6yIN6j83BcIpj9IchAX1uu24XSsM=QiumNsJ-RfQkyI-7jCZu75kqxlhNzKcfhHs7ffN6MM9=jia5wu7pAi-FrIbFfgfgAWaQJfrI4gLrRpZfh6Z6ZgSv3iVFBWrlJIijRg=ZcFKkyYMvyoYFu24iuU15ME3EfY7Yr8z0QF-9D5VQbpKS6pNkpfa0QDrP0hfdx9tv0sKY75z06fLPiFJ8oYf5AIbsPU77ut7zuSa7t=3gEF3=qT7=61P=yrhsBFISmM-8xXKSpD1g6dMlff-RbDz_l2aI7I_9_XKvbi-1NfxjjWsrEfTPtL7LzjMffg1ufdrkEIb69FifjPh8AiwvMWsPc0agrDbbRoU7MhLksDy6x5N_fqLPhUu731_WLdrWdpK0aqaVhpruxlI83FcYYiIBEhP8SvUPmF-_Riue4t-ijSsv3qMfMiN_StL8m2aR5qrYtIx013a=tki83iIf57-kNI3SyYj6VXqLcFhS3jbFR3LjSiTUM3_S=Exj6f4u5FQP0kPky2LyRf98y3s0mvaNBPT8j5JEfeasjjrYujakb9BYwDdeujPfwFAmefIUM=3yIF-JRYaaxp_SL21vBF74BpMjrFh1tUS=z0T8bPQgtls=NDsa4l46N5Q86qfL_FLeVEQXrIXutRz1MphpAFTURUaTJg_gQpakcIuEAF1yBqa7f8yiticYuqNjSh7NBX46EE_lbgsER9JYt=hLRYNuWEJUR0auhjilth9ERU3=SX01w2_yRI6mMi_=QEp=ADzvLhq80gLWxidptsB96p-QVlIrokg1mI3tLEKkRfy6NIFpjDev3Y7rZlk=t2z9fsZ_lU3Yupxf5Hk8sl-g6YLItIaX7jSjtUt0niU=LjBEBBb030ujMsa9Rtr=o9XS3FXwwkL=mpx0yeJkjoJ=je-0csLfgFLgCIch58IfhiuaSfUrR9hDQFhS3Io3uYLy8Nyurp-Xm8Bf1fSvfpLEcDUfRldL3gSgcZyNzIqPZpHWf9JPygaS3IUPZ4Jcfx_fjfIjtIiYtaNP35u=wRJ=hWBYfRB=jiU8hvLFUOsj9Pzp9DIDTMa98YNPVDk8eiKeJlzSDs-w32UL-iz1Q59kQDBYMf4NVXv00jsB0iBjHWKh3ix66D-5AR9zbpJeJOvgtkZ_rDz=UtJ-sgaPSCQrfg9URf-srhNi7E3gQIuCjp7Px4K9Rf7YKq7LRFBSQ3gjSDz=3pZvQlXkZkq53jKupjWYtjaOBFZuBqakbDzPfpJI7UbrBlII7PQUZ93jyhf8tq39jXqIfP46LUBYjqifAXJsMfavTpKkVFtjMU-5zEaHxK6jVI-kCFJk6XMrMoSvm5hP9Qf=jPqj3ichIar=ZI-BYUMrM0a3tFxrbi-ktuh8QYa5Rg_bRpKQB2zXlhSjyFM6hIejYR3pMFK6mqim-DUNRYLFUi377pWIuIzkQkqYJXqrBhumME-v1=ZijCcst2JERieyJiaS_iIDVqSXjrKLoFZ696Mjmg2=_ib8rptj9LejZiI8bkMRtKK3rYf=wXJ=Bl4uJFsQRpK_cgLQ6kxffqzERD_D3qLItfKSAlU=V2ITxUK1k5IwiscpJfkp7qUERIL_cFXrRq7XBkz0siuMbhkvQUJixisEajISkp-OYjL_VdLIjq=IjhffA298jnqejsPhf27o4fLQk8RsuD9QQizpMHc=3iBY7pbEcIzDxMyu=p3j9FR1rgNFWqYcOlK66XWjii9EpI9s4lkIHpaX5Fq5Bk4PmqN65R4N4KN6Ak34fpcEM9tfugzPAI3Y7jTUoYTuht7zsIcLcdxf7F3BBXJ=6kIaNp_fJq7XKUSv3DMjmgNPZEhSTiLULgav=XKvjj7jZDzOI0NutX1uuhT8-D1ge8UD70LIDFa4fgSvyF986fA67H-ORiuDCMakQXq8bqffViUjQ94PxpN6t0akQkaJBi7P9kxru3=VaUzS39tjBKeft3a0SqLPjizRh91vQpZ_Ohb8tfhPfF9=3RaSefilRzaEhYavwkrjJYacjqffIqa9rpZNBDsSQqL8LwpjcDIjQiaPTif8tqaOMkXfu8K6329IBFX8BpAUBXKVufU=tFh8Mf3jcIyP4grS6FKiZIV81UbiWyz=3l9IficLjC16MFZNR93WsfzSjf7=fkSX3rK6Zf2=QUBE7RlgVh_=tkiWjYrPZFNih2e5tiLPKXJPMWsvykfiI3SvMiSjTe_Yjlpshgs7Bib=ZlUPtUKXIF-=_6vySKFwLIuj1s4fjiN6HDf8DeXpt9J1bIoW9KIzLt_E7MJIuIr=8qaFcV-kL2zQ6FzCKhLctgbP6q98tUrSQF-1VlKXrDLMfhasCXhS9WzOR63=VFzvxg7v1UL6Vf_PLqfjZdurTFt6fDzRhlcLLrcRrkFvZ
x-5ku220jw-b: 3k9rur
x-5ku220jw-c: A__57lBqAQAACgDSJrJqFbzAqmJVRyggwSLzliwMgtTT0dRA3TtiKWkLmSk8AawUHfD6K-8VrJMAAOfvAAAAAA==
x-5ku220jw-d: 0
x-5ku220jw-uniquestatekey: A-gE71BqAQAAbcLSHXYa-TmPShEFZzL7WRKJL2s2ZrEiksXMK8pm-YimkgmLAawUHfGuclcCwH8AABszAAAAAA==
I don't know from where they come and how can I get them but the code doesn't work without them(or one of them). I'm sure about that because I can get response when send requests to another URLs from the site where it doesn't need to use these headers.
So, the question is:
1. What these headers are?
2. What's the way(if it's possible) to get values for these headers?
My code:
import requests
data_url = 'https://www.southwest.com/api/air-booking/v1/air-booking/page/air/low-fare-calendar/select-dates'
payload = {
"adultPassengersCount": "1",
"currencyCode": "USD",
"departureDate": "2019-04-25",
"destinationAirportCode": "MCO",
"originationAirportCode": "MDW",
"passengerType": "ADULT",
"returnAirportCode": "",
"returnDate": "",
"tripType": "oneway",
"application": "air-low-fare-calendar",
"site": "southwest"
}
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36',
'x-api-key': 'l7xx944d175ea25f4b9c903a583ea82a1c4c'
}
r = requests.post(data_url, json=payload, headers=headers)
print(r.text)

Related

How to bypass Cloudflare with Python on GET requests?

I want to bypass Cloudflare on a GET request I have tried using Cloudscraper which worked for me in the past but now seems decreped.
I tried:
import cloudscraper
import requests
ses = requests.Session()
ses.headers = {
'referer': 'https://magiceden.io/',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36',
'accept': 'application/json'
}
scraper = cloudscraper.create_scraper(sess=ses)
hookLink = f"https://magiceden.io/launchpad/planetarians"
meG = scraper.get("https://api-mainnet.magiceden.io/launchpads/planetarians")
print(meG.status_code)
print(meG.text)
The issue seems to be that I'm getting a captcha on the request
The python library works well (I never knew about it), the issue is your user agent. Cloudflare uses some sort of extra checks to determine whether you're faking it.
For me, any of the following works:
ses.headers = {
'referer': 'https://magiceden.io/',
'accept': 'application/json'
}
ses.headers = {
'accept': 'application/json'
}
And also just:
scraper = cloudscraper.create_scraper()
meG = scraper.get("https://api-mainnet.magiceden.io/launchpads/planetarians")
EDIT:
You can use this dict syntax instead to fake the user agent (as per the manual)
scraper = cloudscraper.create_scraper(
browser={
'browser': 'chrome',
'platform': 'windows',
'desktop': True
}
)

request from python get different response from nodejs

I am trying to do the same request from nodejs.
The python code is
import requests
r = requests.post(url,
data=data,
headers={
'User-Agent': self.ua,
'Content-Type': 'application/x-www-form-urlencoded'
}
)
and in node I tried node-fetch and Axios and request but not getting the same response, I also tried CURL from bash but getting the same response of node, I tried to print python headers print(r.request.headers) and copy paste it in node but getting different response
Axios.post(url, {
data,
headers: {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36",
"Content-Type": "application/x-www-form-urlencoded",
}
})
.then(text => console.log(text.data))
.catch(err => {
console.log(err);
});
I am getting different result in python I get what I expect but in node I am getting html response
Sorry, could not complete request because: <div class="tk-intro" style="font-size: 14px;color:#ff090f;">application information was not supplied.</div>
but in python works fine
I tried to print request headers and url and data and found I should to to convert data to query string like that
"appleId=email#gmail.com&accountPassword=xxxxxx"
instead of passing it as JSON
{
"appleID": "email#gmail.com",
"accountPassword": "xxxx"
}

Can't find text from page using Python BS4

I am trying to learn how to use BS4 but I ran into this problem. I try to find the text in the Google Search results page showing the number of results for the search but I can't find no text 'results' neither in the html_page nor in the soup HTML parser. This is the code:
from bs4 import BeautifulSoup
import requests
url = 'https://www.google.com/search?q=stack'
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')
print(b'results' in html_page)
print('results' in soup)
Both prints return False, what am I doing wrong? How to fix that?
EDIT:
Turns out the language of the webpage was a problem, adding &hl=en to the URL almost fixed it.
url = 'https://www.google.com/search?q=stack&hl=en'
The first print is now True but the second is still False.
requests library when returning the response in form of response.content usually returns in a raw format. So to answer your second question, replace the res.content with res.text.
from bs4 import BeautifulSoup
import requests
url = 'https://www.google.com/search?q=stack'
res = requests.get(url)
html_page = res.text
soup = BeautifulSoup(html_page, 'html.parser')
print('results' in soup)
Output: True
Keep in mind, Google is usually very active in handling scrapers. To avoid getting blocked/captcha'ed, you can add a user agent to emulate a browser. :
# This is a standard user-agent of Chrome browser running on Windows 10
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36' }
Example:
from bs4 import BeautifulSoup
import requests
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
resp = requests.get('https://www.amazon.com', headers=headers).text
soup = BeautifulSoup(resp, 'html.parser')
...
<your code here>
Additionally, you can add another set of headers to pretend like a legitimate browser. Add some more headers like this:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language' : 'en-US,en;q=0.5',
'Accept-Encoding' : 'gzip',
'DNT' : '1', # Do Not Track Request Header
'Connection' : 'close'
}
It's not because res.content should be changed to res.text as 0xInfection mentioned, it would still return the result.
However, in some cases, it will only return bytes content if it's not gzip or deflate transfer-encodings, which are automatically decoded by requests to a readable format (correct me in the comments or edit this answer if I'm wrong).
It's because there's no user-agent specified thus Google will block a request eventually because default requests user-agent is python-requests and Google understands that it's a bot/script. Learn more about request headers.
Pass user-agent into request headers:
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
request.get('YOUR_URL', headers=headers)
Code and example in the online IDE:
import requests, lxml
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "fus ro dah definition", # query
"gl": "us", # country to make request from
"hl": "en" # language
}
response = requests.get('https://www.google.com/search',
headers=headers,
params=params).content
soup = BeautifulSoup(response, 'lxml')
number_of_results = soup.select_one('#result-stats nobr').previous_sibling
print(number_of_results)
# About 114,000 results
Alternatively, you can achieve the same thing by using Google Direct Answer Box API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you only need to extract the data you want without thinking about how to extract stuff or figure out how to bypass blocks from Google or other search engines since it's already done for the end-user.
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "fus ro dah definition",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
result = results["search_information"]['total_results']
print(result)
# 112000
Disclaimer, I work for SerpApi.

How to scrape from XHR in Chrome ( Python )

I'm trying to scrape information from the facebook game "Coin Master"
Inspect Element > Network > XHR
This brings up "Balance" which i need to access since it contains the information i need to track
Picture example
Coin Master FB Link to Test
But I do not know what module I need to achieve this. I've used BeautifulSoup, Requests in the past but this isn't as straight forward for me.
Any help/insight to my issue would be much appreciated!
Thanks & kind regards
You need to inspect the requests and under Form data find your data for the requests.
import requests
import json
data = {
"Device[udid]": "",
"API_KEY": "",
"API_SECRET": "",
"Device[change]": "",
"fbToken": ""
}
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36"
}
url = "https://vik-game.moonactive.net/api/v1/users/rof4__cjsvfw2s604xrw1lg5ex42qwc/balance"
r = requests.post(url, data=data, headers=headers)
data = r.json()
print(data["coins"])

Google Finance recognizes my Python script as a bot and blocks it

I wrote a script that retrieves stock data on google finance and prints it out, nice and simple. It always worked, but since this morning I only get a page that tells me that I'm probably an automated script instead of the stock data. Of course, being a script, I can't pass the captcha. What can I do?
well, you finally reached a quite challenging realm. decode the captcha.
there do exist OCR approaches to decode simple captcha into code. not seems to work for google captcha.
I heard there are some companies provide manual captcha decoding services, you can try to use some. ^_^ LOL
ok, to be serious, if google don't want you to do it that way, then it is not easy to decode those captchas. After all, why google on finance data, there are a lot other providers, right? try to scrape those websites.
You can try to solve the blocking issue by adding headers where your user-agent will be specified, this is necessary for Google to recognize the request as from a user, and not as from a bot, and not block it:
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}
An additional step could be to rotate user-agents:
import requests, random
user_agent_list = [
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
]
for _ in user_agent_list:
#Pick a random user agent
user_agent = random.choice(user_agent_list)
#Set the headers
headers = {'User-Agent': user_agent}
requests.get('URL', headers=headers)
In addition to the rotate user-agent, you can rotate proxies (ideally residential) that can be used in combination with CAPTCHA solver to bypass CAPTCHA.
To parse dynamic websites using web browser automation, you can use curl-impersonate or selenium-stealth which can bypass most CAPTCHAs, but the option using browser automation could be CPU, RAM expensive and could be difficult to run in parallel.
There's a reducing the chance of being blocked while web scraping blog post if you need a little bit more info.
As an alternative you can use Google Finance Markets API from SerpApi. It's a paid API with the free plan.
The difference is that it will bypass blocks (including CAPTCHA) from Google, no need to create the parser and maintain it.
Example of SerpApi code for extracting Most Active on the main page of Google Finance in the online IDE.
from serpapi import GoogleSearch
import json, os
params = {
"engine": "google_finance_markets", # serpapi parser engine. Or engine to parse ticker page: https://serpapi.com/google-finance-api
"trend": "most-active", # parameter is used for retrieving different market trends: https://serpapi.com/google-finance-markets#api-parameters-search-query-trend
"api_key": "..." # serpapi key, https://serpapi.com/manage-api-key
}
market_trends_data = []
search = GoogleSearch(params)
results = search.get_dict()
most_active = results["markets"]
print(json.dumps(most_active, indent=2, ensure_ascii=False))
Example output:
[
{
"stock": ".DJI:INDEXDJX",
"link": "https://www.google.com/finance/quote/.DJI:INDEXDJX",
"serpapi_link": "https://serpapi.com/search.json?engine=google_finance&hl=en&q=.DJI%3AINDEXDJX",
"name": "Dow Jones",
"price": 34089.27,
"price_movement": {
"percentage": 0.4574563,
"value": 156.66016,
"movement": "Down"
}
},
{
"stock": ".INX:INDEXSP",
"link": "https://www.google.com/finance/quote/.INX:INDEXSP",
"serpapi_link": "https://serpapi.com/search.json?engine=google_finance&hl=en&q=.INX%3AINDEXSP",
"name": "S&P 500",
"price": 4136.13,
"price_movement": {
"percentage": 0.028041454,
"value": 1.1601563,
"movement": "Down"
}
},
other results ...
]

Categories

Resources