I'm trying to scrape information from the facebook game "Coin Master"
Inspect Element > Network > XHR
This brings up "Balance" which i need to access since it contains the information i need to track
Picture example
Coin Master FB Link to Test
But I do not know what module I need to achieve this. I've used BeautifulSoup, Requests in the past but this isn't as straight forward for me.
Any help/insight to my issue would be much appreciated!
Thanks & kind regards
You need to inspect the requests and under Form data find your data for the requests.
import requests
import json
data = {
"Device[udid]": "",
"API_KEY": "",
"API_SECRET": "",
"Device[change]": "",
"fbToken": ""
}
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36"
}
url = "https://vik-game.moonactive.net/api/v1/users/rof4__cjsvfw2s604xrw1lg5ex42qwc/balance"
r = requests.post(url, data=data, headers=headers)
data = r.json()
print(data["coins"])
Related
I'm trying to get how many results have a search in Google with bs4 library in python, but while doing it, it returns empty brackets.
Here is my code:
import requests
from bs4 import BeautifulSoup
url_page = 'https://www.google.com/search?q=covid&oq=covid&aqs=chrome.0.0i433l2j0i131i433j0i433j0i131i433l2j0j0i131i433j0i433j0i131i433.691j0j7&sourceid=chrome&ie=UTF-8'
page = requests.get(url_page).text
soup = BeautifulSoup(page, "lxml")
elTexto = soup.find_all(attrs ={'class': 'LHJvCe'})
print(elTexto)
I have an extension in google that check if the html class is correct and it gives me what I'm looking for so I guess that is not the problem.... Maybe is something related with the format of the 'text' I'm trying to get...
Thanks!
It is better to use gsearch package to accomplish your task, rather than scraping the web page manually.
Google is not randomizing classes as baduker mentioned. They could change some class names over time but they're not randomizing them.
One of the reasons why you get an empty result is because you haven't specified HTTP user-agent aka headers, thus Google might block your request and headers might help to avoid it. You can check what is your user-agent here. Headers will look like this:
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get('YOUR URL', headers=headers)
Also, you don't need to use find_all()/findAll() or select() since you're trying to get only one occurrence, not all of them. Use instead:
find('ELEMENT NAME', class_='CLASS NAME')
select_one('.CSS_SELECTORs')
select()/select_one() usually faster.
Code and example in the online IDE (note: the number of results will always differ. It just works this way.):
import requests, lxml
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "fus ro dah defenition",
"gl": "us",
"hl": "en"
}
response = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(response.text, 'lxml')
number_of_results = soup.select_one('#result-stats nobr').previous_sibling
print(number_of_results)
# About 104,000 results
Alternatively, you achieve the same thing using Google Organic Results API from SerpApi, except you don't need to figure out why certain things don't work and instead iterate over structured JSON string and get the data you want.
Code:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "fus ro dah defenition",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
result = results["search_information"]['total_results']
print(result)
# 104000
Disclaimer, I work for SerpApi.
I am trying to learn how to use BS4 but I ran into this problem. I try to find the text in the Google Search results page showing the number of results for the search but I can't find no text 'results' neither in the html_page nor in the soup HTML parser. This is the code:
from bs4 import BeautifulSoup
import requests
url = 'https://www.google.com/search?q=stack'
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')
print(b'results' in html_page)
print('results' in soup)
Both prints return False, what am I doing wrong? How to fix that?
EDIT:
Turns out the language of the webpage was a problem, adding &hl=en to the URL almost fixed it.
url = 'https://www.google.com/search?q=stack&hl=en'
The first print is now True but the second is still False.
requests library when returning the response in form of response.content usually returns in a raw format. So to answer your second question, replace the res.content with res.text.
from bs4 import BeautifulSoup
import requests
url = 'https://www.google.com/search?q=stack'
res = requests.get(url)
html_page = res.text
soup = BeautifulSoup(html_page, 'html.parser')
print('results' in soup)
Output: True
Keep in mind, Google is usually very active in handling scrapers. To avoid getting blocked/captcha'ed, you can add a user agent to emulate a browser. :
# This is a standard user-agent of Chrome browser running on Windows 10
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36' }
Example:
from bs4 import BeautifulSoup
import requests
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
resp = requests.get('https://www.amazon.com', headers=headers).text
soup = BeautifulSoup(resp, 'html.parser')
...
<your code here>
Additionally, you can add another set of headers to pretend like a legitimate browser. Add some more headers like this:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language' : 'en-US,en;q=0.5',
'Accept-Encoding' : 'gzip',
'DNT' : '1', # Do Not Track Request Header
'Connection' : 'close'
}
It's not because res.content should be changed to res.text as 0xInfection mentioned, it would still return the result.
However, in some cases, it will only return bytes content if it's not gzip or deflate transfer-encodings, which are automatically decoded by requests to a readable format (correct me in the comments or edit this answer if I'm wrong).
It's because there's no user-agent specified thus Google will block a request eventually because default requests user-agent is python-requests and Google understands that it's a bot/script. Learn more about request headers.
Pass user-agent into request headers:
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
request.get('YOUR_URL', headers=headers)
Code and example in the online IDE:
import requests, lxml
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "fus ro dah definition", # query
"gl": "us", # country to make request from
"hl": "en" # language
}
response = requests.get('https://www.google.com/search',
headers=headers,
params=params).content
soup = BeautifulSoup(response, 'lxml')
number_of_results = soup.select_one('#result-stats nobr').previous_sibling
print(number_of_results)
# About 114,000 results
Alternatively, you can achieve the same thing by using Google Direct Answer Box API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you only need to extract the data you want without thinking about how to extract stuff or figure out how to bypass blocks from Google or other search engines since it's already done for the end-user.
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "fus ro dah definition",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
result = results["search_information"]['total_results']
print(result)
# 112000
Disclaimer, I work for SerpApi.
I'm trying to fetch automatically the MAC addresses for some vendors in Python. I found a website that is really helpful, but I'm not being able to access its information from Python. When I run this:
import grequests
rs = (grequests.get(u) for u in ['https://aruljohn.com/mac/000000'])
requests = grequests.map(rs)
for response in requests:
print(response)
It prints None. Does anyone know how to solve this?
Looks like the issue is just not setting a user-agent in the headers. I was able to request the website without any issues. I just used normal python request but it should work fine with grequests. I do think you might want to find a more active library. You could check out aiohttp. Very active and I have had a wonderful experience using aiohttp.
import requests
from lxml import html
def request_website(mac):
url = 'https://aruljohn.com/mac/' + mac
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
}
r = requests.get(url, headers=headers)
return r.text
response = request_website('000000')
tree = html.fromstring(response)
results = tree.cssselect('.results p')[0].text
print (results)
I'm using Python to collect prices data from southwest.com.
Trying to collect their Low Fare Calendar prices.
Example page: https://www.southwest.com/air/low-fare-calendar/select-dates.html?adultPassengersCount=1¤cyCode=USD&departureDate=2019-04-25&destinationAirportCode=MCO&originationAirportCode=MDW&passengerType=ADULT&returnAirportCode=&returnDate=&tripType=oneway
The needed data comes out from post request to https://www.southwest.com/api/air-booking/v1/air-booking/page/air/low-fare-calendar/select-dates.
The problem is in headers which starts with x-:
x-5ku220jw-a: I_ftfpDjDyifDufjizDcDs7teKDhlBY7UN6xgaX7hf8wp7=ZfV=ZiagmgMX1lgT1Yf8tIziO2PrYhLk6DprKAQChh-kEK2=QDyutU3E-gz6wgBpYQrUf8KlMjN6c0aP_95jAIV9BYr35sb66QM06CZP3IZrgF1XwfgNR0akQIc5tIaOIYagjTf1yAyP3louw57s3iZ_m07huHuaxFJI7lU_IFSvNg91IMb8riZ_RIcUrp7rNlpUffIUfqa8aIrEPkaSxiI=wPSP3k9=c1yvM8ikxhZ6h9t1kjSgmibYJInPU8hPfp-3uezOfeKLZIiQypKSQl14WSN_BPJURUaP1qa9R87EBFb8AqSv-ia7IiB0jlzSZ5IktfFfu2asJBII5lh5ss-XagNiZflSxYf8ZPVCljvSzl36bldUMFhj18zkI07=tgkXuU0UcZ7BPpb8tI3=3jr8-I-5QpfLBKWk3iq0bpK7KIsh7C3Ke1cB7t_S32kXhi_fBYN6VpIfu0-kyi753uzv0I-Q5=UsBYNP3DskijalfqWYP2kf5ly6mDILRqLimi7=AE3=tDcsM8-06fr7zDaLcc9I72zBRfIrLqN6Ch-b8D_f7fNuM__MCgaJLDkfhOBSwk-8yI-lusLPFvrN4fejQR79fsLmxFhrRS7jeEbItf_gBPKYS93w6YNimDrfaXR4uICfPKNP6fZP6ldEpciwKqNu7qMBRm3=mDavQ2szxqr=Cf9Yti-YuEU9c3I7t5WkQDs=LlDID2zSMp-gZI-BSf-0bkfURp3Y5fiRuirrB0aFBqf=SNVjLmBY5jakHlzURuu8cTM5wXKv3gaitlD8M3-vMIcYufiDrE4_fi-RhivW51bi7hY7jkhrhfbUBFh=6IaUbXhWTiu8R0Z_T37SQqW=jMLfA5sEBqrDARzQTqU=Nfkqyia7RTLrMUUgCFJbR2TIhl80I9BSLfaijYM6M5XYrCzvjIN6mU_j3jM-sRNiufLPTjUPm8K16Ua5tfr=b3=pfBacjF46fX7uMCujTsW=j0LS4lK8yWTIKlzQMYNhjfqF8WaPcIbIOsSIyp=8cqy6fvKS5de_Bg7v_IzhtYMf7FI8bluf7kJkZ8UFfqifSiSXBlc7ukXDkFJ=LiKs5pZP_f7jBFzvy5y_rRJ6_qM6Nd7PnL46xItbRu3PMUOEjIbSwgKtsHc9MFz9RI-BepBaLqi7A5zqZFapf0N6PYSiuFtj6iZ6LTsX3pbNSWL=w2USZpbPM3a=RI77fI-DnwiUMeSEcDjm0dyfZPb8Mi6jbE1folpkyf-vNMbUcUV9WfoXu5zEMrrj-pur7jr=MprvMYfjZdp5yqapmYrP3D4lteJ8ZphgMI9Y5W2WhjahM5rPRF-RfIILB2zXrDejnFzStjScMY7RurgpMIbsHDU=CIbYjI9ifsSxFfakif-9RU7sMgrmZ9XihWzkZPA8MI-9cR7sB8bPZYfY90asf5ch0davzIB7ulqrKu-8_PZubgL-ApbujDBfZjf8QDUrRjbfhpbThpJ=3iZ6_lkI1DUPsib8NR2pSfasVDw3uvLTqFBiuFbU7WU=jlD8Blbi7hu7MFq8hj30MmDR1jN6VDyStD4Q1jSP-FpMfgr7k2IjM0f83lkzklKhtFw7qlz=nI=ISqrS1g37flKPAizmAfS4ffvrqehI5fTD3fi=sI_YjtZ6Y2eVtj7=t0SNM9KURiakypa0_ib6-sKOcONujgKQ_dejhi7=4IcPQ5chjiNj3IXIrpbIhUjcZDrBfpas74JiuFV=rjf8UUH=3DsPckzkQIkgMuavhIZrZ21ycDpOag7vNLKXhq7PfDNPQCKuilKRulhSy3ij9DDIZI99cqi9fp7OcFJiB2zfjFAenj_PBq78MMI0ZDLSQ29mmDfrzhZPNFK1t93w3i9WZgIjQDeQBUq80qavQDevID-s7iy6Aj7STFJ63FsY5qTmUPO6MiVjsXBjMPfSmaLlbjaSnp7khFBYSg46ZqrbBT78tf1vmQa5yUyPwE-UR4S4=zN6_TZ669KB4fL=6iy6ZEBSmIa5miefM27LBqNPMDzSQqfS3UAUQhip7gs=322kt2ZPm8XPmixikacSCYMMxpBPQiZitiMjIfcEsp-9Rrz833LYus7k9IakRiVfYiQ0epu7ufLi9DaX5pa0UYiLRDd8jp-9SPsaUDIUR37Sm9ccrDd8_IrPV0JIZImiuTbsagaujULPs9hIjphgjYLi7eUSBezQb2zvjU3=mkzu1Ec=TqfI1m7s3gakhDz5AkUf7I-3BXL0hzKIupak_ffY5qSNfii0t3fj9kbUBL4uwF7kUi_8AjaXhiSgEFurL5svxI3_BF7v1dzPMDfjtTNujFq8Ty37tDc=MlU5Al-N6h7E883SM2I86Cc=r0AUR3I7ug7yRF1kygdzvP1XhgejrdL8Ah7=Ki-hujfirP_7tfuU=oWYfgaYIiaIMfJkifzszkU8PPMjADgpZpc9micIWp46BFM0flQ4Rse0tirZBU-lCWLm3YifrUJshfkXjY7P10_MjkaQMU4cukM_RI-S0lIyBXK5VEKchkZZRFd8TFQXuDZ_rezP3k_ZRlJEfT4utq7XBgf7jUfYmDSy-pabrIz9cFAFRFujhQSNR2ffjjNN=j3EBFBYZq7gwj7j0fL=MII=yhw6BkLY5gs=pgq8Z4mkmtevIjJUBIzk3qM6Q9trSY7uSF7=tIrEBUa8yhIrVYakAOQXSihj5DaR4FJEfU-7sDxg_rKpW9K0rlzvthSi7ptRRULj8WIjj8qURdPIRPT8losP3sLIfhLujp-bjb16bgWYS=A8mDbItmu8kjTISpuL7fqi55ZNcqrjwgavmq=gklkXuDsjti9lZ94D9I-hf9h8Qh3kQJasYqrQaqryRp373iZu7O-khX3YufUjQezmm3L9cFJs7gasuz9YjIZ_jisYugs6cf9kNUMr-DpjNqfjAfssuFAIAIcj3qMctFc7EqLNojTIhCxufBQEfSevQXhTxIz4RgL=eM_1Bh_8tiVM1Uhl7DIIJicELfILfdSXCjB8epcrIjr6mjPFDIsRZpbrBXKS68I7hkMfu3ejsq7kjIdTsCAe9i-vhscEToKJcfII7DdrTFzQL_4Tf8yf75J83JLDjEh8Qp3YjWpsA5zkoic=QFLIjq7E7Spjh8zSZXfEfTIYhjrjtgViKdLiu8swwq7SsNZ6P2Z6ZU3Yfkq8MPbrts9kRlpjLDI5AgUky4KPmfJk2q7=UCAm3DNgEFZ66I-==sqw18pkMfsbjp1zx8BjAMZ6iipfpDyvrJcsLjNNRh1vyi_rRqjkmfS4WSSPf3uaxFyPcjfsjiui3bhXjQWUID9uM61=Ajtjt2aDxFBPMta379s=iqLN=XKSMfushq75Ppu9t0aveDN6MFr1jlbUBqaOcIQkjIchhqf9RuVftLbDcYSptj4QLXh=ti6gA9bYhhauA9KQ382=UizLRIZNf8zRgiuncgLjyFX=iuNZ38s=tFZY7Ib8_DzS-DFDBpcvmSfI7iICaI_fhDsjU2y_LF38mInRKFuf7gIjcfbbBf7gQfBS_gMHvKuLRq9sujUD6lsPzF3SwI3YLh9IjIVj1fz6NFBDiFuI1iL=3qf4jleRtIN6yIN6j83BcIpj9IchAX1uu24XSsM=QiumNsJ-RfQkyI-7jCZu75kqxlhNzKcfhHs7ffN6MM9=jia5wu7pAi-FrIbFfgfgAWaQJfrI4gLrRpZfh6Z6ZgSv3iVFBWrlJIijRg=ZcFKkyYMvyoYFu24iuU15ME3EfY7Yr8z0QF-9D5VQbpKS6pNkpfa0QDrP0hfdx9tv0sKY75z06fLPiFJ8oYf5AIbsPU77ut7zuSa7t=3gEF3=qT7=61P=yrhsBFISmM-8xXKSpD1g6dMlff-RbDz_l2aI7I_9_XKvbi-1NfxjjWsrEfTPtL7LzjMffg1ufdrkEIb69FifjPh8AiwvMWsPc0agrDbbRoU7MhLksDy6x5N_fqLPhUu731_WLdrWdpK0aqaVhpruxlI83FcYYiIBEhP8SvUPmF-_Riue4t-ijSsv3qMfMiN_StL8m2aR5qrYtIx013a=tki83iIf57-kNI3SyYj6VXqLcFhS3jbFR3LjSiTUM3_S=Exj6f4u5FQP0kPky2LyRf98y3s0mvaNBPT8j5JEfeasjjrYujakb9BYwDdeujPfwFAmefIUM=3yIF-JRYaaxp_SL21vBF74BpMjrFh1tUS=z0T8bPQgtls=NDsa4l46N5Q86qfL_FLeVEQXrIXutRz1MphpAFTURUaTJg_gQpakcIuEAF1yBqa7f8yiticYuqNjSh7NBX46EE_lbgsER9JYt=hLRYNuWEJUR0auhjilth9ERU3=SX01w2_yRI6mMi_=QEp=ADzvLhq80gLWxidptsB96p-QVlIrokg1mI3tLEKkRfy6NIFpjDev3Y7rZlk=t2z9fsZ_lU3Yupxf5Hk8sl-g6YLItIaX7jSjtUt0niU=LjBEBBb030ujMsa9Rtr=o9XS3FXwwkL=mpx0yeJkjoJ=je-0csLfgFLgCIch58IfhiuaSfUrR9hDQFhS3Io3uYLy8Nyurp-Xm8Bf1fSvfpLEcDUfRldL3gSgcZyNzIqPZpHWf9JPygaS3IUPZ4Jcfx_fjfIjtIiYtaNP35u=wRJ=hWBYfRB=jiU8hvLFUOsj9Pzp9DIDTMa98YNPVDk8eiKeJlzSDs-w32UL-iz1Q59kQDBYMf4NVXv00jsB0iBjHWKh3ix66D-5AR9zbpJeJOvgtkZ_rDz=UtJ-sgaPSCQrfg9URf-srhNi7E3gQIuCjp7Px4K9Rf7YKq7LRFBSQ3gjSDz=3pZvQlXkZkq53jKupjWYtjaOBFZuBqakbDzPfpJI7UbrBlII7PQUZ93jyhf8tq39jXqIfP46LUBYjqifAXJsMfavTpKkVFtjMU-5zEaHxK6jVI-kCFJk6XMrMoSvm5hP9Qf=jPqj3ichIar=ZI-BYUMrM0a3tFxrbi-ktuh8QYa5Rg_bRpKQB2zXlhSjyFM6hIejYR3pMFK6mqim-DUNRYLFUi377pWIuIzkQkqYJXqrBhumME-v1=ZijCcst2JERieyJiaS_iIDVqSXjrKLoFZ696Mjmg2=_ib8rptj9LejZiI8bkMRtKK3rYf=wXJ=Bl4uJFsQRpK_cgLQ6kxffqzERD_D3qLItfKSAlU=V2ITxUK1k5IwiscpJfkp7qUERIL_cFXrRq7XBkz0siuMbhkvQUJixisEajISkp-OYjL_VdLIjq=IjhffA298jnqejsPhf27o4fLQk8RsuD9QQizpMHc=3iBY7pbEcIzDxMyu=p3j9FR1rgNFWqYcOlK66XWjii9EpI9s4lkIHpaX5Fq5Bk4PmqN65R4N4KN6Ak34fpcEM9tfugzPAI3Y7jTUoYTuht7zsIcLcdxf7F3BBXJ=6kIaNp_fJq7XKUSv3DMjmgNPZEhSTiLULgav=XKvjj7jZDzOI0NutX1uuhT8-D1ge8UD70LIDFa4fgSvyF986fA67H-ORiuDCMakQXq8bqffViUjQ94PxpN6t0akQkaJBi7P9kxru3=VaUzS39tjBKeft3a0SqLPjizRh91vQpZ_Ohb8tfhPfF9=3RaSefilRzaEhYavwkrjJYacjqffIqa9rpZNBDsSQqL8LwpjcDIjQiaPTif8tqaOMkXfu8K6329IBFX8BpAUBXKVufU=tFh8Mf3jcIyP4grS6FKiZIV81UbiWyz=3l9IficLjC16MFZNR93WsfzSjf7=fkSX3rK6Zf2=QUBE7RlgVh_=tkiWjYrPZFNih2e5tiLPKXJPMWsvykfiI3SvMiSjTe_Yjlpshgs7Bib=ZlUPtUKXIF-=_6vySKFwLIuj1s4fjiN6HDf8DeXpt9J1bIoW9KIzLt_E7MJIuIr=8qaFcV-kL2zQ6FzCKhLctgbP6q98tUrSQF-1VlKXrDLMfhasCXhS9WzOR63=VFzvxg7v1UL6Vf_PLqfjZdurTFt6fDzRhlcLLrcRrkFvZ
x-5ku220jw-b: 3k9rur
x-5ku220jw-c: A__57lBqAQAACgDSJrJqFbzAqmJVRyggwSLzliwMgtTT0dRA3TtiKWkLmSk8AawUHfD6K-8VrJMAAOfvAAAAAA==
x-5ku220jw-d: 0
x-5ku220jw-uniquestatekey: A-gE71BqAQAAbcLSHXYa-TmPShEFZzL7WRKJL2s2ZrEiksXMK8pm-YimkgmLAawUHfGuclcCwH8AABszAAAAAA==
I don't know from where they come and how can I get them but the code doesn't work without them(or one of them). I'm sure about that because I can get response when send requests to another URLs from the site where it doesn't need to use these headers.
So, the question is:
1. What these headers are?
2. What's the way(if it's possible) to get values for these headers?
My code:
import requests
data_url = 'https://www.southwest.com/api/air-booking/v1/air-booking/page/air/low-fare-calendar/select-dates'
payload = {
"adultPassengersCount": "1",
"currencyCode": "USD",
"departureDate": "2019-04-25",
"destinationAirportCode": "MCO",
"originationAirportCode": "MDW",
"passengerType": "ADULT",
"returnAirportCode": "",
"returnDate": "",
"tripType": "oneway",
"application": "air-low-fare-calendar",
"site": "southwest"
}
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36',
'x-api-key': 'l7xx944d175ea25f4b9c903a583ea82a1c4c'
}
r = requests.post(data_url, json=payload, headers=headers)
print(r.text)
I was trying to learn web scraping and I am facing a freaky issue... My task is to search Google for news on a topic in a certain date range and count the number of results.
my simple code is
import requests, bs4
payload = {'as_epq': 'James Clark', 'tbs':'cdr:1,cd_min:1/01/2015,cd_max:1/01/2015','tbm':'nws'}
r = requests.get("https://www.google.com/search", params=payload)
soup = bs4.BeautifulSoup(r.text)
elems = soup.select('#resultStats')
print(elems[0].getText())
And the result I get is
About 8,600 results
So apparently all works... apart from the fact that the result is wrong. If I open the URL in Firefox (I can obtain the complete URL with r.url)
https://www.google.com/search?tbm=nws&as_epq=James+Clark&tbs=cdr%3A1%2Ccd_min%3A1%2F01%2F2015%2Ccd_max%3A1%2F01%2F2015
I see that the results are actually only 2, and if I manually download the HTML file, open the page source and search for id="resultStats" I find that the number of results is indeed 2!
Can anybody help me to understand why searching for the same id tag in the saved HTML file and in the soup item lead to two different numerical results?
************** UPDATE
It seems that the problem is the custom date range that does not get processed correctly by requests.get. If I use the same URL with selenium I get the correct answer
from selenium import webdriver
driver = webdriver.Firefox()
driver.get(url)
content = driver.page_source
soup = bs4.BeautifulSoup(content)
elems = soup.select('#resultStats')
print(elems[0].getText())
And the answer is
2 results (0.09 seconds)
The problem is that this methodology seems to be more cumbersome because I need to open the page in Firefox...
There are a couple of things that is causing this issue. First, it wants day and month parts of date in 2 digits and it is also expecting a user-agent string of some popular browser. Following code should work:
import requests, bs4
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"
}
payload = {'as_epq': 'James Clark', 'tbs':'cdr:1,cd_min:01/01/2015,cd_max:01/01/2015', 'tbm':'nws'}
r = requests.get("https://www.google.com/search", params=payload, headers=headers)
soup = bs4.BeautifulSoup(r.content, 'html5lib')
print soup.find(id='resultStats').text
To add to Vikas' answer, Google will also fail to use 'custom date range' for some user-agents. That is, for certain user-agents, Google will simply search for 'recent' results instead of your specified date range.
I haven't detected a clear pattern in which user-agents will break the custom date range. It seems that including a language is a factor.
Here are some examples of user-agents that break cdr:
Mozilla/5.0 (Windows; U; Windows NT 6.1; fr-FR) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27
Mozilla/4.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/5.0)
There's no need in selenium, you're looking for this:
soup.select_one('#result-stats nobr').previous_sibling
# About 10,700,000 results
Code and full example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": 'James Clark', # query
"hl": "en", # lang
"gl": "us", # country to search from
"tbm": "nws", # news filter
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
# if used without "nobr" selector and previous_sibling it will return seconds as well: (0.41 secods)
number_of_results = soup.select_one('#result-stats nobr').previous_sibling
print(number_of_results)
# About 10,700,000 results
Alternatively, you can achieve the same thing by using Google News Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't have to find which selectors will make the work done, or figure out why some of them don't return data you want also they should, bypass blocks from search engines, and maintain it over time.
Instead, you only need to iterate over structured JSON and get the data you want, fast.
Code to integrate for your case:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": 'James Clark',
"tbm": "nws",
"gl": "us",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
number_of_results = results['search_information']['total_results']
print(number_of_results)
# 14300000
Disclaimer, I work for SerpApi.