Unable to fetch tabular content from a site using requests - python

I'm trying to fetch tabular content from a webpage using the requests module. After navigating to that webpage, when I manually type 0466425389 right next to Company number and hit the search button, the table is produced accordingly. However, when I mimic the same using requests, I get the following response.
<?xml version='1.0' encoding='UTF-8'?>
<partial-response><redirect url="/bc9/web/catalog"></redirect></partial-response>
I've tried with:
import requests
link = 'https://cri.nbb.be/bc9/web/catalog?execution=e1s1'
payload = {
'javax.faces.partial.ajax': 'true',
'javax.faces.source': 'page_searchForm:actions:0:button',
'javax.faces.partial.execute': 'page_searchForm',
'javax.faces.partial.render': 'page_searchForm page_listForm pageMessagesId',
'page_searchForm:actions:0:button': 'page_searchForm:actions:0:button',
'page_searchForm': 'page_searchForm',
'page_searchForm:j_id3:generated_number_2_component': '0466425389',
'page_searchForm:j_id3:generated_name_4_component': '',
'page_searchForm:j_id3:generated_address_zipCode_6_component': '',
'page_searchForm:j_id3_activeIndex': '0',
'page_searchForm:j_id2_stateholder': 'panel_param_visible;',
'page_searchForm:j_idt133_stateholder': 'panel_param_visible;',
'javax.faces.ViewState': 'e1s1'
}
headers = {
'Faces-Request': 'partial/ajax',
'X-Requested-With': 'XMLHttpRequest',
'Origin': 'https://cri.nbb.be',
'Accept': 'application/xml, text/xml, */*; q=0.01',
'Accept-Encoding': 'gzip, deflate, br',
'Host': 'cri.nbb.be',
'Origin': 'https://cri.nbb.be',
'Referer': 'https://cri.nbb.be/bc9/web/catalog?execution=e1s1'
}
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
s.get(link)
s.headers.update(headers)
res = s.post(link,data=payload)
print(res.text)
How can I fetch tabular content from that site using requests?

From looking at the "action" attribute on the search form, the form appears to generate a new JSESSIONID every time it is opened, and this seems to be a required attribute. I had some success by including this in the URL.
You don't need to explicitly set the headers other than the User-Agent.
I added some code: (a) to pull out the "action" attribute of the form using BeautifulSoup - you could do this with regex if you prefer, (b) to get the url from that redirection XML that you showed at the top of your question.
import re
from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup
...
with requests.Session() as s:
s.headers["User-Agent"] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36"
# GET to get search form
req1 = s.get(link)
# Get the form action
soup = BeautifulSoup(req1.text, "lxml")
form = soup.select_one("#page_searchForm")
form_action = urljoin(link, form["action"])
# POST the form
req2 = s.post(form_action, data=payload)
# Extract the target from the redirection xml response
target = re.search('url="(.*?)"', req2.text).group(1)
# Final GET to get the search result
req3 = s.get(urljoin(link, target))
# Parse and print (some of) the result
soup = BeautifulSoup(req3.text, "lxml").body
for detail in soup.select(".company-details tr"):
columns = detail.select("td")
if columns:
print(f"{columns[0].text.strip()}: {columns[1].text.strip()}")
Result:
Company number: 0466.425.389
Name: A en B PARTNERS
Address: Quai de Willebroeck 37
: BE 1000 Bruxelles
Municipality code NIS: 21004 Bruxelles
Legal form: Cooperative company with limited liability
Legal situation: Normal situation
Activity code (NACE-BEL)
The activity code of the company is the statistical activity code in use on the date of consultation, given by the CBSO based on the main activity codes available at the Crossroads Bank for Enterprises and supplementary informations collected from the companies: 69201 - Accountants and fiscal advisors

I think requests could not handle dynamic web pages. I use helium and pandas to do the work.
import helium as he
import pandas as pd
url = 'https://cri.nbb.be/bc9/web/catalog?execution=e1s1'
driver = he.start_chrome(url)
he.write('0466425389', into='Company number')
he.click('Search')
he.wait_until(he.Button('New search').exists)
he.select(he.ComboBox('10'), '100')
he.wait_until(he.Button('New search').exists)
with open('download.html', 'w') as html:
html.write(driver.page_source)
he.kill_browser()
df = pd.read_html('download.html')
df[2]
Output

Related

How to download CSV link in a python program

The website is "https://www.nseindia.com/companies-listing/corporate-filings-announcements". A friend sent me the underlying link to downloads data between some dates as csv file as "https://www.nseindia.com/api/corporate-announcements?index=equities&from_date=14-01-2022&to_date=20-01-2022&csv=true\27"
This link works fine in a web browser
First If some one can educate how he got this link or rather how I can get this link.
second I am unable to read the csv file to a data frame from this link in python. May be some issues with %27 or something else. code is
csv_url='https://www.nseindia.com/api/corporate-announcements?index=equities&from_date=14-01-2022&to_date=15-01-2022&csv=true%27'
df = pd.read_csv(csv_url)
print(df.head())
use wget.py
DATA_URL = 'http://www.robots.ox.ac.uk/~ankush/data.tar.gz'
DATA_URL = '/home/xxx/book/data.tar.gz'
out_fname = 'abc.tar.gz'
wget.download(DATA_URL, out=out_fname)
Okay so for this issue, first you need to request the NSE website with headers as mentioned in this post and then once you hit the main website, you get some cookies in your session, using which you can hit your desired url. To convert that url data to pandas compatible string, I followed this answer.
Make sure to have the custom user agent in the header else it will fail.
import pandas as pd
import io
import requests
base_url = 'https://www.nseindia.com'
session = requests.Session()
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, '
'like Gecko) '
'Chrome/80.0.3987.149 Safari/537.36',
'accept-language': 'en,gu;q=0.9,hi;q=0.8',
'accept-encoding': 'gzip, deflate, br'}
r = session.get(url, headers=headers, timeout=5)
cookies = dict(r.cookies)
response = session.get('https://www.nseindia.com/api/corporate-announcements?index=equities&from_date=14-01-2022&to_date=20-01-2022&csv=true', timeout=5, headers=headers)
content = response.content
df=pd.read_csv(io.StringIO(content.decode('utf-8')))
print(df.head())

Can't grab tabular content from a webpage using requests

I'm trying to scrape tabular content from this webpage. To locate the content, it is necessary to click on the 12 tab under this title How to navigate the interactive report. Upon clicking on that tab, the tabular content shows up at the bottom of that page under Moves To Austin-Round Rock-Georgetown, TX MSA.
When I observe network activity in chrome dev tools while populating the data manually, I could notice that a post http requests along with appropriate parameter is sent to this url https://public.tableau.com/vizql/w/CBREMigrationAnalysisv1extract/v/CBREMigrationAnalysis/sessions/F3E2227B603E4F5AB3156667A673CF9E-0:0/commands/tabdoc/set-active-story-point in which the portion between /sessions/ and /commands/ is dynamic.
However, I have been able to fetch that dynamic portion from this url on the fly before sending the post requests. Now, when I try with the following script, I get 500 status code.
I've tried with:
import requests
from bs4 import BeautifulSoup
from pprint import pprint
base = 'https://public.tableau.com/views/CBREMigrationAnalysisv1extract/CBREMigrationAnalysis?:showVizHome=no&:embed=true&parentUrl=https%3A%2F%2Fwww.cbre.us%2Fresearch-and-reports%2FCOVID-19-Impact-on-Resident-Migration-Patterns'
link = 'https://public.tableau.com/vizql/w/CBREMigrationAnalysisv1extract/v/CBREMigrationAnalysis/sessions/{}/commands/tabdoc/set-active-story-point'
payload = {
'storyboard': 'CBRE Migration Analysis',
'storyPointId': '14',
'shouldAutoCapture': 'false',
'shouldAutoRevert': 'true'
}
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36',
'x-requested-with': 'XMLHttpRequest',
'x-newrelic-id': 'XA4CV19WGwIBV1RVBQQBUA==',
'x-tsi-active-tab': 'CBRE%20Migration%20Analysis',
'x-tsi-supports-accepted': 'true',
'referer': base,
}
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
r = s.get(base)
post_link = link.format(r.headers['X-Session-Id'])
s.headers.update(headers)
res = s.post(post_link,data=payload)
print(res.status_code)
pprint(res.json()['vqlCmdResponse']['layoutStatus']['applicationPresModel'])
How can I access tabular content from that page using requests?
I've just implemented the storypoints feature in this Tableau Scraper library. Checkout the storypoint section
The following code will show all the storypoints, and go to storypoint with id 14 (equivalent to the storypoint with caption 12 in the UI). Then it gets the worksheet with name P2P Table into a pandas dataframe:
from tableauscraper import TableauScraper as TS
url = 'https://public.tableau.com/views/CBREMigrationAnalysisv1extract/CBREMigrationAnalysis'
ts = TS()
ts.loads(url)
wb = ts.getWorkbook()
print(wb.getStoryPoints())
print("go to specific storypoint")
sp = wb.goToStoryPoint(storyPointId=14)
print(sp.getWorksheetNames())
print(sp.getWorksheet("P2P Table").data)
Try this on repl.it

Scraping Data from .ASPX Website URL with Python

I have a static .aspx url that I am trying to scrape. All of my attempts yield the raw html data of the regular website instead of the data I am querying.
My understanding is the headers I am using (which I found from another post) are correct and generalizable:
import urllib.request
from bs4 import BeautifulSoup
headers = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17',
'Content-Type': 'application/x-www-form-urlencoded',
'Accept-Encoding': 'gzip,deflate,sdch',
'Accept-Language': 'en-US,en;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3'
}
class MyOpener(urllib.request.FancyURLopener):
version = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17'
myopener = MyOpener()
url = 'https://www.mytaxcollector.com/trSearch.aspx'
# first HTTP request without form data
f = myopener.open(url)
soup_dummy = BeautifulSoup(f,"html5lib")
# parse and retrieve two vital form values
viewstate = soup_dummy.select("#__VIEWSTATE")[0]['value']
viewstategen = soup_dummy.select("#__VIEWSTATEGENERATOR")[0]['value']
Trying to enter the form data causes nothing to happen:
formData = (
('__VIEWSTATE', viewstate),
('__VIEWSTATEGENERATOR', viewstategen),
('ctl00_contentHolder_trSearchCharactersAPN', '631091430000'),
('__EVENTTARGET', 'ct100$MainContent$calculate')
)
encodedFields = urllib.parse.urlencode(formData)
# second HTTP request with form data
f = myopener.open(url, encodedFields)
soup = BeautifulSoup(f,"html5lib")
trans_emissions = soup.find("span", id="ctl00_MainContent_transEmissions")
print(trans_emissions.text)
This give raw html code almost exactly the same as the "soup_dummy" variable. But what I want to see is the data of the field ('ctl00_contentHolder_trSearchCharactersAPN', '631091430000') being submitted (this is the "parcel number" box.
I would really appreciate the help. If anything, linking me to a good post about HTML requests (one that not only explains but actually walks through scraping aspx) would be great.
To get the result using the parcel number, your parameters have to be somewhat different from what you have already tried with. Moreover, you have to use this url https://www.mytaxcollector.com/trSearchProcess.aspx to send the post requests.
Working code:
from urllib.request import Request, urlopen
from urllib.parse import urlencode
from bs4 import BeautifulSoup
url = 'https://www.mytaxcollector.com/trSearchProcess.aspx'
payload = {
'hidRedirect': '',
'hidGotoEstimate': '',
'txtStreetNumber': '',
'txtStreetName': '',
'cboStreetTag': '(Any Street Tag)',
'cboCommunity': '(Any City)',
'txtParcelNumber': '0108301010000', #your search term
'txtPropertyID': '',
'ctl00$contentHolder$cmdSearch': 'Search'
}
data = urlencode(payload)
data = data.encode('ascii')
req = Request(url,data)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36')
res = urlopen(req)
soup = BeautifulSoup(res.read(),'html.parser')
for items in soup.select("table.propInfoTable tr"):
data = [item.get_text(strip=True) for item in items.select("td")]
print(data)

Python script to download file from button on website

I want to download an xls file by clicking the button "Export to excel" from the following url: https://apps.tampagov.net/CallsForService_Webapp/Default.aspx?type=TPD.
More specifically the button: name = "ctl00$MainContent$btndata". I've already been able to do this using selenium, but, I plan on building a docker image with this script and running as a docker container because this xls is regularly updated and I need the most current data on my local machine and it doesn't make sense to have a browser open that often to fetch this data. I understand there are headless versions of chrome and firefox although I don't believe they support downloads. Also, I understand that web get will not work in this situation because the button is not a static link to the resource. Maybe there's a completely different approach for downloading and updating this data to my computer?
import urllib
import requests
from bs4 import BeautifulSoup
headers = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=.08',
'Origin': 'https://www.tampagov.net',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17',
'Content-Type': 'application/x-www-form-urlencoded',
'Referer': 'https://apps.tampagov.net/CallsForService_Webapp/Default.aspx?type=TPD',
'Accept-Encoding': 'gzip,deflate,br',
'Accept-Language': 'en-US,en;q=0.5',
}
class MyOpener(urllib.FancyURLopener):
version = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17'
myopener = MyOpener()
url = 'https://apps.tampagov.net/CallsForService_Webapp/Default.aspx?type=TPD'
# first HTTP request without form data
f = myopener.open(url)
soup = BeautifulSoup(f, "html.parser")
# parse and retrieve two vital form values
viewstate = soup.select("#__VIEWSTATE")[0]['value']
eventvalidation = soup.select("#__EVENTVALIDATION")[0]['value']
formData = (
('__EVENTVALIDATION', eventvalidation),
('__VIEWSTATE', viewstate),
('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'),
('Accept-Encoding', 'gzip, deflate, br'),
('Accept-Language', 'en-US,en;q=0.5'),
('Host', 'apps,tampagov.net'),
('User-Agent', 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:59.0) Gecko/20100101 Firefox/59.0'))
payload = urllib.urlencode(formData)
# second HTTP request with form data
r = requests.post("https://apps.tampagov.net/CallsForService_Webapp/Default.aspx?type=TPD", params=payload)
print(r.status_code, r.reason)
First: I removed import urllib because 'requests' is enough.
Some issues you have:
You don't need to create one nested tuple then apply urllib.urlencode, uses one dictionary instead that is one reason why requests is so popular.
You'd better populate all parameters for the http post request. like below what I did, otherwise, the request may be rejected by the backend.
I added one simple codes to save the content to the local.
PS: for those form parameters, you can get their values by analysis the html responsed from http get. Also you can customize the parameters as you need, like page size etc.
Below is a working sample:
import requests
from bs4 import BeautifulSoup
from tqdm import tqdm
def downloadExcel():
headers = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=.08',
'Origin': 'https://www.tampagov.net',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17',
'Content-Type': 'application/x-www-form-urlencoded',
'Referer': 'https://apps.tampagov.net/CallsForService_Webapp/Default.aspx?type=TPD',
'Accept-Encoding': 'gzip,deflate,br',
'Accept-Language': 'en-US,en;q=0.5',
}
r = requests.get("https://apps.tampagov.net/CallsForService_Webapp/Default.aspx?type=TPD", headers=headers)
# parse and retrieve two vital form values
if not r.status_code == 200:
print('Error')
return
soup = BeautifulSoup(r.content, "html.parser")
viewstate = soup.select("#__VIEWSTATE")[0]['value']
eventvalidation = soup.select("#__EVENTVALIDATION")[0]['value']
print ('__VIEWSTATE:', viewstate)
print ('__EVENTVALIDATION:', eventvalidation)
formData = {
'__EVENTVALIDATION': eventvalidation,
'__VIEWSTATE': viewstate,
'__EVENTTARGET': '',
'__EVENTARGUMENT': '',
'__VIEWSTATEGENERATOR': '49DF2C80',
'MainContent_RadScriptManager1_TSM':""";;System.Web.Extensions, Version=4.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35:en-US:59e0a739-153b-40bd-883f-4e212fc43305:ea597d4b:b25378d2;Telerik.Web.UI, Version=2015.2.826.40, Culture=neutral, PublicKeyToken=121fae78165ba3d4:en-US:c2ba43dc-851e-4009-beab-3032480b6a4b:16e4e7cd:f7645509:24ee1bba:c128760b:874f8ea2:19620875:4877f69a:f46195d3:92fe8ea0:fa31b949:490a9d4e:bd8f85e4:58366029:ed16cbdc:2003d0b8:88144a7a:1e771326:aa288e2d:b092aa46:7c926187:8674cba1:ef347303:2e42e72a:b7778d6c:c08e9f8a:e330518b:c8618e41:e4f8f289:1a73651d:16d8629e:59462f1:a51ee93e""",
'search_block_form':'',
'ctl00$MainContent$btndata':'Export to Excel',
'ctl00_MainContent_RadWindow1_C_RadGridVehicles_ClientState':'',
'ctl00_MainContent_RadWindow1_ClientState':'',
'ctl00_MainContent_RadWindowManager1_ClientState':'',
'ctl00$MainContent$RadGrid1$ctl00$ctl02$ctl00$PageSizeComboBox':'20',
'ctl00_MainContent_RadGrid1_ctl00_ctl02_ctl00_PageSizeComboBox_ClientState':'',
'ctl00$MainContent$RadGrid1$ctl00$ctl02$ctl02$RDIPFdispatch_time':'',
'ctl00$MainContent$RadGrid1$ctl00$ctl02$ctl02$RDIPFdispatch_time$dateInput':'',
'ctl00_MainContent_RadGrid1_ctl00_ctl02_ctl02_RDIPFdispatch_time_dateInput_ClientState':'{"enabled":true,"emptyMessage":"","validationText":"","valueAsString":"","minDateStr":"1900-01-01-00-00-00","maxDateStr":"2099-12-31-00-00-00","lastSetTextBoxValue":""}',
'ctl00_MainContent_RadGrid1_ctl00_ctl02_ctl02_RDIPFdispatch_time_ClientState':'{"minDateStr":"1900-01-01-00-00-00","maxDateStr":"2099-12-31-00-00-00"}',
'ctl00$MainContent$RadGrid1$ctl00$ctl02$ctl02$RadComboBox1address':'',
'ctl00_MainContent_RadGrid1_ctl00_ctl02_ctl02_RadComboBox1address_ClientState':'',
'ctl00$MainContent$RadGrid1$ctl00$ctl02$ctl02$RadComboBox1case_description':'',
'ctl00_MainContent_RadGrid1_ctl00_ctl02_ctl02_RadComboBox1case_description_ClientState':'',
'ctl00$MainContent$RadGrid1$ctl00$ctl02$ctl02$FilterTextBox_grid':'',
'ctl00$MainContent$RadGrid1$ctl00$ctl02$ctl02$RadComboBox1report_number':'',
'ctl00_MainContent_RadGrid1_ctl00_ctl02_ctl02_RadComboBox1report_number_ClientState':'',
'ctl00$MainContent$RadGrid1$ctl00$ctl02$ctl02$FilterTextBox_out_max_date':'',
'ctl00$MainContent$RadGrid1$ctl00$ctl02$ctl02$FilterTextBox_out_rowcount':'',
'ctl00$MainContent$RadGrid1$ctl00$ctl03$ctl01$PageSizeComboBox':'20',
'ctl00_MainContent_RadGrid1_ctl00_ctl03_ctl01_PageSizeComboBox_ClientState':'',
'ctl00_MainContent_RadGrid1_rfltMenu_ClientState':'',
'ctl00_MainContent_RadGrid1_gdtcSharedTimeView_ClientState':'',
'ctl00_MainContent_RadGrid1_gdtcSharedCalendar_SD':'[]',
'ctl00_MainContent_RadGrid1_gdtcSharedCalendar_AD':'[[1900,1,1],[2099,12,31],[2018,3,29]]',
'ctl00_MainContent_RadGrid1_ClientState':'',
}
# second HTTP request with form data
r = requests.post("https://apps.tampagov.net/CallsForService_Webapp/Default.aspx?type=TPD", data=formData, headers=headers)
print('received:', r.status_code, len(r.content))
with open(r"C:\Users\xxx\Desktop\test\test\apps.xls", "wb") as handle:
for data in tqdm(r.iter_content()):
handle.write(data)
downloadExcel()
Find out the URL you need to fetch as #Sphinx explains, and then simulate it using something similar to:
import urllib.request
import urllib.parse
data = urllib.parse.urlencode({...})
data = data.encode('ascii')
with urllib.request.urlopen("http://...", data) as fd:
print(fd.read().decode('utf-8'))
Take a look at the documentation of urllib.

google search with python requests library

(I've tried looking but all of the other answers seem to be using urllib2)
I've just started trying to use requests, but I'm still not very clear on how to send or request something additional from the page. For example, I'll have
import requests
r = requests.get('http://google.com')
but I have no idea how to now, for example, do a google search using the search bar presented. I've read the quickstart guide but I'm not very familiar with HTML POST and the like, so it hasn't been very helpful.
Is there a clean and elegant way to do what I am asking?
Request Overview
The Google search request is a standard HTTP GET command. It includes a collection of parameters relevant to your queries. These parameters are included in the request URL as name=value pairs separated by ampersand (&) characters. Parameters include data like the search query and a unique CSE ID (cx) that identifies the CSE that is making the HTTP request. The WebSearch or Image Search service returns XML results in response to your HTTP requests.
First, you must get your CSE ID (cx parameter) at Control Panel of Custom Search Engine
Then, See the official Google Developers site for Custom Search.
There are many examples like this:
http://www.google.com/search?
start=0
&num=10
&q=red+sox
&cr=countryCA
&lr=lang_fr
&client=google-csbe
&output=xml_no_dtd
&cx=00255077836266642015:u-scht7a-8i
And there are explained the list of parameters that you can use.
import requests
from bs4 import BeautifulSoup
headers_Get = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:49.0) Gecko/20100101 Firefox/49.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate',
'DNT': '1',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1'
}
def google(q):
s = requests.Session()
q = '+'.join(q.split())
url = 'https://www.google.com/search?q=' + q + '&ie=utf-8&oe=utf-8'
r = s.get(url, headers=headers_Get)
soup = BeautifulSoup(r.text, "html.parser")
output = []
for searchWrapper in soup.find_all('h3', {'class':'r'}): #this line may change in future based on google's web page structure
url = searchWrapper.find('a')["href"]
text = searchWrapper.find('a').text.strip()
result = {'text': text, 'url': url}
output.append(result)
return output
Will return an array of google results in {'text': text, 'url': url} format. Top result url would be google('search query')[0]['url']
input:
import requests
def googleSearch(query):
with requests.session() as c:
url = 'https://www.google.co.in'
query = {'q': query}
urllink = requests.get(url, params=query)
print urllink.url
googleSearch('Linkin Park')
output:
https://www.google.co.in/?q=Linkin+Park
The readable way to send a request with many query parameters would be to pass URL parameters as a dictionary:
params = {
'q': 'minecraft', # search query
'gl': 'us', # country where to search from
'hl': 'en', # language
}
requests.get('URL', params=params)
But, in order to get the actual response (output/text/data) that you see in the browser you need to send additional headers, more specifically user-agent which is needed to act as a "real" user visit when bot or browser sends a fake user-agent string to announce themselves as a different client.
The reason that your request might be blocked is that the default requests user agent is python-requests and websites understand that. Check what's your user agent.
You can read more about it in the blog post I wrote about how to reduce the chance of being blocked while web scraping.
Pass user-agent:
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
}
requests.get('URL', headers=headers)
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
}
params = {
'q': 'minecraft',
'gl': 'us',
'hl': 'en',
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc'):
title = result.select_one('.DKV0Md').text
link = result.select_one('.yuRUbf a')['href']
print(title, link, sep='\n')
Alternatively, you can achieve the same thing by using Google Organic API from SerpApi. It's a paid API with a free plan.
The difference is that you don't have to create it from scratch and maintain it.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "tesla",
"hl": "en",
"gl": "us",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(result['title'])
print(result['link'])
Disclaimer, I work for SerpApi.
In this code by using bs4 you can get all the h3 and print their text
# Import the beautifulsoup
# and request libraries of python.
import requests
import bs4
# Make two strings with default google search URL
# 'https://google.com/search?q=' and
# our customized search keyword.
# Concatenate them
text= "c++ linear search program"
url = 'https://google.com/search?q=' + text
# Fetch the URL data using requests.get(url),
# store it in a variable, request_result.
request_result=requests.get( url )
# Creating soup from the fetched request
soup = bs4.BeautifulSoup(request_result.text,"html.parser")
filter=soup.find_all("h3")
for i in range(0,len(filter)):
print(filter[i].get_text())
You can use 'webbroser', I think it doesn't get easier than that:
import webbrowser
query = input('Enter your query: ')
webbrowser.open(f'https://google.com/search?q={query}')

Categories

Resources