Can't grab tabular content from a webpage using requests

Can't grab tabular content from a webpage using requests - python

I'm trying to scrape tabular content from this webpage. To locate the content, it is necessary to click on the 12 tab under this title How to navigate the interactive report. Upon clicking on that tab, the tabular content shows up at the bottom of that page under Moves To Austin-Round Rock-Georgetown, TX MSA.
When I observe network activity in chrome dev tools while populating the data manually, I could notice that a post http requests along with appropriate parameter is sent to this url https://public.tableau.com/vizql/w/CBREMigrationAnalysisv1extract/v/CBREMigrationAnalysis/sessions/F3E2227B603E4F5AB3156667A673CF9E-0:0/commands/tabdoc/set-active-story-point in which the portion between /sessions/ and /commands/ is dynamic.
However, I have been able to fetch that dynamic portion from this url on the fly before sending the post requests. Now, when I try with the following script, I get 500 status code.
I've tried with:
import requests
from bs4 import BeautifulSoup
from pprint import pprint
base = 'https://public.tableau.com/views/CBREMigrationAnalysisv1extract/CBREMigrationAnalysis?:showVizHome=no&:embed=true&parentUrl=https%3A%2F%2Fwww.cbre.us%2Fresearch-and-reports%2FCOVID-19-Impact-on-Resident-Migration-Patterns'
link = 'https://public.tableau.com/vizql/w/CBREMigrationAnalysisv1extract/v/CBREMigrationAnalysis/sessions/{}/commands/tabdoc/set-active-story-point'
payload = {
'storyboard': 'CBRE Migration Analysis',
'storyPointId': '14',
'shouldAutoCapture': 'false',
'shouldAutoRevert': 'true'
}
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36',
'x-requested-with': 'XMLHttpRequest',
'x-newrelic-id': 'XA4CV19WGwIBV1RVBQQBUA==',
'x-tsi-active-tab': 'CBRE%20Migration%20Analysis',
'x-tsi-supports-accepted': 'true',
'referer': base,
}
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
r = s.get(base)
post_link = link.format(r.headers['X-Session-Id'])
s.headers.update(headers)
res = s.post(post_link,data=payload)
print(res.status_code)
pprint(res.json()['vqlCmdResponse']['layoutStatus']['applicationPresModel'])
How can I access tabular content from that page using requests?

I've just implemented the storypoints feature in this Tableau Scraper library. Checkout the storypoint section
The following code will show all the storypoints, and go to storypoint with id 14 (equivalent to the storypoint with caption 12 in the UI). Then it gets the worksheet with name P2P Table into a pandas dataframe:
from tableauscraper import TableauScraper as TS
url = 'https://public.tableau.com/views/CBREMigrationAnalysisv1extract/CBREMigrationAnalysis'
ts = TS()
ts.loads(url)
wb = ts.getWorkbook()
print(wb.getStoryPoints())
print("go to specific storypoint")
sp = wb.goToStoryPoint(storyPointId=14)
print(sp.getWorksheetNames())
print(sp.getWorksheet("P2P Table").data)
Try this on repl.it

Related

How to download CSV link in a python program

The website is "https://www.nseindia.com/companies-listing/corporate-filings-announcements". A friend sent me the underlying link to downloads data between some dates as csv file as "https://www.nseindia.com/api/corporate-announcements?index=equities&from_date=14-01-2022&to_date=20-01-2022&csv=true\27"
This link works fine in a web browser
First If some one can educate how he got this link or rather how I can get this link.
second I am unable to read the csv file to a data frame from this link in python. May be some issues with %27 or something else. code is
csv_url='https://www.nseindia.com/api/corporate-announcements?index=equities&from_date=14-01-2022&to_date=15-01-2022&csv=true%27'
df = pd.read_csv(csv_url)
print(df.head())

use wget.py
DATA_URL = 'http://www.robots.ox.ac.uk/~ankush/data.tar.gz'
DATA_URL = '/home/xxx/book/data.tar.gz'
out_fname = 'abc.tar.gz'
wget.download(DATA_URL, out=out_fname)

Okay so for this issue, first you need to request the NSE website with headers as mentioned in this post and then once you hit the main website, you get some cookies in your session, using which you can hit your desired url. To convert that url data to pandas compatible string, I followed this answer.
Make sure to have the custom user agent in the header else it will fail.
import pandas as pd
import io
import requests
base_url = 'https://www.nseindia.com'
session = requests.Session()
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, '
'like Gecko) '
'Chrome/80.0.3987.149 Safari/537.36',
'accept-language': 'en,gu;q=0.9,hi;q=0.8',
'accept-encoding': 'gzip, deflate, br'}
r = session.get(url, headers=headers, timeout=5)
cookies = dict(r.cookies)
response = session.get('https://www.nseindia.com/api/corporate-announcements?index=equities&from_date=14-01-2022&to_date=20-01-2022&csv=true', timeout=5, headers=headers)
content = response.content
df=pd.read_csv(io.StringIO(content.decode('utf-8')))
print(df.head())

Unable to fetch tabular content from a site using requests

I'm trying to fetch tabular content from a webpage using the requests module. After navigating to that webpage, when I manually type 0466425389 right next to Company number and hit the search button, the table is produced accordingly. However, when I mimic the same using requests, I get the following response.
<?xml version='1.0' encoding='UTF-8'?>
<partial-response><redirect url="/bc9/web/catalog"></redirect></partial-response>
I've tried with:
import requests
link = 'https://cri.nbb.be/bc9/web/catalog?execution=e1s1'
payload = {
'javax.faces.partial.ajax': 'true',
'javax.faces.source': 'page_searchForm:actions:0:button',
'javax.faces.partial.execute': 'page_searchForm',
'javax.faces.partial.render': 'page_searchForm page_listForm pageMessagesId',
'page_searchForm:actions:0:button': 'page_searchForm:actions:0:button',
'page_searchForm': 'page_searchForm',
'page_searchForm:j_id3:generated_number_2_component': '0466425389',
'page_searchForm:j_id3:generated_name_4_component': '',
'page_searchForm:j_id3:generated_address_zipCode_6_component': '',
'page_searchForm:j_id3_activeIndex': '0',
'page_searchForm:j_id2_stateholder': 'panel_param_visible;',
'page_searchForm:j_idt133_stateholder': 'panel_param_visible;',
'javax.faces.ViewState': 'e1s1'
}
headers = {
'Faces-Request': 'partial/ajax',
'X-Requested-With': 'XMLHttpRequest',
'Origin': 'https://cri.nbb.be',
'Accept': 'application/xml, text/xml, */*; q=0.01',
'Accept-Encoding': 'gzip, deflate, br',
'Host': 'cri.nbb.be',
'Origin': 'https://cri.nbb.be',
'Referer': 'https://cri.nbb.be/bc9/web/catalog?execution=e1s1'
}
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
s.get(link)
s.headers.update(headers)
res = s.post(link,data=payload)
print(res.text)
How can I fetch tabular content from that site using requests?

From looking at the "action" attribute on the search form, the form appears to generate a new JSESSIONID every time it is opened, and this seems to be a required attribute. I had some success by including this in the URL.
You don't need to explicitly set the headers other than the User-Agent.
I added some code: (a) to pull out the "action" attribute of the form using BeautifulSoup - you could do this with regex if you prefer, (b) to get the url from that redirection XML that you showed at the top of your question.
import re
from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup
...
with requests.Session() as s:
s.headers["User-Agent"] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36"
# GET to get search form
req1 = s.get(link)
# Get the form action
soup = BeautifulSoup(req1.text, "lxml")
form = soup.select_one("#page_searchForm")
form_action = urljoin(link, form["action"])
# POST the form
req2 = s.post(form_action, data=payload)
# Extract the target from the redirection xml response
target = re.search('url="(.*?)"', req2.text).group(1)
# Final GET to get the search result
req3 = s.get(urljoin(link, target))
# Parse and print (some of) the result
soup = BeautifulSoup(req3.text, "lxml").body
for detail in soup.select(".company-details tr"):
columns = detail.select("td")
if columns:
print(f"{columns[0].text.strip()}: {columns[1].text.strip()}")
Result:
Company number: 0466.425.389
Name: A en B PARTNERS
Address: Quai de Willebroeck 37
: BE 1000 Bruxelles
Municipality code NIS: 21004 Bruxelles
Legal form: Cooperative company with limited liability
Legal situation: Normal situation
Activity code (NACE-BEL)
The activity code of the company is the statistical activity code in use on the date of consultation, given by the CBSO based on the main activity codes available at the Crossroads Bank for Enterprises and supplementary informations collected from the companies: 69201 - Accountants and fiscal advisors

I think requests could not handle dynamic web pages. I use helium and pandas to do the work.
import helium as he
import pandas as pd
url = 'https://cri.nbb.be/bc9/web/catalog?execution=e1s1'
driver = he.start_chrome(url)
he.write('0466425389', into='Company number')
he.click('Search')
he.wait_until(he.Button('New search').exists)
he.select(he.ComboBox('10'), '100')
he.wait_until(he.Button('New search').exists)
with open('download.html', 'w') as html:
html.write(driver.page_source)
he.kill_browser()
df = pd.read_html('download.html')
df[2]
Output

Scraping Data from .ASPX Website URL with Python

I have a static .aspx url that I am trying to scrape. All of my attempts yield the raw html data of the regular website instead of the data I am querying.
My understanding is the headers I am using (which I found from another post) are correct and generalizable:
import urllib.request
from bs4 import BeautifulSoup
headers = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17',
'Content-Type': 'application/x-www-form-urlencoded',
'Accept-Encoding': 'gzip,deflate,sdch',
'Accept-Language': 'en-US,en;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3'
}
class MyOpener(urllib.request.FancyURLopener):
version = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17'
myopener = MyOpener()
url = 'https://www.mytaxcollector.com/trSearch.aspx'
# first HTTP request without form data
f = myopener.open(url)
soup_dummy = BeautifulSoup(f,"html5lib")
# parse and retrieve two vital form values
viewstate = soup_dummy.select("#__VIEWSTATE")[0]['value']
viewstategen = soup_dummy.select("#__VIEWSTATEGENERATOR")[0]['value']
Trying to enter the form data causes nothing to happen:
formData = (
('__VIEWSTATE', viewstate),
('__VIEWSTATEGENERATOR', viewstategen),
('ctl00_contentHolder_trSearchCharactersAPN', '631091430000'),
('__EVENTTARGET', 'ct100$MainContent$calculate')
)
encodedFields = urllib.parse.urlencode(formData)
# second HTTP request with form data
f = myopener.open(url, encodedFields)
soup = BeautifulSoup(f,"html5lib")
trans_emissions = soup.find("span", id="ctl00_MainContent_transEmissions")
print(trans_emissions.text)
This give raw html code almost exactly the same as the "soup_dummy" variable. But what I want to see is the data of the field ('ctl00_contentHolder_trSearchCharactersAPN', '631091430000') being submitted (this is the "parcel number" box.
I would really appreciate the help. If anything, linking me to a good post about HTML requests (one that not only explains but actually walks through scraping aspx) would be great.

To get the result using the parcel number, your parameters have to be somewhat different from what you have already tried with. Moreover, you have to use this url https://www.mytaxcollector.com/trSearchProcess.aspx to send the post requests.
Working code:
from urllib.request import Request, urlopen
from urllib.parse import urlencode
from bs4 import BeautifulSoup
url = 'https://www.mytaxcollector.com/trSearchProcess.aspx'
payload = {
'hidRedirect': '',
'hidGotoEstimate': '',
'txtStreetNumber': '',
'txtStreetName': '',
'cboStreetTag': '(Any Street Tag)',
'cboCommunity': '(Any City)',
'txtParcelNumber': '0108301010000', #your search term
'txtPropertyID': '',
'ctl00$contentHolder$cmdSearch': 'Search'
}
data = urlencode(payload)
data = data.encode('ascii')
req = Request(url,data)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36')
res = urlopen(req)
soup = BeautifulSoup(res.read(),'html.parser')
for items in soup.select("table.propInfoTable tr"):
data = [item.get_text(strip=True) for item in items.select("td")]
print(data)

Why Request doesn't work on a specific URL?

I have a question re: requests module in Python.
So far I have been using this to scrape and it's been working well.
However when I do it against one particular website (code below - and refer to the Jupyter Notebook snapshot), it just doesn't want to complete the task (showing [*] forever).
from bs4 import BeautifulSoup
import requests
import pandas as pd
import json
page = requests.get('https://www.stoneisland.com/ca/stone-island-shadow-project/coats-jackets', verify = False)
soup = BeautifulSoup(page.content, 'html.parser')
Some users also suggest using headers such as below to speed it up but it doesnt work for me as well:
url = 'https://www.stoneisland.com/ca/stone-island-shadow-project/coats-jackets'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3'}
req = requests.get(url = url, headers = headers)
Not sure what's going on (this is the first time for me) but I might be missing on something obvious. If someone can explain why this is not working? Or if it's working in your machine, please do let me know!

The page attempts to add a cookie the first time you visit it. By using the requests module and not defining a cookie will prevent you from being able to connect to the page.
I've modified your script to include my cookie which should work - if it doesn't, copy your cookie (for this host domain) from the browser to the script.
url = 'https://www.stoneisland.com/ca/stone-island-shadow-project/coats-jackets'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3'}
cookies = {
'TS01e58ec0': '01a1c9e334eb0b8b191d36d0da302b2bca8927a0ffd2565884aff3ce69db2486850b7fb8e283001c711cc882a8d1f749838ff59d3d'
}
req = requests.get(url = url, headers = headers, cookies=cookies)

Post forms using requests on .net website (python)

import requests
headers ={
"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Encoding":"gzip, deflate",
"Accept-Language":"en-US,en;q=0.5",
"Connection":"keep-alive",
"Host":"mcfbd.com",
"Referer":"https://mcfbd.com/mcf/FrmView_PropertyTaxStatus.aspx",
"User-Agent":"Mozilla/5.0(Windows NT 10.0; WOW64; rv:40.0) Gecko/20100101 Firefox/40.0"}
a = requests.session()
soup = BeautifulSoup(a.get("https://mcfbd.com/mcf/FrmView_PropertyTaxStatus.aspx").content)
payload = {"ctl00$ContentPlaceHolder1$txtSearchHouse":"",
"ctl00$ContentPlaceHolder1$txtSearchSector":"",
"ctl00$ContentPlaceHolder1$txtPropertyID":"",
"ctl00$ContentPlaceHolder1$txtownername":"",
"ctl00$ContentPlaceHolder1$ddlZone":"1",
"ctl00$ContentPlaceHolder1$ddlSector":"2",
"ctl00$ContentPlaceHolder1$ddlBlock":"2",
"ctl00$ContentPlaceHolder1$btnFind":"Search",
"__VIEWSTATE":soup.find('input',{'id':'__VIEWSTATE'})["value"],
"__VIEWSTATEGENERATOR":"14039419",
"__EVENTVALIDATION":soup.find("input",{"name":"__EVENTVALIDATION"})["value"],
"__SCROLLPOSITIONX":"0",
"__SCROLLPOSITIONY":"0"}
b = a.post("https://mcfbd.com/mcf/FrmView_PropertyTaxStatus.aspx",headers = headers,data = payload).text
print(b)
above is my code for this website.
https://mcfbd.com/mcf/FrmView_PropertyTaxStatus.aspx
I checked firebug out and these are the values of the form data.
however doing this:
b = requests.post("https://mcfbd.com/mcf/FrmView_PropertyTaxStatus.aspx",headers = headers,data = payload).text
print(b)
throws this error:
[ArgumentException]: Invalid postback or callback argument
is my understanding of submitting forms via request correct?
1.open firebug
2.submit form
3.go to the NET tab
4.on the NET tab choose the post tab
5.copy form data like the code above
I've always wanted to know how to do this. I could use selenium but I thought I'd try something new and use requests

The error you are receiving is correct because the fields like _VIEWSTATE (and others as well) are not static or hardcoded. The proper way to do this is as follows:
Create a Requests Session object. Also, it is advisable to update it with headers containing USER-AGENT string -
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36",}`
s = requests.session()
Navigate to the specified url -
r = s.get(url)
Use BeautifulSoup4 to parse the html returned -
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.content, 'html5lib')
Populate formdata with the hardcoded values and dynamic values -
formdata = {
'__VIEWSTATE': soup.find('input', attrs={'name': '__VIEWSTATE'})['value'],
'field1': 'value1'
}
Then send the POST request using the session object itself -
s.post(url, data=formdata)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Can't grab tabular content from a webpage using requests - python

Related

How to download CSV link in a python program

Unable to fetch tabular content from a site using requests

Scraping Data from .ASPX Website URL with Python

Why Request doesn't work on a specific URL?

Post forms using requests on .net website (python)

Categories

Resources