Extracting Production Data using Python

Extracting Production Data using Python - python

I am fairly new to Python and I'm trying to extract production data from the Alabama state website (https://www.gsa.state.al.us/ogb/production). I was wondering if someone could guide me on starting this? This is what I have so far. I was trying to extract production for permit number 8132-C.
headers = {
'Content-Type': 'application/x-www-form-urlencoded',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36',
}
payload = '8132-C'
session = requests.Session()
r = requests.get('https://www.gsa.state.al.us/ogb/production',
params=payload)
print(r.url)

Instead of r.url , you should r.text to see the data.
import requests
payload = '8132-C'
session = requests.Session()
r = requests.get('https://www.gsa.state.al.us/ogb/production', params=payload)
print(r.text)

Related

Python / Requests, download file from .aspx website after logging in

Python/requests.
I need to:
log in a website
change a parameter
download a file according to change in 2)
Attached the images with form/payload after download completion (Please feel free to ask me more, if you don't find me enough descriptive).
My idea was:
url = 'https://www.sunnyportal.com/Templates/Start.aspx?ReturnUrl=%2f'
protectedurl = 'https://www.sunnyportal.com/FixedPages/Dashboard.aspx'
downloadurl = 'https://www.sunnyportal.com/Redirect/DownloadDiagram'
# your details here to be posted to the login form.
payload = {
'ctl00$ContentPlaceHolder1$Logincontrol1$txtUserName': user,
'ctl00$ContentPlaceHolder1$Logincontrol1$txtPassword': pw
}
# ensure the session context is closed after use.
with requests.Session() as s:
p = s.post(url, data=payload)
print(p.status_code, p.headers)
# authorised request
r = s.get(protectedurl)
print(r.status_code, r.headers)
# download request
d = s.get(downloadurl)
print(d.status_code, d.headers)
I get for all 200 status code, but download doesn't start.
Here you can find the POST payload after logging in:
Thanks, Please please please help me!
I would like to have more clear:
should I add headers to post/get requests? Which headers?
Should I add more to the payload? What exactly?
Should I use straight just 1-2 url(s)? Which one/which ones?
Thanks!

There is a lot to do here but it should be possible. This is an ASP site so you need to get the __VIEWSTATE and __VIEWSTATEGENERATOR for every page you navigate from and include it in the payload. I would include everything in the payload, even the blanks stuff as well as replicate the headers. See the code below for how to login.
Then once you login you can replicate the network call to change the date, again you need to process the __VIEWSTATE and __VIEWSTATEGENERATOR from the page you are moving from and include it in the payload. (use a function like below and just call it with each move).
When you expand the image you will see another network call which you need to replicate, the response will have HTML you can parse and you can find the image in this tag:
<img id="UserControlShowEnergyAndPower1$_diagram" src="/chartfx70/temp/CFV0113_101418049AC.png"
If it's not that exact chart you want then right click the chart and copy-image-address then look for that image url in the HTML to see where it is.
then you can do something like this to save the file:
img_suffix = soup.find('img',{'id':'UserControlShowEnergyAndPower1$_diagram'})['src']
image_name = img_suffix.split('/')[-1]
image_url = 'https://www.sunnyportal.com/'+img_suffix
image_data = s.get(pdf_url) # where s is the requests.Session() variable
print(f'Saving image')
with open(image_name,'wb') as file:
file.write(image_data.content)
Below is how I logged in but you can take it from here to navigate to your image:
import requests
from bs4 import BeautifulSoup
def get_views(resp):
soup = BeautifulSoup(resp,'html.parser')
viewstate = soup.find('input',{'name':'__VIEWSTATE'})['value']
viewstate_gen = soup.find('input',{'name':'__VIEWSTATEGENERATOR'})['value']
return (viewstate,viewstate_gen)
s = requests.Session()
user = 'your_email'
pw = 'your_password'
headers = {
'accept':'*/*',
'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36',
}
url = 'https://www.sunnyportal.com/Templates/Start.aspx?ReturnUrl=%2f'
protectedurl = 'https://www.sunnyportal.com/FixedPages/Dashboard.aspx'
downloadurl = 'https://www.sunnyportal.com/Redirect/DownloadDiagram'
landing_page = s.get(url,headers=headers)
print(landing_page)
viewstate,viewstate_gen = get_views(landing_page.text)
# your details here to be posted to the login form.
payload = {
'__EVENTTARGET':'',
'__EVENTARGUMENT':'',
'__VIEWSTATE':viewstate,
'__VIEWSTATEGENERATOR':viewstate_gen,
'ctl00$ContentPlaceHolder1$Logincontrol1$txtUserName':user,
'ctl00$ContentPlaceHolder1$Logincontrol1$txtPassword':pw,
'ctl00$ContentPlaceHolder1$Logincontrol1$LoginBtn':'Login',
'ctl00$ContentPlaceHolder1$Logincontrol1$RedirectURL':'',
'ctl00$ContentPlaceHolder1$Logincontrol1$RedirectPlant':'',
'ctl00$ContentPlaceHolder1$Logincontrol1$RedirectPage':'',
'ctl00$ContentPlaceHolder1$Logincontrol1$RedirectDevice':'',
'ctl00$ContentPlaceHolder1$Logincontrol1$RedirectOther':'',
'ctl00$ContentPlaceHolder1$Logincontrol1$PlantIdentifier':'',
'ctl00$ContentPlaceHolder1$Logincontrol1$ServiceAccess':'true',
'ClientScreenWidth':'1920',
'ClientScreenHeight':'1080',
'ClientScreenAvailWidth':'1920',
'ClientScreenAvailHeight':'1050',
'ClientWindowInnerWidth':'1920',
'ClientWindowInnerHeight':'979',
'ClientBrowserVersion':'56',
'ClientAppVersion':'5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36',
'ClientAppName':'Netscape',
'ClientLanguage':'en-ZA',
'ClientPlatform':'Win32',
'ClientUserAgent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36',
'ctl00$ContentPlaceHolder1$hiddenLanguage':'en-gb'
}
new_headers = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding':'gzip, deflate, br',
'Accept-Language':'en-ZA,en;q=0.9,en-GB;q=0.8,en-US;q=0.7,de;q=0.6',
'Cache-Control':'no-cache',
'Connection':'keep-alive',
'Content-Length':'3917',
'Content-Type':'application/x-www-form-urlencoded',
'DNT':'1',
'Host':'www.sunnyportal.com',
'Origin':'https://www.sunnyportal.com',
'Pragma':'no-cache',
'Referer':'https://www.sunnyportal.com/Templates/Start.aspx?ReturnUrl=%2f',
'sec-ch-ua':'" Not;A Brand";v="99", "Google Chrome";v="97", "Chromium";v="97"',
'sec-ch-ua-mobile':'?0',
'sec-ch-ua-platform':'"Windows"',
'Sec-Fetch-Dest':'document',
'Sec-Fetch-Mode':'navigate',
'Sec-Fetch-Site':'same-origin',
'Sec-Fetch-User':'?1',
'Upgrade-Insecure-Requests':'1',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36'
}
login = s.post(url,headers=new_headers,data=payload)
print(login)
print(login.text)

Web scraping request stopped working, showing "Response [401]" in python?

import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36'}
url = 'https://www.nseindia.com/api/chart-databyindex?index=ACCEQN'
r = requests.get(url, headers=headers)
data = r.json()
print(data)
prices = data['grapthData']
print(prices)
It was working fine but now it showing error "Response [401]"

Well, it's all about the site's authentication requirements. It requires a certain level of authorization to access like this.

Getting past a login page in Python

For some reason I cannot get my head around using the requests.Session() function to save login credentials. My current code isn't working and I cannot figure out why, can someone help correct my code and explain the changes that were needed? I don't want to keep asking for help on a website by website basis.
headers = {
'Content-type': 'text/html',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.97 Safari/537.36',
'origin':'https://stockinvest.us',
'referer':'https://stockinvest.us'
}
data = {'email':'user-email', 'password: 'user-password'}
with requests.Session() as s:
login = s.post('https://stockinvest.us/login', data=data, headers=headers)
s.cookies
r = s.get('https://stockinvest.us/profile/edit')

Python requests: CSRF token missing or incorrect

I have a simple HTML page where I am trying to post form data using requests.post(); however, I keep getting Bad Request 400. CSRF token missing or incorrect even though I am passing it URL-encoded.
Please help.
url = "https://recruitment.advarisk.com/tests/scraping"
res = requests.get(url)
tree = etree.HTML(res.content)
csrf = tree.xpath('//input[#name="csrf_token"]/#value')[0]
postData = dict(csrf_token=csrf, ward=wardName)
print(postData)
postUrl = urllib.parse.quote(csrf)
formData = dict(csrf_token=postUrl, ward=wardName)
print(formData)
headers = {'referer': url, 'content-type': 'application/x-www-form-urlencoded', 'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}
page = requests.post(url, data=formData, headers=headers)
return page.content

You have make sure the requests in one session, so that the csrf_token will be matched:
import sys
import requests
wardName = "DHANLAXMICOMPLEX"
url = 'https://recruitment.advarisk.com/tests/scraping'
#make the requests in one session
client = requests.session()
# Retrieve the CSRF token first
tree = etree.HTML(client.get(url).content)
csrf = tree.xpath('//input[#name="csrf_token"]/#value')[0]
#form data
formData = dict(csrf_token=csrf, ward=wardName)
headers = {'referer': url, 'content-type': 'application/x-www-form-urlencoded', 'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}
#use same session client
r = client.post(url, data=formData, headers=headers)
print r.content
It will give you the html with the result data table.

Login to website via Python Requests

for a university project I am currently trying to login to a website, and scrap a little detail (a list of news articles) from my user profile.
I am new to Python, but I did this before to some other website. My first two approaches deliver different HTTP errors. I have considered problems with the header my request is sending, however my understanding of this sites login process appears to be insufficient.
This is the login page: http://seekingalpha.com/account/login
My first approach looks like this:
import requests
with requests.Session() as c:
requestUrl ='http://seekingalpha.com/account/orthodox_login'
USERNAME = 'XXX'
PASSWORD = 'XXX'
userAgent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36'
login_data = {
"slugs[]":None,
"rt":None,
"user[url_source]":None,
"user[location_source]":"orthodox_login",
"user[email]":USERNAME,
"user[password]":PASSWORD
}
c.post(requestUrl, data=login_data, headers = {"referer": "http://seekingalpha.com/account/login", 'user-agent': userAgent})
page = c.get("http://seekingalpha.com/account/email_preferences")
print(page.content)
This results in "403 Forbidden"
My second approach looks like this:
from requests import Request, Session
requestUrl ='http://seekingalpha.com/account/orthodox_login'
USERNAME = 'XXX'
PASSWORD = 'XXX'
userAgent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36'
# c.get(requestUrl)
login_data = {
"slugs[]":None,
"rt":None,
"user[url_source]":None,
"user[location_source]":"orthodox_login",
"user[email]":USERNAME,
"user[password]":PASSWORD
}
headers = {
"accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Language":"de-DE,de;q=0.8,en-US;q=0.6,en;q=0.4",
"origin":"http://seekingalpha.com",
"referer":"http://seekingalpha.com/account/login",
"Cache-Control":"max-age=0",
"Upgrade-Insecure-Requests":1,
"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36"
}
s = Session()
req = Request('POST', requestUrl, data=login_data, headers=headers)
prepped = s.prepare_request(req)
prepped.body ="slugs%5B%5D=&rt=&user%5Burl_source%5D=&user%5Blocation_source%5D=orthodox_login&user%5Bemail%5D=XXX%40XXX.com&user%5Bpassword%5D=XXX"
resp = s.send(prepped)
print(resp.status_code)
In this approach I was trying to prepare the header exactly as my browser would do it. Sorry for redundancy. This results in HTTP error 400.
Does someone have an idea, what went wrong? Probably a lot.

Instead of spending a lot of energy on manually logging in and playing with Session, I suggest you just scrape the pages right away using your cookie.
When you log in, usually there is a cookie added to your request to identify your identity. Please see this for example:
Your code will be like this:
import requests
response = requests.get("www.example.com", cookies={
"c_user":"my_cookie_part",
"xs":"my_other_cookie_part"
})
print response.content

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting Production Data using Python - python

Instead of r.url , you should r.text to see the data. import requests payload = '8132-C' session = requests.Session() r = requests.get('https://www.gsa.state.al.us/ogb/production', params=payload) print(r.text)

Related

Python / Requests, download file from .aspx website after logging in

Web scraping request stopped working, showing "Response [401]" in python?

Getting past a login page in Python

Python requests: CSRF token missing or incorrect

Login to website via Python Requests

Categories

Resources