request data from api with bearer token and refresh token - python

I am still a beginner at web scraping, I am trying to extract data from an API but the problem is that it has a Bearer token and this token changed after 5 to 6 hours so I have to go to the web page again and copy the token again so is there any way to extract the data without any more opening to the web page and copy the token again
I found this info as well on the network request, as someone told me that I could use the refresh_token to access but I don't know how to do that
Cache-Control: no-cache,
Connection: keep-alive,
Content-Length: 177,
Content-Type: application/json;charset=UTF-8,
Cookie: dhh_token=; refresh_token=; _hurrier_session=81556f54bf555a952d1a7f780766b028,
dnt: 1
import pandas as pd
from time import sleep
def make_request():
headers = {
'Connection': 'keep-alive',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache',
'sec-ch-ua': '^\\^',
'Accept': 'application/json',
'Authorization': 'Bearer eyJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJMdXRiZlZRUVZhWlpmNTNJbGxhaXFDY3BCVTNyaGtqZiIsInN1YiI6MzEzMTcwLCJleHAiOjE2MjQzMjU2NDcsInJvbCI6ImRpc3BhdGNoZXIiLCJyb2xlcyI6WyJodXJyaWVyLmRpc3BhdGNoZXIiLCJjb2QuY29kX21hbmFnZXIiXSwibmFtIjoiRXNsYW0gWmVmdGF3eSIsImVtYSI6ImV6ZWZ0YXd5QHRhbGFiYXQuY29tIiwidXNlcm5hbWUiOiJlemVmdGF3eUB0YWxhYmF0LmNvbSIsImNvdW50cmllcyI6WyJrdyIsImJoIiwicWEiLCJhZSIsImVnIiwib20iLCJqbyIsInEyIiwiazMiXX0.XYykBij-jaiIS_2tdqKFIfYGfw0uS0rKmcOTSHor8Nk',
'sec-ch-ua-mobile': '?0',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36',
'Content-Type': 'application/json;charset=UTF-8',
'Origin': 'url',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Dest': 'empty',
'Referer': 'url',
'Accept-Language': 'en-US,en;q=0.9,ar-EG;q=0.8,ar;q=0.7',
'dnt': '1',
}
data = {
'status': 'picked'
}
response = requests.post('url/api', headers=headers, json=data)
print(response.text)
return json.loads(response.text)
def extract_data(row):
data_row = {
'order_id': row['order']['code'],
'deedline': row['order']['deadline'].split('.')[0],
'picked_at': row['picked_at'].split('.')[0],
'picked_by': row['picked_by'],
'processed_at': row['processed_at'],
'type': row['type']
}
return data_row
def periodique_extract(delay):
extract_count = 0
while True:
extract_count += 1
data = make_request()
if extract_count == 1 :
df = pd.DataFrame([extract_data(row) for row in data['data']])
df.to_csv(r"C:\Users\di\Desktop\New folder\a.csv", mode='a')
else:
df = pd.DataFrame([extract_data(row) for row in data['data']])
df.to_csv(r"C:\Users\di\Desktop\New folder\a.csv", mode='a',header=False)
print('exracting data {} times'.format(extract_count))
sleep(delay)
periodique_extract(60)
#note: as the website is track live operation so I extract data every 1 min ```

Sometimes these tokens require JavaScript execution to be set and automatically added to API requests. That means you need to open the page in something that actually runs the javascript, in order to get the token. I.e. actually opening the page in a browser.
One solution could be to use something like Selenium or Puppeteer to open the page whenever the token expires to get a new token, that you then feed to your script. But this depends on the specifics on the page, without a link the correct solution is difficult to say. But if the method of you opening the page in your browser, copying the token, then running your script works, then this is very likely to also work.

Related

How to login into Instagram using Python Requests?

I am using the following code to make a Python Request to login into my Instagram account. I am running this on local.
import requests
from datetime import datetime
import re
from pprint import pprint
import json
time = int(datetime.now().timestamp())
link = 'https://www.instagram.com/accounts/login/'
login_url = f"https://www.instagram.com/accounts/login/ajax/"
payload = {
'username': 'username',
'enc_password': 'PWD_INSTAGRAM_BROWSER:0:{time}:password',
'queryParams': "{}",
'optIntoOneTap': 'false',
'stopDeletionNonce': "",
'trustedDeviceRecords': "{}"
}
response = requests.get(link)
csrf = response.cookies['csrftoken']
print(csrf)
response = requests.post(login_url, data=payload, headers={
"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36",
"X-Xequested-Xith": "XMLHttpRequest",
"Referer": "https://www.instagram.com/",
"X-CSRFtoken": csrf,
"Content-Type": "application/x-www-form-urlencoded",
"Host": "www.instagram.com",
"Origin": "https://www.instagram.com"
})
response_json = json.loads(response.text)
pprint(response_json)
The response I receive after running the above code shows that I request is not authenticated:
{'authenticated': False, 'status': 'ok', 'user': True}
How can I login to Instagram using requests? Is there an updated method?
In general, these use cases lend themselves perfectly for selenium, scrapy, playwright or puppeteer. I do not have an instagram account, so I don't know if this works, but in theory it might return a valid response:
import requests
cookies = {
'csrftoken': '9e7U8qRNqAbazRC0kwrRgyN2okh1kihx',
'mid': 'YsM1_AALAAEG2fGCvkPXE5DVlJD0',
'ig_did': '494394E2-A583-4F01-BC32-5E4344FE2C4D',
}
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Firefox/102.0',
'Accept': '*/*',
'Accept-Language': 'en-US,en;q=0.5',
# 'Accept-Encoding': 'gzip, deflate, br',
'X-CSRFToken': '9e7U8qRNqAbazRC0kwrRgyN2okh1kihx',
'X-Instagram-AJAX': 'c6412f1b1b7b',
'X-IG-App-ID': '936619743392459',
'X-ASBD-ID': '198387',
'X-IG-WWW-Claim': '0',
'X-Requested-With': 'XMLHttpRequest',
'Origin': 'https://www.instagram.com',
'DNT': '1',
'Connection': 'keep-alive',
'Referer': 'https://www.instagram.com/accounts/login/?',
# Requests sorts cookies= alphabetically
# 'Cookie': 'csrftoken=9e7U8qRNqAbazRC0kwrRgyN2okh1kihx; mid=YsM1_AALAAEG2fGCvkPXE5DVlJD0; ig_did=494394E2-A583-4F01-BC32-5E4344FE2C4D',
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
# Requests doesn't support trailers
# 'TE': 'trailers',
}
data = {
'enc_password': '#PWD_INSTAGRAM_BROWSER:10:1656960533:ARxQAMMwb3Yd6w3UdaFGt0Q3mTZ7lMDJHHmZFLaGEfQahXJOTxqb35Q/ZGC3B70DxZRhKcnaf3xImXyL7EFseRF/yZG4dvauui/LzLU7oAHK3rYHSYsjPjQTham/5DFXq6m4foqB5fIiJoChT+ng58EDUkFA1A==',
'username': 'somerandomemail#hotmail.com',
'queryParams': '{}',
'optIntoOneTap': 'false',
'stopDeletionNonce': '',
'trustedDeviceRecords': '{}',
}
response = requests.post('https://www.instagram.com/accounts/login/ajax/', cookies=cookies, headers=headers, data=data)
If you run into security issues, try the same but with cloudscraper instead of requests library.
I don't think you can access Instagram with only requests as far as I know.
Last time I tried, I'd to create an app within the facebook developer account and make an accesstoken from the Facebook / Instagram Graph API to access Instagram and make login stuffs. With that, you can not only login to your account but also you can post contents from that.
Long story short, refer Instagram Graph API and this should get your job done!
Edit:
# Sharing on Instagram...
insta = facebook.GraphAPI(facebookUserAccessToken)
InstaSend = insta.put_photo(open(IMAGEPATH, 'rb'), message=TEXT)
if InstaSend:
print('\nInstagram Share Successful!')
# Sharing on Facebook...
face = facebook.GraphAPI(facebookPageAccessToken)
face.put_object(
parent_object=facebookPageID,
connection_name="feed",
message=TEXT,
)
faceSend = face.put_photo(open(IMAGEPATH, 'rb'), message=TEXT)
if faceSend:
print('\nFacebook Share Successful!')
This above part of code dates back to 2019 that I'd written to automatically share the content on different social media platforms once the video was published on my YouTube channel. I haven't used this since then and I doubt this would work for you as is. Some code might needs to be changed in order for it to work as the Graph API is actively being updated by Meta. However what I believe is the process has gotten easier.
Additionally, you can checkout "Justin Stolpe" on YouTube for more insights on this topic.

login with requests and BeautifulSoup to scrape pages

I need to scrape a page that requires login to access.
I tried to login with the saved logins info converted in cUrl, using requests and BeautifulSoup but it doesn't work.
I need to login on 'https://www.seoprofiler.com/account/login'
And then scrape pages like: 'https://www.seoprofiler.com/lp/links?q=test.com'
Here's my code:
from bs4 import BeautifulSoup
import requests
cookies = {
'csrftoken': 'token123',
'seoprofilersession': 'session123',
}
headers = {
'Connection': 'keep-alive',
'Cache-Control': 'max-age=0',
'sec-ch-ua': '^\\^',
'sec-ch-ua-mobile': '?0',
'Upgrade-Insecure-Requests': '1',
'Origin': 'https://www.seoprofiler.com',
'Content-Type': 'application/x-www-form-urlencoded',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-User': '?1',
'Sec-Fetch-Dest': 'document',
'Referer': 'https://www.seoprofiler.com/account/login',
'Accept-Language': 'en,en-US;q=0.9,it;q=0.8',
}
data = {
'csrfmiddlewaretoken': 'token123',
'username': 'email123#gmail.com',
'password': 'pass123!',
'button': ''
}
response = requests.post('https://www.seoprofiler.com/account/login',
headers=headers, cookies=cookies, data=data)
url = 'https://www.seoprofiler.com/lp/links?q=test.com'
response = requests.get(url, headers= headers, cookies=cookies)
soup = BeautifulSoup(response.content, 'html.parser')
soup.encode('utf-8')
print(soup.title)
I would not use selenium as I have to scrape a lot of data and it would require a lot of time with selenium.
How can I login in order to scrape pages logged in?
Thank you!
You could use requests.Session!
After some trial and error I was able to log in and get the project page using the following script:
import requests
session = requests.Session() # Create new session
session.get(
"https://www.seoprofiler.com/account/login"
) # set seoprofilersession and csrftoken cookies
session.post(
"https://www.seoprofiler.com/account/login",
data={
"csrfmiddlewaretoken": session.cookies.get_dict()["csrftoken"],
"username": "your_email",
"password": "your_password",
},
) # login, sets needed cookies
# Now use this session to get all data you need!
resp = session.get(
"https://www.seoprofiler.com/project/google.com-fa1b9c855721f3d5"
) # get main page content
print(resp.status_code) # my output: 200
Edited:
Just checked one more thing and it appears that it is not mandatory to retrieve seoprofilersession and csrftoken cookies and you can just simply call login post with your credentials (without csrfmiddlewaretoken and then use your session)
How do you know what data structure you must pass to your login page?
A more confident solution uses selenium to fill the username and password fields of the login page and then click on the login button. Next, go to the desired page and scrape that.

python requests 403 for json, however in browsers works fine

I'm trying to get data from etoro. This link works in my browser https://www.etoro.com/sapi/userstats/CopySim/Username/viveredidividend/OneYearAgo but it's forbidden via request.get() even if I add user agent, headers and even cookies.
import requests
url = "https://www.etoro.com/sapi/userstats/CopySim/Username/viveredidividend/OneYearAgo"
headers = {
'Host': 'www.etoro.com',
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0',
'Accept': '*/*',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Referer': 'https://www.etoro.com/people/viveredidividend/chart',
'Cookie': 'XXX',
'TE': 'Trailers'
}
requests.get(url, headers=headers)
>>> <Response [403]>
How to solve it without selenium?
This error gives when you doesn't authenticate the python code in browser. When you login with website it is authenticate and its remember it, thats why you can use and works fine in browser by site.
In order to solve this problem you first need to authenticate the browser in your python code.
To authenticate,
import requests
response = requests.get(url, auth=(username, password))
The error 403 tells that the request you are making is getting blocked. Actually, the website is protected by cloudflare which is preventing the website to get scraped. You can check it by executing print(response.text) in your code and you'll see Access denied | www.etoro.com used Cloudflare to restrict access in the returned cloudflare HTML inside title tag.
Under the hood, when you sent the requests it goes through the cloudflare server and verify whether it's coming from the real browser or not. If the request pass the verification then only it forward the request to website server which returns the valid response. Otherwise, the cloudflare block the request.
It's difficult to bypass cloudflare. Nevertheless, you can try your luck with the code given below.
Code
import urllib.request
url = 'https://www.etoro.com/sapi/userstats/CopySim/Username/viveredidividend/OneYearAgo'
headers = {
'authority': 'www.etoro.com',
'pragma': 'no-cache',
'cache-control': 'no-cache',
'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
'accept': 'application/json, text/plain, */*',
'accounttype': 'Real',
'applicationidentifier': 'ReToro',
'sec-ch-ua-mobile': '?0',
'applicationversion': '331.0.2',
'user-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
'referer': 'https://www.etoro.com/discover/markets/cryptocurrencies',
'accept-language': 'en-US,en;q=0.9',
'cookie': '__cfruid=e7f40231e2946a1a645f6fa0eb19af969527087e-1624781498; _gcl_au=1.1.279416294.1624782732; _gid=GA1.2.518227313.1624782732; _scid=64860a19-28e4-4e83-9f65-252b26c70796; _fbp=fb.1.1624782732733.795190273; __adal_ca=so%3Ddirect%26me%3Dnone%26ca%3Ddirect%26co%3D%28not%2520set%29%26ke%3D%28not%2520set%29; __adal_cw=1624782733150; _sctr=1|1624732200000; _gaexp=GAX1.2.eSuc0QBTRhKbpaD4vT_-oA.18880.x331; _hjTLDTest=1; _hjid=bb69919f-e61b-4a94-a03b-db7b1f4ec4e4; hp_preferences=%7B%22locale%22%3A%22en-gb%22%7D; funnelFromId=38; eToroLocale=en-gb; G_ENABLED_IDPS=google; marketing_visitor_regulation_id=10; marketing_visitor_country=96; __cflb=0KaS4BfEHptJdJv5nwPFxhdSsqV6GxaSK8BuVNBmVkuj6hYxsLDisSwNTSmCwpbFxkL3LDuPyToV1fUsaeNLoSNtWLVGmBErMgEeYAyzW4uVUEoJHMzTirQMGVAqNKRnL; __cf_bm=6ef9d6f250ee71d99f439672839b52ac168f7c89-1624785170-1800-ASu4E7yXfb+ci0NsW8VuCgeJiCE72Jm9uD7KkGJdy1XyNwmPvvg388mcSP+hTCYUJvtdLyY2Vl/ekoQMAkXDATn0gyFR0LbMLl0b7sCd1Fz/Uwb3TlvfpswY1pv2NvCdqJBy5sYzSznxEsZkLznM+IGjMbvSzQffBIg6k3LDbNGPjWwv7jWq/EbDd++xriLziA==; _uetsid=2ba841e0d72211eb9b5cc3bdcf56041f; _uetvid=2babee20d72211eb97efddb582c3c625; _ga=GA1.2.1277719802.1624782732; _gat_UA-2056847-65=1; __adal_ses=*; __adal_id=47f4f887-c22b-4ce0-8298-37d6a0630bdd.1624782733.2.1624785174.1624782818.770dd6b7-1517-45c9-9554-fc8d210f1d7a; _gat=1; TS01047baf=01d53e5818a8d6dc983e2c3d0e6ada224b4742910600ba921ea33920c60ab80b88c8c57ec50101b4aeeb020479ccfac6c3c567431f; outbrain_cid_fetch=true; _ga_B0NS054E7V=GS1.1.1624785164.2.1.1624785189.35; TMIS2=9a74f8b353780f2fbe59d8dc1d9cd901437be0b823f8ee60d0ab36264e2503993c5e999eaf455068baf761d067e3a4cf92d9327aaa1db627113c6c3ae3b39cd5e8ea5ce755fb8858d673749c5c919fe250d6297ac50c5b7f738927b62732627c5171a8d3a86cdc883c43ce0e24df35f8fe9b6f60a5c9148f0a762e765c11d99d; mp_dbbd7bd9566da85f012f7ca5d8c6c944_mixpanel=%7B%22distinct_id%22%3A%20%2217a4c99388faa1-0317c936b045a4-34647600-13c680-17a4c993890d70%22%2C%22%24device_id%22%3A%20%2217a4c99388faa1-0317c936b045a4-34647600-13c680-17a4c993890d70%22%2C%22%24initial_referrer%22%3A%20%22%24direct%22%2C%22%24initial_referring_domain%22%3A%20%22%24direct%22%7D',
}
request = urllib.request.Request(url, headers=headers)
response = urllib.request.urlopen(request).read()
print(response.decode('utf-8'))

How to get post request auth_token from zyBook's response?

I am trying to log into zyBooks using the Requests library in Python. I saw in the Network tab of google chrome that I need an auth_token to be able to add to the URL to actually create and do the login request. Firstly, here is the Network tab snapshot after I log into the website:
So first, I need to do the 1st POST request that is named 'signin' (the 2nd one, 1st OPTIONS request doesn't seem to do anything or respond with anything). The signin POST request is supposed to respond with an auth_token, which then I can use to login using the 3rd name in the list, which is the first GET request.
The response of the first POST request is the auth_token:
And here is the detail about the first POST request. You can see the request URL and the payload required:
As proof, here is what request URL would look like. As you can see, it needs the auth_token.
I am however, unable to get the first POST request's auth_token in anyway that I have tried so far. Both request URL for the first 2 'signin' are what is in the code. Here is the code:
import requests
url = 'https://learn.zybooks.com/signin'
payload = {"email":"myemail","password":"mypassword"}
headers = {
'Host': 'zyserver.zybooks.com',
'Connection': 'keep-alive',
'Content-Length': '52',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache',
'sec-ch-ua': "Chromium;v=88, Google Chrome;v=88, ;Not A Brand;v=99",
'Accept': 'application/json, text/javascript, */*; q=0.01',
'DNT': '1',
'sec-ch-ua-mobile': '?0',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/88.0.4324.150 Safari/537.36',
'Content-Type': 'application/json',
'Origin': 'https://learn.zybooks.com',
'Sec-Fetch-Site': 'same-site',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Dest': 'empty',
'Referer': 'https://learn.zybooks.com/',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
}
session = requests.Session()
req1 = session.post(url)
req2 = session.post(url, data=payload)
print(req2.json())
I just get the JSONDecoreError:
353 obj, end = self.scan_once(s, idx)
354 except StopIteration as err:
--> 355 raise JSONDecodeError("Expecting value", s, err.value) from None
356 return obj, end
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
From what I have researched in many posts online, this error happens because the response doesn't contain any JSON. But that doesn't make any sense as I need that JSON response with the auth_token to be able to create the GET request to login to the site.
Got it. It's because zyBooks runs on Ember.js. There is no html (barely), it's a javascript website. The javascript needs to be loaded first, then the form can be filled and submitted.
I did not go through with implementing it myself, but for future people coming here, there are posts on this subject, such as:
using requests to login to a website that has javascript login form

POST requests using cookie from session

I am trying to scrape a website using POST request to fill the form:
http://www.planning2.cityoflondon.gov.uk/online-applications/search.do?action=advanced
in python, this goes as follow:
import requests
import webbrowser
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.9',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Cookie': 'JSESSIONID=OwXG0Hkxj+X9ELygHZa-aLQ5.undefined; _ga=GA1.3.1911942552.',
'Content-Type': 'application/x-www-form-urlencoded',
'Host': 'www.planning2.cityoflondon.gov.uk',
'Origin': 'http://www.planning2.cityoflondon.gov.uk',
'Referer': 'http://www.planning2.cityoflondon.gov.uk/online-applications/search.do?action=advanced',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}
data = {
'searchCriteria.developmentType': '002',
'date(applicationReceivedStart)': '01/08/2000',
'date(applicationReceivedEnd)': '01/08/2018'
}
url = 'http://www.planning2.cityoflondon.gov.uk/online-applications/advancedSearchResults.do?action=firstPage'
test_file = 'planning_app.html'
with requests.Session() as session:
r = session.post(url, headers = headers, data = data)
with open (test_file, 'w') as file:
file.write(r.text)
webbrowser.open(test_file)
As you can see from the page reopened with webbrowser, this gives an error of outdated cookie.
For this to work I would need to manually go to the webpage, perform a query while opening the inspect panel of google chrome on the network tab, look at the cookie in the requests header and copy paste the cookie in my code. This would work until of course the cookie is expired again.
I tried to automate that retrieval of the cookie by doing the following:
headers_get = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.9',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Host': 'www.planning2.cityoflondon.gov.uk',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
}
with requests.Session() as session:
c = session.get('http://www.planning2.cityoflondon.gov.uk/online-applications/', headers = headers_get)
headers['Cookie'] = 'JSESSIONID=' + list(c.cookies.get_dict().values())[0]
r = session.post(url, headers = headers, data = data)
with open (test_file, 'w') as file:
file.write(r.text)
webbrowser.open(test_file)
I would expect this to work as it is simply automating what i do manually:
Go to the page of the GET request, get the cookie from it add said cookie to the headers dict of the POST request.
However I still receive the 'server error' page from the POST requests.
Anyone would be able to get an understanding of why this happen?
The requests.post accept cookies name parameter. Using it instead of sending cookies directly in header may fix the problem:
with requests.Session() as session:
c = session.get('http://www.planning2.cityoflondon.gov.uk/online- applications/', headers = headers_get)
# Also, you can set with cookies=session.cookies
r = session.post(url, headers = headers, data = data, cookies=c.cookies)
Basically I suppose there may be some javascript logic on the site, which isn't executed with the use of requests.post. If that's the case, to resolve that you have to use selenium for filling and submitting form.
Please see Dynamic Data Web Scraping with Python, BeautifulSoup which has similar problem - javascript not executed.

Categories

Resources