Parsing a table with Pandas

Parsing a table with Pandas - python

I am trying to parse the table from https://alreits.com/screener
I have tried this:
main_url = 'https://alreits.com/screener'
r = requests.get(main_url)
df_list = pd.read_html(r.text)
df = df_list[0]
print(df)
but pandas cant find the table.
I have also tried using BeautifulSoup4 but it didnt seem to give better results.
This is the selector: #__next > div.MuiContainer-root.MuiContainer-maxWidthLg > div.MuiBox-root.jss9.Card__CardContainer-feksr6-0.fpbzHQ.ScreenerTable__CardContainer-sc-1c5wxgl-0.GRrTj > div > table > tbody
This is the full xPath: /html/body/div/div[2]/div[2]/div/table/tbody
I am trying to get the Stock symbol (under name),sector,score and market cap. The other data would be nice to have but is not necessary.
Thank You!

I found one JSON url from the dev tool. This is an easy way to extract the table instead of using selenium. Use post request to extract the data.
import requests
headers = {
'authority': 'api.alreits.com:8080',
'sec-ch-ua': '"Google Chrome";v="93", " Not;A Brand";v="99", "Chromium";v="93"',
'sec-ch-ua-mobile': '?0',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36',
'sec-ch-ua-platform': '"Windows"',
'content-type': 'application/json',
'accept': '*/*',
'origin': 'https://alreits.com',
'sec-fetch-site': 'same-site',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
'referer': 'https://alreits.com/',
'accept-language': 'en-US,en;q=0.9',
}
params = (
('page', '0'),
('size', '500'),
('sort', ['marketCap,desc', 'score,desc', 'ffoGrowth,desc']),
)
data = '{"filters":[]}'
response = requests.post('https://api.alreits.com:8080/api/reits/screener', headers=headers, params=params, data=data)
df = pd.DataFrame(response.json())

The code below will return the data you are looking for.
import requests
import pprint
import json
headers = {'content-type': 'application/json',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'}
r = requests.post(
'https://api.alreits.com:8080/api/reits/screener?page=0&size=500&sort=marketCap,desc&sort=score,desc&sort=ffoGrowth,desc',
headers=headers, data=json.dumps({'filters':[]}))
if r.status_code == 200:
pprint.pprint(r.json())
# Now you have the data - do what you want with it
else:
print(r.status_code)

Related

Scrape multiple pages with json

I am trying to scrape multiple pages with json but they will provide me error
import requests
import json
import pandas as pd
headers = {
'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8,pt;q=0.7',
'Connection': 'keep-alive',
'Origin': 'https://www.nationalhardwareshow.com',
'Referer': 'https://www.nationalhardwareshow.com/',
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'cross-site',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
'accept': 'application/json',
'content-type': 'application/x-www-form-urlencoded',
'sec-ch-ua': '".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
}
params = {
'x-algolia-agent': 'Algolia for vanilla JavaScript 3.27.1',
'x-algolia-application-id': 'XD0U5M6Y4R',
'x-algolia-api-key': 'd5cd7d4ec26134ff4a34d736a7f9ad47',
}
for i in range(0,4):
data = '{"params":"query=&page={i}&facetFilters=&optionalFilters=%5B%5D"}'
resp = requests.post('https://xd0u5m6y4r-dsn.algolia.net/1/indexes/event-edition-eve-e6b1ae25-5b9f-457b-83b3-335667332366_en-us/query', params=params, headers=headers, data=data).json()
req_json=resp
df = pd.DataFrame(req_json['hits'])
f = pd.DataFrame(df[['name','representedBrands','description']])
print(f)
the error :
Traceback (most recent call last):
File "e:\ScriptScraping\Extract data from json\uk.py", line 31, in <module>
df = pd.DataFrame(req_json['hits']) KeyError: 'hits'

Try to concatenate the variable i with data parameter
import requests
import json
import pandas as pd
headers = {
'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8,pt;q=0.7',
'Connection': 'keep-alive',
'Origin': 'https://www.nationalhardwareshow.com',
'Referer': 'https://www.nationalhardwareshow.com/',
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'cross-site',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
'accept': 'application/json',
'content-type': 'application/x-www-form-urlencoded',
'sec-ch-ua': '".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"'
}
params = {
'x-algolia-agent': 'Algolia for vanilla JavaScript 3.27.1',
'x-algolia-application-id': 'XD0U5M6Y4R',
'x-algolia-api-key': 'd5cd7d4ec26134ff4a34d736a7f9ad47'
}
lst=[]
for i in range(0,4):
data = '{"params":"query=&page='+str(i)+'&facetFilters=&optionalFilters=%5B%5D"}'
resp = requests.post('https://xd0u5m6y4r-dsn.algolia.net/1/indexes/event-edition-eve-e6b1ae25-5b9f-457b-83b3-335667332366_en-us/query', params=params, headers=headers, data=data).json()
req_json=resp
df = pd.DataFrame(req_json['hits'])
f = pd.DataFrame(df[['name','representedBrands','description']])
lst.append(f)
#print(f)
d=pd.concat(lst)
print(d)

It is returning status_code 400 as the request is bad. You are sending wrongly formatted data. Change:
data = '{"params":"query=&page={i}&facetFilters=&optionalFilters=%5B%5D"}'
To
data = '{"params":"query=&page='+str(i)+'&facetFilters=&optionalFilters=%5B%5D"}'
For it to work. Hope I could help.

Why does soup.findAll return []?

I am having trouble reading data from a url with BeautifulSoup
This is my code:
url1 = "https://www.barstoolsportsbook.com/events/1018282280"
from bs4 import BeautifulSoup
import requests
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get(url1, headers=headers)
soup = BeautifulSoup(html.text, 'lxml')
data = soup.findAll('div',attrs={"class":"section"})
print(data)
#for x in data:
# print(x.find('p').text)
When I print(data) I am returned []. What could be the reason for this? I would like to avoid using selenium for this task if possible.
This is the HTML for what I'm trying to grab
<div data-v-50e01018="" data-v-2a52296d="" class="section"><p data-v-5f665d29="" data-v-50e01018="" class="header strongbody2">HOT TIPS</p><p data-v-5f665d29="" data-v-50e01018="" class="tip body2"> The Mets have led after 3 innings in seven of their last nine night games against NL East Division opponents that held a losing record. </p><p data-v-5f665d29="" data-v-50e01018="" class="tip body2"> The 'Inning 1 UNDER 0.5 runs' market has hit in each of the Marlins' last nine games against NL East opponents. </p></div>

You can likely get what you want with this request:
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:103.0) Gecko/20100101 Firefox/103.0',
'Accept': 'application/json, text/plain, */*',
'Accept-Language': 'en-US,en;q=0.5',
'Authorization': 'Basic MTI2OWJlMjItNDI2My01MTI1LWJlNzMtMDZmMjlmMmZjNWM3Omk5Zm9jajRJQkZwMUJjVUc0NGt2S2ZpWEpremVKZVpZ',
'Origin': 'https://www.barstoolsportsbook.com',
'DNT': '1',
'Connection': 'keep-alive',
'Referer': 'https://www.barstoolsportsbook.com/',
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'cross-site',
}
response = requests.get('https://api.isportgenius.com.au/preview/1018282280', headers=headers)
response.json()
You should browse the network tab to see where the rest of the data is coming from, or use a webdriver.

Getting Response 500 python

I want to get access from this website https://www.truepeoplesearch.com/ but I am getting 500 error using requests I have also try proxies Api to get 200 response but still getting same response
Here is my code:
import requests
# headers = {
# 'authority': 'www.truepeoplesearch.com',
# 'method': 'GET',
# 'path': '/?__cf_chl_rt_tk=yRTk.U51_2F3gk2zvxX_GLO5MgGmfwXsQeScGiGloJM-1637358329-0-gaNycGzNCpE',
# 'scheme': 'https',
# 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
# 'accept-encoding': 'gzip, deflate, br',
# 'accept-language': 'en-US,en;q=0.9',
# 'cache-control': 'max-age=0',
# 'cookie':' __cf_bm=jDWsHwlBvdk987UFEnrW63LelWWz03HXgZg3jDPaYY0-1637358168-0-AeyRiZHGcxyaD4j/7LGGq1aVmo5sBj/qmFX58OY4gfcUtvfzaG1exKf4HiYNNAlSaqm6LZ3MWB2UBgaOeIugrxA=; aegis_uid=26435771721; aegis_tid=direct; _ga=GA1.2.1056913930.1637358268; _gid=GA1.2.364813628.1637358268; cf_chl_prog=F13; cf_chl_rc_m=7',
# 'referer': 'https://www.truepeoplesearch.com/?__cf_chl_rt_tk=yRTk.U51_2F3gk2zvxX_GLO5MgGmfwXsQeScGiGloJM-1637358329-0-gaNycGzNCpE',
# 'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="96", "Google Chrome";v="96"',
# 'sec-ch-ua-mobile':' ?0',
# 'sec-ch-ua-platform':' "Windows"',
# 'sec-fetch-dest':' document',
# 'sec-fetch-mode': 'navigate',
# 'sec-fetch-site':' same-origin',
# 'upgrade-insecure-requests': '1',
# 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'
# }
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'}
payload = {'api_key': 'dd6a50aba760cc67e0276d840a7989cd', 'url': 'https://www.truepeoplesearch.com/'}
r = requests.get('http://api.scraperapi.com', params=payload, headers= headers)
print(r)
I have try both headers but didn't get 200 response, Anybody have any idea what I am doing wrong?

Obtain data of Freight Index in python

I am trying to get the data from this website, https://en.macromicro.me/charts/947/commodity-ccfi-scfi , for China and Shanghai Continerized Freight Index.
I understand that the data is called from an API, how do I find out how the call is made and how do I extract it using python?
I am new in html in general so I have no idea where to start.
I tried,
import requests
url = "https://en.macromicro.me/charts/data/947/commodity-ccfi-scfi"
resp = requests.get(url)
resp = resp.json()
But the response is <Response [404]>
If I change the url to https://en.macromicro.me/charts/data/947/
the response is {'success': 0, 'data': [], 'msg': 'error #644'}

Try the below
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36',
'Referer': 'https://en.macromicro.me/charts/947/commodity-ccfi-scfi',
'X-Requested-With': 'XMLHttpRequest',
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Authorization': 'Bearer 9633cefba71a598adae0fde0b56878fe',
'Cookie': 'PHPSESSID=n74gv10hc8po7mrj491rk4sgo1; _ga=GA1.2.1231997091.1631627585; _gid=GA1.2.1656837390.1631627585; _gat=1; _hjid=c52244fd-b912-4d53-b0e3-3f11f430b51c; _hjFirstSeen=1; _hjAbsoluteSessionInProgress=0'}
r = requests.get('https://en.macromicro.me/charts/data/947', headers=headers)
print(r.json())
output
{'success': 1, 'data': {' ...}

Web scraping - his application is out of date, please click the refresh button on your browser return message

I would like to scrape some data from the following web site: https://oss.uredjenazemlja.hr/public/lrServices.jsp?action=publicLdbExtract
Steps I would like to automate are:
Choose "Opcinski sud/ZK odjel". For example choose "Zemljišnoknjižni odjel Benkovac".
Choose "Glavna knjiga". For example choose "BENKOVAC"
Enter "Broj kat. čestice:". For example, enter 576/2.
Select "Da" in "Povijesni pregled" (the last row, leave "Broj ZK uloska empty").
Click "Pregledaj" and solve the captcha.
Scrape html that appers.
I have tried to follow above steps using plain requests in python by following network, after opening inspector in the web browser.
There are lots of requests on the page. I will divide my code in several steps:
Start session and make requests that on the start of the page
import requests
import re
import shutil
from twocaptcha import TwoCaptcha
import pandas as pd
import numpy as np
import os
from pathlib import Path
import json
import uuid
# start session
url = 'https://oss.uredjenazemlja.hr/public/lrServices.jsp?action=publicLdbExtract'
session = requests.Session()
session.get(url)
jid = session.cookies.get_dict()['JSESSIONID']
# some requests on the start of the page (probabbly redundandt)
headers = {
'Referer': 'https://oss.uredjenazemlja.hr/public/lrServices.jsp?action=publicLdbExtract',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36',
}
session.get("https://oss.uredjenazemlja.hr/public/js/libs/modernizr-2.5.3.min.js", headers = headers) #
session.get("https://oss.uredjenazemlja.hr/public/js/libs/jquery-1.7.1.min.js", headers = headers) #
session.get("https://oss.uredjenazemlja.hr/public/js/script.js", headers = headers) # script.json
# no cache json
headers = {
'Cookie': 'ossprivatelang=hr_HR; gxtTheme=m%3Aid%7Cs%3Agray%2Cfile%7Cs%3Axtheme-gray.css; JSESSIONID=' + jid,
"Connection": "keep-alive",
'Host': 'oss.uredjenazemlja.hr',
'Referer': 'https://oss.uredjenazemlja.hr/public/lrServices.jsp?action=publicLdbExtract',
"sec-ch-ua": '"Google Chrome";v="89", "Chromium";v="89", ";Not A Brand";v="99"',
"sec-ch-ua-mobile": "?0",
"Sec-Fetch-Dest": "script",
"Sec-Fetch-Mode": "no-cors",
"Sec-Fetch-Site": "same-origin",
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36'
}
res = session.get('https://oss.uredjenazemlja.hr/public/gwt/hr.ericsson.oss.ui.pia.OssPiaModule.nocache.js', headers = headers)
cache_html = re.findall(r'bc=\'(.*\.cache.html)\',C', res.text)[0]
# cache_html = "1F6C776DEF6D55F56C900B938F84D726.cache.html"
# some more requests on the start of the page (probabbly redundandt)
headers = {
'Referer': 'https://oss.uredjenazemlja.hr/public/lrServices.jsp?action=publicLdbExtract',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36',
}
session.get("https://oss.uredjenazemlja.hr/public/gwt/tiny_mce_editor/tiny_mce_src.js", headers = headers) # tiny_mce_src.js
session.get("https://oss.uredjenazemlja.hr/public/gwt/js/common.js", headers = headers)
session.get("https://oss.uredjenazemlja.hr/public/gwt/js/blueimp_tmpl.js", headers = headers) # blueimp_tmpl.js
# cache json
headers = {
"DNT": "1",
'Referer': 'https://oss.uredjenazemlja.hr/public/lrServices.jsp?action=publicLdbExtract',
"sec-ch-ua": '"Google Chrome";v="89", "Chromium";v="89", ";Not A Brand";v="99"',
'sec-ch-ua-mobile': '?0',
'Sec-Fetch-Dest': 'iframe',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36'
}
session.get('https://oss.uredjenazemlja.hr/public/gwt/' + cache_html, headers = headers)
Then, I made requests for steps 1 and 2 above:
# commonRPCService opcinski sud 1
headers = {
'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'hr-HR,hr;q=0.9,en-US;q=0.8,en;q=0.7',
'Connection': 'keep-alive',
# 'Content-Length': '166',
'Content-Type': 'text/x-gwt-rpc; charset=UTF-8',
'Cookie': 'gxtTheme=m%3Aid%7Cs%3Agray%2Cfile%7Cs%3Axtheme-gray.css; ossprivatelang=hr_HR; __utma=79801043.802441445.1616788486.1616788486.1616788486.1; __utmz=79801043.1616788486.1.1.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); x-auto-31=m%3Acollapsed%7Cb%3Atrue; JSESSIONID=' + jid,
"DNT": "1",
'Host': 'oss.uredjenazemlja.hr',
'Origin': 'https://oss.uredjenazemlja.hr',
'Referer': 'https://oss.uredjenazemlja.hr/public/gwt/' + cache_html,
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36'
}
payload = '5|0|4|https://oss.uredjenazemlja.hr/public/gwt/|957F3F03E95E97ABBDE314DFFCCEF4BC|hr.ericsson.oss.ui.common.client.core.rpc.ICommonRPCService|getMainBook|1|2|3|4|0|'
res = session.post(
'https://oss.uredjenazemlja.hr/rpc/commonRPCService',
headers = headers,
data=payload
)
print(res.text)
# commonRPCService opcinski sud 2
payload = '5|0|18|https://oss.uredjenazemlja.hr/public/gwt/|957F3F03E95E97ABBDE314DFFCCEF4BC|hr.ericsson.oss.ui.common.client.core.rpc.ICommonRPCService|getLrInstitutions|com.extjs.gxt.ui.client.data.BaseModel|hr.ericsson.oss.ui.common.client.core.data.RpcModel/2891266824|dirty|java.lang.Boolean/476441737|new|deleted|resourceCode|java.lang.Integer/3438268394|elementSelected|class|java.lang.String/2004016611|hr.ericsson.jis.domain.admin.Institution|name||1|2|3|4|1|5|6|7|7|8|0|9|-2|10|-2|11|12|0|13|-2|14|15|16|17|15|18|'
res = session.post(
'https://oss.uredjenazemlja.hr/rpc/commonRPCService',
data=payload,
headers=headers
)
# print(res.text)
# commonRPCService glavna knjiga 1
payload = '5|0|4|https://oss.uredjenazemlja.hr/public/gwt/|957F3F03E95E97ABBDE314DFFCCEF4BC|hr.ericsson.oss.ui.common.client.core.rpc.ICommonRPCService|getMainBook|1|2|3|4|0|'
res = session.post(
'https://oss.uredjenazemlja.hr/rpc/commonRPCService',
data=payload,
headers=headers
)
print(res.text)
# commonRPCService glavna knjiga 2
payload = ('5|0|34|https://oss.uredjenazemlja.hr/public/gwt/|957F3F03E95E97ABBDE314DFFCCEF4BC|hr.ericsson.oss.ui.common.client.core.rpc.ICommonRPCService|getMainBooks|com.extjs.gxt.ui.client.data.BaseModel|hr.ericsson.oss.ui.common.client.core.data.RpcModel/2891266824|dirty|java.lang.Boolean/476441737|new|deleted|resourceCode|java.lang.Integer/3438268394|elementSelected|cadastralMunicipality|class|java.lang.String/2004016611|hr.ericsson.jis.domain.admin.CadastralMunicipality|hr.ericsson.jis.domain.admin.MainBook|institution|institutionId|parentInstitution|name|Općinski sud u Zadru|hr.ericsson.jis.domain.admin.Institution|institutionType|institutionTypeId|hr.ericsson.jis.domain.admin.InstitutionType|source|superviseInstitutionId|Zemljišnoknjižni odjel Benkovac|place|BENKOVAC|preconditionsRequired||1|2|3|4|1|5|6|10|7|8|0|9|-2|10|-2|11|12|0|13|-2|14|6|1|15|16|17|15|16|18|19|6|13|7|-2|9|-2|20|12|500|21|6|8|7|-2|9|-2|10|-2|20|12|605|11|12|0|13|-2|22|16|23|15|16|24|25|6|7|7|-2|9|-2|10|-2|26|12|14|11|-11|13|-2|15|16|27|28|12|1|10|-2|29|-10|11|-11|13|-2|22|16|30|31|16|32|15|-13|33|-2|22|16|34|').encode("utf-8")
res = session.post(
'https://oss.uredjenazemlja.hr/rpc/commonRPCService',
data=payload,
headers=headers
)
Than I solve the captcha:
# some captcha post
payload = ('5|0|4|https://oss.uredjenazemlja.hr/public/gwt/|957F3F03E95E97ABBDE314DFFCCEF4BC|hr.ericsson.oss.ui.common.client.core.rpc.ICommonRPCService|isCaptchaDisabled|1|2|3|4|0|').encode('utf-8')
res = session.post(
'https://oss.uredjenazemlja.hr/rpc/commonRPCService',
data=payload,
headers=headers
)
print(res.text)
# get and save captcha
TWO_CAPTCHA_APY_KEY = "myapikey"
solver = TwoCaptcha(TWO_CAPTCHA_APY_KEY)
save_path = 'D:/zkrh/captchas'
p = session.get('https://oss.uredjenazemlja.hr/servlets/kaptcha.jpg?1617088523212',
headers=headers,
stream=True)
captcha_path = os.path.join(Path(save_path), 'captcha' + ".jpg")
with open(captcha_path, 'wb') as out_file:
shutil.copyfileobj(p.raw, out_file)
# solve captcha
result = solver.normal(captcha_path, minLength=5, maxLength=5)
payload = ('5|0|6|https://oss.uredjenazemlja.hr/public/gwt/|957F3F03E95E97ABBDE314DFFCCEF4BC|hr.ericsson.oss.ui.common.client.core.rpc.ICommonRPCService|validateCaptcha|java.lang.String|' +
result['code'] + '|1|2|3|4|1|5|6|').encode('utf-8')
res = requests.post(
'https://oss.uredjenazemlja.hr/rpc/commonRPCService',
data=payload,
headers=headers
)
if res.text.startswith("//OK"):
os.rename(captcha_path, os.path.join(Path(save_path), result['code'] + ".jpg"))
else:
print("Kriva captcha. Rijesi!")
Now, here is the most important request and I can't get the right output from it. It should return lots of numbers where the most important number is one with 7 digits (\d{7}. the should be 1 or more of such numbers). I can use that number in the last step, to get html Here is my try:
payload = ('5|0|40|https://oss.uredjenazemlja.hr/public/gwt/|0EAC9F40996251FDB21FF254E1600E83|hr.ericsson.oss.ui.pia.client.rpc.IOssPublicRPCService|getLrUnitsByMainBookAndParcel|com.extjs.gxt.ui.client.data.BaseModel|java.lang.String|hr.ericsson.oss.ui.common.client.core.data.RpcModel/2891266824|date|java.sql.Date/3996530531|dirty|java.lang.Boolean/476441737|new|cadastralMunicipality|id|java.lang.Integer/3438268394|class|java.lang.String/2004016611|hr.ericsson.jis.domain.admin.CadastralMunicipality|cadastralMunicipalityId|source|creationDate|formatedName|BENKOVAC|userId|cadInstitution|deleted|institutionId|resourceCode|elementSelected|name|Odjel za katastar nekretnina Benkovac|hr.ericsson.jis.domain.admin.Institution|institution|Zemljišnoknjižni odjel Benkovac|place|sidMainBook|java.lang.Long/4227064769|hr.ericsson.jis.domain.admin.MainBook|status|576/2|1|2|3|4|2|5|6|7|18|8|9|115|10|21|10|11|0|12|-3|13|7|3|14|15|98|16|17|18|19|-5|20|15|1|21|9|116|0|1|22|17|23|24|15|-9999|25|7|8|10|-3|12|-3|26|-3|27|15|117|28|15|0|29|-3|30|17|31|16|17|32|33|7|9|10|-3|12|-3|26|-3|27|15|500|28|-13|29|-3|30|17|34|35|-9|16|-15|26|-3|28|15|0|29|-3|30|-9|36|37|4730091|0|14|15|30857|16|17|38|39|-19|40|').encode('utf-8')
res = session.post(
'https://oss.uredjenazemlja.hr/rpc/commonRPCService',
data=payload,
headers=headers
)
print(res.text)
It returns:
"//EX[2,1,["com.google.gwt.user.client.rpc.IncompatibleRemoteServiceException/3936916533","This application is out of date, please click the refresh button on your browser. ( Blocked attempt to access interface 'hr.ericsson.oss.ui.pia.client.rpc.IOssPublicRPCService', which is not implemented by 'hr.ericsson.oss.ui.common.server.core.rpc.CommonRPCService'; this is either misconfiguration or a hack attempt )"],0,5]"
instead of numbers as I explained before.
Then, in the last step, I should use 7 digit number as lrUnitNumber parameter
# Publicreportservlet
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'hr-HR,hr;q=0.9,en-US;q=0.8,en;q=0.7',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Content-Length': '169',
'Content-Type': 'application/x-www-form-urlencoded',
'Cookie': 'ossprivatelang=hr_HR; gxtTheme=m%3Aid%7Cs%3Agray%2Cfile%7Cs%3Axtheme-gray.css; JSESSIONID=' + jid,
'Host': 'oss.uredjenazemlja.hr',
'Origin': 'https://oss.uredjenazemlja.hr',
'Referer': 'https://oss.uredjenazemlja.hr/public/lrServices.jsp?action=publicLdbExtract',
'Sec-Fetch-Dest': 'iframe',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-User': '?1',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36'
}
dataFrom = {
'pia': 1,
'report_type_id': 4,
'report_type_name': 'bzp_izvadak_oss',
'source': 1,
'institutionID': 500,
'mainBookId': 30857,
'lrUnitNumber': 5509665,
'lrunitID': 5799992,
'status': '0,1',
'footer': '',
'export_type': 'html'
}
res = session.post(
'https://oss.uredjenazemlja.hr/servlets/PublicReportServlet',
data=dataFrom,
headers=headers
)
res
I am providing the R ode too. Maybe someone with R and web scraping knowledge can help:
library(httr)
library(rvest)
library(stringr)
library(reticulate)
twocaptcha <- reticulate::import("twocaptcha")
# captcha python library
TWO_CAPTCHA_APY_KEY = ".."
solver = twocaptcha$TwoCaptcha(TWO_CAPTCHA_APY_KEY)
#
url = 'https://oss.uredjenazemlja.hr/public/lrServices.jsp?action=publicLdbExtract'
session = GET(url)
jid <- cookies(session)$value
headers_cache = c(
'Referer'= 'https://oss.uredjenazemlja.hr/public/lrServices.jsp?action=publicLdbExtract',
'User-Agent'= 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36'
)
session <- rvest:::request_GET(content(session), 'https://oss.uredjenazemlja.hr/public/gwt/hr.ericsson.oss.ui.pia.OssPiaModule.nocache.js',
add_headers(headers_cache))
cache_html <- str_extract(session$response, "bc=\\'(.*\\.cache.html)\\',C")
cache_html <- gsub(".*=\\'|\\'.C", "", cache_html)
headers_cache = c(
'Referer'= 'https://oss.uredjenazemlja.hr/public/lrServices.jsp?action=publicLdbExtract',
'User-Agent'= 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36'
)
session <- rvest:::request_GET(session, paste0('https://oss.uredjenazemlja.hr/public/gwt/', cache_html), add_headers(headers_cache))
# meta
commonRPCServiceUrl <- "https://oss.uredjenazemlja.hr/rpc/commonRPCService"
headers = c(
'Accept'= '*/*',
'Accept-Encoding'= 'gzip, deflate, br',
'Accept-Language'= 'hr-HR,hr;q=0.9,en-US;q=0.8,en;q=0.7',
'Connection'= 'keep-alive',
# 'Content-Length'= '166',
'Content-Type'= 'text/x-gwt-rpc; charset=UTF-8',
'Cookie'= paste0('gxtTheme=m%3Aid%7Cs%3Agray%2Cfile%7Cs%3Axtheme-gray.css; ossprivatelang=hr_HR; x-auto-31=m%3Acollapsed%7Cb%3Atrue; JSESSIONID=', jid),
'Host'= 'oss.uredjenazemlja.hr',
'Origin'= 'https://oss.uredjenazemlja.hr',
'Referer'= paste0('https://oss.uredjenazemlja.hr/public/gwt/', cache_html),
'Sec-Fetch-Dest'= 'empty',
'Sec-Fetch-Mode'= 'cors',
'Sec-Fetch-Site'= 'same-origin',
'User-Agent'= 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36'
)
payload <- "5|0|4|https://oss.uredjenazemlja.hr/public/gwt/|957F3F03E95E97ABBDE314DFFCCEF4BC|hr.ericsson.oss.ui.common.client.core.rpc.ICommonRPCService|getMainBook|1|2|3|4|0|"
session <- rvest:::request_POST(session, commonRPCServiceUrl, body = payload, add_headers(headers))
session$response$content
readBin(session$response$content, character(), endian = "little")
payload <- "5|0|22|https://oss.uredjenazemlja.hr/public/gwt/|957F3F03E95E97ABBDE314DFFCCEF4BC|hr.ericsson.oss.ui.common.client.core.rpc.ICommonRPCService|getMainBooks|com.extjs.gxt.ui.client.data.BaseModel|hr.ericsson.oss.ui.common.client.core.data.RpcModel/2891266824|dirty|java.lang.Boolean/476441737|new|deleted|resourceCode|java.lang.Integer/3438268394|elementSelected|cadastralMunicipality|class|java.lang.String/2004016611|hr.ericsson.jis.domain.admin.CadastralMunicipality|hr.ericsson.jis.domain.admin.MainBook|institution|preconditionsRequired|name|VELIKA GORICA|1|2|3|4|1|5|6|10|7|8|0|9|-2|10|-2|11|12|0|13|-2|14|6|1|15|16|17|15|16|18|19|0|20|-2|21|16|22|"
session <- rvest:::request_POST(session, commonRPCServiceUrl, body = payload, add_headers(headers))
session$response$content
readBin(session$response$content, character(), endian = "little")
# captcha
payload <- "5|0|4|https://oss.uredjenazemlja.hr/public/gwt/|957F3F03E95E97ABBDE314DFFCCEF4BC|hr.ericsson.oss.ui.common.client.core.rpc.ICommonRPCService|isCaptchaDisabled|1|2|3|4|0|"
session <- rvest:::request_POST(session, commonRPCServiceUrl, body = payload, add_headers(headers))
session$response$content
readBin(session$response$content, character(), endian = "little")
headers_captcha <- c(
"Accept"= "image/avif,image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8",
"Accept-Encoding"= "gzip, deflate, br",
"Accept-Language"=" hr-HR,hr;q=0.9,en-US;q=0.8,en;q=0.7",
"Connection"= "keep-alive",
"Cookie"= paste0("gxtTheme=m%3Aid%7Cs%3Agray%2Cfile%7Cs%3Axtheme-gray.css; ossprivatelang=hr_HR; x-auto-31=m%3Acollapsed%7Cb%3Atrue; JSESSIONID=", jid),
"DNT"= "1",
"Host"= "oss.uredjenazemlja.hr",
"Referer"= "https://oss.uredjenazemlja.hr/public/lrServices.jsp?action=publicLdbExtract",
"sec-ch-ua"= '"Google Chrome";v="89", "Chromium";v="89", ";Not A Brand";v="99"',
"sec-ch-ua-mobile"= "?0",
"Sec-Fetch-Dest"= "image",
"Sec-Fetch-Mode"= "no-cors",
"Sec-Fetch-Site"= "same-origin",
"User-Agent"= "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36"
)
captcha <- GET("https://oss.uredjenazemlja.hr/servlets/kaptcha.jpg?1617286122160", add_headers(headers_captcha))
# session <- rvest:::request_GET(session, "https://oss.uredjenazemlja.hr/servlets/kaptcha.jpg?1617286122160", add_headers(headers_captcha))
captcha$content
captcha$response$content
writeBin(captcha$content, "D:/zkrh/captchas/test.jpg")
result = solver$normal("D:/zkrh/captchas/test.jpg", minLength=5, maxLength=5)
payload <- paste0("5|0|6|https://oss.uredjenazemlja.hr/public/gwt/|957F3F03E95E97ABBDE314DFFCCEF4BC|hr.ericsson.oss.ui.common.client.core.rpc.ICommonRPCService|validateCaptcha|java.lang.String|",
result$code, "|1|2|3|4|1|5|6|")
session <- rvest:::request_POST(session, commonRPCServiceUrl, body = payload, add_headers(headers))
session$response$content
readBin(p$response$content, character(), endian = "little")
# ID!!!!!!
headers = c(
'Accept'= '*/*',
'Accept-Encoding'= 'gzip, deflate, br',
'Accept-Language'= 'hr-HR,hr;q=0.9,en-US;q=0.8,en;q=0.7',
'Connection'= 'keep-alive',
# 'Content-Length'= '166',
'Content-Type'= 'text/x-gwt-rpc; charset=UTF-8',
'Cookie'= paste0('gxtTheme=m%3Aid%7Cs%3Agray%2Cfile%7Cs%3Axtheme-gray.css; ossprivatelang=hr_HR; x-auto-31=m%3Acollapsed%7Cb%3Atrue; JSESSIONID=', jid),
'DNT' = '1',
'Host'= 'oss.uredjenazemlja.hr',
'Origin'= 'https://oss.uredjenazemlja.hr',
'Referer'= paste0('https://oss.uredjenazemlja.hr/public/gwt/', cache_html),
'sec-ch-ua' = '"Google Chrome";v="89", "Chromium";v="89", ";Not A Brand";v="99"',
'sec-ch-ua-mobile' = "?0",
'Sec-Fetch-Dest'= 'empty',
'Sec-Fetch-Mode'= 'cors',
'Sec-Fetch-Site'= 'same-origin',
'User-Agent'= 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36'
)
payload <- paste0("5|0|40|https://oss.uredjenazemlja.hr/public/gwt/|0EAC9F40996251FDB21FF254E1600E83|hr.ericsson.oss.ui.pia.client.rpc.IOssPublicRPCService|getLrUnitByMainBook|com.extjs.gxt.ui.client.data.BaseModel|java.lang.String|hr.ericsson.oss.ui.common.client.core.data.RpcModel/2891266824|date|java.sql.Date/3996530531|dirty|java.lang.Boolean/476441737|new|cadastralMunicipality|id|java.lang.Integer/3438268394|class|java.lang.String/2004016611|hr.ericsson.jis.domain.admin.CadastralMunicipality|cadastralMunicipalityId|source|creationDate|formatedName|VELIKA GORICA|userId|cadInstitution|deleted|institutionId|resourceCode|elementSelected|name|Odjel za katastar nekretnina Velika Gorica|hr.ericsson.jis.domain.admin.Institution|institution|Zemljišnoknjižni odjel Velika Gorica|place|sidMainBook|java.lang.Long/4227064769|hr.ericsson.jis.domain.admin.MainBook|status|1|1|2|3|4|2|5|6|7|18|8|9|114|1|21|10|11|0|12|-3|13|7|3|14|15|102844|16|17|18|19|-5|20|15|1|21|9|116|0|1|22|17|23|24|15|-20|25|7|8|10|-3|12|-3|26|-3|27|15|32|28|15|0|29|-3|30|17|31|16|17|32|33|7|9|10|-3|12|-3|26|-3|27|15|277|28|-13|29|-3|30|17|34|35|-9|16|-15|26|-3|28|-7|29|-3|30|-9|36|37|286610893|17179869184|14|15|21921|16|17|38|39|15|0|40|")
# Encoding(payload) <- "UTF-8"
# payload <- RCurl::curlEscape(payload)
session <- rvest:::request_POST(session, commonRPCServiceUrl, body = payload, add_headers(headers))
session$response$content
readBin(session$response$content, character())

I have found the error. The problem was in wrong url argument in one request.

You should really look into selenium and beautifulsoup4 to automate this - it's like requests on steroids.
see an example on my github: https://github.com/stevenhurwitt/reliant-scrape/blob/master/reliant_scrape.py

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing a table with Pandas - python

Related

Scrape multiple pages with json

Why does soup.findAll return []?

Getting Response 500 python

Obtain data of Freight Index in python

Web scraping - his application is out of date, please click the refresh button on your browser return message

Categories

Resources