Someone knows a strategy to bypass a html late load problem? - python

Someone knows a strategy to bypass a html late load problem?
Here a table don't load the on my page python request.

Found the API call for it. The root URL is: https://br.advfn.com/common/bov-options/api, and an example call is to https://br.advfn.com/common/bov-options/api?symbol=PETR4, where PETR4 is passed as an argument.
Just use a GET request:
import requests
symbol = "PETR4"
res = requests.get(f"https://br.advfn.com/common/bov-options/api?symbol={symbol}")
print(res)
The result:
{"result":[{"symbol":"PETRF286","type":"Call","style":"A","strike_price":"28,46","expiry_date":"18\/06\/2021","volume":"28912100","volume_form":"28.912.100","change_percentage":"25,0%","url":"\/p.php?pid=quote&symbol=BOV%5EPETRF286","class":"up"},{"symbol":"PETRF296","type":"Call","style":"A","strike_price":"28,96","expiry_date":"18\/06\/2021","volume":"25247000","volume_form":"25.247.000
...

Related

Download csv file which has javascript on-click download with requests

I'm trying to download a csv file from here: Link after clicking on "Acesse todos os negócios realizados até o momento", which is in blue next to an image of a cloud with an arrow.
I do know how to solve the problem with selenium, but it's such a heavy library that I'd like to learn another solutions (specially faster ones). My main idea was to use requests, since I think it's the fastest approach.
My code:
import requests
url="https://bvmf.bmfbovespa.com.br/NegociosRealizados/Registro/DownloadArquivoDiretorio?data="
r=requests.get(url,allow_redirects=True)
r.text
r.text is a string of 459432 characters gives the following output (just put some part of it here):
RGF0YS9Ib3JhIGRhIHVsdGltYSBhdHVhbGl6YWNhbzogMTEvMTEvMjAyMiAxNTo1MDoxNApJbnN0cnVtZW50byBGaW5hbmNlaXJvO0VtaXNzb3I7Q29kaWdvIElGO1F1YW50aWRhZGUgTmVnb2NpYWRhO1ByZWNvIE5lZ29jaW87Vm9sdW1lIEZpbmFuY2Vpcm8gUiQ7VGF4YSBOZWdvY2lvO09yaWdlbSBOZWdvY2lvO0hvcmFyaW8gTmVnb2NpbztEYXRhIE5lZ29jaW87Q29kLiBJZGVudGlmaWNhZG9yIGRvIE5lZ29jaW87Q29kLiBJc2luO0RhdGEgTGlxdWlkYWNhbwpDUkE7RUNPQUdST1NFQztDUkEwMTcwMDJCRTsxMDc7MTMxMyw3MDUxMjAwMDsxNDA1NjYsNDU7LDAwMDA7UHJlLXJlZ2lzdHJvIC0gVm9pY2U7MTU6NDc6NDk7MTEvMTEvMjAyMjsjOTc0MzQ0ODY7QlJFQ09BQ1JBMVozOzExLzExLzIwMjIKQ1JJO09QRUFTRUNVUklUSVpBRE9SQTsxOEcwNjg3NTIxOzIzNTsxMjQ4LDY5MTcwNDAwOzI5MzQ0Miw1NTssMDAwMDtQcmUtcmVnaXN0cm8gLSBWb2ljZTsxNTo0Nzo0OTsxMS8xMS8yMDIyOyM5NzQzNDQ4NDtCUlJCUkFDUkk0WTU7MTEvMTEvMjAyMgpERUI7UEVUUk9CUkFTO1BFVFIxNjs3NjsxMTU1LDkzNTgwODAwOzg3ODUxLDEyOzcsMzUwMDtQcmUtcmVnaXN0cm8gLSBWb2ljZTsxNTo0Nzo0OTsxMS8xMS8yMDIyOyM5NzQzNDQ4NTtCUlBFVFJEQlMwMDE7MTEvMTEvMjAyMgpERUI7VkFMRTtDVlJEQTY7Mjk7MzQsMDAwMDAwMDA7OTg2LDAwOywwMDAwO1ByZS1yZWdpc3RybyAtIFZvaWNlOzE1OjQ3OjA0OzExLzExLzIwMjI7Izk3NDMzOTM2O0JSVkFMRURCUzAyODsxMS8xMS8yMDIyCkRFQjtWQUxFO0NWUkRBNjsyOTszMywzMDAwMDAwMDs5NjUsNzA7LDAwMDA7UHJlLXJlZ2lzdHJvIC0gVm9pY2U7MTU6NDc6MDQ7MTEvMTEvMjAyMjsjOTc0MzM5Mzc7QlJWQUxFREJTMDI4OzExLzExLzIwMjIKREVCO0VRVUFUT1JJQUxUUkFOU01JO0VRVUExMTs1OTsxMDA3LDg0NDQ1MzAwOzU5NDYyLDgyOzcsMDMwMDtQcmUtcmVnaXN0cm8gLSBWb2ljZTsxNTo0Njo0MDsxMS8xMS8yMDIyOyM5NzQzMzYxNDtCUkVR...
I don't know what this string is or what to do with it. Is it encoded? Should I call another function with it? Am I calling the wrong link? Should I just try another approach? Is selenium not that bad?
Any help is appreciated. Thank you!
Extra info:
From devtools I found it's calling these javascript functions:
onclick is just that: <a href="#" onclick="carregarDownloadArquivo('')">
carregarDownloadArquivo (I don't know java, tried to extract only this specific function inside the js):
function carregarDownloadArquivo(n){var t;
t=n!=null&&n!=""?n:getParameterByName("data")==null?"":getParameterByName("data");
$("#divloadArquivo").show();
$.ajax({url:"/NegociosRealizados/Registro/DownloadArquivoDiretorio?data="+t}).then(function(t,i,r){var u;
if(t!="null"&&t!=""){var f=convertBase64(t),e=window.navigator.userAgent,o=e.indexOf("MSIE ");
(n==null||n=="")&&(n=retornaDataHoje());
u=n+"_NEGOCIOSBALCAO.CSV";
o>0||!!navigator.userAgent.match(/Trident.*rv\:11\./)?DownloadArquivoIE(u,f,r.getAllResponseHeaders()):DownloadArquivo(u,f,r.getAllResponseHeaders())}$("#divloadArquivo").hide()})}
Extra functions to understand carregardownload:
function DownloadArquivoIE(n,t,i){var r=new Blob(t,{type:i});
navigator.msSaveBlob(r,n)}function DownloadArquivo(n,t,i){var r=new Blob(t,{type:i});
saveAs(r,n)}function getParameterByName(n,t){t||(t=window.location.href);
n=n.replace(/[\[\]]/g,"\\$&");
var r=new RegExp("[?&]"+n+"(=([^&#]*)|&|#|$)"),i=r.exec(t);
return i?i[2]?decodeURIComponent(i[2].replace(/\+/g," ")):"":null}
Not so sure about ajax and send calls. Also I don't know how their code is downloading the csv file
Try to decode the base64-encoded response from the server:
import base64
import requests
url = "https://bvmf.bmfbovespa.com.br/NegociosRealizados/Registro/DownloadArquivoDiretorio?data="
t = base64.b64decode(requests.get(url).text)
print(t.decode("utf-8"))
Prints:
Data/Hora da ultima atualizacao: 11/11/2022 16:38:20
Instrumento Financeiro;Emissor;Codigo IF;Quantidade Negociada;Preco Negocio;Volume Financeiro R$;Taxa Negocio;Origem Negocio;Horario Negocio;Data Negocio;Cod. Identificador do Negocio;Cod. Isin;Data Liquidacao
DEB;STAGENEBRA;MSGT23;2200;980,96524400;2158123,54;7,8701;Pre-registro - Voice;16:34:56;11/11/2022;#97441063;BRMSGTDBS050;12/11/2022
CRA;VERTSEC;CRA022006N5;306;972,49346900;297583,00;,0000;Pre-registro - Voice;16:34:49;11/11/2022;#97441055;BRVERTCRA2S9;11/11/2022
CRI;VIRGOSEC;21L0823062;96;1012,81900600;97230,62;,0000;Pre-registro - Voice;16:33:21;11/11/2022;#97441034;BRIMWLCRIAF2;11/11/2022
CRI;VIRGOSEC;21L0823062;356;1012,81900600;360563,57;,0000;Pre-registro - Voice;16:32:55;11/11/2022;#97441028;BRIMWLCRIAF2;11/11/2022
COE;;IT5322K6C2H;10;1000,00000000;10000,00;;Registro;16:31:52;11/11/2022;#2022111113338428;;11/11/2022
...and so on.

Redirect hostname/endpoint to api.hostname/endpoint in django

I have my api built with this pattern: api.hostname/endpoint.
However there is a plugin to my app that uses hostname/endpoint pattern.
I would like to solve it on the backend side by adding redirection to api.hostname/endpoint.
I tried to experiment with adding urls or paths to urlpatterns, but it didn't help me.
How can I achieve it? Any ideas?
Regards,
Maciej.
You can use urllib
import urllib.parse
url = "https://hostname/endpoint"
split_url = urllib.parse.urlsplit(url)
result = f"{split_url.scheme}://api.{split_url.hostname}/{split_url.endpoint}"
print(result)
>> "https://api.hostname/endpoint"

How to fix the tainted source of data in Coverity issue

Trying to read the URL from json file which in Coverity report shows as taint (untrusted source of data). And the issue is called as URL Manipulation where I used the URL attribute from json.
Can anyone suggest wasys to mitigate the URL Manipulation error in Coverity report.
It means you need to parse/validate the url string.
You can do this in a number of ways - either with your own regex, or with purpose-built libraries (urllib, validators).
For example:
from urllib.parse import urlparse
URL_TO_TEST = "https:/www.google.com"
result = urlparse(URL_TO_TEST)
if (result.scheme and result.netloc) is False:
raise ValueError("Invalid url string")
print(f"url: '{URL_TO_TEST}' is valid")
The Snyk page for the same type of issue provides some good info.

python web-scraping yahoo finance

Since Yahoo finance updated their website. some tables seem to be created dynamically and not actually stored in the HTML (I used to get this information using BeautifulSoup, urllib but this won't work anymore). I am after the Analyst tables for example ADP specifically the Earnings Estimates for Year Ago EPS (Current Year Column). You cannot get this information from the API.
I found this link which works well for the Analyst Recommendations Trends. does anyone know how to do something similar for the main table on this page? (LINK:
python lxml etree applet information from yahoo )
I tried to follow the steps taken but frankly its beyond me.
returning the whole table is all I need I can pick out bits from there. cheers
In order to get that data, you need to open Chrome DevTools and select Network tab with XHR filter. If you click on ADP request you can see link in RequestUrl.
You can use Requests library for making a request and parsing json response from the site.
import requests
from pprint import pprint
url = 'https://query1.finance.yahoo.com/v10/finance/quoteSummary/ADP?formatted=true&crumb=ILlIC9tOoXt&lang=en-US&region=US&modules=upgradeDowngradeHistory%2CrecommendationTrend%2CfinancialData%2CearningsHistory%2CearningsTrend%2CindustryTrend%2CindexTrend%2CsectorTrend&corsDomain=finance.yahoo.com'
r = requests.get(url).json()
pprint(r)
further to volds answer above and using the answer in the link I posted above. (credit to saaj). This gives just the dataset I need and is neater when calling the module. I am not sure what the parameter crumb is but, it seems to work ok without it.
import json
from pprint import pprint
from urllib.request import urlopen
from urllib.parse import urlencode
def parse():
host = 'https://query1.finance.yahoo.com'
#host = 'https://query2.finance.yahoo.com' # try if above doesn't work
path = '/v10/finance/quoteSummary/%s' % 'ADP'
params = {
'formatted' : 'true',
#'crumb' : 'ILlIC9tOoXt',
'lang' : 'en-US',
'region' : 'US',
'modules' : 'earningsTrend',
'domain' : 'finance.yahoo.com'
}
response = urlopen('{}{}?{}'.format(host, path, urlencode(params)))
data = json.loads(response.read().decode())
pprint(data)
if __name__ == '__main__':
parse()
Other modules (just add a comma between them):
assetProfile
financialData
defaultKeyStatistics
calendarEvents
incomeStatementHistory
cashflowStatementHistory
balanceSheetHistory
recommendationTrend
upgradeDowngradeHistory
earningsHistory
earningsTrend
industryTrend
In GitHub, c0redumb has proposed a whole solution. You can download the yqd.py. After import it, you can get Yahoo finance data by one line of code, as blew.
import yqd
yf_data = yqd.load_yahoo_quote('GOOG', '20170722', '20170725')
The result 'yf_data' is:
['Date,Open,High,Low,Close,Adj Close,Volume',
'2017-07-24,972.219971,986.200012,970.770020,980.340027,980.340027,3248300',
'2017-07-25,953.809998,959.700012,945.400024,950.700012,950.700012,4661000',
'']

wikitools parsing error

I'm using wikitools package to parse the wikipedia. I just copy this example from documentation. But its not working. When I run this code. I get following error.
Invalid JSON,trying requesting again. Can you please help me ? thanks
from wikitools import wiki
from wikitools import api
# create a Wiki object
site = wiki.Wiki("http://my.wikisite.org/w/api.php")
# define the params for the query
params = {'action':'query', 'titles':'Papori'}
# create the request object
request = api.APIRequest(site, params)
# query the API
result = request.query()
The "http://my.wikisite.org/w/api.php" is only an example, there is no MediaWiki under that domain. Try with "http://en.wikipedia.org/w/api.php" which searches in the English Wikipedia.

Categories

Resources