I'm trying to download a csv file from here: Link after clicking on "Acesse todos os negócios realizados até o momento", which is in blue next to an image of a cloud with an arrow.
I do know how to solve the problem with selenium, but it's such a heavy library that I'd like to learn another solutions (specially faster ones). My main idea was to use requests, since I think it's the fastest approach.
My code:
import requests
url="https://bvmf.bmfbovespa.com.br/NegociosRealizados/Registro/DownloadArquivoDiretorio?data="
r=requests.get(url,allow_redirects=True)
r.text
r.text is a string of 459432 characters gives the following output (just put some part of it here):
RGF0YS9Ib3JhIGRhIHVsdGltYSBhdHVhbGl6YWNhbzogMTEvMTEvMjAyMiAxNTo1MDoxNApJbnN0cnVtZW50byBGaW5hbmNlaXJvO0VtaXNzb3I7Q29kaWdvIElGO1F1YW50aWRhZGUgTmVnb2NpYWRhO1ByZWNvIE5lZ29jaW87Vm9sdW1lIEZpbmFuY2Vpcm8gUiQ7VGF4YSBOZWdvY2lvO09yaWdlbSBOZWdvY2lvO0hvcmFyaW8gTmVnb2NpbztEYXRhIE5lZ29jaW87Q29kLiBJZGVudGlmaWNhZG9yIGRvIE5lZ29jaW87Q29kLiBJc2luO0RhdGEgTGlxdWlkYWNhbwpDUkE7RUNPQUdST1NFQztDUkEwMTcwMDJCRTsxMDc7MTMxMyw3MDUxMjAwMDsxNDA1NjYsNDU7LDAwMDA7UHJlLXJlZ2lzdHJvIC0gVm9pY2U7MTU6NDc6NDk7MTEvMTEvMjAyMjsjOTc0MzQ0ODY7QlJFQ09BQ1JBMVozOzExLzExLzIwMjIKQ1JJO09QRUFTRUNVUklUSVpBRE9SQTsxOEcwNjg3NTIxOzIzNTsxMjQ4LDY5MTcwNDAwOzI5MzQ0Miw1NTssMDAwMDtQcmUtcmVnaXN0cm8gLSBWb2ljZTsxNTo0Nzo0OTsxMS8xMS8yMDIyOyM5NzQzNDQ4NDtCUlJCUkFDUkk0WTU7MTEvMTEvMjAyMgpERUI7UEVUUk9CUkFTO1BFVFIxNjs3NjsxMTU1LDkzNTgwODAwOzg3ODUxLDEyOzcsMzUwMDtQcmUtcmVnaXN0cm8gLSBWb2ljZTsxNTo0Nzo0OTsxMS8xMS8yMDIyOyM5NzQzNDQ4NTtCUlBFVFJEQlMwMDE7MTEvMTEvMjAyMgpERUI7VkFMRTtDVlJEQTY7Mjk7MzQsMDAwMDAwMDA7OTg2LDAwOywwMDAwO1ByZS1yZWdpc3RybyAtIFZvaWNlOzE1OjQ3OjA0OzExLzExLzIwMjI7Izk3NDMzOTM2O0JSVkFMRURCUzAyODsxMS8xMS8yMDIyCkRFQjtWQUxFO0NWUkRBNjsyOTszMywzMDAwMDAwMDs5NjUsNzA7LDAwMDA7UHJlLXJlZ2lzdHJvIC0gVm9pY2U7MTU6NDc6MDQ7MTEvMTEvMjAyMjsjOTc0MzM5Mzc7QlJWQUxFREJTMDI4OzExLzExLzIwMjIKREVCO0VRVUFUT1JJQUxUUkFOU01JO0VRVUExMTs1OTsxMDA3LDg0NDQ1MzAwOzU5NDYyLDgyOzcsMDMwMDtQcmUtcmVnaXN0cm8gLSBWb2ljZTsxNTo0Njo0MDsxMS8xMS8yMDIyOyM5NzQzMzYxNDtCUkVR...
I don't know what this string is or what to do with it. Is it encoded? Should I call another function with it? Am I calling the wrong link? Should I just try another approach? Is selenium not that bad?
Any help is appreciated. Thank you!
Extra info:
From devtools I found it's calling these javascript functions:
onclick is just that: <a href="#" onclick="carregarDownloadArquivo('')">
carregarDownloadArquivo (I don't know java, tried to extract only this specific function inside the js):
function carregarDownloadArquivo(n){var t;
t=n!=null&&n!=""?n:getParameterByName("data")==null?"":getParameterByName("data");
$("#divloadArquivo").show();
$.ajax({url:"/NegociosRealizados/Registro/DownloadArquivoDiretorio?data="+t}).then(function(t,i,r){var u;
if(t!="null"&&t!=""){var f=convertBase64(t),e=window.navigator.userAgent,o=e.indexOf("MSIE ");
(n==null||n=="")&&(n=retornaDataHoje());
u=n+"_NEGOCIOSBALCAO.CSV";
o>0||!!navigator.userAgent.match(/Trident.*rv\:11\./)?DownloadArquivoIE(u,f,r.getAllResponseHeaders()):DownloadArquivo(u,f,r.getAllResponseHeaders())}$("#divloadArquivo").hide()})}
Extra functions to understand carregardownload:
function DownloadArquivoIE(n,t,i){var r=new Blob(t,{type:i});
navigator.msSaveBlob(r,n)}function DownloadArquivo(n,t,i){var r=new Blob(t,{type:i});
saveAs(r,n)}function getParameterByName(n,t){t||(t=window.location.href);
n=n.replace(/[\[\]]/g,"\\$&");
var r=new RegExp("[?&]"+n+"(=([^&#]*)|&|#|$)"),i=r.exec(t);
return i?i[2]?decodeURIComponent(i[2].replace(/\+/g," ")):"":null}
Not so sure about ajax and send calls. Also I don't know how their code is downloading the csv file
Try to decode the base64-encoded response from the server:
import base64
import requests
url = "https://bvmf.bmfbovespa.com.br/NegociosRealizados/Registro/DownloadArquivoDiretorio?data="
t = base64.b64decode(requests.get(url).text)
print(t.decode("utf-8"))
Prints:
Data/Hora da ultima atualizacao: 11/11/2022 16:38:20
Instrumento Financeiro;Emissor;Codigo IF;Quantidade Negociada;Preco Negocio;Volume Financeiro R$;Taxa Negocio;Origem Negocio;Horario Negocio;Data Negocio;Cod. Identificador do Negocio;Cod. Isin;Data Liquidacao
DEB;STAGENEBRA;MSGT23;2200;980,96524400;2158123,54;7,8701;Pre-registro - Voice;16:34:56;11/11/2022;#97441063;BRMSGTDBS050;12/11/2022
CRA;VERTSEC;CRA022006N5;306;972,49346900;297583,00;,0000;Pre-registro - Voice;16:34:49;11/11/2022;#97441055;BRVERTCRA2S9;11/11/2022
CRI;VIRGOSEC;21L0823062;96;1012,81900600;97230,62;,0000;Pre-registro - Voice;16:33:21;11/11/2022;#97441034;BRIMWLCRIAF2;11/11/2022
CRI;VIRGOSEC;21L0823062;356;1012,81900600;360563,57;,0000;Pre-registro - Voice;16:32:55;11/11/2022;#97441028;BRIMWLCRIAF2;11/11/2022
COE;;IT5322K6C2H;10;1000,00000000;10000,00;;Registro;16:31:52;11/11/2022;#2022111113338428;;11/11/2022
...and so on.
Related
Trying to read the URL from json file which in Coverity report shows as taint (untrusted source of data). And the issue is called as URL Manipulation where I used the URL attribute from json.
Can anyone suggest wasys to mitigate the URL Manipulation error in Coverity report.
It means you need to parse/validate the url string.
You can do this in a number of ways - either with your own regex, or with purpose-built libraries (urllib, validators).
For example:
from urllib.parse import urlparse
URL_TO_TEST = "https:/www.google.com"
result = urlparse(URL_TO_TEST)
if (result.scheme and result.netloc) is False:
raise ValueError("Invalid url string")
print(f"url: '{URL_TO_TEST}' is valid")
The Snyk page for the same type of issue provides some good info.
Someone knows a strategy to bypass a html late load problem?
Here a table don't load the on my page python request.
Found the API call for it. The root URL is: https://br.advfn.com/common/bov-options/api, and an example call is to https://br.advfn.com/common/bov-options/api?symbol=PETR4, where PETR4 is passed as an argument.
Just use a GET request:
import requests
symbol = "PETR4"
res = requests.get(f"https://br.advfn.com/common/bov-options/api?symbol={symbol}")
print(res)
The result:
{"result":[{"symbol":"PETRF286","type":"Call","style":"A","strike_price":"28,46","expiry_date":"18\/06\/2021","volume":"28912100","volume_form":"28.912.100","change_percentage":"25,0%","url":"\/p.php?pid=quote&symbol=BOV%5EPETRF286","class":"up"},{"symbol":"PETRF296","type":"Call","style":"A","strike_price":"28,96","expiry_date":"18\/06\/2021","volume":"25247000","volume_form":"25.247.000
...
I'm trying to automate the process of creating an account for something, lets call it X, but I cant figure out what to do.
I saw this code somewhere,
import urllib
import urllib2
import webbrowser
data = urllib.urlencode({'q': 'Python'})
url = 'http://duckduckgo.com/html/'
full_url = url + '?' + data
response = urllib2.urlopen(full_url)
with open("results.html", "w") as f:
f.write(response.read())
webbrowser.open("results.html")
But I cant figure out how to modify it for my use.
I would highly recommend utilizing Selenium+Webdriver for this, since your question appears UI and browser-based. You can install Selenium via 'pip install selenium' in most cases. Here are a couple of good references to get started.
- http://selenium-python.readthedocs.io/
- https://pypi.python.org/pypi/selenium
Also, if this process needs to drive the browser headlessly, look into including PhantomJS (via GhostDriver), which can be downloaded from the phantomjs.org website.
forgive me, if if come straight out with it but python drives me nuts at something what seemed to be quite simple.
In a nutshell
I'm writing an extension for a musicvideo scraper which is responsible for getting the fanart backdrop.
Here is the URL:
github.com/MViDLibraryToolKit/.../APICaller
So I was able to call the Fanart.tv API and receiving the right json response. My problem is that i'm to dumb to collect the URLs under the Element "artistbackground"
I search the internet and found a very similar post here at stackoverflow but unluckily this was concerning python2,API V2 and a different category at fanart.tv so I was not able to take use out of it. Here it was
Anyway, here is my poor Try to collect URLs to list
# --------------------- Response Verarbeitung
# Ausgabe zwecks Debug
# print(fanartTVresp)
# http://webservice.fanart.tv/v3/music/albums/ba853904-ae25-4ebb-89d6-c44cfbd71bd2?api_key=fdadba00cfaaf3621eaa748669256a9e&client_key=dce01d75553d7e3fbc2ad742aaf5d371
# zu befüllende Liste
url_list = []
# lade Web-Response
json_response = json.loads(fanartTVresp)
# durch Element artistbackground loopen
for artistbackground in json_response:
url = urllib.parse.quote(['url'], ':/')
if url:
url_list.append(url)
print(url_list)
The libs I loaded...
import musicbrainzngs
import urllib
import json
import socket
from pprint import pprint
from urllib.parse import quote
The rest from the code you can find at my github link. Please help me, it drives me crazy ^^
Kind regards
p.s. Please excuse my english, I came from germany :)
I think I finally got it.
# URL List for background images
url_list = []
# set only for debug / value came from powershell runtime later
location = os.path.abspath('C:/temp')
# decode json
json_response = json.loads(fanartTVresp.decode())
# set string objects
bgitem = json_response["artistbackground"]
bgcoverurl = json_response["artistbackground"][0]["url"]
# iterating items and collect
for bgcoverurl in bgitem:
url_list.append(bgcoverurl)
print(url_list)
After taking some hourse of sleep I reallized that "json.loads" deserialized the response to regular python objects. Correct me if I'm wrong.
Anyway, it finally works!
I am trying to crawl wordreference, but I am not succeding.
The first problem I have encountered is, that a big part is loaded via JavaScript, but that shouldn't be much problem because I can see what I need in the source code.
So, for example, I want to extract for a given word, the first two meanings, so in this url: http://www.wordreference.com/es/translation.asp?tranword=crane I need to extract grulla and grúa.
This is my code:
import lxml.html as lh
import urllib2
url = 'http://www.wordreference.com/es/translation.asp?tranword=crane'
doc = lh.parse((urllib2.urlopen(url)))
trans = doc.xpath('//td[#class="ToWrd"]/text()')
for i in trans:
print i
The result is that I get an empty list.
I have tried to crawl it with scrapy too, no success. I am not sure what is going on, the only way I have been able to crawl it is using curl, but that is sloopy, I want to do it in an elegant way, with Python.
Thank you very much
It looks like you need a User-Agent header to be sent, see Changing user agent on urllib2.urlopen.
Also, just switching to requests would do the trick (it automatically sends the python-requests/version User Agent by default):
import lxml.html as lh
import requests
url = 'http://www.wordreference.com/es/translation.asp?tranword=crane'
response = requests.get("http://www.wordreference.com/es/translation.asp?tranword=crane")
doc = lh.fromstring(response.content)
trans = doc.xpath('//td[#class="ToWrd"]/text()')
for i in trans:
print(i)
Prints:
grulla
grúa
plataforma
...
grulla blanca
grulla trompetera