How to fix the tainted source of data in Coverity issue - python

Trying to read the URL from json file which in Coverity report shows as taint (untrusted source of data). And the issue is called as URL Manipulation where I used the URL attribute from json.
Can anyone suggest wasys to mitigate the URL Manipulation error in Coverity report.

It means you need to parse/validate the url string.
You can do this in a number of ways - either with your own regex, or with purpose-built libraries (urllib, validators).
For example:
from urllib.parse import urlparse
URL_TO_TEST = "https:/www.google.com"
result = urlparse(URL_TO_TEST)
if (result.scheme and result.netloc) is False:
raise ValueError("Invalid url string")
print(f"url: '{URL_TO_TEST}' is valid")
The Snyk page for the same type of issue provides some good info.

Related

Download csv file which has javascript on-click download with requests

I'm trying to download a csv file from here: Link after clicking on "Acesse todos os negócios realizados até o momento", which is in blue next to an image of a cloud with an arrow.
I do know how to solve the problem with selenium, but it's such a heavy library that I'd like to learn another solutions (specially faster ones). My main idea was to use requests, since I think it's the fastest approach.
My code:
import requests
url="https://bvmf.bmfbovespa.com.br/NegociosRealizados/Registro/DownloadArquivoDiretorio?data="
r=requests.get(url,allow_redirects=True)
r.text
r.text is a string of 459432 characters gives the following output (just put some part of it here):
RGF0YS9Ib3JhIGRhIHVsdGltYSBhdHVhbGl6YWNhbzogMTEvMTEvMjAyMiAxNTo1MDoxNApJbnN0cnVtZW50byBGaW5hbmNlaXJvO0VtaXNzb3I7Q29kaWdvIElGO1F1YW50aWRhZGUgTmVnb2NpYWRhO1ByZWNvIE5lZ29jaW87Vm9sdW1lIEZpbmFuY2Vpcm8gUiQ7VGF4YSBOZWdvY2lvO09yaWdlbSBOZWdvY2lvO0hvcmFyaW8gTmVnb2NpbztEYXRhIE5lZ29jaW87Q29kLiBJZGVudGlmaWNhZG9yIGRvIE5lZ29jaW87Q29kLiBJc2luO0RhdGEgTGlxdWlkYWNhbwpDUkE7RUNPQUdST1NFQztDUkEwMTcwMDJCRTsxMDc7MTMxMyw3MDUxMjAwMDsxNDA1NjYsNDU7LDAwMDA7UHJlLXJlZ2lzdHJvIC0gVm9pY2U7MTU6NDc6NDk7MTEvMTEvMjAyMjsjOTc0MzQ0ODY7QlJFQ09BQ1JBMVozOzExLzExLzIwMjIKQ1JJO09QRUFTRUNVUklUSVpBRE9SQTsxOEcwNjg3NTIxOzIzNTsxMjQ4LDY5MTcwNDAwOzI5MzQ0Miw1NTssMDAwMDtQcmUtcmVnaXN0cm8gLSBWb2ljZTsxNTo0Nzo0OTsxMS8xMS8yMDIyOyM5NzQzNDQ4NDtCUlJCUkFDUkk0WTU7MTEvMTEvMjAyMgpERUI7UEVUUk9CUkFTO1BFVFIxNjs3NjsxMTU1LDkzNTgwODAwOzg3ODUxLDEyOzcsMzUwMDtQcmUtcmVnaXN0cm8gLSBWb2ljZTsxNTo0Nzo0OTsxMS8xMS8yMDIyOyM5NzQzNDQ4NTtCUlBFVFJEQlMwMDE7MTEvMTEvMjAyMgpERUI7VkFMRTtDVlJEQTY7Mjk7MzQsMDAwMDAwMDA7OTg2LDAwOywwMDAwO1ByZS1yZWdpc3RybyAtIFZvaWNlOzE1OjQ3OjA0OzExLzExLzIwMjI7Izk3NDMzOTM2O0JSVkFMRURCUzAyODsxMS8xMS8yMDIyCkRFQjtWQUxFO0NWUkRBNjsyOTszMywzMDAwMDAwMDs5NjUsNzA7LDAwMDA7UHJlLXJlZ2lzdHJvIC0gVm9pY2U7MTU6NDc6MDQ7MTEvMTEvMjAyMjsjOTc0MzM5Mzc7QlJWQUxFREJTMDI4OzExLzExLzIwMjIKREVCO0VRVUFUT1JJQUxUUkFOU01JO0VRVUExMTs1OTsxMDA3LDg0NDQ1MzAwOzU5NDYyLDgyOzcsMDMwMDtQcmUtcmVnaXN0cm8gLSBWb2ljZTsxNTo0Njo0MDsxMS8xMS8yMDIyOyM5NzQzMzYxNDtCUkVR...
I don't know what this string is or what to do with it. Is it encoded? Should I call another function with it? Am I calling the wrong link? Should I just try another approach? Is selenium not that bad?
Any help is appreciated. Thank you!
Extra info:
From devtools I found it's calling these javascript functions:
onclick is just that: <a href="#" onclick="carregarDownloadArquivo('')">
carregarDownloadArquivo (I don't know java, tried to extract only this specific function inside the js):
function carregarDownloadArquivo(n){var t;
t=n!=null&&n!=""?n:getParameterByName("data")==null?"":getParameterByName("data");
$("#divloadArquivo").show();
$.ajax({url:"/NegociosRealizados/Registro/DownloadArquivoDiretorio?data="+t}).then(function(t,i,r){var u;
if(t!="null"&&t!=""){var f=convertBase64(t),e=window.navigator.userAgent,o=e.indexOf("MSIE ");
(n==null||n=="")&&(n=retornaDataHoje());
u=n+"_NEGOCIOSBALCAO.CSV";
o>0||!!navigator.userAgent.match(/Trident.*rv\:11\./)?DownloadArquivoIE(u,f,r.getAllResponseHeaders()):DownloadArquivo(u,f,r.getAllResponseHeaders())}$("#divloadArquivo").hide()})}
Extra functions to understand carregardownload:
function DownloadArquivoIE(n,t,i){var r=new Blob(t,{type:i});
navigator.msSaveBlob(r,n)}function DownloadArquivo(n,t,i){var r=new Blob(t,{type:i});
saveAs(r,n)}function getParameterByName(n,t){t||(t=window.location.href);
n=n.replace(/[\[\]]/g,"\\$&");
var r=new RegExp("[?&]"+n+"(=([^&#]*)|&|#|$)"),i=r.exec(t);
return i?i[2]?decodeURIComponent(i[2].replace(/\+/g," ")):"":null}
Not so sure about ajax and send calls. Also I don't know how their code is downloading the csv file
Try to decode the base64-encoded response from the server:
import base64
import requests
url = "https://bvmf.bmfbovespa.com.br/NegociosRealizados/Registro/DownloadArquivoDiretorio?data="
t = base64.b64decode(requests.get(url).text)
print(t.decode("utf-8"))
Prints:
Data/Hora da ultima atualizacao: 11/11/2022 16:38:20
Instrumento Financeiro;Emissor;Codigo IF;Quantidade Negociada;Preco Negocio;Volume Financeiro R$;Taxa Negocio;Origem Negocio;Horario Negocio;Data Negocio;Cod. Identificador do Negocio;Cod. Isin;Data Liquidacao
DEB;STAGENEBRA;MSGT23;2200;980,96524400;2158123,54;7,8701;Pre-registro - Voice;16:34:56;11/11/2022;#97441063;BRMSGTDBS050;12/11/2022
CRA;VERTSEC;CRA022006N5;306;972,49346900;297583,00;,0000;Pre-registro - Voice;16:34:49;11/11/2022;#97441055;BRVERTCRA2S9;11/11/2022
CRI;VIRGOSEC;21L0823062;96;1012,81900600;97230,62;,0000;Pre-registro - Voice;16:33:21;11/11/2022;#97441034;BRIMWLCRIAF2;11/11/2022
CRI;VIRGOSEC;21L0823062;356;1012,81900600;360563,57;,0000;Pre-registro - Voice;16:32:55;11/11/2022;#97441028;BRIMWLCRIAF2;11/11/2022
COE;;IT5322K6C2H;10;1000,00000000;10000,00;;Registro;16:31:52;11/11/2022;#2022111113338428;;11/11/2022
...and so on.

Connecting to YouTube API and download URLs - getting KeyError

My goal is to connect to Youtube API and download the URLs of specific music producers.I found the following script which I used from the following link: https://www.youtube.com/watch?v=_M_wle0Iq9M. In the video the code works beautifully. But when I try it on python 2.7 it gives me KeyError:'items'.
I know KeyErrors can occur when there is an incorrect use of a dictionary or when a key doesn't exist.
I have tried going to the google developers site for youtube to make sure that 'items' exist and it does.
I am also aware that using get() may be helpful for my problem but I am not sure. Any suggestions to fixing my KeyError using the following code or any suggestions on how to improve my code to reach my main goal of downloading the URLs (I have a Youtube API)?
Here is the code:
#these modules help with HTTP request from Youtube
import urllib
import urllib2
import json
API_KEY = open("/Users/ereyes/Desktop/APIKey.rtf","r")
API_KEY = API_KEY.read()
searchTerm = raw_input('Search for a video:')
searchTerm = urllib.quote_plus(searchTerm)
url = 'https://www.googleapis.com/youtube/v3/search?part=snippet&q='+searchTerm+'&key='+API_KEY
response = urllib.urlopen(url)
videos = json.load(response)
videoMetadata = [] #declaring our list
for video in videos['items']: #"for loop" cycle through json response and searches in items
if video['id']['kind'] == 'youtube#video': #makes sure that item we are looking at is only videos
videoMetadata.append(video['snippet']['title']+ # getting title of video and putting into list
"\nhttp://youtube.com/watch?v="+video['id']['videoId'])
videoMetadata.sort(); # sorts our list alphaetically
print ("\nSearch Results:\n") #print out search results
for metadata in videoMetadata:
print (metadata)+"\n"
raw_input('Press Enter to Exit')
The problem is most likely a combination of using an RTF file instead of a plain text file for the API key and you seem to be confused whether to use urllib or urllib2 since you imported both.
Personally, I would recommend requests, but I think you need to read() the contents of the request to get a string
response = urllib.urlopen(url).read()
You can check that by printing the response variable

Problems crawling wordreference

I am trying to crawl wordreference, but I am not succeding.
The first problem I have encountered is, that a big part is loaded via JavaScript, but that shouldn't be much problem because I can see what I need in the source code.
So, for example, I want to extract for a given word, the first two meanings, so in this url: http://www.wordreference.com/es/translation.asp?tranword=crane I need to extract grulla and grúa.
This is my code:
import lxml.html as lh
import urllib2
url = 'http://www.wordreference.com/es/translation.asp?tranword=crane'
doc = lh.parse((urllib2.urlopen(url)))
trans = doc.xpath('//td[#class="ToWrd"]/text()')
for i in trans:
print i
The result is that I get an empty list.
I have tried to crawl it with scrapy too, no success. I am not sure what is going on, the only way I have been able to crawl it is using curl, but that is sloopy, I want to do it in an elegant way, with Python.
Thank you very much
It looks like you need a User-Agent header to be sent, see Changing user agent on urllib2.urlopen.
Also, just switching to requests would do the trick (it automatically sends the python-requests/version User Agent by default):
import lxml.html as lh
import requests
url = 'http://www.wordreference.com/es/translation.asp?tranword=crane'
response = requests.get("http://www.wordreference.com/es/translation.asp?tranword=crane")
doc = lh.fromstring(response.content)
trans = doc.xpath('//td[#class="ToWrd"]/text()')
for i in trans:
print(i)
Prints:
grulla
grúa
plataforma
...
grulla blanca
grulla trompetera

Python urllib2.urlopen with # symbol in URL

I'm playing around in Python and and there's a URL that I'm trying to use which goes like this
https://[username#domain.com]:[password]#domain.com/blah
This is my code:
response =urllib2.urlopen("https://[username#domain.com]:[password]#domain.com/blah")
html = response.read()
print ("data="+html)
This isn't going through, it doesn't like the # symbols and probably the : too. I tried searching, and I read something about unquote, but that's not doing anything. This is the error I get:
raise InvalidURL("nonnumeric port: '%s'" % host[i+1:])
httplib.InvalidURL: nonnumeric port: 'password#updates.opendns.com'
How do I get around this? The actual site is "https://updates.opendns.com/nic/update?hostname=
thank you!
URIs have a bunch of reserved characters separating distinguishable parts of the URI (/, ?, &, # and a few others). If any of these characters appears in either username (# does in your case) or password, they need to be percent encoded or the URI becomes invalid.
In Python 3:
>>> from urllib import parse
>>> parse.quote("p#ssword?")
'p%40ssword%3F'
In Python 2:
>>> import urllib
>>> urllib.quote("p#ssword?")
'p%40ssword%3F'
Also, don't put the username and password in square brackets, this is not valid either.
use urlencode! Not sure if urllib2 has it, but urllib has an urlencode function. One sec and i'll get back to you.
I did a quick check, and it seems that you need to use urrlib instead of urllib2 for that...importing urllib and then using urllib.urlencode(YOUR URL) should work!
import urllib
url = urllib.urlencode(<your_url_here>)
EDIT: it's actually urlllib2.quote()!

wikitools parsing error

I'm using wikitools package to parse the wikipedia. I just copy this example from documentation. But its not working. When I run this code. I get following error.
Invalid JSON,trying requesting again. Can you please help me ? thanks
from wikitools import wiki
from wikitools import api
# create a Wiki object
site = wiki.Wiki("http://my.wikisite.org/w/api.php")
# define the params for the query
params = {'action':'query', 'titles':'Papori'}
# create the request object
request = api.APIRequest(site, params)
# query the API
result = request.query()
The "http://my.wikisite.org/w/api.php" is only an example, there is no MediaWiki under that domain. Try with "http://en.wikipedia.org/w/api.php" which searches in the English Wikipedia.

Categories

Resources