Can't download entire file using urllib2 - python

Hi I'm trying to download an excel file from this url (http://www.sicom.gov.co/precios/controller?accion=ExportToExcel) and then I need to parse it using xlrd.
The problem is that when I put that Url on a browser I get an excel file of more or less 2MB, but when I download the file using urllib2, http2lib or even curl from command line I only get a 4k file and obviously parsing that incomplete file fails miserably.
Strangely enough xlrd seems to be able to read the correct sheet name from the downloaded file so I'm guessing the file is the right one but it is clearly incomplete.
Here is some sample code of what I'm trying to achieve
import urllib2
from xlrd import open_workbook
excel_url = 'http://www.sicom.gov.co/precios/controller?accion=ExportToExcel'
result = urllib2.urlopen(excel_url)
wb = open_workbook(file_contents=result.read())
response = ""
for s in wb.sheets():
response += 'Sheet:' + s.name + '<br>'
for row in range(s.nrows):
values = []
for col in range(s.ncols):
value = s.cell(row, col).value
if (value):
values.append(str(value) + " not empty")
else:
values.append("Value at " + col + ", " + row + " was empty")
response += str(values) + "<br>"

You have to call your first url first. It seems to set a cookie or something like that. Then call the second one to download the excel file. For that kind of jobs, you should prefer http://docs.python-requests.org/en/latest/#, because it's much easier to use than the standard lib tools and it handles special cases (like cookies) much better by default.
import requests
s = requests.Session()
s.get('http://www.sicom.gov.co/precios/controller?accion=Home&option=SEARCH_PRECE')
response = s.get('http://www.sicom.gov.co/precios/controller?accion=ExportToExcel')
with file('out.xls','wb') as f:
f.write(response.content)

Related

HTML Diff File is getting malformed

With difflib library I am trying to generate the diff file which is in html format. It works for most of the time but for few times, the generate html is malformed. Sometimes it also observed that formed html doesn't have all the content and sometimes the formed content doesn't have the lines at proper place.
Below is the code I am using for it:
import difflib
try:
print("Reading file from first file")
firstfile = open(firstFilePath, "r")
contentsFirst = firstfile.readlines()
print("Reading file from second file")
secondfile = open(secondFilePath, "r")
contentsSecond = secondfile.readlines()
print("Creating diff file:")
config_diff = difflib.HtmlDiff(wrapcolumn=70).make_file(contentsSecond, contentsFirst)
if not os.path.exists(diff_file_path):
os.makedirs(diff_file_path)
final_path = diff_file_path + "/" + diff_file_name + '.html'
diff_file = open(final_path, 'w')
diff_file.write(config_diff)
print("Diff file is genrated :")
except Exception as error:
print("Exception occurred in create_diff_file " + str(error))
raise Exception(str(error))
This piece of code is called in a threaded program. Although with retry, I get the desired result but doesn't know the reason for getting malformed and inconsistent diff file. If someone can help me in finding the actual reason behind it and can propose the solution, will be helpful for me.

Using python to download table data without .csv file address provided

My purpose is to download the data from this website:
http://transoutage.spp.org/
When opening this website, in the bottom of web, there is a description used to illustrate how to auto-download the data. For example:
http://transoutage.spp.org/report.aspx?download=true&actualendgreaterthan=3/1/2018&includenulls=true
The code I wrote is this:
import requests
ul_begin = 'http://transoutage.spp.org/report.aspx?download=true'
timeset = '3/1/2018' #define the time, m/d/yyyy
fn = ['&actualendgreaterthan='] + [timeset] + ['&includenulls=true']
fn = ''.join(fn)
ul = ul_begin+fn
r = requests.get(ul, verify=False)
Since, if you input the web address,
http://transoutage.spp.org/report.aspx?download=true&actualendgreaterthan=3/1/2018&includenulls=true,
into the Chrome, it will auto-download the data in .csv file. I do not know how to continue my code.
Please help me!!!!
You need to write the response you receive to a file:
r = requests.get(ul, verify=False)
if 200 >= r.status_code <= 300:
# If the request has succeeded
file_path = '<path_where_file_has_to_be_downloaded>`
f = open(file_path, 'w+')
f.write(r.content)
f.close()
This will work properly if the csv file is small in size. but for large files, you need to use stream param to download: http://masnun.com/2016/09/18/python-using-the-requests-module-to-download-large-files-efficiently.html

How can I send an xls file one line per time through http response?

Problem
Recently I got a problem with export large amount of data, and send it to the client side.
The detailed problem description shows in the linked page below:
How can I adapt my code to make it compatible to Microsoft Excel?
What's Different
Although, the first answer in the linked page help me to solve the problem of messy code when open the .csv file by excel. But as I commented, it would be a little inconvenient for the user. So I tried to export an .xls file directly.
My Question Is
Because the dataset is quite large, I cannot generate the whole .xls file all at once, maybe it's a good idea to send one line or several lines to the client side per time as I did with the .csv file.
So how can I send the .xls data piece by piece to the client side? or any better recommendations?
I would be really appreciated for your answer!
This is a possible solution using dependencies flask + sql-alchemy + pandas
def export_query(query, file_name = None):
results = db.session.execute(query)
fetched = results.fetchall()
dataframe = pd.DataFrame(fetched)
dataframe.columns = get_query_coloumn_names(query)
base_path = current_app.config['UPLOAD_FOLDER']
workingdocs = base_path + datetime.now().strftime("%Y%m%d%H%M%S") + '/'
if not os.path.exists(workingdocs):
os.makedirs(workingdocs)
if file_name is None:
file_name = workingdocs + str(uuid.uuid4()) + '-' + 'export.xlsx'
else:
file_name = workingdocs + file_name
dataframe.to_excel(file_name)
return file_name
def export_all(q, page_limit, page):
query = db.session.query(...).\
outerjoin(..).\
filter(..).\
order_by(...)
paging_query = query.paginate(page, page_limit, True)
# TODO need to return total to help user know to keep trying paging_query.total
return export_query(paging_query)
#api.route('/export_excel/', methods=['POST'])
#permission_required(Permission.VIEW_REPORT)
def export_excel():
json = request.get_json(silent=False, force=True)
q = ''.join(('%',json['q'],'%'))
page_limit = try_parse_int(json['page_limit'])
page = try_parse_int(json['page'])
file_name = export_all(q, page_limit, page)
response = send_file(file_name)
response.headers["Content-Disposition"] = "attachment; filename=export.xlsx"
return response

Writing data to csv or text file using python

I am trying to write some data to csv file by checking some condition as below
I will have a list of urls in a text file as below
urls.txt
www.example.com/3gusb_form.aspx?cid=mum
www.example_second.com/postpaid_mum.aspx?cid=mum
www.example_second.com/feedback.aspx?cid=mum
Now i will go through each url from the text file and read the content of the url using urllib2 module in python and will search a string in the entire html page. If the required string founds i will write that url in to a csv file.
But when i am trying to write data(url) in to csv file,it is saving like each character in to one coloumn as below instead of saving entire url(data) in to one column
h t t p s : / / w w w......
Code.py
import urllib2
import csv
search_string = 'Listen Capcha'
html_urls = open('/path/to/input/file/urls.txt','r').readlines()
outputcsv = csv.writer(open('output/path' + 'urls_contaning _%s.csv'%search_string, "wb"),delimiter=',', quoting=csv.QUOTE_MINIMAL)
outputcsv.writerow(['URL'])
for url in html_urls:
url = url.replace('\n','').strip()
if not len(url) == 0:
req = urllib2.Request(url)
response = urllib2.urlopen(req)
if str(search_string) in response.read():
outputcsv.writerow(url)
So whats wrong with the above code, what needs to be done in order to save the entire url(string) in to one column in a csv file ?
Also how can we write data to a text file as above ?
Edited
Also i had a url suppose like http://www.vodafone.in/Pages/tuesdayoffers_che.aspx ,
this url will be redirected to http://www.vodafone.in/pages/home_che.aspx?cid=che in browser actually, but when i tried through code as below it is same as the above given url
import urllib2, httplib
httplib.HTTPConnection.debuglevel = 1
request = urllib2.Request("http://www.vodafone.in/Pages/tuesdayoffers_che.aspx")
opener = urllib2.build_opener()
f = opener.open(request)
print f.geturl()
Result
http://www.vodafone.in/pages/tuesdayoffers_che.aspx?cid=che
So finally how to catch the redirected url with urllib2 and fetch the data from it ?
Change the last line to:
outputcsv.writerow([url])

urllib2 File Download Fails...due to System Security?

I'm trying to download files (approximately 1 - 1.5MB/file) from a NASA server (URL), but to no avail! I've tried a few things with urllib2 and run into two results:
I create a new file on my machine that is only ~200KB and has nothing in it
I create a 1.5MB file on my machine that has nothing in it!
By "nothing in it" I mean when I open the file (these are hdf5 files, so I open them in hdfView) I see no hierarchical structure...literally looks like an empty h5 file. But, when I open the file in a text editor I can see there is SOMETHING there (it's binary, so in text it looks like...well, binary).
I think I am using urllib2 appropriately, though I have never successfully used urllib2 before. Would you please comment on whether what I am doing is right or not, and suggest something better?
from urllib2 import Request, urlopen, URLError, HTTPError
base_url = 'http://avdc.gsfc.nasa.gov/index.php?site=1480884223&id=40&go=list&path=%2FH2O%2F/2010'
file_name = 'download_2.php?site=1480884223&id=40&go=download&path=%2FH2O%2F2010&file=MLS-Aura_L2GP-H2O_v03-31-c01_2010d360.he5'
url = base_url + file_name
req = Request(url)
# Open the url
try:
f = urlopen(req)
print "downloading " + url
# Open our local file for writing
local_file = open('test.he5', "w" + file_mode)
#Write to our local file
local_file.write(f.read())
local_file.close()
except HTTPError, e:
print "HTTP Error:",e.code , url
except URLError, e:
print "URL Error:",e.reason , url
I got this script (which seems to be the closest to working) from here.
I am unsure what the file_name should be. I looked at the page source information of the archive and pulled the file name listed there (not the same as what shows up on the web page), and doing this yields the 1.5MB file that shows nothing in hdfview.
You are creating an invalid url:
base_url = 'http://avdc.gsfc.nasa.gov/index.php?site=1480884223&id=40&go=list&path=%2FH2O%2F/2010'
file_name = 'download_2.php?site=1480884223&id=40&go=download&path=%2FH2O%2F2010&file=MLS-Aura_L2GP-H2O_v03-31-c01_2010d360.he5'
url = base_url + file_name
You probably meant:
base_url = 'http://avdc.gsfc.nasa.gov/'
file_name = 'download_2.php?site=1480884223&id=40&go=download&path=%2FH2O%2F2010&file=MLS-Aura_L2GP-H2O_v03-31-c01_2010d360.he5'
When downloading a large file, it's better to use a buffered copy from filehandle to filehandle:
import shutil
# ...
f = urlopen(req)
with open('test.he5', "w" + file_mode) as local_file:
shutil.copyfileobj(f, local_file)
.copyfileobj will efficiently load from the open urllib connection and write to the open local_file file handle. Note the with statement, when the code block underneath concludes it'll automatically close the file for you.

Categories

Resources