Download xls file from url - python

I am unable to download a xls file from a url. I have tried with both urlopen and urlretrive. But I recieve an really long error message starting with:
Traceback (most recent call last):
File "C:/Users/Henrik/Documents/Development/Python/Projects/ImportFromWeb.py", line 6, in
f = ur.urlopen(dls)
File "C:\Users\Henrik\AppData\Local\Programs\Python\Python35\lib\urllib\request.py", line 163, in urlopen
return opener.open(url, data, timeout)
and ending with:
urllib.error.HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Found
Unfortionally I can't provide the url I am using since the data is sensitive. However I will give you the url with some parts removed.
https://xxxx.xxxx.com/xxxxlogistics/w/functions/transportinvoicelist?0-8.IBehaviorListener.2-ListPageForm-table-TableForm-exportToolbar-xlsExport&antiCache=1477160491504
As you can see the url dosn't end with a "/file.xls" for example. I don't know if that matters but most of the threds regarding this issue has had those types of links.
If I enter the url in my address bar the download file window appears:
Image of download window
The code I have written look like this:
import urllib.request as ur
import openpyxl as pyxl
dls = 'https://xxxx.xxxx.com/xxxxlogistics/w/functions/transportinvoicelist?0-8.IBehaviorListener.2-ListPageForm-table-TableForm-exportToolbar-xlsExport&antiCache=1477160491504'
f = ur.urlopen(dls)
I am grateful for any help you can provide!

Related

ValueError: I/O operation on closed file with urllib

I have a problem scraping data from a seekingalpha website. I know this question has been asked several times so far but the solutions provided didn't help
I have the following block of code:
class AppURLopener(urllib.request.FancyURLopener):
version = "Mozilla/5.0"
def scrape_news(url, source):
opener = AppURLopener()
if(source=='SeekingAlpha'):
print(url)
with opener.open(url) as response:
s = response.read()
data = BeautifulSoup(s, "lxml")
print(data)
scrape_news('https://seekingalpha.com/news/3364386-apple-confirms-hiring-waymo-senior-engineer','SeekingAlpha')
Any idea what might be going wrong here?
EDIT:
whole traceback:
Traceback (most recent call last):
File ".\news.py", line 107, in <module>
scrape_news('https://seekingalpha.com/news/3364386-apple-confirms-hiring-waymo-senior-engineer','SeekingAlpha')
File ".\news.py", line 83, in scrape_news
with opener.open(url) as response:
File "C:\Users\xxx\AppData\Local\Programs\Python\Python36\lib\urllib\response.py", line 30, in __enter__
raise ValueError("I/O operation on closed file")
ValueError: I/O operation on closed file
Your URL returns a 403. Try this in your terminal to confirm:
curl -s -o /dev/null -w "%{http_code}" https://seekingalpha.com/news/3364386-apple-confirms-hiring-waymo-senior-engineer
Or, try this in your Python repl:
import urllib.request
url = 'https://seekingalpha.com/news/3364386-apple-confirms-hiring-waymo-senior-engineer'
opener = urllib.request.FancyURLopener()
response = opener.open(url)
print(response.getcode())
FancyURLOpener is swallowing any errors about the failure response code, which is why your code continues to the response.read() instead of exiting, even though it hasn't recorded a valid response. The standard urllib.request.urlopen should handle this for you by throwing an exception on a 403 error, otherwise you can handle it yourself.

Error with Requests using Pandas/BeautifulSoup: requests.exceptions.TooManyRedirects: Exceeded 30 redirects

I'm using Python 3 to scrape a Pandas data frame I've created from a csv file that contains the source URLs of 63,067 webpages. The for-loop is supposed to scrape news articles from for a project to place into giant text files for cleaning later on.
I'm a bit rusty with Python and this project is the reason I've started programming in it again. I haven't used BeautifulSoup before, so I'm having some difficulty and just got the for-loop to work on the Pandas data frame with BeautifulSoup.
This is for one of the three data sets I'm using (the other two are programmed into the code below to repeat the same process for different data sets, which is why I'm mentioning this).
from bs4 import BeautifulSoup as BS
import requests, csv
import pandas as pd
negativedata = pd.read_csv('negativedata.csv')
positivedata = pd.read_csv('positivedata.csv')
neutraldata = pd.read_csv('neutraldata.csv')
negativedf = pd.DataFrame(negativedata)
positivedf = pd.DataFrame(positivedata)
neutraldf = pd.DataFrame(neutraldata)
negativeURLS = negativedf[['sourceURL']]
for link in negativeURLS.iterrows():
url = link[1]['sourceURL']
negative = requests.get(url)
negative_content = negative.text
negativesoup = BS(negative_content, "lxml")
for text in negativesoup.find_all('a', href = True):
text.append((text.get('href')))
I think finally got my for-loop to work for the code to run through all of the source URLs. However, I then get the error:
Traceback (most recent call last):
File "./datacollection.py", line 18, in <module>
negative = requests.get(url)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/api.py", line 72, in get
return request('get', url, params=params, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/api.py", line 58, in request
return session.request(method=method, url=url, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/sessions.py", line 508, in request
resp = self.send(prep, **send_kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/sessions.py", line 640, in send
history = [resp for resp in gen] if allow_redirects else []
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/sessions.py", line 640, in <listcomp>
history = [resp for resp in gen] if allow_redirects else []
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/sessions.py", line 140, in resolve_redirects
raise TooManyRedirects('Exceeded %s redirects.' % self.max_redirects, response=resp)
requests.exceptions.TooManyRedirects: Exceeded 30 redirects.
I know that the issue is when I'm requesting the URLs, but I'm not sure what–or if a–URL is the problem due to the amount of webpages that are in the data frame being iterated through. Is the problem a URL or that I have too many and should use a different package like scrapy?
I would suggest using modules like mechanize for scraping. Mechanize has a way of handling robots.txt and is much better if your application is scraping data from urls of different websites. But in your case, the redirect is probably because of not having user-agent in headers as mentioned here (https://github.com/requests/requests/issues/3596). And here's how you set headers with requests (Sending "User-agent" using Requests library in Python).
P.S: mechanize is only available for python2.x. If you wish to use python3.x, there are other options (Installing mechanize for python 3.4).

Saving Image from URL using Python Requests - URL type error

Using the following code:
with open('newim','wb') as f:
f.write(requests.get(repr(url)))
where the url is:
url = 'data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAFoAAAArCAYAAAD41p9mAAAAzUlEQVR42u3awQ4DIQhFUf7/p9tNt20nHQGl5yUuh4c36BglgoiIiIiIiGiVHq+RGfvdiGG+lxKonGiWd4vvKZNd5V/u2zXRO953c2jx3bGiMrewLt+PgbJA/xJ3RS5dvl9PEdXLduK3baeOrKrc1bcF9MnLP7WqgR4GOjtOl28L6AlHtLSqBhpooIEGGmiggQYaaKCBBhpodx3H3XW4vQN6HugILyztoL0Zhlfw9G4tfR0FfR0VnTw6lQoT0XtXmMxfdJPuALr0x5Pp+wT35KKWb6NaVgAAAABJRU5ErkJggg=='
I get the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python33\lib\site-packages\requests\api.py", line 69, in get
return request('get', url, params=params, **kwargs)
File "C:\Python33\lib\site-packages\requests\api.py", line 50, in request
response = session.request(method=method, url=url, **kwargs)
File "C:\Python33\lib\site-packages\requests\sessions.py", line 465, in request
resp = self.send(prep, **send_kwargs)
File "C:\Python33\lib\site-packages\requests\sessions.py", line 567, in send
adapter = self.get_adapter(url=request.url)
File "C:\Python33\lib\site-packages\requests\sessions.py", line 641, in get_adapter
raise InvalidSchema("No connection adapters were found for '%s'" % url)
I have seen other posts with what, at first glance, appears to be a similar problem but I haven't had any luck just adding 'https://' or anything like that...I seriously want to avoid having to do this in webdriver+Autoit or something because I have to do a similar exercise for thousands of images.
There seems to be a problem with your understanding of the concept of embedded images. The url you have posted is, actually, what your browser returns when you select 'View Image' or 'Copy Image Location' (or something similar, depending on the browser) from the context menu, and formally is called a data URI.
It is not an http url pointing to an image, and you can not use it to retrieve actual images from any server: this is exactly what requests points out in the error message.
So, how do we get these images?
The following script will handle this task:
import requests
from lxml import html
import binascii as ba
i = 0
url="<Page URL goes here>" #Ex: http://server/dir/images.html
page = requests.get(url)
struct = html.fromstring(page.text)
images = struct.xpath('//img/#src')
for img in images:
i += 1
ext = img.partition('data:image/')[2].split(';')[0]
with open('newim'+str(i)+'.'+ext,'wb') as f:
f.write(ba.a2b_base64(img.partition('base64,')[2]))
print("Done")
To run it you will need to install, along with requests, the lxml library which can be found here.
Here follows a short description of how the script functions:
First it requests the url from the server and, after it gets the server's response, it stores it in a Response object (page).
Then it utilizes html.fromstring() from lxml to transform the "textified" content of page into a tree-structure which can be processed by commands utilizing XPath syntax, like this one: images = struct.xpath('//img/#src').
The result is a list containing the contents of the src attribute of every image in the page. In this case (embedded images) these are the data URIs.
Then, for every image in the list, it first gets the image type (which will be used as the newim's extension), using partition() and split() and stores it in ext. Then it converts the base64 encoded data to binary (using a2b_base64() from binascii module) and writes the output to the file.
As a small demo, save this html code (as, eg, images.html) somewhere in your server
<h1>Images</h1>
<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAEcAAAAmCAYAAACWGIbgAAACKElEQVR4nO2aPWsUQRiAnznSKYI/IbBJoTkCEhDuuhQnMY2VhdjYbAoLO9N4VfwBlyKFFlYKYqNN9iCFnQeCCHrJFdkFf4IgKcO9FnLJJtzOfs3OGJgHtpi5nZl3H+abUyIieObSch3A/4yXo8HL0eDlaPByNHg5GrwcDV6OhnJyPt0bK6XYGhqMYLiFUir1dNlNDNafpmz8kkskU1gHZPaEUX6pfGKZ9nkOSPe9xJfz6AxmmTWpHn+2nDe8S1doVk5KAqFcqC4Kz9uq05CB+OfL0VRsRM7H3s9MAfFAWnCtVluG4p8/5zyRR/JPHBIPaMH1gqO0AAny/eBt5s/BMqdwd5Z8/XKX0lOQofjtr1bJPgs77BV+f/SB/aYm6BzsyzmMxlM4KV5gxCRuLhwd9uX8PhiXLXL0p/zIMoFlOQnyix/pnO76Ru6Hf/k8DJqLKZursUM+PHbSdSzLiWGHb3bbrM7V6DmOsCxnCfqslS62soyLSceynAC1yGrZUkUm7SawP6xu9trpZJGV6PYNJx3HgZyV++1y2/kOt5aaC0eHfTnBJqd9nhZ+v/OQTSf9xslqFaDu9B6fJS/vYZJjFuDrLBm+eOZmTFFRTu3t/IO99rTPNgCjCReOTvGEs7NXGPFqo1ZLcykcf6W7ESO3dDk3gWauG2vFX+myK/2cf1hF0jd/INCRQV3zhuJXIv5fFln444MGL0eDl6PBy9Hg5WjwcjR4ORr+Aq7+02kTcdF1AAAAAElFTkSuQmCC" />
<br />
<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAFoAAAArCAYAAAD41p9mAAAAzUlEQVR42u3awQ4DIQhFUf7/p9tNt20nHQGl5yUuh4c36BglgoiIiIiIiGiVHq+RGfvdiGG+lxKonGiWd4vvKZNd5V/u2zXRO953c2jx3bGiMrewLt+PgbJA/xJ3RS5dvl9PEdXLduK3baeOrKrc1bcF9MnLP7WqgR4GOjtOl28L6AlHtLSqBhpooIEGGmiggQYaaKCBBhpodx3H3XW4vQN6HugILyztoL0Zhlfw9G4tfR0FfR0VnTw6lQoT0XtXmMxfdJPuALr0x5Pp+wT35KKWb6NaVgAAAABJRU5ErkJggg=="></img>
<br />
<img src="data:image/jpg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQH/2wBDAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQH/wAARCAAhADYDASIAAhEBAxEB/8QAHAABAQACAgMAAAAAAAAAAAAAAAoICQILBAYH/8QAKxAAAQQDAAICAQIGAwAAAAAABQMEBgcBAggACQoWFRQhERMXNjh2d7W2/8QAGQEAAgMBAAAAAAAAAAAAAAAAAAQBAgMF/8QAIxEAAgMAAgMBAQEAAwAAAAAAAwQBAgUSEwAGERQjIRUiJP/aAAwDAQACEQMRAD8Av48eYVd0Vd2Ba1RphuOuxQHFUyEvnUgktmP+aIn0vIC4IWPWXSi0cBT+dxiFRjUg5xtkwbKRyYEVG2rdENgEumu7dRu/DrvK7+hOs/YbYPQVx2helhf0dokEpOrcncosOV7hxU3sbI8TodlxUyUQEMN3LrYcJSd6jx+HC2jRBLVTbGYzZnR1W8uIle6aDejchJi0mXXzHnxlXoObVkZmESoTByrsCINhiVrgqrZ2X/iOYLRme6pn87PrQccajM9ohStQ9ycb1IEJqOUgAWAmpagbMANJahv38eSO/Kc9k9vcyRrk/irn2yZTU8z68myq9s2BX5JeO2GApUNIY5GtgcQlzTfUhE304PnnWjo0H/llPxcUICP1KQ0wRbPcQfe1LYt6C+lvWH0j67K/ifOsbk2LLg3RlMVIFZV/W/TFc1G5rDcQPuaMxpMeDn0xahLJnbQHYsoFlZuFMGUZC3PbEhzHdCiZhMyiYxKK5+l7I16sm/e0WH/yGejVh9lqs9cL5irjCmbdihGGO+mmSydAoAtoXMEtSsJrUs1qK+vz7KZCvEVvwXIzZYAy3njZ9rPzdbREAlBAkIs2n6pvp/8Akug8eaUfZnzb2d7Oqtq3n3lTpDbjPmC268Xsi+elAbNeWWVNwsgGsvodBwmFxycwM59XlDEgRO3AaezGKCyMe+vRRi5lwsvNourCrUlOdvfHg95vPfLkCujewgV2WdQgwkyiah2MQLpOjLbsQjWoxvP65fviY0VNguzyX6BkHD+Xb15L08FozJizVfLsiwgIrO2rhuDJmsaWkbHy5MO5CMaICHWKNpekQTPDVpZiKSfk20mAmkik0kRMjazRhDxj7KV66IUs8Oq/UX0VQIHqA47rmvE1dLZNgJJqCv5gOmDlutqO1con2rHjx48z8189dl/9pSj/AF03/wBY6869r4R/+Q/sP/40pr/3FgeXb9NPOpk4KsL5VragLCl58fIhBJboK8rFpeNxXDwOq3CG2Olc8736UnOUyi2NysbX2r3G7BDOGkp0cOf4tZZvRr6SPal6eLwuKwTK/Ad7wq+YtFojLwYro/oWv5BF8x2WbGm8rAvHnFMvZyBZiMKSJrrEXv11Aw7djt1pYHTZqbrxg/x9o0WzfxVL64xliOWOAyPHz/YBjDWZ/wBinZoJUszaKqUsaYuev52ug2f6euAVF/VmPYMvSsEf/a9Ekn0CHLaI/wA51GA96L1mzJKi/mG0lXg2APyzo0fjvtk9WlqmlVUq9KRuAx8c4/HqbItTdf8AR2ZBMt8O0ElF3i34abxNXZjphRRvonooglnd5vjb6F85HGdgfrX11xnO2S/VmMYxjOc5zlnz3jGMYx++c5z+2MY/fOfKgPb36oa69rtBxWvTM4IVDcdPTRCy6CuYYI0kf0iYJaIokB8gjKj8T9jiElaNmiZYe2LiSDQmMBG2T/fIlUWTw9v71M9Cez3pzje4fZk351gdO8Tj3Z4PRHNlpWdeDfom1ZK/jxOTF7FlVo0ZQGlaVsi8gMKwlAAsds03JhLiShCk9EaKs3+ySaNTI4uKe8rCzPfd72Bt21azScXeVh/9C44mvc0q+NnLslW9mC3ui5NaqsMFVeI9ZXXd9goOGpb9IWwBIhmKGrq5yj2CARL2jjRZpTVU05cmvWIaWsG/JgaYnNuPIo7FHcK8zC7SkI0FrUnKVOs7ClEjJIig4NODVLHkpSZOFyrhNsNFidBj50QIkXSSDRo2VculUk099tdFvG3Ku/sL9lkv983RAfMN5yraGYrr1swudMHkYMG6ph7U+ht2BPhp39FqBisv+wzmaU21PIIvXcflw2fumolnHYSZkGQvvi4X9ovsXqCPcscS2rzFSPPkgQSK36ZtKxrdi1lWY+YEN1AdbMR9f0fYAgVV4/DVhITzjMl/MTM1swCuhgWPR17pNtC8f+PP8jqTGo9Fb79uQ+waFMmgge4a8e96d5TEdMapckWjadRNaGyun28UkqBuK7ExeI9IVm4ErhxqwKLIMVllNOzV0uv7I/skrTNYY0XoyTtTa6uYbWIyDR2jCHxKe4Vmio514kP5kzabtRGdayWsjjVSHletoY8XK+IWcjTXElA4PoJ5gglXxRdswD48wuBt6liRS5AKZpr/AJb6YD3wMH7AqwZFBb1oSGEmjZ+OIsHKLxg/YPEdHDR6ydt91G7po6bqJrtnKCiiK6KmiqW+2m2u2XnNo0asGjViybpNGTJui0aNW+mqSDZq2T1Rbt0UtMY0TSRS00TT01xjXTTXXXXGMYx48XJw7L9XPq526+yYknD7PDnNYis34/OU1iI+/fkfPGBdnWPu4d3Cnb1TaR9nGOzrm8RaacvvCbRFuPz7H3755Hjx48p5fx48ePDw8ePHjw8PHjx48PDz/9k="/>
and point to it in the script: requests.get("http://yourserver/somedir/images.html").
When you run the script you will get the following 3 images:
, , , respectively named newim1.png, newim2.png and newim3.jpg.
As a reminder, do note that this script (in its current form) will only handle embedded images. If you want to process also ordinary linked images, then you have to modify it accordingly (but this is not difficult).
This is an image encoded in base64. Quoting the URL below: "base64 equals to text (string) representation of the image itself".
Read this for a detailed explanation:
http://www.stoimen.com/blog/2009/04/23/when-you-should-use-base64-for-images/
In order to use them you'll have to implement a base64 decoder. Luckily SO already provides you with the answer on how to do it:
Python base64 data decode

HTTPError with example biopython code querying pubmed

I want to query pubmed through python. I found a nice biology related library to do this:
http://biopython.org/DIST/docs/tutorial/Tutorial.html
I found some example code here:
http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc116
from Bio import Entrez
Entrez.email = "A.N.Other#example.com"
handle = Entrez.egquery(term="orchid")
record = Entrez.read(handle)
for row in record["eGQueryResult"]:
if row["DbName"]=="pubmed":
print row["Count"]
When I change the email and run this code I get the following error:
Traceback (most recent call last):
File "pubmed.py", line 15, in <module>
handle = Entrez.egquery(term=my_query)
File "/usr/lib/pymodules/python2.7/Bio/Entrez/__init__.py", line 299, in egquery
return _open(cgi, variables)
File "/usr/lib/pymodules/python2.7/Bio/Entrez/__init__.py", line 442, in _open
raise exception
urllib2.HTTPError: HTTP Error 404: Not Found
There is not much of a lead to the source of the problem. I don't know what url it is trying to access. When I search "pubmed entrez urllib2.HTTPError: HTTP Error 404: Not Found", I get 8 results, none of which are related (aside from this thread).
The example works for me. It looks like it was a temporary NCBI issue, although the "Error 404" is quite unusual and not typical of the network problems I have seen with Entrez. In general with any network resource, give it a few hours or a day before worrying that something has broken.
There is also an Entrez Utilities announcement mailing list you may wish to subscribe to, although if there was a planned service outage recently it was not mentioned here:
http://www.ncbi.nlm.nih.gov/mailman/listinfo/utilities-announce

Python Mechanize won't properly handle a redirect

I'm working on a scraper using Mechanize and Beautiful Soup in Python and for some reason redirects aren't working. Here's my code (I apologize for naming my variables "thing" and "stuff"; I don't normally do that, trust me):
stuff = soup.find('div', attrs={'class' : 'paging'}).ul.findAll('a', href=True)
for thing in stuff:
pageUrl = thing['href']
print pageUrl
req = mechanize.Request(pageUrl)
response = browser.open(req)
searchPage = response.read()
soup = BeautifulSoup(searchPage)
soupString = soup.prettify()
print soupString
Anyway, products on Kraft's website that have more than one page for search results display a link to go the next page(s). The source code lists, for example, this as the next page for Kraft's line of steak sauces and marinades, which redirects to this
Anyway, thing['href'] has the old link in it because it scrapes the web page for it; one would think that doing browser.open() on that link would cause mechanize to go to the new link and return that as a response. However, running the code gives this result:
http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&brandid=1&searchtext=a.1. steak sauces and marinades&pageno=2
Traceback (most recent call last):
File "C:\Development\eclipse\mobile development\Crawler\src\Kraft.py", line 58, in <module>
response = browser.open(req)
File "build\bdist.win-amd64\egg\mechanize\_mechanize.py", line 203, in open
File "build\bdist.win-amd64\egg\mechanize\_mechanize.py", line 255, in _mech_open
mechanize._response.httperror_seek_wrapper: HTTP Error 408: Request Time-out
I get a time-out; I imagine it's because, for some reason, mechanize is looking for the old URL and isn't being redirected to the new one (I also tried this with urllib2 and received the same result). What's going on here?
Thanks for the help and let me know if you need any more information.
Update: Alright, I enabled logging; now my code reads:
req = mechanize.Request(pageUrl)
print logging.INFO
When I run it I get this:
url argument is not a URI (contains illegal characters) u'http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&brandid=1&searchtext=a.1. steak sauces and marinades&pageno=2'
20
Update 2 (which occurred while writing the first update): It turns out that it was the spaces in my string! All I had to do was this: pageUrl = thing['href'].replace(' ', "+") and it works perfectly.
Both urllib2 and mechanize openers include a handler for redirect responses by default (you can check looking at the handlers attribute), so I don't think the problem is that a redirect response isn't being correctly followed.
To troubleshoot the problem, you should capture the traffic in your web browser (in firefox, Live HTTP Headers and HttpFox are useful to do this) and compare it with the logs from your script (I'd recommend subclassing urllib2.BaseHandler to create your own handler to log all the information you need for every request and add the handler to your opener object using the add_handler method).

Categories

Resources