Unicode error when reading a csv file with pandas - python

Why pandas is not able to read this csv file and returns 'UnicodeEncodeError'. I tried lot of solutions from stackoverflow (local download, different encoding, change the engine...), but still not working...How to fix it?
import pandas as pd
url = 'http://é.com'
pd.read_csv(url,encoding='utf-8')

TL;DR
Your URL contains non ASCII character as the error complains.
Just change:
url = 'http://é.com'
For:
url = 'http://%C3%A9.com'
And the problem is fixed.
Solutions
Automatic URL escaping
Reading the error in depth shows that after executing the request to get resource behind the URL, the read_csv function expects the URL of resource to be ASCII encoded which seems not the be the case for this specific resource.
This call that is made by read_csv fails miserably:
import urllib.request
urllib.request.urlopen(url)
The problem is due to the accent in é that must be escaped to prevent urlopen to fail. Below a clean way to enforce this requirement:
import urllib.parse
result = urllib.parse.urlparse(url)
replaced = result._replace(path=urllib.parse.quote(result.path))
url = urllib.parse.urlunparse(replaced)
pd.read_csv(url)
Handling dataflow by yourself
Alternatively you can by pass this limitation by handling the complete flow by yourself. Following snippet does the trick:
import io
import gzip
import pandas as pd
import requests
url = 'http://é.com'
response = requests.get(url)
file = io.BytesIO(response.content)
with gzip.open(file, 'rb') as handler:
df = pd.read_csv(handler)
The key is to get the HTTP resource and deflate it then fake the content as a file-like object because read_csv does read directly CSV strings.

Related

How to parse CSV into pandas dataframe

I am having a couple issues with setting up a way to automate the download of a csv. The two issues are when downloading using a simple pandas read_csv(url) method I get and SSL error, so I switched to using requests and trying to parse the response. The next issues is that I am getting some errors in parsing the response. I'm not sure if the reason is that the URL is actually returning a zip file and if that is how can I get around that.
Here is the URL: https://www.californiadgstats.ca.gov/download/interconnection_rule21_applications/
and here is the code:
import pandas as pd
import numpy as np
import os
import io
import requests
import urllib3
requests.packages.urllib3.util.ssl_.DEFAULT_CIPHERS = 'ALL:#SECLEVEL=1'
url = "https://www.californiadgstats.ca.gov/download/interconnection_rule21_applications/"
res = requests.get(url).content
data = pd.read_csv(io.StringIO(res.decode('utf-8')))
If the content is zip format, you should unzip it, and use its contents (csv, txt...).
I wasn't able to download the file due to the low speed from host
Here is the answer I found although I don't really need to actually save these files locally, so if anyone knows how to parse zipfiles without downloading that would be great. Also not sure why I get that SSL error with pandas, but not with requests...
import requests
import zipfile
from io import BytesIO
url = "https://www.californiadgstats.ca.gov/download/interconnection_rule21_applications/"
pathSave = "C:/Users/wherever"
filename = url.split('/')[-1]
r = requests.get(url)
zipfile= zipfile.ZipFile(BytesIO(r.content))
zipfile.extractall(pathSave)

Python script to download PDF not downloading the PDF?

I have a Python 3.10 script to download a PDF from a URL, I get no errors but when I run the code the PDF does not download. I've done a sanity check to ensure the PDF is actually on the URL (which it is)
I'm not sure if this maybe has something to do with HTTP/ HTTPS? This site does have an expired HTTPS certificate, but it is a government site and this is really for testing only so I am not worried about that and can ignore the error
from fileinput import filename
import os
import os.path
from datetime import datetime
import urllib.request
import requests
import urllib3
urllib3.disable_warnings()
resp = requests.get('http:// url domain .org', verify=False)
urllib.request.urlopen('http:// my url .pdf')
filename = datetime.now().strftime("%Y_%m_%d-%I_%M_%S_%p")
save_path = "C:/Users/bob/Desktop/folder"
Or maybe is the issue something to do with urllib3 ignoring the error and urllib downloading the file?
Redacted the specific URL here
The urllib.request.urlopen method doesn't save the remote URL to a file -- it returns a response object that can be treated as a file-like object. You could do something like:
response = urllib.request.urlopen('http:// my url .pdf')
with open('filename.pdf') as fd:
fd.write(response.read())
The urllib.request.urlretrieve method, on the other hand, will take care of writing the remote content to a local file. You would use it like this to write the PDF file to a local file named filename.pdf:
response = urllib.request.urlretrieve('http://my url .pdf',
filename='filename.pdf')
See the documentation for information about the return value from the urlretrieve method.

Download excel file using python

I have a web link which downloads an excel file directly. It opens a page writing "your file is downloading" and starts downloading the file.
Is there any way i can automate it using requests module ?
I am able to do it with selenium but i want it to run in background so i was wondering if i can use request module.
I have used request.get but it simply gives the text i.e "your file is downloading" but somehow i am not able to get the file.
This Python3 code downloads any file from web to a memory:
import requests
from io import BytesIO
url = 'your.link/path'
def get_file_data(url):
response = requests.get(url)
f = BytesIO()
for chunk in response.iter_content(chunk_size=1024):
f.write(chunk)
f.seek(0)
return f
data = get_file_data(url)
You can use next code to read the Excel file:
import pandas as pd
xlsx = pd.read_excel(data, skiprows=0)
print(xlsx)
It sounds like you don't actually have a direct URL to the file, and instead need to engage with some javascript. Perhaps there is an underlying network call that you can find by inspecting the page traffic in your browser that shows a direct URL for downloading the file. With that you can actually just read the excel file URL directly with pandas:
import pandas as pd
url = "https://example.com/some_file.xlsx"
df = pd.read_excel(url)
print(df)
This is nice and tidy, but if you really want to use requests (or avoid pandas) you can download the raw file content as shown in this answer and then use the pyexcel_xlsx package's get_xlsx function to read it without any pandas involvement.

Python Error: iterator should return strings, not bytes (did you open the file in text mode?) [duplicate]

I've been struggling with this simple problem for too long, so I thought I'd ask for help. I am trying to read a list of journal articles from National Library of Medicine ftp site into Python 3.3.2 (on Windows 7). The journal articles are in a .csv file.
I have tried the following code:
import csv
import urllib.request
url = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.csv"
ftpstream = urllib.request.urlopen(url)
csvfile = csv.reader(ftpstream)
data = [row for row in csvfile]
It results in the following error:
Traceback (most recent call last):
File "<pyshell#4>", line 1, in <module>
data = [row for row in csvfile]
File "<pyshell#4>", line 1, in <listcomp>
data = [row for row in csvfile]
_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)
I presume I should be working with strings not bytes? Any help with the simple problem, and an explanation as to what is going wrong would be greatly appreciated.
The problem relies on urllib returning bytes. As a proof, you can try to download the csv file with your browser and opening it as a regular file and the problem is gone.
A similar problem was addressed here.
It can be solved decoding bytes to strings with the appropriate encoding. For example:
import csv
import urllib.request
url = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.csv"
ftpstream = urllib.request.urlopen(url)
csvfile = csv.reader(ftpstream.read().decode('utf-8')) # with the appropriate encoding
data = [row for row in csvfile]
The last line could also be: data = list(csvfile) which can be easier to read.
By the way, since the csv file is very big, it can slow and memory-consuming. Maybe it would be preferable to use a generator.
EDIT:
Using codecs as proposed by Steven Rumbalski so it's not necessary to read the whole file to decode. Memory consumption reduced and speed increased.
import csv
import urllib.request
import codecs
url = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.csv"
ftpstream = urllib.request.urlopen(url)
csvfile = csv.reader(codecs.iterdecode(ftpstream, 'utf-8'))
for line in csvfile:
print(line) # do something with line
Note that the list is not created either for the same reason.
Even though there is already an accepted answer, I thought I'd add to the body of knowledge by showing how I achieved something similar using the requests package (which is sometimes seen as an alternative to urlib.request).
The basis of using codecs.itercode() to solve the original problem is still the same as in the accepted answer.
import codecs
from contextlib import closing
import csv
import requests
url = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.csv"
with closing(requests.get(url, stream=True)) as r:
reader = csv.reader(codecs.iterdecode(r.iter_lines(), 'utf-8'))
for row in reader:
print row
Here we also see the use of streaming provided through the requests package in order to avoid having to load the entire file over the network into memory first (which could take long if the file is large).
I thought it might be useful since it helped me, as I was using requests rather than urllib.request in Python 3.6.
Some of the ideas (e.g using closing()) are picked from this similar post
I had a similar problem using requests package and csv.
The response from post request was type bytes.
In order to user csv library, first I a stored them as a string file in memory (in my case the size was small), decoded utf-8.
import io
import csv
import requests
response = requests.post(url, data)
# response.content is something like:
# b'"City","Awb","Total"\r\n"Bucuresti","6733338850003","32.57"\r\n'
csv_bytes = response.content
# write in-memory string file from bytes, decoded (utf-8)
str_file = io.StringIO(csv_bytes.decode('utf-8'), newline='\n')
reader = csv.reader(str_file)
for row_list in reader:
print(row_list)
# Once the file is closed,
# any operation on the file (e.g. reading or writing) will raise a ValueError
str_file.close()
Printed something like:
['City', 'Awb', 'Total']
['Bucuresti', '6733338850003', '32.57']
urlopen will return a urllib.response.addinfourl instance for an ftp request.
For ftp, file, and data urls and requests explicity handled by legacy
URLopener and FancyURLopener classes, this function returns a
urllib.response.addinfourl object which can work as context manager...
>>> urllib2.urlopen(url)
<addinfourl at 48868168L whose fp = <addclosehook at 48777416L whose fp = <socket._fileobject object at 0x0000000002E52B88>>>
At this point ftpstream is a file like object, using .read() would return the contents however csv.reader requires an iterable in this case:
Defining a generator like so:
def to_lines(f):
line = f.readline()
while line:
yield line
line = f.readline()
We can create our csv reader like so:
reader = csv.reader(to_lines(ftps))
And with a url
url = "http://pic.dhe.ibm.com/infocenter/tivihelp/v41r1/topic/com.ibm.ismsaas.doc/reference/CIsImportMinimumSample.csv"
The code:
for row in reader: print row
Prints
>>>
['simpleci']
['SCI.APPSERVER']
['SRM_SaaS_ES', 'MXCIImport', 'AddChange', 'EN']
['CI_CINUM']
['unique_identifier1']
['unique_identifier2']

Convert text data from requests object to dataframe with pandas

Using requests I am creating an object which is in .csv format. How can I then write that object to a DataFrame with pandas?
To get the requests object in text format:
import requests
import pandas as pd
url = r'http://test.url'
r = requests.get(url)
r.text #this will return the data as text in csv format
I tried (doesn't work):
pd.read_csv(r.text)
pd.DataFrame.from_csv(r.text)
Try this
import requests
import pandas as pd
import io
urlData = requests.get(url).content
rawData = pd.read_csv(io.StringIO(urlData.decode('utf-8')))
I think you can use read_csv with url:
pd.read_csv(url)
filepath_or_buffer : str, pathlib.Path, py._path.local.LocalPath or any object with a read() method (such as a file handle or StringIO)
The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. For instance, a local file could be file ://localhost/path/to/table.csv
import pandas as pd
import io
import requests
url = r'http://...'
r = requests.get(url)
df = pd.read_csv(io.StringIO(r))
If it doesnt work, try update last line:
import pandas as pd
import io
import requests
url = r'http://...'
r = requests.get(url)
df = pd.read_csv(io.StringIO(r.text))
Using "read_csv with url" worked:
import requests, csv
import pandas as pd
url = 'https://arte.folha.uol.com.br/ciencia/2020/coronavirus/csv/mundo/dados-bra.csv'
corona_bra = pd.read_csv(url)
print(corona_bra.head())
if the url has no authentication then you can directly use read_csv(url)
if you have authentication you can use request to get it un-pickel and print the csv and make sure the result is CSV and use panda.
You can directly use importing
import csv

Categories

Resources