read a zipped csv from S3 into python dataframe

read a zipped csv from S3 into python dataframe - python

I Have a bucket in S3 with a csv in it.
There are no none-ASCII characters in it.
when I try to read it using python it will not let me.
I used: df = self.s3_input_bucket.get_file_contents_from_s3(path)
as I used on many occasions recently in the same script, and get:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x84 in position 14: invalid start byte.
to make sure it goes to the right path, i put another plain text file in the same folder and was able to read it without a problem.
I tried many solutions I found on other questions. just one example, I saw a solution someone offered, to try this:
str = unicode(str, errors='replace')
or
str = unicode(str, errors='ignore')
from this question: UnicodeDecodeError: 'utf8' codec can't decode byte 0x9c
but how can I use them in this case?
this did not work:
str = unicode(self.s3_input_bucket.get_file_contents_from_s3(path), errors='replace')

Apparently, I tried to open a zipped filed.
after much research, I was able to read it into a data frame using this code:
import zipfile
import s3fs
s3_fs = s3fs.S3FileSystem(s3_additional_kwargs={'ServerSideEncryption': 'AES256'})
market_score = self._zipped_csv_from_s3_to_df(os.path.join(my-bucket, path-in-bucket), s3_fs)
def _zipped_csv_from_s3_to_df(self, path, s3_fs):
with s3_fs.open(path) as zipped_dir:
with zipfile.ZipFile(zipped_dir, mode='r') as zipped_content:
for score_file in zipped_content.namelist():
with zipped_content.open(score_file) as scores:
return pd.read_csv(scores)
I will always have only one csv file inside the zip, so that is why I know I can return on the first iteration. however this function iterate over the files in the zip.

The error message in the question actually related to a CSV encoding issue (quite separate from the title: "read zipped CSV from s3").
One possible solution to the title question is:
pd.read_csv('s3://bucket-name/path/to/zip/my_file.zip')
Pandas will open the zip and read in the CSV. This will only work if the zip contains a single CSV file. If there are multiple, another solution is required (perhaps more like OP's solution).
The encoding issue can be resolved by specifying the encoding type in the read. For example:
pd.read_csv('s3://bucket-name/path/to/zip/my_file.zip', encoding = "ISO-8859-1")

Related

How do I remove quotation marks around the fields of a CSV?

I'm working on taking csv files and putting them into a postgreSQL database. For one of the files though, every field is surrounded by quotes (When looking at it in Excel it looks normal. In notepad though, one row looks like "Firstname","Lastname","CellNumber","HomeNumber",etc. when it should look like Firstname,Lastname,CellNumber,HomeNumber). It breaks when I tried to load it into SQL.
I tried loading the file into python to do data cleaning, but i'm getting an error:
This is the code I'm running to load in the file in python:
import pandas as pd
logics = pd.read_csv("test.csv")
and this is the error I'm getting:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 28682: invalid continuation byte
I tried encoding it into utf-8, but that gave me a different error.
code:
import pandas as pd
logics = pd.read_csv("test.csv", encoding= 'utf-8')
error:
pandas.errors.ParserError: Error tokenizing data. C error: Expected 12 fields in line 53, saw 14
For whatever reason, when I manually save the file in file explorer as UTF-8 and then save it back again as a CSV file it removes the quotation marks, but I need to automate this process. Is there any way I can use python to remove these quotation marks? Is it just some different kind of encoding?

So you can add more to this, maybe pull out some of the functionality into a function called "clean_line". Below should go through your csv, and remove all " characters in any of the lines. No real need for the pandas overhead on this one, using the standard python libraries should make it faster as well.
with open("test.csv",'r')as f:
lines = f.readlines()
with open("output.csv", 'w') as f:
output=[]
for line in lines:
output.append(line.replace('"',''))
f.writelines(output)

decoding issue while parsing JSON [python]

I am reading a JSON file in Python which has lots of fields and values (~8000 records).
Env: windows 10, python 3.6.4;
code:
import json
json_data = json.load(open('json_list.json'))
print (json_data)
With this I get an error. Below is the stack trace:
json_data = json.load(open('json_list.json'))
File "C:\Program Files (x86)\Python36-32\lib\json\__init__.py", line 296, in load
return loads(fp.read(),
File "C:\Program Files (x86)\Python36-32\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 7977319: character maps to <undefined>
Along with this I have tried
import json
with open('json_list.json', encoding='utf-8') as fd:
json_data = json.load(fd)
print (json_data)
with this my program runs for a long time then hangs with no output.
I have searched almost all topics related to this and could not find a solution.
Note: The JSON data is a valid one as when I see it on Postman/any REST client it doesn't report any anomalies.
Any help on this or alternative solution on how can I load my JSON data (any way by converting it to string then back to JSON etc) will be of great help.
Here is what the file looks like around the reported error:
>>> from pprint import pprint
>>> f = open('C:/Users/c5242046/Desktop/test2/dblist_rest.json', 'rb')
>>> f.seek(7977319)
7977319
>>> pprint(f.read(100))
(b'\x81TICA EL ABGEN INGL\xc3\x83\xc2\x89S, S.A.","memory_size_gb":"64","since'
b'":"2017-04-10","storage_size_gb":"84.747')

The snippet you are asking about seems to have been double-encoded. Basically, whatever originally generated this data produced text in Latin-1 or some related encoding (Windows code page 1252?). It was then fed to a process which converts Latin-1 to UTF-8 ... twice.
Of course, "converting" data which is already UTF-8 but telling the computer that it's Latin-1 just produces mojibake.
The string INGL\xc3\x83\xc2\x89S suggests this analysis, if you can guess that it is supposed to say Inglés in upper case, and realize that the UTF-8 encoding for É is \xC3 \x89 and then examine which characters these two bytes encode in Latin-1 (or, as it happens, Unicode, which is a superset of Latin-1, though they are not compatible on the encoding level).
Notice that being able to guess which string a problematic sequence is supposed to represent is the crucial step here; it also explains why including a representative snippet of the problematic data - with enough context! - is vital for debugging.
Anyway, if the entire file has the same symptom, you should be able to undo the second, superfluous and incorrect round of re-encoding; though an error this far into the file makes me imagine it's probably a local problem with just one or a few records. Maybe they were merged from multiple input files, only one of which had this error. Then fixing it requires a fair bit of detective work, and manual editing, or identifying and fixing the erroneous source. A quick and dirty workaround is to simply manually remove any erroneous records.

Error reading csv file using pandas [duplicate]

This question already has answers here:
UnicodeDecodeError when reading CSV file in Pandas with Python
(25 answers)
Closed 5 years ago.
what i am trying is reading a csv to make a dataframe---making changes in a column---again updating/reflecting changed value into same csv(to_csv)- again trying to read that csv to make another dataframe...there i am getting an error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 7: invalid continuation byte
my code is
import pandas as pd
df = pd.read_csv("D:\ss.csv")
df.columns #o/p is Index(['CUSTOMER_MAILID', 'False', 'True'], dtype='object')
df['True'] = df['True'] + 2 #making changes to one column of type float
df.to_csv("D:\ss.csv") #updating that .csv
df1 = pd.read_csv("D:\ss.csv") #again trying to read that csv
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe7 in position 7: invalid continuation byte
So please suggest how can i avoid the error and be able to read that csv again to a dataframe.
I know somewhere i am missing "encode = some codec type" or "decode = some type" while reading and writing to csv.
But i don't know what exactly should be changed.so need help.

Known encoding
If you know the encoding of the file you want to read in,
you can use
pd.read_csv('filename.txt', encoding='encoding')
These are the possible encodings:
https://docs.python.org/3/library/codecs.html#standard-encodings
Unknown encoding
If you do not know the encoding, you can try to use chardet, however this is not guaranteed to work. It is more a guess work.
import chardet
import pandas as pd
with open('filename.csv', 'rb') as f:
result = chardet.detect(f.read()) # or readline if the file is large
pd.read_csv('filename.csv', encoding=result['encoding'])

Is that error happening on your first read of the data, or on the second read after you write it out and read it back in again? My guess is that it's actually happening on the first read of the data, because your CSV has an encoding that isn't UTF-8.
Try opening that CSV file in Notepad++, or Excel, or LibreOffice. Does your data source have the ç (C with cedilla) character in it? If it does, then that 0xE7 byte you're seeing is probably the ç encoded in either Latin-1 or Windows-1252 (called "cp1252" in Python).
Looking at the documentation for the Pandas read_csv() function, I see it has an encoding parameter, which should be the name of the encoding you expect that CSV file to be in. So try adding encoding="cp1252" to your read_csv() call, as follows:
df = pd.read_csv(r"D:\ss.csv", encoding="cp1252")
Note that I added the character r in front of the filename, so that it will be considered a "raw string" and backslashes won't be treated specially. That way you don't get a surprise when you change the filename from ss.csv to new-ss.csv, where the string D:\new-ss.csv would be read as D, :, newline character, e, w, etc.
Anyway, try that encoding parameter on your first read_csv() call and see if it works. (It's only a guess, since I don't know your actual data. If the data file isn't private and isn't too large, try posting the data file so we can see its contents -- that would let us do better than just guessing.)

One simple solution is you can open the csv file in an editor like Sublime Text and save it with 'utf-8' encoding. Then we can easily read the file through pandas.

Above method used by importing and then detecting file type works
import chardet
import pandas as pd
import chardet
with open('filename.csv', 'rb') as f:
result = chardet.detect(f.read()) # or readline if the file is large
pd.read_csv('filename.csv', encoding=result['encoding'])

Yes you'll get this error. I have work around with this problem, by opening csv file in notepad++ and changing the encoding throught Encoding menu -> convert to UTF-8. Then saving the file. Then again running python program over it.
Other solution is using codecs module in python for encoding-decoding of files. I haven't used that.

I am new to python. Ran into this exact issue when I manually changed the extension on my excel file to .csv and tried to read it with read_csv. However, if I opened the excel file and saved as csv file instead it seemed to work.

Error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

https://github.com/affinelayer/pix2pix-tensorflow/tree/master/tools
An error occurred when compiling "process.py" on the above site.
python tools/process.py --input_dir data -- operation resize --outp
ut_dir data2/resize
data/0.jpg -> data2/resize/0.png
Traceback (most recent call last):
File "tools/process.py", line 235, in <module>
main()
File "tools/process.py", line 167, in main
src = load(src_path)
File "tools/process.py", line 113, in load
contents = open(path).read()
File"/home/user/anaconda3/envs/tensorflow_2/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
What is the cause of the error?
Python's version is 3.5.2.

Python tries to convert a byte-array (a bytes which it assumes to be a utf-8-encoded string) to a unicode string (str). This process of course is a decoding according to utf-8 rules. When it tries this, it encounters a byte sequence which is not allowed in utf-8-encoded strings (namely this 0xff at position 0).
Since you did not provide any code we could look at, we only could guess on the rest.
From the stack trace we can assume that the triggering action was the reading from a file (contents = open(path).read()). I propose to recode this in a fashion like this:
with open(path, 'rb') as f:
contents = f.read()
That b in the mode specifier in the open() states that the file shall be treated as binary, so contents will remain a bytes. No decoding attempt will happen this way.

Use this solution it will strip out (ignore) the characters and return the string without them. Only use this if your need is to strip them not convert them.
with open(path, encoding="utf8", errors='ignore') as f:
Using errors='ignore'
You'll just lose some characters. but if your don't care about them as they seem to be extra characters originating from a the bad formatting and programming of the clients connecting to my socket server.
Then its a easy direct solution.
reference

Use encoding format ISO-8859-1 to solve the issue.

Had an issue similar to this, Ended up using UTF-16 to decode. my code is below.
with open(path_to_file,'rb') as f:
contents = f.read()
contents = contents.rstrip("\n").decode("utf-16")
contents = contents.split("\r\n")
this would take the file contents as an import, but it would return the code in UTF format. from there it would be decoded and seperated by lines.

I've come across this thread when suffering the same error, after doing some research I can confirm, this is an error that happens when you try to decode a UTF-16 file with UTF-8.
With UTF-16 the first characther (2 bytes in UTF-16) is a Byte Order Mark (BOM), which is used as a decoding hint and doesn't appear as a character in the decoded string. This means the first byte will be either FE or FF and the second, the other.
Heavily edited after I found out the real answer

It simply means that one chose the wrong encoding to read the file.
On Mac, use file -I file.txt to find the correct encoding. On Linux, use file -i file.txt.

I had a similar issue with PNG files. and I tried the solutions above without success.
this one worked for me in python 3.8
with open(path, "rb") as f:

use only
base64.b64decode(a)
instead of
base64.b64decode(a).decode('utf-8')

This is due to the different encoding method when read the file. In python, it defaultly
encode the data with unicode. However, it may not works in various platforms.
I propose an encoding method which can help you solve this if 'utf-8' not works.
with open(path, newline='', encoding='cp1252') as csvfile:
reader = csv.reader(csvfile)
It should works if you change the encoding method here. Also, you can find other encoding method here standard-encodings , if above doesn't work for you.

Those getting similar errors while handling Pandas for data frames use the following solution.
example solution.
df = pd.read_csv("File path", encoding='cp1252')

I had this UnicodeDecodeError while trying to read a '.csv' file using pandas.read_csv(). In my case, I could not manage to overcome this issue using other encoder types. But instead of using
pd.read_csv(filename, delimiter=';')
I used:
pd.read_csv(open(filename, 'r'), delimiter=';')
which just seems working fine for me.
Note that: In open() function, use 'r' instead of 'rb'. Because 'rb' returns bytes object that causes to happen this decoder error in the first place, that is the same problem in the read_csv(). But 'r' returns str which is needed since our data is in .csv, and using the default encoding='utf-8' parameter, we can easily parse the data using read_csv() function.

if you are receiving data from a serial port, make sure you are using the right baudrate (and the other configs ) : decoding using (utf-8) but the wrong config will generate the same error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
to check your serial port config on linux use : stty -F /dev/ttyUSBX -a

I had a similar issue and searched all the internet for this problem
if you have this problem just copy your HTML code in a new HTML file and use the normal <meta charset="UTF-8">
and it will work....
just create a new HTML file in the same location and use a different name

Check the path of the file to be read. My code kept on giving me errors until I changed the path name to present working directory. The error was:
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

If you are on a mac check if you for a hidden file, .DS_Store. After removing the file my program worked.

I had a similar problem.
Solved it by:
import io
with io.open(filename, 'r', encoding='utf-8') as fn:
lines = fn.readlines()
However, I had another problem. Some html files (in my case) were not utf-8, so I received a similar error. When I excluded those html files, everything worked smoothly.
So, except from fixing the code, check also the files you are reading from, maybe there is an incompatibility there indeed.

You have to use the encoding as latin1 to read this file as there are some special character in this file, use the below code snippet to read the file.
The problem here is the encoding type. When Python can't convert the data to be read, it gives an error.
You can you latin1 or other encoding values.
I say try and test to find the right one for your dataset.

I have the same issue when processing a file generated from Linux. It turns out it was related with files containing question marks..

Following code worked in my case:
df = pd.read_csv(filename,sep = '\t', encoding='cp1252')

If possible, open the file in a text editor and try to change the encoding to UTF-8. Otherwise do it programatically at the OS level.

Python 2.7 encoding

I am trying to write a xml file from python (2.7) using xml.append function.
I have a string "Frédéric" that needs to be written to xml file as one of the values.
I am trying to use unicode function on this string and then encode function to write to the file.
a ="Frédéric"
unicode(a, 'utf8')
By I get error message as 'ascii' codec can't decode byte 0xe9 in position 9'
I have gone through other stackoverflow posts for this scenario, the suggestion was to add unicode-literal before the string.
a = u'Frédéric'
a.encode('utf8')
Since, my 'a' variable is going to be dynamic (it can take any value from a list) I need to use unicode function.
Any suggestions please?
Thanks

Maybe following helps. You can use codecs to save the XML string while using utf-8.
import codecs
def save_xml_string(path, xml_string):
"""
Writes the given string to the file associated with the given path.
:param path:
Path to the file to write to.
:param xml_string:
The string to be written
:return:
nothing
"""
output_file = codecs.open(path, "w", "utf-8")
output_file.write(xml_string)
output_file.close()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.