I am trying to parse multiple gpx files stored in a directory with gpxpy in Python and create a pandas data frame.
Here is my code:
import gpxpy
import os
# Open the file in read mode and parse it
gpx_dir = r'/Users/Gav/GPX Data/'
for filename in os.listdir(gpx_dir):
gpx_file = open(filename, 'r')
gpx = gpxpy.parse(gpx_file)
I am getting the following error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 3131: ordinal not in range(128)
I know the gpx file is fine as I am able to open it and parse it as a single file, but as soon as I try to open multiple gpx files it gives this error.
ok after lots of digging around fixed the problem myself...turns out there was .DS_store file in my data folder which was a hidden and auto generated file and it caused the issue. I was able to fix the problem after removing it.
Related
I'm working on taking csv files and putting them into a postgreSQL database. For one of the files though, every field is surrounded by quotes (When looking at it in Excel it looks normal. In notepad though, one row looks like "Firstname","Lastname","CellNumber","HomeNumber",etc. when it should look like Firstname,Lastname,CellNumber,HomeNumber). It breaks when I tried to load it into SQL.
I tried loading the file into python to do data cleaning, but i'm getting an error:
This is the code I'm running to load in the file in python:
import pandas as pd
logics = pd.read_csv("test.csv")
and this is the error I'm getting:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 28682: invalid continuation byte
I tried encoding it into utf-8, but that gave me a different error.
code:
import pandas as pd
logics = pd.read_csv("test.csv", encoding= 'utf-8')
error:
pandas.errors.ParserError: Error tokenizing data. C error: Expected 12 fields in line 53, saw 14
For whatever reason, when I manually save the file in file explorer as UTF-8 and then save it back again as a CSV file it removes the quotation marks, but I need to automate this process. Is there any way I can use python to remove these quotation marks? Is it just some different kind of encoding?
So you can add more to this, maybe pull out some of the functionality into a function called "clean_line". Below should go through your csv, and remove all " characters in any of the lines. No real need for the pandas overhead on this one, using the standard python libraries should make it faster as well.
with open("test.csv",'r')as f:
lines = f.readlines()
with open("output.csv", 'w') as f:
output=[]
for line in lines:
output.append(line.replace('"',''))
f.writelines(output)
I am having having trouble reading a csv file using read_csv in Pandas. Here's the error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
I have tried a bunch of different encoding types with the file I am dealing with and none seem to work. The file is from Google's Search Ads 360 product, which says the csv should be in the 'UFT-16' format. Strangely, if I open the file in Excel and save it as a utf-8 format, I can use read_csv normally.
I've tried the solutions to a similar problem here, but they did not work for me. This is the only code I am running:
import pandas as pd
df = pd.read_csv('path/file.csv')
Edit: I read in the file as tab delimited, and that seemed to work. I still don't understand why I got the error I did when I tried to read it in as a normal csv. Any insight into this would be appreciated!!
Try this encoding:
import pandas as pd
df = pd.read_csv('path/file.csv',encoding='cp1252')
I Have a bucket in S3 with a csv in it.
There are no none-ASCII characters in it.
when I try to read it using python it will not let me.
I used: df = self.s3_input_bucket.get_file_contents_from_s3(path)
as I used on many occasions recently in the same script, and get:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x84 in position 14: invalid start byte.
to make sure it goes to the right path, i put another plain text file in the same folder and was able to read it without a problem.
I tried many solutions I found on other questions. just one example, I saw a solution someone offered, to try this:
str = unicode(str, errors='replace')
or
str = unicode(str, errors='ignore')
from this question: UnicodeDecodeError: 'utf8' codec can't decode byte 0x9c
but how can I use them in this case?
this did not work:
str = unicode(self.s3_input_bucket.get_file_contents_from_s3(path), errors='replace')
Apparently, I tried to open a zipped filed.
after much research, I was able to read it into a data frame using this code:
import zipfile
import s3fs
s3_fs = s3fs.S3FileSystem(s3_additional_kwargs={'ServerSideEncryption': 'AES256'})
market_score = self._zipped_csv_from_s3_to_df(os.path.join(my-bucket, path-in-bucket), s3_fs)
def _zipped_csv_from_s3_to_df(self, path, s3_fs):
with s3_fs.open(path) as zipped_dir:
with zipfile.ZipFile(zipped_dir, mode='r') as zipped_content:
for score_file in zipped_content.namelist():
with zipped_content.open(score_file) as scores:
return pd.read_csv(scores)
I will always have only one csv file inside the zip, so that is why I know I can return on the first iteration. however this function iterate over the files in the zip.
The error message in the question actually related to a CSV encoding issue (quite separate from the title: "read zipped CSV from s3").
One possible solution to the title question is:
pd.read_csv('s3://bucket-name/path/to/zip/my_file.zip')
Pandas will open the zip and read in the CSV. This will only work if the zip contains a single CSV file. If there are multiple, another solution is required (perhaps more like OP's solution).
The encoding issue can be resolved by specifying the encoding type in the read. For example:
pd.read_csv('s3://bucket-name/path/to/zip/my_file.zip', encoding = "ISO-8859-1")
I have a comma-separated .txt file with French characters such as Vétérinaire and Désinfectant.
import pandas as pd
df = pd.read_csv('somefile.txt', sep=',', header=None, encoding='utf-8')
[Decode error - output not utf-8]
I have read many Q&A posts (including this) and tried many different encoding such as 'latin1' and 'utf-16', they didn't work. However, I tried to run the exact same script on the different Windows 10 computer with similar Python setup (all Python 3.6), it works perfectly fine in the other computer.
Edit: I tried this. Using encoding='cp1252' helps for some of the .txt files I want to import, but for a few .txt files, it gives the following error.
File "C:\Program_Files_Extra\Anaconda3\lib\encodings\cp1252.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 25: character maps to <undefined>
Edit:
Trying to identify encoding from chardet
import chardet
import pandas as pd
test_txt = 'somefile.txt'
rawdata = open(test_txt, 'rb').read()
result = chardet.detect(rawdata)
charenc = result['encoding']
print (charenc)
df = pd.read_csv(test_txt, sep=',', header=None, encoding=charenc)
print (df.head())
utf-8
[Decode error - output not utf-8]
Your program opens your files with a default encoding and that doesn't match the contents of the file you are trying to open.
Option 1: Decode the file contents to python string objects:
rawdata = open(test_txt, 'rb', encoding='UTF8').read()
Option 2: Open the csv file in an editor like Sublime Text and save it with utf-8 encoding to easily read the file through pandas.
I have a .xlsx file and transformed it into a .csv file. Then I'm uploading the .csv file to a Python script I wrote, but an error is thrown.
Since the file is upload through HTTP, I'm accessing it with file = request.files['file']. This is returning a file of type FileStorage. After I'm trying to read it with the StringIO object as follows:
io.StringIO(file.stream.read().decode("UTF8"), newline=None)
I'm getting the following error:
TypeError: initial_value must be str or None, not bytes
I also tried to read the file of FileStorage object this way:
file_data = file.read().decode("utf-8")
and I'm getting the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 97: invalid start byte
Maybe it is interesting to note, that I'm being able to read the file directly, i.e. as a csv file, with the following code:
with open('file_path', 'r') as file:
csv_reader = csv.reader(file, delimiter=";")
...
But since I'm trying to get the file from an upload button, i.e. an input HTML element of type file, as mentioned above, I'm getting a FileStorage object, which I'm not being able to read it.
Anyone has any idea how could I approach this?
Thank you in advance!
It could be that it's not encoded in utf-8. Try decoding it into latin-1 instead:
file_data = file.read().decode("latin-1")