Pandas: Read CSV: ValueError: could not convert string to float - python

I'm trying to read a large and complex CSV file with pandas.read_csv.
The exact command is
pd.read_csv(filename, quotechar='"', low_memory=True, dtype=data_types, usecols= columns, true_values=['T'], false_values=['F'])
I am pretty sure that the data types are correct. I can read the first 16 million lines (setting nrows=16000000) without problems but somewhere after this I get the following error
ValueError: could not convert string to float: '1,123'
As it seems, for some reason pandas thinks two columns would be one.
What could be the problem? How can I fix it?

I found the mistake. The problem was a thousand separator.
When writing the CSV file, most numbers were below thousand and were correctly written to the CSV file. However, this one value was greater than thousand and it was written as "1,123" which pandas did not recognize as a number but as a string.

Related

Trouble reading CSV file using pandas

I'm working on a data analysis project & I wanted to read data from CSV files using pandas. I read the first CSV file and It was fine but the second one gave me a UTF 8 encoding error. I exported the file to csv and encoded it to UTF-8 in the numbers spreadsheet app. However, the data frame is not in the expected format. Any idea why?
the original CSV file in numbers
it looks like your file is semicolon separated not comma separated.
To fix this you need to add the sep=';' parameter to pd.read_csv function.
pd.read_csv("mitb.csv", sep=';')
Try adding the correct delimiter, in this case ";", to read the csv.
mitb = pd.read_csv('mitb.csv', sep=";")
The file is semicolon-separated and also decimal is comma, not dot
df = pd.read_csv('mitb.csv', sep=';', decmal=',')
And Please do not upload images of code/data/errors.

Pandas read_csv produces unexpected behavior, why?

I've got a large tab-separated file and am trying to load it using
import pandas as pd
df = pd.read_csv(..., sep="\t")
however, the process crashes with the error being
pandas.errors.ParserError: Error tokenizing data. C error: Expected 8 fields in line 1743925, saw 12
Nothing apparent wrong with that particular line when I printed it out manually. Feeling confident that there is nothing wrong with my file, I went and tried to calculate the field counts myself...
from collections import Counter
lengths = []
with open(...) as f:
for line in f:
lengths.append(len(line.split('\t')))
c = Counter(lengths)
print(c)
...and got the result Counter({8: 2385674}). So I was wondering what does pandas do differently, but the error is raised inside a .pyx file and hence I cannot plant a breakpoint there. What could be the cause of this? Where is my expectation flawed?
Fixed the issue. Turns out the problem was a different quoting on csv export and read. The issue was solved by matching quoting on read_csv with quoting on the to_csv which created the loaded file. I assume some tabs and newlines were thought to be parts of string literals because of this, hence the assumption of 11 tab characters on one row (they were actually 2+ rows).

Truncated Beginning String Characters When Reading txt file to pandas

I am trying to read a txt file to a pandas data using pandas.read_fwf. Here's my line of code:
klia_sepang = pd.read_fwf('KLIA_SEPANG.txt', sep='[ ]{1,}')
However, I'm finding out that all string of 100th decimal places will be truncated at the beginning. So a 791.0 becomes 91.0, 309.0 becomes 09.0, and so on. I'm not sure why this happens. I've tried adding parameters like colspecs and widths to no avail.
txt file
pandas data
Looking at your text file, you probably want to use the widths or colspec parameters to define how to break up the file into columns. Or you might have success just letting read_fwf infer how to organize the columns of data.
I don't think passing "sep" with those characters is helping, and it may be confusing the parser.

Can't convert base64 images from csv to file in Python

I know it looks like it has been answered before, but I can't seem to find a solution for this issue. I have a CSV file that contains very long strings of Base64 encoded images (~5mb each). I enabled the CSV field size limit to max. There are several of the images decoded separated by columns then a few values that are only a couple of words long. I can read these through print(row[7]) for example no problem. The images won't print the base64 strings, and i'm trying to decode them and save them to the filesystem but they end up being empty. Any thoughts?
fh = open("~path~/image.png", "wb")
x = base64.b64decode(row[1])
fh.write(x)
fh.close()
Thanks for any help!
EDIT: Works now. CSV split on python seems to act a little different than in Java. The empty values came up due to the csv being saved differently than the exporting tool I used indicated, so it was left with values ("8",,"data:image/png;base64,IR0BRR....",...). I didn't catch the empty space before, which is why it was showing blank, and then I also attempted to append the data:image/png part to the beginning of itself since I believed that python string split would split the comma after base64 like Java would. After adjusting for this, the image correctly saves in my filesystem.

Writing Integers to a File

I'm having a really difficult time writing integers out to a file. Here's my situation. I have a file, let's call it 'idlist.txt'. It has multiple columns and is fairly long (10,000 rows), but I only care about the first column of data.
I'm loading it into python using:
import numpy as np
FH = np.loadtxt('idlist.txt',delimiter=',',comments='#')
# Testing initial data type
print FH[0,0],type(FH[0,0])
>>> 85000370342.0 <type 'numpy.float64'>
# Converting to integers
F = [int(FH[i,0]) for i in range(len(FH))]
print F[0],type(F[0])
>>> 85000370342 <type 'long'>
As you can see, the data must be made into integers. What I now would like to do is to write the entries of this list out as the first column of another file (really the only column in the entire file), we can rename it 'idonly.txt'. Here is how I'm trying to do it:
with open('idonly.txt','a') as f:
for i in range(len(F)):
f.write('%d\n' % (F[i]))
This is clearly not producing the desired output - when I open the file 'idonly.txt', each entry is actually a float (i.e. - 85000370342.0). What exactly is going on here, and why is writing integers to a file such a complicated task? I found the string formatting idea from here: How to write integers to a file , but it didn't fix my issue.
Okay, well it appears that this is completely my fault. When I'm opening the file I'm using the mode 'a', which means append. It turns out that the first time I wrote this out to a file I did it incorrectly, and ever since I've been appending the correct answer onto that and simply not looking down as far as I should since it's a really long file.
For reference here are all of the modes you can use when handling files in python: http://www.tutorialspoint.com/python/python_files_io.htm. Choose carefully.
Try to use:
f.write('{}'.format(F[i]));

Categories

Resources