Pandas - Tokenizing Data Expected 1 field saw multiple - python

A bit confused why I am getting this error. I thought skiprows should have taken care of me.
Error:
CParserError: Error tokenizing data. C error: Expected 1 fields in line 6, saw 13
Line:
df_data = pd.read_csv(infile.name, skiprows=[6], sep=',')
CSV:
Header: 1asdf
Header: 2fac
Header: 3aaz
Header: 4ssw
Header: 5aaa
0.0,-64,192,152,27023,3,0,26275,31473,149,67,77,0.0
0.04050016403198242,-64,192,148,27021,3,0,26274,31471,149,67,77,0.038919925689697266
0.08100008964538574,-64,192,148,27017,3,0,26275,31467,149,67,77,0.07783985137939453
0.12150001525878906,-60,192,148,27019,3,0,26277,31467,149,67,77,0.1167600154876709
0.16199994087219238,-60,192,144,27015,3,0,26277,31463,149,67,77,0.15567994117736816
0.2025001049041748,-60,192,148,27075,3,0,26319,31463,149,67,77,0.19460034370422363

If you pass a list to skiprows, it interprets it as 'skip the rows in this list (0 indexed)'. Pass an integer instead. You probably also want header=None so your first row of data doesn't become the column names.
pd.read_csv(infile.name, skiprows=6, header=None)

I got the same error message. In my case it was because commas were used as decimal marks and the cells were seperated by semicolon. In my case sep=";" solved the problem:
pd.read_csv(infile.name, sep=";")

This error comes when you have more columns entries than specified in schema.
That means - In your particulate column you should have delimiter in
it.
In this way interpreter assumes that new column is coming but in reality we dont have any so the exception is thrown as runtime.
Solution for that:
Best one is, ask your input source generator to solve this.
Second would be, if you have permission to skip the records then
using this -
df = pd.read_csv(file_loc, sep=',', keep_default_na=False)

Related

Pandas ignoring cells with " and ,

I have a semicolon-delimited pandas DataFrame with all dtypes of object. Within some of the cells the string value can have ", a comma (,), or both (ex. TES"T_ING,_VALUE). I am then querying the DF using df.query based on some condition to get a subset of the DataFrame but the rows that have the pattern described in the example are being omitted completely but the remaining rows are being returned just fine. Another requirement is that I need to match all " within the text with a closing quote as well but applying a lambda to replace " with "" is also not being done properly. I have tried several methods and they are listed below
Problem 1:
pd.read_csv("file.csv", delimiter=';')
pd.read_csv("file.csv", delmiter=';', thousands=',')
pd.read_csv("file.csv", delimiter=";", escapechar='"')
pd.read_csv("file.csv", delimiter=";", encoding='utf-8')
All of the above fail to load the data in question.
Problem 2:
Input: TES"T_ING,_VALUE to TES""T_ING,_VALUE
I have tried:
df.apply(lambda s: s.str.replace('"', '""')
which doesn't do anything.
What exactly is going on? I haven't been able to find any questions tackling this particular type of issue anywhere.
Appreciate your help in advance.
EDIT: Sorry I didn't provide some mockup data due to sensitivity but here is some fake data that illustrates the issue
The following is a sample of how the csv structure
Column1;Column2;Column3;Column4;Column5\n
TES"T_ING,_VALUE;Col2Value;Col3Value;Col4Value;Col5Value\n
Col1value;TES"T_ING,_VALUE2;Col3Value;Col4Value;Col5Value\n
I have tried utilizing quoting=csv.QUOTE_ALL/QUOTE_NONNUMERIC and quotechar='"' when loading in the df but the result ends up being
Column1;Column2;Column3;Column4;Column5\n
"TES"T_ING,_VALUE;Col2Value;Col3Value;Col4Value;Col5Value";;;;\n
"Col1value;TES"T_ING,_VALUE2;Col3Value;Col4Value;Col5Value";;;;\n
So it interprets the whole row as value in column 1 rather than actually splitting on the ; and applying the quoting to only column1. Truthfully I can iterate through each row in the df and maybe do a split and load the remaining values into their respective column but the CSV is quite large so this operation would take sometime. The subset of the data the user queries on is supposed to be returned from an endpoint (this part is already working).
The problem was solved utilizing pd.apply and utilizing a custom function to process each record.
df = pd.read_csv("csv_file.csv", delimiter=';', escapechar='\\')
def mapper(record):
if ';' in record['col1']:
content = record['col1'].split(';')
if len(content) == num_columns:
if '"' in content[0]:
content[0] = content[0].replace('"', '""')
record['col1'] = content[0]
# repeat for remaining columns
processed = df.apply(lambda x: mapper(x), axis=1)

Python pandas extra 0 in numeric values

I have a simple code that read csv file. After that I change the names of the columns and print them. I found one weird issue that for some numeric columns its adding extra .0 Here is my code:
v_df = pd.read_csv('csvfile', delimiter=;)
v_df = v_df.rename(columns={Order No. : Order_Id})
for index, csv_row in v_df.iterrows():
print(csv_row.Order_Id)
Output is:
149545961155429.0
149632391661184.0
If I remove the empty row (2nd one in the above output) from the csv file, .0 does not appear in the ORDER_ID.
After doing some search, I found that converting this column to string will solve the problem. It does work if I change the first row of the above code to:
v_df = pd.read_csv('csvfile', delimiter=;, dtype={'Order No.' : 'str'})
However, the issue is that the column name 'Order No.' is changing to Order_Id as I am doing the rename so I can not use 'Order No.'. For this reason I tried the following:
v_df[['Order_Id']] = v_df[['Order_Id']].values.astype('str')
But unfortunately it seems that astype is not changing the datatype and .0 is still appearing. My questions are:
1- Why .0 is coming at the first place if there is an empty row in the csv file?
2- Why datatype change is not happening after rename?
My aim is to just get rid of .0, I don't want to change the datatype if .0 can go away using any other method.
I am trying to emulate your df here, although it has some differences I think it will work for you:
import pandas as pd
import numpy as np
v_df = pd.DataFrame([['13-Oct-22','149545961155429.0','149545961255429','Delivered'],
['12-Oct-22',None,None,'delivered'],
['15-Oct-22','149632391661184.0','149632391761184','Delivered']], columns=
['Transaction Date','Order_Id','Order Item No.','Order Item Status'])
v_df[['Order_Id']] = v_df[['Order_Id']].fillna(np.nan).values.astype('float').astype('int').astype('str')
Try it and let me know

Koalas Dataframe read_csv reads null column as not null

I am working on loading a sample csv file using koalas. What I see is a weird behavior.
The file has a blank column area_code which looks like this. As you can see, it is a blank column. All the rows for this column have blank.
When I read the file as df = ks.read_csv('zipcodes.csv'), I get the following output, which means that the column has nulls, as expected, all good.
When I read the file as df = ks.read_csv('zipcodes.csv', dtype = str), I get the following output, which means that the column doesn't have any nulls.
After a closer look, it seems that the dtype = str is causing this column to be loaded with a string value = None
Any reason why would this happen. Any help is appreciated. Thanks in advance.
Bhupesh C
For pandas, that issue was discussed here and seems to be solved.
I don't know much about koalas but you can try this :
import numpy as np
df = ks.read_csv('zipcodes.csv', dtype=str, keep_default_na=False).replace('', np.nan)

Pandas DataFrame- Finding Index Value for a Column

I have a DataFrame that has columns such as ID, Name, Specification, Time.
my file path to open them
mc = pd.read_csv("C:\\data.csv", sep = ",", header = 0, dtype = str)
When I checked my columns values, using
mc.coulumns.values
I found my ID had it with a weird character looked like this,
['/ufeffID', 'Name', 'Specification', 'Time']
After this I assigned that columns with ID like this,
mc.columns.values[0] = "ID"
When I checked this using
mc.columns.values
I got my result as,
Array(['ID', 'Name', 'Specification', 'Time'])
Then, I checked with,
"ID" in mc.columns.values
it gave me "True"
Then I tried,
mc["ID"]
I got an error stating like this,
keyError 'ID'.
I want to get the values of ID column and get rid of that weird characters in front of ID column? Is there any way to solve that? Any help would be appreciated. Thank you in advance.
That's utf-16 BOM, pass encoding='utf-16' to read_csv see: https://en.wikipedia.org/wiki/Byte_order_mark#Representations_of_byte_order_marks_by_encoding
mc = pd.read_csv("C:\\data.csv", sep=",", header=0, dtype=str, encoding='utf-16')
the above should work FE FF is the BOM for utf-16 Big endian to be specific
Also you should use rename rather than try to overwrite the np array value:
mc.rename(columns={mc.columns[0]: "ID"}, inplace=True)
should work correctly

read_table pandas python numeric error

I am doing a basic pd.read_table of a .txt file. The first column is a list of cusips. The cusip "65248E10" is being read as a number 65248E10 = 652480000000000 (E10 as scientific notation).
I have been going through the pandas but I can't figure out how to require it to stay as a character. http://pandas.pydata.org/pandas-docs/dev/generated/pandas.io.parsers.read_table.html#pandas.io.parsers.read_table
Also, even if I put header = 0, it seems to be putting the first row as the headers and then row 0 is the second row and so on. If my text file has no column names, how can I get that to default to NULL (or 1, 2, 3, etc.)
Thanks for the help. I am new to pandas/python
If we have a data file which looks like
65248E10 11
55555E55 22
then we can read it in with something like
>>> pd.read_table("cusip.txt", header=None, delimiter=" ", converters={0: str})
0 1
0 65248E10 11
1 55555E55 22
where we use header=None to tell it that there aren't any headers, we use delimiter=" " to tell it there's a space delimiter (adjust to match your data format), and converters={0: str} to tell it that after reading the first column in as a string, we want to turn it into a string (i.e. in this case do nothing to it) rather than process it further. Instead of converters={0: str}, dtype=(str, int) would have worked too, but this way we can still let pandas figure out what the other columns are.
The problem with using header=0 is that 0 here doesn't mean "no header", it means use row number #0 (the first row) as the headers.
To stop your column from being read as a number, use the converters parameter and specify str as the converter for the column containing your "cusips".
For the header, as documented on the page you linked to, header is the number of the row which is to be considered the header; it is not a boolean saying "do I have a header or not. Setting it to zero means to use row zero (i.e., the first row) as the header. The documentation explicitly says:
Specify None if there is no header row.

Categories

Resources