Python read_csv - ParserError: Error tokenizing data

Python read_csv - ParserError: Error tokenizing data - python

I understand why I get this error when trying to df = pd.read_csv(file) :
ParserError: Error tokenizing data. C error: Expected 14 fields in line 7, saw 30
When it reads in the csv, it sees 14 strings/columns in the first row, based on the first row of the csv calls it the headers (which is what I want).
However, those columns are extended further, down the rows (specifially when it gets to row 7).
I can find solutions that will read it in by skipping those rows 1-6, but I don't want that. I still want the whole csv to be read, but instead of the header being 14 columns, how can I tell it make the header 30 columns, and if there is no text/string, just leave the column as a "", or null, or some random numbering. In other words, I don't care what it's named, I just need the space holder so it can parse after row 6.
I'm wondering is there a way to read in the csv, and explicitly say there are 30 columns but have not found a solution.

I can throw some random solutions that I think should work.
1) Set Header=None and give columns names in 'Name' attribute of read_csv.
df=pd.read_csv(file, header=None, namees = [field1, field2, ...., field 30])
PS. This will work if your CSV doesn't have a header already.
2) Secondly you can try using below command (if your csv already has header row)
df=pd.read_csv(file, usecols=[0,1,2,...,30])
Let me know if this works out for you.
Thanks,
Rohan Hodarkar

what about trying, to be noted error_bad_lines=False will cause the offending lines to be skipped
data = pd.read_csv('File_path', error_bad_lines=False)
Just few more collectives answers..
It might be an issue with the delimiters in your data the first row,
To solve it, try specifying the sep and/or header arguments when calling read_csv. For instance,
df = pandas.read_csv('File_path', sep='delimiter', header=None)
In the code above, sep defines your delimiter and header=None tells pandas that your source data has no row for headers / column titles. Here Documenet says: "If file contains no header row, then you should explicitly pass header=None". In this instance, pandas automatically creates whole-number indices for each field {0,1,2,...}.
According to the docs, the delimiter thing should not be an issue. The docs say that "if sep is None [not specified], will try to automatically determine this." I however have not had good luck with this, including instances with obvious delimiters.
This might be an issue of delimiter, as most of the csv CSV are got create using sep='/t' so try to read_csv using the tab character (\t) using separator /t. so, try to open using following code line.
data=pd.read_csv("File_path", sep='\t')
OR
pandas.read_csv('File_path',header=None,sep=', ')

Related

How can I read a document with pandas (python) that don't look like the average one?

I am trying to get the values from the colums from a file. The doc looks like this
the data I want to read
So all the examples I have found uses pd.read_csv or pd.DataFrame but all data usually have a clear header an nothing on top of the data (I have like 10 lines of code I don't really need for what I am doing).
Also, I think maybe there is something wrong because I tried to run:
data = pd.read_csv('tdump_BIL_100_-96_17010100.txt',header=15)
and I get
pd.read_csv output
which is just the row in one column, so there is no separation apparently, and therefore no way of getting the columns I need.
So my question is if there is a way to get the data from this file with pandas and how to get it.

If a defined number, skip the initial rows, indicate that no header is present, and that values are separated by spaces.
df = pd.read_csv(filename, skiprows=15, header=None, sep='\s+')
See read_csv for documentation.

Separating csv columns with delimiter / sep

My goal is to separate data stored in cells to multiple columns in the same row.
For example, I would like to take data that looks like this:
Row 1: [<1><2>][<3><4>][][]
Row 2: [<1><2>][<3><4>][][]
Into data that looks like this:
Row 1: [1][2][3][4]
Row 2: [1][2][3][4]
I tried using the code below to pull the csv and separate each line at the ">"
df = pd.read_csv('file.csv', engine='python', sep="\*>", header=None)
However, the code did not function as anticipated. Instead, the separation occurred at seemingly random and unpredictable points (I'm sure there's a pattern but I don't see it.) And each break created another row as opposed to another column. For example:
Row 1: [<1>][<2>]
Row 2: [<3>]
Row 3: [<4>]
I thought the issue might lie with reading the CSV file so I tried just re-scraping the site with the separator included but it produced the same results so I'm assuming its an issue with the separator call. However, I found that call after trying many others that caused various errors. For example, when I tried using sep = '>' I got the following error: ParserError: '>' expected after '"' and when I tried sep = '\>' , I got the following error: ParserError: Expected 36 fields in line 1106, saw 120. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.
These errors sent me looking though multiple resources including this and this among others.
However, I have find no resources that have successfully demonstrated how I can separate each column within a row following the use of a '>' delimiter. If anyone knows how to do this, please let me know. Your help is much appreciated!
Update:
Here is an actual screenshot of the CSV file for a better understanding of what I was trying to demonstrate above. My end goal is to have all the data is columns I+ have data on one descriptive factor as opposed to many as they do now.

Would this work:
string="[<1><2>][<3><4>][][]"
string=string.replace("[","")
string=string.replace("]","")
string=string.replace("<","[")
string=string.replace(">","]")
print(string)
Result:
[1][2][3][4]

I ended up using Google Sheets. Once you upload the csv there is a header titled "data" and then a sub-section titled "split text to columns."
If you want a faster way to do this with code, you can also do the following with pandas:
# new data frame with split value columns
new = data["Name"].str.split(" ", n = 1, expand = True)
# making separate first name column from new data frame
data["First Name"]= new[0]
# making separate last name column from new data frame
data["Last Name"]= new[1]
# Dropping old Name columns
data.drop(columns =["Name"], inplace = True)
# df display
data

How to export pandas dataframe to file so it can be open using pandas dan pyspark?

I've read both pandas.read_csv and pyspark.sql.DataFrameReader.csv documentation and it seems that PySpark side doesn't have doublequote parameter so the quote character inside field is escaped using escape character and pandas doubling quote character to show that qoute character is inside the field.
This can be solve by set parameter doubleqoute=False and escapechar='\\' in pandas.to_csv and set parameter multiLine=True in pyspark.sql.DataFrameReader.csv.
But, after I set those parameter to pandas.to_csv and then tried to pandas.read_csv by using same parameter. I got error showing that this line have 4 field when expect 3 field.
1242,"I see him, I know him \",an_username
1243,"I think I'm good now",another_username
I think the reason why the error occur is because the second field of first line contains \ as the last character and pandas reads it as escaping character " and think that the second field isn't end there. Is there any way to solve this beside remove \ character?
This is the example script to get the error
import pandas as pd
from io import StringIO
f = StringIO()
pd.DataFrame({'class':['y','y','n'],
'text':['I am fine','I saw him, I knew him \\','I think, I am good now'],
'value':['username','an_username','another_username']})\
.to_csv(f,doublequote=False,escapechar='\\',index=False)
f.seek(0)
print(f.read())
f.seek(0)
pd.read_csv(f,doublequote=False,escapechar='\\')

I tried the same but didn't get this problem. Please check the below tried code
import pandas as pd
data = pd.read_csv('c.csv')
print(data)
df = pd.DataFrame(data)
print(df)
df.to_csv('d.csv', doublequote=False)
data_1 = pd.read_csv('d.csv')
print(data_1)
The output of the above code is:
Empty DataFrame
Columns: [1242, I see him, I know him, True]
Index: []
Empty DataFrame
Columns: [1242, I see him, I know him, True]
Index: []
Empty DataFrame
Columns: [Unnamed: 0, 1242, I see him, I know him, True]
Index: []
Hope, this may helps you.

First row of data has become a column in Pandas table

The first row in pandas data table has turned into a column. I've tried various renaming methods and restructuring and it hasn't been working. Something really trivial, but unfortunately I need some help.
The line "0" is supposed to come down as the first data row "Bachelor". Could someone please point me to the proper way of getting this done?

I think there is problem your csv have no header, so is possible create default range columns names:
df_degree = pd.read_csv(file, header=None)
Or is possible define custom columns names:
df_degree = pd.read_csv(file, names=['col1','col2'])

how to set self.rows.colnames as headers by capitalizing and trimming them for export_to_csv?

I can get column names from self.rows.colnames But it contains tablename.attributename format.
for e.g: tblname.fieldname
Web2Py's export_to_csv uses it by default.
also db.tblname.fields will give you a list of column names.
I am getting internal errors whenever I tried to override the colnames using colnames parameter of export_to_csv
I want to set self.rows.colnames by trim and capitalize them and set them as a header for that csv.
Thanks in advance!!!

You can't use the parameter colnames, because, as the docs states: colnames: list of column names to use (default self.colnames). And it's only purpose is to make you able to export only a subset of colnames, not something that isn't already on self.colnames. If you intend to simply have a different header on your csv you could:
Reopen the generated csv file and write your own header over the first line, or
Write your own csv serializer for your particular case of Rows object.
If you go with the second option, which i recommend, you should read the csv.DictWriter documentation.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python read_csv - ParserError: Error tokenizing data - python

Related

How can I read a document with pandas (python) that don't look like the average one?

Separating csv columns with delimiter / sep

How to export pandas dataframe to file so it can be open using pandas dan pyspark?

First row of data has become a column in Pandas table

how to set self.rows.colnames as headers by capitalizing and trimming them for export_to_csv?

Categories

Resources