Error tokenizing data while uploading CSV file into Pandas Dataframe - python

I have an 8GB CSV file that contains information about companies created in France.
When I try to upload it in Python using pandas.read_csv, I get various types of error; I believe it’s a combination of 3 factors that cause the problem:
The size of the file (8GB)
The French characters in the cells (like “é”)
The fact that this CSV file is organized like an Excel file; the fields are separated by column, just like an XLS file
When I tried to import the file using:
import pandas as pd
df = pd.read_csv(r'C:\..\data.csv')
I got the following error: OSError: Initializing from file failed
Then, to eliminate the problem about the size, I copy the file (data.csv) and paste it, only keeping the first 25 rows (data2.csv). This is a much lighter file, to eliminate the size problem:
df = pd.read_csv(r'C:\..\data2.csv')
I get the same OSError: Initializing from file failed error.
After some research, I try the following code with Data2.csv
df = pd.read_csv(r'C:\..\data2.csv', sep="\t", encoding="latin")
This time, the import successfully works, but in a weird format, like this: https://imgur.com/a/y6WJHC5. All fields are in the same column.
So this even with the size problem eliminated, it doesn't properly read the csv file. And still, I need to work with the main file, Data.csv. So I try the same code on the initial file (data.csv):
df = pd.read_csv(r'C:\..\data.csv', sep="\t", encoding="latin")
I get: ParserError: Error tokenizing data. C error: out of memory
What is the proper code to read this data.csv properly?
Thank you,

From your image it looks like the file is separated by semi-colons (;). Try using ";" as the sep in the read_csv function.
Pandas reads the csv into ram - an 8GB file could easily exhaust this - try reading the file in chunks. See this answer.

Related

Jupyter Notebook - Pandas

I am new to using Jupyter, NumPy, and pandas. I was looking for a solution online but I could not find anything to solve the error.
I am trying to load a file.csv but I got an error each time I find a solution. I also tried to upload the file to Jupyter notebook to use just the file directly but my system respond that the file is not there. I convert the file from .txt to .csv assuming that that was the problem but still can't load directly. Thus, I decided to use the long format but still have problems.
data = pd.read_csv(r'C:/Users/kharm/Dropbox/Jupyter/Assignment/AutoInsurSweden.csv', header=None)
data.head()
I got the error:
ParserError: Error tokenizing data. C error: Expected 1 field in line 12, saw 2
If I modify to:
data = pd.read_csv(r'C:/Users/kharm/Dropbox/Jupyter/Assignment/AutoInsurSweden.csv', header=None, error_bad_lines=False )
data.head()
or
data = pd.read_csv(r'C:/Users/kharm/Dropbox/Jupyter/Assignment/AutoInsurSweden.csv', header=None, sep='\n')
data.head()
that error suggests that the problem is with the datafile itself not your code, it seems on line 12 of the csv you have an extra data field

How to export data to csv/excel from clipboard with python

I have being trying to export a table to a csv file. The table is copied to the clipboard and it is ready to be put into a csv (at least manually).
I have seen that you can read with pandas anything that you have in the clipboard and assign it to a dataframe, so I tried this code.
df = pd.read_clipboard()
df
df.to_csv('data.csv')
However, I got this error:
pandas.errors.ParserError: Expected 10 fields in line 5, saw 16. Error could possibly be due to
quotes being ignored when a multi-char delimiter is used.
I have being looking for a solution or an alternative but failed.
Thanks in advance!

How can i show my csv data file in jupyter notebook using pyspark

I am working on a big data csv dataset. I need to read it on jupyter-notebook using pyspark. My data is about 4+ million records (540000 rows and 7 columns.) What can i do so i can show all my dataset printed?
I tried to use pandas dataframe, but it does show error as in the attached screenshot, then i tried to change the encoding type it gives SyntaxError: unexpected EOF while parsing. Can you please help me?
For the last screenshot I think you are missing the way files are reading in python by using the handler with. If your data is in a json file your can read it as follows:
with open('data_file.json', encoding='utf-8') as data_file:
data = json.loads(data_file.read())
Note that it is 'data_file.json' and not data_file.json. The same logis holds for the csv example
If it is in a csv file, tha's pretty straigtforward:
file = pd.read_csv('data_file.csv')
Try removing the encoding parameter in your csv reading step
I would not recommend to use a notebook for reading such a huge file even if you are using pyspark for that. Consider using a portion of that file for visualized in a notebook and then switch to another platform.
Hope it helps

How to import PGN file for Machine Learning

I am attempting to create a Dataframe in python in order to perform some machine learning tasks on a chess AI. I am having trouble printing out the Dataframe.
I am using pandas to read a csv file. This file was originally a pgn file that I simply saved as a csv file. I am using pandas.head() in attempts to read said file.
import pandas as pd
Fischer_games = pd.read_csv("/home/rhulain/Desktop/Python Projects/Fischer_ai/Fischer_dataset.csv", sep=".")
print(Fischer_games.head())
I expected to see the first 5 items of the csv file as separated at each period. This would be the first 5 moves within the first chess game within the file.
Instead I get this error:
ParserError: Error tokenizing data. C error: Expected 1 fields in line 3, saw 3
My intuition says that the formatting of the csv file is somehow in such a way that the pandas parser isn't handling it well. In that case, I am unsure how to format the information within the csv file to have pandas properly read it.
I found the solution. The issue was in blank data columns being read as data.
Following code fixed it:
bob_games = pd.read_csv("/home/rhulain/Desktop/Python Projects/bob_ai/Fischer_dataset.csv", sep='delimiter', header=None)

CParserError: Error tokenizing data

I'm having some trouble reading a csv file
import pandas as pd
df = pd.read_csv('Data_Matches_tekha.csv', skiprows=2)
I get
pandas.io.common.CParserError: Error tokenizing data. C error: Expected 1 fields in line 526, saw 5
and when I add sep=None to df I get another error
Error: line contains NULL byte
I tried adding unicode='utf-8', I even tried CSV reader and nothing works with this file
the csv file is totally fine, I checked it and i see nothing wrong with it
Here are the errors I get:
In your actual code, the line is:
>>> pandas.read_csv("Data_Matches_tekha.xlsx", sep=None)
You are trying to read an Excel file, and not a plain text CSV which is why things are not working.
Excel files (xlsx) are in a special binary format which cannot be read as simple text files (like CSV files).
You need to either convert the Excel file to a CSV file (note - if you have multiple sheets, each sheet should be converted to its own csv file), and then read those.
You can use read_excel or you can use a library like xlrd which is designed to read the binary format of Excel files; see Reading/parsing Excel (xls) files with Python for for more information on that.
Use read_excel instead read_csv if Excel file:
import pandas as pd
df = pd.read_excel("Data_Matches_tekha.xlsx")
I have encountered the same error when I used to_csv to write some data and then read it in another script. I found an easy solution without passing by pandas' read function, it's a package named Pickle.
You can download it by typing in your terminal
pip install pickle
Then you can use for writing your data (first) the code below
import pickle
with open(path, 'wb') as output:
pickle.dump(variable_to_save, output)
And finally import your data in another script using
import pickle
with open(path, 'rb') as input:
data = pickle.load(input)
Note that if you want to use, when reading your saved data, a different python version than the one in which you saved your data, you can precise that in the writing step by using protocol=x with x corresponding to the version (2 or 3) aiming to use for reading.
I hope this can be of any use.

Categories

Resources