CParserError: Error tokenizing data

CParserError: Error tokenizing data - python

I'm having some trouble reading a csv file
import pandas as pd
df = pd.read_csv('Data_Matches_tekha.csv', skiprows=2)
I get
pandas.io.common.CParserError: Error tokenizing data. C error: Expected 1 fields in line 526, saw 5
and when I add sep=None to df I get another error
Error: line contains NULL byte
I tried adding unicode='utf-8', I even tried CSV reader and nothing works with this file
the csv file is totally fine, I checked it and i see nothing wrong with it
Here are the errors I get:

In your actual code, the line is:
>>> pandas.read_csv("Data_Matches_tekha.xlsx", sep=None)
You are trying to read an Excel file, and not a plain text CSV which is why things are not working.
Excel files (xlsx) are in a special binary format which cannot be read as simple text files (like CSV files).
You need to either convert the Excel file to a CSV file (note - if you have multiple sheets, each sheet should be converted to its own csv file), and then read those.
You can use read_excel or you can use a library like xlrd which is designed to read the binary format of Excel files; see Reading/parsing Excel (xls) files with Python for for more information on that.

Use read_excel instead read_csv if Excel file:
import pandas as pd
df = pd.read_excel("Data_Matches_tekha.xlsx")

I have encountered the same error when I used to_csv to write some data and then read it in another script. I found an easy solution without passing by pandas' read function, it's a package named Pickle.
You can download it by typing in your terminal
pip install pickle
Then you can use for writing your data (first) the code below
import pickle
with open(path, 'wb') as output:
pickle.dump(variable_to_save, output)
And finally import your data in another script using
import pickle
with open(path, 'rb') as input:
data = pickle.load(input)
Note that if you want to use, when reading your saved data, a different python version than the one in which you saved your data, you can precise that in the writing step by using protocol=x with x corresponding to the version (2 or 3) aiming to use for reading.
I hope this can be of any use.

Related

Python - trying to import/open incorrectly formatted .xls file

I'm trying to write some Python code which needs to take data from an .xls file created by another application (outside of my control). I've tried using pandas and xlrd and neither are able to open the file, I get the error messages:
"Excel file format cannot be determined, you must specify an engine manually." using Pandas.
"Unsupported format, or corrupt file: Expected BOF record; found b'\r\n\t'" using xlrd
I think it has to do with the way the file is exported from the program that creates it. When opened directly through Excel, I get the error message "The file format and extension don't match". However, you can ignore this message and the file opens in a usable format and can be edited and all of the expected values are in the right cells etc. Interestingly, when I go to save the file in Excel, the default option that comes up is a webpage.
Currently I have a workaround in that I can just open the file in Excel, save it as a .csv then read it into Python as a csv. This does have to be done through Excel through, if I just change the file extension to .csv, the resulting file is garbage.
However, ideally I would like to avoid the user having to do anything manaully. Would be greatly appreciated if anyone has any suggestions of ways that this might be possible (i.e. can I 'open' the file in Excel and save it through Excel using Python commands?) or if there are any packages or comands I can use to open/fix badly formatted .xls files.
Cheers!
P.S. I'm pretty new to Python and only have experience in R otherwise so my current knowledge is quite limited, apologies in advance!

try this :
from pathlib import Path
import pandas as pd
file_path = Path(filename)
df = pd.read_excel(file.read(), engine='openpyxl')

How to filter out useable data from csv files using python?

Please help me in extracting important data from a .csv file using python. I got .csv file from 'citrine'.
I want to extract the element name and atomic percentage in the form of "Al2.5B0.02C0.025Co14.7Cr16.0Mo3.0Ni57.48Ti5.0W1.25Zr0.03"
ORIGINAL
[{""element"":""Al"",""idealAtomicPercent"":{""value"":""5.4""}},{""element"":""B"",""idealAtomicPercent"":{""value"":""0.02""}},{""element"":""C"",""idealAtomicPercent"":{""value"":""0.13""}},{""element"":""Co"",""idealAtomicPercent"":{""value"":""7.5""}},{""element"":""Cr"",""idealAtomicPercent"":{""value"":""6.1""}},{""element"":""Mo"",""idealAtomicPercent"":{""value"":""2.0""}},{""element"":""Nb"",""idealAtomicPercent"":{""value"":""0.5""}},{""element"":""Ni"",""idealAtomicPercent"":{""value"":""61.0""}},{""element"":""Re"",""idealAtomicPercent"":{""value"":""0.5""}},{""element"":""Ta"",""idealAtomicPercent"":{""value"":""9.0""}},{""element"":""Ti"",""idealAtomicPercent"":{""value"":""1.0""}},{""element"":""W"",""idealAtomicPercent"":{""value"":""5.8""}},{""element"":""Zr"",""idealAtomicPercent"":{""value"":""0.13""}}]
Original CSV
Expected output

Without having the file structure it is hard to tell.
Try to load the file using:
import csv
with open(file_path) as file:
reader = csv.DictReader(...)
You will have to figure out the arguments for the function which depend on the file.

How can i show my csv data file in jupyter notebook using pyspark

I am working on a big data csv dataset. I need to read it on jupyter-notebook using pyspark. My data is about 4+ million records (540000 rows and 7 columns.) What can i do so i can show all my dataset printed?
I tried to use pandas dataframe, but it does show error as in the attached screenshot, then i tried to change the encoding type it gives SyntaxError: unexpected EOF while parsing. Can you please help me?

For the last screenshot I think you are missing the way files are reading in python by using the handler with. If your data is in a json file your can read it as follows:
with open('data_file.json', encoding='utf-8') as data_file:
data = json.loads(data_file.read())
Note that it is 'data_file.json' and not data_file.json. The same logis holds for the csv example
If it is in a csv file, tha's pretty straigtforward:
file = pd.read_csv('data_file.csv')
Try removing the encoding parameter in your csv reading step
I would not recommend to use a notebook for reading such a huge file even if you are using pyspark for that. Consider using a portion of that file for visualized in a notebook and then switch to another platform.
Hope it helps

How to import PGN file for Machine Learning

I am attempting to create a Dataframe in python in order to perform some machine learning tasks on a chess AI. I am having trouble printing out the Dataframe.
I am using pandas to read a csv file. This file was originally a pgn file that I simply saved as a csv file. I am using pandas.head() in attempts to read said file.
import pandas as pd
Fischer_games = pd.read_csv("/home/rhulain/Desktop/Python Projects/Fischer_ai/Fischer_dataset.csv", sep=".")
print(Fischer_games.head())
I expected to see the first 5 items of the csv file as separated at each period. This would be the first 5 moves within the first chess game within the file.
Instead I get this error:
ParserError: Error tokenizing data. C error: Expected 1 fields in line 3, saw 3
My intuition says that the formatting of the csv file is somehow in such a way that the pandas parser isn't handling it well. In that case, I am unsure how to format the information within the csv file to have pandas properly read it.

I found the solution. The issue was in blank data columns being read as data.
Following code fixed it:
bob_games = pd.read_csv("/home/rhulain/Desktop/Python Projects/bob_ai/Fischer_dataset.csv", sep='delimiter', header=None)

Error tokenizing data while uploading CSV file into Pandas Dataframe

I have an 8GB CSV file that contains information about companies created in France.
When I try to upload it in Python using pandas.read_csv, I get various types of error; I believe it’s a combination of 3 factors that cause the problem:
The size of the file (8GB)
The French characters in the cells (like “é”)
The fact that this CSV file is organized like an Excel file; the fields are separated by column, just like an XLS file
When I tried to import the file using:
import pandas as pd
df = pd.read_csv(r'C:\..\data.csv')
I got the following error: OSError: Initializing from file failed
Then, to eliminate the problem about the size, I copy the file (data.csv) and paste it, only keeping the first 25 rows (data2.csv). This is a much lighter file, to eliminate the size problem:
df = pd.read_csv(r'C:\..\data2.csv')
I get the same OSError: Initializing from file failed error.
After some research, I try the following code with Data2.csv
df = pd.read_csv(r'C:\..\data2.csv', sep="\t", encoding="latin")
This time, the import successfully works, but in a weird format, like this: https://imgur.com/a/y6WJHC5. All fields are in the same column.
So this even with the size problem eliminated, it doesn't properly read the csv file. And still, I need to work with the main file, Data.csv. So I try the same code on the initial file (data.csv):
df = pd.read_csv(r'C:\..\data.csv', sep="\t", encoding="latin")
I get: ParserError: Error tokenizing data. C error: out of memory
What is the proper code to read this data.csv properly?
Thank you,

From your image it looks like the file is separated by semi-colons (;). Try using ";" as the sep in the read_csv function.
Pandas reads the csv into ram - an 8GB file could easily exhaust this - try reading the file in chunks. See this answer.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.