How to import PGN file for Machine Learning - python

I am attempting to create a Dataframe in python in order to perform some machine learning tasks on a chess AI. I am having trouble printing out the Dataframe.
I am using pandas to read a csv file. This file was originally a pgn file that I simply saved as a csv file. I am using pandas.head() in attempts to read said file.
import pandas as pd
Fischer_games = pd.read_csv("/home/rhulain/Desktop/Python Projects/Fischer_ai/Fischer_dataset.csv", sep=".")
print(Fischer_games.head())
I expected to see the first 5 items of the csv file as separated at each period. This would be the first 5 moves within the first chess game within the file.
Instead I get this error:
ParserError: Error tokenizing data. C error: Expected 1 fields in line 3, saw 3
My intuition says that the formatting of the csv file is somehow in such a way that the pandas parser isn't handling it well. In that case, I am unsure how to format the information within the csv file to have pandas properly read it.

I found the solution. The issue was in blank data columns being read as data.
Following code fixed it:
bob_games = pd.read_csv("/home/rhulain/Desktop/Python Projects/bob_ai/Fischer_dataset.csv", sep='delimiter', header=None)

Related

Pandas txt to csv output only displays the first two lines of values, how do I get the full data to show?

My issue is as follows.
I've gathered some contact data from SurveyMonkey using the SM API, and I've converted that data into a txt file. When opening the txt file, I see the full data from the survey that I'm trying to convert into csv, however when I use the following code:
df = pd.read_csv("my_file.txt",sep =",", encoding = "iso-8859-10")
df.to_csv('my_file.csv')
It creates a csv file with only two lines of values (and cuts off in the middle of the second line). Similarly if I try to organize the data within a pandas dataframe, it only registers the first two lines, meaning most of my txt file is not being read registered.
As I've never run into this problem before and I've been able to convert into CSV without issues, I'm wondering if anyone here has ideas as to what might be causing this issue to occur and how I could go about solving it?
All help is much appreciated.
Edit:
I was able to get the data to display properly in csv, when I converted it directly into csv from json instead of converting it to a txt file first. I was not however able to figure out what when wrong in the conversion from txt to csv, as I tried multiple different encodings but came to the same result.

How can i show my csv data file in jupyter notebook using pyspark

I am working on a big data csv dataset. I need to read it on jupyter-notebook using pyspark. My data is about 4+ million records (540000 rows and 7 columns.) What can i do so i can show all my dataset printed?
I tried to use pandas dataframe, but it does show error as in the attached screenshot, then i tried to change the encoding type it gives SyntaxError: unexpected EOF while parsing. Can you please help me?
For the last screenshot I think you are missing the way files are reading in python by using the handler with. If your data is in a json file your can read it as follows:
with open('data_file.json', encoding='utf-8') as data_file:
data = json.loads(data_file.read())
Note that it is 'data_file.json' and not data_file.json. The same logis holds for the csv example
If it is in a csv file, tha's pretty straigtforward:
file = pd.read_csv('data_file.csv')
Try removing the encoding parameter in your csv reading step
I would not recommend to use a notebook for reading such a huge file even if you are using pyspark for that. Consider using a portion of that file for visualized in a notebook and then switch to another platform.
Hope it helps

Error tokenizing data while uploading CSV file into Pandas Dataframe

I have an 8GB CSV file that contains information about companies created in France.
When I try to upload it in Python using pandas.read_csv, I get various types of error; I believe it’s a combination of 3 factors that cause the problem:
The size of the file (8GB)
The French characters in the cells (like “é”)
The fact that this CSV file is organized like an Excel file; the fields are separated by column, just like an XLS file
When I tried to import the file using:
import pandas as pd
df = pd.read_csv(r'C:\..\data.csv')
I got the following error: OSError: Initializing from file failed
Then, to eliminate the problem about the size, I copy the file (data.csv) and paste it, only keeping the first 25 rows (data2.csv). This is a much lighter file, to eliminate the size problem:
df = pd.read_csv(r'C:\..\data2.csv')
I get the same OSError: Initializing from file failed error.
After some research, I try the following code with Data2.csv
df = pd.read_csv(r'C:\..\data2.csv', sep="\t", encoding="latin")
This time, the import successfully works, but in a weird format, like this: https://imgur.com/a/y6WJHC5. All fields are in the same column.
So this even with the size problem eliminated, it doesn't properly read the csv file. And still, I need to work with the main file, Data.csv. So I try the same code on the initial file (data.csv):
df = pd.read_csv(r'C:\..\data.csv', sep="\t", encoding="latin")
I get: ParserError: Error tokenizing data. C error: out of memory
What is the proper code to read this data.csv properly?
Thank you,
From your image it looks like the file is separated by semi-colons (;). Try using ";" as the sep in the read_csv function.
Pandas reads the csv into ram - an 8GB file could easily exhaust this - try reading the file in chunks. See this answer.

Split csv files based on columns

I have a csv file that I am trying to split based on the number of columns. The original file has about 24000 columns and I want to split this into files with each files having a fixed number of columns (say 1000). I want to run to do feature selection on weka on the individual files. I have the following code in python.
import pandas as pd
import numpy as np
i=0
df=pd.read_csv("glio.csv")
#row_split=int(input("Enter the Row Split: "))
row_split=6000
name ="temp_file_"
ext=".csv"
rows, columns = df.shape
df_temp=df.iloc[:,:row_split]
df_temp.to_csv(name+str(i)+ext)
i=i+1
while(row_split<columns):
df_temp=df.iloc[:,row_split+1:row_split+100]
df_temp.to_csv(name+str(i)+ext)
i=i+1
row_split+=1000
It is generating the individual files as expected but after splitting I am not able to load the individual files in weka. I am getting the following error
I am new to this and have no idea why this occurs. I cannot find answers online. It would be really helpful if someone could explain why this is happening and how to correct this
First of all add index=False to the to_csv call:
df_temp.to_csv(name+str(i)+ext, index=False)
Also please upload a screenshot of the csv file when you open it in some csv viewer application (e.g. Excel).

CParserError: Error tokenizing data

I'm having some trouble reading a csv file
import pandas as pd
df = pd.read_csv('Data_Matches_tekha.csv', skiprows=2)
I get
pandas.io.common.CParserError: Error tokenizing data. C error: Expected 1 fields in line 526, saw 5
and when I add sep=None to df I get another error
Error: line contains NULL byte
I tried adding unicode='utf-8', I even tried CSV reader and nothing works with this file
the csv file is totally fine, I checked it and i see nothing wrong with it
Here are the errors I get:
In your actual code, the line is:
>>> pandas.read_csv("Data_Matches_tekha.xlsx", sep=None)
You are trying to read an Excel file, and not a plain text CSV which is why things are not working.
Excel files (xlsx) are in a special binary format which cannot be read as simple text files (like CSV files).
You need to either convert the Excel file to a CSV file (note - if you have multiple sheets, each sheet should be converted to its own csv file), and then read those.
You can use read_excel or you can use a library like xlrd which is designed to read the binary format of Excel files; see Reading/parsing Excel (xls) files with Python for for more information on that.
Use read_excel instead read_csv if Excel file:
import pandas as pd
df = pd.read_excel("Data_Matches_tekha.xlsx")
I have encountered the same error when I used to_csv to write some data and then read it in another script. I found an easy solution without passing by pandas' read function, it's a package named Pickle.
You can download it by typing in your terminal
pip install pickle
Then you can use for writing your data (first) the code below
import pickle
with open(path, 'wb') as output:
pickle.dump(variable_to_save, output)
And finally import your data in another script using
import pickle
with open(path, 'rb') as input:
data = pickle.load(input)
Note that if you want to use, when reading your saved data, a different python version than the one in which you saved your data, you can precise that in the writing step by using protocol=x with x corresponding to the version (2 or 3) aiming to use for reading.
I hope this can be of any use.

Categories

Resources