WEKA - Cant read CSV generated with Python pandas - python

I've been working on some dataframes with Python. I load them in using readCSV(filename, index=0) and it's all fine. The files also open fine in Excel. I also opened them in notepad, and the seem alright; below is an example line:
851,1.218108787,0.636454978,0.269719611,-0.849476404,-0.143909689,0.050626813,-0.094248374,-0.3096134,-0.131347142,0.671271112,0.167593329,0.439417259,-0.198164647,-0.031552824,-0.215189948,-0.1791156,0.092648696,-0.107840318,-0.162596466,0.019324121,0.040572892,-0.008307331,-0.077819297,-0.023809355,-0.148229913,-0.041082835,0.138234498,-0.070986117,0.024788437,-0.050982962,0.24689969,0
The first column is as I understand it an index column. Then there's a bunch of Principal Components, and at the end is a 1/0.
When I try and load the file into WEKA, however, it gives me a nasty error and urges me to use the converter, saying:
Reason:
32 Problem encountered on line: 2
When I attempt to use the converter with the default settings, it states a new error:
Couldn't read object file_name.csv invalid stream header: 2C636F6D
Could anyone help with any of this? I can't provide the entire data file but if requested I can try and maybe cut out a few rows and only paste those if the error still occurs. Are there any flags I need to specify when saving a file to CSV in python? At the moment I just use a .toCSV('x.csv').

I think the index column not having an issue would prevent weka from reading it, when you write using pandas.to_csv() set the index = False
df.to_csv(index = False)

Related

Removing excel sheet with openpyxl raise error when I open the file

I tested the openpyxl .remove() function and it's working on multiple empty file.
Problem: I have a more complex Excel file with multiple sheet that I need to remove. If I remove one or two it works, when I try to remove three or more, Excel raise an error when I open the file.
Sorry, we have troubles getting info in file bla bla.....
logs talking about pictures troubles
logs about error105960_01.xml ?
The strange thing is that it's talking about pictures trouble but I don't have this error if I don't remove 3 or more sheet. And I don't try to remove sheet with images !
Even more strange, It's always about the number, every file can be deleted without trouble but if I remove 3 or more, Excel yell at me.
The thing is that, it's ok when Excel "repair" the "error" but sometimes, excel reinitialize the format of the sheets (size of cell, bold and length of the characters, etc...) and everything fail :(
bad visual that I want to avoid
If someone have an idea, i'm running out of creativity !
For the code, I only use basic functions (simplify here but it would be long to present more...).
INPUT_EXCEL_PATH = "my_excel.xlsx"
OUTPUT_EXCEL_PATH = "new_excel.xlsx"
wb = openpyxl.load_workbook(INPUT_EXCEL_PATH)
ws = wb["sheet1"]
wb.remove(ws)
ws = wb["sheet2"]
wb.remove(ws)
ws = wb["sheet3"]
wb.remove(ws)
wb.save(OUTPUT_EXCEL_PATH)
In my case it was some left over empty CalculationChainPart. I used DocxToSource to investigate the corrupted file. Excel will attempt to fix the file on load. Save this file and compare it's structure to the original file. To delete descendant parts you can use the DeletePart() method.
using (SpreadsheetDocument doc = SpreadsheetDocument .Open(document, true)) {
MainDocumentPart mainPart = doc.MainDocumentPart;
if (mainPart.DocumentSettingsPart != null) {
mainPart.DeletePart(mainPart.DocumentSettingsPart);
}
}
CalculationChainPart can be also removed anytime.
While calculation chain information can be loaded by a spreadsheet application, it is not required. A calculation chain can be constructed in memory at load-time (source)

Pandas.read_csv() ignore bad lines/rows containing FEWER fields. Text file

I am trying to read this huge text file: https://www.dropbox.com/s/3ikikw8bxde6y1i/TCAD_SPECIAL%20EXPORT_2019_20200409.zip?dl=0 (if you download the zip, the file is Special_ARB.txt (not necessary for my question imo).
I am running this code (adding error_bad_lines=False) to ignore lines with more-than-expected fields, which works well:
pd.read_csv(r'~/Special_ARB.txt', sep="|",
header=None,encoding='cp1252',error_bad_lines=False)
The problem is that read.csv() crashed when a line had only 1 field. With the following error:
Too many columns specified: expected 77 and found 1
Is there a way to tell python/pandas to ignore this error? It is not letting me know which line it is. There are more than a million rows so I can't just find it on my own.
I tried a for loop to read line by line and figure it out from there, but data is so large that python crashed.
The number of columns is 77 which is correctly identify by pandas when running the code, I don't think that's an issue.
Thanks,
Errors and Exceptions
Python Try Except
try:
pd.read_csv(r'~/Special_ARB.txt', sep="|", header=None,encoding='cp1252',error_bad_lines=False)
except <your error description>:
<do this>
This should work for in-memory datasets, you can use chunking for a solution on large datasets: https://stackoverflow.com/a/59331754/9379924

What happened when I used pandas to read csv files for multiple time in kaggle's notebook?

I am participating the kaggle's NCAA March Madness Anlytics Competion. I used pandas to read the information from csv files but encountered such a problem:
seeds = pd.read_csv('/kaggle/input/march-madness-analytics-2020/2020DataFiles/2020DataFiles/2020-Womens-Data/WDataFiles_Stage1/WNCAATourneySeeds.csv')
seeds
Here the output is empty. And I tried again like this:
rank = seeds.merge(teams)
Then there came an error:
NameError: name 'seeds' is not defined.
I can't figure out what happened and I tried it offline which turned out that nothing happened. Do I miss anything? And how can I fix it? Note that this was not the first time I used the read_csv() to read data from csv file in this notebook, though I couldn't figure out whether there is relation between this trouble and my situation.
You must put the CSV file in the folder where python saves projects.
Run this to find out the destination:
%pwd
Put the file in the destination and run this:
seeds = pd.read_csv('WNCAATourneySeeds.csv')
You can also run this:
seeds = pd.read_csv(r'C:\Users....\WNCAATourneySeeds.csv')
Where "C" is the disk where your file is saved and replace "..." by the computer path where the file is saved. Use also "\" not "/".
I finally found the problem. I didn't notice I was writing my codes in the markdown cell. Stupid me!

Spyder, variable explorer, xpt

I'm coming to Python from a SAS background.
I've imported a SAS version 5 transport file (XPT) into python using:
df = pd.read_sas(r'C:\mypath\myxpt.xpt')
The file is a simple SAS transport file, converted from a SAS dataset created with the following:
DATA myxpt;
DO i = 1 TO 10;
y = "XXX";
OUTPUT;
END;
RUN;
The file imports correctly and I can view the contents using:
print(df)
screenshot showing print of dataframe
However, when I view the file using the variable explorer, all character columns are shown as blank.
Screenshot showing data frame viewed through Variable explorer
I've tried reading this as a sas dataset instead of a transport file and importing this into Python but have the same problem.
I've also tried creating a dataframe within python containing character columns and this displays correctly within the variable explorer.
Any suggestions what's going wrong?
Thanks in advance.
Column Y is a column of binary strings. You have to decode it first. The variable explorer cannot guess the correct encoding and apparently does not show binary strings. If you do not know the encoding you will have to guess. Try df['utf8']=df.Y.str.decode('utf8') and see if the info makes any sense.
As you have noted, it is possible to specify the encoding in the import function:
df = pd.read_sas(r'C:\mypath\myxpt.xpt', encoding='utf8')
As a sidenote, you should always be aware and preferably explicit of the encodings in use to avoid major headaches.
For a list of all available encodings and ther aliases check here.

ValueError in Python 3; specifically jupyter notebook

I'm trying to read in a file
the text file itself is laid out in 9 columns with tons of data (454 lines total)
I'm trying to read in and retrieve certain columns of data so I can plot a diagram of the mass related to temperature (an HR diagram)
however when I try to load the text using:
file = 'nameoftext.txt' #the file itself is saved as a txt from notepad++
track1 = np.loadtext(file, skiprows=70) #im skipping 70 rows of headers to the data (and np is numpy)
I get an error saying:
ValueError: could not convert string to float: 'iso'
I have no idea what this means or what I'm doing.
I'm also using np.loadtext because that's the only way my professor showed us how to load files and I have no idea how else to do it.
another option for loading .txt files in the python is the genfromtxt() function also in numpy. In this function the object type of values in each column can be specified or you can allow the function to guess the type on its own.
Check out the link below for a similar question.
Loading text file containing both float and string using numpy.loadtxt

Categories

Resources