Python cannot use loadtxt for csv file - python

I have an excel spreadsheet with all numbers on it, when I try to open it it gives me error:
for fname in glob.glob("Train*"):
prob = 0
a = array(loadtxt(fname, skiprows=1, dtype=object)[prob], dtype=float)
ERROR: a = array(loadtxt(fname, skiprows=1, dtype=object)[prob], dtype=float)
ValueError: setting an array element with a sequence.
I remember this working before but I haven't opened it in a while, not sure what is wrong.

Break it down.
The first step is to identify the file that is giving you the problem. Insert
print fname
as the first line inside the loop. The last name it prints before the error is the file in question.
Then, at the command prompt run
loadtxt("thebadfilename", skiprows=1, dtype=object)
See what you get.
At about this point you should see what is going wrong.

As said in the comments numpy.loadtxt cannot read Excel files.
You could try pandas.ExcelFile to read your data (not sure if this will work as you didn't gave an example.
docstring:
Class for parsing tabular excel sheets into DataFrame objects.
Uses xlrd for parsing .xls files or openpyxl for .xlsx files.
See ExcelFile.parse for more documentation
Parameters
----------
path : string or file-like object
Path to xls file
kind : {'xls', 'xlsx', None}, default None

Related

Python - trying to import/open incorrectly formatted .xls file

I'm trying to write some Python code which needs to take data from an .xls file created by another application (outside of my control). I've tried using pandas and xlrd and neither are able to open the file, I get the error messages:
"Excel file format cannot be determined, you must specify an engine manually." using Pandas.
"Unsupported format, or corrupt file: Expected BOF record; found b'\r\n\t'" using xlrd
I think it has to do with the way the file is exported from the program that creates it. When opened directly through Excel, I get the error message "The file format and extension don't match". However, you can ignore this message and the file opens in a usable format and can be edited and all of the expected values are in the right cells etc. Interestingly, when I go to save the file in Excel, the default option that comes up is a webpage.
Currently I have a workaround in that I can just open the file in Excel, save it as a .csv then read it into Python as a csv. This does have to be done through Excel through, if I just change the file extension to .csv, the resulting file is garbage.
However, ideally I would like to avoid the user having to do anything manaully. Would be greatly appreciated if anyone has any suggestions of ways that this might be possible (i.e. can I 'open' the file in Excel and save it through Excel using Python commands?) or if there are any packages or comands I can use to open/fix badly formatted .xls files.
Cheers!
P.S. I'm pretty new to Python and only have experience in R otherwise so my current knowledge is quite limited, apologies in advance!
try this :
from pathlib import Path
import pandas as pd
file_path = Path(filename)
df = pd.read_excel(file.read(), engine='openpyxl')

pandas dataframe to excel

I am trying to save to an excel file from a panda dataframe. After some methods of scraping the data I end up having the final method, where I generate the data to an excel file.
The problem is that I want the sheet_name to be an input variable for each scrape I do.
But with the code below, I got the error:
ValueError: No engine for filetype: ''
def datacollection(self,filename):
tbl= self.find_element_by_xpath("/html/body/form/div[3]/div[2]/div[3]/div[3]/div[1]/table").get_attribute('outerHTML')
df=pd.read_html(tbl)
print(df[0])
print(type(df[0]))
final=pd.DataFrame(df[0])
final.to_excel(r'C:\Users\ADMIN\Desktop\PROJECTS\Python',sheet_name=f'{filename}')
I believe the problem here is that you are asking it to write to a file called Python, without any file extension.
You could name it Python.xlsx for example.
Or, if Python was the directory name, then it should be Python/somefilename.xlsx
EDIT: Given that you were trying to name the file after filename, you are using the sheet_name parameter wrong, which names the sheet instead of the file. Ditch the sheet_name and change the last line to:
final.to_excel(fr'C:\Users\ADMIN\Desktop\PROJECTS\Python\{filename}.xlsx')
You need to give a file extension for the excel file:
final.to_excel(r'C:\Users\ADMIN\Desktop\PROJECTS\Python.xlsx',sheet_name=f'{filename}')
SOLUTION:
If using f' the path access must be changed from \ to / as:
def datacollection(self,filename):
tbl= self.find_element_by_xpath("/html/body/form/div[3]/div[2]/div[3]/div[3]/div[1]/table").get_attribute('outerHTML')
df=pd.read_html(tbl)
print(df[0])
print(type(df[0]))
final=pd.DataFrame(df[0])
final.to_excel(f'C:/Users/ADMIN/Desktop/PROJECTS/Python/{filename}.xlsx')
This might solve the error !!
final.to_excel(f'C:\Users\ADMIN\Desktop\PROJECTS\Python\{filename}.xlsx')

Pandas: ValueError: Worksheet index 0 is invalid, 0 worksheets found

Simple problem that has me completely dumbfounded. I am trying to read an Excel document with pandas but I am stuck with this error:
ValueError: Worksheet index 0 is invalid, 0 worksheets found
My code snippet works well for all but one Excel document linked below. Is this an issue with my Excel document (which definitely has sheets when I open it in Excel) or am I missing something completely obvious?
Excel Document
EDIT - Forgot the code. It is quite simply:
import pandas as pd
df = pd.read_excel(FOLDER + 'omx30.xlsx')
FOLDER Is the absolute path to the folder in which the file is located.
Your file is saved as Strict Open XML Spreadsheet (*.xlsx). Because it shares the same extension as Excel Workbook, it isn't obvious that the format is different. Open the file in Excel and Save As. If the selected option is Strict Open XML Spreadsheet (*.xlsx), change it to Excel Workbook (*.xlsx), save it and try loading it again with pandas.
EDIT: with the info that you have the original .csv, re-do your cleaning and save it as a .csv from Excel; or, if you prefer, pd.read_csv the original, and do your cleaning from the CLI with pandas directly.
It maybe your excel delete the first sheet of index 0, and now the actual index is > 0, but the param sheet_name of function pd.read_excel is 0, so the error raised.
It seems there indeed is a problem with my excel file. We have not been able to figure out what though. For now the path of least resistance is simply saving as a .csv in excel and using pd.read_csv to read this instead.

openpyxl cannot read Strict Open XML Spreadsheet format: UserWarning: File contains an invalid specification for Sheet1. This will be removed

A few of my users (all of whom use Mac) have uploaded an Excel into my application, which then rejected it because the file appeared to be empty. After some debugging, I've determined that the file was saved in Strict Open XML Spreedsheet format, and that openpyxl (2.6.0) doesn't issue an error, but rather prints a warning to stderr.
To reproduce, open a file, add a few rows and save as Strict Open XML Spreedsheet (*.xlsx) format.
import openpyxl
with open('excel_open_strict.xlsx', 'rb') as f:
workbook = openpyxl.load_workbook(filename=f)
This will print the following warning, but will not throw any exception:
UserWarning: File contains an invalid specification for Sheet1. This will be removed
Furthermore, the workbook appears to have no sheets:
assert workbook.get_sheet_names() == []
I've now had three Mac users experience this issue. It seems like Mac will sometimes default to using this Strict Open XML Spreedsheet format. If this is a normal case, then openpyxl should be able to handle it. Otherwise, it would be great if openpyxl would just throw an exception. As a workaround, it seems I can do the following:
import openpyxl
with open('excel_open_strict.xlsx', 'rb') as f:
workbook = openpyxl.load_workbook(filename=f)
if not workbook.get_sheet_names():
raise Exception("The Excel was saved in an incorrect format")
I had similar problems with XLSX files created using the R library openxlsx. A sample error message from a simple python program to open the file and retrieve a single value from sheet Crops:
Warning (from warnings module):
File "C:\Python38\lib\site-packages\openpyxl\reader\workbook.py", line 88
warn(msg)
UserWarning: File contains an invalid specification for Crops. This will be removed
My first, very clumsy solution:
Open with Excel
Save the file as *.xls, which triggered a warning about compatibility.
Re-save as *.xlsx
My second solution works if you only need to read the file:
Impose a read-only restriction:
wb = load_workbook(filename = 'CAF_LTAR_crops_out_0.3.xlsx', read_only=True)
The broad lesson seems to be that the XLSX file specification is not uniformly (correctly?) implemented across programming languages.
I am working with a Windows PC and I had the same Problem with openpyxl. I got an excel template that was saved as Strict Open XML Spreadsheet (*.xlsx). I tried to fill out the template but I got always a fault message for each work sheet as below and when I tried to print the array with all worksheet names was empty [].
UserWarning: File contains an invalid specification for Sheetname. This will be removed
Solution
I saved the file as Excel Workbook (*.xlsx) and not as Strict Open XML Spreadsheet (*.xlsx). After that I had no fault message, the array included all Worksheets and I could fill out the template with openpyxl.

CParserError: Error tokenizing data

I'm having some trouble reading a csv file
import pandas as pd
df = pd.read_csv('Data_Matches_tekha.csv', skiprows=2)
I get
pandas.io.common.CParserError: Error tokenizing data. C error: Expected 1 fields in line 526, saw 5
and when I add sep=None to df I get another error
Error: line contains NULL byte
I tried adding unicode='utf-8', I even tried CSV reader and nothing works with this file
the csv file is totally fine, I checked it and i see nothing wrong with it
Here are the errors I get:
In your actual code, the line is:
>>> pandas.read_csv("Data_Matches_tekha.xlsx", sep=None)
You are trying to read an Excel file, and not a plain text CSV which is why things are not working.
Excel files (xlsx) are in a special binary format which cannot be read as simple text files (like CSV files).
You need to either convert the Excel file to a CSV file (note - if you have multiple sheets, each sheet should be converted to its own csv file), and then read those.
You can use read_excel or you can use a library like xlrd which is designed to read the binary format of Excel files; see Reading/parsing Excel (xls) files with Python for for more information on that.
Use read_excel instead read_csv if Excel file:
import pandas as pd
df = pd.read_excel("Data_Matches_tekha.xlsx")
I have encountered the same error when I used to_csv to write some data and then read it in another script. I found an easy solution without passing by pandas' read function, it's a package named Pickle.
You can download it by typing in your terminal
pip install pickle
Then you can use for writing your data (first) the code below
import pickle
with open(path, 'wb') as output:
pickle.dump(variable_to_save, output)
And finally import your data in another script using
import pickle
with open(path, 'rb') as input:
data = pickle.load(input)
Note that if you want to use, when reading your saved data, a different python version than the one in which you saved your data, you can precise that in the writing step by using protocol=x with x corresponding to the version (2 or 3) aiming to use for reading.
I hope this can be of any use.

Categories

Resources