Pandas, Python - Problem with converting xlsx to csv - python

I found to have problem with conversion of .xlsx file to .csv using pandas library.
Here is the code:
import pandas as pd
# If pandas is not installed: pip install pandas
class Program:
def __init__(self):
# file = input("Insert file name (without extension): ")
file = "Daty"
self.namexlsx = "D:\\" + file + ".xlsx"
self.namecsv = "D:\\" + file + ".csv"
Program.export(self.namexlsx, self.namecsv)
def export(namexlsx, namecsv):
try:
read_file = pd.read_excel(namexlsx, sheet_name='Sheet1', index_col=0)
read_file.to_csv(namecsv, index=False, sep=',')
print("Conversion to .csv file has been successful.")
except FileNotFoundError:
print("File not found, check file name again.")
print("Conversion to .csv file has failed.")
Program()
After running the code the console shows the ValueError: File is not a recognized excel file error
File i have in that directory is "Daty.xlsx". Tried couple of thigns like looking up to documentation and other examples around internet but most had similar code.
Edit&Update
What i intend afterwards is use the created csv file for conversion to .db file. So in the end the line of import will go .xlsx -> .csv -> .db. The idea of such program came as a training, but i cant get past point described above.

You can use like this-
import pandas as pd
data_xls = pd.read_excel('excelfile.xlsx', 'Sheet1', index_col=None)
data_xls.to_csv('csvfile.csv', encoding='utf-8', index=False)

I checked the xlsx itself, and apparently for some reason it was corrupted with columns in initial file being merged into one column. After opening and correcting the cells in the file everything runs smoothly.
Thank you for your time and apologise for inconvenience.

Related

How to read the files of Azure file share as csv that is pandas dataframe

I have few csv files in my Azure File share which I am accessing as text by following the code:
from azure.storage.file import FileService
storageAccount='...'
accountKey='...'
file_service = FileService(account_name=storageAccount, account_key=accountKey)
share_name = '...'
directory_name = '...'
file_name = 'Name.csv'
file = file_service.get_file_to_text(share_name, directory_name, file_name)
print(file.content)
The contents of the csv files are being displayed but I need to pass them as dataframe which I am not able to do. Can anyone please tell me how to read the file.content as pandas dataframe?
After reproducing from my end, I could able to read a csv file into dataframe from the contents of the file following the below code.
generator = file_service.list_directories_and_files('fileshare/')
for file_or_dir in generator:
print(file_or_dir.name)
file=file_service.get_file_to_text('fileshare','',file_or_dir.name)
df = pd.read_csv(StringIO(file.content), sep=',')
print(df)
RESULTS:

Error open file after saving it with storeFile of pysmb

I am reading an Excel file (.xlsx) with pysmb.
import tempfile
from smb.SMBConnection import SMBConnection
conn = SMBConnection(userID, password, client_machine_name, server_name, use_ntlm_v2 = True)
conn.connect(server_ip, 139)
file_obj = tempfile.TemporaryFile()
file_attributes, filesize = conn.retrieveFile(service_name, test.xlsx, file_obj)
This step works, I am able to transform the file in pandas.DataFrame
import pandas as pd
pd.read_excel(file_obj)
Next, I want to save the file, the file is saved but if I want to open it with Excel, I have an error message "Excel has run into an error"
Here the code to save the file
conn.storeFile(service_name, 'test_save.xlsx', file_obj)
file_obj.close()
How can I save correctly the file and open it with excel ?
Thank you
I tried with a .txt file file and it is working. An error occurs with .xlsx, .xls and .pdf files. I have also tried without extension, same issue, imossible to open the file.
I would like to save the file with .pdf and .xlsx extension, and open it.
Thank you.
I found a solution an I will post it here in case someone face a similar issue.
Excel can be save as a binary stream.
from io import BytesIO
df = pd.read_excel(file_obj)
output = BytesIO()
writer = pd.ExcelWriter(output, engine='xlsxwriter')
df.to_excel(writer, sheet_name='data', index = False)
writer.save()
output.seek(0)
conn.storeFile(service_name, 'test_save.xlsx', output)

Pandas gives an unordered csv file

What can I do to make this (1 Pic):
look like this one with pandas (2 Pic):
Here's the code I used to make the csv file in the 1 Picture
import pandas as pd
import os
all_months_data = pd.DataFrame()
files = [file for file in os.listdir('Sales_Data/')]
for file in files:
df = pd.read_csv('Sales_Data/' + file)
all_months_data = pd.concat([all_months_data, df])
all_months_data.to_csv('all_data.csv')
I just figured the problem and it was Exel itself that have read my csv file as a text.
I did this and it worked:
Open Excel
Go to 'Data' tab
Select 'From Text/CSV' and select the .CSV file you want to import.
Click 'Import' and you're done!

Python - XLRDError: Unsupported format, or corrupt file: Expected BOF record

I am trying to open an excel file which was given to me for my project, the excel file is the file that we get from a SAP system. But when I try opening it using pandas I am getting the following error:
XLRDError: Unsupported format, or corrupt file: Expected BOF record; found '\xff\xfe\r\x00\n\x00\r\x00'
The following is my code:
import pandas as pd
# To open an excel file
df = pd.ExcelFile('myexcel.xls').parse('Sheet1')
Dont know whether it will work for you once it had worked for me, but anyway can you try the following:
from __future__ import unicode_literals
from xlwt import Workbook
import io
filename = r'myexcel.xls'
# Opening the file using 'utf-16' encoding
file1 = io.open(filename, "r", encoding="utf-16")
data = file1.readlines()
# Creating a workbook object
xldoc = Workbook()
# Adding a sheet to the workbook object
sheet = xldoc.add_sheet("Sheet1", cell_overwrite_ok=True)
# Iterating and saving the data to sheet
for i, row in enumerate(data):
# Two things are done here
# Removeing the '\n' which comes while reading the file using io.open
# Getting the values after splitting using '\t'
for j, val in enumerate(row.replace('\n', '').split('\t')):
sheet.write(i, j, val)
# Saving the file as an excel file
xldoc.save('myexcel.xls')
I had faced the same xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record; error and solved it by writing an XML to XLSX converter. You can call pd.ExcelFile('myexcel.xlsx') after the convertion. The reason is that actually, pandas uses xlrd for reading Excel files and xlrd does not support XML Spreadsheet (*.xml) i.e. NOT in XLS or XLSX format.
import pandas as pd
from bs4 import BeautifulSoup
def convert_to_xlsx():
with open('sample.xls') as xml_file:
soup = BeautifulSoup(xml_file.read(), 'xml')
writer = pd.ExcelWriter('sample.xlsx')
for sheet in soup.findAll('Worksheet'):
sheet_as_list = []
for row in sheet.findAll('Row'):
sheet_as_list.append([cell.Data.text if cell.Data else '' for cell in row.findAll('Cell')])
pd.DataFrame(sheet_as_list).to_excel(writer, sheet_name=sheet.attrs['ss:Name'], index=False, header=False)
writer.save()
What worked for me was applying this advice:
How to cope with an XLRDError
There you also find a suitable explanation that was appropiated for me. It says that the problem was a file format not correctly saved. When I opened the xls file, it offered to save it as html.I saved it a ".xlsx" and solved the problem

pd.read_excel throws PermissionError if file is open in Excel

Whenever I have the file open in Excel and run the code, I get the following error which is surprising because I thought read_excel should be a read only operation and would not require the file to be unlocked?
Traceback (most recent call last):
File "C:\Users\Public\a.py", line 53, in <module>
main()
File "C:\Users\Public\workspace\a.py", line 47, in main
blend = plStream(rootDir);
File "C:\Users\Public\workspace\a.py", line 20, in plStream
df = pd.read_excel(fPath, sheetname="linear strategy", index_col="date", parse_dates=True)
File "C:\Users\Public\Continuum\Anaconda35\lib\site-packages\pandas\io\excel.py", line 163, in read_excel
io = ExcelFile(io, engine=engine)
File "C:\Users\Public\Continuum\Anaconda35\lib\site-packages\pandas\io\excel.py", line 206, in __init__
self.book = xlrd.open_workbook(io)
File "C:\Users\Public\Continuum\Anaconda35\lib\site-packages\xlrd\__init__.py", line 394, in open_workbook
f = open(filename, "rb")
PermissionError: [Errno 13] Permission denied: '<Path to File>'
Generally Excel have a lot of restrictions when opening files (can't open the same file twice, can't open 2 different files with the same name ..etc).
I don't have excel on machine to test, but checking the docs for read_excel I've noticed that it allows you to set the engine.
from the stack trace you posted it seems like the error is thrown by xlrd which is the default engine used by pandas.
try using any of the other ones
Supported engines: “xlrd”, “openpyxl”, “odf”, “pyxlsb”, default “xlrd”.
so try with the rest, like
df = pd.read_excel(fPath, sheetname="linear strategy", index_col="date", parse_dates=True, engine="openpyxl")
I know this is not a real answer, but you might want to submit a bug report to pandas or xlrd teams.
As a workaround I suggest making python create a copy of the original file then read from the copy. After that the code should delete the copied file. It's a bit of extra work but should work.
Example
import shutil
shutil.copy("C://Test//Test.xlsx", "C://Test//koko.xlsx")
I would suggest using the xlwings module instead which allows for greater functionality.
Firstly, you will need to load your workbook using the following line:
If the spreadsheet is in the same folder as your python script:
import xlwings as xw
workbook = xw.Book('myfile.xls')
Alternatively:
workbook = xw.Book('"C:\Users\...\myfile.xls')
Then, you can create your Pandas DataFrame, by specifying the sheet within your spreadsheet and the cell where your dataset begins:
df = workbook.sheets[0].range('A1').options(pd.DataFrame,
header=1,
index=False,
expand='table').value
When specifying a sheet you can either specify a sheet by its name or by its location (i.e. first, second etc.) in the following way:
workbook.sheets[0] or workbook.sheets['sheet_name']
Lastly, you can simply install the xlwings module by using Pip install xlwings
Mostly there is no issues in your code. [ If you publish the code it will be easier.]
You need to change the permissions of the directory you are using so that all users have read and write permissions.
I got this to work by first setting the working directory, then opening the file. Maybe something to do with shared drive permissions and read_excel function.
import os
import pandas as pd
os.chdir("c:\\Users\\...\\")
filepath = "...\\filename.xlsx"
sheetname = 'sheet1'
df_xls = pd.read_excel(filepath, sheet_name=sheetname, engine='openpyxl')
I fix this error simply closing the .xlsx file that was open.
You can set engine = 'xlrd', then you can run the code while Excel has the file open.
df = pd.read_excel(filename, sheetname, engine = 'xlrd')
You may need to pip install xlrd if you don't have it
You may also want to check if the file has a password? Alternatively you can open the file with the password required using the code below:
import sys
import win32com.client
xlApp = win32com.client.Dispatch("Excel.Application")
print "Excel library version:", xlApp.Version
filename, password = <-- enter your own filename and password
xlwb = xlApp.Workbooks.Open(filename, Password=password)
# xlwb = xlApp.Workbooks.Open(filename)
xlws = xlwb.Sheets([insert number here]) # counts from 1, not from 0
print xlws.Name
print xlws.Cells(1, 1) # that's A1
You can set engine='python' then you can run it even if the file is open
df = pd.read_excel(filename, engine = 'python')

Categories

Resources