Modifying the xlsx file using openpyxl in databricks directly without pandas/dataframe

Modifying the xlsx file using openpyxl in databricks directly without pandas/dataframe - python

import openpyxl
input_workbook1 = openpyxl.load_workbook('/dbfs/FileStore/Test/my_excel.xlsx')
sheet_1 = input_workbook1.active
sheet_1['A2'] = 'A2'
input_workbook1.save('/dbfs/FileStore/Test/Output.xlsx')
OSError: [Errno 95] Operation not supported
I tried reading the excel file directly using openpyxl in databricks , I can able to read and modify directly without pandas/dataframes, but when I am trying to save i.e last line in above code facing the issue.I tried exactly the same way but facing the above error , can anyone help me please

I tried doing the same procedure and it gave me the same error OSError: [Errno 95] Operation not supported. The reason for this is that there is a limitation that random writes do not work on the local file system and here is the official Microsoft documentation (Local File API limitations) which refers to this issue.
So, try instead of trying to write to the local file system, write the file to /databricks/driver/ path and then copy/move the file to required directory.
Modify your code as following:
import openpyxl
input_workbook1 = openpyxl.load_workbook('/dbfs/FileStore/Test/my_excel.xlsx')
sheet_1 = input_workbook1.active
sheet_1['A2'] = 'A2'
input_workbook1.save('Output.xlsx')
#will be saved to '/databricks/driver/'.
#Use dbutils.fs.ls('/databricks/driver/') to view.
from shutil import move
move('/databricks/driver/Output.xlsx','/dbfs/FileStore/Test/')
wb1 = openpyxl.load_workbook('/dbfs/FileStore/Output.xlsx')
ws1 = wb1.active
for row in ws1.iter_rows():
print([col.value for col in row])
The above code will successfully move your file to the required path without any errors.

Related

How to save an excel using pywin32?

I am trying to save an excel file generated by another application that is open. i.e the excel application is in the foreground. This file has some data and it needs to be saved i.e written into the disk.
In other words, I need to do an operation like File->SaveAs.
Steps to reproduce:
Open an Excel Application. This will be shown as Book1 - Excel in the title by default
Write this code and run
import win32com.client as win32
app = win32.gencache.EnsureDispatch('Excel.Application')
app.Workbooks(1).SaveAs(r"C:\Users\test\Desktop\test.xlsx")
app.Application.Quit()
Error -
Traceback (most recent call last):
File "c:/Users/test/Downloads/automate_excel.py", line 6, in <module>
ti = disp._oleobj_.GetTypeInfo()
pywintypes.com_error: (-2147418111, 'Call was rejected by callee.', None, None)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "c:/Users/test/Downloads/automate_excel.py", line 6, in <module>
app = win32.gencache.EnsureDispatch('Excel.Application')
File "C:\Users\test\AppData\Local\Programs\Python\Python38\lib\site-packages\win32com\client\gencache.py", line 633, in EnsureDispatch
raise TypeError(
TypeError: This COM object can not automate the makepy process - please run makepy manually for this object

There could be many sources for your problem so I would apreciate if you shared further code. The second error can for example occur when you are running multiple instances of the line excel = win32.gencache.EnsureDispatch('Excel.Application') for example in a for loop .
Also make sure to have a version of excel that is fully activated and licensed .

This is working for me (on python==3.9.8 and pywin32==305). You'll see that the first line is a different than yours, but I think that's really it.
In the course of this we kept getting Attribute Errors for the Workbook or for setting DisplayAlerts. We found (from this question: Excel.Application.Workbooks attribute error when converting excel to pdf) that if Excel is in a loop (for example, editing a cell or has a pop-up open) then you will get an error. So, be sure to click enter out of a cell so that you aren't editing it.
import win32com.client as win32
savepath = 'c:\\my\\file\\path\\test\\'
xl = win32.Dispatch('Excel.Application')
wb = xl.Workbooks['Book1']
wb.DisplayAlerts = False # helpful if saving multiple times to save file, it means you won't get a pop-up for overwrite and will default to save it.
filename = 'new_xl.xlsx'
wb.SaveAs(savepath+filename)
wb.Close()
xl.Quit()
edit: add pywin32 version, include some more tips

This is the version that worked for me based on #scotscotmcc's answer. The issue was with the cell which was in edit mode while I was running the program. Make sure you hit enter in the current cell and come out of the edit mode in excel.
import win32com.client as win32
import random
xl = win32.Dispatch('Excel.Application')
wb = xl.Workbooks['Book1']
wb.SaveAs(r"C:\Users\...\Desktop\Form"+str(random.randint(0,1000))+".xlsx")
wb.Close()
xl.Quit()

How to write pandas dataframe into Databricks dbfs/FileStore?

I'm new to the Databricks, need help in writing a pandas dataframe into databricks local file system.
I did search in google but could not find any case similar to this, also tried the help guid provided by databricks (attached) but that did not work either. Attempted the below changes to find my luck, the commands goes just fine, but the file is not getting written in the directory (expected wrtdftodbfs.txt file gets created)
df.to_csv("/dbfs/FileStore/NJ/wrtdftodbfs.txt")
Result: throws the below error
FileNotFoundError: [Errno 2] No such file or directory:
'/dbfs/FileStore/NJ/wrtdftodbfs.txt'
df.to_csv("\\dbfs\\FileStore\\NJ\\wrtdftodbfs.txt")
Result: No errors, but nothing written either
df.to_csv("dbfs\\FileStore\\NJ\\wrtdftodbfs.txt")
Result: No errors, but nothing written either
df.to_csv(path ="\\dbfs\\FileStore\\NJ\\",file="wrtdftodbfs.txt")
Result: TypeError: to_csv() got an unexpected keyword argument 'path'
df.to_csv("dbfs:\\FileStore\\NJ\\wrtdftodbfs.txt")
Result: No errors, but nothing written either
df.to_csv("dbfs:\\dbfs\\FileStore\\NJ\\wrtdftodbfs.txt")
Result: No errors, but nothing written either
The directory exists and the files created manually shows up but pandas to_csv never writes nor error out.
dbutils.fs.put("/dbfs/FileStore/NJ/tst.txt","Testing file creation and existence")
dbutils.fs.ls("dbfs/FileStore/NJ")
Out[186]: [FileInfo(path='dbfs:/dbfs/FileStore/NJ/tst.txt',
name='tst.txt', size=35)]
Appreciate your time and pardon me if the enclosed details are not clear enough.

Try with this in your notebook databricks:
import pandas as pd
from io import StringIO
data = """
CODE,L,PS
5d8A,N,P60490
5d8b,H,P80377
5d8C,O,P60491
"""
df = pd.read_csv(StringIO(data), sep=',')
#print(df)
df.to_csv('/dbfs/FileStore/NJ/file1.txt')
pandas_df = pd.read_csv("/dbfs/FileStore/NJ/file1.txt", header='infer')
print(pandas_df)

This worked out for me:
outname = 'pre-processed.csv'
outdir = '/dbfs/FileStore/'
dfPandas.to_csv(outdir+outname, index=False, encoding="utf-8")
To download the file, add files/filename to your notebook url (before the interrogation mark ?):
https://community.cloud.databricks.com/files/pre-processed.csv?o=189989883924552#
(you need to edit your home url, for me is :
https://community.cloud.databricks.com/?o=189989883924552#)
dbfs file explorer

Python 2.7 Openpyxl UserWarning

Why do I receive this warning message every time I run my code? (below). Is it possible to get rid of it? If so, how do I do that?
My code:
from openpyxl import load_workbook
from openpyxl import Workbook
wb = load_workbook('NFL.xlsx', data_only = True)
ws = wb.active
sh = wb["Sheet1"]
ptsDiff = (sh['J127'].value)
print ptsDiff
The code works but I get this warning message:
Warning (from warnings module):
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/openpyxl/reader/worksheet.py", line 320
warn(msg)
UserWarning: Unknown extension is not supported and will be removed

This error happens when openpyxl cannot understand/read an extension (source). Here is the list of built-in extensions openpyxl currently knows that is doesn't support:
Conditional Formatting
Data Validation
Sparkline Group
Slicer List
Protected Range
Ignored Error
Web Extension
Slicer List
Timeline Ref
Also see the Worksheet extension list specification.

Try to add single quotes to your data_only parameter like this:
wb = load_workbook('NFL.xlsx', data_only = **'True'**)
This works for me.

Using python 3.5 under Anaconda3, Excel 2016, Windows10 -- I had the same problem initially with an xlsx file. Tried to make it into a csv and did not work. What worked was: select the entire spreadsheet, copy on a Notepad, select the notepad text, paste in a new spreadsheet, save as xslx. It looks like any extra formatting would result in a warning.

It is already listed in the first answer what is wrong with it If you only want to get rid of the error that is given in red for some reason. You can go to the file location of the error and # the line where is says warn(msg) this will stop the error being displayed the code still works fine in my experience.I am not sure if this will work after compiled but this should work in the same machine.
PS:I had the same error and this is what I did because I though it could be confusing for the end user
PS:You can use a try and except error catcher too but this is quicker.

Error: Line magic function

I'm trying to read a file using python and I keep getting this error
ERROR: Line magic function `%user_vars` not found.
My code is very basic just
names = read_csv('Combined data.csv')
names.head()
I get this for anytime I try to read or open a file. I tried using this thread for help.
ERROR: Line magic function `%matplotlib` not found
I'm using enthought canopy and I have IPython version 2.4.1. I made sure to update using the IPython installation page for help. I'm not sure what's wrong because it should be very simple to open/read files. I even get this error for opening text files.
EDIT:
I imported traceback and used
print(traceback.format_exc())
But all I get is none printed. I'm not sure what that means.

Looks like you are using Pandas. Try the following (assuming your csv file is in the same path as the your script lib) and insert it one line at a time if you are using the IPython Shell:
import pandas as pd
names = pd.read_csv('Combined data.csv')
names.head()

Index out of range when reading with Openpyxl

I am trying to open a .xlsx file with Openpyxl, using the "Optimized reader" tips from the documentation :
# -*- coding: iso-8859-1 -*-
from openpyxl import load_workbook
wb = load_workbook(filename = r'/path/to/the/file.xlsx', use_iterators = True)
This give me the following error :
Traceback (most recent call last):
File "/home/me/test.py", line 5, in <module>
wb = load_workbook(filename = r'/path/to/the/file.xlsx', use_iterators = True)
File "/usr/local/lib/python2.6/dist-packages/openpyxl/reader/excel.py", line 151, in load_workbook
_load_workbook(wb, archive, filename, read_only, keep_vba)
File "/usr/local/lib/python2.6/dist-packages/openpyxl/reader/excel.py", line 240, in _load_workbook
wb._named_ranges = list(read_named_ranges(archive.read(ARC_WORKBOOK), wb))
File "/usr/local/lib/python2.6/dist-packages/openpyxl/reader/workbook.py", line 160, in read_named_ranges
named_range.scope = workbook.worksheets[int(location_id)]
IndexError: list index out of range
I also tried using flags (keep_vba = True|False, guess_types = True|False, data_only = True|False) with every combination. Same error.
The .xlsx file I am trying to open has 13 worksheets, there is no worksheet with more than 200 row, so I suppose this is not a size problem.
I can't edit anything on this .xlsx file, I don't have permission, this is a readonly file for me.
I am using Python 2.6 on a Debian Squeeze 64 bits and the version of Openpyxl is 2.1.0.
If I try to open an other file (an empty test file), it works fine (no error triggered, the script carry on).
So I suppose the question is : what is wrong with the .xlsx file I am trying to open ?

The problem is related to the defined names / ranges in use. I've seen it an another file but not yet sure quite what's triggering it. Can you please submit a bug, preferably with a sample file, as this will make tracking the problem down a lot easier.
The 2.1 branch should contain a fix for this if you can try a checkout. As far as I can tell the issue is related to the use of defined names from other workbooks or when using some of the reserved names for print areas, etc. Such definitions are likely lost when the file is processed by openpyxl but shouldn't affect the data itself

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Modifying the xlsx file using openpyxl in databricks directly without pandas/dataframe - python

Related

How to save an excel using pywin32?

How to write pandas dataframe into Databricks dbfs/FileStore?

Python 2.7 Openpyxl UserWarning

Error: Line magic function

Index out of range when reading with Openpyxl

Categories

Resources