I've switched over to Jupyter Lab recently, and discovered that pd.read_excel() now requires "engine = 'openpyxl' " in its arguments to avoid a known error in defaulting to xlrd. Unfortunately, openpyxl as an engine is introducing issues that none of my previous code accounted for.
In particular, it appears to append rows of NaN values to the end of dataframes when I import an xlsx file. I'm aware of the issue where blank rows at the start of an Excel sheet get pushed to the end of the import, and that's not the case here. I have an Excel file with multiple tabs, 16 unique column headers in the first row of each tab (and identical between tabs), and every row filled with data. Previously, in Jupyter Notebook (and without engine='openpyxl') read.excel() with sheet_name=None would create a dictionary of dataframes from each tab, reading no additional rows beyond the end of the data. Now, I get upwards of one thousand blank rows at the end of some of the dataframes.
I'm not looking forward to going through all of my old code and adding in dropna(how='all) to every import, and afraid that this might be indicative of a larger issue I'm not catching. Has anyone experienced something similar? Below are the import in Jupyter Lab of one of the tabs in question as an example, and the Excel sheet for the tab itself, with no data beyond row 5226.
Thanks for the help!
It was the case that with version 1.1.4 of pandas that you needed engine="openpyxl" but with 1.2.4 of pandas you will not need the openpyxl parameter.
So an upgrade might be worth while, but not sure if it will fix your issue.
pip install pandas --upgrade
to check what versions you have installed
import pandas as pd
import openpyxl
print(pd.__version__)
print(openpyxl.__version__)
Related
I do not have code specifics for this problem unfortunately, but using Python 2.7+openpyxl engine after writing to xlsx causes the excel worksheet to be initially opened with a lot of blank rows, which then gets fixed automatically after scrolling down and back up. Here are some pictures:
There isn't a problem with writing/reading the files with blank rows because the resulting dataframe always matches the amount of rows there are supposed to be, so I believe this is happening when writing to excel. I can't seem to find a similar question online, so I'm wondering if I need to go over my code again or if someone else has hopefully experienced/knows the problem I seem to be experiencing. Thanks!
I would like to be able to move a data of the table automatically to place it on a new column and duplicate it as many times as I have rows before a row with only one data but I don't know which tool to use.
This is probably not a python, pandas or dataframe question but more about running a macro in excel.
One can run macro's in excel with python using: https://www.xlwings.org/
This is open source and free, comes preinstalled with Anaconda and WinPython, and works on Windows and macOS
Although, you might simple prefer the natural excel vba editor for this and "record a macro".
Hope this is helpful.
Using ffill answer directly the question.
df['col'] = df['col'].ffill()
Assuming I have an excel sheet already open, make some changes in the file and use pd.read_excel to create a dataframe based on that sheet, I understand that the dataframe will only reflect the data in the last saved version of the excel file. I would have to save the sheet first in order for pandas dataframe to take into account the change.
Is there anyway for pandas or other python packages to read an opened excel file and be able to refresh its data real time (without saving or closing the file)?
Have you tried using mitosheet package? It doesn't answer your question directly, but it allows you working on pandas dataframes as you would do in excel sheets. In this way, you may edit the data on the fly as in excel and still get a pandas dataframe as a result (meanwhile generating the code to perform the same operations with python). Does this help?
There is no way to do this. The table is not saved to disk, so pandas can not read it from disk.
Be careful not to over-engineer, that being said:
Depending on your use case, if this is really needed, I could theoretically imagine a Robotic Process Automation like e.g. BluePrism, UiPath or PowerAutomate loading live data from Excel into a Python environment with a pandas DataFrame continuously and then changing it.
This use case would have to be a really important process though, otherwise licensing RPA is not worth it here.
df = pd.read_excel("path")
In variable explorer you can see the data if you run the program in SPYDER ide
I'm working a lot with Excel xlsx files which I convert using Python 3 into Pandas dataframes, wrangle the data using Pandas and finally write the modified data into xlsx files again.
The files contain also text data which may be formatted. While most modifications (which I have done) have been pretty straight forward, I experience problems when it comes to partly formatted text within a single cell:
Example of cell content: "Medical device whith remote control and a Bluetooth module for communication"
The formatting in the example is bold and italic but may also be a color.
So, I have two questions:
Is there a way of preserving such formatting in xlsx files when importing the file into a Python environment?
Is there a way of creating/modifying such formatting using a specific python library?
So far I have been using Pandas, OpenPyxl, and XlsxWriter but have not succeeded yet. So I shall appreciate your help!
As pointed out below in a comment and the linked question OpenPyxl does not allow for this kind of formatting:
Any other ideas on how to tackle my task?
i have been recently working with openpyxl. Generally if one cell has the same style(font/color), you can get the style from cell.font: cell.font.bmeans bold andcell.font.i means italic, cell.font.color contains color object.
but if the style is different within one cell, this cannot help. only some minor indication on cell.value
I created an Excel spreadsheet using Pandas and xlsxwriter, which has all the data in the right rows and columns. However, the formatting in xlsxwriter is pretty basic, so I want to solve this problem by writing my Pandas spreadsheet on top of a template spreadsheet with Pyxl.
First, however, I need to get Pyxl to only import data up to the first blank row, and to get rid of the column headings. This way I could write my Excel data from the xlsxwriter output to the template.
I have no clue how to go about this and can't find it here or in the docs. Any ideas?
How about if I want to read data from the first column after the first blank column? (I can think of a workaround for this, but it would help if I knew how)
To be honest I'd be tempted to suggest you use openpyxl all the way if there is something that xlsxwriter doesn't do, though I think that it's formatting options are pretty extensive. The most recent version of openpyxl is as fast as xlsxwriter if lxml is installed.
However, it's worth noting that Pandas has tended to ship with an older version of openpyxl because we changed the style API.
Otherwise you can use max_row to get the highest row but this won't check for an empty row.