pandas ExcelWriter merge but keep value that's already there - python

I have a few small data frames that I'm outputting to excel on one sheet. To make then fit better, I need to merge some cells in one table, but to write this in xlsx writer, I need to specify the data parameter. I want to keep the data that is already written in the left cell from using the to_excel() bit of code. Is there a way to do this without having to specify the data parameter? Or do I need to lookup the value in the dataframe to put in there.
For example:
df.to_excel(writer, 'sheet') gives similar to the following output:
Then I want to merge across C:D for this table without having to specify what data should be there (because it is already in column C), using something like:
worksheet.merge_range('C1:D1', cell_format = fmat) etc.
to get below:
Is this possible? Or will I need to lookup the values in the dataframe?

Is this possible? Or will I need to lookup the values in the dataframe?
You will need to lookup the data from the dataframe. There is no way in XlsxWriter to write formatting on top of existing data. The data and formatting need to be written at the same time (apart from Conditional Formatting which can't be used for merging anyway).

Related

Cell Format for a range of cells using xlsxwriter

I am using xlsxwriter to export pandas dataframe to excel file. I need format a range of cells without using worksheet.write function as the data is already present in cells.
If I am using set_row or set_column, it is adding the format to entire row or column.
Please help me find a solution.
I need format a range of cells without using worksheet.write function as the data is already present in cells.
In general that isn't possible with XlsxWriter. If you want to specify formatting for cells then you need to do it when you write the data to the cells.
There are some options which may or may not suit your needs:
Row and Column formatting. However that formats the rest of the row or column and not just the cells with data.
Add a table format via add_table().
Add a conditional format via conditional_format().
However, these are just workarounds. If you really need to format the cells then you will need to do it when using write().

Check the missing value in an excel table

I am working on my assignment of data visualization. Firstly, I have to check dataset I found, and do the data wrangling, if it is necessary. The data consists of several particles index for air quality in Madrid, those data were collected by different stations.
I found some values are missing in the table. How can I check those missing values quickly by tools (python or R or Tableau) and replace those value?
In Python, you can use the pandas module to load the Excel file as a DataFrame. Post this, it is easy to substitute the NaN/missing values.
Let's say your excel is named madrid_air.xlsx
import pandas as pd
df = pd.read_excel('madrid_air.xlsx')
Post this, you will have what they call a DataFrame which consists of the data in the excel file in the same tabular format with column names and index. In the DataFrame the missing values will be loaded as NaN values. So in order to get the rows which contains NaN values,
df_nan = df[df.isna()]
df_nan will have the rows which has NaN values in them.
Now if you want to fill all those NaN values with let's say 0.
df_zerofill = df.fillna(0)
df_zerofill will have the whole DataFrame with all the NaNs substituted with 0.
In order to specifically fill coulmns use the coumn names.
df[['NO','NO_2']] = df[['NO','NO_2']].fillna(0)
This will fill the NO and NO_2 columns' missing values with 0.
To read up more about DataFrame: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html
To read up more about handling missing data in DataFrames : https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html
There are several libraries for python to process excel spreadsheets. My favorite one is openpyxl. It transforms the spreadsheets into a dataframe in which you then can address a specific field by it coordinates. Which comes in quite handy is that it also recognizes labels of rows and columns. Of course you can also update your tables
with it. But be careful, if you are using corrupted code your xlsx-files might get permantly damaged
Edit1:
import openpyxl
wb = openpyxl.load_workbook('filename.xlsx')
# if your worksheet is the first one in the workbook
ws = wb.get_sheet_names(wb.get_sheet_by_name()[0])
for row in ws.iter_rows('G{}:I{}'.format(ws.min_row,ws.max_row)):
for cell in row:
if cell.value is None:
cell.value = 0
Well, in Tableau you can creat a worksheet, drag n Drop the lowest level of granurality in the dimension table (Blue pill) in and put the columns (as measures) in the same chart.
If your table is trully atomic, then you will get a response in your worksheet at the bottom right telling you about the null values. Clicking on it allows you to clear or replace these specifics values in the data of the workbook.
Just to clearify, Its not the "hi end" and the Coding way, but is the simplest one.
PS: You can also check for missing values in the data input window of the Tableau by filtering the columns by "null" values.
PS2: If you want to Chang it dynamic, the you Will need to use formulas like:
IF ISNULL(Measure1)
THEN (Measure2) ˜ OR Another Formula
ELSE null
END

Merging and cleaning up csv files in Python

I have been using pandas but am open to all suggestions, I'm not an expert at scripting but am a complete loss. My goal is the following:
Merge multiple CSV files. Was able to do this in Pandas and have a dataframe with the merged dataset.
Screenshot of how merged dataset looks like
Delete the duplicated "GEO" columns after the first set. This last part doesn't let me usedf = df.loc[:,~df.columns.duplicated()] because they are not technically duplicated.The repeated column names end with a .1,.2,etc. as I am guessing the concate adds this. Other problem is that some columns have a duplicated column name but are different datasets. I have been using the first row as the index since it's always the same coded values but this row is unnecessary and will be deleted afterwards in the script. This is my biggest problem right now.
Delete certain columns such as the ones with the "Margins". I use ~df2.columns.str.startswith for this and have no trouble with this.
Replace spaces, ":" and ";" with underscores in the first row. I have no clue how to do this.
Insert a new column, write '=TEXT(B1,0)' formula, do this for the whole column (formula would change to B2,B3, etc.), copy the column and paste as values. I was able to do this in openpyxl although was having trouble and was not able to try the final output thanks to excel trouble.
source = excel.Workbooks.Open(filename)
excel.Range("C1:C1337").Select()
excel.Selection.Copy()
excel.Selection.PasteSpecial(Paste=constants.xlPasteValues)
Not sure if it works and was wondering if it was possible in pandas, win32com or I should stay with openpyxl. Thanks all!

export table to csv keeping format python

I have a dataframe grouped by 3 variables. It looks like:
https://i.stack.imgur.com/q8W0y.png
When I export the table to csv, the format changes. I want to keep the original
Any ideas?
Thanks!
Pandas to_csv (and csv in general) does not support the MultiIndex used in your data. As such, it just stores the indices "long" (so each level of the MultiIndex would be a column, and each row would have its index value.) I suspect that's what you are calling "format changes".
The upshot is that if you expect to save a pandas dataframe to csv and then reestablish the dataframe from the csv, then you need to re-index the dataframe to the MultiIndex yourself, after importing it.

Pandas: how to format both rows and columns when exporting to Excel (row format takes precedence)?

I am using pandas and xlsxwriter to export and format a number of dataframes to Excel.
The xlsxwriter documentation mentions that:
http://xlsxwriter.readthedocs.io/worksheet.html?highlight=set_column
A row format takes precedence over a default column format
Precedence means that, if you format column B as percentage, and then row 2 as bold, cell B2 won't be bold and in % - it will be bold only, but not in %!
I have provided an example below. Is there a way around it? Maybe an engine other than xlsxwriter? Maybe some way to apply formatting after exporting the dataframes to Excel?
It makes no difference whether I format the row first and the columns later, or viceversa.
It's not shown in the example below, but in my code I export a number of dataframes, all with the same columns, to the same Excel sheet. The dataframes are the equivalent of an Excel Pivot table, with a 'total' row at the bottom. I'd like the header row and the total row to be bold, and each column to have a specific formatting depending on the data (%, thousands, millions, etc). Sample code below:
import pandas as pd
writer = pd.ExcelWriter('test.xlsx')
wk = writer.book.add_worksheet('Test')
fmt_bold = writer.book.add_format({'bold':True})
fmt_pct = writer.book.add_format({'num_format': '0.0%'})
wk.write(1,1,1)
wk.write(2,1,2)
wk.set_column(1,1, None, fmt_pct)
wk.set_row(1,None, fmt_bold)
writer.close()
As #jmcnamara notes openpyxl provides different formatting options because it allows you essentially to process a dataframe within a worksheet. NB. openpyxl does not support row or column formats.
The openpyxl dataframe_to_rows() function converts a dataframe to a generator of values, row by row allowing you to apply whatever formatting or additional processing you like.
In this case you will need to create another format that is a combination of the row and column formats and apply it to the cell.
In order to do that you will need to iterate over the data frame and call XlsxWriter directly, rather then using the Pandas-Excel interface.
Alternatively, you may be able to do using OpenPyXL as the pandas Excel engine. Recent versions of the Pandas interface added the ability to add formatting to the Excel data after writing the dataframe, when using OpenPyXL.

Categories

Resources