How to save a pandas pivot table with xlsxwriter in excel - python

I want to save a pandas pivot table proberly and nice formatted into an excel workbook.
I have an pandas pivot table, based on this formula:
table = pd.pivot_table(d2, values=['Wert'], index=['area', 'Name'], columns=['Monat'],
aggfunc={'Wert': np.sum}, margins=True).fillna('')
From my original dataframe:
df
Area Name2 Monat Wert
A A 1 2
A A 1 3
A B 1 2
A A 2 1
so the pivot table looks like this:
Wert
Monat 1 2 All
Area Name
A A 5 1 6
B 2 2
All 7 1 8
Then I want to save this in an excel workbook with the following code:
import pandas as pd
import xlsxwriter
workbook = xlsxwriter.Workbook('myexcel.xlsx)
worksheet1 = workbook.add_worksheet('table1')
caption = 'Table1'
worksheet1.set_column(1, 14, 25) #irrelevant, just a random size right now
worksheet1.write('B1', caption)
worksheet1.add_table('B3:K100', {'data': table.values.tolist()}) #also wrong size from B3 to K100
workbook.close()
But this looks like this (with different values), so the headers are missing:
How can I adjust it and save a pivot table in excel?
If I am using the pandas command .to_excel it looks like this:
Which is fine, but the column name is not respecting the width of the names and the background color is not looking nice, and I am also missing a capturing.

I found the solution with combination of this topic:
flattened = pd.DataFrame(table.to_records())
flattened.columns = [column.replace("('Wert', ", "Monat: ").replace(")", "") for column in flattened.columns] ##only for renaming the column headers
And then:
workbook = xlsxwriter.Workbook(excelfilename, options={'nan_inf_to_errors': True})
worksheet = workbook.add_worksheet('Table1')
worksheet1.set_column(0, flattned.shape[1], 25)
worksheet.add_table(0, 0, flattened.shape[0], flattened.shape[1]-1,
{'data': flattened.values.tolist(),
'columns': [{'header': c} for c in flattened.columns.tolist()],
'style': 'Table Style Medium 9'})
workbook.close()

Related

Comparing 2 revisions of excel files in python pandas

I am very new to pandas. It might be a silly question to some of you.
I am looking to compare 2 excel files and output the changes or the new entries
old.csv
Product Price Description
1 1.25 Product 1
2 2.25 Product 2
3 3.25 Product 3
new.csv
Product Price Description
1 1.25 Product 1 # Product 2 not in list
3 3.50 Product 3 # Price update
4 4.25 Product 4 # New entry
TRIED
import pandas as pd
import numpy as np
import requests
url = '<SomeUrl>/<PriceList>.xls'
resp = requests.get(url)
df = pd.DataFrame(pd.read_excel(resp.content))
df.to_csv('new.csv')
old = pd.read_csv('old.csv')
new = pd.read_csv('new.csv')
changes = new.loc[new['Price'] != old['Price']]
changes_csv = changes[['Product', 'Price', 'Description']]
print(changes_csv)
EXPECTING
3 3.50 Product 3 # Price update
4 4.25 Product 4 # New entry
I get the correct results if the length matches exactly. Otherwise I get
pandas valueerror can only compare identically labeled objects
BONUS
It wound be awesome if I can produce output for discontinued products
You could create a master index of all products, create 2 old/new dataframes using all the master index, then use df.compare() to compare the two databases:
Import pandas as pd
df1 = pd.DataFrame([[1,1.25,'Product 1'],[2,2.25,'Product 2'],[3,3.25,'Product 3']], columns=['Product','Price','Description'])
df2 = pd.DataFrame([[1,1.25,'Product 1'],[3,3.5,'Product 2'],[4,4.25,'Product 3 Change']], columns=['Product','Price','Description'])
df1product = df1[['Product']]
df2product = df2[['Product']]
dfproducts = df1product.merge(df2product, on='Product', how='outer')
df1 = dfproducts.merge(df1, how='left', on='Product')
df1.set_index(df1['Product'], inplace=True)
df2 = dfproducts.merge(df2, how='left', on='Product')
df2.set_index(df2['Product'], inplace=True)
dfcompare = df1.compare(df2, align_axis=0)
I have solved the problem, even though #WCeconomics kindly took the time to type the code out, it did not help me to get the output I wanted. Likely due to me being a noob with pandas.
This is how I solved it, so as it is useful to the community.
import pandas as pd
import openpyxl # to write excel files
from openpyxl.utils.dataframe import dataframe_to_rows
old = pd.read_excel('old.xls')
new = pd.read_excel('new.xls')
# data for these is in the same format as in question, with 'Product Description' instead of 'Description'
merged = old.merge(new, on='Product', how='outer', suffixes=('_old', '_new'))
df = merged[['Product', 'Product Description_old', 'Price_old', 'Price_new']]
changes = df.loc[(df['Price_new'] > df['Price_old'])].dropna(how='any', axis=0)
wb = openpyxl.Workbook()
ws = wb.active
for r in dataframe_to_rows(changes, index=False, header=True):
ws.append(r)
wb.save('avp_changes.xls')

How can I fetch data from another Excel tab using a formula as reference using Python and Pandas (or something like)

First, I'm not that sure if pandas is the right approach to this, it may be better done with VBA or another lib like openpyxl.
I have a excel sheet which has two different tabs (tab1 has a name and a value, which is a formula like: ='tab2'!H10, for instance, tab2 has said value (or sum of values) and other bunch of information).
I want to get information from the value column on tab1, which may have reference for more than one cell on the second tab ='tab2'!H10 + 'tab2'!H12 + 'tab2'!H20 on row = Name1. Extract those ROWS (row 10, 12 and 20 on this example) and fetch information from 3 columns on tab2, for those rows.
Then, I want to "join" (not sure if a join) the name on tab1 with those 3 columns from tab2 on said lines. Something like this as the end result:
| Name 1 (from tab 1 - line) | Column 1 (from tab2) | Column 2 | Column 3 | from row 10
| Name 1 (from tab 1 - line) | Column 1 (from tab2) | Column 2 | Column 3 | from row 12
| Name 1 (from tab 1 - line) | Column 1 (from tab2) | Column 2 | Column 3 | from row 20
Code that I'm trying and it's not currently working, error
ValueError: cannot join with no overlapping index names
import numpy as np
import pandas as pd
from IPython.display import display
from openpyxl import Workbook
from openpyxl import load_workbook
wbx = load_workbook(filename= 'test.xlsx')
sheet_names = wbx.sheetnames
name1 = sheet_names[0]
sheet_ranges1 = wbx[name1]
df1 = pd.DataFrame(sheet_ranges1.values)
name2 = sheet_names[1]
sheet_ranges2 = wbx[name2]
df2 = pd.DataFrame(sheet_ranges2.values)
pd.set_option("display.max_rows", None, "display.max_columns", None)
c1 = df1.iloc[:,[1]]
c2 = df1.iloc[:,24]
print(c1.dtypes)
res = c2.str.extractall(r"!H(?P<line>\d+)?")
res2 = c1.merge(pd.DataFrame(res), how='left', left_index=True, right_index=True)
hope it helps:
import pandas as pd
df1 = pd.read_excel(r'.\foldername\filename.xlsx', sheet_name='sheet1')
df2 = pd.read_excel(r'.\foldername\filename.xlsx', sheet_name='sheet2')
df3 = pd.read_excel(r'.\foldername\filename.xlsx', sheet_name='sheet3')
# drop columns as needed that are not to include in merged result, or to avoid duplicate column that will be col_x and col_y
df1 = df1.drop(columns=['col2', 'col3'], index=False)
# join table
dfx = df1.merge(df2, how="inner", left_on="col1", right_on="col2)
merged = dfx.merge(df3, how="left", left_on="col7", right_on="col3)
print(merged.head())
you can do as well in VBA
Sub JoinTables()
Dim connection As ADODB.Connection
Set connection = New ADODB.Connection
With connection
.Provider = "Microsoft.Jet.OLEDB.4.0"
.ConnectionString = "Data Source=" & ThisWorkbook.FullName & ";" & "Extended Properties=Excel 8.0;"
.Open
End With
Dim recordset As ADODB.Recordset
Set recordset = New ADODB.Recordset
recordset.Open "SELECT * FROM [Sheet1$] INNER JOIN [Sheet2$] ON [Sheet1$].[type] = " & "[Sheet2$].[type]", connection
With Worksheets("Sheet3")
.Cells(2, 1).CopyFromRecordset recordset
End With
recordset.Close
connection.Close
End Sub

Delete a row when a cell is empty

I'm trying to delete a row when a cell is empty from the 'calories.xlsx' spreadsheet and send all data, except empty rows, to the 'destination.xlsx' spreadsheet. The code below is how far I got. But still, it does not delete rows that have an empty value based on the calories column.
This is the data set:
Data Set
How can I develop my code to solve this problem?
import pandas as pd
FileName = 'calories.xlsx'
SheetName = pd.read_excel(FileName, sheet_name = 'Sheet1')
df = SheetName
print(df)
ListCalories = ['Calories']
print(ListCalories)
for Cell in ListCalories:
if Cell == 'NaN':
df.drop[Cell]
print(df)
df.to_excel('destination.xlsx')
Create dummy data
df=pd.DataFrame({
'calories':[2306,3256,1235,np.nan,3654,3256],
'Person':['person1','person2','person3','person4','person5','person6',]
})
Print data frame
calories Person
0 2306.0 person1
1 3256.0 person2
2 1235.0 person3
3 person4
4 3654.0 person5
5 3256.0 person6
remove row, if calories value is missing
new_df=df.dropna(how='any',subset=['calories'])
Result
calories Person
0 2306.0 person1
1 3256.0 person2
2 1235.0 person3
4 3654.0 person5
5 3256.0 person6
save as excel
new_df.to_excel('destination.xlsx',index=False)
your ListCalories contains only one element which is Calories, I'll assume this was a typo.
what you are trying to probably do is
import pandas as pd
FileName = 'calories.xlsx'
df = pd.read_excel(FileName, sheet_name = 'Sheet1')
print(df)
# you don't need this, but I kept it for you
ListCalories = df['Calories']
print(ListCalories)
clean_df = df[df['Calories'].notna()] # this will only select the rows that doesn't have na value in the Calories column
print(clean_df)
clean_df.to_excel('destination.xlsx')
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.notna.html

formatting colour for cell and row in excel using pandas

my requirement highlight the different of those cell to "yellow" and new rows added will be highlight in red colour in excel as a output.
load1 dataset
StudentID Visit1 Visit 2 Visit 3 Visit 4
1 16-04-20 23-04-20 30-04-20 07-05-20
2 17-04-20 24-04-20 01-05-20 08-05-20
3 18-04-20 25-04-20 02-05-20 09-05-20
4 19-04-20 26-04-20 03-05-20 10-05-20
5 20-04-20 27-04-20 04-05-20 11-05-20
6 21-04-20 28-04-20 05-05-20 12-05-20
7 22-04-20 29-04-20 06-05-20 13-05-20
8 23-04-20 30-04-20 07-05-20 14-05-20
load2 table
StudentID Visit1 Visit 2 Visit 3 Visit 4
1 16-04-20 23-04-20 30-04-20 07-05-20
2 17-04-20 24-04-20 01-05-20 08-05-20
3 18-04-20 25-09-20 02-05-20 09-05-20
4 19-04-20 26-04-20 03-06-20 10-05-20
5 20-04-20 27-04-20 04-05-20 11-05-20
6 21-04-20 28-04-20 05-05-20 12-06-20
7 22-04-20 29-08-20 06-05-20 13-05-20
8 23-04-20 30-04-20 07-05-20 14-05-20
9 15-05-20 16-05-20 17-05-20 18-05-20
Output. I am looking for the output in excel in the below format with colours hi-lighted.
I am new to this pandas, I was able to get the different difference at dataframe level. Not at the outputed excel where to format each cell and row with colours.Please help .
the below are the script i tried to get the difference.
import pandas as pd
import os
import numpy as np
colour1= pd.read_excel('pandas.xlsx',sheet_name='load1')
colour2= pd.read_excel('pandas.xlsx',sheet_name='load2')
colour_merge=colour1.merge(colour2,left_on='StudentID', right_on='StudentID',how='outer')
colour_merge['Visit1dif']= np.where(colour_merge['Visit1_x']==colour_merge['Visit1_y'],0,1)
colour_merge['Visit2dif']= np.where(colour_merge['Visit 2_x']==colour_merge['Visit 2_y'],0,1)
colour_merge['Visit3dif']= np.where(colour_merge['Visit 3_x']==colour_merge['Visit 3_y'],0,1)
colour_merge['Visit4dif']= np.where(colour_merge['Visit 4_x']==colour_merge['Visit 4_y'],0,1)
colour_merge[['StudentID','Visit1_x','Visit1_y','Visit1dif','Visit 2_x','Visit 2_y','Visit2dif','Visit 3_x','Visit 3_y','Visit3dif','Visit 4_x','Visit 4_y','Visit4dif']]
I think you have two sheets load1 and load2, and want to and the third one to be displayed in the picture.
In order to add a new sheet, pandas.ExcelWriter should be opened in append mode and openpyxl library is needed along with pandas.DataFrame.to_excel and conditional_formatting functions :
import pandas as pd
import openpyxl
import xlrd
from openpyxl.styles import Alignment, Font, NamedStyle, PatternFill, colors
from openpyxl.styles.differential import DifferentialStyle
from openpyxl.formatting.rule import Rule, ColorScaleRule, CellIsRule, FormulaRule
file = r'C:\\app\\excel\\pandas.xlsx'
xls = xlrd.open_workbook(file)
sht = xls.sheet_names()
colour1= pd.read_excel(file,sheet_name=sht[0])
colour2= pd.read_excel(file,sheet_name=sht[1])
colour_merge = colour1.merge(colour2,left_on='StudentID',right_on='StudentID',how='outer')
colour_merge = colour_merge[['StudentID','Visit 1_x','Visit 1_y','Visit 2_x','Visit 2_y','Visit 3_x','Visit 3_y','Visit 4_x','Visit 4_y']]
l_sid = []
mxc1 = max(colour1['StudentID'])+1
mxc2 = max(colour2['StudentID'])+1
for j in range(min(mxc1,mxc2),max(mxc1,mxc2)):
l_sid.append(j+1)
writer_args = { 'path': file, 'mode': 'a', 'engine': 'openpyxl'}
with pd.ExcelWriter(**writer_args) as xlsx:
colour_merge.to_excel(xlsx, 'load3', index=False)
ws = xlsx.sheets['load3']
mxc= ws.max_column
title_row = '1'
yellow = PatternFill(bgColor=colors.YELLOW)
red = PatternFill(bgColor=colors.RED)
i=1
while i <= mxc*2:
l_col = []
l_col.append(chr(i+65))
l_col.append(chr(i+65+1))
for j in range(2,mxc+1):
for k in l_col:
if j not in l_sid:
ws.conditional_formatting.add(k+str(j), FormulaRule(formula=[l_col[0]+'$'+str(j)+'<>'+l_col[1]+'$'+str(j)], stopIfTrue=True, fill=yellow))
i+=2
r = Rule(type="expression", dxf=DifferentialStyle(fill=red), stopIfTrue=False)
r.formula = ['1=1']
while l_sid:
el=l_sid.pop(-1)
ws.conditional_formatting.add('A'+str(el)+':'+chr(64+mxc)+str(el), r)
xlsx.save()

Shift down one row then rename the column

My data is looking like this:
pd.read_csv('/Users/admin/desktop/007538839.csv').head()
105586.18
0 105582.910
1 105585.230
2 105576.445
3 105580.016
4 105580.266
I want to move that 105568.18 to the 0 index because now it is the column name. And after that I want to name this column 'flux'. I've tried
pd.read_csv('/Users/admin/desktop/007538839.csv', sep='\t', names = ["flux"])
but it did not work, probably because the dataframe is not in the right format.
How can I achieve that?
For me your code working very nice:
import pandas as pd
temp=u"""105586.18
105582.910
105585.230
105576.445
105580.016
105580.266"""
#after testing replace 'pd.compat.StringIO(temp)' to '/Users/admin/desktop/007538839.csv'
df = pd.read_csv(pd.compat.StringIO(temp), sep='\t', names = ["flux"])
print (df)
flux
0 105586.180
1 105582.910
2 105585.230
3 105576.445
4 105580.016
5 105580.266
For overwrite original file with same data with new header flux:
df.to_csv('/Users/admin/desktop/007538839.csv', index=False)
Try this:
df=pd.read_csv('/Users/admin/desktop/007538839.csv',header=None)
df.columns=['flux']
header=None is the friend of yours.

Categories

Resources