I want to save a pandas pivot table proberly and nice formatted into an excel workbook.
I have an pandas pivot table, based on this formula:
table = pd.pivot_table(d2, values=['Wert'], index=['area', 'Name'], columns=['Monat'],
aggfunc={'Wert': np.sum}, margins=True).fillna('')
From my original dataframe:
df
Area Name2 Monat Wert
A A 1 2
A A 1 3
A B 1 2
A A 2 1
so the pivot table looks like this:
Wert
Monat 1 2 All
Area Name
A A 5 1 6
B 2 2
All 7 1 8
Then I want to save this in an excel workbook with the following code:
import pandas as pd
import xlsxwriter
workbook = xlsxwriter.Workbook('myexcel.xlsx)
worksheet1 = workbook.add_worksheet('table1')
caption = 'Table1'
worksheet1.set_column(1, 14, 25) #irrelevant, just a random size right now
worksheet1.write('B1', caption)
worksheet1.add_table('B3:K100', {'data': table.values.tolist()}) #also wrong size from B3 to K100
workbook.close()
But this looks like this (with different values), so the headers are missing:
How can I adjust it and save a pivot table in excel?
If I am using the pandas command .to_excel it looks like this:
Which is fine, but the column name is not respecting the width of the names and the background color is not looking nice, and I am also missing a capturing.
I found the solution with combination of this topic:
flattened = pd.DataFrame(table.to_records())
flattened.columns = [column.replace("('Wert', ", "Monat: ").replace(")", "") for column in flattened.columns] ##only for renaming the column headers
And then:
workbook = xlsxwriter.Workbook(excelfilename, options={'nan_inf_to_errors': True})
worksheet = workbook.add_worksheet('Table1')
worksheet1.set_column(0, flattned.shape[1], 25)
worksheet.add_table(0, 0, flattened.shape[0], flattened.shape[1]-1,
{'data': flattened.values.tolist(),
'columns': [{'header': c} for c in flattened.columns.tolist()],
'style': 'Table Style Medium 9'})
workbook.close()
Related
I am very new to pandas. It might be a silly question to some of you.
I am looking to compare 2 excel files and output the changes or the new entries
old.csv
Product Price Description
1 1.25 Product 1
2 2.25 Product 2
3 3.25 Product 3
new.csv
Product Price Description
1 1.25 Product 1 # Product 2 not in list
3 3.50 Product 3 # Price update
4 4.25 Product 4 # New entry
TRIED
import pandas as pd
import numpy as np
import requests
url = '<SomeUrl>/<PriceList>.xls'
resp = requests.get(url)
df = pd.DataFrame(pd.read_excel(resp.content))
df.to_csv('new.csv')
old = pd.read_csv('old.csv')
new = pd.read_csv('new.csv')
changes = new.loc[new['Price'] != old['Price']]
changes_csv = changes[['Product', 'Price', 'Description']]
print(changes_csv)
EXPECTING
3 3.50 Product 3 # Price update
4 4.25 Product 4 # New entry
I get the correct results if the length matches exactly. Otherwise I get
pandas valueerror can only compare identically labeled objects
BONUS
It wound be awesome if I can produce output for discontinued products
You could create a master index of all products, create 2 old/new dataframes using all the master index, then use df.compare() to compare the two databases:
Import pandas as pd
df1 = pd.DataFrame([[1,1.25,'Product 1'],[2,2.25,'Product 2'],[3,3.25,'Product 3']], columns=['Product','Price','Description'])
df2 = pd.DataFrame([[1,1.25,'Product 1'],[3,3.5,'Product 2'],[4,4.25,'Product 3 Change']], columns=['Product','Price','Description'])
df1product = df1[['Product']]
df2product = df2[['Product']]
dfproducts = df1product.merge(df2product, on='Product', how='outer')
df1 = dfproducts.merge(df1, how='left', on='Product')
df1.set_index(df1['Product'], inplace=True)
df2 = dfproducts.merge(df2, how='left', on='Product')
df2.set_index(df2['Product'], inplace=True)
dfcompare = df1.compare(df2, align_axis=0)
I have solved the problem, even though #WCeconomics kindly took the time to type the code out, it did not help me to get the output I wanted. Likely due to me being a noob with pandas.
This is how I solved it, so as it is useful to the community.
import pandas as pd
import openpyxl # to write excel files
from openpyxl.utils.dataframe import dataframe_to_rows
old = pd.read_excel('old.xls')
new = pd.read_excel('new.xls')
# data for these is in the same format as in question, with 'Product Description' instead of 'Description'
merged = old.merge(new, on='Product', how='outer', suffixes=('_old', '_new'))
df = merged[['Product', 'Product Description_old', 'Price_old', 'Price_new']]
changes = df.loc[(df['Price_new'] > df['Price_old'])].dropna(how='any', axis=0)
wb = openpyxl.Workbook()
ws = wb.active
for r in dataframe_to_rows(changes, index=False, header=True):
ws.append(r)
wb.save('avp_changes.xls')
First, I'm not that sure if pandas is the right approach to this, it may be better done with VBA or another lib like openpyxl.
I have a excel sheet which has two different tabs (tab1 has a name and a value, which is a formula like: ='tab2'!H10, for instance, tab2 has said value (or sum of values) and other bunch of information).
I want to get information from the value column on tab1, which may have reference for more than one cell on the second tab ='tab2'!H10 + 'tab2'!H12 + 'tab2'!H20 on row = Name1. Extract those ROWS (row 10, 12 and 20 on this example) and fetch information from 3 columns on tab2, for those rows.
Then, I want to "join" (not sure if a join) the name on tab1 with those 3 columns from tab2 on said lines. Something like this as the end result:
| Name 1 (from tab 1 - line) | Column 1 (from tab2) | Column 2 | Column 3 | from row 10
| Name 1 (from tab 1 - line) | Column 1 (from tab2) | Column 2 | Column 3 | from row 12
| Name 1 (from tab 1 - line) | Column 1 (from tab2) | Column 2 | Column 3 | from row 20
Code that I'm trying and it's not currently working, error
ValueError: cannot join with no overlapping index names
import numpy as np
import pandas as pd
from IPython.display import display
from openpyxl import Workbook
from openpyxl import load_workbook
wbx = load_workbook(filename= 'test.xlsx')
sheet_names = wbx.sheetnames
name1 = sheet_names[0]
sheet_ranges1 = wbx[name1]
df1 = pd.DataFrame(sheet_ranges1.values)
name2 = sheet_names[1]
sheet_ranges2 = wbx[name2]
df2 = pd.DataFrame(sheet_ranges2.values)
pd.set_option("display.max_rows", None, "display.max_columns", None)
c1 = df1.iloc[:,[1]]
c2 = df1.iloc[:,24]
print(c1.dtypes)
res = c2.str.extractall(r"!H(?P<line>\d+)?")
res2 = c1.merge(pd.DataFrame(res), how='left', left_index=True, right_index=True)
hope it helps:
import pandas as pd
df1 = pd.read_excel(r'.\foldername\filename.xlsx', sheet_name='sheet1')
df2 = pd.read_excel(r'.\foldername\filename.xlsx', sheet_name='sheet2')
df3 = pd.read_excel(r'.\foldername\filename.xlsx', sheet_name='sheet3')
# drop columns as needed that are not to include in merged result, or to avoid duplicate column that will be col_x and col_y
df1 = df1.drop(columns=['col2', 'col3'], index=False)
# join table
dfx = df1.merge(df2, how="inner", left_on="col1", right_on="col2)
merged = dfx.merge(df3, how="left", left_on="col7", right_on="col3)
print(merged.head())
you can do as well in VBA
Sub JoinTables()
Dim connection As ADODB.Connection
Set connection = New ADODB.Connection
With connection
.Provider = "Microsoft.Jet.OLEDB.4.0"
.ConnectionString = "Data Source=" & ThisWorkbook.FullName & ";" & "Extended Properties=Excel 8.0;"
.Open
End With
Dim recordset As ADODB.Recordset
Set recordset = New ADODB.Recordset
recordset.Open "SELECT * FROM [Sheet1$] INNER JOIN [Sheet2$] ON [Sheet1$].[type] = " & "[Sheet2$].[type]", connection
With Worksheets("Sheet3")
.Cells(2, 1).CopyFromRecordset recordset
End With
recordset.Close
connection.Close
End Sub
I'm trying to delete a row when a cell is empty from the 'calories.xlsx' spreadsheet and send all data, except empty rows, to the 'destination.xlsx' spreadsheet. The code below is how far I got. But still, it does not delete rows that have an empty value based on the calories column.
This is the data set:
Data Set
How can I develop my code to solve this problem?
import pandas as pd
FileName = 'calories.xlsx'
SheetName = pd.read_excel(FileName, sheet_name = 'Sheet1')
df = SheetName
print(df)
ListCalories = ['Calories']
print(ListCalories)
for Cell in ListCalories:
if Cell == 'NaN':
df.drop[Cell]
print(df)
df.to_excel('destination.xlsx')
Create dummy data
df=pd.DataFrame({
'calories':[2306,3256,1235,np.nan,3654,3256],
'Person':['person1','person2','person3','person4','person5','person6',]
})
Print data frame
calories Person
0 2306.0 person1
1 3256.0 person2
2 1235.0 person3
3 person4
4 3654.0 person5
5 3256.0 person6
remove row, if calories value is missing
new_df=df.dropna(how='any',subset=['calories'])
Result
calories Person
0 2306.0 person1
1 3256.0 person2
2 1235.0 person3
4 3654.0 person5
5 3256.0 person6
save as excel
new_df.to_excel('destination.xlsx',index=False)
your ListCalories contains only one element which is Calories, I'll assume this was a typo.
what you are trying to probably do is
import pandas as pd
FileName = 'calories.xlsx'
df = pd.read_excel(FileName, sheet_name = 'Sheet1')
print(df)
# you don't need this, but I kept it for you
ListCalories = df['Calories']
print(ListCalories)
clean_df = df[df['Calories'].notna()] # this will only select the rows that doesn't have na value in the Calories column
print(clean_df)
clean_df.to_excel('destination.xlsx')
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.notna.html
my requirement highlight the different of those cell to "yellow" and new rows added will be highlight in red colour in excel as a output.
load1 dataset
StudentID Visit1 Visit 2 Visit 3 Visit 4
1 16-04-20 23-04-20 30-04-20 07-05-20
2 17-04-20 24-04-20 01-05-20 08-05-20
3 18-04-20 25-04-20 02-05-20 09-05-20
4 19-04-20 26-04-20 03-05-20 10-05-20
5 20-04-20 27-04-20 04-05-20 11-05-20
6 21-04-20 28-04-20 05-05-20 12-05-20
7 22-04-20 29-04-20 06-05-20 13-05-20
8 23-04-20 30-04-20 07-05-20 14-05-20
load2 table
StudentID Visit1 Visit 2 Visit 3 Visit 4
1 16-04-20 23-04-20 30-04-20 07-05-20
2 17-04-20 24-04-20 01-05-20 08-05-20
3 18-04-20 25-09-20 02-05-20 09-05-20
4 19-04-20 26-04-20 03-06-20 10-05-20
5 20-04-20 27-04-20 04-05-20 11-05-20
6 21-04-20 28-04-20 05-05-20 12-06-20
7 22-04-20 29-08-20 06-05-20 13-05-20
8 23-04-20 30-04-20 07-05-20 14-05-20
9 15-05-20 16-05-20 17-05-20 18-05-20
Output. I am looking for the output in excel in the below format with colours hi-lighted.
I am new to this pandas, I was able to get the different difference at dataframe level. Not at the outputed excel where to format each cell and row with colours.Please help .
the below are the script i tried to get the difference.
import pandas as pd
import os
import numpy as np
colour1= pd.read_excel('pandas.xlsx',sheet_name='load1')
colour2= pd.read_excel('pandas.xlsx',sheet_name='load2')
colour_merge=colour1.merge(colour2,left_on='StudentID', right_on='StudentID',how='outer')
colour_merge['Visit1dif']= np.where(colour_merge['Visit1_x']==colour_merge['Visit1_y'],0,1)
colour_merge['Visit2dif']= np.where(colour_merge['Visit 2_x']==colour_merge['Visit 2_y'],0,1)
colour_merge['Visit3dif']= np.where(colour_merge['Visit 3_x']==colour_merge['Visit 3_y'],0,1)
colour_merge['Visit4dif']= np.where(colour_merge['Visit 4_x']==colour_merge['Visit 4_y'],0,1)
colour_merge[['StudentID','Visit1_x','Visit1_y','Visit1dif','Visit 2_x','Visit 2_y','Visit2dif','Visit 3_x','Visit 3_y','Visit3dif','Visit 4_x','Visit 4_y','Visit4dif']]
I think you have two sheets load1 and load2, and want to and the third one to be displayed in the picture.
In order to add a new sheet, pandas.ExcelWriter should be opened in append mode and openpyxl library is needed along with pandas.DataFrame.to_excel and conditional_formatting functions :
import pandas as pd
import openpyxl
import xlrd
from openpyxl.styles import Alignment, Font, NamedStyle, PatternFill, colors
from openpyxl.styles.differential import DifferentialStyle
from openpyxl.formatting.rule import Rule, ColorScaleRule, CellIsRule, FormulaRule
file = r'C:\\app\\excel\\pandas.xlsx'
xls = xlrd.open_workbook(file)
sht = xls.sheet_names()
colour1= pd.read_excel(file,sheet_name=sht[0])
colour2= pd.read_excel(file,sheet_name=sht[1])
colour_merge = colour1.merge(colour2,left_on='StudentID',right_on='StudentID',how='outer')
colour_merge = colour_merge[['StudentID','Visit 1_x','Visit 1_y','Visit 2_x','Visit 2_y','Visit 3_x','Visit 3_y','Visit 4_x','Visit 4_y']]
l_sid = []
mxc1 = max(colour1['StudentID'])+1
mxc2 = max(colour2['StudentID'])+1
for j in range(min(mxc1,mxc2),max(mxc1,mxc2)):
l_sid.append(j+1)
writer_args = { 'path': file, 'mode': 'a', 'engine': 'openpyxl'}
with pd.ExcelWriter(**writer_args) as xlsx:
colour_merge.to_excel(xlsx, 'load3', index=False)
ws = xlsx.sheets['load3']
mxc= ws.max_column
title_row = '1'
yellow = PatternFill(bgColor=colors.YELLOW)
red = PatternFill(bgColor=colors.RED)
i=1
while i <= mxc*2:
l_col = []
l_col.append(chr(i+65))
l_col.append(chr(i+65+1))
for j in range(2,mxc+1):
for k in l_col:
if j not in l_sid:
ws.conditional_formatting.add(k+str(j), FormulaRule(formula=[l_col[0]+'$'+str(j)+'<>'+l_col[1]+'$'+str(j)], stopIfTrue=True, fill=yellow))
i+=2
r = Rule(type="expression", dxf=DifferentialStyle(fill=red), stopIfTrue=False)
r.formula = ['1=1']
while l_sid:
el=l_sid.pop(-1)
ws.conditional_formatting.add('A'+str(el)+':'+chr(64+mxc)+str(el), r)
xlsx.save()
My data is looking like this:
pd.read_csv('/Users/admin/desktop/007538839.csv').head()
105586.18
0 105582.910
1 105585.230
2 105576.445
3 105580.016
4 105580.266
I want to move that 105568.18 to the 0 index because now it is the column name. And after that I want to name this column 'flux'. I've tried
pd.read_csv('/Users/admin/desktop/007538839.csv', sep='\t', names = ["flux"])
but it did not work, probably because the dataframe is not in the right format.
How can I achieve that?
For me your code working very nice:
import pandas as pd
temp=u"""105586.18
105582.910
105585.230
105576.445
105580.016
105580.266"""
#after testing replace 'pd.compat.StringIO(temp)' to '/Users/admin/desktop/007538839.csv'
df = pd.read_csv(pd.compat.StringIO(temp), sep='\t', names = ["flux"])
print (df)
flux
0 105586.180
1 105582.910
2 105585.230
3 105576.445
4 105580.016
5 105580.266
For overwrite original file with same data with new header flux:
df.to_csv('/Users/admin/desktop/007538839.csv', index=False)
Try this:
df=pd.read_csv('/Users/admin/desktop/007538839.csv',header=None)
df.columns=['flux']
header=None is the friend of yours.