Why duplicates aren't being removed in Pandas - python

Very new to Python and Pandas...but the issue is that my final output file isn't excluding any duplicates on the 'Customer Number'. Any suggestions on why this would be happening would be appreciated!
import pandas as pd
import numpy as np #numpy is the module which can replace errors from huge datasets
from openpyxl import load_workbook
from openpyxl.styles import Font
df_1 = pd.read_excel('PRT Tracings 2020.xlsx', sheet_name='Export') #this is reading the Excel document shifts and looks at sheet
df_2 = pd.read_excel('PRT Tracings 2021.xlsx', sheet_name='Export') #this reads the same Excel document but looks at a different sheet
df_3 = pd.read_excel('PRT Tracings YTD 2022.xlsx', sheet_name='Export') #this reads a different Excel file, and only has one sheet so no need to have it read a sheet
df_all = pd.concat([df_1, df_2, df_3], sort=False) #this combines the sheets from 1,2,3 and the sort function as false so our columns stay in the same order
to_excel = df_all.to_excel('Combined_PRT_Tracings.xlsx', index=None) #this Excel file combines all three sheets into one spreadsheet
df_all = df_all.replace(np.nan, 'N/A', regex=True) #replaces errors with N/A
remove = ['ORDERNUMBER', 'ORDER_TYPE', 'ORDERDATE', 'Major Code Description', 'Product_Number_And_Desc', 'Qty', 'Order_$', 'Order_List_$'] #this will remove all unwanted columns
df_all.drop(columns=remove, inplace=True)
df_all.drop_duplicates(subset=['Customer Number'], keep=False) #this will remove all duplicates from the tracing number syntax with pandas module
to_excel = df_all.to_excel('Combined_PRT_Tracings.xlsx', index=None) #this Excel file combines all three sheets into one spreadsheet
wb = load_workbook('Combined_PRT_Tracings.xlsx') #we are using this to have openpyxl read the data, from the spreadsheet already created
ws = wb.active #this workbook is active
wb.save('Combined_PRT_Tracings.xlsx')

You should assign the return value of df_all.drop_duplicates to a variable or set inplace=True to have the DataFrame contents overwritten. This is to prevent undesired changes to the original data.
Try:
df_all = df_all.drop_duplicates(subset='Customer Number', keep=False)
Or the equivalent:
df_all.drop_duplicates(subset='Customer Number', keep=False, inplace=True)
That will remove all duplicate rows from the DataFrame. If you want to keep the first or last row which contains a duplicate, change keep to first or last.

Related

How to use pandas and numpy to compare two excel workbooks with multiple tabs with difference in the rows and columns in a single tab?

I have two xlsx files that have multiple tabs. I need to compare values in each tab based on the tab name,but in some tab there is a difference in the rows and columns (e.g. sheet1 in file1 needs to be compared with sheet1 in file2 and so on). When I use the following code, it will only compare and write the only for same numbers of rows and same number of columns. Please help me figure out why all tabs do not get compared.
import pandas as pd
import numpy as np
df1 = pd.read_excel('test_1.xlsx', sheet_name=None)
df2 = pd.read_excel('test_2.xlsx', sheet_name=None)
with pd.ExcelWriter('./Excel_diff.xlsx') as writer:
for sheet, df1 in df1.items():
# check if sheet is in the other Excel file
if sheet in df2:
df2sheet = df2[sheet]
comparison_values = df1.values == df2sheet.values
print(comparison_values)
rows, cols = np.where(comparison_values == False)
for item in zip(rows, cols):
df1.iloc[item[0], item[1]] = '{} → {}'.format(df1.iloc[item[0], item[1]], df2sheet.iloc[item[0], item[1]])
df1.to_excel(writer, sheet_name=sheet, index=False, header=True)
This endeavor is non-trivial irrespective of tool or library.
However, when you get it going or need to validate your progress, Microsoft have already built a very useful and intuitive tool to compare workbooks quite nicely. Consider using it to accompany your dev efforts: Excel Compare

Add header with merged cells from one excel and insert to another excel Pandas

I have been searching over on how to append/insert/concat a row from one excel to another but with merged cells. I was not able to find what I am looking for.
What I need to get is this:
and append to the very first row of this:
I tried using pandas append() but it destroyed the arrangement of columns.
df = pd.DataFrame()
for f in ['merge1.xlsx', 'test1.xlsx']:
data = pd.read_excel(f, 'Sheet1')
df = df.append(data)
df.to_excel('test3.xlsx')
Is there way pandas could do it? I just need to literally insert the header to the top row.
Although I am still trying to find a way, it would actually be fine to me if this question had a duplicate as long as I can find answers or advice.
You can use pd.read_excel to read in the workbook with the data you want, in your case that is 'test1.xlsx'. You could then utilize openpyxl.load_workbook() to open an existing workbook with the header, in your case that is 'merge1.xlsx'. Finally you could save the new workbbok by a new name ('test3.xlsx') without changing the two existing workbooks.
Below I've provided a fully reproducible example of how you can do this. To make this example fully reproducible, I create 'merge1.xlsx' and 'test1.xlsx'.
Please note that if in your 'merge1.xlsx', if you only have the header that you want and nothing else in the file, you can make use of the two lines I've left commented out below. This would just append your data from 'test1.xlsx' to the header in 'merge1.xlsx'. If this is the case then you can get rid of the two for llops at the end. Otherwise as in my example it's a bit more complicated.
In creating 'test3.xlsx', we loop through each row and we determine how many columns there are using len(df3.columns). In my example this is equal to two but this code would also work for a greater number of columns.
import pandas as pd
from openpyxl import load_workbook
from openpyxl.utils.dataframe import dataframe_to_rows
df1 = pd.DataFrame()
writer = pd.ExcelWriter('merge1.xlsx') #xlsxwriter engine
df1.to_excel(writer, sheet_name='Sheet1')
ws = writer.sheets['Sheet1']
ws.merge_range('A1:C1', 'This is a merged cell')
ws.write('A3', 'some string I might not want in other workbooks')
writer.save()
df2 = pd.DataFrame({'col_1': [1,2,3,4,5,6], 'col_2': ['A','B','C','D','E','F']})
writer = pd.ExcelWriter('test1.xlsx')
df2.to_excel(writer, sheet_name='Sheet1')
writer.save()
df3 = pd.read_excel('test1.xlsx')
wb = load_workbook('merge1.xlsx')
ws = wb['Sheet1']
#for row in dataframe_to_rows(df3):
# ws.append(row)
column = 2
for item in list(df3.columns.values):
ws.cell(2, column=column).value = str(item)
column = column + 1
for row_index, row in df3.iterrows():
ws.cell(row=row_index+3, column=1).value = row_index #comment out to remove index
for i in range(0, len(df3.columns)):
ws.cell(row=row_index+3, column=i+2).value = row[i]
wb.save("test3.xlsx")
Expected Output of the 3 Workbooks:

Python Pandas - How to write in a specific column in an Excel Sheet

I am having trouble updating an Excel Sheet using pandas by writing new values in it. I already have an existing frame df1 that reads the values from MySheet1.xlsx. so this needs to either be a new dataframe or somehow to copy and overwrite the existing one.
The spreadsheet is in this format:
I have a python list: values_list = [12.34, 17.56, 12.45]. My goal is to insert the list values under Col_C header vertically. It is currently overwriting the entire dataframe horizontally, without preserving the current values.
df2 = pd.DataFrame({'Col_C': values_list})
writer = pd.ExcelWriter('excelfile.xlsx', engine='xlsxwriter')
df2.to_excel(writer, sheet_name='MySheet1')
workbook = writer.book
worksheet = writer.sheets['MySheet1']
How to get this end result? Thank you!
Below I've provided a fully reproducible example of how you can go about modifying an existing .xlsx workbook using pandas and the openpyxl module (link to Openpyxl Docs).
First, for demonstration purposes, I create a workbook called test.xlsx:
from openpyxl import load_workbook
import pandas as pd
writer = pd.ExcelWriter('test.xlsx', engine='openpyxl')
wb = writer.book
df = pd.DataFrame({'Col_A': [1,2,3,4],
'Col_B': [5,6,7,8],
'Col_C': [0,0,0,0],
'Col_D': [13,14,15,16]})
df.to_excel(writer, index=False)
wb.save('test.xlsx')
This is the Expected output at this point:
In this second part, we load the existing workbook ('test.xlsx') and modify the third column with different data.
from openpyxl import load_workbook
import pandas as pd
df_new = pd.DataFrame({'Col_C': [9, 10, 11, 12]})
wb = load_workbook('test.xlsx')
ws = wb['Sheet1']
for index, row in df_new.iterrows():
cell = 'C%d' % (index + 2)
ws[cell] = row[0]
wb.save('test.xlsx')
This is the Expected output at the end:
In my opinion, the easiest solution is to read the excel as a panda's dataframe, and modify it and write out as an excel. So for example:
Comments:
Import pandas as pd.
Read the excel sheet into pandas data-frame called.
Take your data, which could be in a list format, and assign it to the column you want. (just make sure the lengths are the same). Save your data-frame as an excel, either override the old excel or create a new one.
Code:
import pandas as pd
ExcelDataInPandasDataFrame = pd.read_excel("./YourExcel.xlsx")
YourDataInAList = [12.34,17.56,12.45]
ExcelDataInPandasDataFrame ["Col_C"] = YourDataInAList
ExcelDataInPandasDataFrame .to_excel("./YourNewExcel.xlsx",index=False)

How to write to separate excel worksheets within the same workbook in a for loop?

I have a for-loop that computes DataFrames and outputs them to an excel sheet. Right now, I am able to export one full DF to one excel sheet.
As shown below, I am trying to add code inside the for-loop so that it can do this computation for many DFs and export each DF to a separate excel worksheet within the same workbook, delineated by the sheet_name.
list = ["AUD_JPY", "AUD_USD"]
granularity = "D"
bar_count = 5000
MA_list = [20, 50]
writer = pd.ExcelWriter('BT.xlsx')
resultsDF = pd.DataFrame(columns = ['instrument', 'Start date', 'End date']
for MA in MA_list:
for instrument in list:
...
data = {'instrument': [instrument], 'Start date': [startDate], 'End date': [endDate]}
resultsDF = resultsDF.append(data, ignore_index=True)
instrument=+1
resultsDF.to_excel(writer, sheet_name='Price_Over_SMA{0}_{1}'.format(MA,
granularity))
writer.save()
MA=+1
This only exports the first DF from the 'instrument' loop to the excel sheet. I get this worksheet:
But I also want to export another worksheet next to that one that should be called 'Price_Over_SMA50_D' in this case.
Not sure what I am doing wrong.
Thanks
Using Pandas, you can save different dataframes into different sheets of the same excel file:
Let's say I have a list of dataframes dfs that I want to save as separate sheets in an excel file whose absolute path is file_name.
You could use something like this:
import pandas as pd
file_name = # you excel file name you're saving
dfs = # my list of dataframes
for i, df in enumerate(dfs):
df.to_excel(file_name, sheetname='Sheet_{0}'.format(i))
I found out it was a matter of moving writer.save() outside of my 'MA' for-loop and resetting the DF after each loop.

How to read Excel Workbook (pandas)

First I want to say that I am not an expert by any means. I am versed but carry a burden of schedule and learning Python like I should have at a younger age!
Question:
I have a workbook that will on occasion have more than one worksheet. When reading in the workbook I will not know the number of sheets or their sheet name. The data arrangement will be the same on every sheet with some columns going by the name of 'Unnamed'. The problem is that everything I try or find online uses the pandas.ExcelFile to gather all sheets which is fine but i need to be able to skips 4 rows and only read 42 rows after that and parse specific columns. Although the sheets might have the exact same structure the column names might be the same or different but would like them to be merged.
So here is what I have:
import pandas as pd
from openpyxl import load_workbook
# Load in the file location and name
cause_effect_file = r'C:\Users\Owner\Desktop\C&E Template.xlsx'
# Set up the ability to write dataframe to the same workbook
book = load_workbook(cause_effect_file)
writer = pd.ExcelWriter(cause_effect_file)
writer.book = book
writer.sheets = dict((ws.title, ws) for ws in book.worksheets)
# Get the file skip rows and parse columns needed
xl_file = pd.read_excel(cause_effect_file, skiprows=4, parse_cols = 'B:AJ', na_values=['NA'], convert_float=False)
# Loop through the sheets loading data in the dataframe
dfi = {sheet_name: xl_file.parse(sheet_name)
for sheet_name in xl_file.sheet_names}
# Remove columns labeled as un-named
for col in dfi:
if r'Unnamed' in col:
del dfi[col]
# Write dataframe to sheet so we can see what the data looks like
dfi.to_excel(writer, "PyDF", index=False)
# Save it back to the book
writer.save()
The link to the file i am working with is below
Excel File
Try to modify the following based on your specific need:
import os
import pandas as pd
df = pd.DataFrame()
xls = pd.ExcelFile(path)
Then iterate over all the available data sheets:
for x in range(0, len(xls.sheet_names)):
a = xls.parse(x,header = 4, parse_cols = 'B:AJ')
a["Sheet Name"] = [xls.sheet_names[x]] * len(a)
df = df.append(a)
You can adjust the header row and the columns to read for each sheet. I added a column that will indicate the name of the data sheet the row came from.
You probably want to look at using read_only mode in openpyxl. This will allow you to load only those sheets that you're interested and look at only the cells you're interested in.
If you want to work with Pandas dataframes then you'll have to create these yourself but that shouldn't be too hard.

Categories

Resources