Mapping in excel using python - python

Excel 1
Group Summary Label Amount
Individual Member
Family Member
Family
Excel 2
Network Label Value
Individual Member 100
Family Member 200
Family 300
I have two Excel sheets and I am trying to map values. As you see both excels has different column names but rows are same. I am trying to map 'value' in excel 2 to Amount in excel 1
I am expecting result like this. How can I do this using Python? I am new and trying to learn.
Group Summary Label Amount
Individual Member 100
Family Member 200
Family 300

First, open two dataframe of each sheet inside the excel:
df = pd.ExcelFile('excelfilename.xls')
df1 = pd.read_excel(df, 'Sheet1')
df2 = pd.read_excel(df, 'Sheet2')
now you can match the two dataFrames into one
df1.drop(["Amount"], axis=1)
df2 = df2.join(df1)
df2 = df2.rename(columns={"Value": "Amount"}, errors="raise")

Related

Dataframes to Excel file (multiple sheets) per unique value

I have three different dataframes which all contain a column with certain IDs.
DF_1
DF_2
DF_3
What I am trying to achieve is to create an Excel sheet with the ID as its name with the dataframes as the sheets 'DF_1, DF_2, DF_3' per unique value. So '1.xlsx' should contain three sheets (the dataframes) with only the records that are of associated with that ID. The thing I get stuck at is either getting the multiple sheets or only the corresponding values per unique value.
for name, r in df_1.groupby("ID"):
r.groupby("ID").to_excel(f'{name}.xlsx', index=False)
This piece of code gives me the correct output, but only for df_1. I get 5 Excel files with the corresponding rows per ID, but only one sheet, namely for df_1. I can't figure out how to include df_2 and df_3 per ID. When I try to use the following piece of code with nested loops, I get all the rows instead of per unique value:
writer = pd.ExcelWriter(f'{name}.xlsx')
r.to_excel(writer, sheet_name=f'{name}_df1')
r.to_excel(writer, sheet_name=f'{name}_df2')
r.to_excel(writer, sheet_name=f'{name}_df3')
writer.save()
There is more data transformation going on before this part, and the final dataframes are the once that are needed eventually. Frankly, I have no idea how to fix this or how to achieve this. Hopefully, someone has some insightful comments.
Can you try the following:
unique_ids = df_1['ID'].unique()
for name in unique_ids:
writer = pd.ExcelWriter(f'{name}.xlsx')
r1 = df_1[df_1['ID'].eq(name)]
r1.to_excel(writer, sheet_name=f'{name}_df1')
r2 = df_2[df_2['ID'].eq(name)]
r2.to_excel(writer, sheet_name=f'{name}_df2')
r3 = df_3[df_3['ID'].eq(name)]
r.to_excel(writer, sheet_name=f'{name}_df3')
writer.save()

Select rows in a panda dataframe based on condition from another dataframe with a different size

Consider a 100x200 dataframe (called df1) representing clinical data from 100 patients. Each patient can be identified through one number in column "ID" and another number in column 'CENTER'.
Now, consider a second 40*170 dataframe df2 containing data from a subset of 40 patients randomly selected from df1, and tested 6 months after on different variables. Similar to df1, df2 contains columns 'ID' and 'CENTER'. I am trying to select these 40 patients in df1 based on their ID and CENTER numbers, but can't find an easy way to do so using Pandas. Any idea ?
You could try this:
df3 = df1[df1.ID.isin(df2.ID) & df1.CENTER.isin(df2.CENTER)]

Writing a dict with keys and dataframes to an excel sheet using python

I have the below dictionary which contains key as months and a dataframe.
The data and keys:
Data Period Jan'18 Data Period Jan'18 Data Period Jan'18
Churn Period Feb'18 Churn Period Mar'18 Churn Period Apr'18
Variable_Name correlation Variable_Name correlation Variable_Name correlation
Pending_Disconnect 0.553395448 Pending_Change 0.043461995 active_frq_N 0.025697016
status_Active 0.539464806 status_Active 0.038057697 active_frq_Y 0.025697016
days_active 0.414774231 ethnic 0.037503202 ethnic 0.025195149
days_pend_disco 0.392915837 days_active 0.037227245 ecgroup 0.023192408
prop_tenure 0.074321692 archetype_grp 0.035761434 age 0.023121305
abs_change_3m 0.062267386 age_nan 0.035761434 archetype_nan 0.023121305
The keys and the dataframe have to be written to an excel sheet with a gap/s between each dataframe and key combination.
So the Data Period comes from the first part of key and churn period from the second part , after - .
Each dataframe contains data which looks like below:
Variable_Name correlation
Pending_Disconnect 0.553395448
status_Active 0.539464806
days_active 0.414774231
days_pend_disco 0.392915837
prop_tenure 0.074321692
abs_change_3m 0.062267386
Can someone please help me with this?
1.) You should concat all dataframes of the dictionary and create a big dataframe.
Create an empty dataframe:
tmp = pd.Dataframe()
Iterate through the keys of your dictionary(let's say d) and concat the dfs:
for i in d.keys():
tmp = pd.concat([tmp,d[key]], axis=1)
Now, tmp is a big df with all smaller dataframes concatenated.
2.) Append blank columns in this new df tmp. The point here is that, every small df should be separated by a blank column with another small df.
So, if there are 3 small df's, append 2 blank columns to tmp.
tmp[''] = ''
tmp[''] = ''
3.) Now, re-structure your tmp df by placing blank cols in between small dfs.
Suppose, columns in tmp are:
'variable_name','correlation','Attribute','Datatype', 'variable_name','correlation','Attribute','Datatype',
'variable_name','correlation','Attribute','Datatype',
'','' ## Last 2 cols are empty having blank values(step#2)
These are columns for all small dfs which were concatenated.
Now, create a col_list and place a blank column between each small df:
col_list = ['variable_name','correlation','Attribute','Datatype', '', 'variable_name','correlation','Attribute','Datatype', '', 'variable_name','correlation','Attribute','Datatype']
4.) Re-arrange tmp as per col_list.
tmp = tmp[col_list]
5.) Now, you have this big dataframe ready with each small dataframe separated by a blank column with another.
Wrtie this into excel now.
tmp.to_excel() ## Fill all required parameters and write to excel.
Let me know if this helps.

Split dataframe according to column value and export to different Excel worksheets

Source used before asked:
Pandas: Iterate through a list of DataFrames and export each to excel sheets
Splitting dataframe into multiple dataframes
I have managed to do all of this:
# sort the dataframe
df.sort(columns=['name'], inplace=True)
# set the index to be this and don't drop
df.set_index(keys=['name'], drop=False,inplace=True)
# get a list of names
names=df['name'].unique().tolist()
# now we can perform a lookup on a 'view' of the dataframe
joe = df.loc[df.name=='joe']
# now you can query all 'joes'
I have managed to make this work - joe = df.loc[df.name=='joe'] and it gave the exact result what I was looking for.
As solution to make it work for big amount of data I found this potential solution.
writer = pandas.ExcelWriter("MyData.xlsx", engine='xlsxwriter')
List = [Data , ByBrand]
for i in List:
i.to_excel(writer, sheet_name= i)
writer.save()
Currently I have:
teacher_names = ['Teacher A', 'Teacher B', 'Teacher C']
df =
ID Teacher_name Student_name
Teacher_name
Teacher A 1.0 Teacher A Student 1
Teacher A NaN Teacher A Student 2
Teacher B 0.0 Teacher B Student 3
Teacher C 2.0 Teacher C Student 4
If I use - test = df.loc[df.Teacher_name=='Teacher A'] - Will receive exact result.
Question: How to optimise that it will automatically save the "test" result to (for each teacher separate) excel file ( .to_excel(writer, sheet_name=Teacher_name ) with teacher name, and will do it for all the existing in the database teacher?
This should work for you. You were nearly there, you just need to iterate the names list and filter your dataframe each time.
names = df['name'].unique().tolist()
writer = pandas.ExcelWriter("MyData.xlsx", engine='xlsxwriter')
for myname in names:
mydf = df.loc[df.name==myname]
mydf.to_excel(writer, sheet_name=myname)
writer.save()
#jpp, the text 'sheetname' is to be replaced with 'sheet_name'. Also once the 'name' variable is converted to list, upon running the for loop for creating multiple sheets based on unique name value, I get the following error:
InvalidWorksheetName: Invalid Excel character '[]:*?/\' in sheetname '['.
Alternate way of creating multiple worksheets (in a single excel file) based on a column value (through a function):
def writesheet(g):
a=g['name'].tolist()[0]
g.to_excel(writer,sheet_name = str(a),index=False)
df.groupby('name').apply(writesheet)
writer.save()
Source: How to split a large excel file into multiple worksheets based on their given ip address using pandas python

Pandas excel difference generator

I am trying to create a python program that gives me the differences between 2 big excel files with multiple sheets. I got it to print the results to an excel, but apparently when one of the cells contains datetime data then the operation of multiplying a boolean dataframe with the dataframe that contains dates doesn't work anymore. I get the following error :
TypeError: unsupported operand type(s) for *: 'bool' and 'datetime.datetime'
'EDIT' : I just realized this method doesn't work for strings neither (It only works for pure numerical data). What would be a better way to do this, that works on strings, numbers and time data?
#start of program
import pandas as pd
from pandas import ExcelWriter
import numpy as np
df1 = pd.read_excel('4_Input EfE_2030.xlsm',None)
df2 = pd.read_excel('5_Input EfE_2030.xlsm',None)
keys1=df1.keys()
keys2=df2.keys()
writer = ExcelWriter('test1.xlsx')
#loop for all sheets and create new dataframes with the differences
for x in keys1:
df3 = pd.read_excel('4_Input EfE_2030.xlsm',sheetname=x,header=None)
df4 = pd.read_excel('5_Input EfE_2030.xlsm',sheetname=x,header=None)
dif = df3 != df4
df=dif*df3
df2=dif*df4
nrcolumns=len(df.columns)
#when there are no differences in the entire sheet the dataframe will be empty. Add 1 to row indexes so the number coincides with excel rownumbers
if not df.empty:
# df.columns = ['A']
df.index = np.arange(1, len(df) + 1)
if not df2.empty:
# df2.columns = ['A']
df2.index = np.arange(1, len(df) + 1)
#delete rows with all 0
df = df.loc[~(df == 0).all(axis=1)]
df2 = df2.loc[~(df2 == 0).all(axis=1)]
#create new df with the data of the 2 sheets
result = pd.concat([df,df2],axis=1)
print(result)
result.to_excel(writer,sheet_name=x)
Updated answer
Approach
This is an interesting question. Another approach is to compare the column values in one Excel worksheet against the column values in another Excel worksheet by using the Panel data structure offered by Pandas. This data structure stores data as a 3-dimensional array. With the data from two Excel worksheets stored in a Panel we can then compare rows across worksheets that are uniquely identified by one or a combination of columns (e.g, a unique ID). to make this comparison by applying a custom function that compares the value in each cell of each column in one worksheet to the value in the same cell of the same column in the second worksheet. One benefit of this approach is that the datatype of each value no longer matters since we're just comparing values (e.g., 1 == 1, 'my name' == 'my name', etc.).
Assumptions
This approach makes several assumptions about your data:
The rows in each of the worksheets share one or a combination of columns that uniquely identify each row.
The columns of interest for comparison exist in both worksheets and share the same column headers.
(There may be other assumptions I'm failing to notice.)
Implementation
The implementation of this approach is a bit involved. Also, because I do not have access to your data, I cannot customize the implementation specifically to your data. With that said, I'll implement this approach using some dummy data shown below.
"Old" dataset:
id col_num col_str col_datetime
1 123 My string 1 2001-12-04
2 234 My string 2 2001-12-05
3 345 My string 3 2001-12-06
"New" dataset:
id col_num col_str col_datetime
1 123 My string 1 MODIFIED 2001-12-04
3 789 My string 3 2001-12-10
4 456 My string 4 2001-12-07
Notice the following differences about these two dataframes:
col_str in the row with id 1 is different
col_num in the row with id 3 is different
col_datetime in the row with id 3 is different
The row with id 2 exists in "old" but not "new"
The row with id 4 exists in "new" but not "old"
Okay, let's get started. First, we read in the datasets into separate dataframes:
df_old = pd.read_excel('old.xlsx', 'Sheet1', na_values=['NA'])
df_new = pd.read_excel('new.xlsx', 'Sheet1', na_values=['NA'])
Then we add a new version column to each dataframe to keep our thinking straight. We'll also use this column later to separate out rows from the "old" and "new" dataframes into their own separate dataframes:
df_old['VER'] = 'OLD'
df_new['VER'] = 'NEW'
Then we concatenate the "old" and "new" datasets into a single dataframe. Notice that the ignore_index parameter is set to True so that we ignore the index as it is not meaningful for this operation:
df_full = pd.concat([df_old, df_new], ignore_index=True)
Now we're going to identify all of the duplicate rows that exist across the two dataframes. These are rows where all of the column values are the same across the "old" and "new" dataframes. In other words, these are rows where no differences exist:
Once identified, we drop these duplicates rows. What we're left with are the rows that (a) are different between the two dataframes, (b) exist in the "old" dataframe but not the "new" dataframe, and (c) exist in the "new" dataframe but not the "old" dataframe:
df_diff = df_full.drop_duplicates(subset=['id', 'col_num', 'col_str', 'col_datetime'])
Next we identify and extract the values for id (i.e., the primary key across the "old" and "new" dataframes) for the rows that exist in both the "old" and "new" dataframes. It's important to note that these ids do not include rows that exist in one or the other dataframes but not both (i.e., rows removed or rows added):
diff_ids = df_diff.set_index('id').index.get_duplicates()
Now we restrict df_full to only those rows identified by the ids in diff_ids:
df_diff_ids = df_full[df_full['id'].isin(diff_ids)]
Now we move the duplicate rows from the "old" and "new" dataframes into separate dataframes that we can plug into a Panel data structure for comparison:
df_diff_old = df_diff_ids[df_diff_ids['VER'] == 'OLD']
df_diff_new = df_diff_ids[df_diff_ids['VER'] == 'NEW']
Next we set the index for both of these dataframes to the primary key (i.e., id). This is necessary for Panel to work effectively:
df_diff_old.set_index('id', inplace=True)
df_diff_new.set_index('id', inplace=True)
We slot both of these dataframes into a Panel data structure:
df_panel = pd.Panel(dict(df1=df_diff_old, df2=df_diff_new))
Finally we make our comparison using a custom function (find_diff) and the apply method:
def find_diff(x):
return x[0] if x[0] == x[1] else '{} -> {}'.format(*x)
df_diff = df_panel.apply(find_diff, axis=0)
If you print out the contents of df_diff you can easily note the which values changed between the "old" and "new" dataframes:
col_num col_str col_datetime
id
1 123 My string 1 -> My string 1 MODIFIED 2001-12-04 00:00:00
3 345 -> 789 My string 3 2001-12-06 00:00:00 -> 2001-12-10 00:00:00
Improvements
There are a few improvements I'll leave to you to make to this implementation.
Add a binary (1/0) flag that indicates if a one or more values in a
row changed
Identify which rows in the "old" dataframe were removed
(i.e., are not present in the "new" dataframe)
Identify which rows in the
"new" dataframe were added (i.e., not present in the "old" dataframe)
Original answer
Issue:
The issue is that you cannot perform arithmetic operations on datetimes.
However, you can perform arithmetic operations on timedeltas.
I can think of a few solutions that might help you:
Solution 1:
Convert your datetimes to strings.
If I'm understanding your problem correctly, you're comparing Excel worksheets for differences, correct? If this is the case, then I don't think it matters if the datetimes are represented as explicit datetimes (i.e., you're not performing any datetime calculations).
To implement this solution you would modify your pd.read_excel()' calls and explicitly set thedtypesparameter to convert yourdatetimes` to strings:
df1 = pd.read_excel('4_Input EfE_2030.xlsm', dtypes={'LABEL FOR DATETIME COL 1': str})
Solution 2:
Convert your datetimes to timedeltas.
For each datetime column, you can use: pd.Timedelta(df['LABEL FOR DATETIME COL'])
Overall, without seeing your data, I believe Solution 1 is the most straightforward.

Categories

Resources