Iterate over dataframes and merge by conditions

Iterate over dataframes and merge by conditions - python

i have to data frame
id-input id-output Date Price Type
1 3 20/09/2020 100 ABC
2 1 20/09/2020 200 ABC
2 1 21/09/2020 300 ABC
1 3 21/09/2020 50 AD
1 2 21/09/2020 40 AD
I want to get this Output :
id-inp-ABC id-out-ABC Date-ABC Price-ABC Type-ABC id-inp-AD id-out-AD Date-AD Price-AD Type-AD
1 3 20/09/2020 10 ABC 2 1 20/09/2020 10 AD
1' 3 20/09/2020 90 ABC Nan Nan Nan Nan Nan
2 1 20/09/2020 40 ABC 1 2 21/09/2020 40 AD
2' 1 20/09/2020 160 ABC Nan Nan Nan Nan Nan
2 1 21/09/2020 300 ABC Nan Nan Nan Nan Nan
My idea is to :
-divide the dataframe into two dataframes by type
-iterate through the both dataframes and check if the same id-input == id-output
-check if the price is equal , if not split row and soustract the price.
rename the columns and merge them.
grp = df.groupby('type')
transformed_df_list = []
for idx, frame in grp:
frame.reset_index(drop=True, inplace=True)
transformed_df_list.append(frame.copy())
ABC = pd.DataFrame([transformed_df_list[0])
AD = pd.DataFrame([transformed_df_list[1])
for i , row in ABC.iterrows():
for i, row1 in AD.iterrows():
if row['id-inp'] == row1['id-out']:2
row_df = pd.DataFrame([row1])
row_df= row_df.rename(columns={'id-inp': 'id-inp-AD', 'id-out':'id-out-AD' , 'Date':'Date-AD' ,'price':'price-AD'})
output = pd.merge(ABC.set_index('id-inp' , drop =False) ,row_df.set_index('id-out-AD' , drop =False), how='left' , left_on =['id-inp'] , right_on =['id-inp-AD' ])
but the results is Nan in the id-inp-AD id-out-AD Date-AD Price-AD Type-AD part ,
and row_df contains just the last row :
1 2 21/09/2020 40 A
i want also that the iteration respect the order and each insert in the output dataframe is sorted by date.

The most elegant way to solve your problem is to use pandas.DataFrame.pivot. You end up with multilevel column names instead of a single level. If you need to transfer the DataFrame back to single level column names, check the second answer here.
import pandas as pd
input = [
[1, 3, '20/09/2020', 100, 'ABC'],
[2, 1, '20/09/2020', 200, 'ABC'],
[2, 1, '21/09/2020', 300, 'ABC'],
[1, 3, '21/09/2020', 50, 'AD'],
[1, 2, '21/09/2020', 40, 'AD']
]
df = pd.DataFrame(data=input, columns=["id-input", "id-output", "Date", "Price", "Type"])
df_pivot = df.pivot(columns=["Type"])
print(df_pivot)
Output
id-input id-output Date Price
Type ABC AD ABC AD ABC AD ABC AD
0 1.0 NaN 3.0 NaN 20/09/2020 NaN 100.0 NaN
1 2.0 NaN 1.0 NaN 20/09/2020 NaN 200.0 NaN
2 2.0 NaN 1.0 NaN 21/09/2020 NaN 300.0 NaN
3 NaN 1.0 NaN 3.0 NaN 21/09/2020 NaN 50.0
4 NaN 1.0 NaN 2.0 NaN 21/09/2020 NaN 40.0

Related

Transform DataFrame in Pandas

I am struggling with the following issue.
My DF is:
df = pd.DataFrame(
[
['7890-1', '12345N', 'John', 'Intermediate'],
['7890-4', '30909N', 'Greg', 'Intermediate'],
['3300-1', '88117N', 'Mark', 'Advanced'],
['2502-2', '90288N', 'Olivia', 'Elementary'],
['7890-2', '22345N', 'Joe', 'Intermediate'],
['7890-3', '72245N', 'Ana', 'Elementary']
],
columns=['Id', 'Code', 'Person', 'Level'])
print(df)
I would like to get such a result:
Id
Code 1
Person 1
Level 1
Code 2
Person 2
Level 2
Code 3
Person 3
Level 3
Code 4
Person 4
Level 4
0
7890
12345N
John
Intermediate
22345N
Joe
Intermediate
72245N
Ana
Elementary
30909N
Greg
Intermediate
1
3300
88117N
Mark
Advanced
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
2
2502
NaN
NaN
NaN
90288N
Olivia
Elementary
NaN
NaN
NaN
NaN
NaN
NaN

I'd start with the same approach as #Andrej Kesely but then sort by index after unstacking and map over the column names with ' '.join.
df[["Id", "No"]] = df["Id"].str.split("-", expand=True)
df_wide = df.set_index(["Id", "No"]).unstack(level=1).sort_index(axis=1,level=1)
df_wide.columns = df_wide.columns.map(' '.join)
Output
Code 1 Level 1 Person 1 Code 2 Level 2 Person 2 Code 3 \
Id
2502 NaN NaN NaN 90288N Elementary Olivia NaN
3300 88117N Advanced Mark NaN NaN NaN NaN
7890 12345N Intermediate John 22345N Intermediate Joe 72245N
Level 3 Person 3 Code 4 Level 4 Person 4
Id
2502 NaN NaN NaN NaN NaN
3300 NaN NaN NaN NaN NaN
7890 Elementary Ana 30909N Intermediate Greg

Try:
df[["Id", "Id2"]] = df["Id"].str.split("-", expand=True)
x = df.set_index(["Id", "Id2"]).unstack(level=1)
x.columns = [f"{a} {b}" for a, b in x.columns]
print(
x[sorted(x.columns, key=lambda k: int(k.split()[-1]))]
.reset_index()
.to_markdown()
)
Prints:
Id
Code 1
Person 1
Level 1
Code 2
Person 2
Level 2
Code 3
Person 3
Level 3
Code 4
Person 4
Level 4
0
2502
nan
nan
nan
90288N
Olivia
Elementary
nan
nan
nan
nan
nan
nan
1
3300
88117N
Mark
Advanced
nan
nan
nan
nan
nan
nan
nan
nan
nan
2
7890
12345N
John
Intermediate
22345N
Joe
Intermediate
72245N
Ana
Elementary
30909N
Greg
Intermediate

Extract column value based on another column, reading multiple files

I will like to extract out the values based on another on Name,Grade,School,Class.
For example if I were to find Name and Grade, I would like to go through column 0 and find the value in the next few column, but the value is scattered(to be extracted) around the next column. Same goes for School and Class.
Refer to this: extract column value based on another column pandas dataframe
I have multiple files:
0 1 2 3 4 5 6 7 8
0 nan nan nan Student Registration nan nan
1 Name: nan nan John nan nan nan nan nan
2 Grade: nan 6 nan nan nan nan nan nan
3 nan nan nan School: C College nan Class: 1A
0 1 2 3 4 5 6 7 8 9
0 nan nan nan Student Registration nan nan nan
1 nan nan nan nan nan nan nan nan nan nan
2 Name: Mary nan nan nan nan nan nan nan nan
3 Grade: 7 nan nan nan nan nan nan nan nan
4 nan nan nan School: nan D College Class: nan 5A
This is my code: (Error)
for file in files:
df = pd.read_csv(file,header=0)
df['Name'] = df.loc[df[0].isin('Name')[1,2,3]
df['Grade'] = df.loc[df[0].isin('Grade')[1,2,3]
df['School'] = df.loc[df[3].isin('School')[4,5]
df['Class'] = df.loc[df[7].isin('Class')[8,9]
d.append(df)
df = pd.concat(d,ignore_index=True)
This is the outcome I want: (Melt Function)
Name Grade School Class ... .... ... ...
John 6 C College 1A
John 6 C College 1A
John 6 C College 1A
John 6 C College 1A
Mary 7 D College 5A
Mary 7 D College 5A
Mary 7 D College 5A
Mary 7 D College 5A

I think here is possible use:
for file in files:
df = pd.read_csv(file,header=0)
#filter out first column and reshape - removed NaNs, convert to 1 column df
df = df.iloc[1:].stack().reset_index(drop=True).to_frame('data')
#compare by :
m = df['data'].str.endswith(':', na=False)
#shift values to new column
df['val'] = df['data'].shift(-1)
#filter and transpose
df = df[m].set_index('data').T.rename_axis(None, axis=1)
d.append(df)
df = pd.concat(d,ignore_index=True)
EDIT:
You can use:
for file in files:
#if input are excel, change read_csv to read_excel
df = pd.read_excel(file, header=None)
df['Name'] = df.loc[df[0].eq('Name:'), [1,2,3]].dropna(axis=1).squeeze()
df['Grade'] = df.loc[df[0].eq('Grade:'), [1,2,3]].dropna(axis=1).squeeze()
df['School'] = df.loc[df[3].eq('School:'), [4,5]].dropna(axis=1).squeeze()
df['Class'] = df.loc[df[6].eq('Class:'), [7,8]].dropna(axis=1).squeeze()
print (df)

Working with multi-index pandas dataframe

I am working with a multi-index data frame but I am having a few problems while trying to filter/update its values.
What I need:
Change 'Name 1', 'Name 2' and the others to upper case
Get all the names with value 1 in {Group 1+ A} for example
Get the list of the names in the previous step after selection (NAME 1, NAME 2, etc)
If I could also convert this MultiIndex data frame into a "normal" data frame it would be fine too.
A sample code:
import pandas as pd
sample_file = '.../Sample.xlsx'
excel_file = pd.ExcelFile(sample_file)
df = excel_file.parse(header=[0, 1], index_col=[0], sheet_name=0)
# Upper case columns
c_cols = licensing_df.columns.get_level_values(0).str.upper()
s_cols = licensing_df.columns.get_level_values(1).str.upper()
licensing_df.columns = pd.MultiIndex.from_arrays([c_cols, s_cols])
# TODO: step 1
# Step 2
valid = df[df[('GROUP 1', 'A')] == 1]
# TODO: Step 3
This is the sample file I am using: Sample file
This is a sample picture of a data frame:
Thank you!

User your excel file:
df = pd.read_excel('Downloads/Sample.xlsx', header=[0,1], index_col=0)
df
Output:
Lists Group 1 ... Group 2
Name AR AZ CA CO CT FL GA IL IN KY ... SC SD TN TX UT VA WA WI WV WY
Name 1 NaN 1.0 1.0 1.0 NaN 1.0 NaN NaN 1 1 ... 1 NaN 1.0 1.0 1.0 1.0 1 1.0 NaN 1.0
Name 2 NaN NaN NaN NaN NaN 1.0 NaN 1.0 1 1 ... 1 NaN 1.0 NaN NaN 1.0 1 NaN NaN NaN
Name 3 NaN NaN NaN NaN NaN NaN NaN 1.0 1 1 ... 1 NaN NaN NaN NaN NaN 1 NaN NaN NaN
[3 rows x 72 columns]
To Do #1
df.index = df.index.str.upper()
df
Output:
Lists Group 1 ... Group 2
Name AR AZ CA CO CT FL GA IL IN KY ... SC SD TN TX UT VA WA WI WV WY
NAME 1 NaN 1.0 1.0 1.0 NaN 1.0 NaN NaN 1 1 ... 1 NaN 1.0 1.0 1.0 1.0 1 1.0 NaN 1.0
NAME 2 NaN NaN NaN NaN NaN 1.0 NaN 1.0 1 1 ... 1 NaN 1.0 NaN NaN 1.0 1 NaN NaN NaN
NAME 3 NaN NaN NaN NaN NaN NaN NaN 1.0 1 1 ... 1 NaN NaN NaN NaN NaN 1 NaN NaN NaN
[3 rows x 72 columns]
To Do #2
df[df.loc[:, ('Group 1', 'AZ')] == 1].index.to_list()
Output:
['NAME 1']
To Do #3
df[df.loc[:, ('Group 1', 'IL')] == 1].index.to_list()
Output:
['NAME 2', 'NAME 3']

I can only assume what you're trying to achieve since you did not provide an input sample.
If you're trying to select and modify a specific row with a MultIndex you can use the .loc operator and the corresponding tuple that you specified in the MultiIndex, e.g
df.loc['Name1', ('GROUP 1', 'A')]
Let's mock some data...
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
names=['subject', 'type'])
data=np.array(list(string.ascii_lowercase))[:24].reshape((4, 6))
df = pd.DataFrame(
columns=columns,
index=index,
data=data
)
Here's our MultiIndex DataFrame:
subject Bob Guido Sue
type HR Temp HR Temp HR Temp
year visit
2013 1 a b c d e f
2 g h i j k l
2014 1 m n o p q r
2 s t u v w x
Let's select the first row and change the letters to uppercase...
df.loc[(2013, 1)].str.upper()
...and likewise for the first column...
df.loc[('Bob', 'HR')].str.upper()
...and finally we pick a specific cell
df.loc[(2014, 1), ('Guido', 'HR')].upper()
which returns
'O'
I hope that gives you an idea of how to use the .loc operator....

Reindexing after a pivot in pandas

Consider the following dataset:
After running the code:
convert_dummy1 = convert_dummy.pivot(index='Product_Code', columns='Month', values='Sales').reset_index()
The data is in the right form, but my index column is named 'Month', and I cannot seem to remove this at all. I have tried codes such as the below, but it does not do anything.
del convert_dummy1.index.name
I can save the dataset to a csv, delete the ID column, and then read the csv - but there must be a more efficient way.
Dataset after reset_index():

convert_dummy1
Month Product_Code 0 1 2 3 4
0 10133.9 0 0 0 0 0
1 10146.9 120 80 60 0 100
convert_dummy1.index = pd.RangeIndex(len(convert_dummy1.index))
del convert_dummy1.columns.name
convert_dummy1
Product_Code 0 1 2 3 4
0 10133.9 0 0 0 0 0
1 10146.9 120 80 60 0 100

Since you pivot with columns="Month", each column in output corresponds to a month. If you decide to reset index after the pivot, you should check column names with convert_dummy1.columns.value which should return in your case :
array(['Product_Code', 1, 2, 3, 4, 5], dtype=object)
while convert_dummy1.columns.names should return:
FrozenList(['Month'])
So to rename Month, use rename_axis function:
convert_dummy1.rename_axis('index',axis=1)
Output:
index Product_Code 1 2 3 4 5
0 10133 NaN NaN NaN NaN 0.0
1 10234 NaN 0.0 NaN NaN NaN
2 10245 0.0 NaN NaN NaN NaN
3 10345 NaN NaN NaN 0.0 NaN
4 10987 NaN NaN 1.0 NaN NaN
If you wish to reproduce it, this is my code:
df1=pd.DataFrame({'Product_Code':[10133,10245,10234,10987,10345], 'Month': [1,2,3,4,5], 'Sales': [0,0,0,1,0]})
df2=df1.pivot_table(index='Product_Code', columns='Month', values='Sales').reset_index().rename_axis('index',axis=1)

Unmelt Pandas DataFrame

I have a pandas dataframe with two id variables:
df = pd.DataFrame({'id': [1,1,1,2,2,3],
'num': [10,10,12,13,14,15],
'q': ['a', 'b', 'd', 'a', 'b', 'z'],
'v': [2,4,6,8,10,12]})
id num q v
0 1 10 a 2
1 1 10 b 4
2 1 12 d 6
3 2 13 a 8
4 2 14 b 10
5 3 15 z 12
I can pivot the table with:
df.pivot('id','q','v')
And end up with something close:
q a b d z
id
1 2 4 6 NaN
2 8 10 NaN NaN
3 NaN NaN NaN 12
However, what I really want is (the original unmelted form):
id num a b d z
1 10 2 4 NaN NaN
1 12 NaN NaN 6 NaN
2 13 8 NaN NaN NaN
2 14 NaN 10 NaN NaN
3 15 NaN NaN NaN 12
In other words:
'id' and 'num' my indices (normally, I've only seen either 'id' or 'num' being the index but I need both since I'm trying to retrieve the original unmelted form)
'q' are my columns
'v' are my values in the table
Update
I found a close solution from Wes McKinney's blog:
df.pivot_table(index=['id','num'], columns='q')
v
q a b d z
id num
1 10 2 4 NaN NaN
12 NaN NaN 6 NaN
2 13 8 NaN NaN NaN
14 NaN 10 NaN NaN
3 15 NaN NaN NaN 12
However, the format is not quite the same as what I want above.

You could use set_index and unstack
In [18]: df.set_index(['id', 'num', 'q'])['v'].unstack().reset_index()
Out[18]:
q id num a b d z
0 1 10 2.0 4.0 NaN NaN
1 1 12 NaN NaN 6.0 NaN
2 2 13 8.0 NaN NaN NaN
3 2 14 NaN 10.0 NaN NaN
4 3 15 NaN NaN NaN 12.0

You're really close slaw. Just rename your column index to None and you've got what you want.
df2 = df.pivot_table(index=['id','num'], columns='q')
df2.columns = df2.columns.droplevel().rename(None)
df2.reset_index().fillna("null").to_csv("test.csv", sep="\t", index=None)
Note that the the 'v' column is expected to be numeric by default so that it can be aggregated. Otherwise, Pandas will error out with:
DataError: No numeric types to aggregate
To resolve this, you can specify your own aggregation function by using a custom lambda function:
df2 = df.pivot_table(index=['id','num'], columns='q', aggfunc= lambda x: x)

you can remove name q.
df1.columns=df1.columns.tolist()
Zero's answer + remove q =
df1 = df.set_index(['id', 'num', 'q'])['v'].unstack().reset_index()
df1.columns=df1.columns.tolist()
id num a b d z
0 1 10 2.0 4.0 NaN NaN
1 1 12 NaN NaN 6.0 NaN
2 2 13 8.0 NaN NaN NaN
3 2 14 NaN 10.0 NaN NaN
4 3 15 NaN NaN NaN 12.0

This might work just fine:
Pivot
df2 = (df.pivot_table(index=['id', 'num'], columns='q', values='v')).reset_index())
Concatinate the 1st level column names with the 2nd
df2.columns =[s1 + str(s2) for (s1,s2) in df2.columns.tolist()]

Came up with a close solution
df2 = df.pivot_table(index=['id','num'], columns='q')
df2.columns = df2.columns.droplevel()
df2.reset_index().fillna("null").to_csv("test.csv", sep="\t", index=None)
Still can't figure out how to drop 'q' from the dataframe

It can be done in three steps:
#1: Prepare auxilary column 'id_num':
df['id_num'] = df[['id', 'num']].apply(tuple, axis=1)
df = df.drop(columns=['id', 'num'])
#2: 'pivot' is almost an inverse of melt:
df, df.columns.name = df.pivot(index='id_num', columns='q', values='v').reset_index(), ''
#3: Bring back 'id' and 'num' columns:
df['id'], df['num'] = zip(*df['id_num'])
df = df.drop(columns=['id_num'])
This is a result, but with different order of columns:
a b d z id num
0 2.0 4.0 NaN NaN 1 10
1 NaN NaN 6.0 NaN 1 12
2 8.0 NaN NaN NaN 2 13
3 NaN 10.0 NaN NaN 2 14
4 NaN NaN NaN 12.0 3 15
Alternatively with proper order:
def multiindex_pivot(df, columns=None, values=None):
#inspired by: https://github.com/pandas-dev/pandas/issues/23955
names = list(df.index.names)
df = df.reset_index()
list_index = df[names].values
tuples_index = [tuple(i) for i in list_index] # hashable
df = df.assign(tuples_index=tuples_index)
df = df.pivot(index="tuples_index", columns=columns, values=values)
tuples_index = df.index # reduced
index = pd.MultiIndex.from_tuples(tuples_index, names=names)
df.index = index
df = df.reset_index() #me
df.columns.name = '' #me
return df
df = df.set_index(['id', 'num'])
df = multiindex_pivot(df, columns='q', values='v')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Iterate over dataframes and merge by conditions - python

Related

Transform DataFrame in Pandas

Extract column value based on another column, reading multiple files

Working with multi-index pandas dataframe

Reindexing after a pivot in pandas

Unmelt Pandas DataFrame

Categories

Resources