How to compare between 2 dataframes by 0,1?
df1
L_ID L_Values
1 20-25
2 30-35
3 25
4 45
5 30-45
df2
From L_ID in df1, it represents to columns 1,2,3,4,5 in df2
Name 1 2 3 4 5
John 25 25 20 30 45
Zara 20 NaN NaN 25 30
Kim NaN NaN NaN 45 50
I would like to compare values in df2, is it range in df1?
yes = 0, no = 1
Expect output in df3
Name 1 2 3 4 5
John 0 1 1 1 0
Zara 0 1 1 1 0
Kim 1 1 1 0 1
The following should work
Just take care for Nan condition in
df3.col.iloc[i]==df3.col.iloc[i]
if Nan is in sting format, just replace this part of the code with
if df3.col.iloc[i]!='Nan'
The code is:
d={}
for i in range len(df1):
d[df1.L_ID.iloc[i]]=[]
temp=df1.L_Values.iloc[i].split('-')
if len(temp)==2:
d[df1.L_ID.iloc[i]]=temp
else:
d[df1.L_ID.iloc[i]]=[temp[0], temp[0]]
df3=df2.copy()
for i in range(len(df3)):
for col in df3.columns:
if df3.col.iloc[i]==df3.col.iloc[i] and d[col][0] <= df3.col.iloc[i] <= d[col][1]:
df3.col.iloc[i]=0
else:
df3.col.iloc[i]=1
print(df3)
Related
Customer Material ID Bill Quantity
0 1 64578 100
1 2 64579 58
2 3 64580 36
3 4 64581 45
4 5 64582 145
We have to concatenate the 0th index material id and 1st index material id and put it into the 0th index material id record.
similarly 1,2 3,4
The result should contain only catenated records.
Just shift the data and combine the columns.
df.assign(new_ID=df["Material ID"] + df.shift(-1)["Material ID"])
Customer Material ID Bill Quantity new_ID
0 0 64578 100 NaN 129157.0
1 1 64579 58 NaN 129159.0
2 2 64580 36 NaN 129161.0
3 3 64581 45 NaN 129163.0
4 4 64582 145 NaN NaN
If you need to concatenate it as a str type then the following would work.
df["Material ID"] = df["Material ID"].astype(str)
df.assign(new_ID=df["Material ID"] + df.shift(-1)["Material ID"])
Customer Material ID Bill Quantity new_ID
0 0 64578 100 NaN 6457864579
1 1 64579 58 NaN 6457964580
2 2 64580 36 NaN 6458064581
3 3 64581 45 NaN 6458164582
4 4 64582 145 NaN NaN
This question already has answers here:
Merge items on dataframes with duplicate values
(2 answers)
Closed 2 years ago.
I've seen other answered questions similar to this one, but to my knowledge I have yet to find a response that does exactly what I am looking for. I have 2 pandas dataframes: df1 which has 3 columns-ID, A, and B; and df2 which has 4 columns-ID, C, D, and E.
df1 has the following rows:
ID A B
0 1 200 0.5
1 1 201 0.5
2 2 99 1.1
And df2 has the following rows:
ID C D E
0 1 50 1.1250 0
1 1 52 1.1300 0
2 1 50 1.1200 0
3 2 25 0.6667 20
4 2 24 0.6667 20
I want to merge df1 and df2 on the ID column such that if a pair of rows from each dataframe has a matching ID, we combine them into a single row. Notice that the dataframes are not the same size. If one dataframe has a row with no more available matches from the other dataframe, then we fill in the missing data with NaN. How can I accomplish this merge in pandas?
So far, I have tried variations of the function pd.merge(df1, df2, on='ID', how='...'), but no matter if I put how= 'left', 'right', 'outer', or 'inner', I get a wrong result which is a dataframe with 8 rows. Below is the desired result.
Desired result:
ID A B C D E
0 1 200 0.5 50 1.1250 0
1 1 201 0.5 52 1.1300 0
2 1 NaN NaN 50 1.1200 0
3 2 99 1.1 25 0.6667 20
4 2 NaN NaN 24 0.6667 20
You need to order your ID using groupby ID and cumcount so that first ID 1 in df1 joins with the first ID 1 in df2 and the 2nd to 2nd, and so on. And the same with ID 2 and so on for all IDs in both dataframes. Then, merge on both ID and key with how='outer'.
df1k = df1.assign(key=df1.groupby('ID').cumcount())
df2k = df2.assign(key=df2.groupby('ID').cumcount())
df_out = df1k.merge(df2k, on=['ID','key'], how='outer').sort_values('ID')
Output:
ID A B key C D E
0 1 200.0 0.5 0 50 1.1250 0
1 1 201.0 0.5 1 52 1.1300 0
3 1 NaN NaN 2 50 1.1200 0
2 2 99.0 1.1 0 25 0.6667 20
4 2 NaN NaN 1 24 0.6667 20
And, you can drop the 'key' also,
df_out.drop('key', axis=1)
Output:
ID A B C D E
0 1 200.0 0.5 50 1.1250 0
1 1 201.0 0.5 52 1.1300 0
3 1 NaN NaN 50 1.1200 0
2 2 99.0 1.1 25 0.6667 20
4 2 NaN NaN 24 0.6667 20
My input dataframe;
MinA MinB MaxA MaxB
0 1 2 5 7
1 1 0 8 6
2 2 15 15
3 3
4 10
I want to merge "min" and "max" columns amongst themselves with priority (A columns have more priority than B columns).
If both columns are null, they should have default values, for min=0 for max=100.
Desired output is;
MinA MinB MaxA MaxB Min Max
0 1 2 5 7 1 5
1 1 0 8 6 1 8
2 2 15 15 2 15
3 3 3 100
4 10 0 10
Could you please help me about this?
This can be accomplished using mask. With your data that would look like the following:
df = pd.DataFrame({
'MinA': [1,1,2,None,None],
'MinB': [2,0,None,3,None],
'MaxA': [5,8,15,None,None],
'MaxB': [7,6,15,None,10],
})
# Create new Column, using A as the base, if it is Nan, then use B.
# Then do the same again using specified values
df['Min'] = df['MinA'].mask(pd.isna, df['MinB']).mask(pd.isna, 0)
df['Max'] = df['MaxA'].mask(pd.isna, df['MaxB']).mask(pd.isna, 100)
The above would result in the desired output:
MinA MinB MaxA MaxB Min Max
0 1 2 5 7 1 5
1 1 0 8 6 1 8
2 2 NaN 15 15 2 15
3 NaN 3 NaN NaN 3 100
4 NaN NaN NaN 10 0 10
Just use fillna() will be fine.
df['Min'] = df['MinA'].fillna(df['MinB']).fillna(0)
df['Max'] = df['MaxA'].fillna(df['MaxB']).fillna(100)
I have dataframe which looks like below:
Name width height breadth
0 1 13 90 2
1 2 101 45 1
2 3 78 6 1
3 5 11 34 1
4 6 23 8 2
So like seen, the name is not in sequence. There are missing files in between.
I want to shift the column values of width and height one row below if the Name is in sequence. If not i want to populate the width and height of the row as NaN.
I tried the below code:
diff=data['Name'].diff()
And tried to do a group_by using this diff in value. But it did not work.
I am expecting a result like below:
Name width height breadth
0 1 NaN Nan 2
1 2 13 90 1
2 3 101 45 1
3 5 Nan Nan 1
4 6 11 34 2
Create helper Series for groups by Series.diff, compare by Series.ne and Series.cumsum and pass it to DataFrameGroupBy.shift:
diff = data['Name'].diff().ne(1).cumsum()
data[['width','height']] = data.groupby(diff)['width','height'].shift()
print (data)
Name width height breadth
0 1 NaN NaN 2
1 2 13.0 90.0 1
2 3 101.0 45.0 1
3 5 NaN NaN 1
4 6 11.0 34.0 2
You could use a temporary dataframe to add the empty lines and shift the values:
temp = pd.DataFrame({'Name': np.arange(
data.Name.min(), data.Name.max() + 1)}).merge(data, on='Name', how='left')
temp.iloc(axis=1)[1:] = temp.iloc(axis=1)[1:].shift()
result = pd.DataFrame(data.Name).merge(temp , on='Name')
I am working with a dataframe that looks like this.
id time diff
0 0 34 nan
1 0 36 2
2 1 43 7
3 1 55 12
4 1 59 4
5 2 2 -57
6 2 10 8
What is an efficient way find the minimum values for 'time' by id, then set 'diff' to nan at those minimum values. I am looking for a solution that results in:
id time diff
0 0 34 nan
1 0 36 2
2 1 43 nan
3 1 55 12
4 1 59 4
5 2 2 nan
6 2 10 8
groupby('id') and use idxmin to find the location of minimum values of 'time'. Finally, use loc to assign np.nan
df.loc[df.groupby('id').time.idxmin(), 'diff'] = np.nan
df
You can group the time by id and calculate a logical vector where if the time is minimum within the group, the value is True, else False, and use the logical vector to assign NaN to the corresponding rows:
import numpy as np
import pandas as pd
df.loc[df.groupby('id')['time'].apply(lambda g: g == min(g)), "diff"] = np.nan
df
# id time diff
#0 0 34 NaN
#1 0 36 2.0
#2 1 43 NaN
#3 1 55 12.0
#4 1 59 4.0
#5 2 2 NaN
#6 2 10 8.0