Concat two DataFrames on missing indices - python

I have two DataFrames and want to use the second one only on the rows whose index is not already contained in the first one.
What is the most efficient way to do this?
Example:
df_1
idx val
0 0.32
1 0.54
4 0.26
5 0.76
7 0.23
df_2
idx val
1 10.24
2 10.90
3 10.66
4 10.25
6 10.13
7 10.52
df_final
idx val
0 0.32
1 0.54
2 10.90
3 10.66
4 0.26
5 0.76
6 10.13
7 0.23
Recap: I need to add the rows in df_2 for which the index is not already in df_1.
EDIT
Removed some indices in df_2 to illustrate the fact that all indices from df_1 are not covered in df_2.

You can use reindex with combine_first or fillna:
df = df_1.reindex(df_2.index).combine_first(df_2)
print (df)
val
idx
0 0.32
1 0.54
2 10.90
3 10.66
4 0.26
5 0.76
6 10.13
7 0.23
df = df_1.reindex(df_2.index).fillna(df_2)
print (df)
val
idx
0 0.32
1 0.54
2 10.90
3 10.66
4 0.26
5 0.76
6 10.13
7 0.23

You can achieve the wanted output by using the combine_first method of the DataFrame. From the documentation of the method:
Combine two DataFrame objects and default to non-null values in frame calling the method. Result index columns will be the union of the respective indexes and columns
Example usage:
import pandas as pd
df_1 = pd.DataFrame([0.32,0.54,0.26,0.76,0.23], columns=['val'], index=[0,1,4,5,7])
df_1.index.name = 'idx'
df_2 = pd.DataFrame([10.56,10.24,10.90,10.66,10.25,10.13,10.52], columns=['val'], index=[0,1,2,3,4,6,7])
df_2.index.name = 'idx'
df_final = df_1.combine_first(df_2)
This will give the desired result:
In [7]: df_final
Out[7]:
val
idx
0 0.32
1 0.54
2 10.90
3 10.66
4 0.26
5 0.76
6 10.13
7 0.23

Related

Why Pandas groupby.apply doesn't aggregate rows in this example

I see this example in a tutorial in which the apply function doesn't behave as I expected.
The dataframe is
df=pd.DataFrame({'key':['A','B','C','A','B','C'],
'data1':np.arange(6),
'data2':[5,0,3,3,7,9]})
key data1 data2
0 A 0 5
1 B 1 0
2 C 2 3
3 A 3 3
4 B 4 7
5 C 5 9
The passed in function normalizes data1 by dividing the group sum of data2.
def normalize(group):
group['data1']/=group['data2'].sum()
return group
The result is
df.groupby('key').apply(normalize).round(2)
key data1 data2
0 A 0.00 5
1 B 0.14 0
2 C 0.17 3
3 A 0.38 3
4 B 0.57 7
5 C 0.42 9
However, according to my understanding for apply, the rows for same group should be put together, also there should be a multiindex for group name, since I didn't suppress the group_key. The result I expect should be like this:
key data1 data2
A 0 A 0.00 5
3 A 0.38 3
B 1 B 0.14 0
4 B 0.57 7
C 2 C 0.17 3
5 C 0.42 9
So can somebody explain why apply function in this example behave strangely. And what triggers this behavior?
For comparison, the following example has normal behavior:
def mysort(group):
return group.sort_values('data2')
df.groupby('key1').apply(mysort)
key1 key2 data1 data2
key1
a 0 a one -0.8 -0.4
1 a two -2.0 0.0
4 a one 0.7 0.9
b 3 b two 1.8 -0.3
2 b one -0.8 0.6
I realize it's probably because the former function modify the group inplace. But I'm looking for some in-depth explanation for the row order and missing index.
Thank you!

DataFrame Create new column after applying a function on groupby values

I have such a dataframe :
With a minimal example :
d = {'Subject': [1,1,1,1,2,2,3,3,3,3,3,3,3],
'Pattern': [1,1,2,2,3,3,2,2,2,2,2,2,2],
'Time': [0.85, 0.92, 1.03, 1.06, 0.89, 0.85, 1.20, 1.03, 1.25, 100.03, 1.97,0.23,0.64]}
df = pd.DataFrame(data=d)
Where Subject ranges from 1 to 8 and Pattern from 1 to 3. I want to create a new column where after grouping by Subject and Pattern I apply a function that removes outliers from the Time list associated to the groupby. Right now I have a solution that works well, but I was wondering if there would be a more elegant solution to it, so that I learn how to interact better with DataFrame. Taking the example, it should output :
Subject Pattern Time Time_2
0 1 1 0.85 0.85
1 1 1 0.92 0.92
2 1 2 1.03 1.03
3 1 2 1.06 1.06
4 2 3 0.89 0.89
5 2 3 0.85 0.85
6 3 2 1.20 1.20
7 3 2 1.03 1.03
8 3 2 1.25 1.25
9 3 2 100.03 0.00 # <---
10 3 2 1.97 1.97
11 3 2 0.23 0.23
12 3 2 0.64 0.64
My current code :
def remove_outliers(arr):
elements = np.array(arr)
mean = np.mean(elements)
sd = np.std(elements)
return [x if (mean - 2 * sd < x < mean + 2 * sd) else 0 for x in arr]
df_g = df.groupby(['Subject', 'Pattern'])['Time']
times = []
keys = list(df_g.groups.keys())
for i, l in enumerate(df_g.apply(list)):
times.append((keys[i], remove_outliers(l)))
df['Time_2'] = 0
for k, l in times:
vals = df[(df['Subject'] == k[0]) & (df['Pattern'] == k[1])].index.values
df['Time_2'].iloc[vals] = l
Try this -
Use groupby transform the groups to get GroupWise mean and std for each row.
Next use these series objects to create your check condition as per your function.
Next inverse this and use df.mask to mask values that lie outside this range, and fill them with 0 instead.
grouper = df.groupby(['Subject', 'Pattern'])['Time']
mean = grouper.transform('mean')
std = grouper.transform('std').fillna(0)
check = (df['Time'] < (mean - 2*std)) | (df['Time'] > (mean + 2*std))
df['Time_new'] = df['Time'].mask(check).fillna(0)
print(df)
Subject Pattern Time Time_new
0 1 1 0.85 0.85
1 1 1 0.92 0.92
2 1 2 1.03 1.03
3 1 2 1.06 1.06
4 2 3 0.89 0.89
5 2 3 0.85 0.85
6 3 2 1.20 1.20
7 3 2 1.03 1.03
8 3 2 1.25 1.25
9 3 2 100.03 0.00 #<---
10 3 2 1.97 1.97
11 3 2 0.23 0.23
12 3 2 0.64 0.64
NOTE: Jsut to add the 3std deviation condition is too high a range for your example. Try 2std.

Map numeric data into bins in Pandas dataframe for seperate groups using dictionaries

I have a pandas dataframe as follows:
df = pd.DataFrame(data = [[1,0.56],[1,0.59],[1,0.62],[1,0.83],[2,0.85],[2,0.01],[2,0.79],[3,0.37],[3,0.99],[3,0.48],[3,0.55],[3,0.06]],columns=['polyid','value'])
polyid value
0 1 0.56
1 1 0.59
2 1 0.62
3 1 0.83
4 2 0.85
5 2 0.01
6 2 0.79
7 3 0.37
8 3 0.99
9 3 0.48
10 3 0.55
11 3 0.06
I need to reclassify the 'value' column separately for each 'polyid'. For the reclassification, I have two dictionaries. One with the bins that contain the information on how I want to cut the 'values' for each 'polyid' separately:
bins_dic = {1:[0,0.6,0.8,1], 2:[0,0.2,0.9,1], 3:[0,0.5,0.6,1]}
And one with the ids with which I want to label the resulting bins:
ids_dic = {1:[1,2,3], 2:[1,2,3], 3:[1,2,3]}
I tried to get this answer to work for my use case. I could only come up with applying pd.cut on each 'polyid' subset and then pd.concat all subsets again back to one dataframe:
import pandas as pd
def reclass_df_dic(df, bins_dic, names_dic, bin_key_col, val_col, name_col):
df_lst = []
for key in df[bin_key_col].unique():
bins = bins_dic[key]
names = names_dic[key]
sub_df = df[df[bin_key_col] == key]
sub_df[name_col] = pd.cut(df[val_col], bins, labels=names)
df_lst.append(sub_df)
return(pd.concat(df_lst))
df = pd.DataFrame(data = [[1,0.56],[1,0.59],[1,0.62],[1,0.83],[2,0.85],[2,0.01],[2,0.79],[3,0.37],[3,0.99],[3,0.48],[3,0.55],[3,0.06]],columns=['polyid','value'])
bins_dic = {1:[0,0.6,0.8,1], 2:[0,0.2,0.9,1], 3:[0,0.5,0.6,1]}
ids_dic = {1:[1,2,3], 2:[1,2,3], 3:[1,2,3]}
df = reclass_df_dic(df, bins_dic, ids_dic, 'polyid', 'value', 'id')
This results in my desired output:
polyid value id
0 1 0.56 1
1 1 0.59 1
2 1 0.62 2
3 1 0.83 3
4 2 0.85 2
5 2 0.01 1
6 2 0.79 2
7 3 0.37 1
8 3 0.99 3
9 3 0.48 1
10 3 0.55 2
11 3 0.06 1
However, the line:
sub_df[name_col] = pd.cut(df[val_col], bins, labels=names)
raises the warning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
that I am unable to solve with using .loc. Also, I guess there generally is a more efficient way of doing this without having to loop over each category?
A simpler solution would be to use groupby and apply a custom function on each group. In this case, we can define a function reclass that obtains the correct bins and ids and then uses pd.cut:
def reclass(group, name):
bins = bins_dic[name]
ids = ids_dic[name]
return pd.cut(group, bins, labels=ids)
df['id'] = df.groupby('polyid')['value'].apply(lambda x: reclass(x, x.name))
Result:
polyid value id
0 1 0.56 1
1 1 0.59 1
2 1 0.62 2
3 1 0.83 3
4 2 0.85 2
5 2 0.01 1
6 2 0.79 2
7 3 0.37 1
8 3 0.99 3
9 3 0.48 1
10 3 0.55 2
11 3 0.06 1

Generate new column based on values in another column and their index

In the df underneath, I want to sort the values of column 'cdf_X' based on column 'A' and 'X'. Column 'X' and 'cdf_X' are connected, so if a value in 'X' appears in column 'A', the value of 'cdf_X' should be repositioned to that index number of column 'A' in a new column. (Values don't occur twice in a column 'cdf_A'.)
Example: 'X'=3 at index 0 -> cdf_X=0.05 at index 0 -> '3' appears in column 'A' at index 4 -> cdf_A at index 4 = cdf_X at index 0
Initial df:
A X cdf_X
0 7 3 0.05
1 4 4 0.15
2 11 7 0.27
3 9 9 0.45
4 3 11 0.69
5 13 13 1.00
Desired df:
A X cdf_X cdf_A
0 7 3 0.05 0.27
1 4 4 0.15 0.15
2 11 7 0.27 0.69
3 9 9 0.45 0.45
4 3 11 0.69 0.05
5 13 13 1.00 1.00
Tried code:
import pandas as pd
df = pd.DataFrame({"A": [7,4,11,9,3,13],
"cdf_X": [0.05,0.15,0.27,0.45,0.69,1.00],
"X": [3,4,7,9,11,13]})
df.loc[:, 'cdf_A'] = df['cdf_X'].where(df['A'] == df['X'])

print(df)
Check with map
df['cdf_A'] = df.A.map(df.set_index('X')['cdf'])
I think you need replace
df['cdf_A'] = df.A.replace(df.set_index('X').cdf)
Out[989]:
A X cdf cdf_A
0 7 3 0.05 0.27
1 4 4 0.15 0.15
2 11 7 0.27 0.69
3 9 9 0.45 0.45
4 3 11 0.69 0.05
5 13 13 1.00 1.00

Merge two DataFrames based on columns and values of a specific column with Pandas in Python 3.x

Hello i have a problem which i am not able to implement a solution on.
I have following two DataFrames:
>>> df1
A B date
1 1 01-2016
2 1 02-2017
1 2 03-2017
2 2 04-2020
>>> df2
A B 01-2016 02-2017 03-2017 04.2020
1 1 0.10 0.22 0.55 0.77
2 1 0.20 0.12 0.99 0.125
1 2 0.13 0.15 0.15 0.245
2 2 0.33 0.1 0.888 0.64
What i want is following DataFrame:
>>> df3
A B date value
1 1 01-2016 0.10
2 1 02-2017 0.12
1 2 03-2017 0.15
2 2 04-2020 0.64
I already tried following:
summarize_dates = self.summarize_specific_column(data=df1, column='date')
for date in summarize_dates:
left_on = np.append(left_on, date)
right_on = np.append(right_on, merge_columns.upper())
result = pd.merge(left=df2, right=df1,
left_on=left_on, right_on=right_on,
how='right')
print(result)
This does not work. Can you help me and suggest a more comfortable implementation? Manyy thanks in advance!
You can melt df2 and then merge using the default 'inner' merge
df3 = df1.merge(df2.melt(id_vars = ['A', 'B'], var_name='date'))
A B date value
0 1 1 01-2016 0.10
1 2 1 02-2017 0.12
2 1 2 03-2017 0.15
3 2 2 04-2020 0.64
Using lookup
df1['value']=df2.set_index(['A','B']).lookup(df1.set_index(['A','B']).index,df1.date)
df1
Out[228]:
A B date value
0 1 1 01-2016 0.10
1 2 1 02-2017 0.12
2 1 2 03-2017 0.15
3 2 2 04-2020 0.64

Categories

Resources