Drop dataframe rows which are missing in an intersection with another dataframe - python

I have a dataframe such as:
name value_1 ... value_n
1 a 11.5 ... 13.2
2 b 11.5 ... 17.9
3 a 10.0 ... 21.3
4 a 9.5 ... 11.1
5 b 10.0 ... 7.2
6 a 10.5 ... 3.0
I grouped by name, so I have now two dataframes:
name value_1 ... value_n
1 a 11.5 ... 13.2
3 a 10.0 ... 21.3
4 a 9.5 ... 11.1
6 a 10.5 ... 3.0
name value_1 ... value_n
2 b 11.5 ... 17.9
5 b 10.0 ... 7.2
Then, I want to keep only those entries whose value_1 is in both dataframes. I don't care about the other columns. My attempts:
using isin -> Does not work, because it requires all the columns to contain the same data
Intersection: pd.merge(group_a, group_b, how='inner', on=['value_1']), which kind of works, but results in a dataframe contain the columns of both rows merged, such as value_n_x and value_n_y, which does not fit my needs
Any other idea?

I think you can try merge subsets of both df like:
print group_a
name value_1 value_n
1 a 11.5 13.2
3 a 10.0 21.3
3 a 10.0 21.3
4 a 9.5 1.1
6 a 10.5 3.0
print group_b
name value_1 value_n
2 b 11.5 17.9
5 b 10.0 7.2
print pd.merge(group_a[['value_1']], group_b[['value_1']], how='inner', on=['value_1'])
value_1
0 11.5
1 10.0
2 10.0
Second solution use numpy.intersect1d and loc with isin:
inter = np.intersect1d(group_a['value_1'], group_b['value_1'])
print inter
[ 10. 11.5]
mask1 = group_a['value_1'].isin(inter)
mask2 = group_b['value_1'].isin(inter)
print group_a.loc[mask1]
name value_1 value_n
1 a 11.5 13.2
3 a 10.0 21.3
3 a 10.0 21.3
print group_b.loc[mask2]
name value_1 value_n
2 b 11.5 17.9
5 b 10.0 7.2

Related

Find a value in a column in function of another column

Assuming that the value exists, how can I for example, create another column "testFinal" in the dataframe where I will have the absolute value of df["test"]- " df["test"] which is 0.2 seconds after "
for example, the first value for testFinal is the absolute value of the difference between 2 and the value 0.2 seconds after -> so 8, the result is abs(2-8) = 6
My goal is to calculate "testFinal"
I don't know if its clear so here is the example
NB : the Timestamp is not homogeneous, so the interval between two values can be different over time
Thanks a lot
Here is the code for the dataframe
df = pd.DataFrame({'Timestamp':[11.1,11.2,11.3,11.4,11.5,11.6,11.7,11.8,11.9,12.0,12.10],
'test':[2,22,8,4,5,4,5,3,54,23,89],
'testFinal':[6,18,3,0,0,1,49,20,35,np.NaN,np.NaN]})
First, create a new temporary column temp obtained from converting the Timestamp column to timedelta using pd.to_timedelta, then set this temp column as dataframe index, then create a new column testFinal having the values as this new index + 0.2 seconds, then using Series.map, map the testFinal column to the values from df['test'] column, thus now the testFinal column should have values corresponding to the values in test column 0.2s later, thereafter you can subtract the values in the testFinal and test column to get the desired result:
df['temp'] = pd.to_timedelta(df['Timestamp'], unit='s')
df = df.set_index('temp')
df['testFinal'] = df.index + pd.Timedelta(seconds=0.2)
df['testFinal'] = df['testFinal'].map(df['test']).sub(df['test']).abs()
df = df.reset_index(drop=True)
# print(df)
Timestamp test testFinal
0 11.1 2 6.0
1 11.2 22 18.0
2 11.3 8 3.0
3 11.4 4 0.0
4 11.5 5 0.0
5 11.6 4 1.0
6 11.7 5 49.0
7 11.8 3 20.0
8 11.9 54 35.0
9 12.0 23 NaN
10 12.1 89 NaN
You could use numpy as follows. I created a new column test_final to compare with the expected testFinal column.
import numpy as np
test = df.test.values
df['test_final'] = np.abs(test - np.concatenate((test[2:], np.array([np.nan]*2)), axis=0))
print(df)
Output:
Timestamp test testFinal test_final
0 11.1 2 6.0 6.0
1 11.2 22 18.0 18.0
2 11.3 8 3.0 3.0
3 11.4 4 0.0 0.0
4 11.5 5 0.0 0.0
5 11.6 4 1.0 1.0
6 11.7 5 49.0 49.0
7 11.8 3 20.0 20.0
8 11.9 54 35.0 35.0
9 12.0 23 NaN NaN
10 12.1 89 NaN NaN

Write previous entries of a time serie into additional columns

I have a dataframe that contains values for individual days:
day value
1 10.1
2 15.4
3 12.1
4 14.1
5 -9.7
6 2.0
8 3.4
There is not necessary a value for each day (day 7 is missing in my example), but there is never more than one value per day.
I want to add additional columns to this dataframe, containing per row the value of the day before, the value of two days ago, the value of three days ago etc. The result would be:
day value value-of-1 value-of-2 value-of-3
1 10.1 NaN NaN NaN
2 15.4 10.1 NaN NaN
3 12.1 15.4 10.1 NaN
4 14.1 12.1 15.4 10.1
5 -9.7 14.1 12.1 15.4
6 2.0 -9.7 14.1 12.1
8 3.4 NaN 2.0 -9.7
At the moment, I add to the orginal dataframe a column containing the required day and then merge the original dataframe using this new column as join condition. After some reorganizing of the columns, I get my result:
data = [[1, 10.1], [2, 15.4], [3, 12.1], [4, 14.1], [5, -9.7], [6, 2.0], [8, 3.4]]
df = pd.DataFrame(data, columns = ['day', 'value'])
def add_column_for_prev_day(df, day):
df[f"day-{day}"] = df["day"] - day
df = df.merge(df[["day", "value"]], how="left", left_on=f"day-{day}", right_on="day", suffixes=("", "_r")) \
.drop(["day_r",f"day-{day}"],axis=1) \
.rename({"value_r": f"value-of-{day}"}, axis=1)
return df
df = add_column_for_prev_day(df, 1)
df = add_column_for_prev_day(df, 2)
df = add_column_for_prev_day(df, 3)
I wonder if there is a better and faster way to get the same result, especially without having to merge the dataframe over and over again.
A simple shift does not help as there are days without data.
You can use:
m=df.set_index('day').reindex(range(df['day'].min(),df['day'].max()+1))
l=[1,2,3]
for i in l:
m[f"value_of_{i}"] = m['value'].shift(i)
m.reindex(df.day).reset_index()
day value value_of_1 value_of_2 value_of_3
0 1 10.1 NaN NaN NaN
1 2 15.4 10.1 NaN NaN
2 3 12.1 15.4 10.1 NaN
3 4 14.1 12.1 15.4 10.1
4 5 -9.7 14.1 12.1 15.4
5 6 2.0 -9.7 14.1 12.1
6 8 3.4 NaN 2.0 -9.7

Combine two columns with if condition in pandas

I have two columns with data which overlap for some entries (and are almost similar when they do).
df = pd.DataFrame(
{'x':[2.1,3.1,5.4,1.9,np.nan,4.3,np.nan,np.nan,np.nan],
'y':[np.nan,np.nan,5.3,1.9,3.2,4.2,9.1,7.8,4.1]
}
)
I want the result to be a column 'xy' which contains the average of x and y when they both have values and x or y when only one of them has a value like this:
df['xy']=[2.1,3.1,5.35,1.9,3.2,4.25,9.1,7.8,4.1]
Here you go:
Solution
df['xy'] = df[['x','y']].mean(axis=1)
Output
print(df.to_string())
x y xy
0 2.1 NaN 2.10
1 3.1 NaN 3.10
2 5.4 5.3 5.35
3 1.9 1.9 1.90
4 NaN 3.2 3.20
5 4.3 4.2 4.25
6 NaN 9.1 9.10
7 NaN 7.8 7.80
8 NaN 4.1 4.10

Concatenate multiple unequal dataframes on condition

I have 7 dataframes (df_1, df_2, df_3,..., df_7) all with the same columns but different lengths but sometimes have the same values.
I'd like to concatenate all 7 dataframes under the conditions that:
if df_n.iloc[row_i] != df_n+1.iloc[row_i] and df_n.iloc[row_i][0] < df_n+1.iloc[row_i][0]:
pd.concat([df_n.iloc[row_i], df_n+1.iloc[row_i], df_n+2.iloc[row_i],
...., df_n+6.iloc[row_i]])
Where df_n.iloc[row_i] is the ith row of the nth dataframe and df_n.iloc[row_i][0] is the first column of the ith row.
For example if we only had 2 dataframes and that len(df_1) < len(df_2) and if we used the conditions above the input would be:
df_1 df_2
index 0 1 2 index 0 1 2
0 12.12 11.0 31 0 12.2 12.6 30
1 12.3 12.1 33 1 12.3 12.1 33
2 10 9.1 33 2 13 12.1 23
3 16 12.1 33 3 13.1 12.1 27
4 14.4 13.1 27
5 15.2 13.2 28
And the output would be:
conditions -> pd.concat([df_1, df_2]):
index 0 1 2 3 4 5
0 12.12 11.0 31 12.2 12.6 30
2 10 9.1 33 13 12.1 23
4 nan 14.4 13.1 27
5 nan 15.2 13.2 28
Is there an easy way to do this?
IIUC concat first , the groupby by columns get the different , and we just implement your condition
s=pd.concat([df1,df2],1)
s1=s.groupby(level=0,axis=1).apply(lambda x : x.iloc[:,0]-x.iloc[:,1])
yourdf=s[s1.ne(0).any(1)&s1.iloc[:,0].lt(0)|s1.iloc[:,0].isnull()]
Out[487]:
0 1 2 0 1 2
index
0 12.12 11.0 31.0 12.2 12.6 30
2 10.00 9.1 33.0 13.0 12.1 23
4 NaN NaN NaN 14.4 13.1 27
5 NaN NaN NaN 15.2 13.2 28

Transposing a pandas col into a row, using another col as an index

I'm looking to transpose a column within a dataframe so that it becomes a row, while using another row as the index. Specifically, I need all ColB values where ColA == '1' to become the values for RowA, and all the ColB where ColA == '2' to become the values for RowB.
i.e. I need to turn:
index ColA ColB
0 1.0 1.1
1 1.0 12.2
2 1.0 4.5
3 2.0 5.1
4 2.0 7.7
5 2.0 9.5
into ...
ColB
0 1 2
ColA
1.0 1.1 12.2 4.5
2.0 5.1 7.7 9.5
------ Update #1 --------
In reference to the answer provided by #Scott_Boston:
df.groupby('ColA').apply(lambda x: x.reset_index().ColB)
seems to give me:
ColA
1.0 0 1.1
1 12.2
2 4.5
2.0 0 5.1
1 7.7
2 9.5
df.groupby('ColA').ColB.apply(list).apply(pd.Series).rename_axis('ColB',1)
Out[113]:
ColB 0 1 2
ColA
1.0 1.1 12.2 4.5
2.0 5.1 7.7 9.5
Let's use groupby, apply, and reset_index:
df.groupby('ColA').apply(lambda x: x.reset_index().ColB)
Output:
ColB 0 1 2
ColA
1.0 1.1 12.2 4.5
2.0 5.1 7.7 9.5

Categories

Resources