Assume the following:
df1:
x y z
1 10 11
2 20 22
3 30 33
4 40 44
1 20 21
1 30 31
1 40 41
2 10 12
2 30 32
2 40 42
3 10 31
3 20 23
3 40 43
4 10 14
4 20 24
4 30 34
df2:
x b
1 100
2 200
df3:
y c
10 1000
20 2000
I want all rows from df1, for which either x or y appears in either df2 or df3 respectively, meaning in this case
out:
x y z
1 10 11
2 20 22
1 20 21
1 30 31
1 40 41
2 10 12
2 30 32
2 40 42
3 10 31
3 20 23
4 10 14
4 20 24
I would like to do this in pure pandas, with no for loops, seems standard enough to me, but I don't really know what to look for
You can use isin on both cases, chain the conditions with a bitwise OR and perform boolean indexation on the dataframe with the result:
df1[df1.x.isin(df2.x) | df1.y.isin(df3.y)]
Related
I have a dataset where I would like to sum two columns and then perform a subtraction while displaying a cumulative sum.
Data
id date t1 t2 total start cur_t1 cur_t2 final_o finaldb de_t1 de_t2
a q122 4 1 5 50 25 20 55 21 1 1
a q222 1 1 2 50 25 20 57 22 0 0
a q322 0 0 0 50 25 20 57 22 5 5
b q122 5 5 10 100 30 40 110 27 4 4
b q222 2 2 4 100 30 70 114 29 5 1
b q322 3 4 7 100 30 70 121 33 0 1
Desired
id date t1 t2 total start cur_t1 cur_t2 final_o finaldb de_t1 de_t2 finalt1
a q122 4 1 5 50 25 20 55 21 1 1 28
a q222 1 1 2 50 25 20 57 22 0 0 29
a q322 0 0 0 50 25 20 57 22 5 5 24
b q122 5 5 10 100 30 40 110 27 4 4 31
b q222 2 2 4 100 30 70 114 29 5 1 28
b q322 3 4 7 100 30 70 121 33 0 1 31
Logic
Create 'finalt1' column by summing 't1' and 'cur_t1'
initially and then subtracting 'de_t1' cumulatively and grouping by 'id' and 'date'
Doing
df['finalt1'] = df['cur_t1'].add(df.groupby('id')['t1'].cumsum())
I am still researching on how to subtract the 'de_t1' column cumulatively.
I can't test right now, but logically:
(df['cur_t1'].add(df.groupby('id')['t1'].cumsum())
.sub(df.groupby('id')['de_t1'].cumsum())
)
Of note, there was also this possibility to avoid grouping twice (it is calculating both cumsums at once and computing the difference), but it is actually slower:
df['cur_t1'].add(df.groupby('id')[['de_t1', 't1']].cumsum().diff(axis=1)['t1'])
I have 2 dataframes, df1 and df2, and df2 holds the min and max values for the corresponding columns.
import numpy as np
import pandas as pd
df1 = pd.DataFrame(np.random.randint(0,50,size=(10, 5)), columns=list('ABCDE'))
df2 = pd.DataFrame(np.array([[5,3,4,7,2],[30,20,30,40,50]]),columns=list('ABCDE'))
I would like to iterate through df1 and replace the cell values with those of df2 when the df1 cell value is below/above the respective columns' min/max values.
First dont loop/iterate in pandas, if exist some another better and vectorized solutions like here.
Use numpy.select with broadcasting for set values by conditions:
np.random.seed(123)
df1 = pd.DataFrame(np.random.randint(0,50,size=(10, 5)), columns=list('ABCDE'))
df2 = pd.DataFrame(np.array([[5,3,4,7,2],[30,20,30,40,50]]),columns=list('ABCDE'))
print (df1)
A B C D E
0 45 2 28 34 38
1 17 19 42 22 33
2 32 49 47 9 32
3 46 32 47 25 19
4 14 36 32 16 4
5 49 3 2 20 39
6 2 20 47 48 7
7 41 35 28 38 33
8 21 30 27 34 33
print (df2)
A B C D E
0 5 3 4 7 2
1 30 20 30 40 50
#for pandas below 0.24 change .to_numpy() to .values
min1 = df2.loc[0].to_numpy()
max1 = df2.loc[1].to_numpy()
arr = df1.to_numpy()
df = pd.DataFrame(np.select([arr < min1, arr > max1], [min1, max1], arr),
index=df1.index,
columns=df1.columns)
print (df)
A B C D E
0 30 3 28 34 38
1 17 19 30 22 33
2 30 20 30 9 32
3 30 20 30 25 19
4 14 20 30 16 4
5 30 3 4 20 39
6 5 20 30 40 7
7 30 20 28 38 33
8 21 20 27 34 33
9 12 20 4 40 5
Another better solution with numpy.clip:
df = pd.DataFrame(np.clip(arr, min1, max1), index=df1.index, columns=df1.columns)
print (df)
A B C D E
0 30 3 28 34 38
1 17 19 30 22 33
2 30 20 30 9 32
3 30 20 30 25 19
4 14 20 30 16 4
5 30 3 4 20 39
6 5 20 30 40 7
7 30 20 28 38 33
8 21 20 27 34 33
9 12 20 4 40 5
I have a df like this:
1 2 3 4 5 6
0 5 10 12 35 70 80
1 10 11 23 40 42 47
2 5 26 27 38 60 65
Where all the values in each row are different and have an increasing order.
I would like to create a new column with 1 or 0 if there are at least 2 consecutive numbers.
For example the second and third row have 10 and 11, and 26 and 27. Is there a more pythonic way than using an iterator?
Thanks
Use DataFrame.diff for difference per rows, compare by 1, check if at least one True per rows and last cast to integers:
df['check'] = df.diff(axis=1).eq(1).any(axis=1).astype(int)
print (df)
1 2 3 4 5 6 check
0 5 10 12 35 70 80 0
1 10 11 23 40 42 47 1
2 5 26 27 38 60 65 1
For improve performance use numpy:
arr = df.values
df['check'] = np.any(((arr[:, 1:] - arr[:, :-1]) == 1), axis=1).astype(int)
I have 2 data frames, df1 and df2:
df1
Out[66]:
A B
0 1 11
1 1 2
2 1 32
3 1 42
4 1 54
5 1 66
6 2 16
7 2 23
8 3 13
9 3 24
10 3 35
11 3 46
12 3 51
13 4 12
14 4 28
15 4 39
16 4 49
df2
Out[80]:
B
0 32
1 42
2 13
3 24
4 35
5 39
6 49
I want to merge dataframes but at the same time including the first and/or last value of the set in column A. This is an example of the desired outcome:
df3
Out[93]:
A B
0 1 2
1 1 32
2 1 42
3 1 54
4 3 13
5 3 24
6 3 35
7 3 46
8 4 28
9 4 39
10 4 49
I'm trying to use merge but that only slice the portion of data frames that coincides. Someone have an idea to deal with this? thanks!
Here's one way to do it using merge with indicator, groupby, and rolling:
df[df.merge(df2, on='B', how='left', indicator='Ind').eval('Found=Ind == "both"')
.groupby('A')['Found']
.apply(lambda x: x.rolling(3, center=True, min_periods=2).max()).astype(bool)]
Output:
A B
1 1 2
2 1 32
3 1 42
4 1 54
8 3 13
9 3 24
10 3 35
11 3 46
14 4 28
15 4 39
16 4 49
pd.concat([df1.groupby('A').min().reset_index(), pd.merge(df1,df2, on="B"), df1.groupby('A').max().reset_index()]).reset_index(drop=True).drop_duplicates().sort_values(['A','B'])
A B
0 1 2
4 1 32
5 1 42
1 2 16
2 3 13
7 3 24
8 3 35
3 4 12
9 4 39
10 4 49
Breaking down each part
#Get Minimum
df1.groupby('A').min().reset_index()
# Merge on B
pd.merge(df1,df2, on="B")
# Get Maximum
df1.groupby('A').max().reset_index()
# Reset the Index and drop duplicated rows since there may be similarities between the Merge and Min/Max. Sort values by 'A' then by 'B'
.reset_index(drop=True).drop_duplicates().sort_values(['A','B'])
I have a dataframe that has values of the different column numbers for another dataframe. Is there a way that I can just return the value from the other dataframe instead of just having the column index.
I basically want to match up the index between the Push and df dataframes. The values in the Push dataframe contain what column I want to return from the df dataframe.
Push dataframe:
0 1
0 1 2
1 0 3
2 0 3
3 1 3
4 0 2
df dataframe:
0 1 2 3 4
0 10 11 22 33 44
1 10 11 22 33 44
2 10 11 22 33 44
3 10 11 22 33 44
4 10 11 22 33 44
return:
0 1
0 11 22
1 10 33
2 10 33
3 11 33
4 10 22
You can do it with np.take ; However this function works on the flattened array. push must be shift like that :
In [285]: push1 = push.values+np.arange(0,25,5)[:,None]
In [229]: pd.DataFrame(df.values.take(push1))
EDIT
No, I just reinvent np.choose :
In [24]: df
Out[24]:
0 1 2 3 4
0 0 1 2 3 4
1 10 11 12 13 14
2 20 21 22 23 24
3 30 31 32 33 34
4 40 41 42 43 44
In [25]: push
Out[25]:
0 1
0 1 2
1 0 3
2 0 3
3 1 3
4 0 2
In [27]: np.choose(push.T,df).T
Out[27]:
0 1
0 1 2
1 10 13
2 20 23
3 31 33
4 40 42
We using melt then replace notice (df1 is your push , df2 is your df)
df1.astype(str).replace(df2.melt().drop_duplicates().set_index('variable').value.to_dict())
Out[31]:
0 1
0 11 22
1 10 33
2 10 33
3 11 33
4 10 22