Iterating over dataframe and replace with values from another dataframe - python

I have 2 dataframes, df1 and df2, and df2 holds the min and max values for the corresponding columns.
import numpy as np
import pandas as pd
df1 = pd.DataFrame(np.random.randint(0,50,size=(10, 5)), columns=list('ABCDE'))
df2 = pd.DataFrame(np.array([[5,3,4,7,2],[30,20,30,40,50]]),columns=list('ABCDE'))
I would like to iterate through df1 and replace the cell values with those of df2 when the df1 cell value is below/above the respective columns' min/max values.

First dont loop/iterate in pandas, if exist some another better and vectorized solutions like here.
Use numpy.select with broadcasting for set values by conditions:
np.random.seed(123)
df1 = pd.DataFrame(np.random.randint(0,50,size=(10, 5)), columns=list('ABCDE'))
df2 = pd.DataFrame(np.array([[5,3,4,7,2],[30,20,30,40,50]]),columns=list('ABCDE'))
print (df1)
A B C D E
0 45 2 28 34 38
1 17 19 42 22 33
2 32 49 47 9 32
3 46 32 47 25 19
4 14 36 32 16 4
5 49 3 2 20 39
6 2 20 47 48 7
7 41 35 28 38 33
8 21 30 27 34 33
print (df2)
A B C D E
0 5 3 4 7 2
1 30 20 30 40 50
#for pandas below 0.24 change .to_numpy() to .values
min1 = df2.loc[0].to_numpy()
max1 = df2.loc[1].to_numpy()
arr = df1.to_numpy()
df = pd.DataFrame(np.select([arr < min1, arr > max1], [min1, max1], arr),
index=df1.index,
columns=df1.columns)
print (df)
A B C D E
0 30 3 28 34 38
1 17 19 30 22 33
2 30 20 30 9 32
3 30 20 30 25 19
4 14 20 30 16 4
5 30 3 4 20 39
6 5 20 30 40 7
7 30 20 28 38 33
8 21 20 27 34 33
9 12 20 4 40 5
Another better solution with numpy.clip:
df = pd.DataFrame(np.clip(arr, min1, max1), index=df1.index, columns=df1.columns)
print (df)
A B C D E
0 30 3 28 34 38
1 17 19 30 22 33
2 30 20 30 9 32
3 30 20 30 25 19
4 14 20 30 16 4
5 30 3 4 20 39
6 5 20 30 40 7
7 30 20 28 38 33
8 21 20 27 34 33
9 12 20 4 40 5

Related

How to iterate rows in pandas Dataframe to perform the Manipulation

How to iterate rows in pandas to perform the Manipulation in a format below
I have a csv file that contains a 365 column and 1152 rows(the rows index is divided like(1,48),(1,48)...), I need to select K maximum rows from every (1,48) row index and perform some manipulation.
Steps I took:
I used df.apply to do this.
Code I tried
def with_battery(val):
for i in range(d2i.shape[0]):
if i in [31,32,33,34,35,36]: #[31,32,33,34,35,36] should be replaced by top K max.
#batterysize = 50
if val.iloc[i]>batterysize:
val.iloc[i]=0
else:
val.iloc[i] -= batterysize
return val
D2j = D2i.apply(with_battery,axis=0)
How the data is:
**Input Dataframe**
1 2 3 4 5 6 7
1 10 11 34 21 23 12 10
2 11 11 11 11 11 11 11
3 32 32 32 32 32 32 32
4 21 21 21 21 21 21 21
5 42 42 42 42 42 42 42
6 34 34 34 34 34 34 34
1 21 21 21 21 21 21 21
2 22 22 22 22 22 22 22
3 54 54 54 54 54 54 54
4 45 45 45 45 45 45 45
5 43 43 43 43 43 43 43
6 42 42 42 42 42 42 42
> for K=3, the row (3,5,6) is max so I made the value less than 50 as Zero and value more than 50 as value - 50. Similarly in next chunk of rows (3,4,5) is top 3 max rows and I performed similar action as above
Output Dataframe
1 2 3 4 5 6 7
1 10 11 34 21 23 12 10
2 11 11 11 11 11 11 11
3 0 0 0 0 0 0 0
4 21 21 21 21 21 21 21
5 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0
1 21 21 21 21 21 21 21
2 22 22 22 22 22 22 22
3 4 4 4 4 4 4 4
4 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0
6 42 42 42 42 42 42 42

How to convert a dataframe into a pandas IndexSlice to index another MultiIndex dataframe

I have the following 2 dataframes df1 and df2:
import pandas as pd
m_idx = pd.MultiIndex.from_product([range(3), range(3, 6), range(6, 8)])
m_idx.names = ['a', 'b', 'c']
df1 = pd.DataFrame(None, index=m_idx, columns=['x1', 'x2', 'x3'])
df1.loc[:, 'x1'] = m_idx.get_level_values('a') + m_idx.get_level_values('b') + m_idx.get_level_values('c')
df1.loc[:, 'x2'] = df1.loc[:, 'x1'] * 2
df1.loc[:, 'x3'] = df1.loc[:, 'x1'] * 3
df2 = pd.DataFrame({'a': [0, 2], 'c': [6, 6]})
df1:
x1 x2 x3
a b c
0 3 6 9 18 27
7 10 20 30
4 6 10 20 30
7 11 22 33
5 6 11 22 33
7 12 24 36
1 3 6 10 20 30
7 11 22 33
4 6 11 22 33
7 12 24 36
5 6 12 24 36
7 13 26 39
2 3 6 11 22 33
7 12 24 36
4 6 12 24 36
7 13 26 39
5 6 13 26 39
7 14 28 42
df2:
a c
0 0 6
1 2 6
How can I convert df2 into something I can use to look up the index of df1 where the column names of df2 are the levels and in each row you have the combination of keys you are looking to get out of the df1 index.
Or in other words how can I convert df2 into something that does the equivalent of
df1.loc[pd.IndexSlice[[0, 2], :, [6, 6]], :]
which would return:
x1 x2 x3
a b c
0 3 6 9 18 27
4 6 10 20 30
5 6 11 22 33
2 3 6 11 22 33
4 6 12 24 36
5 6 13 26 39
This is a very simplified and small scale version of what I am actually trying to solve. So really looking to create the pd.IndexSlice on the fly.
I see a separate question that suggested this and I have done something similar in my code BUT it takes too long for my purposes.
df_list = [df1.loc[(v[0], slice(None), v[1]), :] for r, v in df2.iterrows()]
df_sliced = pd.concat(df_list)
So am hoping that using pd.IndexSlice or another alternative instead could be much quicker.
MANY THANKS!
Convert the index of df1 to dataframe then use isin + all on the matching levels to replicate the behaviour of index slice
d = df2.to_dict('list')
df1[df1.index.to_frame()[d].isin(d).all(1)]
x1 x2 x3
a b c
0 3 6 9 18 27
4 6 10 20 30
5 6 11 22 33
2 3 6 11 22 33
4 6 12 24 36
5 6 13 26 39

Can I assign a numpy array to new columns in pandas 1.0.3

This bit of code:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(25).reshape(5,5))
df[['A', 'B']] = np.arange(30,40).reshape(5,2)
Works fine in 1.1.1 but throws an error in 1.0.3
KeyError: "None of [Index(['A', 'B'], dtype='object')] are in the [columns]"
Is there a way to do this in a backwards compatible way?
You can do assign
df=df.assign(**dict(zip(['A', 'B'], np.arange(30,40).reshape(2,5))))
Out[119]:
0 1 2 3 4 A B
0 0 1 2 3 4 30 35
1 5 6 7 8 9 31 36
2 10 11 12 13 14 32 37
3 15 16 17 18 19 33 38
4 20 21 22 23 24 34 39
Use T to transpose the array and use unpacking:
df['A'], df['B'] = np.arange(30,40).reshape(5,2).T
Result:
0 1 2 3 4 A B
0 0 1 2 3 4 30 31
1 5 6 7 8 9 32 33
2 10 11 12 13 14 34 35
3 15 16 17 18 19 36 37
4 20 21 22 23 24 38 39

Select rows from pandas df, where index appears somewhere in another df

Assume the following:
df1:
x y z
1 10 11
2 20 22
3 30 33
4 40 44
1 20 21
1 30 31
1 40 41
2 10 12
2 30 32
2 40 42
3 10 31
3 20 23
3 40 43
4 10 14
4 20 24
4 30 34
df2:
x b
1 100
2 200
df3:
y c
10 1000
20 2000
I want all rows from df1, for which either x or y appears in either df2 or df3 respectively, meaning in this case
out:
x y z
1 10 11
2 20 22
1 20 21
1 30 31
1 40 41
2 10 12
2 30 32
2 40 42
3 10 31
3 20 23
4 10 14
4 20 24
I would like to do this in pure pandas, with no for loops, seems standard enough to me, but I don't really know what to look for
You can use isin on both cases, chain the conditions with a bitwise OR and perform boolean indexation on the dataframe with the result:
df1[df1.x.isin(df2.x) | df1.y.isin(df3.y)]

Appending a dataframe to the right of another one with the same columns

I have two different dataframes with the same column names:
eg.
0 1 2
0 10 13 17
1 14 21 34
2 68 32 12
0 1 2
0 45 56 32
1 9 22 86
2 55 64 19
I would like to append the second frame to the right of the first one while continuing the column names from the first frame. The output would look like this:
0 1 2 3 4 5
0 10 13 17 45 56 32
1 14 21 34 9 22 86
2 68 32 12 55 64 19
What is the most efficient way of doing this?
Thanks.
Use pd.concat first and then reset the columns.
In [1108]: df_out = pd.concat([df1, df2], axis=1)
In [1109]: df_out.columns = list(range(len(df_out.columns)))
In [1110]: df_out
Out[1110]:
0 1 2 3 4 5
0 10 13 17 45 56 32
1 14 21 34 9 22 86
2 68 32 12 55 64 19
Why not join:
>>> df=df.join(df_,lsuffix='_')
>>> df.columns=range(len(df.columns))
>>> df
0 1 2 3 4 5
0 10 13 17 45 56 32
1 14 21 34 9 22 86
2 68 32 12 55 64 19
>>>
join is your friend, i use lsuffix (could be rsuffix too) to ignore error for saying duplicate columns.

Categories

Resources