get difference between the two minimum values rowwise pandas columns - python

I have a dataframe as a result of various pivot operations with float numbers (this example using integers for simplicity)
import numpy as np
import pandas as pd
np.random.seed(365)
rows = 10
cols= {'col_a': [np.random.randint(100) for _ in range(rows)],
'col_b': [np.random.randint(100) for _ in range(rows)],
'col_c': [np.random.randint(100) for _ in range(rows)]}
data = pd.DataFrame(cols)
data
col_a col_b col_c
0 82 36 43
1 52 48 12
2 33 28 77
3 91 99 11
4 44 95 27
5 5 94 64
6 98 3 88
7 73 39 92
8 26 39 62
9 56 74 50
I want to detect the two minimum values in a row and get the diff in a new column.
For example, in first row, the 2 minimum values are 36 and 43, so the difference will be 7
I've tried this way:
data['difference']=data[data.apply(lambda x: x.nsmallest(2).astype(float), axis=1).isna()].subtract(axis=1)
but i get:
TypeError: f() missing 1 required positional argument: 'other'

Better use numpy:
a = np.sort(data)
data['difference'] = a[:,1]-a[:,0]
output:
col_a col_b col_c difference
0 82 36 43 7
1 52 48 12 36
2 33 28 77 5
3 91 99 11 80
4 44 95 27 17
5 5 94 64 59
6 98 3 88 85
7 73 39 92 34
8 26 39 62 13
9 56 74 50 6

Follow your idea with nsmallest on rows
data['difference'] = data.apply(lambda x: x.nsmallest(2).tolist(), axis=1, result_type='expand').diff(axis=1)[1]
# or
data['difference'] = data.apply(lambda x: x.nsmallest(2).diff().iloc[-1], axis=1)
print(data)
col_a col_b col_c difference
0 82 36 43 7
1 52 48 12 36
2 33 28 77 5
3 91 99 11 80
4 44 95 27 17
5 5 94 64 59
6 98 3 88 85
7 73 39 92 34
8 26 39 62 13
9 56 74 50 6

Here is a way using rank()
(df.where(
df.rank(axis=1,method = 'first')
.le(2))
.stack()
.sort_values()
.groupby(level=0)
.agg(lambda x: x.diff().sum()))
If your df was larger and you wanted to potentially use more than the 2 smallest, this should work
(df.where(
df.rank(axis=1,method = 'first')
.le(2))
.stack()
.sort_values(ascending=False)
.groupby(level=0)
.agg(lambda x: x.mul(-1).cumsum().add(x.max()*2).iloc[-1]))

Related

Sum row values of all columns where column names meet string match condition

I have the following dataset:
ID Length Width Range_CAP Capacity_CAP
0 1 33 25 16 50
1 2 34 22 11 66
2 3 22 12 15 42
3 4 46 45 66 54
4 5 16 6 23 75
5 6 21 42 433 50
I basically want to sum the row values of the columns only where the columns match a string (in this case, all columns with _CAP at the end of their name). And store the sum of the result in a new column.
So that I end up with a dataframe that looks something like this:
ID Length Width Range_CAP Capacity_CAP CAP_SUM
0 1 33 25 16 50 66
1 2 34 22 11 66 77
2 3 22 12 15 42 57
3 4 46 45 66 54 120
4 5 16 6 23 75 98
5 6 21 42 433 50 483
I first tried to use the solution recommended in this question here:
Summing columns in Dataframe that have matching column headers
However, the solution doesn't work for me since they are summing up columns that have the same exact name so a simple groupby can accomplish the result whereas I am trying to sum columns with specific string matches only.
Code to recreate above sample dataset:
data1 = [['1', 33,25,16,50], ['2', 34,22,11,66],
['3', 22,12,15,42],['4', 46,45,66,54],
['5',16,6,23,75], ['6', 21,42,433,50]]
df = pd.DataFrame(data1, columns = ['ID', 'Length','Width','Range_CAP','Capacity_CAP'])
Let us do filter
df['CAP_SUM'] = df.filter(like='CAP').sum(1)
Out[86]:
0 66
1 77
2 57
3 120
4 98
5 483
dtype: int64
If have other CAP in front
df.filter(regex='_CAP$').sum(1)
Out[92]:
0 66
1 77
2 57
3 120
4 98
5 483
dtype: int64
One approach is:
df['CAP_SUM'] = df.loc[:, df.columns.str.endswith('_CAP')].sum(1)
print(df)
Output
ID Length Width Range_CAP Capacity_CAP CAP_SUM
0 1 33 25 16 50 66
1 2 34 22 11 66 77
2 3 22 12 15 42 57
3 4 46 45 66 54 120
4 5 16 6 23 75 98
5 6 21 42 433 50 483
The expression:
df.columns.str.endswith('_CAP')
creates a boolean mask where the values are True if and only if the column name ends with CAP. As an alternative use filter, with the following regex:
df['CAP_SUM'] = df.filter(regex='_CAP$').sum(1)
print(df)
Output (of filter)
ID Length Width Range_CAP Capacity_CAP CAP_SUM
0 1 33 25 16 50 66
1 2 34 22 11 66 77
2 3 22 12 15 42 57
3 4 46 45 66 54 120
4 5 16 6 23 75 98
5 6 21 42 433 50 483
You may try this:
columnstxt = df.columns
df['sum'] = 0
for i in columnstxt:
if i.find('_CAP') != -1:
df['sum'] = df['sum'] + df[i]
else:
pass

How to select all rows which contain values greater than a threshold?

The request is simple: I want to select all rows which contain a value greater than a threshold.
If I do it like this:
df[(df > threshold)]
I get these rows, but values below that threshold are simply NaN. How do I avoid selecting these rows?
There is absolutely no need for the double transposition - you can simply call any along the column index (supplying 1 or 'columns') on your Boolean matrix.
df[(df > threshold).any(1)]
Example
>>> df = pd.DataFrame(np.random.randint(0, 100, 50).reshape(5, 10))
>>> df
0 1 2 3 4 5 6 7 8 9
0 45 53 89 63 62 96 29 56 42 6
1 0 74 41 97 45 46 38 39 0 49
2 37 2 55 68 16 14 93 14 71 84
3 67 45 79 75 27 94 46 43 7 40
4 61 65 73 60 67 83 32 77 33 96
>>> df[(df > 95).any(1)]
0 1 2 3 4 5 6 7 8 9
0 45 53 89 63 62 96 29 56 42 6
1 0 74 41 97 45 46 38 39 0 49
4 61 65 73 60 67 83 32 77 33 96
Transposing as your self-answer does is just an unnecessary performance hit.
df = pd.DataFrame(np.random.randint(0, 100, 10**8).reshape(10**4, 10**4))
# standard way
%timeit df[(df > 95).any(1)]
1 loop, best of 3: 8.48 s per loop
# transposing
%timeit df[df.T[(df.T > 95)].any()]
1 loop, best of 3: 13 s per loop
This is actually very simple:
df[df.T[(df.T > 0.33)].any()]

Retriving all the rows from a csv file and plotting

I need to retrieve the rows from a csv file generated from the function:
def your_func(row):
return (row['x-momentum']**2+ row['y-momentum']**2 + row['z-momentum']**2)**0.5 / row['mass']
columns_to_keep = ['#time', 'x-momentum', 'y-momentum', 'z-momentum', 'mass']
dataframe = pd.read_csv("./flash.csv", usecols=columns_to_keep)
dataframe['mean_velocity'] = dataframe.apply(your_func, axis=1)
print dataframe
I got rows up until 29s then it skipped to the last few lines, also I need to plot this column 2 against 1
you can adjust pd.options.display.max_rows option, but it won't affect your plots, so your plots will contain all your data
demo:
In [25]: df = pd.DataFrame(np.random.randint(0,100,size=(10, 3)), columns=list('ABC'))
In [26]: df
Out[26]:
A B C
0 93 76 5
1 33 70 12
2 50 52 26
3 88 98 85
4 90 93 92
5 66 10 58
6 82 43 39
7 17 20 91
8 47 90 33
9 44 30 26
In [27]: pd.options.display.max_rows = 4
Now it'll display 4 rows at most
In [36]: df
Out[36]:
A B C
0 93 76 5
1 33 70 12
.. .. .. ..
8 47 90 33
9 44 30 26
[10 rows x 3 columns]
but it'll plot all your data
In [37]: df.plot.bar()
Out[37]: <matplotlib.axes._subplots.AxesSubplot at 0x49e2d68>
In [38]: pd.options.display.max_rows = 60
In [39]: df
Out[39]:
A B C
0 93 76 5
1 33 70 12
2 50 52 26
3 88 98 85
4 90 93 92
5 66 10 58
6 82 43 39
7 17 20 91
8 47 90 33
9 44 30 26

Shuffle DataFrame rows except the first row

I am trying to randomize all rows in a data frame except for the first. I would like for the first row to always appear first, and the remaining rows can be in any randomized order.
My data frame is:
df = pd.DataFrame(np.random.randn(10, 5), columns=['a', 'b', 'c', 'd', 'e'])
Any suggestions as to how I can approach this?
try this:
df = pd.concat([df[:1], df[1:].sample(frac=1)]).reset_index(drop=True)
test:
In [38]: df
Out[38]:
a b c d e
0 2.070074 2.216060 -0.015823 0.686516 -0.738393
1 -1.213517 0.994057 0.634805 0.517844 -0.128375
2 0.937532 0.814923 -0.231120 1.970019 1.438927
3 1.499967 0.105707 1.255207 0.929084 -3.359826
4 0.418702 -0.894226 -1.088968 0.631398 0.152026
5 1.214119 -0.122633 0.983818 -0.445202 -0.807955
6 0.252078 -0.258703 -0.445209 -0.179094 1.180077
7 1.428827 -0.569009 -0.718485 0.161108 1.300349
8 -1.403100 2.154548 -0.492264 -0.544538 -0.061745
9 0.468671 0.004839 -0.738240 -0.385624 -0.532640
In [39]: df = pd.concat([df[:1], df[1:].sample(frac=1)]).reset_index(drop=True)
In [40]: df
Out[40]:
a b c d e
0 2.070074 2.216060 -0.015823 0.686516 -0.738393
1 0.468671 0.004839 -0.738240 -0.385624 -0.532640
2 0.418702 -0.894226 -1.088968 0.631398 0.152026
3 -1.213517 0.994057 0.634805 0.517844 -0.128375
4 1.428827 -0.569009 -0.718485 0.161108 1.300349
5 0.937532 0.814923 -0.231120 1.970019 1.438927
6 0.252078 -0.258703 -0.445209 -0.179094 1.180077
7 1.499967 0.105707 1.255207 0.929084 -3.359826
8 -1.403100 2.154548 -0.492264 -0.544538 -0.061745
9 1.214119 -0.122633 0.983818 -0.445202 -0.807955
Use numpy's shuffle
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(100).reshape(20, 5), columns=list('ABCDE'))
np.random.shuffle(df.values[1:, :])
print df
A B C D E
0 0 1 2 3 4
1 55 56 57 58 59
2 10 11 12 13 14
3 80 81 82 83 84
4 90 91 92 93 94
5 70 71 72 73 74
6 25 26 27 28 29
7 40 41 42 43 44
8 65 66 67 68 69
9 5 6 7 8 9
10 45 46 47 48 49
11 85 86 87 88 89
12 15 16 17 18 19
13 30 31 32 33 34
14 60 61 62 63 64
15 20 21 22 23 24
16 35 36 37 38 39
17 95 96 97 98 99
18 75 76 77 78 79
19 50 51 52 53 54
np.random.shuffle shuffles an ndarray in place. The dataframe is just a wrapper on an ndarray. You can access that ndarray with the values attribute. To specify that all but the first row get shiffled, operate on the array slice [1:, :].

pandas - get a fraction of each multi-index level label rows

I have an XY problem. My setup is as follows - I have a dataframe with multi-index of 2 levels. I want to split it to two dataframes, taking only a fraction of rows from each label in the first level. For example:
df = pd.DataFrame({'a':[1, 1, 1, 1, 7, 7, 10, 10, 10, 10, 10, 10, 10], 'b': np.random.randint(0, 100, 13), 'c':np.random.randint(0, 100, 13)}).set_index(['a', 'b'])
df
Out[13]:
c
a b
1 86 83
1 37
57 64
53 5
7 4 66
13 49
10 61 0
32 84
97 59
69 98
25 52
17 31
37 95
So let's say the fraction is 0.5, I want to split it to two dataframes:
c
a b
1 86 83
1 37
7 4 66
10 61 0
32 84
97 59
69 98
c
a b
1 57 64
53 5
7 13 49
10 25 52
17 31
37 95
I thought about doing (df.groupby(level = 0).count() * 0.5).astype(int) to get the limit on which to "slice" the dataframe. Then, if only I had a way to add a running distance such as this:
c r
a b
1 38 36 0
6 47 1
57 6 2
55 45 3
7 7 51 0
90 96 1
10 59 75 0
27 16 1
58 7 2
79 51 3
58 77 4
63 48 5
87 60 6
I could join the limits and this df and filter with a boolean condition. Any suggestions on either problem? (splitting a fraction of rows or adding a level-aware running index)
This turns out to be pretty trivial with groupby:
In [36]: df.groupby(level=0).apply(lambda x:x.head(int(x.shape[0] * 0.5))).reset_index(level=0, drop=True)
Out[36]:
c
a b
1 86 83
1 37
7 4 66
10 61 0
32 84
97 59
Also getting the running index per group:
In [33]: df.groupby(level=0).cumcount()
Out[33]:
a b
1 38 0
6 1
57 2
55 3
7 7 0
90 1
10 59 0
27 1
58 2
79 3
58 4
63 5
87 6

Categories

Resources