Checking if values of a row are consecutive

Checking if values of a row are consecutive - python

I have a df like this:
1 2 3 4 5 6
0 5 10 12 35 70 80
1 10 11 23 40 42 47
2 5 26 27 38 60 65
Where all the values in each row are different and have an increasing order.
I would like to create a new column with 1 or 0 if there are at least 2 consecutive numbers.
For example the second and third row have 10 and 11, and 26 and 27. Is there a more pythonic way than using an iterator?
Thanks

Use DataFrame.diff for difference per rows, compare by 1, check if at least one True per rows and last cast to integers:
df['check'] = df.diff(axis=1).eq(1).any(axis=1).astype(int)
print (df)
1 2 3 4 5 6 check
0 5 10 12 35 70 80 0
1 10 11 23 40 42 47 1
2 5 26 27 38 60 65 1
For improve performance use numpy:
arr = df.values
df['check'] = np.any(((arr[:, 1:] - arr[:, :-1]) == 1), axis=1).astype(int)

Related

Pandas fill dataframe with count of values within a range from another dataframe

I currently have two dataframes, df_ages and df_count:
In [1]: df_ages
Out [1]:
Enrolled Age
1 Y 44
2 Y 35
3 N 37
4 Y 55
5 N 26
6 Y 19
7 N 18
8 N 49
9 Y 26
10 Y 25
11 Y 25
12 Y 32
13 Y 25
14 N 50
15 N 58
In [2]: df_count
Out [2]:
Min Max counts percentage
1 18 25
2 26 35
3 36 45
4 46 55
5 56 65
I am looking for code to populate df_count [count] column with the sum of people who fit within the min and max age range in the previous columns.
The [percentage] column should be the percentage of number of entries.
The desired resulting output is shown below:
In [2]: df_count
Out [2]:
Min Max counts percentage
1 18 25 5 33.3
2 26 35 4 26.7
3 36 45 2 13.3
4 46 55 3 20.0
5 56 65 1 6.7

You can try apply on rows with Series.between
df_count['counts'] = df_count.apply(lambda row: df_ages['Age'].between(row['Min'], row['Max']).sum(), axis=1)
df_count['percentage'] = df_count['counts'].div(len(df_ages)).mul(100).round(1)
print(df_count)
Min Max counts percentage
0 18 25 5 33.3
1 26 35 4 26.7
2 36 45 2 13.3
3 46 55 3 20.0
4 56 65 1 6.7

Is there any way to ungroup the groupby dataframe with adding an additional column

Suppose we take a pandas dataframe...
item MRP sold
0 A 10 10
1 A 36 4
2 B 32 6
3 A 26 7
4 B 30 9
Then do a groupby('item').mean()
it becomes
item MRP sold
0 A 24 7
1 B 31 7.5
Is there a way to retain the mean values of MRP, of all the unique items and make another column which will contain those values when ungrouped.
Basically what i want is
item MRP sold Mean_MRP
0 A 10 10 24
1 A 36 4 24
2 B 32 6 31
3 A 26 7 24
4 B 30 9 31
There are a lot of items, so i need a faster and optimised way to do this

Use the Transform function :
df = (df
.assign(Mean_MRP = lambda x:x.groupby('item')['MRP']
.transform('mean')))
df
item MRP sold Mean_MRP
0 A 10 10 24
1 A 36 4 24
2 B 32 6 31
3 A 26 7 24
4 B 30 9 31
You could also use the pyjanitor module, which makes the code a bit cleaner:
import janitor
df.groupby_agg(by='item',
agg='mean',
agg_column_name="MRP",
new_column_name='Mean_MRP')

Try using transform:
df['Mean_MRP'] = df.groupby('item').transform('mean')

Select rows from pandas df, where index appears somewhere in another df

Assume the following:
df1:
x y z
1 10 11
2 20 22
3 30 33
4 40 44
1 20 21
1 30 31
1 40 41
2 10 12
2 30 32
2 40 42
3 10 31
3 20 23
3 40 43
4 10 14
4 20 24
4 30 34
df2:
x b
1 100
2 200
df3:
y c
10 1000
20 2000
I want all rows from df1, for which either x or y appears in either df2 or df3 respectively, meaning in this case
out:
x y z
1 10 11
2 20 22
1 20 21
1 30 31
1 40 41
2 10 12
2 30 32
2 40 42
3 10 31
3 20 23
4 10 14
4 20 24
I would like to do this in pure pandas, with no for loops, seems standard enough to me, but I don't really know what to look for

You can use isin on both cases, chain the conditions with a bitwise OR and perform boolean indexation on the dataframe with the result:
df1[df1.x.isin(df2.x) | df1.y.isin(df3.y)]

Pandas Cumsum skip rows

I have a dataframe like the following.
idx vals
0 10
1 21
2 12
3 33
4 14
5 55
6 16
7 77
I would like to perform a cumsum (and avoid a for) but only considering rows with the same idx mod 2. For instance, for row 3 I would like to obtain 21+33=54, while for row 4, 10+12+14 = 36.
Any ideas?

You just need groupby here
df.vals.groupby(df.idx%2).cumsum()
Out[75]:
0 10
1 21
2 22
3 54
4 36
5 109
6 52
7 186
Name: vals, dtype: int64

Fastest way to sort each row in a pandas dataframe

I need to find the quickest way to sort each row in a dataframe with millions of rows and around a hundred columns.
So something like this:
A B C D
3 4 8 1
9 2 7 2
Needs to become:
A B C D
8 4 3 1
9 7 2 2
Right now I'm applying sort to each row and building up a new dataframe row by row. I'm also doing a couple of extra, less important things to each row (hence why I'm using pandas and not numpy). Could it be quicker to instead create a list of lists and then build the new dataframe at once? Or do I need to go cython?

I think I would do this in numpy:
In [11]: a = df.values
In [12]: a.sort(axis=1) # no ascending argument
In [13]: a = a[:, ::-1] # so reverse
In [14]: a
Out[14]:
array([[8, 4, 3, 1],
[9, 7, 2, 2]])
In [15]: pd.DataFrame(a, df.index, df.columns)
Out[15]:
A B C D
0 8 4 3 1
1 9 7 2 2
I had thought this might work, but it sorts the columns:
In [21]: df.sort(axis=1, ascending=False)
Out[21]:
D C B A
0 1 8 4 3
1 2 7 2 9
Ah, pandas raises:
In [22]: df.sort(df.columns, axis=1, ascending=False)
ValueError: When sorting by column, axis must be 0 (rows)

To Add to the answer given by #Andy-Hayden, to do this inplace to the whole frame... not really sure why this works, but it does. There seems to be no control on the order.
In [97]: A = pd.DataFrame(np.random.randint(0,100,(4,5)), columns=['one','two','three','four','five'])
In [98]: A
Out[98]:
one two three four five
0 22 63 72 46 49
1 43 30 69 33 25
2 93 24 21 56 39
3 3 57 52 11 74
In [99]: A.values.sort
Out[99]: <function ndarray.sort>
In [100]: A
Out[100]:
one two three four five
0 22 63 72 46 49
1 43 30 69 33 25
2 93 24 21 56 39
3 3 57 52 11 74
In [101]: A.values.sort()
In [102]: A
Out[102]:
one two three four five
0 22 46 49 63 72
1 25 30 33 43 69
2 21 24 39 56 93
3 3 11 52 57 74
In [103]: A = A.iloc[:,::-1]
In [104]: A
Out[104]:
five four three two one
0 72 63 49 46 22
1 69 43 33 30 25
2 93 56 39 24 21
3 74 57 52 11 3
I hope someone can explain the why of this, just happy that it works 8)

You could use pd.apply.
Eg:
A = pd.DataFrame(np.random.randint(0,100,(4,5)), columns=['one','two','three','four','five'])
print (A)
one two three four five
0 2 75 44 53 46
1 18 51 73 80 66
2 35 91 86 44 25
3 60 97 57 33 79
A = A.apply(np.sort, axis = 1)
print(A)
one two three four five
0 2 44 46 53 75
1 18 51 66 73 80
2 25 35 44 86 91
3 33 57 60 79 97
Since you want it in descending order, you can simply multiply the dataframe with -1 and sort it.
A = pd.DataFrame(np.random.randint(0,100,(4,5)), columns=['one','two','three','four','five'])
A = A * -1
A = A.apply(np.sort, axis = 1)
A = A * -1

Instead of using pd.DataFrame constructor, an easier way to assign the sorted values back is to use double brackets:
original dataframe:
A B C D
3 4 8 1
9 2 7 2
df[['A', 'B', 'C', 'D']] = np.sort(df)[:, ::-1]
A B C D
0 8 4 3 1
1 9 7 2 2
This way you can also sort a part of the columns:
df[['B', 'C']] = np.sort(df[['B', 'C']])[:, ::-1]
A B C D
0 3 8 4 1
1 9 7 2 2

One could try this approach to preserve the integrity of the df:
import pandas as pd
import numpy as np
A = pd.DataFrame(np.random.randint(0,100,(4,5)), columns=['one','two','three','four','five'])
print (A)
print(type(A))
one two three four five
0 85 27 64 50 55
1 3 90 65 22 8
2 0 7 64 66 82
3 58 21 42 27 30
<class 'pandas.core.frame.DataFrame'>
B = A.apply(lambda x: np.sort(x), axis=1, raw=True)
print(B)
print(type(B))
one two three four five
0 27 50 55 64 85
1 3 8 22 65 90
2 0 7 64 66 82
3 21 27 30 42 58
<class 'pandas.core.frame.DataFrame'>

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Checking if values of a row are consecutive - python

Related

Pandas fill dataframe with count of values within a range from another dataframe

Is there any way to ungroup the groupby dataframe with adding an additional column

Select rows from pandas df, where index appears somewhere in another df

Pandas Cumsum skip rows

Fastest way to sort each row in a pandas dataframe

Categories

Resources