Pandas Cumsum skip rows - python

I have a dataframe like the following.
idx vals
0 10
1 21
2 12
3 33
4 14
5 55
6 16
7 77
I would like to perform a cumsum (and avoid a for) but only considering rows with the same idx mod 2. For instance, for row 3 I would like to obtain 21+33=54, while for row 4, 10+12+14 = 36.
Any ideas?

You just need groupby here
df.vals.groupby(df.idx%2).cumsum()
Out[75]:
0 10
1 21
2 22
3 54
4 36
5 109
6 52
7 186
Name: vals, dtype: int64

Related

Combining dataframe with first 2 columns of another dataframe without changing index position

I originally have a 'monthly' DataFrame with months (1-11) as column index and number of disease cases as values.
I have another 'disease' DataFrame with the first 2 columns as 'Country' and 'Province'.
I want to combine the 'monthly' DataFrame with the 2 columns, and the 2 columns should be still be the first 2 columns in the combined 'monthly' DataFrame (Same index position).
In other words, the original 'monthly' DataFrame is:
1 2 3 4 5 6 7 8 9 10 11
0 1 5 8 0 9 9 8 18 82 89 81
1 0 1 9 19 8 12 29 19 91 74 93
The desired output is:
Country Province 1 2 3 4 5 6 7 8 9 10 11
0 Afghanistan Afghanistan 1 5 8 0 9 9 8 18 82 89 81
1 Argentina Argentina 0 1 9 19 8 12 29 19 91 74 93
I was able to append the 2 columns into the 'monthly' DataFrame by this code:
monthly['Country'] = disease['Country']
monthly['Province'] = disease['Province']
However, this puts the 2 columns at the end of the 'monthly' DataFrame.
1 2 3 4 5 6 7 8 9 10 11 Country Province
0 1 5 8 0 9 9 8 18 82 89 81 Afghanistan Afghanistan
1 0 1 9 19 8 12 29 19 91 74 93 Argentina Argentina
How should I improve the code without using the insert() function ? Can I use the iloc to specify the index position?
Thanks for your help in advance!
Use concat with selecting first 2 columns by positions by DataFrame.iloc, here first : means get all rows:
df = pd.concat((disease.iloc[:, :2], monthly), axis=1)
Or by columns names:
df = pd.concat((disease[['Country','Province']], monthly), axis=1)

Checking if values of a row are consecutive

I have a df like this:
1 2 3 4 5 6
0 5 10 12 35 70 80
1 10 11 23 40 42 47
2 5 26 27 38 60 65
Where all the values in each row are different and have an increasing order.
I would like to create a new column with 1 or 0 if there are at least 2 consecutive numbers.
For example the second and third row have 10 and 11, and 26 and 27. Is there a more pythonic way than using an iterator?
Thanks
Use DataFrame.diff for difference per rows, compare by 1, check if at least one True per rows and last cast to integers:
df['check'] = df.diff(axis=1).eq(1).any(axis=1).astype(int)
print (df)
1 2 3 4 5 6 check
0 5 10 12 35 70 80 0
1 10 11 23 40 42 47 1
2 5 26 27 38 60 65 1
For improve performance use numpy:
arr = df.values
df['check'] = np.any(((arr[:, 1:] - arr[:, :-1]) == 1), axis=1).astype(int)

Merge dataframes including extreme values

I have 2 data frames, df1 and df2:
df1
Out[66]:
A B
0 1 11
1 1 2
2 1 32
3 1 42
4 1 54
5 1 66
6 2 16
7 2 23
8 3 13
9 3 24
10 3 35
11 3 46
12 3 51
13 4 12
14 4 28
15 4 39
16 4 49
df2
Out[80]:
B
0 32
1 42
2 13
3 24
4 35
5 39
6 49
I want to merge dataframes but at the same time including the first and/or last value of the set in column A. This is an example of the desired outcome:
df3
Out[93]:
A B
0 1 2
1 1 32
2 1 42
3 1 54
4 3 13
5 3 24
6 3 35
7 3 46
8 4 28
9 4 39
10 4 49
I'm trying to use merge but that only slice the portion of data frames that coincides. Someone have an idea to deal with this? thanks!
Here's one way to do it using merge with indicator, groupby, and rolling:
df[df.merge(df2, on='B', how='left', indicator='Ind').eval('Found=Ind == "both"')
.groupby('A')['Found']
.apply(lambda x: x.rolling(3, center=True, min_periods=2).max()).astype(bool)]
Output:
A B
1 1 2
2 1 32
3 1 42
4 1 54
8 3 13
9 3 24
10 3 35
11 3 46
14 4 28
15 4 39
16 4 49
pd.concat([df1.groupby('A').min().reset_index(), pd.merge(df1,df2, on="B"), df1.groupby('A').max().reset_index()]).reset_index(drop=True).drop_duplicates().sort_values(['A','B'])
A B
0 1 2
4 1 32
5 1 42
1 2 16
2 3 13
7 3 24
8 3 35
3 4 12
9 4 39
10 4 49
Breaking down each part
#Get Minimum
df1.groupby('A').min().reset_index()
# Merge on B
pd.merge(df1,df2, on="B")
# Get Maximum
df1.groupby('A').max().reset_index()
# Reset the Index and drop duplicated rows since there may be similarities between the Merge and Min/Max. Sort values by 'A' then by 'B'
.reset_index(drop=True).drop_duplicates().sort_values(['A','B'])

multiplication of dataframes with differnet lengths

I have two dataframes: the both have 5 columns, but the first one has 100 rows, and the second one just one row. I should multiply every row of the first dataframe by this single row of the second, and than summarize the value of columns in each row and this value in the 6th new column 'sum of multipliations". I've seen "np.dot" operation, but I'm not sure that I could apply it to dataframes. Also I'm looking for the pythonic/pandas operation or method, if it's possible to replace a little bit heavy numpy code from scratch? Thank you in advance for your advice.
I think you can convert DataFrames to numpy arrays by values, multiple them and last sum:
import pandas as pd
import numpy as np
np.random.seed(1)
df1 = pd.DataFrame(np.random.randint(10, size=(1,5)))
df1.columns = list('ABCDE')
print df1
A B C D E
0 5 8 9 5 0
np.random.seed(0)
df2 = pd.DataFrame(np.random.randint(10,size=(10,5)))
df2.columns = list('ABCDE')
print df2
A B C D E
0 5 0 3 3 7
1 9 3 5 2 4
2 7 6 8 8 1
3 6 7 7 8 1
4 5 9 8 9 4
5 3 0 3 5 0
6 2 3 8 1 3
7 3 3 7 0 1
8 9 9 0 4 7
9 3 2 7 2 0
print df2.values * df1.values
[[25 0 27 15 0]
[45 24 45 10 0]
[35 48 72 40 0]
[30 56 63 40 0]
[25 72 72 45 0]
[15 0 27 25 0]
[10 24 72 5 0]
[15 24 63 0 0]
[45 72 0 20 0]
[15 16 63 10 0]]
df = pd.DataFrame(df2.values * df1.values)
df['sum'] = df.sum(axis=1)
print df
0 1 2 3 4 sum
0 25 0 27 15 0 67
1 45 24 45 10 0 124
2 35 48 72 40 0 195
3 30 56 63 40 0 189
4 25 72 72 45 0 214
5 15 0 27 25 0 67
6 10 24 72 5 0 111
7 15 24 63 0 0 102
8 45 72 0 20 0 137
9 15 16 63 10 0 104
Timing:
In [1185]: %timeit df2.mul(df1.ix[0], axis=1)
The slowest run took 5.07 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 287 µs per loop
In [1186]: %timeit pd.DataFrame(df2.values * df1.values)
The slowest run took 6.31 times longer than the fastest. This could mean that an intermediate result is being cached
10000 loops, best of 3: 98 µs per loop
You are probably looking for something like this:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({ 'A' : [1.1,2.7, 3.4],
'B' : [-1.,-2.5, -3.9]})
df1['sum of multipliations']=df1.sum(axis = 1)
df2 = pd.DataFrame({ 'A' : [2.],
'B' : [3.],
'sum of multipliations' : [1.]})
print df1
print df2
row = df2.ix[0]
df5=df1.mul(row, axis=1)
df5.loc['Total']= df5.sum()
print df5

pandas drop row below each row containing an 'na'

i have a dataframe with, say, 4 columns [['a','b','c','d']], to which I add another column ['total'] containing the sum of all the other columns for each row. I then add another column ['growth of total'] with the growth rate of the total.
some of the values in [['a','b','c','d']] are blank, rendering the ['total'] column invalid for these rows. I can easily get rid of these rows with df.dropna(how='any').
However, my growth rate will be invalid not only for rows with missing values in [['a','b','c','d']], but also for the following row. How do I drop all these rows?
IIUC correctly you can use notnull with all to mask off any rows with NaN and any rows that follow NaN rows:
In [43]:
df = pd.DataFrame({'a':[0,np.NaN, 2, 3,np.NaN], 'b':[np.NaN, 1,2,3,4], 'c':[0, np.NaN,2,3,4]})
df
Out[43]:
a b c
0 0 NaN 0
1 NaN 1 NaN
2 2 2 2
3 3 3 3
4 NaN 4 4
In [44]:
df[df.notnull().all(axis=1) & df.shift().notnull().all(axis=1)]
Out[44]:
a b c
3 3 3 3
Here's one option that I think does what you're looking for:
In [76]: df = pd.DataFrame(np.arange(40).reshape(10,4))
In [77]: df.ix[1,2] = np.nan
In [78]: df.ix[6,1] = np.nan
In [79]: df['total'] = df.sum(axis=1, skipna=False)
In [80]: df
Out[80]:
0 1 2 3 total
0 0 1 2 3 6
1 4 5 NaN 7 NaN
2 8 9 10 11 38
3 12 13 14 15 54
4 16 17 18 19 70
5 20 21 22 23 86
6 24 NaN 26 27 NaN
7 28 29 30 31 118
8 32 33 34 35 134
9 36 37 38 39 150
In [81]: df['growth'] = df['total'].iloc[1:] - df['total'].values[:-1]
In [82]: df
Out[82]:
0 1 2 3 total growth
0 0 1 2 3 6 NaN
1 4 5 NaN 7 NaN NaN
2 8 9 10 11 38 NaN
3 12 13 14 15 54 16
4 16 17 18 19 70 16
5 20 21 22 23 86 16
6 24 NaN 26 27 NaN NaN
7 28 29 30 31 118 NaN
8 32 33 34 35 134 16
9 36 37 38 39 150 16

Categories

Resources