Cumulative count based off different values in a pandas df

Cumulative count based off different values in a pandas df - python

The code below provides a cumulative count of how many times a specified value changes. The value has to change to return a count.
import pandas as pd
import numpy as np
d = ({
'Who' : ['Out','Even','Home','Home','Even','Away','Home','Out','Even','Away','Away','Home','Away'],
})
#Specified Values
Teams = ['Home', 'Away']
for who in Teams:
s = df[df.Who==who].index.to_series().diff()!=1
df['Change_'+who] = s[s].cumsum()
Output:
Who Change_Home Change_Away
0 Out NaN NaN
1 Even NaN NaN
2 Home 1.0 NaN
3 Home NaN NaN
4 Even NaN NaN
5 Away NaN 1.0
6 Home 2.0 NaN
7 Out NaN NaN
8 Even NaN NaN
9 Away NaN 2.0
10 Away NaN NaN
11 Home 3.0 NaN
12 Away NaN 3.0
I'm trying to further sort the output based off what value precedes Home and Away. As in the code above doesn't differentiate what Home and Away got changed from. It just counts the amount of times it got changed to Home/Away.
Is there a way to alter the code above to split it up into what Home/Away got changed from? Or will it have to start again?
My intended output is:
Even_Away Even_Home Swap_Away Swap_Home Who
0 Out
1 Even
2 1 Home
3 Home
4 Even
5 1 Away
6 1 Home
7 Out
8 Even
9 2 Away
10 Away
11 2 Home
12 1 Away
So Even_ represents how many times it went from Even to Home/Away and Swap_ represents how many times it went from Home to Away or vice versa.

Main function is get_dummies for dynamic solution - create new columns for all previous values defined in Teams list:
#create DataFrame
df = pd.DataFrame(d)
Teams = ['Home', 'Away']
#create boolean mask for check value by list and compare with shifted column
shifted = df['Who'].shift().fillna('')
m1 = df['Who'].isin(Teams)
#mask for exclude same previous values Home_Home, Away_Away
m2 = df['Who'] == shifted
#chain together, ~ invert mask
m = m1 & ~m2
#join column by mask and create indicator df
df1 = pd.get_dummies(np.where(m, shifted + '_' + df['Who'], np.nan))
#rename columns dynamically
c = df1.columns[df1.columns.str.startswith(tuple(Teams))]
c1 = ['Swap_' + x.split('_')[1] for x in c]
df1 = df1.rename(columns = dict(zip(c, c1)))
#count values by cumulative sum, add column Who
df2 = df1.cumsum().mask(df1 == 0, 0).join(df[['Who']])
print (df2)
Swap_Home Even_Away Even_Home Swap_Away Who
0 0 0 0 0 Out
1 0 0 0 0 Even
2 0 0 1 0 Home
3 0 0 0 0 Home
4 0 0 0 0 Even
5 0 1 0 0 Away
6 1 0 0 0 Home
7 0 0 0 0 Out
8 0 0 0 0 Even
9 0 2 0 0 Away
10 0 0 0 0 Away
11 2 0 0 0 Home
12 0 0 0 1 Away

Related

How to fill missing values in a column by random sampling another column by other column values

I have missing values in one column that I would like to fill by random sampling from a source distribution:
import pandas as pd
import numpy as np
source = pd.DataFrame({'age':5*[21],
'location':[0,0,1,1,1],
'x':[1,2,3,4,4]})
source
age location x
0 21 0 1
1 21 0 2
2 21 1 3
3 21 1 4
4 21 1 4
target = pd.DataFrame({'age':5*[21],
'location':[0,0,0,1,2],
'x':5*[np.nan]})
target
age location x
0 21 0 NaN
1 21 0 NaN
2 21 0 NaN
3 21 1 NaN
4 21 2 NaN
Now I need to fill in the missing values of x in the target dataframe by choosing a random value of x from the source dataframe that have the same values for age and location as the missing x with replacement. If there is no value of x in source that has the same values for age and location as the missing value it should be left as missing.
Expected output:
age location x
0 21 0 1 with probability 0.5 2 otherwise
1 21 0 1 with probability 0.5 2 otherwise
2 21 0 1 with probability 0.5 2 otherwise
3 21 1 3 with probability 0.33 4 otherwise
4 21 2 NaN
I can loop through all the missing combinations of age and location and slice the source dataframe and then take a random sample, but my dataset is large enough that it takes quite a while to do.
Is there a better way?

You can create MultiIndex in both DataFrames and then in custom function replace NaN by another DataFrame in GroupBy.transform with numpy.random.choice:
source = pd.DataFrame({'age':5*[21],
'location':[0,0,1,1,1],
'x':[1,2,3,4,4]})
target = pd.DataFrame({'age':5*[21],
'location':[0,0,0,1,2],
'x':5*[np.nan]})
cols = ['age', 'location']
source1 = source.set_index(cols)['x']
target1 = target.set_index(cols)['x']
def f(x):
try:
a = source1.loc[x.name].to_numpy()
m = x.isna()
x[m] = np.random.choice(a, size=m.sum())
return x
except KeyError:
return np.nan
target1 = target1.groupby(level=[0,1]).transform(f).reset_index()
print (target1)
age location x
0 21 0 1.0
1 21 0 2.0
2 21 0 2.0
3 21 1 3.0
4 21 2 NaN

You can create a common grouper and perform a merge:
cols = ['age', 'location']
(target[cols]
.assign(group=target.groupby(cols).cumcount()) # compute subgroup for duplicates
.merge((# below: assigns a random row group
source.assign(group=source.sample(frac=1).groupby(cols, sort=False).cumcount())
.groupby(cols+['group'], as_index=False) # get one row per group
.first()
),
on=cols+['group'], how='left') # merge
#drop('group', axis=1) # column kept for clarity, uncomment to remove
)
output:
age location group x
0 20 0 0 0.339955
1 20 0 1 0.700506
2 21 0 0 0.777635
3 22 1 0 NaN

Identifying consecutive NaN's with pandas part 2

I have a question related to the earlier question: Identifying consecutive NaN's with pandas
I am new on stackoverflow so I cannot add a comment, but I would like to know how I can partly keep the original index of the dataframe when counting the number of consecutive nans.
So instead of:
df = pd.DataFrame({'a':[1,2,np.NaN, np.NaN, np.NaN, 6,7,8,9,10,np.NaN,np.NaN,13,14]})
df
Out[38]:
a
0 1
1 2
2 NaN
3 NaN
4 NaN
5 6
6 7
7 8
8 9
9 10
10 NaN
11 NaN
12 13
13 14
I would like to obtain the following:
Out[41]:
a
0 0
1 0
2 3
5 0
6 0
7 0
8 0
9 0
10 2
12 0
13 0

I have found a workaround. It is quite ugly, but it does the trick. I hope you don't have massive data, because it might be not very performing:
df = pd.DataFrame({'a':[1,2,np.NaN, np.NaN, np.NaN, 6,7,8,9,10,np.NaN,np.NaN,13,14]})
df1 = df.a.isnull().astype(int).groupby(df.a.notnull().astype(int).cumsum()).sum()
# Determine the different groups of NaNs. We only want to keep the 1st. The 0's are non-NaN values, the 1's are the first in a group of NaNs.
b = df.isna()
df2 = b.cumsum() - b.cumsum().where(~b).ffill().fillna(0).astype(int)
df2 = df2.loc[df2['a'] <= 1]
# Set index from the non-zero 'NaN-count' to the index of the first NaN
df3 = df1.loc[df1 != 0]
df3.index = df2.loc[df2['a'] == 1].index
# Update the values from df3 (which has the right values, and the right index), to df2
df2.update(df3)
The NaN-group thingy is inspired by the following answer: This is coming from the this answer.

2-dimensional bins from a pandas DataFrame based on 3 columns

I'm trying to create 2-dimensional bins from a pandas DataFrame based on 3 columns. Here a snippet from my DataFrame:
Scatters N z Dist_first
---------------------------------------
0 0 0 0.096144 2.761508
1 1 0 -8.229910 17.403039
2 2 0 0.038125 21.466233
3 3 0 -2.050480 29.239867
4 4 0 -1.620470 NaN
5 5 0 -1.975930 NaN
6 6 0 -11.672200 NaN
7 7 0 -16.629000 26.554049
8 8 0 0.096002 NaN
9 9 0 0.176049 NaN
10 10 0 0.176005 NaN
11 11 0 0.215408 NaN
12 12 0 0.255889 NaN
13 13 0 0.301834 27.700308
14 14 0 -29.593600 9.155065
15 15 1 -2.582290 NaN
16 16 1 0.016441 2.220946
17 17 1 -17.329100 NaN
18 18 1 -5.442320 34.520919
19 19 1 0.001741 39.579189
For my result each Dist_first should be binned with all "z <= 0" of lower index within a group "N" than the Distance itself. "Scatters" is a copy of the index left from an operation in an earlier stage of my code which is not relevant here. Nonetheless I came to use it instead of the index in the example below. The bins for the distances and z's are in 10 m and 0.1 m steps, respectively and I can obtain a result from looping through groups of the dataFrame:
# create new column for maximal possible distances per group N
for j in range(N.groupby('N')['Dist_first'].count().max()):
N[j+1] = N.loc[N[N['Dist_first'].notna()].groupby('N')['Scatters'].nlargest(j+1).groupby('N').min()]['Dist_first']
# fill nans with zeros to allow
N[j+1] = N[j+1].fillna(0)
# make sure no value is repeated
if j+1 > 1:
N[j+1] = N[j+1]-N[list(np.arange(j)+1)].sum(axis=1)
# and set all values <= 0 to NaN
N[N[list(np.arange(N.groupby('N')['Dist_first'].count().max())+1)] <= 0] = np.nan
# backwards fill to make sure every distance gets all necessary depths
N[list(np.arange(N.groupby('N')['Dist_first'].count().max())+1)] = N.set_index('N').groupby('N').bfill().set_index('Scatters')[list(np.arange(N.groupby('N')['Dist_first'].count().max())+1)]
# bin the result(s)
for j in range(N.groupby('N')['Dist_first'].count().max()):
binned = N[N['z'] >= 0].groupby([pd.cut(N[N['z'] >= 0]['z'], bins_v, include_lowest=True), pd.cut(N[N['z'] >= 0][j+1], bins_h, include_lowest=True)])
binned = binned.size().unstack()
## rename
binned.index = N_v.index; binned.columns = N_h.index
## and sum up with earlier chunks
V = V+binned
This bit of code works just fine and the result for the small snippet of the data I've shared looks like this:
Distance [m] 0.0 10.0 20.0 30.0 40.0
Depth [m]
----------------------------------------------------
0.0 1 1 1 4 2
0.1 1 2 2 4 0
0.2 0 3 0 3 0
0.3 0 2 0 2 0
0.4 0 0 0 0 0
However, the whole dataset(s) are excesively large (> 300 mio rows each) and looping through all rows is not an option. Therefore I'm looking for some vectorized solution.

I suggest you to calculate creiteria in extra columns and then use Pandas standard binning function, like qcut. It can be applied separately along the 2 binning dimensions. Not most elegant, but definitely vectorized.

Find the number of previous consecutive occurences of value different than current row value in pandas dataframe

Assume that we have the following pandas dataframe:
df = pd.DataFrame({'x':[0,0,1,0,0,0,0],'y':[1,1,1,1,1,1,0],'z':[0,1,1,1,0,0,1]})
x y z
0 0 1 0
1 0 1 1
2 1 1 1
3 0 1 1
4 0 1 0
5 0 1 0
6 0 0 1
All dataframe is filled either by 1 or 0. Looking at each column separately, if current row value is different than previous value I need to count number of previous consecutive values:
x y z
0
1 1
2 2
3 1
4 3
5
6 6 2
I tried to write a lambda function and apply it to entire dataframe, but I failed. Any idea?

Let's try this:
def f(col):
x = (col != col.shift().bfill())
s = x.cumsum()
return s.groupby(s).transform('count').shift().where(x)
df.apply(f).fillna('')
Output:
x y z
0
1 1
2 2
3 1
4 3
5
6 6 2
Details:
Use apply, to apply a custom function on each column of the dataframe.
Find the difference spots in the column then use cumsum to create groups of consecutive values, then groupby and transform to create a count for each record, then mask the values in the column using where for the difference spots.

You can try the following, where you identify the "runs" first, get the "runs" lengths. You will only entry at where it switches, so it is the lengths of the runs except the last one.
import pandas as pd
import numpy as np
def func(x,missing=np.NaN):
runs = np.cumsum(np.append(0,np.diff(x)!=0))
switches = np.where(np.diff(x!=0))[0] + 1
out = np.repeat(missing,len(x))
out[switches] = np.bincount(runs)[:-1]
# thanks to Scott see comments below
##out[switches] = pd.value_counts(runs,sort=False)[:-1]
return(out)
df.apply(func)
x y z
0 NaN NaN NaN
1 NaN NaN 1.0
2 2.0 NaN NaN
3 1.0 NaN NaN
4 NaN NaN 3.0
5 NaN NaN NaN
6 NaN 6.0 2.0
It might be faster with a good implementation of run length encoding.. but I am not too familiar with it in python..

Adding non-existing combination

I want to make a table with all available products for every customer. However, I only have a table with the combination of product and customer if it was bought. I want to make a new table that also included the product that were not bought by the customer. The current table looks as follows:
The table I want to end up with is:
Could anyone help me how to do this in pandas?

One way to do this is to use pd.MultiIndex and reindex:
df = pd.DataFrame({'Product':list('ABCDEF'),
'Customer':[1,1,2,3,3,3],
'Amount':[4,5,3,1,1,2]})
indx = pd.MultiIndex.from_product([df['Product'].unique(),
df['Customer'].unique()],
names=['Product','Customer'])
df.set_index(['Product','Customer'])\
.reindex(indx, fill_value=0)\
.reset_index()\
.sort_values(['Customer','Product'])
Output:
Product Customer Amount
0 A 1 4
3 B 1 5
6 C 1 0
9 D 1 0
12 E 1 0
15 F 1 0
1 A 2 0
4 B 2 0
7 C 2 3
10 D 2 0
13 E 2 0
16 F 2 0
2 A 3 0
5 B 3 0
8 C 3 0
11 D 3 1
14 E 3 1
17 F 3 2

You can also create a pivot to do what you want in one line. Note that the output format is different -- it's a pandas.DataFrame.pivot rather than a standard pandas data frame. But if you're not especially fussed about that (depends on how you intend to use the final table), the following code does the job.
df = pd.DataFrame({'Product':['A','B','C','D','E','F'],
'Customer':[1,1,2,3,3,3],
'Amount':[4,5,3,1,1,2]})
pivot_df = df.pivot(index='Product',
columns='Customer',
values='Amount').fillna(0).astype('int')
Output:
Customer 1 2 3
Product
A 4 0 0
B 5 0 0
C 0 3 0
D 0 0 1
E 0 0 1
F 0 0 2
df.pivot creates NaN values when there are no corresponding entries in the original df (it creates a NaN value for Product A and Customer 2, for instance). NaNs are float values, so all the 'Amounts' in the pivot are implicitly converted into floats. This is why I use fillna(0) to convert the NaN values into 0s, and then finally change the dtype back to int.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cumulative count based off different values in a pandas df - python

Related

How to fill missing values in a column by random sampling another column by other column values

Identifying consecutive NaN's with pandas part 2

2-dimensional bins from a pandas DataFrame based on 3 columns

Find the number of previous consecutive occurences of value different than current row value in pandas dataframe

Adding non-existing combination

Categories

Resources