Compare GROUPBY and np.where with Rs datatable package - python

I'm struggling to create a new variable in python based on a condition with a groupby statement. I know there is a new datatable package in python, but I would like to explore the options below.
In R, I did the following:
require(data.table)
id <- c(1,2,3,4,1,5)
units <- c(34, 31, 16, 32, NA, 35)
unitsnew <- c(23, 11, 21, 22, 27, 11)
df = data.table(id, units, unitsnew)
df[, col:= ifelse(is.na(units), min(unitsnew, na.rm = T), units), by = id]
id units unitsnew col
1 34 23 34
2 31 11 31
3 16 21 16
4 32 22 32
1 NA 27 23
5 35 11 35
In Python,
import pandas as pd
import numpy as np
matrix = [(1, 34, 23),
(2, 31, 11),
(3, 16, 21),
(4, 32, 22),
(1, np.nan, 27),
(5, 35, 11)]
df = pd.DataFrame(matrix, columns = ['id', 'units', 'unitsnew'])
df.assign(col = np.where(np.isnan(df.units), np.nanmin(df.unitsnew), df.units))
id units unitsnew col
1 34 23 34
2 31 11 31
3 16 21 16
4 32 22 32
1 NA 27 11
5 35 11 35
I'm not sure where to fit the groupby statement such that the minimum value per id is taken. The desired output is generated with R. In python, the nan value is filled with 11 because I failed to add the groupby statement. How can I solve this? Thanks,

First let's look how to get the minimum value per group using groupby+transform('min'):
df.groupby('id')['unitsnew'].transform('min')
output:
0 23
1 11
2 21
3 22
4 23
5 11
Name: unitsnew, dtype: int64
Now we can use where to assign the above value when df['units'].isna():
df['col'] = df['units'].where(df['units'].notna(),
df.groupby('id')['unitsnew'].transform('min'))
output:
id units unitsnew col
0 1 34.0 23 34.0
1 2 31.0 11 31.0
2 3 16.0 21 16.0
3 4 32.0 22 32.0
4 1 NaN 27 23.0
5 5 35.0 11 35.0

Related

Randomly Sample in Python with certain distirbution

I want to create a dataframe with two columns an id column which repeats the ids 1-100 3 times and then 'age' where I randomly sample the ages 0-14 17% of the time, ages 15-64 65% of the time, ages 65-100 18% of the time.
Example DF:
id age
1 21
1 21
1 21
2 45
2 45
2 45
3 64
3 64
3 64
Code i have so far:
N = 100
R = 3
d = {'id': np.repeat(np.arange(1, N + 1), R)}
pd.DataFrame(d)
I'm stuck on how to simulate the age though.
How can I do this?
You can apply numpy.random.randint for your specific ranges and thresholds:
df['ages'] = np.repeat(np.concatenate([np.random.randint(0, 14, 17),
np.random.randint(15, 64, 65),
np.random.randint(65, 100, 18)]), R)
print(df)
If needed, the concatenated arrays can be additionally shuffled with np.random.shuffle (before the ages would be repeated np.repeat):
ages = np.concatenate([np.random.randint(0, 14, 17),
np.random.randint(15, 64, 65),
np.random.randint(65, 100, 18)])
np.random.shuffle(ages)
df['ages'] = np.repeat(ages, R)
id ages
0 1 11
1 1 11
2 1 11
3 2 3
4 2 3
5 2 3
6 3 12
7 3 12
8 3 12
9 4 8
10 4 8
11 4 8
12 5 10
13 5 10
14 5 10
.. ... ...
285 96 70
286 96 70
287 96 70
288 97 83
289 97 83
290 97 83
291 98 70
292 98 70
293 98 70
294 99 98
295 99 98
296 99 98
297 100 92
298 100 92
299 100 92
I suggst this method:
import pandas as pd
import numpy as np
ids = np.repeat(range(1, 101), 3)
age_choices = [(np.arange(0, 15), 0.17), (np.arange(15, 65), 0.65), (np.arange(65, 101), 0.18)]
ages = np.concatenate([np.random.choice(choice[0], size=int(len(ids)*choice[1]), replace=True) for choice in age_choices])
df = pd.DataFrame({'id': ids, 'age': ages})
print(df.head(30))
which gives
id age
0 1 2
1 1 13
2 1 8
3 2 14
4 2 0
5 2 14
6 3 7
7 3 6
8 3 9
9 4 13
10 4 9
11 4 6
12 5 7
13 5 6
14 5 12
15 6 12
16 6 4
17 6 2
18 7 0
19 7 10
20 7 4
21 8 10
22 8 8
23 8 1
24 9 10
25 9 5
26 9 13
27 10 8
28 10 13
29 10 4
Maybe something like:
import numpy as np
import pandas as pd
N = 100
R = 3
ids = np.arange(1, N + 1)
# Assuming max age of 99
possible_ages = np.arange(100)
sizes = np.array([16, 50, 34])
percentages = np.array([17, 65, 18])
ages = np.random.choice(possible_ages, size=N, p=np.repeat(percentages / sizes / 100, sizes))
df = pd.DataFrame({
"id": np.repeat(ids, R),
"age": np.repeat(ages, R)
})
Alternatively you could sample the age group using your specified percentages first, then uniformly sample from the obtained group after.

Fill Missing values with max from a group and get rows corresponding to that max value

I have an input data as shown:
df = pd.DataFrame({"colony" : [22, 22, 22, 33, 33, 33],
"measure" : [np.nan, 7, 11, 13, np.nan, 9,],
"Length" : [14, 17, 13, 10, 19,16],
"net/gross" : [np.nan, "gross", "net", "gross", "np.nan", "net"]})
df
colony measure length net/gross
0 22 NaN 14 NaN
1 22 7 17 gross
2 22 11 13 net
3 33 13 10 gross
4 33 NaN 19 NaN
5 33 9 16 net
I want to fill the NaN in the measure column with maximum value from each group of the colony, then fill the NaN in the net/gross column with the net/gross value at the row where the measure was maximum (e.g fill the NaN at index 0 with the value corresponding to where the measure was max which is "net", and 13 on the length_adj column) and create a remark column to document all the NaN filled rows as "max_filled" and the other rows as "unchanged" to arrive at an output as below:
colony measure net/gross length_adj remarks
0 22 11 net 13 max_filled
1 22 7 gross 17 unchanged
2 22 11 net 13 unchanged
3 33 13 gross 10 unchanged
4 33 13 gross 10 max_filled
5 33 9 net 16 unchanged
One approach that allows maximum control of each step (but may be less efficient than more direct pandas methods) is to use apply (with axis=1 to iterate rows) with a custom function, passing the dataframe as an argument as well.
You can use np.isnan to verify that a certain value of a row is or is not nan.
Without using groupby, you can directly for each row retrieve the dataframe of the corresponding colony group. Then you can retrieve the index of the maximum value found with idxmax()
def my_func(row, df):
if np.isnan(row.measure):
max_index_location = df[df.colony==row.colony]['measure'].idxmax()
row.measure = df.iloc[max_index_location].measure
row['Length'] = df.iloc[max_index_location]['Length']
row['net/gross'] = df.iloc[max_index_location]['net/gross']
row['remarks'] = 'max_filled'
else:
row['remarks'] = 'unchanged'
return row
df = df.apply(lambda x: my_func(x, df), axis=1)
Dataframe will be:
colony
measure
Length
net/gross
remarks
0
22
11
13
net
max_filled
1
22
7
17
gross
unchanged
2
22
11
13
net
unchanged
3
33
13
10
gross
unchanged
4
33
13
10
gross
max_filled
5
33
9
16
net
unchanged
Here you go:
df['measure'] = df['measure'].fillna(df.groupby('colony')['measure'].transform('max'))
step1
fill max in measure column
s = df.groupby('colony')['measure'].transform(lambda x: x.fillna(x.max()))
s
0 11.0
1 7.0
2 11.0
3 13.0
4 13.0
5 9.0
Name: measure, dtype: float64
make s to measure column
df[['colony']].assign(measure=s)
result A
colony measure
0 22 11.0
1 22 7.0
2 22 11.0
3 33 13.0
4 33 13.0
5 33 9.0
step2
df1 = df[df.columns[::-1]].dropna()
df1
net/gross Length measure colony
1 gross 17 7.0 22
2 net 13 11.0 22
3 gross 10 13.0 33
5 net 16 9.0 33
step3
merge resultA and df1
df[['colony']].assign(measure=s).merge(df1, how='left')
resultB
colony measure net/gross Length
0 22 11.0 net 13
1 22 7.0 gross 17
2 22 11.0 net 13
3 33 13.0 gross 10
4 33 13.0 gross 10
5 33 9.0 net 16
step4
make resultB to desired output(include full code)
import pandas as pd
import numpy as np
s = df.groupby('colony')['measure'].transform(lambda x: x.fillna(x.max()))
df1 = df[df.columns[::-1]].dropna()
s2 = np.where(df['measure'].isna(), 'max_filled', 'unchanged')
(df[['colony']].assign(measure=s).merge(df1, how='left')
.assign(remark=s2).rename(columns={'Length':'Length_adj'}))
output
colony measure net/gross Length_adj remark
0 22 11.0 net 13 max_filled
1 22 7.0 gross 17 unchanged
2 22 11.0 net 13 unchanged
3 33 13.0 gross 10 unchanged
4 33 13.0 gross 10 max_filled
5 33 9.0 net 16 unchanged

pandas convert dataframe to pivot_table where index is the sorting values

i have the following dataframe:
site height_id height_meters
0 9 c3 24
1 9 c2 30
2 9 c1 36
3 3 c0 18
4 3 bf 24
5 3 be 30
6 4 10 18
7 4 0f 24
8 4 0e 30
i want to transform it to the following this column indexes is values of 'site' and the values is 'height_meters' and i want it to be indexed by the order of the values (i looked in the internet and didnt find somthing similar... tried to groupby and make some pivot table without success):
9 3 4
0 24 18 18
1 30 24 24
2 36 30 24
the gap between numbers isn't necessary ...
here is the df
my_df = pd.DataFrame(dict(
site=[9, 9, 9, 3, 3, 3, 4, 4, 4],
height_id='c3,c2,c1,c0,bf,be,10,0f,0e'.split(','),
height_meters=[24, 30, 36, 18, 24, 30, 18, 24, 30]
))
You can use GroupBy.cumcount for counter of column site:
print (my_df.groupby('site').cumcount())
0 0
1 1
2 2
3 0
4 1
5 2
6 0
7 1
8 2
dtype: int64
You can convert it to index with site column and reshape by Series.unstack:
df = my_df.set_index([my_df.groupby('site').cumcount(), 'site'])['height_meters'].unstack()
print (df)
site 3 4 9
0 18 18 24
1 24 24 30
2 30 30 36
Similar solution with DataFrame.pivot and column created by cumcount:
df = my_df.assign(new=my_df.groupby('site').cumcount()).pivot('new','site','height_meters')
print (df)
site 3 4 9
new
0 18 18 24
1 24 24 30
2 30 30 36
If order is important add DataFrame.reindex by unique values of column site:
df = (my_df.set_index([my_df.groupby('site').cumcount(), 'site'])['height_meters']
.unstack()
.reindex(my_df['site'].unique(), axis=1))
print (df)
site 9 3 4
0 24 18 18
1 30 24 24
2 36 30 30
Last for remove site (new) columns and index names is possible use DataFrame.rename_axis:
df = df.rename_axis(index=None, columns=None)
print (df)
3 4 9
0 18 18 24
1 24 24 30
2 30 30 36

Python pandas:Fast way to create a unique identifier for groups

I have data that looks something like this
df
Out[10]:
ID1 ID2 Price Date
0 11 21 10.99 3/15/2016
1 11 22 11.99 3/15/2016
2 12 23 5 3/15/2016
3 11 21 10.99 3/16/2016
4 11 22 12.99 3/16/2016
5 11 21 10.99 3/17/2016
6 11 22 11.99 3/17/2016
The goal is to get a unique ID for each group of ID1 with particular prices for each of its ID2's, like so:
# Desired Result
df
Out[14]:
ID1 ID2 Price Date UID
0 11 21 10.99 3/15/2016 1
1 11 22 11.99 3/15/2016 1
2 12 23 5 3/15/2016 7
3 11 21 10.99 3/16/2016 5
4 11 22 12.99 3/16/2016 5
5 11 21 10.99 3/17/2016 1
6 11 22 11.99 3/17/2016 1
Speed is an issue because of the size of the data. The best way I can come up with is below, but it is still a fair amount slower than is desirable. If anyone has a way that they think should be naturally faster I'd love to hear it. Or perhaps there is an easy way to do the within group operations in parallel to speed things up?
My method basically concatenates ID's and prices (after padding with zeros to ensure same lengths) and then takes ranks to simplify the final ID. The bottleneck is the within group concatenation done with .transform(np.sum).
# concatenate ID2 and Price
df['ID23'] = df['ID2'] + df['Price']
df
Out[12]:
ID1 ID2 Price Date ID23
0 11 21 10.99 3/15/2016 2110.99
1 11 22 11.99 3/15/2016 2211.99
2 12 23 5 3/15/2016 235
3 11 21 10.99 3/16/2016 2110.99
4 11 22 12.99 3/16/2016 2212.99
5 11 21 10.99 3/17/2016 2110.99
6 11 22 11.99 3/17/2016 2211.99
# groupby ID1 and Date and then concatenate the ID23's
grouped = df.groupby(['ID1','Date'])
df['summed'] = grouped['ID23'].transform(np.sum)
df
Out[16]:
ID1 ID2 Price Date ID23 summed UID
0 6 3 0010.99 3/15/2016 30010.99 30010.9960011.99 630010.9960011.99
1 6 6 0011.99 3/15/2016 60011.99 30010.9960011.99 630010.9960011.99
2 7 7 0000005 3/15/2016 70000005 70000005 770000005
3 6 3 0010.99 3/16/2016 30010.99 30010.9960012.99 630010.9960012.99
4 6 6 0012.99 3/16/2016 60012.99 30010.9960012.99 630010.9960012.99
5 6 3 0010.99 3/17/2016 30010.99 30010.9960011.99 630010.9960011.99
6 6 6 0011.99 3/17/2016 60011.99 30010.9960011.99 630010.9960011.99
# Concatenate ID1 on the front and take rank to get simpler ID's
df['UID'] = df['ID1'] + df['summed']
df['UID'] = df['UID'].rank(method = 'min')
# Drop unnecessary columns
df.drop(['ID23','summed'], axis=1, inplace=True)
UPDATE:
To clarify, consider the original data grouped as follows:
grouped = df.groupby(['ID1','Date'])
for name, group in grouped:
print group
ID1 ID2 Price Date
0 11 21 10.99 3/15/2016
1 11 22 11.99 3/15/2016
ID1 ID2 Price Date
3 11 21 10.99 3/16/2016
4 11 22 12.99 3/16/2016
ID1 ID2 Price Date
5 11 21 10.99 3/17/2016
6 11 22 11.99 3/17/2016
ID1 ID2 Price Date
2 12 23 5 3/15/2016
UID's should be at the group level and match if everything about that group is identical ignoring the date. So in this case the first and third printed groups are the same, meaning that rows 0,1,5, and 6 should all get the same UID. Rows 3 and 4 belong to a different group because a price changed and therefore need a different UID. Row 2 is also a different group.
A slightly different way of looking at this problem is that I want to group as I have here, drop the date column (which was important for initially forming the groups) and then aggregate across groups which are equal once I have removed the dates.
Edit: The code below is actually slower than OP's solution. I'm leaving it as it is for now in case someone uses it to write a better solution.
For visualization, I'll be using the following data:
df
Out[421]:
ID1 ID2 Price Date
0 11 21 10.99 3/15/2016
1 11 22 11.99 3/15/2016
2 12 23 5.00 3/15/2016
3 11 21 10.99 3/16/2016
4 11 22 12.99 3/16/2016
5 11 21 10.99 3/17/2016
6 11 22 11.99 3/17/2016
7 11 22 11.99 3/18/2016
8 11 21 10.99 3/18/2016
9 12 22 11.99 3/18/2016
10 12 21 10.99 3/18/2016
11 12 23 5.00 3/19/2016
12 12 23 5.00 3/19/2016
First, let's group it by 'ID1' and 'Date' and aggregate the results as tuples (sorted). I also reset the index, so there is a new columns named 'index'.
gr = df.reset_index().groupby(['ID1','Date'], as_index = False)
df1 = gr.agg(lambda x : tuple(sorted(x)))
df1
Out[425]:
ID1 Date index ID2 Price
0 11 3/15/2016 (0, 1) (21, 22) (10.99, 11.99)
1 11 3/16/2016 (3, 4) (21, 22) (10.99, 12.99)
2 11 3/17/2016 (5, 6) (21, 22) (10.99, 11.99)
3 11 3/18/2016 (7, 8) (21, 22) (10.99, 11.99)
4 12 3/15/2016 (2,) (23,) (5.0,)
5 12 3/18/2016 (9, 10) (21, 22) (10.99, 11.99)
6 12 3/19/2016 (11, 12) (23, 23) (5.0, 5.0)
After all grouping is done, I'll use indices from column 'index' to access rows from df (they'd better be unique). (Notice also that df1.index and df1['index'] are completely different things.)
Now, let's group 'index' (skipping dates):
df2 = df1.groupby(['ID1','ID2','Price'], as_index = False)['index'].sum()
df2
Out[427]:
ID1 ID2 Price index
0 11 (21, 22) (10.99, 11.99) (0, 1, 5, 6, 7, 8)
1 11 (21, 22) (10.99, 12.99) (3, 4)
2 12 (21, 22) (10.99, 11.99) (9, 10)
3 12 (23,) (5.0,) (2,)
4 12 (23, 23) (5.0, 5.0) (11, 12)
I believe this is the grouping needed for the problem, so we can now add labels to df. For example like this:
df['GID'] = -1
for i, t in enumerate(df2['index']):
df.loc[t,'GID'] = i
df
Out[430]:
ID1 ID2 Price Date GID
0 11 21 10.99 3/15/2016 0
1 11 22 11.99 3/15/2016 0
2 12 23 5.00 3/15/2016 3
3 11 21 10.99 3/16/2016 1
4 11 22 12.99 3/16/2016 1
5 11 21 10.99 3/17/2016 0
6 11 22 11.99 3/17/2016 0
7 11 22 11.99 3/18/2016 0
8 11 21 10.99 3/18/2016 0
9 12 22 11.99 3/18/2016 2
10 12 21 10.99 3/18/2016 2
11 12 23 5.00 3/19/2016 4
12 12 23 5.00 3/19/2016 4
Or in a possibly faster but tricky way:
# EXPERIMENTAL CODE!
df3 = df2['index'].apply(pd.Series).stack().reset_index()
df3.index = df3[0].astype(int)
df['GID'] = df3['level_0']

Python pandas.cut

Edit: Added defT
Does using pandas.cut change the structure of a pandas.DataFrame.
I am using pandas.cut in the following manner to map single age years to age groups and then aggregating afterwards. However, the aggregation does not work as I end up with NaN in all columns that are being aggregated. Here is my code:
cutoff = numpy.hstack([numpy.array(defT.MinAge[0]), defT.MaxAge.values])
labels = defT.AgeGrp
df['ageGrp'] = pandas.cut(df.Age,
bins = cutoff,
labels = labels,
include_lowest = True)
Here is defT:
AgeGrp MaxAge MinAge
1 18 14
2 21 19
3 24 22
4 34 25
5 44 35
6 54 45
7 65 55
Then I pass the data-frame into another function to aggregate:
grouped = df.groupby(['Year', 'Month', 'OccID', 'ageGrp', 'Sex', \
'Race', 'Hisp', 'Educ'],
as_index = False)
final = grouped.aggregate(numpy.sum)
If I change the ages to age groups via this manner it works perfectly:
df['ageGrp'] = 1
df.ix[(df.Age >= 14) & (df.Age <= 18), 'ageGrp'] = 1 # Age 16 - 20
df.ix[(df.Age >= 19) & (df.Age <= 21), 'ageGrp'] = 2 # Age 21 - 25
df.ix[(df.Age >= 22) & (df.Age <= 24), 'ageGrp'] = 3 # Age 26 - 44
df.ix[(df.Age >= 25) & (df.Age <= 34), 'ageGrp'] = 4 # Age 45 - 64
df.ix[(df.Age >= 35) & (df.Age <= 44), 'ageGrp'] = 5 # Age 64 - 85
df.ix[(df.Age >= 45) & (df.Age <= 54), 'ageGrp'] = 6 # Age 64 - 85
df.ix[(df.Age >= 55) & (df.Age <= 64), 'ageGrp'] = 7 # Age 64 - 85
df.ix[df.Age >= 65, 'ageGrp'] = 8 # Age 85+
I would prefer to do this on the fly, importing the definition table and using pandas.cut, instead of being hard-coded.
Thank you in advance.
Here is, perhaps, a work-around.
Consider the following example which replicates the symptom you describe:
import numpy as np
import pandas as pd
np.random.seed(2015)
defT = pd.DataFrame({'AgeGrp': [1, 2, 3, 4, 5, 6, 7],
'MaxAge': [18, 21, 24, 34, 44, 54, 65],
'MinAge': [14, 19, 22, 25, 35, 45, 55]})
cutoff = np.hstack([np.array(defT['MinAge'][0]), defT['MaxAge'].values])
labels = defT['AgeGrp']
N = 50
df = pd.DataFrame(np.random.randint(100, size=(N,2)), columns=['Age', 'Year'])
df['ageGrp'] = pd.cut(df['Age'], bins=cutoff, labels=labels, include_lowest=True)
grouped = df.groupby(['Year', 'ageGrp'], as_index=False)
final = grouped.agg(np.sum)
print(final)
# Year ageGrp Age
# Year ageGrp
# 3 1 NaN NaN NaN
# 2 NaN NaN NaN
# ...
# 97 1 NaN NaN NaN
# 2 NaN NaN NaN
# [294 rows x 3 columns]
If we change
grouped = df.groupby(['Year', 'ageGrp'], as_index=False)
final = grouped.agg(np.sum)
to
grouped = df.groupby(['Year', 'ageGrp'], as_index=True)
final = grouped.agg(np.sum).dropna()
print(final)
then we obtain:
Age
Year ageGrp
6 7 61
16 4 32
18 1 34
25 3 23
28 5 39
34 7 60
35 5 42
38 4 25
40 2 19
53 7 59
56 4 25
5 35
66 6 54
67 7 55
70 7 56
73 6 51
80 5 36
81 6 46
85 5 38
90 7 58
97 1 18

Categories

Resources