Python/Pandas - Best way to group by criteria? - python

I have tried to find an answer to my question, but maybe I'm just not applying the solutions correctly to my situation. This is what I created to group some rows in my datasheet into income groups. I created 4 new dataframes and then concatenated them after applying an index to each. Is this optimal or is there a better way to do things?
I should add my goal is to create a boxplot using these new groups and the boxpot "by=" argument.
df_nonull1 = df_nonull[(df_nonull['mn_earn_wne_p6'] < 20000)]
df_nonull2 = df_nonull[(df_nonull['mn_earn_wne_p6'] >= 20000) & (df_nonull['mn_earn_wne_p6'] < 30000)]
df_nonull3 = df_nonull[(df_nonull['mn_earn_wne_p6'] >= 30000) & (df_nonull['mn_earn_wne_p6'] < 40000)]
df_nonull4 = df_nonull[(df_nonull['mn_earn_wne_p6'] >= 40000)]
df_nonull1['inc_index'] = 1
df_nonull2['inc_index'] = 2
df_nonull3['inc_index'] = 3
df_nonull4['inc_index'] = 4
frames = [df_nonull1,df_nonull2,df_nonull3,df_nonull4]
results = pd.concat(frames)

Edit. As Paul mentioned in the comments, there is a pd.cut function for exactly this sort of thing, which is much more elegant than my original answer.
# equal-width bins
df['inc_index'] = pd.cut(df.A, bins=4, labels=[1, 2, 3, 4])
# custom bin edges
df['inc_index'] = pd.cut(df.A, bins=[0, 20000, 30000, 40000, 50000],
labels=[1, 2, 3, 4])
Note that the labels argument is optional. pd.cut produces an ordered categorical Series, so you can sort by the resulting column regardless of labels:
df = pd.DataFrame(np.random.randint(1, 20, (10, 2)), columns=list('AB'))
df['inc_index'] = pd.cut(df.A, bins=[0, 7, 13, 15, 20])
print df.sort_values('inc_index')
which outputs (modulo random numbers)
A B inc_index
6 2 16 (0, 7]
7 5 5 (0, 7]
3 12 6 (7, 13]
4 10 8 (7, 13]
5 9 13 (7, 13]
1 15 10 (13, 15]
2 15 7 (13, 15]
8 15 13 (13, 15]
0 18 10 (15, 20]
9 16 12 (15, 20]
Original solution. This is a generalization on Alexander's answer to variable bucket widths. You can build the inc_index column using Series.apply. For example,
def bucket(v):
# of course, the thresholds can be arbitrary
if v < 20000:
return 1
if v < 30000:
return 2
if v < 40000:
return 3
return 4
df['inc_index'] = df.mn_earn_wne_p6.apply(bucket)
or, if you really want to avoid a def,
df['inc_index'] = df.mn_earn_wne_p6.apply(
lambda v: 1 if v < 20000 else 2 if v < 30000 else 3 if v < 40000 else 4)
Note that if you just want to subdivide the range of mn_earn_wne_p6 into equal buckets, then Alexander's way is much cleaner and faster.
df['inc_index'] = df.mn_earn_wne_p6 // bucket_width
Then, to produce the result you want, you can just sort by this column.
df.sort_values('inc_index')
You can also groupby('inc_index') to aggregate results within each bucket.

If all your values are between 10k and 50k, you can assign your index using integer division (//):
df_nonull['inc_index'] = df_nonull.mn_earn_wne_p6 // 10000
You don't need to to break up your dataframes and concatenate them, you need to find a way to create your inc_index from your mn_earn_wne_p6 field.

Related

Is there a more efficient or concise way to divide a df according to a list of indexes?

I'm trying to slice/divide the following dataframe
df = pd.DataFrame(
{'time': [4, 10, 15, 6, 0, 20, 40, 11, 9, 12, 11, 25],
'value': [0, 0, 0, 50, 100, 0, 0, 70, 100, 0,100, 20]}
)
according to a list of indexes to split on :
[5, 7, 9]
The first and last items of the list are the first and last indexes of the dataframe. I'm trying to get the following four dataframes as a result (defined by the three given indexes and the beginning and end of the original df) each assigned to their own variable:
time value
0 4 0
1 10 0
2 15 0
3 6 50
4 0 100
time value
5 20 0
6 40 0
time value
7 11 70
8 9 100
time value
9 12 0
10 11 100
11 25 20
My current solution gives me a list of dataframes that I could then assign to variables manually by list index, but the code is a bit complex, and I'm wondering if there's a simpler/more efficient way to do this.
indexes = [5,7,9]
indexes.insert(0,0)
indexes.append(df.index[-1]+1)
i = 0
df_list = []
while i+1 < len(indexes):
df_list.append(df.iloc[indexes[i]:indexes[i+1]])
i += 1
This is all coming off of my attempt to answer this question. I'm sure there's a better approach to that answer, but I did feel like there should be a simpler way to do this kind of slicing that what I thought of.
you can use np.split like
df_list = np.split(df, indexes)

How to sort & extract values with multiple conditions in R?

I have a basic conditional data extraction issue. I have already written a code in Python. I am learning R; and I would like to replicate the same code in R.
I tried to put conditional arguments using which, but that doesn't seem to work. I am not yet fully versed with R syntax.
I have a dataframe with 2 columns: x and y
The idea is to extract a list of maximum 5 x-values multiplied by 2 corresponding to the maximum y-values with a condition that we will select only those values of y which are at least 0.45 times the peak y-value.
So, the algorithm will have the following steps:
We find the peak value of y: max_y
We define the threshold = 0.45 * max_y
We apply a filter, to get the list of all y-values that are greater than the threshold value: y_filt
We get a list of x-values corresponding to the y-values in step 3: x_filt
If the number of values in x_filt is less than or equal to 5, then our result would be the values in x_filt multiplied by 2
If x_filt has more than 5 values, we only select the 5 values corresponding to the 5 maximum y-values in the list. Then we multiply by 2 to get our result
Python Code
max_y = max(y)
max_x = x[y.argmax()]
print (max_x, max_y)
threshold = 0.45 * max_y
y_filt = y [y > threshold]
x_filt = x [y > threshold]
if len(y_filt) > 4:
n_highest = 5
else:
n_highest = len(y_filt)
y_filt_highest = y_filt.argsort()[-n_highest:][::-1]
result = [x_filt[i]*2 for i in range(len(x_filt)) if i in y_filt_highest]
For Example Data-set
x y
1 20
2 7
3 5
4 11
5 0
6 8
7 3
8 10
9 2
10 6
11 15
12 18
13 0
14 1
15 12
The above code will give the following results
max_y = 20
max_x = 1
threshold = 9
y_filt = [20, 11, 10, 15, 18, 12]
x_filt = [1, 4, 8, 11, 12, 15]
n_highest = 5
y_filt_highest = [20, 11, 15, 18, 12]
result = [2, 8, 22, 24, 30]
I wish to do the same in R.
One of the reasons that R is so powerful/easy to use for statistical work is that the built in data.frame is foundational. Using one here simplifies things:
# Create a dataframe with the toy data
df <- data.frame(x = 1:10, y = c(20, 7, 5, 11, 0, 8, 3, 10, 2, 6))
# Refer to columns with the $ notation
max_y <- max(df$y)
max_x <- df$x[which(df$y == max_y)]
# If you want to print both values, you need to create a list with c()
print(c(max_x, max_y))
# But you could also just call the values directly, as in python
max_x
max_y
# Calculate a threshold and then create a filtered data.frame
threshold <- 0.45 * max_y
df_filt <- df[which(df$y > threshold), ]
df_filt <- df_filt[order(-df_filt$y), ]
if(nrow(df_filt) > 5){
df_filt <- df_filt[1:5, ]
}
# Calculate the result
result <- df_filt$x * 2
# Alternatively, you may want the result to be part of your data.frame
df_filt$result <- df_filt$x*2
# Should show identical results
max_y
max_x
threshold
df_filt # Probably don't want to print a df if it is large
result
Of course if you really need separate vectors for y_filt and x_filt, you could create them easily after the fact:
y_filt <- df_filt$y
x_filt <- df_filt$x
Note that like numpy.argmax, which(df$y == max(y)) will return multiple values if your maximum is not unique.

split dataframe values into a specified number of groups and apply function - pandas

df=pd.DataFrame([1,4,1,3,2,8,3,6,3,7,3,1,2,9])
I'd like to split df into a specified number of groups and sum all elements in each group. For example, dividing df into 4 groups
1,4,1,3 2,8,3,6 3,7,3,1 2,9
would result in
9
19
14
11
I could do df.groupby(np.arange(len(df))//4).sum(), but this won't work for larger dataframes
For example
df1=pd.DataFrame([1,4,1,3,2,8,3,6,3,7,3,1,2,9,1,5,3,4])
df1.groupby(np.arange(len(df1))//4).sum()
creates 5 groups instead of 4
You can use numpy.array_split:
df=pd.DataFrame([1,4,1,3,2,8,3,6,3,7,3,1,2,9,1,5,3,4])
a = pd.Series([x.values.sum() for x in np.array_split(df, 4)])
print (a)
0 11
1 27
2 15
3 13
dtype: int64
Solution with concat and sum:
a = pd.concat(np.array_split(df, 4), keys=np.arange(4)).sum(level=0)
print (a)
0
0 11
1 27
2 15
3 13
Say you have this data frame:
df = pd.DataFrame([1,4,1,3,2,8,3,6,3,7,3,1,2,9])
You can achive it using list comprehension and loc:
group_size = 4
[df.loc[i:i+group_size-1].values.sum() for i in range(0, len(df), group_size)]
Output:
[9, 19, 14, 11]
I looked in the comments, and i thought that you can use some explicit python code when the "usual" pandas functions can't fulfill your needs.
So:
import pandas as pd
def get_sum(a, chunks):
for k in range(0, len(df), chunks):
yield a[k:k+chunks].values.sum()
df = pd.DataFrame([1,4,1,3,2,8,3,6,3,7,3,1,2,9])
group_size = list(get_sum(df, 4))
print(group_size)
Output:
[9, 19, 14, 11]

Create dataframe column based on other column

I have a dataframe with columns[id, type, income] and want to add an additional column called incomebracket based on income. Does anyone have any suggestions?
Ideally I would create the new incomebracket column based on a series of intervals. ie:
incomebracket = 1 if 100000 < income < 150000
So far I know how to create a blank dataframe column: df['incomebracket'], but I can't figure out the rest.
Any suggestions?
Cheers
Try this
df['incomebracket'] = 0 #default
df.incomebracket[(df.income >= 100000) & (df.income < 150000)] = 1
My preferred way is using numpy where
import numpy as np
df['incomebracket'] = np.where((df.income >= 100000) & (df.income < 150000), 1, 0)
You might be interested in pd.cut:
>>> df = pd.DataFrame({"income": np.random.uniform(0, 10**6, 10)})
>>> df["incomebracket"] = pd.cut(df.income, np.linspace(0, 10**6, 11))
>>> df
income incomebracket
0 474229.041695 (400000, 500000]
1 128577.241314 (100000, 200000]
2 254345.417166 (200000, 300000]
3 622104.725105 (600000, 700000]
4 93779.964789 (0, 100000]
5 865556.464985 (800000, 900000]
6 304711.799685 (300000, 400000]
7 601910.710932 (600000, 700000]
8 229606.880350 (200000, 300000]
9 49889.911661 (0, 100000]
[10 rows x 2 columns]
See also pd.qcut.

Most efficient way to sum huge 2D NumPy array, grouped by ID column?

I have a massive data array (500k rows) that looks like:
id value score
1 20 20
1 10 30
1 15 0
2 12 4
2 3 8
2 56 9
3 6 18
...
As you can see, there is a non-unique ID column to the left, and various scores in the 3rd column.
I'm looking to quickly add up all of the scores, grouped by IDs. In SQL this would look like SELECT sum(score) FROM table GROUP BY id
With NumPy I've tried iterating through each ID, truncating the table by each ID, and then summing the score up for that table.
table_trunc = table[(table == id).any(1)]
score = sum(table_trunc[:,2])
Unfortunately I'm finding the first command to be dog-slow. Is there any more efficient way to do this?
you can use bincount():
import numpy as np
ids = [1,1,1,2,2,2,3]
data = [20,30,0,4,8,9,18]
print np.bincount(ids, weights=data)
the output is [ 0. 50. 21. 18.], which means the sum of id==0 is 0, the sum of id==1 is 50.
I noticed the numpy tag but in case you don't mind using pandas (or if you read in these data using this module), this task becomes an one-liner:
import pandas as pd
df = pd.DataFrame({'id': [1,1,1,2,2,2,3], 'score': [20,30,0,4,8,9,18]})
So your dataframe would look like this:
id score
0 1 20
1 1 30
2 1 0
3 2 4
4 2 8
5 2 9
6 3 18
Now you can use the functions groupby() and sum():
df.groupby(['id'], sort=False).sum()
which gives you the desired output:
score
id
1 50
2 21
3 18
By default, the dataframe would be sorted, therefore I use the flag sort=False which might improve speed for huge dataframes.
You can try using boolean operations:
ids = [1,1,1,2,2,2,3]
data = [20,30,0,4,8,9,18]
[((ids == i)*data).sum() for i in np.unique(ids)]
This may be a bit more effective than using np.any, but will clearly have trouble if you have a very large number of unique ids to go along with large overall size of the data table.
If you're looking only for sum you probably want to go with bincount. If you also need other grouping operations like product, mean, std etc. have a look at https://github.com/ml31415/numpy-groupies . It's the fastest python/numpy grouping operations around, see the speed comparison there.
Your sum operation there would look like:
res = aggregate(id, score)
The numpy_indexed package has vectorized functionality to perform this operation efficiently, in addition to many related operations of this kind:
import numpy_indexed as npi
npi.group_by(id).sum(score)
You can use a for loop and numba
from numba import njit
#njit
def wbcnt(b, w, k):
bins = np.arange(k)
bins = bins * 0
for i in range(len(b)):
bins[b[i]] += w[i]
return bins
Using #HYRY's variables
ids = [1, 1, 1, 2, 2, 2, 3]
data = [20, 30, 0, 4, 8, 9, 18]
Then:
wbcnt(ids, data, 4)
array([ 0, 50, 21, 18])
Timing
%timeit wbcnt(ids, data, 4)
%timeit np.bincount(ids, weights=data)
1000000 loops, best of 3: 1.99 µs per loop
100000 loops, best of 3: 2.57 µs per loop
Maybe using itertools.groupby, you can group on the ID and then iterate over the grouped data.
(The data must be sorted according to the group by func, in this case ID)
>>> data = [(1, 20, 20), (1, 10, 30), (1, 15, 0), (2, 12, 4), (2, 3, 0)]
>>> groups = itertools.groupby(data, lambda x: x[0])
>>> for i in groups:
for y in i:
if isinstance(y, int):
print(y)
else:
for p in y:
print('-', p)
Output:
1
- (1, 20, 20)
- (1, 10, 30)
- (1, 15, 0)
2
- (2, 12, 4)
- (2, 3, 0)

Categories

Resources