Create dataframe column based on other column - python

I have a dataframe with columns[id, type, income] and want to add an additional column called incomebracket based on income. Does anyone have any suggestions?
Ideally I would create the new incomebracket column based on a series of intervals. ie:
incomebracket = 1 if 100000 < income < 150000
So far I know how to create a blank dataframe column: df['incomebracket'], but I can't figure out the rest.
Any suggestions?
Cheers

Try this
df['incomebracket'] = 0 #default
df.incomebracket[(df.income >= 100000) & (df.income < 150000)] = 1
My preferred way is using numpy where
import numpy as np
df['incomebracket'] = np.where((df.income >= 100000) & (df.income < 150000), 1, 0)

You might be interested in pd.cut:
>>> df = pd.DataFrame({"income": np.random.uniform(0, 10**6, 10)})
>>> df["incomebracket"] = pd.cut(df.income, np.linspace(0, 10**6, 11))
>>> df
income incomebracket
0 474229.041695 (400000, 500000]
1 128577.241314 (100000, 200000]
2 254345.417166 (200000, 300000]
3 622104.725105 (600000, 700000]
4 93779.964789 (0, 100000]
5 865556.464985 (800000, 900000]
6 304711.799685 (300000, 400000]
7 601910.710932 (600000, 700000]
8 229606.880350 (200000, 300000]
9 49889.911661 (0, 100000]
[10 rows x 2 columns]
See also pd.qcut.

Related

Fill a dataframe with Carthesian product of variably shaped input lists

I want to create a script that fills a dataframe with values that are the Carthesian product of parameters I want to vary in a series of experiments.
My first thought was to use the product function of itertools, however it seems to require a fixed set of input lists.
The output I'm looking for can be generated using this sample:
cols = ['temperature','pressure','power']
l1 = [1, 100, 50.0 ]
l2 = [1000, 10, np.nan]
l3 = [0, 100, np.nan]
data = []
for val in itertools.product(l1,l2,l3): #use itertools to get the Carthesian product of the lists
data.append(val) #make a list of lists to store each variation
df = pd.DataFrame(data, columns=cols).dropna(0) #make a dataframe from the list of lists (dropping NaN values)
However, I would like instead to extract the parameters from dataframes of arbitrary shape and then fill up a dataframe with the product, like so (code doesn't work):
data = [{'parameter':'temperature','value1':1,'value2':100,'value3':50},
{'parameter':'pressure','value1':1000,'value2':10},
{'parameter':'power','value1':0,'value2':100},
]
df = pd.DataFrame(data)
l = []
cols = []
for i in range(df.shape[0]):
l.append(df.iloc[i][1:].to_list()) #store the values of each df row to a separate list
cols.append(df.iloc[i][0]) #store the first value of the row as column header
data = []
for val in itertools.product(l): #ask itertools to parse a list of lists
data.append(val)
df2 = pd.DataFrame(data, columns=cols).dropna(0)
Can you recommend a way about this? My goal is creating the final dataframe, so it's not a requirement to use itertools.
Another alternative without product (nothing wrong with product, though) could be to use .join() with how="cross" to produce successive cross-products:
df2 = df.T.rename(columns=df.iloc[:, 0]).drop(df.columns[0])
df2 = (
df2.iloc[:, [0]]
.join(df2.iloc[:, [1]], how="cross")
.join(df2.iloc[:, [2]], how="cross")
.dropna(axis=0)
)
Result:
temperature pressure power
0 1 1000 0
1 1 1000 100
3 1 10 0
4 1 10 100
9 100 1000 0
10 100 1000 100
12 100 10 0
13 100 10 100
18 50.0 1000 0
19 50.0 1000 100
21 50.0 10 0
22 50.0 10 100
A compacter version with product:
from itertools import product
df2 = pd.DataFrame(
product(*df.set_index("parameter", drop=True).itertuples(index=False)),
columns=df["parameter"]
).dropna(axis=0)

Pandas groupby aggregation with percentages

I have the following dataframe:
import pandas as pd
import numpy as np
np.random.seed(123)
n = 10
df = pd.DataFrame({"val": np.random.randint(1, 10, n),
"cat": np.random.choice(["X", "Y", "Z"], n)})
val cat
0 3 Z
1 3 X
2 7 Y
3 2 Z
4 4 Y
5 7 X
6 2 X
7 1 X
8 2 X
9 1 Y
I want to know the percentage each category X, Y, and Z has of the entire val column sum. I can aggregate df like this:
total_sum = df.val.sum()
#32
s = df.groupby("cat").val.sum().div(total_sum)*100
#this is the desired result in % of total val
cat
X 46.875 #15/32
Y 37.500 #12/32
Z 15.625 #5/32
Name: val, dtype: float64
However, I find it rather surprising that pandas seemingly does not have a percentage/frequency function something like df.groupby("cat").val.freq() instead of df.groupby("cat").val.sum() or df.groupby("cat").val.mean(). I assumed this is a common operation, and Series.value_counts has implemented this with normalize=True - but for groupby aggregation, I cannot find anything similar. Am I missing here something or is there indeed no out-of-the-box function?

How to give certain rows 'points' depending on how much larger that row's column 1 is compared to that row's column 2

I'm looking at creating an algorithm where if the views_per_hour is 2x larger than the average_views_per_hour, I give the channel 5 points; if it is 3x larger I give the row 10 points and if it is 4x larger, I give the row 20 points. I'm not really sure how to go about this and would really appreciate some help.
df = pd.DataFrame({'channel':['channel1','channel2','channel3','channel4'], 'views_per_hour_today':[300,500,2000,100], 'average_views_per_hour':[100,200,200,50],'points': [0,0,0,0] })
df.loc[:, 'average_views_per_hour'] *= 2
df['n=2'] = np.where((df['views_per_hour'] >= df['average_views_per_hour']) , 5, 0)
df.loc[:, 'average_views_per_hour'] *= 3
df['n=3'] = np.where((df['views_per_hour'] >= df['average_views_per_hour']) , 5, 0)
df.loc[:, 'average_views_per_hour'] *= 4
df['n=4'] = np.where((df['views_per_hour'] >= df['average_views_per_hour']) , 10, 0)
I expected to be able to add up the results from columns n=2, n=3, n=4 for each row in the 'Points' column but the columns are always showing either 5 or 10 and never 0 (the code thinks that the views_per_hour is always greater than the average_views_per_hour, even when the average_views_per_hour is multiplied by a large integer.)
There are multiple ways of solving this kind of problem. You can use numpy select which has more concise syntax, you can also define a function and apply on the data frame.
div = df['views_per_hour_today']/df['average_views_per_hour']
cond = [(div >= 2) & (div < 3), (div >= 3) & (div < 4), (div >= 4) ]
choice = [5, 10, 20]
df['points'] = np.select(cond, choice)
channel views_per_hour_today average_views_per_hour points
0 channel1 300 100 10
1 channel2 500 200 5
2 channel3 2000 200 20
3 channel4 100 50 5

Python/Pandas - Best way to group by criteria?

I have tried to find an answer to my question, but maybe I'm just not applying the solutions correctly to my situation. This is what I created to group some rows in my datasheet into income groups. I created 4 new dataframes and then concatenated them after applying an index to each. Is this optimal or is there a better way to do things?
I should add my goal is to create a boxplot using these new groups and the boxpot "by=" argument.
df_nonull1 = df_nonull[(df_nonull['mn_earn_wne_p6'] < 20000)]
df_nonull2 = df_nonull[(df_nonull['mn_earn_wne_p6'] >= 20000) & (df_nonull['mn_earn_wne_p6'] < 30000)]
df_nonull3 = df_nonull[(df_nonull['mn_earn_wne_p6'] >= 30000) & (df_nonull['mn_earn_wne_p6'] < 40000)]
df_nonull4 = df_nonull[(df_nonull['mn_earn_wne_p6'] >= 40000)]
df_nonull1['inc_index'] = 1
df_nonull2['inc_index'] = 2
df_nonull3['inc_index'] = 3
df_nonull4['inc_index'] = 4
frames = [df_nonull1,df_nonull2,df_nonull3,df_nonull4]
results = pd.concat(frames)
Edit. As Paul mentioned in the comments, there is a pd.cut function for exactly this sort of thing, which is much more elegant than my original answer.
# equal-width bins
df['inc_index'] = pd.cut(df.A, bins=4, labels=[1, 2, 3, 4])
# custom bin edges
df['inc_index'] = pd.cut(df.A, bins=[0, 20000, 30000, 40000, 50000],
labels=[1, 2, 3, 4])
Note that the labels argument is optional. pd.cut produces an ordered categorical Series, so you can sort by the resulting column regardless of labels:
df = pd.DataFrame(np.random.randint(1, 20, (10, 2)), columns=list('AB'))
df['inc_index'] = pd.cut(df.A, bins=[0, 7, 13, 15, 20])
print df.sort_values('inc_index')
which outputs (modulo random numbers)
A B inc_index
6 2 16 (0, 7]
7 5 5 (0, 7]
3 12 6 (7, 13]
4 10 8 (7, 13]
5 9 13 (7, 13]
1 15 10 (13, 15]
2 15 7 (13, 15]
8 15 13 (13, 15]
0 18 10 (15, 20]
9 16 12 (15, 20]
Original solution. This is a generalization on Alexander's answer to variable bucket widths. You can build the inc_index column using Series.apply. For example,
def bucket(v):
# of course, the thresholds can be arbitrary
if v < 20000:
return 1
if v < 30000:
return 2
if v < 40000:
return 3
return 4
df['inc_index'] = df.mn_earn_wne_p6.apply(bucket)
or, if you really want to avoid a def,
df['inc_index'] = df.mn_earn_wne_p6.apply(
lambda v: 1 if v < 20000 else 2 if v < 30000 else 3 if v < 40000 else 4)
Note that if you just want to subdivide the range of mn_earn_wne_p6 into equal buckets, then Alexander's way is much cleaner and faster.
df['inc_index'] = df.mn_earn_wne_p6 // bucket_width
Then, to produce the result you want, you can just sort by this column.
df.sort_values('inc_index')
You can also groupby('inc_index') to aggregate results within each bucket.
If all your values are between 10k and 50k, you can assign your index using integer division (//):
df_nonull['inc_index'] = df_nonull.mn_earn_wne_p6 // 10000
You don't need to to break up your dataframes and concatenate them, you need to find a way to create your inc_index from your mn_earn_wne_p6 field.

Most efficient way to sum huge 2D NumPy array, grouped by ID column?

I have a massive data array (500k rows) that looks like:
id value score
1 20 20
1 10 30
1 15 0
2 12 4
2 3 8
2 56 9
3 6 18
...
As you can see, there is a non-unique ID column to the left, and various scores in the 3rd column.
I'm looking to quickly add up all of the scores, grouped by IDs. In SQL this would look like SELECT sum(score) FROM table GROUP BY id
With NumPy I've tried iterating through each ID, truncating the table by each ID, and then summing the score up for that table.
table_trunc = table[(table == id).any(1)]
score = sum(table_trunc[:,2])
Unfortunately I'm finding the first command to be dog-slow. Is there any more efficient way to do this?
you can use bincount():
import numpy as np
ids = [1,1,1,2,2,2,3]
data = [20,30,0,4,8,9,18]
print np.bincount(ids, weights=data)
the output is [ 0. 50. 21. 18.], which means the sum of id==0 is 0, the sum of id==1 is 50.
I noticed the numpy tag but in case you don't mind using pandas (or if you read in these data using this module), this task becomes an one-liner:
import pandas as pd
df = pd.DataFrame({'id': [1,1,1,2,2,2,3], 'score': [20,30,0,4,8,9,18]})
So your dataframe would look like this:
id score
0 1 20
1 1 30
2 1 0
3 2 4
4 2 8
5 2 9
6 3 18
Now you can use the functions groupby() and sum():
df.groupby(['id'], sort=False).sum()
which gives you the desired output:
score
id
1 50
2 21
3 18
By default, the dataframe would be sorted, therefore I use the flag sort=False which might improve speed for huge dataframes.
You can try using boolean operations:
ids = [1,1,1,2,2,2,3]
data = [20,30,0,4,8,9,18]
[((ids == i)*data).sum() for i in np.unique(ids)]
This may be a bit more effective than using np.any, but will clearly have trouble if you have a very large number of unique ids to go along with large overall size of the data table.
If you're looking only for sum you probably want to go with bincount. If you also need other grouping operations like product, mean, std etc. have a look at https://github.com/ml31415/numpy-groupies . It's the fastest python/numpy grouping operations around, see the speed comparison there.
Your sum operation there would look like:
res = aggregate(id, score)
The numpy_indexed package has vectorized functionality to perform this operation efficiently, in addition to many related operations of this kind:
import numpy_indexed as npi
npi.group_by(id).sum(score)
You can use a for loop and numba
from numba import njit
#njit
def wbcnt(b, w, k):
bins = np.arange(k)
bins = bins * 0
for i in range(len(b)):
bins[b[i]] += w[i]
return bins
Using #HYRY's variables
ids = [1, 1, 1, 2, 2, 2, 3]
data = [20, 30, 0, 4, 8, 9, 18]
Then:
wbcnt(ids, data, 4)
array([ 0, 50, 21, 18])
Timing
%timeit wbcnt(ids, data, 4)
%timeit np.bincount(ids, weights=data)
1000000 loops, best of 3: 1.99 µs per loop
100000 loops, best of 3: 2.57 µs per loop
Maybe using itertools.groupby, you can group on the ID and then iterate over the grouped data.
(The data must be sorted according to the group by func, in this case ID)
>>> data = [(1, 20, 20), (1, 10, 30), (1, 15, 0), (2, 12, 4), (2, 3, 0)]
>>> groups = itertools.groupby(data, lambda x: x[0])
>>> for i in groups:
for y in i:
if isinstance(y, int):
print(y)
else:
for p in y:
print('-', p)
Output:
1
- (1, 20, 20)
- (1, 10, 30)
- (1, 15, 0)
2
- (2, 12, 4)
- (2, 3, 0)

Categories

Resources