Edit: Added defT
Does using pandas.cut change the structure of a pandas.DataFrame.
I am using pandas.cut in the following manner to map single age years to age groups and then aggregating afterwards. However, the aggregation does not work as I end up with NaN in all columns that are being aggregated. Here is my code:
cutoff = numpy.hstack([numpy.array(defT.MinAge[0]), defT.MaxAge.values])
labels = defT.AgeGrp
df['ageGrp'] = pandas.cut(df.Age,
bins = cutoff,
labels = labels,
include_lowest = True)
Here is defT:
AgeGrp MaxAge MinAge
1 18 14
2 21 19
3 24 22
4 34 25
5 44 35
6 54 45
7 65 55
Then I pass the data-frame into another function to aggregate:
grouped = df.groupby(['Year', 'Month', 'OccID', 'ageGrp', 'Sex', \
'Race', 'Hisp', 'Educ'],
as_index = False)
final = grouped.aggregate(numpy.sum)
If I change the ages to age groups via this manner it works perfectly:
df['ageGrp'] = 1
df.ix[(df.Age >= 14) & (df.Age <= 18), 'ageGrp'] = 1 # Age 16 - 20
df.ix[(df.Age >= 19) & (df.Age <= 21), 'ageGrp'] = 2 # Age 21 - 25
df.ix[(df.Age >= 22) & (df.Age <= 24), 'ageGrp'] = 3 # Age 26 - 44
df.ix[(df.Age >= 25) & (df.Age <= 34), 'ageGrp'] = 4 # Age 45 - 64
df.ix[(df.Age >= 35) & (df.Age <= 44), 'ageGrp'] = 5 # Age 64 - 85
df.ix[(df.Age >= 45) & (df.Age <= 54), 'ageGrp'] = 6 # Age 64 - 85
df.ix[(df.Age >= 55) & (df.Age <= 64), 'ageGrp'] = 7 # Age 64 - 85
df.ix[df.Age >= 65, 'ageGrp'] = 8 # Age 85+
I would prefer to do this on the fly, importing the definition table and using pandas.cut, instead of being hard-coded.
Thank you in advance.
Here is, perhaps, a work-around.
Consider the following example which replicates the symptom you describe:
import numpy as np
import pandas as pd
np.random.seed(2015)
defT = pd.DataFrame({'AgeGrp': [1, 2, 3, 4, 5, 6, 7],
'MaxAge': [18, 21, 24, 34, 44, 54, 65],
'MinAge': [14, 19, 22, 25, 35, 45, 55]})
cutoff = np.hstack([np.array(defT['MinAge'][0]), defT['MaxAge'].values])
labels = defT['AgeGrp']
N = 50
df = pd.DataFrame(np.random.randint(100, size=(N,2)), columns=['Age', 'Year'])
df['ageGrp'] = pd.cut(df['Age'], bins=cutoff, labels=labels, include_lowest=True)
grouped = df.groupby(['Year', 'ageGrp'], as_index=False)
final = grouped.agg(np.sum)
print(final)
# Year ageGrp Age
# Year ageGrp
# 3 1 NaN NaN NaN
# 2 NaN NaN NaN
# ...
# 97 1 NaN NaN NaN
# 2 NaN NaN NaN
# [294 rows x 3 columns]
If we change
grouped = df.groupby(['Year', 'ageGrp'], as_index=False)
final = grouped.agg(np.sum)
to
grouped = df.groupby(['Year', 'ageGrp'], as_index=True)
final = grouped.agg(np.sum).dropna()
print(final)
then we obtain:
Age
Year ageGrp
6 7 61
16 4 32
18 1 34
25 3 23
28 5 39
34 7 60
35 5 42
38 4 25
40 2 19
53 7 59
56 4 25
5 35
66 6 54
67 7 55
70 7 56
73 6 51
80 5 36
81 6 46
85 5 38
90 7 58
97 1 18
Related
I want to create a dataframe with two columns an id column which repeats the ids 1-100 3 times and then 'age' where I randomly sample the ages 0-14 17% of the time, ages 15-64 65% of the time, ages 65-100 18% of the time.
Example DF:
id age
1 21
1 21
1 21
2 45
2 45
2 45
3 64
3 64
3 64
Code i have so far:
N = 100
R = 3
d = {'id': np.repeat(np.arange(1, N + 1), R)}
pd.DataFrame(d)
I'm stuck on how to simulate the age though.
How can I do this?
You can apply numpy.random.randint for your specific ranges and thresholds:
df['ages'] = np.repeat(np.concatenate([np.random.randint(0, 14, 17),
np.random.randint(15, 64, 65),
np.random.randint(65, 100, 18)]), R)
print(df)
If needed, the concatenated arrays can be additionally shuffled with np.random.shuffle (before the ages would be repeated np.repeat):
ages = np.concatenate([np.random.randint(0, 14, 17),
np.random.randint(15, 64, 65),
np.random.randint(65, 100, 18)])
np.random.shuffle(ages)
df['ages'] = np.repeat(ages, R)
id ages
0 1 11
1 1 11
2 1 11
3 2 3
4 2 3
5 2 3
6 3 12
7 3 12
8 3 12
9 4 8
10 4 8
11 4 8
12 5 10
13 5 10
14 5 10
.. ... ...
285 96 70
286 96 70
287 96 70
288 97 83
289 97 83
290 97 83
291 98 70
292 98 70
293 98 70
294 99 98
295 99 98
296 99 98
297 100 92
298 100 92
299 100 92
I suggst this method:
import pandas as pd
import numpy as np
ids = np.repeat(range(1, 101), 3)
age_choices = [(np.arange(0, 15), 0.17), (np.arange(15, 65), 0.65), (np.arange(65, 101), 0.18)]
ages = np.concatenate([np.random.choice(choice[0], size=int(len(ids)*choice[1]), replace=True) for choice in age_choices])
df = pd.DataFrame({'id': ids, 'age': ages})
print(df.head(30))
which gives
id age
0 1 2
1 1 13
2 1 8
3 2 14
4 2 0
5 2 14
6 3 7
7 3 6
8 3 9
9 4 13
10 4 9
11 4 6
12 5 7
13 5 6
14 5 12
15 6 12
16 6 4
17 6 2
18 7 0
19 7 10
20 7 4
21 8 10
22 8 8
23 8 1
24 9 10
25 9 5
26 9 13
27 10 8
28 10 13
29 10 4
Maybe something like:
import numpy as np
import pandas as pd
N = 100
R = 3
ids = np.arange(1, N + 1)
# Assuming max age of 99
possible_ages = np.arange(100)
sizes = np.array([16, 50, 34])
percentages = np.array([17, 65, 18])
ages = np.random.choice(possible_ages, size=N, p=np.repeat(percentages / sizes / 100, sizes))
df = pd.DataFrame({
"id": np.repeat(ids, R),
"age": np.repeat(ages, R)
})
Alternatively you could sample the age group using your specified percentages first, then uniformly sample from the obtained group after.
Economy Year Indicator1 Indicator2 Indicator3 Indicator4 .
UK 1 23 45 56 78
UK 2 24 87 32 42
UK 3 22 87 32 42
UK 4 2 87 32 42
FR . . . . .
This is my data which extends on and held as a DataFrame, I want to switch the Header(Indicators) and the Year columns, seems like a pivot. There are hundreds of indicators and 20 years.
Use DataFrame.melt with DataFrame.pivot:
df = (df.melt(['Economy','Year'], var_name='Ind')
.pivot(['Economy','Ind'], 'Year', 'value')
.reset_index()
.rename_axis(None, axis=1))
print (df)
Economy Ind 1 2 3 4
0 UK Indicator1 23 24 22 2
1 UK Indicator2 45 87 87 87
2 UK Indicator3 56 32 32 32
3 UK Indicator4 78 42 42 42
Another option is to set Year column as index and then use transpose.
Consider the code below:
import pandas as pd
df = pd.DataFrame(columns=['Economy', 'Year', 'Indicator1', 'Indicator2', 'Indicator3', 'Indicator4'],
data=[['UK', 1, 23, 45, 56, 78],['UK', 2, 24, 87, 32, 42],['UK', 3, 22, 87, 32, 42],['UK', 4, 2, 87, 32, 42],
['FR', 1, 22, 33, 11, 35]])
# Make Year column as index
df = df.set_index('Year')
# Transpose columns to rows and vice-versa
df = df.transpose()
print(df)
gives you
Year 1 2 3 4 1
Economy UK UK UK UK FR
Indicator1 23 24 22 2 22
Indicator2 45 87 87 87 33
Indicator3 56 32 32 32 11
Indicator4 78 42 42 42 35
You can use transpose
like this :
df = df.set_index('Year')
df = df.transpose()
print (df)
First, let us create random dataframe:
df = pd.DataFrame(
{
"A": np.random.randint(0, 70, size=5),
"B": np.random.randint(-10, 35, size=5),
"C": np.random.randint(10, 50, size=5)
}
)
Then, I am using min and max functions to create two additional columns:
df['max'] = df[['A', 'B', 'C']].max(axis=1)
df['min'] = df[['A', 'B', 'C']].min(axis=1)
Output:
A B C max min
0 17 26 31 31 17
1 45 31 17 45 17
2 36 24 31 36 24
3 16 17 24 24 16
4 16 12 23 23 12
What would be the most efficient and elegant way to get remaining value to the 'mid' column so that the output looked like this:
A B C max min mid
0 17 26 31 31 17 26
1 45 31 17 45 17 31
2 36 24 31 36 24 31
3 16 17 24 24 16 17
4 16 12 23 23 12 16
I am looking for vectorized solution. I was able to achieve this using conditions:
conditions = [((df['A'] > df['B']) & (df['A'] < df['C']) | (df['A'] > df['C']) & (df['A'] < df['B'])),
((df['B'] > df['A']) & (df['B'] < df['C']) | (df['B'] > df['C']) & (df['B'] < df['A'])),
((df['C'] > df['A']) & (df['C'] < df['B']) | (df['C'] > df['B']) & (df['C'] < df['A']))]
choices = [df['A'], df['B'], df['C']]
df['mid'] = np.select(conditions, choices, default=0)
However, I think there is more elegant solution for that.
Should you use median?
df[["A","B","C"]].median(axis=1)
By the way, instead of running the aggregations one-by-one, you should everything in one go as follows:
df.join(df.agg([min, max, 'median'], axis=1))
OUTPUT
A B C min max median
0 2 22 38 2.0 38.0 22.0
1 29 15 40 15.0 40.0 29.0
2 48 -5 17 -5.0 48.0 17.0
3 17 18 43 17.0 43.0 18.0
4 60 -10 39 -10.0 60.0 39.0
The advantage of this is that, in a case like the one you described (i.e. you want to aggregate the entire row), you don't need to specify the name of the columns you want to aggregate. If you start adding one column with an aggregation, you need to make you sure you don't include the new column in the following aggregation - so you will need to speficy the columns you want to aggregate.
I'm struggling to create a new variable in python based on a condition with a groupby statement. I know there is a new datatable package in python, but I would like to explore the options below.
In R, I did the following:
require(data.table)
id <- c(1,2,3,4,1,5)
units <- c(34, 31, 16, 32, NA, 35)
unitsnew <- c(23, 11, 21, 22, 27, 11)
df = data.table(id, units, unitsnew)
df[, col:= ifelse(is.na(units), min(unitsnew, na.rm = T), units), by = id]
id units unitsnew col
1 34 23 34
2 31 11 31
3 16 21 16
4 32 22 32
1 NA 27 23
5 35 11 35
In Python,
import pandas as pd
import numpy as np
matrix = [(1, 34, 23),
(2, 31, 11),
(3, 16, 21),
(4, 32, 22),
(1, np.nan, 27),
(5, 35, 11)]
df = pd.DataFrame(matrix, columns = ['id', 'units', 'unitsnew'])
df.assign(col = np.where(np.isnan(df.units), np.nanmin(df.unitsnew), df.units))
id units unitsnew col
1 34 23 34
2 31 11 31
3 16 21 16
4 32 22 32
1 NA 27 11
5 35 11 35
I'm not sure where to fit the groupby statement such that the minimum value per id is taken. The desired output is generated with R. In python, the nan value is filled with 11 because I failed to add the groupby statement. How can I solve this? Thanks,
First let's look how to get the minimum value per group using groupby+transform('min'):
df.groupby('id')['unitsnew'].transform('min')
output:
0 23
1 11
2 21
3 22
4 23
5 11
Name: unitsnew, dtype: int64
Now we can use where to assign the above value when df['units'].isna():
df['col'] = df['units'].where(df['units'].notna(),
df.groupby('id')['unitsnew'].transform('min'))
output:
id units unitsnew col
0 1 34.0 23 34.0
1 2 31.0 11 31.0
2 3 16.0 21 16.0
3 4 32.0 22 32.0
4 1 NaN 27 23.0
5 5 35.0 11 35.0
The Scenario
I have dataframe df1 which needs to cutdown into different dataframes based on a list y_km.
Dataframe df1 holds data as follows:
0 1 2
0 3.000000 4.000000 3.000000
1 3.618555 3.646074 3.923834
2 2.669256 2.769302 2.897346
3 4.340775 4.311200 4.341143
and y_km as [0, 3, 2, 1, 2, 3, 3, 3, 1, 1, 0, 1, 2]
My Snippet
df1 = pd.DataFrame(X)
df1 = df1.iloc[0:5,:10]
cl0 = pd.DataFrame()
cl1 = pd.DataFrame()
cl2 = pd.DataFrame()
cl3 = pd.DataFrame()
y_km = list(y_kmeans)
for i in y_kmeans:
rows = df1.iloc[i, :]
if i == 0:
cl0 = cl0.append(rows, ignore_index=False)
elif i == 1:
cl1 = cl1.append(rows, ignore_index=False)
elif i == 2:
cl2 = cl2.append(rows, ignore_index=False)
elif i == 3:
cl3 = cl3.append(rows, ignore_index=False)
Issue with this is, that my clX DFrames are having the same records as of first inserted.
You want .groupby:
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: df = pd.DataFrame(np.random.randint(0, 100, (13, 3)))
In [4]: df
Out[4]:
0 1 2
0 73 85 15
1 4 56 5
2 30 74 1
3 93 16 9
4 94 97 41
5 37 49 63
6 28 66 10
7 74 35 4
8 1 76 65
9 5 79 27
10 54 33 74
11 99 54 46
12 67 28 77
Simply:
In [5]: y_km = [0, 3, 2, 1, 2, 3, 3, 3, 1, 1, 0, 1, 2]
In [6]: dfs = {k:g for k,g in df.groupby(y_km)}
Now, I've gone ahead and put the data-frames in a dict, but you can do whatever you want. I suggest against a bunch of variables, and rather, keep things together in a container of some sort. Note:
In [7]: dfs[0]
Out[7]:
0 1 2
0 73 85 15
10 54 33 74
In [8]: dfs[1]
Out[8]:
0 1 2
3 93 16 9
8 1 76 65
9 5 79 27
11 99 54 46
In [9]: dfs[3]
Out[9]:
0 1 2
1 4 56 5
5 37 49 63
6 28 66 10
7 74 35 4
In [10]: dfs[2]
Out[10]:
0 1 2
2 30 74 1
4 94 97 41
12 67 28 77