For the DataFrame below, I need to create a new column 'unit_count' which is 'unit'/'count' for each year and month. However, because each year and month is not unique, for each entry, I only want to use the count for a given month from the B option.
key UID count month option unit year
0 1 100 1 A 10 2015
1 1 200 1 B 20 2015
2 1 300 2 A 30 2015
3 1 400 2 B 40 2015
Essentially, I need a function that does the following:
unit_count = df.unit / df.count
for value of unit, but using the only the 'count' value of option 'B' in that given 'month'.
So that the end result would look like the table below, where unit_count is dividing the number of units by the count of 'sector' 'B' for a given month.
key UID count month option unit year unit_count
0 1 100 1 A 10 2015 0.05
1 1 200 1 B 20 2015 0.10
2 1 300 2 A 30 2015 0.075
3 1 400 2 B 40 2015 0.01
Here is the code I used to create the original DataFrame:
df = pd.DataFrame({'UID':[1,1,1,1],
'year':[2015,2015,2015,2015],
'month':[1,1,2,2],
'option':['A','B','A','B'],
'unit':[10,20,30,40],
'count':[100,200,300,400]
})
It seems you can first create NaN where not option is B and then divide back filled NaN values:
Notice: DataFrame has to be sorted by year, month and option first for last value with B for each group
#if necessary in real data
#df.sort_values(['year','month', 'option'], inplace=True)
df['unit_count'] = df.loc[df.option=='B', 'count']
print (df)
UID count month option unit year unit_count
0 1 100 1 A 10 2015 NaN
1 1 200 1 B 20 2015 200.0
2 1 300 2 A 30 2015 NaN
3 1 400 2 B 40 2015 400.0
df['unit_count'] = df.unit.div(df['unit_count'].bfill())
print (df)
UID count month option unit year unit_count
0 1 100 1 A 10 2015 0.050
1 1 200 1 B 20 2015 0.100
2 1 300 2 A 30 2015 0.075
3 1 400 2 B 40 2015 0.100
Related
I have a Pandas DataFrame that looks like this:
index
ID
value_1
value_2
0
1
200
126
1
1
200
127
2
1
200
128.1
3
1
200
125.7
4
2
300.1
85
5
2
289.4
0
6
2
0
76.9
7
2
199.7
0
My aim is to find all rows in each ID-group (1,2 in this example) which have the max value for value_1 column. The second condition is if there are multiple maximum values per group, the row where the value in column value_2 is maximum should be taken.
So the target table should look like this:
index
ID
value_1
value_2
0
1
200
128.1
1
2
300.1
85
Use DataFrame.sort_values by all 3 columns and then DataFrame.drop_duplicates:
df1 = (df.sort_values(['ID', 'value_1', 'value_2'], ascending=[True, False, False])
.drop_duplicates('ID'))
print (df1)
ID value_1 value_2
2 1 200.0 128.1
4 2 300.1 85.0
I have a DF with 50ish columns and duplicate ID's. The section I'm interested in kind of looks like this
ID Value year
0 3 200 1995
1 3 100 2001
2 4 300 1995
3 4 250 2000
All first entries of each ID = 1995, however the second entries correspond to a ValuedFrom column (the second entry is the retirement age of each object, and so its last value in most cases). Id like to merge all three of these columns so that I end up with two, like so
ID Value1995 ValueRetired
0 3 200 100
1 4 300 250
Any ideas on how I might do this?
General solution:
print (df)
ID year Value
1 3 2003 95
2 3 1995 200
2 3 2001 100
3 4 1995 300
4 4 2000 250
5 4 2004 150
6 5 2000 201
7 5 1995 202 <- remove this row with 1995, because last value of group 5, if seelct next row it is in another group
8 6 2000 203
9 6 2000 204
First select indices of 1995 and all next rows:
idx = df.index[(df['year'] == 1995) & (df.groupby('ID').cumcount(ascending=False) != 0)]
idx2 = df.index.intersection(idx + 1).union(idx)
df = df.loc[idx2]
print (df)
ID year Value ValuedFrom
2 3 1995 200 1995
2 3 2001 100 2001
3 4 1995 300 1995
4 4 2000 250 2000
Detail:
print (df.groupby('ID').cumcount(ascending=False))
1 2
2 1
2 0
3 2
4 1
5 0
6 1
7 0
8 1
9 0
dtype: int64
Then change values of column year for reshape by unstack:
df['year'] = np.where(df['year'] == 1995, 'Value1995', 'ValueRetired')
df = df.set_index(['ID', 'year'])['Value'].unstack().reset_index().rename_axis(None, axis=1)
print (df)
ID Value1995 ValueRetired
0 3 200 100
1 4 300 250
You can create a series mapping year to labels, then use pd.DataFrame.pivot:
df['YearType'] = np.where(df['year'] == 1995, 'Value1995', 'ValueRetired')
res = df.pivot(index='ID', columns='YearType', values='Value')
print(res)
YearType Value1995 ValueRetired
ID
3 200 100
4 300 250
5 150 95
I have the following code which produces a data frame showing me a per month and per year average sold price. I would like to add to this a total row per year and a total row per pid (person). Sample code and data:
import pandas as pd
import StringIO
s = StringIO.StringIO("""pid,year,month,price
1,2017,4,2000
1,2017,4,2900
1,2018,4,2000
1,2018,4,2300
1,2018,5,2000
1,2018,5,1990
1,2018,6,2200
1,2018,6,2400
1,2018,6,2250
1,2018,7,2150
""")
df = pd.read_csv(s)
maths = {'price': 'mean'}
gb = df.groupby(['pid','year','month'])
counts = gb.size().to_frame(name='n')
out = counts.join(gb.agg(maths)).reset_index()
print(out)
Which yields:
pid year month n price
0 1 2017 4 2 2450.000000
1 1 2018 4 2 2150.000000
2 1 2018 5 2 1995.000000
3 1 2018 6 3 2283.333333
4 1 2018 7 1 2150.000000
I would the additional per year rows to look like:
pid year month n price
0 1 2017 all 2 2450.000000
0 1 2018 all 8 2161.000000
And then the per pid rollup to look like:
pid year month n price
0 1 all all 10 2218.000000
I'm having trouble cleanly grouping/aggregating those last two frames where I essentially want an all split for each year and month value, and then have each data frame here combined into one which I can write to CSV, or a database table.
Using pd.concat
df1=df.groupby(['pid','year','month']).price.agg(['size','mean']).reset_index()
df2=df.groupby(['pid','year']).price.agg(['size','mean']).assign(month='all').reset_index()
df3=df.groupby(['pid']).price.agg(['size','mean']).assign(**{'month':'all','year':'all'}).reset_index()
pd.concat([df1,df2,df3])
Out[484]:
mean month pid size year
0 2450.000000 4 1 2 2017
1 2150.000000 4 1 2 2018
2 1995.000000 5 1 2 2018
3 2283.333333 6 1 3 2018
4 2150.000000 7 1 1 2018
0 2450.000000 all 1 2 2017
1 2161.250000 all 1 8 2018
0 2219.000000 all 1 10 all
I have the following df: visitor can make multiple visits, and the number of page views is recorded in each visit.
df = pd.DataFrame({'visitor_id':[1,1,2,1],'visit_id':[1,2,1,3], 'page_views':[10,20,30,40]})
page_views visit_id visitor_id
0 10 1 1
1 20 2 1
2 30 1 2
3 40 3 1
What I need is to create an additional column called weight, which will diminish with a certain parameter. For example, if this parameter is 1/2, the newest visit has a weight of 1, 2nd newest visit a weight of 1/2, 3rd is 1/4 and so on.
E.g. I want my dataframe to look like:
page_views visit_id visitor_id weight
0 10 1(oldest) 1 0.25
1 20 2 1 0.5
2 30 1(newest) 2 1
3 40 3(newest) 1 1
Then I will be able to group using their weight e.g.
df.groupby(['visitor_id']).Weight.sum() to get weighted page views group by.
Doesnt work as expected
df = pd.DataFrame({'visitor_id':[1,1,2,2,1,1],'visit_id':[5,6,1,2,7,8], 'page_views':[10,20,30,30,40,50]})
df['New']=df.groupby('visitor_id').visit_id.transform('max') - df.visit_id
df['weight'] = pd.Series([1/2]*len(df)).pow(df.New.values)
df
page_views visit_id visitor_id New weight
0 10 5 1 3 0
1 20 6 1 2 0
2 30 1 2 1 0
3 30 2 2 0 1
4 40 7 1 1 0
5 50 8 1 0 1
Is this what you need ?
df.groupby('visitor_id').visit_id.apply(lambda x : 1*1/2**(max(x)-x))
Out[1349]:
0 0.25
1 0.50
2 1.00
3 1.00
Name: visit_id, dtype: float64
Maybe try this
df['New']=df.groupby('visitor_id').visit_id.transform('max')-df.visit_id
pd.Series([1/2]*len(df)).pow(df.New.values)
Out[45]:
0 0.25
1 0.50
2 1.00
3 1.00
Name: New, dtype: float64
I am trying to model different scenarios for groups of assets in future years. This is something I have accomplished very tediously in Excel, but want to leverage the large database I have built with Pandas.
Example:
annual_group_cost = 0.02
df1:
year group x_count y_count value
2018 a 2 5 109000
2019 a 0 4 nan
2020 a 3 0 nan
2018 b 0 0 55000
2019 b 1 0 nan
2020 b 1 0 nan
2018 c 5 1 500000
2019 c 3 0 nan
2020 c 2 5 nan
df2:
group x_benefit y_cost individual_avg starting_value
a 0.2 0.72 1000 109000
b 0.15 0.75 20000 55000
c 0.15 0.70 20000 500000
I would like to update the values in df1, by taking the previous year's value (or starting value) and adding the x benefit, y cost, and annual cost. I am assuming this will take a function to accomplish, but I don't know of an efficient way to handle it.
The final output I would like to have is:
df1:
year group x_count y_count value
2018 a 2 5 103620
2019 a 0 4 98667.3
2020 a 3 0 97294.248
2018 b 0 0 53900
2019 b 1 0 56822
2020 b 1 0 59685.56
2018 c 5 1 495000
2019 c 3 0 497100
2020 c 2 5 420158
I achieved this by using:
starting_value-(starting_value*annual_group_cost)+(x_count*(individual_avg*x_benefit))-(y_count*(individual_avg*y_cost))
Since subsequent new values are dependent upon previously calculated new values, this will need to involve (even if behind the scenes using e.g. apply) a for loop:
for i in range(1, len(df1)):
if np.isnan(df1.loc[i, 'value']):
df1.loc[i, 'value'] = df1.loc[i-1, 'value'] #your logic here
You should merge the two tables together and then just do the functions on the data Series
hold = df_1.merge(df_2, on=['group']).fillna(0)
x = (hold.x_count*(hold.individual_avg*hold.x_benefit))
y = (hold.y_count*(hold.individual_avg*hold.y_cost))
for year in hold.year.unique():
start = hold.loc[hold.year == year, 'starting_value']
hold.loc[hold.year == year, 'value'] = (start-(start*annual_group_cost)+x-y)
if year != hold.year.max():
hold.loc[hold.year == year + 1, 'starting_value'] = hold.loc[hold.year == year, 'value'].values
hold.drop(['x_benefit', 'y_cost', 'individual_avg', 'starting_value'],axis=1)
Will give you
year group x_count y_count value
0 2018 a 2 5 103620.0
1 2019 a 0 4 98667.6
2 2020 a 3 0 97294.25
3 2018 b 0 0 53900.0
4 2019 b 1 0 55822.0
5 2020 b 1 0 57705.56
6 2018 c 5 1 491000.0
7 2019 c 3 0 490180.0
8 2020 c 2 5 416376.4