Using condition-based logic in group wise transformations - python

I have a dataframe of data that has a year column ('Year' and a dollar value column. I want to group by the year, then for each row, determine if the row is above the group's median by 20% or below the group's median by 20%.
I tried the following:
def f(x):
if x >= 1.2* np.median(x):
return 'H'
elif x<= .8* np.median(x):
return 'L'
transformed = df.groupby('Year').transform(f)
But I get an error saying the truth value of an array is ambiguous. This makes me think python is treating x in both the left and right hand side of the equation as the array of values, when in other transformation functions it knows on the left hand side the x is the row element and on the right hand side, where x is wrapped in an aggregation, x is the array.
Any idea on how to do this?

I think what you want is something like this:
n = 20
dr = randint(2000, 2014, size=n)
df = DataFrame({'year': dr, 'dollar': hstack((poisson(1000, size=n / 2), poisson(100000, size=n / 2)))})
def med_replace(x):
res = Series(index=x.index, name='med_cmp')
med = x.dollar.median()
upper = 1.2 * med
lower = 0.8 * med
res[x.dollar >= upper] = 'H'
res[x.dollar <= lower] = 'L'
res[(x.dollar > lower) & (x.dollar < upper)] = 'N'
return x.join(res)
df.groupby('year').apply(med_replace)
yielding:
dollar year med_cmp
0 1016 2004 N
1 956 2002 L
2 1044 2010 N
3 985 2008 L
4 1038 2001 L
5 997 2001 L
6 1015 2001 L
7 971 2012 L
8 1017 2013 N
9 1040 2010 N
10 99760 2001 H
11 99835 2001 H
12 100017 2012 H
13 99532 2001 H
14 100311 2011 N
15 100344 2002 H
16 100209 2007 N
17 99988 2008 H
18 100204 2007 N
19 100996 2005 N
A numpy ndarray is not a valid argument to bool unless its size is 0 or 1. This means that you cannot evaluate its "truthiness" in an if statement unless it has 0 or 1 elements. This is why you're getting the error you reported.

Related

data cleansing - 2 columns change the data in one column if criteria is met

I have columns with vehicle data, for vehicles greater than 1 year old with mileage less than 100 I want to replace mileage less than 100 with 1000.
my attempts -
mileage_corr = vehicle_data_all.loc[(vehicle_data_all["mileage"] < 100) & (vehicle_data_all["year"] < 2020)], 1000
Error - AttributeError: 'tuple' object has no attribute
and
mileage_corr = vehicle_data_all.loc[(vehicle_data_all["mileage"] < 100) & (vehicle_data_all["year"] < 2020)]
mileage_corr['mileage'].where(mileage_corr['mileage'] <= 100, 1000, inplace=True)
error -
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
return self._where(
Without complete information, assuming your vehicle_data_all DataFrame looks something like this,
years mileage
0 2019 192
1 2014 78
2 2010 38
3 2018 119
4 2019 4
5 2012 122
6 2005 50
7 2015 69
8 2004 56
9 2003 194
Pandas has a way of assigning based on a filter result. This is referred to as setting values.
df.loc[condition, "field_to_change"] = desired_change
Applied to your dataframe would look something like this,
vehicle_data_all.loc[((vehicle_data_all["mileage"] < 100) & (vehicle_data_all["year"] < 2020)), "mileage"] = 1000
This was my result,
years mileage
0 2019 192
1 2014 1000
2 2010 1000
3 2018 119
4 2019 1000
5 2012 122
6 2005 1000
7 2015 1000
8 2004 1000
9 2003 194

I am trying to build a summary data frame using totals from another data frame

So I have all the data points formatted into a table where I can now start to summarise my findings.
home_goals away_goals result home_points away_points
2006-2007 1 1 D 1 1
2006-2007 1 1 D 1 1
2006-2007 2 1 H 3 0
2006-2007 2 1 H 3 0
2006-2007 3 0 H 3 0
... ... ... ... ... ... ... ...
2019 - 2020 0 2 A 0 3
2019 - 2020 5 0 H 3 0
2019 - 2020 1 3 A 0 3
2019 - 2020 3 1 H 3 0
2019 - 2020 1 1 D 1 1
My objective is to collate this into a new data frame that summarises each season under the following columns:
Season_breakdown = pd.DataFrame(columns = ['Season','Matches Played','Home
Wins','Draws','Away Wins', 'Home Points',
'Away Points'])
My current solution is to run for something like this
index_count = pd.Index(data_master.index).value_counts()
index_count
That outputs:
2007-2008 380
2013-2014 380
2010-2011 380
2017-2018 380
2015-2016 380
2016-2017 380
2009-2010 380
2014-2015 380
2012-2013 380
2006-2007 380
2018-2019 380
2011-2012 380
2008-2009 380
2019 - 2020 P1 288
2019 - 2020 P2 92
and then hardcode the results into a new data variable which I can incorporate into my Season_breakdown and repeat similar steps to collate the information for home wins (by season) away wins (by season) home points (by season) away points (by season).
The aim is to have something along the lines of;
Season MatchesPlayed HomeWins Draws AwayWins HomePoints AwayPoints
2006-2007 380 (sum H 6/7) (sum D 6/7) (sum H 6/7) (sum h_points)(sum a_points)
2007-2008 380 (sum H 7/8) (sum D 7/8) (sum H 7/8) (sum h_points)(sum a_points)
2008-2009 380 (sum H 8/9) (sum D 8/9) (sum H 8/9) (sum h_points)(sum a_points)
2009-20010 380 (sum H 9/10)(sum D 9/10)(sum H 9/10)(sum h_points)(sum a_points)
Etc.
I feel like there is a far more robust way to approach this and was hoping for some insight.
Thanks
You have multi level aggregations. Points are aggregated at the season level, while Wins/Draws are aggregated at the combined season / result level. So one option is to aggregate the result in multiple steps and then concatenate / join the results:
season_points = df.groupby(level=[0]).agg({'home_points': 'sum', 'away_points': 'sum'})
season_count = df.groupby(level=[0]).result.count().rename('MatchesPlayed').to_frame()
season_results = pd.crosstab(df.index, df.result).rename(
columns={'A': 'AwayWins', 'D': 'Draws', 'H': 'HomeWins'})
season_results.index.name = None
agg_df = pd.concat([season_count, season_results, season_points], axis=1) \
.rename(columns={'home_points': 'HomePoints', 'away_points': 'AwayPoints'})
print(agg_df)
# MatchesPlayed AwayWins Draws HomeWins HomePoints AwayPoints
#2006-2007 5 0 2 3 11 2
#2019 - 2020 5 2 1 2 7 7
Working example

function calculation based on condition [duplicate]

This question already has an answer here:
Pandas Rolling Python to create new Columns
(1 answer)
Closed 2 years ago.
I am practising and new to create a function in Python with conditions:
create a function that takes an input of an integer number (for example m, where m is between 2 to n, and n is the maximum number of rows). This function calculates the ‘Sum A’ and ‘Sum B’ from the last m-days. There will be no value for the first m-days
The original data:
V TP A B Sum A Sum B
3509 47.81
4862 48.406667 235353.2133
1810 49.26 89160.6
3824 49.263333 188382.9867
2209 47.386667 104677.1467
4558 45.573333 207723.2533
3832 44.396667 170128.0267
3778 43.75 165287.5
1005 44.64 44863.2
4047 43.76 177096.72
2201 44.383333 97687.71667 655447.7167 824912.6467
2507 45.156667 113207.7633 533302.2667 824912.6467
4392 44.4333 195151.2 444141.6667 1020063.847
3497 43.296667 151408.4433 255758.68 1171472.29
1181 43.07 50865.67 255758.68 1117660.813
1971 42.89 84536.19 255758.68 994473.75
4994 43.563333 217555.2867 473313.9667 824345.7233
2017 44.816667 90395.21667 563709.1833 659058.2233
2823 44.936667 126856.21 645702.1933 659058.2233
2774 45.13 125190.62 770892.8133 481961.5033
Continue original data
Day
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
The attempt that I have done so far is and it shows error KeyError 'A':
curret_period = int(input("enter days: "))
sumA = curret_period * ((df["A"] < df["A"]),'')
sumB = curret_period * ((df["B"] >= df["B"]),'')
print(sumA)
print(sumB)
I am wondering is there a better way to create the function? I also wonder if below is the one that I need?
def function_name()
print()
Expected result when m= 10:
A B Sum A Sum B
0
1 235353.21333333332
2 89160.59999999999
3 188382.98666666663
4 104677.1466666667
5 207723.25333333333
6 170128.02666666667
7 165287.5
8 44863.200000000004
9 177096.72
10 97687.71666666666 655447.7167 824912.6467
11 113207.76333333334 533302.2667 824912.6467
12 195151.2 444141.6667 1020063.847
13 151408.4433333333 255758.68 1171472.29
14 50865.66999999999 255758.68 1117660.813
15 84536.19000000002 255758.68 994473.75
16 217555.28666666665 473313.9667 824345.7233
17 90395.21666666666 563709.1833 659058.2233
18 126856.21 645702.1933 659058.2233
19 125190.61999999998 770892.8133 481961.5033
Any suggestion? Thank you in advance.
You can utilize df.tail() to get the last m rows of the dataframe and then simply sum() each column.
We can also check if m is not greater than the length of the dataframe, however even if you did not have this it would just sum the entire dataframe.
def sumof(df, m):
if m <= len(df.index):
rows = df.tail(m)
print(rows['A'].sum())
print(rows['B'].sum())
else:
print("'m' can not be greater than length of dataframe")

How to avoid that depreciation goes negative?

I am working on a project for my thesis, which has to do with the capitalization of Research & Development (R&D) expenses for a data set of companies that I have.
For those who are not familiar with financial terminology, I am trying to accumulate the values of each year's R&D expenses with the following ones by decaying its value (or "depreciating" it) every time period.
I was able to apply the following code to get the gist of the operation:
df['rd_capital'] = [(df['r&d_exp'].iloc[:i] * (1 - df['dep_rate'].iloc[:i]*np.arange(i)[::-1])).sum() for i in range(1,len(df)+1)]
However, there is a major flaw with this method, which is that it continues to take away the depreciation rate once the value has reached zero, therefore going into negative territory.
For example if we have Apple's R&D expenses for 5 years at a constant depreciation rate of 20%, the code above gives me the following result:
year r&d_exp dep_rate r&d_capital
0 1999 10 0.2 10
1 2000 8 0.2 16
2 2001 12 0.2 24.4
3 2002 7 0.2 25.4
4 2003 15 0.2 33
5 2004 8 0.2 30.6
6 2005 11 0.2 29.6
However, the value for the year 2005 is incorrect as it should be 31.6!
If it was not clear, r&d_capital is retrieved the following way:
2000 = 10*(1-0.2) + 8
2001 = 10*(1-0.4) + 8*(1-0.2) + 12
2002 = 10*(1-0.6) + 8*(1-0.4) + 12*(1-0.2) + 7
2003 = 10*(1-0.8) + 8*(1-0.6) + 12*(1-0.4) + 7*(1-0.2) + 15
the key problem comes here as the code above does the following:
2004 = 10*(1-1) + 8*(1-0.8) + 12*(1-0.6) + 7*(1-0.4) + 15*(1-0.2) + 8
2005 = 10*(1-1.2) + 8*(1-1) + 12*(1-0.8) + 7*(1-0.6) + 15*(1-0.4) + 8*(0.2) + 11
Instead it should discard the values once the value reaches zero, just like this:
2004 = 8*(1-0.8) + 12*(1-0.6) + 7*(1-0.4) + 15*(1-0.2) + 8
2005 = 12*(1-0.8) + 7*(1-0.6) + 15*(1-0.4) + 8*(0.2) + 11
Thank you in advance for any help that you will give, really appreciate it :)
A possible way would be to compute the residual part for each investment. The assumption is that there a finite and known number of years after which any investment is fully depreciated. Here I will use 6 years (5 would be enough but it demonstrates how to avoid negative depreciations):
# cumulated depreciation rates:
cum_rate = pd.DataFrame(index = df.index)
for i in range(2, 7):
cum_rate['cum_rate' + str(i)] = df['dep_rate'].rolling(i).sum().shift(1 - i)
cum_rate['cum_rate1'] = df['dep_rate']
cum_rate[cum_rate > 1] = 1 # avoid negative rates
# residual values
resid = pd.DataFrame(index = df.index)
for i in range(1, 7):
resid['r' + str(i)] = (df['r&d_exp'] * (1 - cum_rate['cum_rate' + str(i)])
).shift(i)
# compute the capital
df['r&d_capital'] = resid.apply('sum', axis=1) + df['r&d_exp']
It gives as expected:
year r&d_exp dep_rate r&d_capital
0 1999 10 0.2 10.0
1 2000 8 0.2 16.0
2 2001 12 0.2 24.4
3 2002 7 0.2 25.4
4 2003 15 0.2 33.0
5 2004 8 0.2 30.6
6 2005 11 0.2 31.6
You have to keep track of the absolute depreciation and stop depreciating when the asset reaches value zero. Look at the following code:
>>> exp = [10, 8, 12, 7, 15, 8, 11]
>>> dep = [0.2*x for x in exp]
>>> cap = [0]*7
>>> for i in range(7):
... x = exp[:i+1]
... for j in range(i):
... x[j] -=(i-j)*dep[j]
... x[j] = max(x[j], 0)
... cap[i] = sum(x)
...
>>> cap
[10, 16.0, 24.4, 25.4, 33.0, 30.599999999999998, 31.6]
>>>
In the for loops I calculate for every year the remaining value of all assets (in variable x). When this reaches zero, I stop depreciating. That is what the statement x[j] = max(x[j], 0) does. The sum of the value of all assets is then stored in cap[i].

Group rows in a CSV by blocks of 25

I have a csv file with 2 columns, representing a distribution of items per year which looks like this:
A B
1900 10
1901 2
1903 5
1908 8
1910 25
1925 3
1926 4
1928 1
1950 10
etc, about 15000 lines.
When making a distribution diagram based on this data, it's too many points on an axe, not very pretty. I want to group rows by blocks of 25 years, so that at the end I would have less points at the axe.
So, for example, from 1900 till 1925 I would have a sum of produced items, 1 row in A column and 1 row in B column:
1925 53
1950 15
So far I only figured how to convert the data in csv file to int:
o=open('/dates_dist.csv', 'rU')
mydata = csv.reader(o)
def int_wrapper(mydata):
for v in reader:
yield map(int, v)
reader = int_wrapper(mydata)
Can't find how to do it further...
You could use itertools.groupby:
import itertools as IT
import csv
def int_wrapper(mydata):
for v in mydata:
yield map(int, v)
with open('data', 'rU') as o:
mydata = csv.reader(o)
header = next(mydata)
reader = int_wrapper(mydata)
for key, group in IT.groupby(reader, lambda row: (row[0]-1)//25+1):
year = key*25
total = sum(row[1] for row in group)
print(year, total)
yields
(1900, 10)
(1925, 43)
(1950, 15)
Note that 1900 to 1925 (inclusive) spans 26 years, not 25. So
if you want to group 25 years, given the way you are reporting the totals, you probably want the half-open interval (1900, 1925].
The expression row[0]//25 takes the year and integer divides by 25.
This number will be the same for all numbers in the range [1900, 1925).
To make the range half-open on the left, subtract and add 1: (row[0]-1)//25+1.
Here is my approach . Its definitely not the most engaging python code, but could be a way to achieve the desired output.
if __name__ == '__main__':
o=open('dates_dist.csv', 'rU')
lines = o.read().split("\n") # Create a list having each line of the file
out_dict = {}
curr_date = 0;
curr_count = 0
chunk_sz = 25; #years
if len(lines) > 0:
line_split = lines[0].split(",")
start_year = int(line_split[0])
curr_count = 0
# Iterate over each line of the file
for line in lines:
# Split at comma to get the year and the count.
# line_split[0] will be the year and line_split[1] will be the count.
line_split = line.split(",")
curr_year = int(line_split[0])
time_delta = curr_year-start_year
if time_delta<chunk_sz or time_delta == chunk_sz:
curr_count = curr_count + int(line_split[1])
else:
out_dict[start_year+chunk_sz] = curr_count
start_year = start_year+chunk_sz
curr_count = int(line_split[1])
#print curr_year , curr_count
out_dict[start_year+chunk_sz] = curr_count
print out_dict
You could create a dummy column and group by it after doing some integer division:
df['temp'] = df['A'] // 25
>>> df
A B temp
0 1900 10 76
1 1901 2 76
2 1903 5 76
3 1908 8 76
4 1910 25 76
5 1925 3 77
6 1926 4 77
7 1928 1 77
8 1950 10 78
>>> df.groupby('temp').sum()
A B
temp
76 9522 50
77 5779 8
78 1950 10
My numbers are slightly different from yours since I am technically grouping from 1900-1924, 1925-1949, and 1950-1974, but the idea is the same.

Categories

Resources