I am still at the beginning of my Python career and I am trying to add a column with the BMI in a DataFrame, which is calculated from two other columns. However, this does not work with my code yet and maybe someone can help me! In my data I also have relatively often no information and then I just want "NaN" and I think that's the reason why my code doesn't work.
df = pd.DataFrame({'gender': ['m', 'w', 'm', 'm'],
'bodyheight': [1.80, 1.70, 1.85, 'NaN'],
'bodyweight': [75, 59, 83, 90]},
columns=['gender', 'bodyheight', 'bodyweight'])
df.apply(lambda x: (x.bodyweight/(x.bodyheight**2)), axis=1)
That's because you have a string NaN in there. You could replace that with an actual NaN value; then use vectorized division (also I feel like you forgot to divide the height by 100 to convert centimeters to meters):
df = df.replace('NaN', np.nan)
df['BMI'] = df['bodyweight'] / df['bodyheight'].div(100).pow(2)
Output:
gender bodyheight bodyweight BMI
0 m 180.0 75 23.148148
1 w 170.0 59 20.415225
2 m 185.0 83 24.251278
3 m NaN 90 NaN
Say I have a df that looks like this:
name day var_A var_B
0 Pete Wed 4 5
1 Luck Thu 1 10
2 Pete Sun 10 10
And I want to sum var_A and var_B for every name/person and then get the average of this sum by the number of ocurrences of that name/person.
Let's take Pete for example. Sum his variables (in this case, (4+10) + (5+10) = 29), and divide this sum by the ocurrences of Pete in the df (29/2 = 14,5). And the "day" column would be eliminated, there would be only one column for the name and another for the average.
Would look like this:
>>> df.method().method()
name avg
0 Pete 14.5
1 Luck 11.0
I've been trying to do this using groupby and other methods, but I eventually got stuck.
Any help would be appreciated.
I came up with
df.groupby('name')['var_A', 'var_B'].apply(lambda g: g.stack().sum()/len(g)).rename('avg').reset_index()
which produces the correct result, but I'm not sure it's the most elegant way.
pandas' groupby is a lazy expression, and as such it is reusable:
# create group
group = df.drop(columns="day").groupby("name")
# compute output
group.sum().sum(1) / group.size()
name
Luck 11.0
Pete 14.5
dtype: float64
TeamA TeamB TeamC
12 17 19
13 20 21
14 21 26
15 22 15
difference = numpy.abs(data['TeamA'] - data['TeamB'])
teamC = data['TeamC']
df1 = pd.DataFrame(difference)
df1.columns = ['diff']
df2 = pd.DataFrame(teamC)
correlation = df1.corrwith(df2,axis=0)
I am looking to return the correlation between (the absolute points difference between team A and Team B) and the number of points of team C. However, my code is not returning any number. Any suggestion?
pandas is expecting a series inside the corrwith instead of a dataframe (despite what the documentations say).
This makes sense because just passing a dataframe does not really help since you do not know which columns to use for generating correlation score
You should instead be doing:
df1.corrwith(df2["TeamC"])
OUT
diff 0.18221
dtype: float64
This answer is just an extension of this thread:
pandas.DataFrame corrwith() method
I know there are tons of groupby-filter questions about pandas, but I've gone through a number of them and they don't have what I need.
Anyway here's what I have for the dataframe df:
user1 user2 date quantity
-----------------------------
Alice Bob 2018-05-21 100
Alice Bob 2018-05-19 20
Alice Carol 2018-01-01 1000
Bob Carol 2018-02-01 100
I want to calculate a function (let's say some function func) of the quantity for a given user1-user2 pair for weekdays only.
So far what I have are:
df['day'] = df['date'].dt.weekday
df.groupby(['user1','user2']).filter(lambda x: (x.day < 5).any() )
But I don't get what I expect. Apparently, what the filter does is to select only those pairs where at least one day entry is < 5. What I need though, are all rows where the day column is less than 5 for one particular user1-user2 pair.
One straightforward solution is to filter your dataframe before you perform the groupby:
res = df[df['date'].dt.weekday < 5].groupby(...)
I am still struggling to get really familiar with the pandas groupby operations. When passing a function to agg, what if the aggregation function that is passed needs to consider values in columns other than those that are being aggregated.
Consider the following data frame example which is lists sales of products by two salesmen:
DateList = np.array( [(datetime.date.today() - datetime.timedelta(7)) + datetime.timedelta(days = x) for x in [1, 2, 2, 3, 4, 4, 5]] + \
[(datetime.date.today() - datetime.timedelta(7)) + datetime.timedelta(days = x) for x in [1, 1, 2, 3, 4, 5, 5]]
Names = np.array(['Joe' for x in xrange(7)] + ['John' for x in xrange(7)])
Product = np.array(['Product1', 'Product1', 'Product2', 'Product2', 'Product2', 'Product3', 'Product3', \
'Product1', 'Product2', 'Product2', 'Product2', 'Product2', 'Product2', 'Product3'])
Volume = np.array([100, 0, 150, 175, 15, 120, 150, 75, 0, 115, 130, 135, 10, 120])
Prices = {'Product1' : 25.99, 'Product2': 13.99, 'Product3': 8.99}
SalesDF = DataFrame({'Date' : DateLists, 'Seller' : Names, 'Product' : Product, 'Volume' : Volume})
SalesDF.sort(['Date', 'Seller'], inplace = True)
SalesDF['Prices'] = SalesDF.Product.map(Prices)
On some days each seller sells more than one item. Suppose you wanted to aggregate the data set into a single day/seller observations, and you wished to do so based upon which product sold the most volume. To be clear, this would be simple for the volume measure, simply pass a max function to agg. However for evaluating which Product and Price would remain would mean determining which volume measure was highest and then returning the value tha corresponds with that max value.
I am able to get the result I want by using the index values in the column that is passed to the function when agg is called and referencing the underlying data frame:
def AggFunc(x, df, col1):
#Create list of index values that index the data in the column passed as x
IndexVals = list(x.index)
#Use those index values to create a list of the values of col1 in those index positions in the underlying data frame.
ColList = list(df[col1][IndexVals])
# Find the max value of the list of values of col1
MaxVal = np.max(ColList)
# Find the index value of the max value of the list of values of col1
MaxValIndex = ColList.index(MaxVal)
#Return the data point in the list of data passed as column x which correspond to index value of the the max value of the list of col1 data
return list(x)[MaxValIndex]
FunctionDict = {'Product': lambda x : AggFunc(x, SalesDF, 'Volume'), 'Volume' : 'max',\
'Prices': lambda x : AggFunc(x, SalesDF, 'Volume')}
SalesDF.groupby(['Date', "Seller"], as_index = False).agg(FunctionDict)
But I'm wondering if there is a better way where I can pass 'Volume' as an argument to the function that aggregates Product without having to get the index values and create lists from the data in the underlying dataframe? Something tells me no, as agg passes each column as a series to the aggregation function, rather than the dataframe itself.
Any ideas?
Thanks
Maybe extracting the right indices first using .idxmax would be simpler?
>>> grouped = SalesDF.groupby(["Date", "Seller"])["Volume"]
>>> max_idx = grouped.apply(pd.Series.idxmax)
>>> SalesDF.loc[max_idx]
Date Product Seller Volume Prices
0 2013-11-04 Product1 Joe 100 25.99
7 2013-11-04 Product1 John 75 25.99
2 2013-11-05 Product2 Joe 150 13.99
9 2013-11-05 Product2 John 115 13.99
3 2013-11-06 Product2 Joe 175 13.99
10 2013-11-06 Product2 John 130 13.99
5 2013-11-07 Product3 Joe 120 8.99
11 2013-11-07 Product2 John 135 13.99
6 2013-11-08 Product3 Joe 150 8.99
13 2013-11-08 Product3 John 120 8.99
idxmax gives the index of the first occurrence of the maximum value. If you want to keep multiple products if they all obtain the maximum volume, it'd be a little different, something more like
>>> max_vols = SalesDF.groupby(["Date", "Seller"])["Volume"].transform(max)
>>> SalesDF[SalesDF.Volume == max_vols]