Conditional mean over a Pandas DataFrame - python

I have a dataset from which I want a few averages of multiple variables I created.
I started off with:
data2['socialIdeology2'].mean()
data2['econIdeology'].mean()
^ that works perfectly, and gives me the averages I'm looking for.
Now, I'm trying to do a conditional mean, so the mean only for a select group within the data set. (I want the ideologies broken down by whom voted for in the 2016 election) In Stata, the code would be similar to: mean(variable) if voteChoice == 'Clinton'
Looking into it, I came to the conclusion a conditional mean just isn't a thing (although hopefully I am wrong?), so I was writing my own function for it.
This is me just starting out with a 'mean' function, to create a foundation for a conditional mean function:
def mean():
sum = 0.0
count = 0
for index in range(0, len(data2['socialIdeology2'])):
sum = sum + (data2['socialIdeology2'][index])
print(data2['socialIdeology2'][index])
count = count + 1
return sum / count
print(mean())
Yet I keep getting 'nan' as the result. Printing data2['socialIdeology2'][index] within the loop prints nan over and over again.
So my question is: if the data stored within the socialIdeology2 variable really is a nan (which I don't understand how it could be), why is it that the .mean() function works with it?
And how can I get generate means by category?

Conditional mean is indeed a thing in pandas. You can use DataFrame.groupby():
means = data2.groupby('voteChoice').mean()
or maybe, in your case, the following would be more efficient:
means = data2.groupby('voteChoice')['socialIdeology2'].mean()
to drill down to the mean you're looking for. (The first case will calculate means for all columns.) This is assuming that voteChoice is the name of the column you want to condition on.

If you're only interested in the mean for a single group (e.g. Clinton voters) then you could create a boolean series that is True for members of that group, then use this to index into the rows of the DataFrame before taking the mean:
voted_for_clinton = data2['voteChoice'] == 'Clinton'
mean_for_clinton_voters = data2.loc[voted_for_clinton, 'socialIdeology2'].mean()
If you want to get the means for multiple groups simultaneously then you can use groupby, as in Brad's answer. However, I would do it like this:
means_by_vote_choice = data2.groupby('voteChoice')['socialIdeology2'].mean()
Placing the ['socialIdeology2'] index before the .mean() means that you only compute the mean over the column you're interested in, whereas if you place the indexing expression after the .mean() (i.e. data2.groupby('voteChoice').mean()['socialIdeology2']) this computes the means over all columns and then selects only the 'socialIdeology2' column from the result, which is less efficient.
See here for more info on indexing DataFrames using .loc and here for more info on groupby.

Related

How to apply a function pairwise on rows in a series?

I want something like this:
df.groupby("A")["B"].diff()
But instead of diff(), I want be able to compute if the two rows are different or identical, and return 1 if the current row is different from the previous, and 0 if it is identical.
Moreover, I really would like to use a custom function instead of diff(), so that I can do general pairwise row operations.
I tried using .rolling(2) and .apply() at different places, but I just can not get it to work.
Edit:
Each row in the dataset is a packet.
The first row in the dataset is the first recorded packet, and the last row is the last recorded packet, i.e., they are ordered by time.
One of the features(columns) is called "ID", and several packets have the same ID.
Another column is called "data", its values are 64 bit binary values (strings), i.e., 001011010011001.....10010 (length 64).
I want to create two new features(columns):
Compare the "data" field of the current packet with the data field of the previous packet with the Same ID, and compute:
If they are different (1 or 0)
How different (a figure between 0 and 1)
Hi I think it is best if you forgo using the grouby and shift instead:
equal_index = (df == df.shift(1))[X].all(axis=1)
where X is a list of columns you want to be identic. Then you can create your own grouper by
my_grouper = (~equal_index).cumsum()
and use it together with agg to aggregate with whatever function you wish
df.groupby(my_grouper).agg({'B':f})
Use DataFrameGroupBy.shift with compare for not equal by Series.ne:
df["dc"] = df.groupby("ID")["data"].shift().ne(df['data']).astype(int)
EDIT: for correlation between 2 Series use:
df["dc"] = df['data'].corr(df.groupby("ID")["data"].shift())
Ok, I solved it myself with
def create_dc(df: pd.DataFrame):
dc = df.groupby("ID")["data"].apply(lambda x: x != x.shift(1)).astype(int)
dc.fillna(1, inplace=True)
df["dc"] = dc
this does what I want.
Thank you #Arnau for inspiring me to use .shift()!

Pandas conditional row values based on an other column

Picture of the dataframe1
Hi! I've been trying to figure out how I could calculate wallet balances of erc-20 tokens, but can't get this to work.The idea is simple, when the "Status" columns row value is "Sending", the value would be negative, and when it is "receiving", it would be positive. Lastly I would use groupby and calculate sums by token symbols. The problem is, I can't get the conditional statement to work. What would be a way to do this? I've tried making loop iterations but they don't seem to work.
Assuming that df is the dataframe you presented, it's enough to select proper slice and multiply values by -1:
df.loc[df['Status'] == 'Sending', 'Value'] *= -1
And then grouping:
df = df.groupby(['Symbol']).sum().reset_index()
The looping in pandas is not a good idea – you are able to perform operations in a more optimal, vectorised manner, so try to avoid that.

Why can't I use the groupby function to calculate the average of another column here?

I am trying to find the average CTR for a set of emails which I would like to categorize by the time that they were sent in order to determine if the CTR is affected by the time they were sent. But for some reason, pandas just doesn't want to let me find the mean of the CTR values.
As you'll see below, I have tried using the mean function to find the mean of the CTR for each of the times, but I continually get the error:
DataError: No numeric types to aggregate
This to me would imply that my CTR figures are not integers or floats, but are instead strings. However, though they came in as strings, I have already converted them to floats. I know this too because if I use the sum() function in lieu of the average function, it works just fine.
The line of code is very simple:
df.groupby("TIME SENT", as_index=False)['CTR'].mean()
I can't imagine why the sum function would work and the mean function would fail, especially if the error is the one described above. Anyone got any ideas?
EDIT: Code I used to turn CTR column from string percentage (85.8%) to float:
i = 0
for index, row in df.iterrows():
df.loc[i, "CTR"] = float(row['CTR'].strip('%'))/100
i += 1
Link to df.head() : https://ethercalc.org/zw6xmf2c7auw
df['CTR']= (df['CTR'].str.strip('%').astype('float'))/100
The above code strips the % from the CTR column, then changes its type to float.You can then do your groupby.

Pandas dataframe find first and last element given condition and calculate slope

The situation:
I have a pandas dataframe where I have some data about the production of a product. The product is produced in 3 phases. The phases are not fixed meaning that their cycles (the time till last) is changing. During the production phases, at each cycle the temperature of the product is measured.
Please see the table below:
The problem:
I need to calculate the slope for each cycle of each phase for each product. I also need to add it to the dataframe in a new column called "Slope". The one you can see, highlighted in yellow was added by me manually in an excel file. The real dataset contains hundreds of parameters (not only temperatures) so in reality I need to calculate the slope for many, many columns, therefore I tried to define a function.
My solution is not working at all:
This is the code I tried, but it does not work. I am trying to catch the first and last row for the given product, for the given phase. And then get the temperature data and the difference of these two rows. And this way I could calculate the slope.
This is all I could come up with so far (I created another column called: "Max_cylce_no", this stores the maximum amount of the cycle for each phase):
temp_at_start=-1
def slope(col_name):
global temp_at_start
start_cycle_no = 1
if row["Cycle"]==1:
temp_at_start =row["Temperature"]
start_row = df.index(row)
cycle_numbers = row["Max_cylce_no"]
last_cycle_row = cycle_numbers + start_row
last_temp = df.loc[last_cycle_row, "Temperature"]
And the way I would like to apply it:
df.apply(slope("Temperature"), axis=1)
Unfortunatelly I get a NameError right away saying that: name 'row' is not defined.
Could you please help me and show me the right direction on how to solve this problem. It gives me a really hard time. :(
Thank you in advance!
I believe you need GroupBy.transform with subtract last value with first and divide by length:
f = lambda x: (x.iloc[-1] - x.iloc[0]) / len(x)
df['new'] = df.groupby(['Product_no','Phase_no'])['Temperature'].transform(f)

Rolling Standard Deviation in Pandas Returning Zeroes for One Column

Has anyone had issues with rolling standard deviations not working on only one column in a pandas dataframe?
I have a dataframe with a datetime index and associated financial data. When I run df.rolling().std() (psuedo code, see actual below), I get correct data for all columns except one. That column returns 0's where there should be standard deviation values. I also get the same error when using .rolling_std() and I get an error when trying to run df.rolling().skew(), all the other columns work and this column gives NaN.
What's throwing me off about this error is that the other columns work correctly and for this column, df.rolling().mean() works. In addition, the column has dtype float64, which shouldn't be a problem. I also checked and don't see missing data. I'm using a rolling window of 30 days and if I try to get the last standard deviation value using series[-30:].std() I get a correct result. So it seems like something specifically about the rolling portion isn't working. I played around with the parameters of .rolling() but couldn't get anything to change.
# combine the return, volume and slope data
raw_factor_data = pd.concat([fut_rets, vol_factors, slope_factors], axis=1)
# create new dataframe for each factor type (mean,
# std dev, skew) and combine
mean_vals = raw_factor_data.rolling(window=past, min_periods=past).mean()
mean_vals.columns = [column + '_mean' for column in list(mean_vals)]
std_vals = raw_factor_data.rolling(window=past, min_periods=past).std()
std_vals.columns = [column + '_std' for column in list(std_vals)]
skew_vals = raw_factor_data.rolling(window=past, min_periods=past).skew()
skew_vals.columns = [column + '_skew' for column in list(skew_vals)]
fact_data = pd.concat([mean_vals, std_vals, skew_vals], axis=1)
The first line combines three dataframes together. Then I create separate dataframes with rolling mean, std and skew (past = 30), and then combine those into a single dataframe.
The name of the column I'm having trouble with is 'TY1_slope'. So I've run some code as follows to see where there is an error.
print raw_factor_data['TY1_slope'][-30:].std()
print raw_factor_data['TY1_slope'][-30:].mean()
print raw_factor_data['TY1_slope'].rolling(window=30, min_periods=30).std()
print raw_factor_data['TY1_slope'].rolling(window=30, min_periods=30).mean()
The first two lines of code output a correct standard deviation and mean (.08 and .14). However, the third line of code produces zeroes but the fourth line produces accurate mean values (the final values in those series are 0.0 and .14).
If anyone can help with how to look at the .rolling source code that would be helpful too. I'm new to doing that and tried the following, but just got a few lines that didn't seem very helpful.
import inspect
import pandas as pd
print inspect.getsourcelines(pd.rolling_std)
Quoting JohnE's comment since it worked (although still not sure the root cause of the issue). JohnE, feel free to change to an answer and I'll upvote.
shot in the dark, but you could try rolling(30).apply( lambda x: np.std(x,ddof=1) ) in case it's some weird syntax bug with rolling + std – JohnE

Categories

Resources