I have a for loop with the intent of checking for values greater than zero.
Problem is, I only want each iteration to check the sum of a group of ID’s.
The grouping would be a match of the first 8 characters of the ID string.
I have that grouping taking place before the loop but the loop still appears to search the entire df instead of each group.
LeftGroup = newDF.groupby(‘ID_Left_8’)
for g in LeftGroup.groups:
if sum(newDF[‘Hours_Calc’] > 0):
print(g)
Is there a way to filter that sum to each grouping of leftmost 8 characters?
I was expecting the .groups function to accomplish this, but it still seems to search every single ID.
Thank you.
def filter_and_sum(group):
return sum(group[group['Hours_Calc'] > 0]['Hours_Calc'])
LeftGroup = newDF.groupby('ID_Left_8')
results = LeftGroup.apply(filter_and_sum)
print(results)
This will compute the sum of the Hours_Calc column for each group, filtered by the condition Hours_Calc > 0. The resulting series will have the leftmost 8 characters as the index, and the sum of the Hours_Calc column as the value.
I'm trying to find a vectorized way of determining the first instance where my column of data has a sign change. I looked at this question and it gets close to what I want, except it evaluates my first zeros as true. I'm open to different solutions including changing how the data is set up in the first place. I'll detail what I'm doing below.
I have two columns, let's call them positive and negative, that look at a third column. The third column has values ranging between [-5, 5]. When this column is [3, 5], my positive column gets a +1 on that same row; all other rows are 0 in that column. Likewise, when the third column is between [-5, -3], my negative column gets a -1 in that row; all other rows are 0.
I combine these columns into one column. You can conceptualize this as 'turn machine on, keep it on/off, turn it off, keep it on/off, turn machine on ... etc.' The problem I've having is that my combined column looks something like below:
pos = [1,1,1,0, 0, 0,0,0,0,0,1, 0,1]
neg = [0,0,0,0,-1,-1,0,0,0,0,0,-1,0]
com = [1,1,1,0,-1,-1,0,0,0,0,1,-1,1]
# Below is what I want to have as the final column.
cor = [1,0,0,0,-1, 0,0,0,0,0,1,-1,1]
The problem with what I've linked is that it gets close, but it evaluates the first 0 as a sign change as well. 0's should be ignored and I tried a few things, but seem to be creating new errors. For the sake of completeness, this is what the code linked outputs:
lnk = [True,False,False,True,True,False,True,False,False,False,True,True,True]
As you can see, it's doing the 1 and -1 not flipping fine, but the zero's it's flipping. Not sure if I should change how the combined column is made or if I should change the logic for the creation of the component columns, both. The big thing is I need to vectorize this code for performance concerns.
Any help would be greatly appreciated!
Let's suppose your dataframe is named df with columns pos and neg then you can try something like the following :
df.loc[:, "switch_pos"] = (np.diff(df.pos, prepend=0) > 0)*1
df.loc[:, "switch_neg"] = (np.diff(df.neg, prepend=0) > 0)*(-1)
You can then combine your two switchs columns.
Explanations
no.diff gives you the difference row by row but setting (for pos columns) 1 for 0 to 1 and - 1 for 1 to 0. Considering your desired output, you want to keep only your 0 to 1, that's why you need to keep only the more than zero output
Is it possible to use .nlargest to get the two highest numbers in a set of number, but ensure that they are x amount of rows apart?
For examples, in the following code I would want to find the largest values but ensure that they are more than 5 values apart from each other. Is there an easy way to do this?
data = {'Pressure' : [100,112,114,120,123,420,1222,132,123,333,123,1230,132,1,23,13,13,13,123,13,123,3,222,2303,1233,1233,1,1,30,20,40,401,10,40,12,122,1,12,333],
}
If I understand the question correctly, you need to output the largest value, and then the next largest value that's at least X rows apart from it (based on the index).
First value is just data.Pressure.max(). Its index is data.Pressure.idxmax()
Second value is either before or after the first value's index:
max_before = df.Pressure.loc[:df.Pressure.idxmax() - X].max()
max_after = df.Pressure.loc[df.Pressure.idxmax() + X:].max()
second_value = max(max_before, max_after)
I want to replace values in ['animal'] for each subid/group, based on a condition.
The values in animal column are numbers (0-3) and vary for each subid, so a where cond == 1 might look like [0,3] for one subid or [2,1] or [0,3] and the same goes for b.
for s in sids:
a = df[(df['subid']==s)&(df['cond'] == 1)]['animal'].unique()
b = df[(df['subid']==s)&(df['cond'] == 0)]['animal'].unique()
df["animal"].replace({a[0]: 0,a[1]:1,b[0]:2,b[1]:3})
The thing is I think the dataframe overwrites entirely each time and uses only the last iteration of the for loop instead of saving the appropriate values for each group.
I tried specifying the subid at the beginning like so, df[df['subid']==s["animal"].replace({a[0]:0,a[1]:1,b[0]:2,b[1]:3}) but it didn't work.
Any pointers are appreciated, thanks!
So I have 2 dataframes. One with values, and the second one with test statistic(TS) values. I need to check each cell in TS dataframe and if its value is smaller than 1, I need to change value in the second dataframe on same cell to 0.
I've tried to map them but could not find the right way.
yearly_flux = yearly_flux.map(lambda x : 0 ts_yearly_flux else x, ts_yearly_flux)
Have no idea if I can solve it like this, but I've tryed.
It's my second question so sorry if something is missing.
df1 = pd.DataFrame(np.random.normal(size=(5,10)))
df2 = pd.DataFrame(np.random.normal(size=(5,10)))
df2[:] = np.where(df1 < 1, 0, df2)