Replace values for each group - python

I want to replace values in ['animal'] for each subid/group, based on a condition.
The values in animal column are numbers (0-3) and vary for each subid, so a where cond == 1 might look like [0,3] for one subid or [2,1] or [0,3] and the same goes for b.
for s in sids:
a = df[(df['subid']==s)&(df['cond'] == 1)]['animal'].unique()
b = df[(df['subid']==s)&(df['cond'] == 0)]['animal'].unique()
df["animal"].replace({a[0]: 0,a[1]:1,b[0]:2,b[1]:3})
The thing is I think the dataframe overwrites entirely each time and uses only the last iteration of the for loop instead of saving the appropriate values for each group.
I tried specifying the subid at the beginning like so, df[df['subid']==s["animal"].replace({a[0]:0,a[1]:1,b[0]:2,b[1]:3}) but it didn't work.
Any pointers are appreciated, thanks!

Related

How do you sum a dataframe based off a grouping in Python pandas?

I have a for loop with the intent of checking for values greater than zero.
Problem is, I only want each iteration to check the sum of a group of ID’s.
The grouping would be a match of the first 8 characters of the ID string.
I have that grouping taking place before the loop but the loop still appears to search the entire df instead of each group.
LeftGroup = newDF.groupby(‘ID_Left_8’)
for g in LeftGroup.groups:
if sum(newDF[‘Hours_Calc’] > 0):
print(g)
Is there a way to filter that sum to each grouping of leftmost 8 characters?
I was expecting the .groups function to accomplish this, but it still seems to search every single ID.
Thank you.
def filter_and_sum(group):
return sum(group[group['Hours_Calc'] > 0]['Hours_Calc'])
LeftGroup = newDF.groupby('ID_Left_8')
results = LeftGroup.apply(filter_and_sum)
print(results)
This will compute the sum of the Hours_Calc column for each group, filtered by the condition Hours_Calc > 0. The resulting series will have the leftmost 8 characters as the index, and the sum of the Hours_Calc column as the value.

Find Sign when Sign Changes in Pandas Column while Ignoring Zeros using Vectorization

I'm trying to find a vectorized way of determining the first instance where my column of data has a sign change. I looked at this question and it gets close to what I want, except it evaluates my first zeros as true. I'm open to different solutions including changing how the data is set up in the first place. I'll detail what I'm doing below.
I have two columns, let's call them positive and negative, that look at a third column. The third column has values ranging between [-5, 5]. When this column is [3, 5], my positive column gets a +1 on that same row; all other rows are 0 in that column. Likewise, when the third column is between [-5, -3], my negative column gets a -1 in that row; all other rows are 0.
I combine these columns into one column. You can conceptualize this as 'turn machine on, keep it on/off, turn it off, keep it on/off, turn machine on ... etc.' The problem I've having is that my combined column looks something like below:
pos = [1,1,1,0, 0, 0,0,0,0,0,1, 0,1]
neg = [0,0,0,0,-1,-1,0,0,0,0,0,-1,0]
com = [1,1,1,0,-1,-1,0,0,0,0,1,-1,1]
# Below is what I want to have as the final column.
cor = [1,0,0,0,-1, 0,0,0,0,0,1,-1,1]
The problem with what I've linked is that it gets close, but it evaluates the first 0 as a sign change as well. 0's should be ignored and I tried a few things, but seem to be creating new errors. For the sake of completeness, this is what the code linked outputs:
lnk = [True,False,False,True,True,False,True,False,False,False,True,True,True]
As you can see, it's doing the 1 and -1 not flipping fine, but the zero's it's flipping. Not sure if I should change how the combined column is made or if I should change the logic for the creation of the component columns, both. The big thing is I need to vectorize this code for performance concerns.
Any help would be greatly appreciated!
Let's suppose your dataframe is named df with columns pos and neg then you can try something like the following :
df.loc[:, "switch_pos"] = (np.diff(df.pos, prepend=0) > 0)*1
df.loc[:, "switch_neg"] = (np.diff(df.neg, prepend=0) > 0)*(-1)
You can then combine your two switchs columns.
Explanations
no.diff gives you the difference row by row but setting (for pos columns) 1 for 0 to 1 and - 1 for 1 to 0. Considering your desired output, you want to keep only your 0 to 1, that's why you need to keep only the more than zero output

First Transition Value of DataFrame Column without Temporary Variables

I am trying to find the first transition value of a dataframe column as efficiently as possible. I would prefer not to have temporary variables. Say I have a dataframe (df) with a column of:
Column1
0
0
0
-1
1
In this case, the value which I'm looking for is -1, which is the first time the value changes. I want to use this in an if statement for whether the value is first transitioning to 1 or -1. The pseudocode being:
if (first transition value == 1):
# Something
elif: (first transition value == -1):
# Something else
General case
You can compare the values in the dataframe to the first one, take only the differing values and use the first value of these.
df[df.Column1 != df.Column1.iloc[0]].Column1.values[0]
Special case
If you always want to find the first differing element from 0 you could just do it like this:
df[df.Column1 != 0].Column1.values[0]

If statement over pandas.series and append result to list

i'm trying to build a few list made from results. Could you tell me why this results is empty?
I'm not looking for sollution with numpy, that's why originally i'll create > 50 list, later save it to CSV.
df1 = pd.DataFrame(data={"Country":["USA","Germany","Russia","Poland"],
"Capital":["Washington","Berlin","Moscow","Warsaw"], "Region":
["America","Europe","Europe",'Europe']})
America = []
if (df1['Region']=='America').all():
America.append(df1)
print(America)
Your expression df1['Region']=='America' gives a so-called boolean mask (docs on boolean indexing). A boolean mask is a pandas Series of True and False whose index is lined up with the index of df1.
It's easy to get your expected values once you get used to boolean indexing:
df1[df1['Region']=='America']
Country Capital Region
0 USA Washington America
If you are interested in keeping entire rows, don't bother manually building a python list; that would complicate your work immensely compared to sticking to pandas. You can store the rows in a new DataFrame:
# Use df.copy() here so that changing America won't change df1
America = df1[df1['Region']=='America'].copy()
Why if (df1['Region']=='America').all(): didn't work
The Series.all() method checks whether all values in the Series are True. What you need to do here is to check each row for your condition df1['Region']=='America', and keep only those rows that match this condition (if I understand you correctly).
I'm not sure about what you want.
If you want to add the whole dataframe to the list if 'America' is in region :
for region in df1.Region :
if region == 'America':
America.append(df1)
If you want to add the element from each list wich is at same index than 'America' in 'Region' list :
count = 0
for region in df1.Region :
if region == 'America':
America.append(df1.Country[count])
America.append(df1.Capital[count])
count += 1
Does this answer the question ?

assigning values to first three rows of every group

I'm trying to code following logic in pandas, for first three rows of every group i want to create a variable which should have value 1(1st row), 2 (2nd row), 3(3rd row). I'm doing it like below, In the below code I'm not creating a new variable because i don't know how to do that, so I'm replacing the variable that's already present in the data set. Though my code doesn't throw error, it's giving me very strange results.
def func (i):
data.loc[data.groupby('ID').nth(i).index,'date'] = i
func(1)
Any suggestions?
Thanks in Advance.
If you don't have duplicated index, you can create a row id for each group, filter out id which is larger than 3 and then assign it back to the data frame:
data['date'] = (data.groupby('ID').cumcount() + 1)[lambda x: x <= 3]
This gives the first three rows for each ID 1,2,3, rows beyond 3 will have NaN values.
data = pd.DataFrame({"ID":[1,1,1,1,2,2,3,3,3]})
data['date'] = (data.groupby('ID').cumcount() + 1)[lambda x: x <= 3]
data

Categories

Resources