Drop rows while iterating through groups in pandas groupby - python

for a dataset like:
te = {'A':[1,1,1,2,2,2],'B':[0,3,6,0,5,7]}
df = DataFrame(te, index = range(6))
vol = 0
I'd like to groupby A and iter through the groups after the groupby function:
for name, group in df.groupby('A'):
for i, row in group.iterrows():
if row['B'] <= 0:
group = group.drop(i)
vol += row['A']
Somehow my code didn't work and the dataframe df remains the same as before the for loop. I need to use the groupby() method because the rows of dataset will increase through another loop outside this one, is there any methods to drop rows in groups from groupby? Or how to filter it out while also summing the row['A']?

If I understand correctly, you can do both operations separately without loop:
vol = df.A[df.B <= 0].sum()
df = df[df.B > 0] # equivalent to drop

Related

Loop in Pandas faster

I need to make a loop in pandas faster. It's a time series.
Below code works pretty well but it is slow for massive df.
It iterates through a df and at each first value 0 'zero' of column A (it needs to be only the first zero of a serie; df has many 0 series) calculates the delta (in absolute value) of column B values at one period before and after of the initial value 0 'zero' of column A.
Then it stores the results in a new df with column called 'Delta'
I bet I can do something with loc. but I cannot figure out how.
deltas=[]
indexes = []
i=0
for idx, row in df.iterrows():
if df.A[i] == 0 and df.A[i-1] !=0:
deltas.append(abs(df.B.shift(periods=1)[i] - df.B.shift(periods=-1)[i]))
indexes.append(idx)
i+=1
s_delta = pd.Series(deltas, name="Delta", index = indexes)
df_delta = s_delta.to_frame()
Use assign function to process df in series not per row:
df = df.assign(
n = lambda x: x.B.shift(1),
p = lambda x: x.B.shift(-1),
s_delta= np.abs(x.n-x.p)
)
Then you can modify it using np.where

How can I update the value in a pandas dataframe

I have a pandas DataFrame df which consists of three columns: doc1, doc2, value
I set value to 0 in all the row. I want to update the value using the jaccard similarity function (suppose it is defined).
I do the following:
df['value'] = 0
for index, row in df.iterrows():
sim = jaccardSim(row['doc1'], row['doc'])
df.at[index, 'value'] = sim
Unfortunately, it does not work. When i print df, I get in df['value'] the value 0.
How can I solve that?
You can try
df['value']=[jaccardSim(x, y) for x , y in zip(df['doc1'], df['doc'])]
you can do it making vectorized function. you should modify the jaccardSim to take a row of df or create a lambda wrapper function
jaccardSim = lambda row: jaccardSim(row["doc1"], row["doc2"])
vect_jaccardSim = np.vectorize(jaccardSim)
df['value'] = vect_jaccardSim(df)

Is there an equivalent Python function similar to complete.cases in R

I am removing a number of records in a pandas data frame which contains diverse combinations of NaN in the 4-columns frame. I have created a function called complete_cases to provide indexes of rows which met the following condition: all columns in the row are NaN.
I have tried this function below:
def complete_cases(dataframe):
indx = []
indx = [x for x in list(dataframe.index) \
if dataframe.loc[x, :].isna().sum() ==
len(dataframe.columns)]
return indx
I am wondering should this is optimal enough or there is a better way to do this.
Absolutely. All you need to do is
df.dropna(axis = 0, how = 'any', inplace = True)
This will remove all rows that have at least one missing value, and updates the data frame "inplace".
I'd recommend to use loc, isna, and any with 'columns' axis, like this:
df.loc[df.isna().any(axis='columns')]
This way you'll filter only the results like the complete.cases in R.
A possible solution:
Count the number of columns with "NA" creating a column to save it
Based on this new column, filter the rows of the data frame as you wish
Remove the (now) unnecessary column
It is possible to do it with a lambda function. For example, if you want to remove rows that have 10 "NA" values:
df['count'] = df.apply(lambda x: 0 if x.isna().sum() == 10 else 1, axis=1)
df = df[df.count != 0]
del df['count']

Iterate over a subset of a Pandas groupby object

I have a Pandas groupby object, and I would like to iterate over the first n groups. I've tried:
import pandas as pd
df = pd.DataFrame({'A':['a','a','a','b','b','c','c','c','c','d','d'],
'B':[1,2,3,4,5,6,7,8,9,10,11]})
df_grouped = df.groupby('A')
i = 0
n = 2 # for instance
for name, group in df_grouped:
#DO SOMETHING
if i == n:
break
i += 1
and
group_list = list(df_grouped.groups.keys())[:n]
for name in group_list:
group = df_grouped.get_group(name)
#DO SOMETHING
but I wondered if there was a more elegant/pythonic way to do it?
My actual groupby has 1000s of groups within it, and I'd like to only perform an operation on a subset, just to get an impression of the data as a whole.
You can filter with your original df, then we can do all the other you need to do
yourdf=df[df.groupby('A').ngroup()<=1]
yourdf=df[pd.factorize(df.A)[0]<=1]

Add values to bottom of DataFrame automatically with Pandas

I'm initializing a DataFrame:
columns = ['Thing','Time']
df_new = pd.DataFrame(columns=columns)
and then writing values to it like this:
for t in df.Thing.unique():
df_temp = df[df['Thing'] == t] #filtering the df
df_new.loc[counter,'Thing'] = t #writing the filter value to df_new
df_new.loc[counter,'Time'] = dftemp['delta'].sum(axis=0) #summing and adding that value to the df_new
counter += 1 #increment the row index
Is there are better way to add new values to the dataframe each time without explicitly incrementing the row index with 'counter'?
If I'm interpreting this correctly, I think this can be done in one line:
newDf = df.groupby('Thing')['delta'].sum().reset_index()
By grouping by 'Thing', you have the various "t-filters" from your for-loop. We then apply a sum() to 'delta', but only within the various "t-filtered" groups. At this point, the dataframe has the various values of "t" as the indices, and the sums of the "t-filtered deltas" as a corresponding column. To get to your desired output, we then bump the "t's" into their own column via reset_index().

Categories

Resources