Iterate over a subset of a Pandas groupby object - python

I have a Pandas groupby object, and I would like to iterate over the first n groups. I've tried:
import pandas as pd
df = pd.DataFrame({'A':['a','a','a','b','b','c','c','c','c','d','d'],
'B':[1,2,3,4,5,6,7,8,9,10,11]})
df_grouped = df.groupby('A')
i = 0
n = 2 # for instance
for name, group in df_grouped:
#DO SOMETHING
if i == n:
break
i += 1
and
group_list = list(df_grouped.groups.keys())[:n]
for name in group_list:
group = df_grouped.get_group(name)
#DO SOMETHING
but I wondered if there was a more elegant/pythonic way to do it?
My actual groupby has 1000s of groups within it, and I'd like to only perform an operation on a subset, just to get an impression of the data as a whole.

You can filter with your original df, then we can do all the other you need to do
yourdf=df[df.groupby('A').ngroup()<=1]
yourdf=df[pd.factorize(df.A)[0]<=1]

Related

Is there an equivalent Python function similar to complete.cases in R

I am removing a number of records in a pandas data frame which contains diverse combinations of NaN in the 4-columns frame. I have created a function called complete_cases to provide indexes of rows which met the following condition: all columns in the row are NaN.
I have tried this function below:
def complete_cases(dataframe):
indx = []
indx = [x for x in list(dataframe.index) \
if dataframe.loc[x, :].isna().sum() ==
len(dataframe.columns)]
return indx
I am wondering should this is optimal enough or there is a better way to do this.
Absolutely. All you need to do is
df.dropna(axis = 0, how = 'any', inplace = True)
This will remove all rows that have at least one missing value, and updates the data frame "inplace".
I'd recommend to use loc, isna, and any with 'columns' axis, like this:
df.loc[df.isna().any(axis='columns')]
This way you'll filter only the results like the complete.cases in R.
A possible solution:
Count the number of columns with "NA" creating a column to save it
Based on this new column, filter the rows of the data frame as you wish
Remove the (now) unnecessary column
It is possible to do it with a lambda function. For example, if you want to remove rows that have 10 "NA" values:
df['count'] = df.apply(lambda x: 0 if x.isna().sum() == 10 else 1, axis=1)
df = df[df.count != 0]
del df['count']

Pandas: Setting values in GroupBy doesn't affect original DataFrame

data = pd.read_csv("file.csv")
As = data.groupby('A')
for name, group in As:
current_column = group.iloc[:, i]
current_column.iloc[0] = np.NAN
The problem: 'data' stays the same after this loop, even though I'm trying to set values to np.NAN .
As #ohduran suggested:
data = pd.read_csv("file.csv")
As = data.groupby('A')
new_data = pd.DataFrame()
for name, group in As:
# edit grouped data
# eg group.loc[:,'column'] = np.nan
new_data = new_data.append(group)
.groupby() does not change the initial DataFrame. You might want to store what you do with groupby() on a different variable, and the accumulate it in a different DataFrame using that for loop?

Get all previous values for every row

I'm about to write a backtesting tool and so for every row I'd like to have access to all the dataframe till the given row. In the following example I'm doing it from a fixed index using a loop. I'm wondering if there is any better solution.
import numpy as np
import pandas as pd
N
df = pd.DataFrame({"a":np.arange(N)})
for i in range(3,N):
print(df["a"][:i].values)
UPDATE (toy example)
I need to apply a custom function to all the previous values. Here as a toy example I will use the sum of the square of all previous values.
def toyFun(v):
return np.sum(v**2)
res = np.empty(N)
res[:] = np.nan
for i in range(3, N):
res[i] = toyFun(df["a"][:i].values)
df["res"] = res
If you are indexing rows for a particular column say 'a', you can use .iloc indexer (i stands for index, loc means location) to index on the columns.
df = pd.DataFrame({'a': [1,2,3,4]})
print(df.a.iloc[:2]) # get first two values
So, you can do:
for i in range(3, 10):
print(df.a.iloc[:i])
The best way is to use a temporary column with the direct results, that way you are not re-calculating everything.
df["a"].apply(lambda x: x**2).cumsum()
Then re-index as you which:
res[3:] = df["a"].apply(lambda x: x**2).cumsum()[2:N-1].values
or directly to the dataframe.

Drop rows while iterating through groups in pandas groupby

for a dataset like:
te = {'A':[1,1,1,2,2,2],'B':[0,3,6,0,5,7]}
df = DataFrame(te, index = range(6))
vol = 0
I'd like to groupby A and iter through the groups after the groupby function:
for name, group in df.groupby('A'):
for i, row in group.iterrows():
if row['B'] <= 0:
group = group.drop(i)
vol += row['A']
Somehow my code didn't work and the dataframe df remains the same as before the for loop. I need to use the groupby() method because the rows of dataset will increase through another loop outside this one, is there any methods to drop rows in groups from groupby? Or how to filter it out while also summing the row['A']?
If I understand correctly, you can do both operations separately without loop:
vol = df.A[df.B <= 0].sum()
df = df[df.B > 0] # equivalent to drop

Pandas DataFrame, how to calculate a new column element based on multiple rows

I am currently trying to implement a statistical test for a specific row based on the content of different rows. Given the dataframe in the following image:
DataFrame
I would like to create a new column based on a function that takes into account all the columns of the dataframe that has the same string in column "Template".
For example, in this case there are 2 rows with Template "[Are|Off]", and for each one of those rows I would need to create an element in a new column based on "Clicks", "Impressions" and "Conversions" of both rows.
How would you best approach this problem?
PS: I apologise in advance for the way I am describing the problem, as you might have notices I am not a professional codes :D But I would really appreciate your help!
Here the formula with which I solved this in excel:
Excel Chi Squared test
This might be overly general but I would use some sort of function map if different things should be done depending on the template name:
import pandas as pd
import numpy as np
import collections
n = 5
template_column = list(['are|off', 'are|off', 'comp', 'comp', 'comp|city'])
n = len(template_column)
df = pd.DataFrame(np.random.random((n, 3)), index=range(n), columns=['Clicks', 'Impressions', 'Conversions'])
df['template'] = template_column
# Use a defaultdict so that you can define a default value if a template is
# note defined
function_map = collections.defaultdict(lambda: lambda df: np.nan)
# Now define functions to compute what the new columns should do depending on
# the template.
function_map.update({
'are|off': lambda df: df.sum().sum(),
'comp': lambda df: df.mean().mean(),
'something else': lambda df: df.mean().max()
})
# The lambda functions are just placeholders. You could do whatever you want in these functions... for example:
def do_special_stuff(df):
"""Do something that uses rows and columns...
you could also do looping or whatever you want as long
as the result is a scalar, or a sequence with the same
number of columns as the original template DataFrame
"""
crazy_stuff = np.prod(np.sum(df.values,axis=1)[:,None] + 2*df.values, axis=1)
return crazy_stuff
function_map['comp'] = do_special_stuff
def wrap(f):
"""Wrap a function so that it returns an updated dataframe"""
def wrapped(df):
df = df.copy()
new_column_data = f(df.drop('template', axis=1))
df['new_column'] = new_column_data
return df
return wrapped
# wrap all the functions so that each template has a function defined that does
# the correct thing
series_function_map = {k: wrap(function_map[k]) for k in df['template'].unique()}
# throw everything back together
new_df = pd.concat([series_function_map[label](group)
for label, group in df.groupby('template')],
ignore_index=True)
# print your shiny new dataframe
print(new_df)
The result is then something like:
Clicks Impressions Conversions template new_column
0 0.959765 0.111648 0.769329 are|off 4.030594
1 0.809917 0.696348 0.683587 are|off 4.030594
2 0.265642 0.656780 0.182373 comp 0.502015
3 0.753788 0.175305 0.978205 comp 0.502015
4 0.269434 0.966951 0.478056 comp|city NaN
Hope it helps!
Ok so after groupby u need to apply this formula ..so you can do this in pandas also ...
import numpy as np
t = df.groupby("Template") # this is for groupby
def calculater(b5,b6,c5,c6):
return b5/(b5+b6)*((c5+c6))
t['result'] = np.vectorize(calculater)(df["b5"],df["b6"],df["c5"],df["c6"])
here b5,b6 .. are column names of the cells shown in image
This should work for you or may need to do some minor changes in maths there

Categories

Resources