Filtering counts in Pandas .agg

Filtering counts in Pandas .agg - python

I am attempting to create a new Pandas Dataframe with specific counts from an existing dataframe (grouping by date and department).
I have read the documentation here:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html
and have constructed the following:
new_values = df.groupby(['department',pd.to_datetime(df.date).dt.strftime('%m/%Y')]).agg({'id':"count", 'confirmed':"count"})
I am having difficulty with the syntax, as I want the second count to count only where 'confirmed'=='1'.
I can do this using a second dataframe
filtered_values = df[df['confirmed']==1]
taking another count, and then merging them back together, however, is there a way to do it in the aggregate listed above?

If the values are just (0, 1), you can do directly:
new_values = df.groupby(['department',pd.to_datetime(df.date).dt.strftime('%m/%Y')]).agg({'id':"count", 'confirmed': lambda x: x.sum()})
In any other case (more different values) you can do:
new_values = df.groupby(['department',pd.to_datetime(df.date).dt.strftime('%m/%Y')]).agg({'id':"count", 'confirmed': lambda x: x.eq(1).sum()})

Related

pandas apply function to each group (output is not really an aggregation)

I have a list of time-series (=pandas dataframe) and want to calculate for each time-series (of a device) the matrixprofile.
One option is to iterate all the devices - which seems to be slow.
A second option would be to group by the devices - and apply a UDF. The problem is now, that the UDF will return 1:1 rows i.e. not a single scalar value per group but the same number of rows will be outputted as the input.
Is it still possible to somehow vectorize this calculation for reach group when 1:1 (or at least non scalar values) are returned?
import pandas as pd
df = pd.DataFrame({
'foo':[1,2,3], 'baz':[1.1, 0.5, 4], 'bar':[1,2,1]
})
display(df)
print('***************************')
# slow version retaining all the rows
for g in df.bar.unique():
print(g)
this_group = df[df.bar == g]
# perform a UDF which needs to have all the values per group
# i.e. for real I want to calculate the matrixprofile for each time-series of a device
this_group['result'] = this_group.baz.apply(lambda x: 1)
display(this_group)
print('***************************')
def my_non_scalar1_1_agg_function(x):
display(pd.DataFrame(x))
return x
# neatly vectorized application of a non_scalar function
# but this fails as: Must produce aggregated value
df = df.groupby(['bar']).baz.agg(my_non_scalar1_1_agg_function)
display(df)

For non-aggregated functions applied to each distinct group that does not return a non-scalar value, you need to iterate method across groups and then compile together.
Therefore, consider a list or dict comprehension using groupby(), followed by concat. Be sure method inputs and returns a full data frame, series, or ndarray.
# LIST COMPREHENSION
df_list = [ myfunction(sub) for index, sub in df.groupby(['group_column']) ]
final_df = pd.concat(df_list)
# DICT COMPREHENSION
df_dict = { index: myfunction(sub) for index, sub in df.groupby(['group_column']) }
final_df = pd.concat(df_dict, ignore_index=True)

Indeed this (see also the link above in the comment) is a way to get it to work in a faster/more desired way. Perhaps there is even a better alternative
import pandas as pd
df = pd.DataFrame({
'foo':[1,2,3], 'baz':[1.1, 0.5, 4], 'bar':[1,2,1]
})
display(df)
grouped_df = df.groupby(['bar'])
altered = []
for index, subframe in grouped_df:
display(subframe)
subframe = subframe# obviously we need to apply the UDF here - not the idempotent operation (=doing nothing)
altered.append(subframe)
print (index)
#print (subframe)
pd.concat(altered, ignore_index=True)
#pd.DataFrame(altered)

Add values to bottom of DataFrame automatically with Pandas

I'm initializing a DataFrame:
columns = ['Thing','Time']
df_new = pd.DataFrame(columns=columns)
and then writing values to it like this:
for t in df.Thing.unique():
df_temp = df[df['Thing'] == t] #filtering the df
df_new.loc[counter,'Thing'] = t #writing the filter value to df_new
df_new.loc[counter,'Time'] = dftemp['delta'].sum(axis=0) #summing and adding that value to the df_new
counter += 1 #increment the row index
Is there are better way to add new values to the dataframe each time without explicitly incrementing the row index with 'counter'?

If I'm interpreting this correctly, I think this can be done in one line:
newDf = df.groupby('Thing')['delta'].sum().reset_index()
By grouping by 'Thing', you have the various "t-filters" from your for-loop. We then apply a sum() to 'delta', but only within the various "t-filtered" groups. At this point, the dataframe has the various values of "t" as the indices, and the sums of the "t-filtered deltas" as a corresponding column. To get to your desired output, we then bump the "t's" into their own column via reset_index().

Python dataframe groupby by dictionary list then sum

I have two dataframes. The first named mergedcsv is of the format:
mergedcsv dataframe
The second dataframe named idgrp_df is of a dictionary format which for each region Id a list of corresponding string ids.
idgrp_df dataframe - keys with lists
For each row in mergedcsv (and the corresponding row in idgrp_df) I wish to select the columns within mergedcsv where the column labels are equal to the list with idgrp_df for that row. Then sum the values of those particular values and add the output to a column within mergedcsv. The function will iterate through all rows in mergedcsv (582 rows x 600 columns).
My line of code to try to attempt this is:
mergedcsv['TotRegFlows'] = mergedcsv.groupby([idgrp_df],as_index=False).numbers.apply(lambda x: x.iat[0].sum())
It returns a ValueError: Grouper for class pandas.core.frame.DataFrame not 1-dimensional.
This relates to the input dataframe for the groupby. How can I access the list for each row as the input for the groupby?
So for example, for the first row in mergedcsv I wish to select the columns with labels F95RR04, F95RR06 and F95RR15 (reading from the list in the first row of idgrp_df). Sum the values in these columns for that row and insert the sum value into TotRegFlows column.
Any ideas as to how I can utilize the list would be very much appreciated.
Edits:
Many thanks IanS. Your solution is useful. Following modification of the code line based on this advice I realised that (as suggested) my index in both dataframes are out of sync. I tested the indices (mergedcsv had 'None' and idgrp_df has 'REG_ID' column as index. I set the mergedcsv to 'REG_ID' also. Then realised that the mergedcsv has 582 rows (the REG_ID is not unique) and the idgrp_df has 220 rows (REG_ID is unique). I therefor think I am missing a groupby based on REG_ID index in mergedcsv.
I have modified the code as follows:
mergedcsv.set_index('REG_ID', inplace=True)
print mergedcsv.index.name
print idgrp_df.index.name
mergedcsvgroup = mergedcsv.groupby('REG_ID')[mergedcsv.columns].apply(lambda y: y.tolist())
mergedcsvgroup['TotRegFlows'] = mergedcsvgroup.apply(lambda row: row[idgrp_df.loc[row.name]].sum(), axis=1)
I have a keyError:'REG_ID'.
Any further recommendations are most welcome. Would it be more efficient to combine the groupby and apply into one line?
I am new to working with pandas and trying to build experience in python
Further amendments:
Without an index for mergedcsv:
mergedcsv['TotRegFlows'] = mergedcsv.apply(lambda row: row[idgrp_df.loc[row.name]].groupby('REG_ID').sum(), axis=1)
this throws a KeyError: (the label[0] is not in the [index], u 'occurred at index 0')
With an index for mergedcsv:
mergedcsv.set_index('REG_ID', inplace=True)
columnlist = list(mergedcsv.columns.values)
mergedcsv['TotRegFlows'] = mergedcsv.apply(lambda row: row[idgrp_df.loc[row.name]].groupby('REG_ID')[columnlist].transform().sum(), axis=1)
this throws a TypeError: ("unhashable type:'list'", u'occurred at index 7')
Or finally separating the groupby function:
columnlist = list(mergedcsv.columns.values)
mergedcsvgroup = mergedcsv.groupby('REG_ID')
mergedcsv['TotRegFlows'] = mergedcsvgroup.apply(lambda row: row[idgrp_df.loc[row.name]].sum())
this throws a TypeError: unhashable type list. The axis=1 argument is not available also with groupby apply.
Any ideas how I can use the lists with the apply function? I've explored tuples in the apply code but have not had any success.
Any suggestions much appreciated.

If I understand correctly, I have a simple solution with apply:
Setup
import pandas as pd
df = pd.DataFrame({'A': [1,2,3], 'B': [4,5,6], 'C': [7,8,9]})
lists = pd.Series([['A', 'B'], ['A', 'C'], ['C']])
Solution
I apply a lambda function that gets the list of columns to be summed from the lists series:
df.apply(lambda row: row[lists[row.name]].sum(), axis=1)
The trick is that, when iterating over rows (axis=1), row.name is the original index of the dataframe df. I use that to access the list from the lists series.
Notes
This solution assumes that both dataframes share the same index, which appears not to be the case in the screenshots you included. You have to address that.
Also, if idgrp_df is a dataframe and not a series, then you need to access its values with .loc.

How to group a Series by values in pandas?

I currently have a pandas Series with dtype Timestamp, and I want to group it by date (and have many rows with different times in each group).
The seemingly obvious way of doing this would be something similar to
grouped = s.groupby(lambda x: x.date())
However, pandas' groupby groups Series by its index. How can I make it group by value instead?

grouped = s.groupby(s)
Or:
grouped = s.groupby(lambda x: s[x])

Three methods:
DataFrame: pd.groupby(['column']).size()
Series： sel.groupby(sel).size()
Series to DataFrame:
pd.DataFrame( sel, columns=['column']).groupby(['column']).size()

For anyone else who wants to do this inline without throwing a lambda in (which tends to kill performance):
s.to_frame(0).groupby(0)[0]

You should convert it to a DataFrame, then add a column that is the date(). You can do groupby on the DataFrame with the date column.
df = pandas.DataFrame(s, columns=["datetime"])
df["date"] = df["datetime"].apply(lambda x: x.date())
df.groupby("date")
Then "date" becomes your index. You have to do it this way because the final grouped object needs an index so you can do things like select a group.

To add another suggestion, I often use the following as it uses simple logic:
pd.Series(index=s.values).groupby(level=0)

How to iterate over columns of pandas dataframe to run regression

I have this code using Pandas in Python:
all_data = {}
for ticker in ['FIUIX', 'FSAIX', 'FSAVX', 'FSTMX']:
all_data[ticker] = web.get_data_yahoo(ticker, '1/1/2010', '1/1/2015')
prices = DataFrame({tic: data['Adj Close'] for tic, data in all_data.iteritems()})
returns = prices.pct_change()
I know I can run a regression like this:
regs = sm.OLS(returns.FIUIX,returns.FSTMX).fit()
but how can I do this for each column in the dataframe? Specifically, how can I iterate over columns, in order to run the regression on each?
Specifically, I want to regress each other ticker symbol (FIUIX, FSAIX and FSAVX) on FSTMX, and store the residuals for each regression.
I've tried various versions of the following, but nothing I've tried gives the desired result:
resids = {}
for k in returns.keys():
reg = sm.OLS(returns[k],returns.FSTMX).fit()
resids[k] = reg.resid
Is there something wrong with the returns[k] part of the code? How can I use the k value to access a column? Or else is there a simpler approach?

for column in df:
print(df[column])

You can use iteritems():
for name, values in df.iteritems():
print('{name}: {value}'.format(name=name, value=values[0]))

This answer is to iterate over selected columns as well as all columns in a DF.
df.columns gives a list containing all the columns' names in the DF. Now that isn't very helpful if you want to iterate over all the columns. But it comes in handy when you want to iterate over columns of your choosing only.
We can use Python's list slicing easily to slice df.columns according to our needs. For eg, to iterate over all columns but the first one, we can do:
for column in df.columns[1:]:
print(df[column])
Similarly to iterate over all the columns in reversed order, we can do:
for column in df.columns[::-1]:
print(df[column])
We can iterate over all the columns in a lot of cool ways using this technique. Also remember that you can get the indices of all columns easily using:
for ind, column in enumerate(df.columns):
print(ind, column)

You can index dataframe columns by the position using ix.
df1.ix[:,1]
This returns the first column for example. (0 would be the index)
df1.ix[0,]
This returns the first row.
df1.ix[:,1]
This would be the value at the intersection of row 0 and column 1:
df1.ix[0,1]
and so on. So you can enumerate() returns.keys(): and use the number to index the dataframe.

A workaround is to transpose the DataFrame and iterate over the rows.
for column_name, column in df.transpose().iterrows():
print column_name

Using list comprehension, you can get all the columns names (header):
[column for column in df]

Based on the accepted answer, if an index corresponding to each column is also desired:
for i, column in enumerate(df):
print i, df[column]
The above df[column] type is Series, which can simply be converted into numpy ndarrays:
for i, column in enumerate(df):
print i, np.asarray(df[column])

I'm a bit late but here's how I did this. The steps:
Create a list of all columns
Use itertools to take x combinations
Append each result R squared value to a result dataframe along with excluded column list
Sort the result DF in descending order of R squared to see which is the best fit.
This is the code I used on DataFrame called aft_tmt. Feel free to extrapolate to your use case..
import pandas as pd
# setting options to print without truncating output
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
import statsmodels.formula.api as smf
import itertools
# This section gets the column names of the DF and removes some columns which I don't want to use as predictors.
itercols = aft_tmt.columns.tolist()
itercols.remove("sc97")
itercols.remove("sc")
itercols.remove("grc")
itercols.remove("grc97")
print itercols
len(itercols)
# results DF
regression_res = pd.DataFrame(columns = ["Rsq", "predictors", "excluded"])
# excluded cols
exc = []
# change 9 to the number of columns you want to combine from N columns.
#Possibly run an outer loop from 0 to N/2?
for x in itertools.combinations(itercols, 9):
lmstr = "+".join(x)
m = smf.ols(formula = "sc ~ " + lmstr, data = aft_tmt)
f = m.fit()
exc = [item for item in x if item not in itercols]
regression_res = regression_res.append(pd.DataFrame([[f.rsquared, lmstr, "+".join([y for y in itercols if y not in list(x)])]], columns = ["Rsq", "predictors", "excluded"]))
regression_res.sort_values(by="Rsq", ascending = False)

I landed on this question as I was looking for a clean iterator of columns only (Series, no names).
Unless I am mistaken, there is no such thing, which, if true, is a bit annoying. In particular, one would sometimes like to assign a few individual columns (Series) to variables, e.g.:
x, y = df[['x', 'y']] # does not work
There is df.items() that gets close, but it gives an iterator of tuples (column_name, column_series). Interestingly, there is a corresponding df.keys() which returns df.columns, i.e. the column names as an Index, so a, b = df[['x', 'y']].keys() assigns properly a='x' and b='y'. But there is no corresponding df.values(), and for good reason, as df.values is a property and returns the underlying numpy array.
One (inelegant) way is to do:
x, y = (v for _, v in df[['x', 'y']].items())
but it's less pythonic than I'd like.

Most of these answers are going via the column name, rather than iterating the columns directly. They will also have issues if there are multiple columns with the same name. If you want to iterate the columns, I'd suggest:
for series in (df.iloc[:,i] for i in range(df.shape[1])):
...

assuming X-factor, y-label (multicolumn):
columns = [c for c in _df.columns if c in ['col1', 'col2','col3']] #or '..c not in..'
_df.set_index(columns, inplace=True)
print( _df.index)
X, y = _df.iloc[:,:4].values, _df.index.values

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Filtering counts in Pandas .agg - python

Related

pandas apply function to each group (output is not really an aggregation)

Add values to bottom of DataFrame automatically with Pandas

Python dataframe groupby by dictionary list then sum

How to group a Series by values in pandas?

How to iterate over columns of pandas dataframe to run regression

Categories

Resources