Python dataframe groupby by dictionary list then sum

Python dataframe groupby by dictionary list then sum - python

I have two dataframes. The first named mergedcsv is of the format:
mergedcsv dataframe
The second dataframe named idgrp_df is of a dictionary format which for each region Id a list of corresponding string ids.
idgrp_df dataframe - keys with lists
For each row in mergedcsv (and the corresponding row in idgrp_df) I wish to select the columns within mergedcsv where the column labels are equal to the list with idgrp_df for that row. Then sum the values of those particular values and add the output to a column within mergedcsv. The function will iterate through all rows in mergedcsv (582 rows x 600 columns).
My line of code to try to attempt this is:
mergedcsv['TotRegFlows'] = mergedcsv.groupby([idgrp_df],as_index=False).numbers.apply(lambda x: x.iat[0].sum())
It returns a ValueError: Grouper for class pandas.core.frame.DataFrame not 1-dimensional.
This relates to the input dataframe for the groupby. How can I access the list for each row as the input for the groupby?
So for example, for the first row in mergedcsv I wish to select the columns with labels F95RR04, F95RR06 and F95RR15 (reading from the list in the first row of idgrp_df). Sum the values in these columns for that row and insert the sum value into TotRegFlows column.
Any ideas as to how I can utilize the list would be very much appreciated.
Edits:
Many thanks IanS. Your solution is useful. Following modification of the code line based on this advice I realised that (as suggested) my index in both dataframes are out of sync. I tested the indices (mergedcsv had 'None' and idgrp_df has 'REG_ID' column as index. I set the mergedcsv to 'REG_ID' also. Then realised that the mergedcsv has 582 rows (the REG_ID is not unique) and the idgrp_df has 220 rows (REG_ID is unique). I therefor think I am missing a groupby based on REG_ID index in mergedcsv.
I have modified the code as follows:
mergedcsv.set_index('REG_ID', inplace=True)
print mergedcsv.index.name
print idgrp_df.index.name
mergedcsvgroup = mergedcsv.groupby('REG_ID')[mergedcsv.columns].apply(lambda y: y.tolist())
mergedcsvgroup['TotRegFlows'] = mergedcsvgroup.apply(lambda row: row[idgrp_df.loc[row.name]].sum(), axis=1)
I have a keyError:'REG_ID'.
Any further recommendations are most welcome. Would it be more efficient to combine the groupby and apply into one line?
I am new to working with pandas and trying to build experience in python
Further amendments:
Without an index for mergedcsv:
mergedcsv['TotRegFlows'] = mergedcsv.apply(lambda row: row[idgrp_df.loc[row.name]].groupby('REG_ID').sum(), axis=1)
this throws a KeyError: (the label[0] is not in the [index], u 'occurred at index 0')
With an index for mergedcsv:
mergedcsv.set_index('REG_ID', inplace=True)
columnlist = list(mergedcsv.columns.values)
mergedcsv['TotRegFlows'] = mergedcsv.apply(lambda row: row[idgrp_df.loc[row.name]].groupby('REG_ID')[columnlist].transform().sum(), axis=1)
this throws a TypeError: ("unhashable type:'list'", u'occurred at index 7')
Or finally separating the groupby function:
columnlist = list(mergedcsv.columns.values)
mergedcsvgroup = mergedcsv.groupby('REG_ID')
mergedcsv['TotRegFlows'] = mergedcsvgroup.apply(lambda row: row[idgrp_df.loc[row.name]].sum())
this throws a TypeError: unhashable type list. The axis=1 argument is not available also with groupby apply.
Any ideas how I can use the lists with the apply function? I've explored tuples in the apply code but have not had any success.
Any suggestions much appreciated.

If I understand correctly, I have a simple solution with apply:
Setup
import pandas as pd
df = pd.DataFrame({'A': [1,2,3], 'B': [4,5,6], 'C': [7,8,9]})
lists = pd.Series([['A', 'B'], ['A', 'C'], ['C']])
Solution
I apply a lambda function that gets the list of columns to be summed from the lists series:
df.apply(lambda row: row[lists[row.name]].sum(), axis=1)
The trick is that, when iterating over rows (axis=1), row.name is the original index of the dataframe df. I use that to access the list from the lists series.
Notes
This solution assumes that both dataframes share the same index, which appears not to be the case in the screenshots you included. You have to address that.
Also, if idgrp_df is a dataframe and not a series, then you need to access its values with .loc.

Related

Sorting DataFrame and getting a value of column from it

I have the following dataframe:
newItem = pd.DataFrame({'c1': range(10), 'c2': (1,90,100,50,30,10,50,30,90,1000)})
Which looks like this:
I want to sort the columns by descending order, and extract the i'th row to a new pandas series.
So my function looks like this:
def getLargestRow(dataFrame, indexAfterSort):
numRows, numCols = dataFrame.shape
seriesToReturn = pd.Series()
dataFrame= dataFrame.sort_values(by=list(df.columns), ascending = False)
My problem is that I can't get to concatenate dataFrame's row number indexAfterSort.
I've tried to use:
seriesToReturn = seriesToReturn.add(df.iloc[indexAfterSort])
But confusingly I got NaN values, instead of the row values.
The dataframe after sort:
The output I receive (no matter what's the input for row index):
What am I missing here?
Thanks in advance.

It's a good idea to use built-in pandas functions for simple operations like sorting. Function sort_values is a good option here. This sorts the rows of the dataframe by c1 column:
seriesToReturn = newItem.sort_values('c1', ascending=False)
This returns a dataframe with both c1 and c2 columns, to get series of c2 column, just use seriesToReturn = seriesToReturn['c2'].

calculate sum of rows in pandas dataframe grouped by date

I have a csv that I loaded into a Pandas Dataframe.
I then select only the rows with duplicate dates in the DF:
df_dups = df[df.duplicated(['Date'])].copy()
I'm trying to get the sum of all the rows with the exact same date for 4 columns (all float values), like this:
df_sum = df_dups.groupby('Date')["Received Quantity","Sent Quantity","Fee Amount","Market Value"].sum()
However, this does not give the desired result. When I examine df_sum.groups, I've noticed that it did not include the first date in the indices. So for two items with the same date, there would only be one index in the groups object.
pprint(df_dups.groupby('Date')["Received Quantity","Sent Quantity","Fee Amount","Market Value"].groups)
I have no idea how to get the sum of all duplicates.
I've also tried:
df_sum = df_dups.groupby('Date')["Received Quantity","Sent Quantity","Fee Amount","Market Value"].apply(lambda x : x.sum())
This gives the same result, which makes sense I guess, as the indices in the groupby object are not complete. What am I missing here?

Check the documentation for the method duplicated. By default duplicates are marked with True except for the first occurence, which is why the first date is not included in your sums.
You only need to pass in keep=False in duplicated for your desired behaviour.
df_dups = df[df.duplicated(['Date'], keep=False)].copy()
After that the sum can be calculated properly with the expression you wrote
df_sum = df_dups.groupby('Date')["Received Quantity","Sent Quantity","Fee Amount","Market Value"].apply(lambda x : x.sum())

pandas df masking specific row by list

I have pandas df which has 7000 rows * 7 columns. And I have list (row_list) that consists with the value that I want to filter out from df.
What I want to do is to filter out the rows if the rows from df contain the corresponding value in the list.
This is what I got when I tried,
"Empty DataFrame
Columns: [A,B,C,D,E,F,G]
Index: []"
df = pd.read_csv('filename.csv')
df1 = pd.read_csv('filename1.csv', names = 'A')
row_list = []
for index, rows in df1.iterrows():
my_list = [rows.A]
row_list.append(my_list)
boolean_series = df.D.isin(row_list)
filtered_df = df[boolean_series]
print(filtered_df)

replace
boolean_series = df.RightInsoleImage.isin(row_list)
with
boolean_series = df.RightInsoleImage.isin(df1.A)
And let us know the result. If it doesn't work show a sample of df and df1.A

(1) generating separate dfs for each condition, concat, then dedup (slow)
(2) a custom function to annotate with bool column (default as False, then annotated True if condition is fulfilled), then filter based on that column
(3) keep a list of indices of all rows with your row_list values, then filter using iloc based on your indices list
Without an MRE, sample data, or a reason why your method didn't work, it's difficult to provide a more specific answer.

How to use the content of one dataframe to index another multi-level index dataframe?

I have the following dataframes site_1_df and `site_2_df (both are similar):
site_1_df:
And the following dataframe:
site_1_index_df = pd.DataFrame(site_1_df.index.values.tolist(), columns=["SiteNumber", "WeekNumber", "PG"])
site_2_index_df = pd.DataFrame(site_2_df.index.values.tolist(), columns=["SiteNumber", "WeekNumber", "PG"])
index_intersection = pd.merge(left=site_1_index_df, right=site_2_index_df,
on=["WeekNumber", "PG"], how="inner")[["WeekNumber", "PG"]]
index_intersection:
Consequently, it is clear that site_1_df and site_2_df are multi-level indexed dataframes. Therefore, I woulld like to use the index_intersection to index the above dataframe. Or, If I am indexing from site_1_df, then I want a subset of the rows from the same dataframe. And technically, I should get back a dataframe that has (8556 rows x 6 columns), i.e., the same number of rows of index_intersection. How can I achieve that efficiently in pandas?
I tried:
index_intersection = pd.merge(left=site_1_index_df, right=site_2_index_df,
on=["WeekNumber", "PG"], how="inner")[["SiteNumber_x", "WeekNumber", "PG"]]
index_intersection = index_intersection.rename(columns={"SiteNumber_x": "SiteNumber"})
index_intersection = index_intersection.set_index(["SiteNumber", "WeekNumber", "PG"])
index_intersection
And I got:
However, indexing the dataframe using another dataframe such as:
site_2_df.loc[index_intersection]
# or
site_2_df.loc[index_intersection.index]
# or
site_2_df.loc[index_intersection.index.values]
will give me an error:
NotImplementedError: Indexing a MultiIndex with a DataFrame key is not implemented
Any help is much appreciated!!

So I figured out that I can find the intersection of 2 dataframes, based on their index through:
sites_common_rows = pd.merge(left=site_1_df.reset_index([0]), right=site_2_df.reset_index([0]),
left_index=True, right_index=True, how="inner")
The reset_index([0]) above is used to ignore the SiteNumber since this totally different from one dataframe to another. Consequently, I am able to find the inner join between two dataframes from their indexes.

How to iterate over columns of pandas dataframe to run regression

I have this code using Pandas in Python:
all_data = {}
for ticker in ['FIUIX', 'FSAIX', 'FSAVX', 'FSTMX']:
all_data[ticker] = web.get_data_yahoo(ticker, '1/1/2010', '1/1/2015')
prices = DataFrame({tic: data['Adj Close'] for tic, data in all_data.iteritems()})
returns = prices.pct_change()
I know I can run a regression like this:
regs = sm.OLS(returns.FIUIX,returns.FSTMX).fit()
but how can I do this for each column in the dataframe? Specifically, how can I iterate over columns, in order to run the regression on each?
Specifically, I want to regress each other ticker symbol (FIUIX, FSAIX and FSAVX) on FSTMX, and store the residuals for each regression.
I've tried various versions of the following, but nothing I've tried gives the desired result:
resids = {}
for k in returns.keys():
reg = sm.OLS(returns[k],returns.FSTMX).fit()
resids[k] = reg.resid
Is there something wrong with the returns[k] part of the code? How can I use the k value to access a column? Or else is there a simpler approach?

for column in df:
print(df[column])

You can use iteritems():
for name, values in df.iteritems():
print('{name}: {value}'.format(name=name, value=values[0]))

This answer is to iterate over selected columns as well as all columns in a DF.
df.columns gives a list containing all the columns' names in the DF. Now that isn't very helpful if you want to iterate over all the columns. But it comes in handy when you want to iterate over columns of your choosing only.
We can use Python's list slicing easily to slice df.columns according to our needs. For eg, to iterate over all columns but the first one, we can do:
for column in df.columns[1:]:
print(df[column])
Similarly to iterate over all the columns in reversed order, we can do:
for column in df.columns[::-1]:
print(df[column])
We can iterate over all the columns in a lot of cool ways using this technique. Also remember that you can get the indices of all columns easily using:
for ind, column in enumerate(df.columns):
print(ind, column)

You can index dataframe columns by the position using ix.
df1.ix[:,1]
This returns the first column for example. (0 would be the index)
df1.ix[0,]
This returns the first row.
df1.ix[:,1]
This would be the value at the intersection of row 0 and column 1:
df1.ix[0,1]
and so on. So you can enumerate() returns.keys(): and use the number to index the dataframe.

A workaround is to transpose the DataFrame and iterate over the rows.
for column_name, column in df.transpose().iterrows():
print column_name

Using list comprehension, you can get all the columns names (header):
[column for column in df]

Based on the accepted answer, if an index corresponding to each column is also desired:
for i, column in enumerate(df):
print i, df[column]
The above df[column] type is Series, which can simply be converted into numpy ndarrays:
for i, column in enumerate(df):
print i, np.asarray(df[column])

I'm a bit late but here's how I did this. The steps:
Create a list of all columns
Use itertools to take x combinations
Append each result R squared value to a result dataframe along with excluded column list
Sort the result DF in descending order of R squared to see which is the best fit.
This is the code I used on DataFrame called aft_tmt. Feel free to extrapolate to your use case..
import pandas as pd
# setting options to print without truncating output
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
import statsmodels.formula.api as smf
import itertools
# This section gets the column names of the DF and removes some columns which I don't want to use as predictors.
itercols = aft_tmt.columns.tolist()
itercols.remove("sc97")
itercols.remove("sc")
itercols.remove("grc")
itercols.remove("grc97")
print itercols
len(itercols)
# results DF
regression_res = pd.DataFrame(columns = ["Rsq", "predictors", "excluded"])
# excluded cols
exc = []
# change 9 to the number of columns you want to combine from N columns.
#Possibly run an outer loop from 0 to N/2?
for x in itertools.combinations(itercols, 9):
lmstr = "+".join(x)
m = smf.ols(formula = "sc ~ " + lmstr, data = aft_tmt)
f = m.fit()
exc = [item for item in x if item not in itercols]
regression_res = regression_res.append(pd.DataFrame([[f.rsquared, lmstr, "+".join([y for y in itercols if y not in list(x)])]], columns = ["Rsq", "predictors", "excluded"]))
regression_res.sort_values(by="Rsq", ascending = False)

I landed on this question as I was looking for a clean iterator of columns only (Series, no names).
Unless I am mistaken, there is no such thing, which, if true, is a bit annoying. In particular, one would sometimes like to assign a few individual columns (Series) to variables, e.g.:
x, y = df[['x', 'y']] # does not work
There is df.items() that gets close, but it gives an iterator of tuples (column_name, column_series). Interestingly, there is a corresponding df.keys() which returns df.columns, i.e. the column names as an Index, so a, b = df[['x', 'y']].keys() assigns properly a='x' and b='y'. But there is no corresponding df.values(), and for good reason, as df.values is a property and returns the underlying numpy array.
One (inelegant) way is to do:
x, y = (v for _, v in df[['x', 'y']].items())
but it's less pythonic than I'd like.

Most of these answers are going via the column name, rather than iterating the columns directly. They will also have issues if there are multiple columns with the same name. If you want to iterate the columns, I'd suggest:
for series in (df.iloc[:,i] for i in range(df.shape[1])):
...

assuming X-factor, y-label (multicolumn):
columns = [c for c in _df.columns if c in ['col1', 'col2','col3']] #or '..c not in..'
_df.set_index(columns, inplace=True)
print( _df.index)
X, y = _df.iloc[:,:4].values, _df.index.values

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python dataframe groupby by dictionary list then sum - python

Related

Sorting DataFrame and getting a value of column from it

calculate sum of rows in pandas dataframe grouped by date

pandas df masking specific row by list

How to use the content of one dataframe to index another multi-level index dataframe?

How to iterate over columns of pandas dataframe to run regression

Categories

Resources