Pandas groupby.sum for all columns - python

I have a dataset with a set of columns I want to sum for each row. The columns in question all follow a specific naming pattern that I have been able to group in the past via the .sum() function:
pd.DataFrame.sum(data.filter(regex=r'_name$'),axis=1)
Now, I need to complete this same function, but, when grouped by a value of a column:
data.groupby('group').sum(data.filter(regex=r'_name$'),axis=1)
However, this does not appear to work as the .sum() function now does not expect any filtered columns. Is there another way to approach this keeping my data.filter() code?
Example toy dataset. Real dataset contains over 500 columns where all columns are not cleanly ordered:
toy_data = ({'id':[1,2,3,4,5,6],
'group': ["a","a","b","b","c","c"],
'a_name': [1,6,7,3,7,3],
'b_name': [4,9,2,4,0,2],
'c_not': [5,7,8,4,2,5],
'q_name': [4,6,8,2,1,4]
})
df = pd.DataFrame(toy_data, columns=['id','group','a_name','b_name','c_not','q_name'])
Edit: Missed this in original post. My objective is to get a variable ;sum" of the summation of all the selected columns as shown below:

You can filter first and then pass df['group'] instead group to groupby, last add sum column by DataFrame.assign:
df1 = (df.filter(regex=r'_name$')
.groupby(df['group']).sum()
.assign(sum = lambda x: x.sum(axis=1)))
ALternative is filter columns names and pass after groupby:
cols = df.filter(regex=r'_name$').columns
df1 = df.groupby('group')[cols].sum()
Or:
cols = df.columns[df.columns.str.contains(r'_name$')]
df1 = df.groupby('group')[cols].sum().assign(sum = lambda x: x.sum(axis=1))
print (df1)
a_name b_name q_name sum
group
a 7 13 10 30
b 10 6 10 26
c 10 2 5 17

Related

How to concatenate values from many columns into one column when one doesn't know the number of columns will have

My application saves an indeterminate number of values in different columns. As a results, I have a data frame with a certain number of columns at the beginning but then from a particular column (that I know) I will have an uncertain number of columns saving same data
Example:
known1 known2 know3 unknow1 unknow2 unknow3 ...
1 3 3 data data2 data3
The result I would like to get should be something like this
known1 known2 know3 all_unknow
1 3 3 data,data2,data3
How can I do this when I don't know the number of unknown columns but what I do know is this will occur (in this example) from the 4th column.
IIUC, use filter to select the columns by keyword:
cols = list(df.filter(like='unknow'))
# ['unknow1', 'unknow2', 'unknow3']
df['all_unknow'] = df[cols].apply(','.join, axis=1)
df = df.drop(columns=cols)
or take all columns from the 4th one:
cols = df.columns[3:]
df['all_unknow'] = df[cols].apply(','.join, axis=1)
df = df.drop(columns=cols)
output:
known1 known2 know3 all_unknow
0 1 3 3 data,data2,data3
df['all_unknown'] = df.iloc[:, 3:].apply(','.join, axis=1)
if you also want to drop all columns after the 4th:
cols = df.columns[3:-1]
df.drop(cols, axis=1)
the -1 is to avoid dropping the new column

i want to extract dataframe that meet certain conditions using python, pandas

I call Excel data with the tuples Time, Name, Good, Bad using python and pandas.
I want to reprocess dataframe to another dataframe that meet certain conditions.
In detail, i would like to print out a dataframe that stores the sum of Good and Bad data for each Name during the entire time.
please help me anybody who knows well python, pandas.
enter image description here
First aggregate sum by DataFrame.groupby, change columns names by DataFrame.add_prefix, add new column by DataFrame.assign and last convert index to column by DataFrame.reset_index:
df = pd.DataFrame({
'Name':list('aaabbb'),
'Bad':[1,3,5,7,1,0],
'Good':[5,3,6,9,2,4]
})
df1 = (df.groupby('Name')['Good','Bad']
.sum()
.add_prefix('Total_')
.assign(Total_Count = lambda x: x.sum(axis=1))
.reset_index())
print (df1)
Name Total_Good Total_Bad Total_Count
0 a 14 9 23
1 b 15 8 23
Use pandas NamedAgg with eval,
df.groupby('Name')[['Good', 'Bad']]\
.agg(Total_Good=('Good','sum'),
Total_Bad=('Bad', 'sum'))\
.eval('Total_Count = Total_Good + Total_Bad')

how to efficiently decode arrays to columns in pandas dataframe

I have a function that produces results for every month of a year. In my dataframe I collect these results for different data columns. After that, I have a dataframe containing multiple columns with arrays as values. Now I want to "pivot" those columns to have each value in its own column.
For example, if a row contains values [1,2,3,4,5,6,7,8,9,10,11,12] in column 'A', I want to have twelve columns 'A_01', 'A_02', ..., 'A_12' that each contain one value from the array.
My current code is this:
# create new columns
columns_to_add = []
column_count = len(columns_to_process)
for _, row in df[columns_to_process].iterrows():
columns_to_add += [[row[name][offset] if type(row[name]) == list else row[name]
for offset in range(array_len) for name in range(column_count)]]
new_df = pd.DataFrame(columns_to_add,
columns=[name+'_'+str(offset+1) for offset in range(array_len)
for name in columns_to_process],
index=df.index) # make dataframe addendum
(note: some rows don't have any values, so I had to put the condition if type() == list into the iteration)
But this code is awfully slow. I believe there must be a much more elegant solution. Can you show me such a solution?
IIUC, use Series.tolist with the pandas.DataFrame constructor.
We'll use DataFrame.rename as well to fix your column name format.
# Setup
df = pd.DataFrame({'A': [ [1,2,3,4,5,6,7,8,9,10,11,12] ]})
pd.DataFrame(df['A'].tolist()).rename(columns=lambda x: f'A_{x+1:0>2d}')
[out]
A_01 A_02 A_03 A_04 A_05 A_06 A_07 A_08 A_09 A_10 A_11 A_12
0 1 2 3 4 5 6 7 8 9 10 11 12

Finding mean of consecutive column data

I have the following data:
(the data given here is just representational)
`
I want to do the following with this data:
I want to get column only after the 201
i.e. I want to remove the 200-1 to 200-4 column data.
One way to do this is to retrieve only the required column while reading the data from excel, but I want to know how we can filter the column name on the basis of a particular pattern as 200-1 to 200-4 column name has pattern 200-*
I want to make a column after 202-4 which stores the values in the following ways:
201q1= mean of (201-1 and 201-2)
201q2 = mean of(201-3 and 201-4)
Similarly, if 202-1 to 201-4 data would have been there, a similar column should have been formed.
Please help.
Thanks in advance for your support.
This is a rough example but it will get you close. The example assume that there are always four columns per group:
#sample data
np.random.seed(1)
df = pd.DataFrame(np.random.randn(2,12), columns=['200-1','200-2','200-3','200-4', '201-1', '201-2', '201-3','201-4', '202-1', '202-2', '202-3','202-4'])
# remove 200-* columns
df2 = df[df.columns[~df.columns.str.contains('200-')]]
# us np.arange to create groups
new = df2.groupby(np.arange(len(df2.columns))//2, axis=1).mean()
# rename columns
new.columns = [f'{v}{k}' for v,k in zip([x[:3] for x in df2.columns[::2]], ['q1','q2']*int(len(df2.columns[::2])/2))]
# join
df2.join(new)
201-1 201-2 201-3 201-4 202-1 202-2 202-3 \
0 0.865408 -2.301539 1.744812 -0.761207 0.319039 -0.249370 1.462108
1 -0.172428 -0.877858 0.042214 0.582815 -1.100619 1.144724 0.901591
202-4 201q1 201q2 202q1 202q2
0 -2.060141 -0.718066 0.491802 0.034834 -0.299016
1 0.502494 -0.525143 0.312514 0.022052 0.702043
For step 1, you can get away with list comprehension, and the pandas drop function:
dropcols = [x for x in df.columns if '200-' in x]
df.drop(dropcols, axis=1, inplace=True)
Steps 3 and 4 are similar, you could calculate the rolling mean of the columns:
df2 = df.rolling(2, axis = 1).mean() # creates rolling mean
df2.columns = [x.replace('-', 'q') for x in df2.columns] # renames the columns
dfans = pd.concat([df, df2], axis = 1) # concatenate the columns together
Now, you just need to remove the columns that you dont want and rename them.

How to apply a function to multiple columns to create multiple columns in Pandas?

I am trying to apply a function on multiple columns and in turn create multiple columns to count the length of each entry.
Basically I have 5 columns with indexes 5,7,9,13 and 15 and each entry in those columns is a string of the form 'WrappedArray(|2008-11-12, |2008-11-12)' and in my function I try to strip the wrappedArray part and split the two values and count the (length - 1) using the following;
def updates(row,num_col):
strp = row[num_col.strip('WrappedAway')
lis = list(strp.split(','))
return len(lis) - 1
where num_col is the index of the column and cal take the value 5,7,9,13,15.
I have done this but only for 1 column:
fn = lambda row: updates(row,5)
col = df.apply(fn, axis=1)
df = df.assign(**{'count1':col.values})
I basically want to apply this function to ALL the columns (not just 5 as above) with the indexes mentioned and then create a separate column associated with columns 5,7,9,13 and 15 all in short code instead of doing that separately for each value.
I hope I made sense.
In regards to finding the amount of elements in the list, looks like you could simply use str.count() to find the amount of ',' in the strings. And in order to apply a defined function to a set of columns you could do something like:
cols = [5,7,9,13,15]
for col in cols:
col_counts = {'{}_count'.format(col): df.iloc[:,col].apply(lambda x: x.count(','))}
df = df.assign(**col_counts)
Alternatively you can also usestrip('WrappedAway').split(',') as you where using:
def count_elements(x):
return len(x.strip('WrappedAway').split(',')) - 1
for col in cols:
col_counts = {'{}_count'.format(col):
df.iloc[:,col].apply(count_elements)}
df = df.assign(**col_counts)
So for example with the following dataframe:
df = pd.DataFrame({'A': ['WrappedArray(|2008-11-12, |2008-11-12, |2008-10-11)', 'WrappedArray(|2008-11-12, |2008-11-12)'],
'B': ['WrappedArray(|2008-11-12,|2008-11-12)', 'WrappedArray(|2008-11-12, |2008-11-12)'],
'C': ['WrappedArray(|2008-11-12|2008-11-12)', 'WrappedArray(|2008-11-12|2008-11-12)']})
Redefining the set of columns on which we want to count the amount of elements:
for col in [0,1,2]:
col_counts = {'{}_count'.format(col):
df.iloc[:,col].apply(count_elements)}
df = df.assign(**col_counts)
Would yield:
A \
0 WrappedArray(|2008-11-12, |2008-11-12, |2008-1...
1 WrappedArray(|2008-11-12, |2008-11-12)
B \
0 WrappedArray(|2008-11-12,|2008-11-12)
1 WrappedArray(|2008-11-12, |2008-11-12)
C 0_count 1_count 2_count
0 WrappedArray(|2008-11-12|2008-11-12) 2 1 0
1 WrappedArray(|2008-11-12|2008-11-12) 1 1 0
You are confusing row-wise and column-wise operations by trying to do both in one function. Choose one or the other. Column-wise operations are usually more efficient and you can utilize Pandas str methods.
Setup
df = pd.DataFrame({'A': ['WrappedArray(|2008-11-12, |2008-11-12, |2008-10-11)', 'WrappedArray(|2008-11-12, |2008-11-12)'],
'B': ['WrappedArray(|2008-11-12,|2008-11-12)', 'WrappedArray(|2008-11-12|2008-11-12)']})
Logic
# perform operations on strings in a series
def calc_length(series):
return series.str.strip('WrappedAway').str.split(',').str.len() - 1
# apply to each column and join to original dataframe
df = df.join(df.apply(calc_length).add_suffix('_Length'))
Result
print(df)
A \
0 WrappedArray(|2008-11-12, |2008-11-12, |2008-1...
1 WrappedArray(|2008-11-12, |2008-11-12)
B A_Length B_Length
0 WrappedArray(|2008-11-12,|2008-11-12) 2 1
1 WrappedArray(|2008-11-12|2008-11-12) 1 0
I think we can use pandas str.count()
df= pd.DataFrame({
"col1":['WrappedArray(|2008-11-12, |2008-11-12)',
'WrappedArray(|2018-11-12, |2017-11-12, |2018-11-12)'],
"col2":['WrappedArray(|2008-11-12, |2008-11-12,|2008-11-12,|2008-11-12)',
'WrappedArray(|2018-11-12, |2017-11-12, |2018-11-12)']})
df["col1"].str.count(',')

Categories

Resources