How to calculate totals of all possible combinations of columns - python
I have the following df:
df = pd.DataFrame({'a': [1,2,3,4,2], 'b': [3,4,1,0,4], 'c':[1,2,3,1,0], 'd':[3,2,4,1,4]})
I want to generate a combination of totals from these 4 columns, which equals 4 x 3 x 2 = 24 total combinations minus duplicates. I want the results in the same df.
I want something that looks like this (partial results shown):
A combo of a_b is the same as b_a and therefore I wouldn't want such a calculation since its a duplicate.
Is there a way to calculate all combinations and exclude duplicate totals?
import itertools as it
orig_cols = df.columns
for r in range(2, df.shape[1] + 1):
for cols in it.combinations(orig_cols, r):
df["_".join(cols)] = df.loc[:, cols].sum(axis=1)
Needs some looping, but not on the dataframe itself, but rather the combinations. We get 2, 3, ..., N-1'th combinations of the column names where N is number of columns. Then form the new _-joined column as the sum.
In [11]: df
Out[11]:
a b c d a_b a_c a_d b_c b_d c_d a_b_c a_b_d a_c_d b_c_d a_b_c_d
0 1 3 1 3 4 2 4 4 6 4 5 7 5 7 8
1 2 4 2 2 6 4 4 6 6 4 8 8 6 8 10
2 3 1 3 4 4 6 7 4 5 7 7 8 10 8 11
3 4 0 1 1 4 5 5 1 1 2 5 5 6 2 6
4 2 4 0 4 6 2 6 4 8 4 6 10 6 8 10
Related
Pandas: Get rows with consecutive column values
I need to go through a large pd and select consecutive rows with similar values in a column. i.e. in the pd below and selecting column x: I want to specify consecutive values in column x? Say if I want consecutive values of 3 and 5 only col row x y 1 1 1 1 5 7 3 0 2 2 2 2 6 3 3 8 9 2 3 4 5 3 3 9 4 9 4 4 5 5 5 1 3 7 5 2 6 6 6 6 5 8 6 2 3 7 6 0 The results output would be: col row x y consecutive-count 6 3 3 8 1 9 2 3 4 1 5 3 3 9 1 5 5 5 1 2 3 7 5 2 2 I tried m = df['x'].eq(df['x'].shift()) df[m|m.shift(-1, fill_value=False)] But that includes the consecutive 6 that I don't want. I also tried: df.query( 'x in [3,5]') That prints every row where x has 3 or 5.
IIUC use masks for boolean indexing. Check for 3 or 5, and use a cummax and reverse cummax to ensure having the order: m1 = df['x'].eq(3) m2 = df['x'].eq(5) out = df[(m1|m2)&(m1.cummax()&m2[::-1].cummax())] Output: col row x y 2 6 3 3 8 3 9 2 3 4 4 5 3 3 9 6 5 5 5 1 7 3 7 5 2
you can create a group column for consecutive values, and filter by the group count and value of x: # create unique ids for consecutive groups, then get group length: group_num = (df.x.shift() != df.x).cumsum() group_len = group_num.groupby(group_num).transform("count") # filter main df: df2 = df[(df.x.isin([3,5])) & (group_len > 1)] # add new group num col df2['consecutive-count'] = (df2.x != df2.x.shift()).cumsum() output: col row x y consecutive-count 3 6 3 3 8 1 4 9 2 3 4 1 5 5 3 3 9 1 7 5 5 5 1 2 8 3 7 5 2 2
Pandas: get rows with consecutive column values and add a couter row
I need to go through a large pd and select consecutive rows with similar values in a column. i.e. in the pd below and selecting column x: col row x y 1 1 1 1 2 2 2 2 6 3 3 8 9 2 3 4 5 3 3 9 4 9 4 4 5 5 5 1 3 7 5 2 6 6 6 6 The results output would be: col row x y 6 3 3 8 9 2 3 4 5 3 3 9 5 5 5 1 3 7 5 2 Not sure how to do this.
IIUC, use boolean indexing using a mask of the consecutive values: m = df['x'].eq(df['x'].shift()) df[m|m.shift(-1, fill_value=False)] Output: col row x y 2 6 3 3 8 3 9 2 3 4 4 5 3 3 9 6 5 5 5 1 7 3 7 5 2
Sort a subset of columns of a pandas dataframe alphabetically by column name
I'm having trouble finding the solution to a fairly simple problem. I would like to alphabetically arrange certain columns of a pandas dataframe that has over 100 columns (i.e. so many that I don't want to list them manually). Example df: import pandas as pd subject = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,4,4,4,4,4,4] timepoint = [1,2,3,4,5,6,1,2,3,4,5,6,1,2,4,1,2,3,4,5,6] c = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3] d = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3] a = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3] b = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3] df = pd.DataFrame({'subject':subject, 'timepoint':timepoint, 'c':c, 'd':d, 'a':a, 'b':b}) df.head() subject timepoint c d a b 0 1 1 2 2 2 2 1 1 2 3 3 3 3 2 1 3 4 4 4 4 3 1 4 5 5 5 5 4 1 5 6 6 6 6 How could I rearrange the column names to generate a df.head() that looks like this: subject timepoint a b c d 0 1 1 2 2 2 2 1 1 2 3 3 3 3 2 1 3 4 4 4 4 3 1 4 5 5 5 5 4 1 5 6 6 6 6 i.e. keep the first two columns where they are and then alphabetically arrange the remaining column names. Thanks in advance.
You can split your your dataframe based on column names, using normal indexing operator [], sort alphabetically the other columns using sort_index(axis=1), and concat back together: >>> pd.concat([df[['subject','timepoint']], df[df.columns.difference(['subject', 'timepoint'])]\ .sort_index(axis=1)],ignore_index=False,axis=1) subject timepoint a b c d 0 1 1 2 2 2 2 1 1 2 3 3 3 3 2 1 3 4 4 4 4 3 1 4 5 5 5 5 4 1 5 6 6 6 6 5 1 6 7 7 7 7 6 2 1 3 3 3 3 7 2 2 4 4 4 4 8 2 3 1 1 1 1 9 2 4 2 2 2 2 10 2 5 3 3 3 3 11 2 6 4 4 4 4 12 3 1 5 5 5 5 13 3 2 4 4 4 4 14 3 4 5 5 5 5 15 4 1 8 8 8 8 16 4 2 4 4 4 4 17 4 3 5 5 5 5 18 4 4 6 6 6 6 19 4 5 2 2 2 2 20 4 6 3 3 3 3
Specify the first two columns you want to keep (or determine them from the data), then sort all of the other columns. Use .loc with the correct list to then "sort" the DataFrame. import numpy as np first_cols = ['subject', 'timepoint'] #first_cols = df.columns[0:2].tolist() # OR determine first two other_cols = np.sort(df.columns.difference(first_cols)).tolist() df = df.loc[:, first_cols+other_cols] print(df.head()) subject timepoint a b c d 0 1 1 2 2 2 2 1 1 2 3 3 3 3 2 1 3 4 4 4 4 3 1 4 5 5 5 5 4 1 5 6 6 6 6
You can try getting the dataframe columns as a list, rearrange them, and assign it back to the dataframe using df = df[cols] import pandas as pd subject = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,4,4,4,4,4,4] timepoint = [1,2,3,4,5,6,1,2,3,4,5,6,1,2,4,1,2,3,4,5,6] c = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3] d = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3] a = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3] b = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3] df = pd.DataFrame({'subject':subject, 'timepoint':timepoint, 'c':c, 'd':d, 'a':a, 'b':b}) cols = df.columns.tolist() cols = cols[:2] + sorted(cols[2:]) df = df[cols]
Python Pandas keep maximum 3 consecutive duplicates
I have this table: import pandas as pd list1 = [1,1,2,2,3,3,3,3,4,1,1,1,1,2,2] df = pd.DataFrame(list1) df.columns = ['A'] I want to keep maximum 3 consecutive duplicates, or keep all in case there's less than 3 (or no) duplicates. The result should look like this: list2 = [1,1,2,2,3,3,3,4,1,1,1,2,2] result = pd.DataFrame(list2) result.columns = ['A']
Use GroupBy.head with consecutive Series create by compare shifted values for not equal and cumulative sum by Series.cumsum: df1 = df.groupby(df.A.ne(df.A.shift()).cumsum()).head(3) print (df1) A 0 1 1 1 2 2 3 2 4 3 5 3 6 3 8 4 9 1 10 1 11 1 13 2 14 2 Detail: print (df.A.ne(df.A.shift()).cumsum()) 0 1 1 1 2 2 3 2 4 3 5 3 6 3 7 3 8 4 9 5 10 5 11 5 12 5 13 6 14 6 Name: A, dtype: int32
Last us do df[df.groupby(df[0].diff().ne(0).cumsum())[0].cumcount()<3] 0 0 1 1 1 2 2 3 2 4 3 5 3 6 3 8 4 9 1 10 1 11 1 13 2 14 2
Solving with itertools.groupby which groups only consecutive duplicates , then slicing 3 elements: import itertools pd.Series(itertools.chain.from_iterable([*g][:3] for i,g in itertools.groupby(df['A']))) 0 1 1 1 2 2 3 2 4 3 5 3 6 3 7 4 8 1 9 1 10 1 11 2 12 2 dtype: int64
Pandas - Merge multiple columns and sum
I have a main df like so: index A B C 5 1 5 8 6 2 4 1 7 8 3 4 8 3 9 5 and an auxiliary df2 that I want to add to the main df like so: index A B 5 4 2 6 4 3 7 7 1 8 6 2 Columns A & B are the same name, however the main df contains many columns that the secondary df2 does not. I want to sum the columns that are common and leave the others as is. Output: index A B C 5 5 7 8 6 6 7 1 7 15 4 4 8 9 11 5 Have tried variations of df.join, pd.merge and groupby but having no luck at the moment. Last Attempt: df.groupby('index').sum().add(df2.groupby('index').sum()) But this does not keep common columns. pd.merge I am getting suffix _x and _y
Use add only with same columns by intersection: c = df.columns.intersection(df2.columns) df[c] = df[c].add(df2[c], fill_value=0) print (df) A B C index 5 5 7 8 6 6 7 1 7 15 4 4 8 9 11 5 If use only add, integers columns which not matched are converted to floats: df = df.add(df2, fill_value=0) print (df) A B C index 5 5 7 8.0 6 6 7 1.0 7 15 4 4.0 8 9 11 5.0 EDIT: If possible strings common columns: print (df) A B C D index 5 1 5 8 a 6 2 4 1 e 7 8 3 4 r 8 3 9 5 w print (df2) A B C D index 5 1 5 8 a 6 2 4 1 e 7 8 3 4 r 8 3 9 5 w Solution is similar, only filter first only numeric columns by select_dtypes: c = df.select_dtypes(np.number).columns.intersection(df2.select_dtypes(np.number).columns) df[c] = df[c].add(df2[c], fill_value=0) print (df) A B C D index 5 5 7 8 a 6 6 7 1 e 7 15 4 4 r 8 9 11 5 w
Not the cleanest way but it might work. df_new = pd.DataFrame() df_new['A'] = df['A'] + df2['A'] df_new['B'] = df['B'] + df2['B'] df_new['C'] = df['C'] print(df_new) A B C 0 5 7 8 1 6 7 1 2 15 4 4 3 9 11 5