Pandas DataFrame get column combined max values - python
I have a pandas DataFrame like following.
df = pd.DataFrame({"A": [3,1,2,4,5,3,4,10], "B": [1,3,2,4,0,0,1,0]})
The row values 0 to 10 are recommendations (10 is best). One DataFrame column is a category (A, B, etc.) the 0 to 10 recommendation is related to. All categories have the same weight but each row is related to one item.
I want the DataFrame to be sorted for items with the max values combined to both (or more) categories. So if a row related to an item has a value of 10 in category A but value 0 in category B, that would not be the expected solution for the highest rated item. In example given above the row with values [4,4] would be the best choice.
My groupby solution does not give the expected result.
grouped = df.groupby(['A', 'B'])
grouped[["A", "B"]].max().sort(ascending=False)
result:
A B
A B
10 2 10 0
5 0 5 0
4 4 4 4
1 4 1
3 1 3 1
0 3 0
2 2 2 2
1 3 1 3
A row based total sum would also not yield the expected result since it does not differentiate between categories.
df = pd.DataFrame({"A": [3,1,2,4,5,3,4,10], "B": [1,3,2,4,0,0,1,0]})
then calculate the rank for each column in the data frame
rank = df.rank(method = "dense")
rank
Out[44]:
A B
0 3 2
1 1 4
2 2 3
3 4 5
4 5 1
5 3 1
6 4 2
7 6 1
add a new column to the data frame which is the the total rank based on all categories
df['total_rank'] = rank.sum(axis = 1)
df
Out[46]:
A B total_rank
0 3 1 5
1 1 3 5
2 2 2 5
3 4 4 9
4 5 0 6
5 3 0 4
6 4 1 6
7 10 0 7
and finally sort your data frame by total rank
df.sort(columns='total_rank' , ascending = False)
Out[49]:
A B total_rank
3 4 4 9
7 10 0 7
4 5 0 6
6 4 1 6
0 3 1 5
1 1 3 5
2 2 2 5
5 3 0 4
How about this
df['pos'] = df.A/df.A.mean() + df.B/df.B.mean()
df.sort( columns='pos', ascending=False)
# A B pos
#3 4 4 3.909091
#7 10 0 2.500000
#1 1 3 2.431818
#2 2 2 1.954545
#6 4 1 1.727273
#0 3 1 1.477273
#4 5 0 1.250000
#5 3 0 0.750000
If you have more columns you want to rank ['A','B','C', ...]
cols = ['A','B'] # ,'C', 'D', ... ]
df['pos'] = pandas.np.sum([ df[col]/df[col].mean() for col in cols ],axis=0)
Update
Because 0 is considered a quality value (lowest), I would amend my answer as follows (not sure it makes a huge difference)
df['pos'] = (df.A+1)/(df.A.max()+1) + (df.B+1)/(df.B.max()+1)
df.sort( columns='pos', ascending=False)
# A B pos
#3 4 4 1.454545
#7 10 0 1.200000
#1 1 3 0.981818
#2 2 2 0.872727
#6 4 1 0.854545
#0 3 1 0.763636
#4 5 0 0.745455
#5 3 0 0.563636
Related
Create new dataframe with max value from each cell per column out of a set of dataframes
I have a code that is looping through a list and spits out a dataframe for each iteration of the loop. Each of these dataframes is the same size (# of columns/rows). I want to create a new dataframe that contains the max value for each position in each column between all of the dataframes. In the code below, MRS_response_df are the newly generated dataframes that I need the max values from. Example: df_1 = 1 2 3 4 0 2 2 2 2 1 3 1 2 5 df_2 = 1 1 3 4 0 2 3 5 9 1 1 8 3 4 output = 1 2 3 4 0 2 3 5 9 1 3 8 3 5 Code Example: data = 0 for i in dataset: #do stuff for i in column: print('Max Response Spectrum: ' + str(i) + ' out of ' + str(len(column))) y = accel_df[accel_df.columns[i]] MRS_frequency, MRS_response = MRS(t, y, Q, fn) MRS_frequency_df[MRS_frequency_df.columns[i]] = MRS_frequency MRS_response_df[MRS_response_df.columns[i]] = MRS_response data += 1
use .where and combine_first(): df1.where(df1>df2).combine_first(df2).astype(int) 0 1 2 3 0 1 2 3 4 1 2 3 5 9 2 3 8 3 5
Here's one way: pd.concat([df1, df2], keys=[1,2]).max(level=1) Output: 0 1 2 3 0 1 2 3 4 1 2 3 5 9 2 3 8 3 5
Append Pandas disjunction of 2 dataframes to first dataframe
Given 2 pandas tables, both with the 3 columns id, x and y coordinates. So several rows of same id represent a graph with its x-yvalues. How would I find paths that do not exist in the first table, but in the second and append them to 1st table? Key problem is that the order of the graphs in both tables can be different. Example: df1 = pd.DataFrame({'id':[1,1,2,2,2,3,3,3], 'x':[1,1,5,4,4,1,1,1], 'y':[1,2,4,4,3,4,5,6]}) df2 = pd.DataFrame({'id':[1,1,1,2,2,3,3,3,4,4,4], 'x':[1,1,1,1,1,5,4,4,10,10,9], 'y':[4,5,6,1,2,4,4,3,1,2,2]}) (df1 intersect df2 ) ---------> df1 id x y id x y id x y 1 1 1 1 1 4 1 1 1 1 1 2 1 1 5 1 1 2 2 5 4 1 1 6 2 5 4 2 4 4 2 1 1 2 4 4 2 4 3 2 1 2 2 4 3 3 1 4 3 5 4 3 1 4 3 1 5 3 4 4 3 1 5 3 1 6 3 4 3 3 1 6 4 10 1 4 10 1 4 10 2 4 10 2 4 9 2 4 9 2 Should become: df1 = pd.DataFrame({'id':[1,1,2,2,2,3,3,3,4,4,4], 'x':[1,1,5,4,4,1,1,1,10,10,9], 'y':[1,2,4,4,3,4,5,6,1,2,2]}) As you can see until id= 3, df1 and df2 have similar graphs, but their order is different from one to another table. In this case for example df1 first graph is df2 seconds graph. Now df2 has a 4th path that is not in df1. In that case the 4th path should be detected and appended to df1. Like that I want to get the intersection of the 2 pandas table and append the disjunction of the both to the first table, with the condition that the id, so to say the order of the paths can be different from one and another.
Imports: import pandas as pd Set starting DataFrames: df1 = pd.DataFrame({'id':[1,1,2,2,2,3,3,3], 'x':[1,1,5,4,4,1,1,1], 'y':[1,2,4,4,3,4,5,6]}) df2 = pd.DataFrame({'id':[1,1,1,2,2,3,3,3,4,4,4], 'x':[1,1,1,1,1,5,4,4,10,10,9], 'y':[4,5,6,1,2,4,4,3,1,2,2]}) Outer Merge: df_merged = df1.merge(df2, on=['x', 'y'], how='outer') produces: df_merged = id_x x y id_y 0 1.0 1 1 2 1 1.0 1 2 2 2 2.0 5 4 3 3 2.0 4 4 3 4 2.0 4 3 3 5 3.0 1 4 1 6 3.0 1 5 1 7 3.0 1 6 1 8 NaN 10 1 4 9 NaN 10 2 4 10 NaN 9 2 4 Note: Why does id_x become floats? Fill NaN: df_merged.id_x = df_merged.id_x.fillna(df_merged.id_y).astype('int') produces: df_merged = id_x x y id_y 0 1 1 1 2 1 1 1 2 2 2 2 5 4 3 3 2 4 4 3 4 2 4 3 3 5 3 1 4 1 6 3 1 5 1 7 3 1 6 1 8 4 10 1 4 9 4 10 2 4 10 4 9 2 4 Drop id_y: df_merged = df_merged.drop(['id_y'], axis=1) produces: df_merged = id_x x y 0 1 1 1 1 1 1 2 2 2 5 4 3 2 4 4 4 2 4 3 5 3 1 4 6 3 1 5 7 3 1 6 8 4 10 1 9 4 10 2 10 4 9 2 Rename id_x to id: df_merged = df_merged.rename(columns={'id_x': 'id'}) produces: df_merged = id x y 0 1 1 1 1 1 1 2 2 2 5 4 3 2 4 4 4 2 4 3 5 3 1 4 6 3 1 5 7 3 1 6 8 4 10 1 9 4 10 2 10 4 9 2 Final Program is 4 lines of code: import pandas as pd df1 = pd.DataFrame({'id':[1,1,2,2,2,3,3,3], 'x':[1,1,5,4,4,1,1,1], 'y':[1,2,4,4,3,4,5,6]}) df2 = pd.DataFrame({'id':[1,1,1,2,2,3,3,3,4,4,4], 'x':[1,1,1,1,1,5,4,4,10,10,9], 'y':[4,5,6,1,2,4,4,3,1,2,2]}) df_merged = df1.merge(df2, on=['x', 'y'], how='outer') df_merged.id_x = df_merged.id_x.fillna(df_merged.id_y).astype('int') df_merged = df_merged.drop(['id_y'], axis=1) df_merged = df_merged.rename(columns={'id_x': 'id'}) Please remember to put a check next to the selected answer.
Mauritius, try this code: df1 = pd.DataFrame({'id':[1,1,2,2,2,3,3,3], 'x':[1,1,5,4,4,1,1,1], 'y':[1,2,4,4,3,4,5,6]}) df2 = pd.DataFrame({'id':[1,1,1,2,2,3,3,3,4,4,4,5], 'x':[1,1,1,1,1,5,4,4,10,10,9,1], 'y':[4,5,6,1,2,4,4,3,1,2,2,2]}) df1_s = [{(x,y) for x, y in df1[['x','y']][df1.id==i].values} for i in df1.id.unique()] def f(df2): data = {(x,y) for x, y in df2[['x','y']].values} if data not in df1_s: return True else: return False check = df2.groupby('id').apply(f).apply(pd.Series) ids = check[check[0]].index.values df2 = df2.set_index('id').loc[ids].reset_index() df1 = df1.append(df2) OUT: id x y 0 1 1 1 1 1 1 2 2 2 5 4 3 2 4 4 4 2 4 3 5 3 1 4 6 3 1 5 7 3 1 6 0 4 10 1 1 4 10 2 2 4 9 2 3 5 1 2 I think it can be done more simple and pythonic, but I think a lot and still don't know how = ) And I think, should to check ids is not the same in df1 and df2, before append one df to another (in the end). I might add this later. Does this code do what you want?
Is it possible to obtain groupby style counts without collapsing Pandas DataFrame?
I have a DataFrame with 9 columns, and I'm trying to add a column of counts of unique values based on the first 3 columns (e.g. Cols A, B, and C, must match to count as a unique value , but the remaining columns can vary. I attempted to do this as with groupby: df = pd.DataFrame(resultsFile500.groupby(['chr','start','end']).size().reset_index().rename(columns={0:'count'})) This returns a DataFrame with 5 columns, and the counts are what I want. However, I also need values from the original data frame, so what I have been trying to do is somehow get those values of counts as a column in the original df. So, this would mean that if two rows in columns chr, start, and end, had identical values, the counts column would be 2 in both rows, but they would not be collapsed to one row. Is there an easy solution here that I'm missing, or do I need to hack something together?
You can use .transform to get non-collapsing behavior: >>> df a b c d e 0 3 4 1 3 0 1 3 1 4 3 0 2 4 3 3 2 1 3 3 4 1 4 0 4 0 4 3 3 2 5 1 2 0 4 1 6 3 1 4 2 1 7 0 4 3 4 0 8 1 3 0 1 1 9 3 4 1 2 1 >>> df.groupby(['a','b','c']).transform('count') d e 0 3 3 1 2 2 2 1 1 3 3 3 4 2 2 5 1 1 6 2 2 7 2 2 8 1 1 9 3 3 >>> Note, i'll have to choose an arbitrary column from the .transform result, but then just do: >>> df['unique_count'] = df.groupby(['a','b','c']).transform('count')['d'] >>> df a b c d e unique_count 0 3 4 1 3 0 3 1 3 1 4 3 0 2 2 4 3 3 2 1 1 3 3 4 1 4 0 3 4 0 4 3 3 2 2 5 1 2 0 4 1 1 6 3 1 4 2 1 2 7 0 4 3 4 0 2 8 1 3 0 1 1 1 9 3 4 1 2 1 3
Python, pandas, cumulative sum in new column on matching groups
If I have these columns in a dataframe: a b 1 5 1 7 2 3 1,2 3 2 5 How do I create column c where column b is summed using groupings of column a (string), keeping the existing dataframe. Some rows can belong to more than one group. a b c 1 5 15 1 7 15 2 3 11 1,2 3 26 2 5 11 Is there an easy and efficient solution as the dataframe I have is very large.
You can first need split column a and join it to original DataFrame: print (df.a.str.split(',', expand=True) .stack() .reset_index(level=1, drop=True) .rename('a')) 0 1 1 1 2 2 3 1 3 2 4 2 Name: a, dtype: object df1 = df.drop('a', axis=1) .join(df.a.str.split(',', expand=True) .stack() .reset_index(level=1, drop=True) .rename('a')) print (df1) b a 0 5 1 1 7 1 2 3 2 3 3 1 3 3 2 4 5 2 Then use transform for sum without aggragation. df1['c'] = df1.groupby(['a'])['b'].transform(sum) #cast for aggreagation join working with strings df1['a'] = df1.a.astype(str) print (df1) b a c 0 5 1 15 1 7 1 15 2 3 2 11 3 3 1 15 3 3 2 11 4 5 2 11 Last groupby by index and aggregate columns by agg: print (df1.groupby(level=0) .agg({'a':','.join,'b':'first' ,'c':sum}) [['a','b','c']] ) a b c 0 1 5 15 1 1 7 15 2 2 3 11 3 1,2 3 26 4 2 5 11
Sort pandas DataFrame by multiple columns and duplicated index
I have a pandas DataFrame with duplicated indices. There are 3 rows with each index, and they correspond to a group of items. There are two columns, a and b. df = pandas.DataFrame([{'i': b % 4, 'a': abs(b - 6) , 'b': b} for b in range(12)]).set_index('i') I want to sort the DataFrame so that: All of the rows with the same indices are adjacent. (all of the groups are together) The groups are in reverse order by the lowest value of a within the group. For example, in the above df, the first three items should be the ones with index 0, because the lowest a value for those three rows is 2, and all of the other groups have at least one row with an a value lower than 2. The second three items could be either group 3 or group 1, because the lowest a value in both of those groups is 1. The last group of items should be group 2, because it has a row with an a value of 0. Within each group, the items are sorted in ascending order by b. Desired output: a b i 0 6 0 0 2 4 0 2 8 3 3 3 3 1 7 3 5 11 1 5 1 1 1 5 1 3 9 2 4 2 2 0 6 2 4 10 I've been trying something like: df.groupby('i')[['a']].transform(min).sort(['a', 'b'], ascending=[0, 1]) But it gives me a KeyError, and it only gets that far if I make i a column instead of an index anyway.
The most straightforward way I see is moving your index to a column, and calculating a new column with the group min. In [43]: df = df.reset_index() In [45]: df['group_min'] = df.groupby('i')['a'].transform('min') Then you can sort by your conditions: In [49]: df.sort_values(['group_min', 'i', 'b'], ascending=[False, False, True]) Out[49]: i a b group_min 0 0 6 0 2 4 0 2 4 2 8 0 2 8 2 3 3 3 3 1 7 3 1 7 1 11 3 5 11 1 1 1 5 1 1 5 1 1 5 1 9 1 3 9 1 2 2 4 2 0 6 2 0 6 0 10 2 4 10 0 To get back to your desired frame, drop the tracking variable and reset the index. In [50]: df.sort_values(['group_min', 'i', 'b'], ascending=[False, False, True]).drop('group_min', axis=1).set_index('i') Out[50]: a b i 0 6 0 0 2 4 0 2 8 3 3 3 3 1 7 3 5 11 1 5 1 1 1 5 1 3 9 2 4 2 2 0 6 2 4 10
You can first sort by a in descending order and then sort your index: >>> df.sort(['a', 'b'], ascending=[False, True]).sort_index() a b i 0 6 0 0 2 4 0 2 8 1 5 1 1 3 9 1 1 5 2 4 2 2 4 10 2 0 6 3 5 11 3 3 3 3 1 7