How can I obtain the top n groups in pandas? - python

I have a pandas dataframe. The final column in the dataframe is the max value of the RelAb column for each unique group (in this case, a species assignment) in the dataframe as obtained by:
df_melted['Max'] = df_melted.groupby('Species')['RelAb'].transform('max')
As you can see, the max value is represented in all rows of the group. Each group contains a large number of rows. I have the df sorted by max values, for which there are about 100 rows per max value. My goal is to obtain the top 20 groups based on the max value (i.e. a df with 100 X 20 rows - 2000 rows). I do not want to drop individual rows from groups in the dataframe, rather entire groups.
I am pasting a subset of the dataframe where the max for a group changes from one "Max" value to the next:
My feeling is that I need to convert the max so that the one value represents the entire group and then sort based on that column, perhaps as such?
For context, the reason I am doing this is because I am planning to make a stacked barchart with the most abundant species in the table for each sample. Right now, there are just way too many species, so it makes the stacked bar chart uninformative.

One way to do it:
aux = (df_melted.groupby('Species')['RelAb']
.max()
.nlargest(20, keep='all')
.to_list())
top20 = df_melted.loc[df_melted['Max'].isin(aux), :].copy()

Related

Combine two Pandas Data Frames where "Dates" matches

I have two Pandas DataFrames with one column in common, namely "Dates". I need to merge these two where "Dates" correspond. with pd.merge() it does the expected but removes the uncorresponding values. I want to keep other values too.
Ex: I have historical data for a stock for 1 min. and a calculated indicator for 5min. data ie. for each 5 rows I have a new value calculated in 1 min Data Frame.
I know that Series.dt.floor method may reveal useful here but I couldn't figure out.
I concatenated respective "Dates" to calculated indicator Series so that I can merge them where column matches. I obtained a right result but missing values. I need a continuity of 1 min values, i.e. same indicator must be valid for the next 5 entries then the second indicator value's turn to be merged.
df1.merge(df2, left_on='Dates', right_on='Dates')

Create Matrix from a DataFrame

I have a dataframe with several columns, including Department and ICA. I need to create a matrix where the rows are the departments, and the columns are ICA values (they range from bad-acceptable-good).
So position r,c would be a number that shows how many observations of ICA were recorded for each department.
For example, if Amazonas is row 1 and Acceptable is column 3, position (1,3) would be the number of acceptable observations for Amazonas.
Thanks!
You can get values from your DataFrame using integer-based indexing with the DataFrame.iloc method. This seems to do what you need.
For example, if df is your DataFrame, then df.iloc[0, 2] will give you the value at the first row and third column.

How do I divide one column in one df by another column in a different df in pandas?

Here's my
If I have two dataframes (say df4avg and df5avg) with identical corrected wavelengths and different count rates, and I want to divide the df4avg count rate by df5avg's count rate and get an output of the corrected wavelength and the new divided value with a new column name (say 'ratio'), how would I do this?
If you want to add the ratio column in the df4avg Dataframe then
df4avg['ratio'] = df4avg['COUNT_RATE'] / df5avg['COUNT_RATE']

Having trouble on retrieving max values in a pyspark dataframe

After I calculate average of quantities within 5 rows for each row in a pyspark dataframe by using window and partitioning over a group of columns
from pyspark.sql import functions as F
prep_df = ...
window = Window.partitionBy([F.col(x) for x in group_list]).rowsBetween(Window.currentRow, Window.currentRow + 4)
consecutive_df = prep_df.withColumn('aveg', F.avg(prep_df['quantity']).over(window))
I am trying to group by with the same group and select the maximum value of the average values like this:
grouped_consecutive_df = consecutive_df.groupBy(group_column_list).agg(F.max(consecutive_df['aveg']).alias('aveg'))
However, when I debug, I see that the calculated maximum values are wrong. For specific instances, I saw that the retrieved max numbers are not even in the 'aveg' column.
I'd like to ask whether I am taking a false approach or missing something trivial. Any comments are appreciated.
I could solve this by a workaround like this: Before aggregation, I mapped the max values of quantity averages to another new column, then I selected one of the rows in the group.

How can I add a column of one data frame to another based on the nearest identifier?

Problem:
I have a data frame foo that contains measurements and a common_step column, which contains integers indicating when each row was measured.
I have a second data frame that also contains a common_step column and a bar_step column. It translates between the two integer steps.
I would like to add bar_step as a column to foo. However, the common_step values of both data frames are not aligned.
Thus, for each row in foo, I would like to find the row in bar with the nearest global_step and add its bar_step to the foo row.
I have found a way to do this. However, the solution is very slow. This is because for every row in foo, it searches through all rows in bar to find the one with closest global_step.
foo.sort_values('common_step', inplace=True)
bar.sort_values('common_step', inplace=True)
def find_nearest(foo_row):
index = abs(bar.common_step - foo_row.common_step).idxmin()
return bar.loc[index].bar_step
foo['bar_step'] = scores.apply(find_nearest, axis=1)
Questions:
How I can add the closest match for bar_step to the foo data table with sub quadratic run time?
Moreover, it would be ideal to have a flag that chooses the row with the closest but smaller global_step.
As #QuangHoang suggested in the comment, merge_asof does this. Moreover, the second data frame should contain no other columns to not interfere with existing columns in the first one:
foo.sort_values('common_step', inplace=True)
bar.sort_values('common_step', inplace=True)
bar = bar[['bar_step', 'common_step']]
foo = pandas.merge_asof(foo, bar, on='common_step', direction='backward')
The direction parameter specifies whether to use the nearest lower match, nearest higher match, or nearest match considering both directions. From the documentation:
A “backward” search selects the last row in the right DataFrame whose ‘on’ key is less than or equal to the left’s key.
A “forward” search selects the first row in the right DataFrame whose ‘on’ key is greater than or equal to the left’s key.
A “nearest” search selects the row in the right DataFrame whose ‘on’ key is closest in absolute distance to the left’s key.

Categories

Resources