find diff. of max and min in pandas by groupby? - python

If I have date frame as below of 3 year of rainfall from 2015-2017 for three stations, could you help me how find diff. between maximum and minimum for every station ?

Code below uses groupby() with axis=1 to get min() and max() for each row. The restuls are then combined using .merge():
Option-1:
Using the non-repeating names in the column 'Name'
# Import libraries
import pandas as pd
# Create DataFrame
df = pd.DataFrame({
'Name':['Baghdad', 'Basra', 'Mousl'],
'R2015':[300,190,350],
'R2016':[240,180,540],
'R2017':[290,160,490]
})
# Convert column to index
df = df.set_index('Name')
# Get min and max
df_min = df.groupby(['min']*df.shape[1],axis=1).min()
df_max = df.groupby(['max']*df.shape[1],axis=1).max()
# Combine
df_min_max = df_min.merge(df_max, on='Name')
# Get difference
df_min_max['diff'] = abs(df_min_max['min'] - df_min_max['max'])
# Output
df_min_max
Option-2:
If the DataFrame had names in column Name repeating, then below should work. Here, added Baghdad as an additional repeating row. Here, groupby() of groupby() is used.
# Import libraries
import pandas as pd
# Create DataFrame
df = pd.DataFrame({
'Name':['Baghdad', 'Basra', 'Mousl','Baghdad'],
'R2015':[300,190,350,780],
'R2016':[240,180,540,455],
'R2017':[290,160,490,23]
})
# Convert column to index
df = df.set_index('Name')
# Get min and max
df_min = df.groupby(['min']*df.shape[1],axis=1).min().groupby(['Name']).min()
df_max = df.groupby(['max']*df.shape[1],axis=1).max().groupby(['Name']).max()
# Combine
df_min_max = df_min.merge(df_max, on='Name')
# Get difference
df_min_max['diff'] = abs(df_min_max['min'] - df_min_max['max'])
# Output
df_min_max

Related

Finding .mean() of all columns in python using loop

I have the following dataframe:
Dataframe
Now i want to find the average of every column and create a new dataframe with the result.
My only solution has been:
#convert all rows to mean of values in column
df_find_mean['Germany'] = (df_find_mean["Germany"].mean())
df_find_mean['Turkey'] = (df_find_mean["Turkey"].mean())
df_find_mean['USA_NJ'] = (df_find_mean["USA_NJ"].mean())
df_find_mean['USA_TX'] = (df_find_mean["USA_TX"].mean())
df_find_mean['France'] = (df_find_mean["France"].mean())
df_find_mean['Sweden'] = (df_find_mean["Sweden"].mean())
df_find_mean['Italy'] = (df_find_mean["Italy"].mean())
df_find_mean['SouthAfrica'] = (df_find_mean["SouthAfrica"].mean())
df_find_mean['Taiwan'] = (df_find_mean["Taiwan"].mean())
df_find_mean['Hungary'] = (df_find_mean["Hungary"].mean())
df_find_mean['Portugal'] = (df_find_mean["Portugal"].mean())
df_find_mean['Croatia'] = (df_find_mean["Croatia"].mean())
df_find_mean['Albania'] = (df_find_mean["Albania"].mean())
df_find_mean['England'] = (df_find_mean["England"].mean())
df_find_mean['Switzerland'] = (df_find_mean["Switzerland"].mean())
df_find_mean['Denmark'] = (df_find_mean["Denmark"].mean())
#Remove all rows except first
df_find_mean = df_find_mean.loc[[0]]
#Verify data
display(df_find_mean)
Which works, but is not very elegant.
Is there some way to iterate over each column and construct a new dataframe as the average (.mean()) of that colume?
Expected output:
Dataframe with average of columns from previous dataframes
Use DataFrame.mean with convert Series to one row DataFrame by Series.to_frame and transpose:
df = df_find_mean.mean().to_frame().T
display(df)
Just use DataFrame.mean() to compute the mean of all your columns:
You can compute the mean of each column by df_find_mean.mean() and then integrate this into pd.DataFrame([df_find_mean.mean()])!
means = df_find_mean.mean()
df_mean = pd.DataFrame([means])
display(df_mean)

Find values in a Pandas dataframe and insert the data in a column of another Pandas dataframe

I have a dataframe that I need to convert the Custom Field column rows to columns in a second dataframe. This part I have managed to do and it works fine.
The problem is that I need to add the corresponding values from the id column to the respective columns of the second dataframe.
Here is an example:
This is first dataframe:
This is the second dataframe, with the columns already converted.
But I would like to add the values corresponding to the id column of the first dataframe to the second dataframe:
Attached is the code:
import pandas as pd
Data = {
"Custom Field": ["CF1", "CF2", "CF3"],
"id": [50, 40, 45],
"Name": ["Wilson", "Junior", "Otavio"]
}
### create the dataframe ###
df = pd.DataFrame(data)
print(df)
### add new columns from a list ###
columns_list = []
for x in df['Custom Field']:
### create multiple columns with x ##
columns_list.append(x)
### convert list to new columns ###
df2 = pd.DataFrame(df,columns=columns_list)
df2["Name"] = df["Name"]
print(df2)
### If Name df3 is equal to Name df and equal to Custom Field of df, then get the id of df and insert the value into the corresponding column in df3. ###
#### First unsuccessful attempt ###
df2_columns_names = list(df2.columns.values)
for df2_name in df2['Name']:
for df2_cf in df2_columns_names:
for df_name in df['Name']:
for df_cf in df['Custom Field']:
for df_id in df['id']:
if df2_name == df_name and df2_cf == df_cf:
df2.loc[df2_name, df2_cf] = df_id
print(df2)
Any suggestions?
Thanks in advance.
Use pivot_table
df.pivot_table(index=['Name'], columns=['Custom Field'])
As a general rule of thumb, if you are doing for loops and changing cells manually, you're using pandas wrong. Explore the methods of the framework in the docs, it can be very powerful :)

How to find the top any % of a dataframe?

I want to find the top 1% in my dataframe and append all the values in a list. Then i can check the first value inside and use it as a filter in the dataframe, any idea how to do it ? Or if you have a simplier way to do it !
You can find the dataframe i use here :
https://raw.githubusercontent.com/srptwice/forstack/main/resultat_projet.csv
What i tried is to watch my dataframe with heatmap (from Seaborn) and use a filter like that :
df4 = df2[df2 > 50700]
You can use df.<column name>.quantile(<percentile>) to get the top % of a dataframe. For example, the code below would get you the rows for df2 where bfly column is at the top 10% (90th percentile)
import pandas as pd
df = pd.read_csv('./resultstat_projet.csv')
df.columns = df.columns.str.replace(' ', '') # remove blank spaces in columns
df2 = df[df.bfly > df.bfly.quantile(0.9)]
print(df2)

Extract holidays from a dataframe

I have a dataframe with dates as an index and values as the first column.
I want to take all of the Belgian holidays out of that dataframe and create a new dataframe.
Things I've tried:
be_holidays = holidays.BE()
#example of the data frame (same format)
index = pd.date_range(start='08/08/2018',end='08/09/2018',freq='1H')
df = pd.DataFrame([1,2,3,5,3,5,4,6,2,4,6,6,3,2,5,9,7,8,8,5,1,2,5,3,6],columns=['A'], index = index)
new_df = df.applymap(lambda x: str(df.index[x]).split()[0] in be_holidays)
new_df = df[~(str(df.index).split()[0]).isin(be_holidays)]
#for context
type(df.index[0])
#results is
pandas._libs.tslib.Timestamp
I think this would work:
import pandas as pd
# index must be a datetime
df.index = pd.to_datetime(df.index)
# boolean mask, to identify holidays
mask = df.index.isin( set(be_holidays) )
# drop holidays, keep rest (~ is `not`)
df = df.loc[~mask]

Creating an empty Pandas DataFrame column with a fixed first value then filling it with a formula

I'd like to create an emtpy column in an existing DataFrame with the first value in only one column to = 100. After that I'd like to iterate and fill the rest of the column with a formula, like row[C][t-1] * (1 + row[B][t])
very similar to:
Creating an empty Pandas DataFrame, then filling it?
But the difference is fixing the first value of column 'C' to 100 vs entirely formulas.
import datetime
import pandas as pd
import numpy as np
todays_date = datetime.datetime.now().date()
index = pd.date_range(todays_date-datetime.timedelta(10), periods=10, freq='D')
columns = ['A','B','C']
df_ = pd.DataFrame(index=index, columns=columns)
df_ = df_.fillna(0)
data = np.array([np.arange(10)]*3).T
df = pd.DataFrame(data, index=index, columns=columns)
df['B'] = df['A'].pct_change()
df['C'] = df['C'].shift() * (1+df['B'])
## how do I set 2016-10-03 in Column 'C' to equal 100 and then calc consequtively from there?
df
Try this. Unfortunately, something similar to a for loop is likely needed because you will need to calculate the next row based on the prior rows value which needs to be saved to a variable as it moves down the rows (c_column in my example):
c_column = []
c_column.append(100)
for x,i in enumerate(df['B']):
if(x>0):
c_column.append(c_column[x-1] * (1+i))
df['C'] = c_column

Categories

Resources