I am working with water quality data for both surface water locations and groundwater well locations. I would like to create a summary statistics table for all three of my parameters (pH, Temp, salinity) grouped by the location the samples were taken from (surface water vs. Groundwater) as shown below:
| 'Surface Water' | 'Groundwater' |
___________________________________________________________________________
| min | max | mean | std | min | max | mean | std
'pH'
The way I set up my Excel Sheet for data collection includes the following columns: Date, Monitoring ID (Either Surface Water or Groundwater), pH, Temp, and Salinity.
How can i tell python to do this? I am familiar with the groupby and describe() function but I don't know how to style organize it the way that I want. Any help would be appreciated!
I have tried using the groupby function for each descriptive stat for example:
mean = df.\
groupby('Monitoring ID')\
[['pH', 'SAL (ppt)', 'Temperature (°C)', 'DO (mg/L)']].mean()
min = df.\
groupby('Monitoring ID')\
[['pH', 'SAL (ppt)', 'Temperature (°C)', 'DO (mg/L)']].min()
etc.... but I don't know how to incorporate it all into one nice table
You can use groupby_describe as you suggest then stack_transpose:
metrics = ['count', 'mean', 'std', 'min', 'max']
out = df.groupby('Monitoring ID').describe().stack().T.loc[:, (slice(None), metrics)]
>>> out
Monitoring ID Groundwater Surface Water
count mean std min max count mean std min max
pH 159.0 6.979182 0.587316 6.00 7.98 141.0 6.991135 0.564097 6.00 7.99
SAL (ppt) 159.0 1.976226 0.577557 1.02 2.99 141.0 1.917589 0.576650 1.01 2.99
Temperature (°C) 159.0 13.466101 4.805317 4.13 21.78 141.0 13.099645 4.989240 4.03 21.61
DO (mg/L) 159.0 1.984277 0.609071 1.00 2.99 141.0 1.939433 0.577651 1.00 2.96
You can use agg along with groupby:
import pandas as pd
import numpy as np
# Sample data
data = {'Date': ['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-01', '2022-01-02', '2022-01-03'],
'Monitoring ID': ['Surface Water', 'Surface Water', 'Surface Water', 'Groundwater', 'Groundwater', 'Groundwater'],
'pH': [7.1, 7.2, 7.5, 7.8, 7.6, 7.4],
'Temp': [10, 12, 9, 15, 13, 14],
'Salinity': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]}
df = pd.DataFrame(data)
# Group by 'Monitoring ID' and calculate summary statistics
summary_stats = df.groupby('Monitoring ID').agg({'pH': ['min', 'max', 'mean', 'std'],
'Temp': ['min', 'max', 'mean', 'std'],
'Salinity': ['min', 'max', 'mean', 'std']})
# Reorganise column by renaming
summary_stats.columns = ['_'.join(col).strip() for col in summary_stats.columns.values]
# Summary table
print(summary_stats)
Pardon me I'm still trying to figure how to demonstrate the output of the code here but I hope this helps.
Related
I have a data frame in which there is an indefinite number of columns, to be defined later.
Like this:
index
GDP
2004
2005
...
brasil
1000
0.10
0.10
...
china
1000
0.15
0.10
...
india
1000
0.05
0.10
...
df = pd.DataFrame({'index': ['brasil', 'china', 'india'],
'GDP': [1000,1000,1000],
'2004': [0.10, 0.15, 0.05],
'2005': [0.10, 0.10, 0.10]})
Being the column GDP the initial GDP, and the columns from 2004 onwards being floats, representing percentages, relating to GDP growth in each year.
Using percentages to get the absolute number of the GDP in each year, based on initial GDP. I need a dataframe like this:
index
GDP
2004
2005
brasil
1000
1100
1210
china
1000
1150
1265
india
1000
1050
1155
I tried to use itertuples, df.columns and for loops, but i probably missing something.
Remembering that there are an indefinite number of columns.
Thank you very much in advance!
My answer is a combination of Wardy and user19*.
Starting with...
df = pd.DataFrame(data={'GDP': [1000, 1000, 1000],
'2004': [0.10, 0.15, 0.5],
'2005': [0.10, 0.10, 0.10],
'index': ['brasil', 'china', 'india']})
Find the percentage columns and make sure they are in the right order.
columns_of_interest = sorted(c for c in df.columns if c not in ['GDP', 'index'])
Now we calculate...
running_GDP = df['GDP'] # starting value
for column in columns_of_interest:
running_GDP *= 1.0 + df[column]
df[column] = running_GDP
This results in
GDP 2004 2005 index
0 1000 1100.0 1210.0 brasil
1 1000 1150.0 1265.0 china
2 1000 1500.0 1650.0 india
A simple way is to count the columns and loop over:
num = df.shape[1]
start = 2
for idx in range(start, num):
df.iloc[:, idx] = df.iloc[:, idx-1] * (1+df.iloc[:, idx])
print(df)
which gives
index GDP 2004 2005
0 brasil 1000 1100.0 1210.0
1 china 1000 1150.0 1265.0
2 india 1000 1050.0 1155.0
You can use df.columns to access a list of the dataframes columns.
Then you can do a loop over all of these column names. Here is an example of your data frame where I multiplied every value by 2. If you want to do different operations to different columns you can add conditions into the loop.
df = pd.DataFrame({'index': ['brasil', 'china', 'india'],
'GDP': [1000,1000,1000],
'2004': [0.10, 0.15, 0.5],
'2005': [0.10, 0.10, 0.10]})
for colName in df.columns:
df[colName] *= 2
print(df)
this returns...
index GDP 2004 2005
0 brasilbrasil 2000 0.2 0.2
1 chinachina 2000 0.3 0.2
2 indiaindia 2000 1.0 0.2
Hope this helps!
Add one to the percentages; calculate the cumulative product;
q = (df.iloc[:,2:] + 1).cumprod(axis=1)
multiply by the beginning gdp.
q = q.mul(df['GDP'],axis='index')
If you are trying to change the original DataFrame assign the result.
df.iloc[:,2:] = q
If you want to make new DataFrame concatenate the result with the first columns of the original.
new = pd.concat([df.iloc[:,:2],q],axis=1)
You can put those first two lines together if you want.
q = (df.iloc[:,2:] + 1).cumprod(axis=1).mul(df.GDP,axis='index')
I am trying to calculate t-test score of data stored in different dataframes using ttest_rel from scipy.stats. But when calculating t-test between same data, it is returning a numpy ndarray of NaN instead of a single NaN. What am I doing wrong that I am getting a numpy array instead of a single value?
My code with a sample dataframe is as follows:
import pandas as pd
import numpy as np
import re
from scipy.stats import ttest_rel
cities_nhl = pd.DataFrame({'metro': ['NewYork', 'LosAngeles', 'StLouis', 'Detroit', 'Boston', 'Baltimore'],
'total_ratio': [0.45, 0.51, 0.62, 0.43, 0.26, 0.32]})
cities_nba = pd.DataFrame({'metro': ['Boston', 'LosAngeles', 'Phoenix', 'Baltimore', 'Detroit', 'NewYork'],
'total_ratio': [0.50, 0.41, 0.34, 0.53, 0.33, 0.42]})
cities_mlb = pd.DataFrame({'metro': ['Seattle', 'Detroit', 'Boston', 'Baltimore', 'NewYork', 'LosAngeles'],
'total_ratio': [0.48, 0.27, 0.52, 0.33, 0.28, 0.67]})
cities_nfl = pd.DataFrame({'metro': ['LosAngeles', 'Atlanta', 'Detroit', 'Boston', 'NewYork', 'Baltimore'],
'total_ratio': [0.47, 0.41, 0.82, 0.13, 0.56, 0.42]})
needed_cols = ['metro', 'total_ratio'] #metro is a string and total_ratio is a float column
df_dict = {'NHL': cities_nhl[needed_cols], 'NBA': cities_nba[needed_cols],
'MLB': cities_mlb[needed_cols], 'NFL': cities_nfl[needed_cols]} #keeping all dataframes in a dictionary
#for ease of access
sports = ['NHL','NBA','MLB','NFL'] #name of sports
p_values_dict = {'NHL':[], 'NBA':[], 'MLB':[], 'NFL':[]} #dictionary to store p values
for clm1 in sports:
for clm2 in sports:
#merge the dataframes of two sports and then calculate their ttest score
_df = pd.merge(df_dict[clm1], df_dict[clm2],
how='inner', on='metro', suffixes=[f'_{clm1}', f'_{clm2}'])
_pval = ttest_rel(_df[f"total_ratio_{clm1}"], _df[f"total_ratio_{clm2}"])[1]
p_values_dict[clm1].append(_pval)
p_values = pd.DataFrame(p_values_dict, index=sports)
p_values
| |NHL |NBA |MLB |NFL |
|-------|-----------|-----------|-----------|----------|
|NHL |[nan, nan] |0.589606 |0.826298 |0.38493 |
|NBA |0.589606 |[nan, nan] |0.779387 |0.782173 |
|MLB |0.826298 |0.779387 |[nan, nan] |0.713229 |
|NFL |0.38493 |0.782173 |0.713229 |[nan, nan]|
The problem here is actually not related to scipy, but is due to duplicate column labels in your dataframes. In this part of your code:
_df = pd.merge(df_dict[clm1], df_dict[clm2],
how='inner', on='metro', suffixes=[f'_{clm1}', f'_{clm2}'])
When clm1 and clm2 are equal (say they are both NHL), you get a _df dataframe like this:
metro total_ratio_NHL total_ratio_NHL
0 NewYork 0.45 0.45
1 LosAngeles 0.51 0.51
2 StLouis 0.62 0.62
3 Detroit 0.43 0.43
4 Boston 0.26 0.26
5 Baltimore 0.32 0.32
Then, when you pass the columns to the ttest_rel function, you end up passing both columns when you refer to a single column label, because they have the same label:
ttest_rel(_df[f"total_ratio_{clm1}"], _df[f"total_ratio_{clm2}"])
And that's how you get two t-statistics and two p-values.
So, you can modify those two lines to eliminate duplicate column labels, like this:
_df = pd.merge(df_dict[clm1], df_dict[clm2],
how='inner', on='metro', suffixes=[f'_{clm1}_1', f'_{clm2}_2'])
_pval = ttest_rel(_df[f"total_ratio_{clm1}_1"], _df[f"total_ratio_{clm2}_2"])[1]
The result will look like this:
NHL NBA MLB NFL
NHL NaN 0.589606 0.826298 0.384930
NBA 0.589606 NaN 0.779387 0.782173
MLB 0.826298 0.779387 NaN 0.713229
NFL 0.384930 0.782173 0.713229 NaN
I want to divide columns by other columns in a big data frame in pandas. How can I do this operation in an easy and fast way?
This is an example:
sent1 sent2 sent3 media fake other
0.67 0.25 1.6 3.0 4.0 5.0
My output would be
sent1 sent2 media fake other sent1/media sent1/fake sent1/other sent2/media sent2/fake sent2/ot
0.67 0.25 3.0 4.0 5.0 0.22. 0.16. 0.134 0.08 0.625 0.05
I would like to obtain this results in a easiest way.
So far, I've calculated this by doing:
df['sent1/media'] = df['sent1'] / df['media']
df['sent1/fake'] = df['sent1'] / df['fake']
df['sent1/other'] = df['sent1'] / df['other']
You could do something like this:
for num in ['sent1', 'sent2']:
for denom in ['media', 'fake', 'other']:
df[f'{num}/{denom}'] = df[num] / df[denom]
A broadcasting option:
from itertools import product
import pandas as pd
df = pd.DataFrame({
'sent1': {0: 0.67}, 'sent2': {0: 0.25},
'sent3': {0: 1.6}, 'media': {0: 3.0},
'fake': {0: 4.0}, 'other': {0: 5.0}
})
# Grab sent1 and sent2 Columns
sents = df[['sent1', 'sent2']]
# Grab Non Sent Columns
others = df.filter(regex='^(?!sent)')
# Broadcast Division
results = (
sents.to_numpy()[..., None] / others.to_numpy()[:, None]
).reshape((len(df), len(sents.columns) * len(others.columns)))
# Convert to new dataframe with new column labels
new_df = pd.DataFrame(
results,
columns=map('/'.join,
(product(sents.columns.tolist(), others.columns.tolist())))
)
# Join to df
new_df = df.join(new_df)
print(new_df.to_string())
sent1 sent2 sent3 media fake other sent1/media sent1/fake sent1/other sent2/media sent2/fake sent2/other
0 0.67 0.25 1.6 3.0 4.0 5.0 0.223333 0.1675 0.134 0.083333 0.0625 0.05
i have a data frame with the following columns:
id
name
product
count
price
discount
and i want to create a sumamry data frame where it shows the sum of how much each client has spend total. with and without discount applied.
i tried the following
summary = df.groupby('client_name')['price','count','discount'].agg([
('Total pre discount', df['price']*df['count']),
('Discount applied', df['price']*df['count']*df['discount']
])
and im getting this error:
TypeError: 'Series' objects are mutable, thus they cannot be hashed
is it even possible to do this in one step?
what am i doing wrong?
Please note that agg() function cannot be used for calculation involving multiple columns. You have to use apply() function instead. Refer to this post for details.
if you had multiple columns that needed to interact together then you
cannot use agg, which implicitly passes a Series to the aggregating
function. When using apply the entire group as a DataFrame gets passed
into the function.
For your case, you have to define a customized function as follows:
def f(x):
data = {}
data['Total pre discount'] = (x['price'] * x['count']).sum()
data['Discount applied'] = (x['price'] * x['count'] * x['discount']).sum()
return pd.Series(data)
Then perform your desired task by:
df.groupby('client_name').apply(f)
or if you want to use lambda function instead of customized function:
df.groupby('client_name').apply(lambda x: pd.Series({'Total pre discount': (x['price'] * x['count']).sum(), 'Discount applied': (x['price'] * x['count'] * x['discount']).sum()}))
Run Demonstration
Test Data Creation
data = {'id': ['0100', '0200', '0100', '0200', '0300'], 'client_name': ['Ann', 'Bob', 'Ann', 'Bob', 'Charles'], 'product': ['pen', 'paper', 'folder', 'pencil', 'tray'], 'count': [12, 300, 5, 12, 10], 'price': [2.00, 5.00, 3.50, 2.30, 8.20], 'discount': [0.0, 0.1, 0.15, 0.1, 0.12]}
df = pd.DataFrame(data)
print(df)
Output:
id client_name product count price discount
0 0100 Ann pen 12 2.0 0.00
1 0200 Bob paper 300 5.0 0.10
2 0100 Ann folder 5 3.5 0.15
3 0200 Bob pencil 12 2.3 0.10
4 0300 Charles tray 10 8.2 0.12
Run New Codes
# Use either one of the following 2 lines of codes:
summary = df.groupby('client_name').apply(f) # Using the customized function f() defined above
# or using lambda function
summary = df.groupby('client_name').apply(lambda x: pd.Series({'Total pre discount': (x['price'] * x['count']).sum(), 'Discount applied': (x['price'] * x['count'] * x['discount']).sum()}))
print(summary)
Output:
Total pre discount Discount applied
client_name
Ann 41.5 2.625
Bob 1527.6 152.760
Charles 82.0 9.84
df.groupby(['client', 'discount']).agg({'price' : 'sum'}).reset_index()
I have the following dataframe:
date, industry, symbol, roc
25-02-2015, Health, abc, 200
25-02-2015, Health, xyz, 150
25-02-2015, Mining, tyr, 45
25-02-2015, Mining, ujk, 70
26-02-2015, Health, abc, 60
26-02-2015, Health, xyz, 310
26-02-2015, Mining, tyr, 65
26-02-2015, Mining, ujk, 23
I need to determine the average 'roc', max 'roc', min 'roc' as well as how many symbols exist for each date+industry. In other words I need to groupby date and industry, and then determine various averages, max/min etc.
So far I am doing the following, which is working but seems to be very slow and inefficient:
sector_df = primary_df.groupby(['date', 'industry'], sort=True).mean()
tmp_max_df = primary_df.groupby(['date', 'industry'], sort=True).max()
tmp_min_df = primary_df.groupby(['date', 'industry'], sort=True).min()
tmp_count_df = primary_df.groupby(['date', 'industry'], sort=True).count()
sector_df['max_roc'] = tmp_max_df['roc']
sector_df['min_roc'] = tmp_min_df['roc']
sector_df['count'] = tmp_count_df['roc']
sector_df.reset_index(inplace=True)
sector_df.set_index(['date', 'industry'], inplace=True)
The above code works, resulting in a dataframe indexed by date+industry, showing me what was the min/max 'roc' for each date+industry, as well as how many symbols existed for each date+industry.
I am basically doing a complete groupby multiple times (to determine the mean, max, min, count of the 'roc'). This is very slow because it's doing the same thing over and over.
Is there a way to just do the group by once. Then perform the mean, max etc on that object and assign the result to the sector_df?
You want to perform an aggregate using agg:
In [72]:
df.groupby(['date','industry']).agg([pd.Series.mean, pd.Series.max, pd.Series.min, pd.Series.count])
Out[72]:
roc
mean max min count
date industry
2015-02-25 Health 175.0 200 150 2
Mining 57.5 70 45 2
2015-02-26 Health 185.0 310 60 2
Mining 44.0 65 23 2
This allows you to pass an iterable (a list in this case) of functions to perform.
EDIT
To access individual results you need to pass a tuple for each axis:
In [78]:
gp.loc[('2015-02-25','Health'),('roc','mean')]
Out[78]:
175.0
Where gp = df.groupby(['date','industry']).agg([pd.Series.mean, pd.Series.max, pd.Series.min, pd.Series.count])
You can just save the groupby part to a variable as shown below:
primary_df = pd.DataFrame([['25-02-2015', 'Health', 'abc', 200],
['25-02-2015', 'Health', 'xyz', 150],
['25-02-2015', 'Mining', 'tyr', 45],
['25-02-2015', 'Mining', 'ujk', 70],
['26-02-2015', 'Health', 'abc', 60],
['26-02-2015', 'Health', 'xyz', 310],
['26-02-2015', 'Mining', 'tyr', 65],
['26-02-2015', 'Mining', 'ujk', 23]],
columns='date industry symbol roc'.split())
grouped = primary_df.groupby(['date', 'industry'], sort=True)
sector_df = grouped.mean()
tmp_max_df = grouped.max()
tmp_min_df = grouped.min()
tmp_count_df = grouped.count()
sector_df['max_roc'] = tmp_max_df['roc']
sector_df['min_roc'] = tmp_min_df['roc']
sector_df['count'] = tmp_count_df['roc']
sector_df.reset_index(inplace=True)
sector_df.set_index(['date', 'industry'], inplace=True)