Python: Standard Deviation within a list of dataframes - python

I have a list of say 50 dataframes 'list1', each dataframe has columns 'Speed' and 'Value', like this;
Speed Value
1 12
2 17
3 19
4 21
5 25
I am trying to get the standard deviation of 'Value' for each speed, across all dataframes. The end goal is get a list or df of standard deviation for each speed, like this;
Speed Standard Deviation
1 1.23
2 2.5
3 1.98
4 5.6
5 5.77
I've tried to pull the values into a new dataframe using a for loop, to then use 'statistics.stdev' on but I can't seem to get it working. Any help is really appreciatted!
Update!
pd.concat([d.set_index('Speed').values for d in df_power], axis=1).std(1)
This worked. Although, I forgot to mention that the values for Speed are not always the same between dataframes. Some dataframes miss a few and this ends up returning nan in those instances.

You can concat and use std:
list_df = [df1, df2, df3, ...]
pd.concat([d.set_index('Speed') for d in list_dfs], axis=1).std(1)

You'll want to concatenate, groupby speed, and take the standard deviation.
1) Concatenate your dataframes
list1 = [df_1, df_2, ...]
full_df = pd.concat(list1, axis=0) # stack all dataframes
2) Groupby speed and take the standard deviation
std_per_speed_df = full_df.groupby('speed')[['value']].std()

If the dataframes are all saved on the same folder you can use pd.concat +groupby as already suggested or you can use dask
import dask.dataframe as dd
import pandas as pd
df = dd.read_csv("data/*")
out = df.groupby("Speed")["Value"].std()\
.compute()\
.reset_index(name="Standard Deviation")

Related

How to index out the data from another column with filtered data [duplicate]

I can't get the average or mean of a column in pandas. A have a dataframe. Neither of things I tried below gives me the average of the column weight
>>> allDF
ID birthyear weight
0 619040 1962 0.1231231
1 600161 1963 0.981742
2 25602033 1963 1.3123124
3 624870 1987 0.94212
The following returns several values, not one:
allDF[['weight']].mean(axis=1)
So does this:
allDF.groupby('weight').mean()
If you only want the mean of the weight column, select the column (which is a Series) and call .mean():
In [479]: df
Out[479]:
ID birthyear weight
0 619040 1962 0.123123
1 600161 1963 0.981742
2 25602033 1963 1.312312
3 624870 1987 0.942120
In [480]: df.loc[:, 'weight'].mean()
Out[480]: 0.83982437500000007
Try df.mean(axis=0) , axis=0 argument calculates the column wise mean of the dataframe so the result will be axis=1 is row wise mean so you are getting multiple values.
Do try to give print (df.describe()) a shot. I hope it will be very helpful to get an overall description of your dataframe.
Mean for each column in df :
A B C
0 5 3 8
1 5 3 9
2 8 4 9
df.mean()
A 6.000000
B 3.333333
C 8.666667
dtype: float64
and if you want average of all columns:
df.stack().mean()
6.0
you can use
df.describe()
you will get basic statistics of the dataframe and to get mean of specific column you can use
df["columnname"].mean()
You can also access a column using the dot notation (also called attribute access) and then calculate its mean:
df.your_column_name.mean()
You can use either of the two statements below:
numpy.mean(df['col_name'])
# or
df['col_name'].mean()
Additionally if you want to get the round value after finding the mean.
#Create a DataFrame
df1 = {
'Subject':['semester1','semester2','semester3','semester4','semester1',
'semester2','semester3'],
'Score':[62.73,47.76,55.61,74.67,31.55,77.31,85.47]}
df1 = pd.DataFrame(df1,columns=['Subject','Score'])
rounded_mean = round(df1['Score'].mean()) # specified nothing as decimal place
print(rounded_mean) # 62
rounded_mean_decimal_0 = round(df1['Score'].mean(), 0) # specified decimal place as 0
print(rounded_mean_decimal_0) # 62.0
rounded_mean_decimal_1 = round(df1['Score'].mean(), 1) # specified decimal place as 1
print(rounded_mean_decimal_1) # 62.2
Do note that it needs to be in the numeric data type in the first place.
import pandas as pd
df['column'] = pd.to_numeric(df['column'], errors='coerce')
Next find the mean on one column or for all numeric columns using describe().
df['column'].mean()
df.describe()
Example of result from describe:
column
count 62.000000
mean 84.678548
std 216.694615
min 13.100000
25% 27.012500
50% 41.220000
75% 70.817500
max 1666.860000
You can simply go for:
df.describe()
that will provide you with all the relevant details you need, but to find the min, max or average value of a particular column (say 'weights' in your case), use:
df['weights'].mean(): For average value
df['weights'].max(): For maximum value
df['weights'].min(): For minimum value
You can use the method agg (aggregate):
df.agg('mean')
It's possible to apply multiple statistics:
df.agg(['mean', 'max', 'min'])
You can easily follow the following code
import pandas as pd
import numpy as np
classxii = {'Name':['Karan','Ishan','Aditya','Anant','Ronit'],
'Subject':['Accounts','Economics','Accounts','Economics','Accounts'],
'Score':[87,64,58,74,87],
'Grade':['A1','B2','C1','B1','A2']}
df = pd.DataFrame(classxii,index = ['a','b','c','d','e'],columns=['Name','Subject','Score','Grade'])
print(df)
#use the below for mean if you already have a dataframe
print('mean of score is:')
print(df[['Score']].mean())

Different aggregation for dataframe with several columns

I am looking for some short-cut to reduce the manual grouping required:
I have a dataframe with many columns. When grouping the dataframe by 'Level', I want to group two columns using nunique(),but all other columns (ca. 60 columns representing years from 2021 onward) using mean().
Does anyone have an idea how to define 'the rest' of the columns?
Thanks!
I would do it following way
import pandas as pd
df = pd.DataFrame({'X':[1,1,1,2,2,2],'A':[1,2,3,4,5,6],'B':[1,2,3,4,5,6],'C':[7,8,9,10,11,12],'D':[13,14,15,16,17,18],'E':[19,20,21,22,23,24]})
aggdct = dict.fromkeys(df.columns, pd.Series.mean)
del aggdct['X']
aggdct['A'] = pd.Series.nunique
print(df.groupby('X').agg(aggdct))
output
A B C D E
X
1 3 2 8 14 20
2 3 5 11 17 23
Explanation: I prepare dict with information how to agg using dict.fromkeys which does provide dict with keys being names of column and values being pd.Series.mean functions, then remove column to be used in groupby and changing selected column to hold pd.Series.nunique rather than pd.Series.mean

How to merge two dataframes with different lengths in python

I am trying to merge two weelly DateFrames, which are made-up of one column each, but with different lengths.
Could I please know how to merge them, maintaining the 'Week' indexing?
[df1]
Week Coeff1
1 -0.456662
1 -0.533774
1 -0.432871
1 -0.144993
1 -0.553376
... ...
53 -0.501221
53 -0.025225
53 1.529864
53 0.044380
53 -0.501221
[16713 rows x 1 columns]
[df2]
Week Coeff
1 0.571707
1 0.086152
1 0.824832
1 -0.037042
1 1.167451
... ...
53 -0.379374
53 1.076622
53 -0.547435
53 -0.638206
53 0.067848
[63265 rows x 1 columns]
I've tried this code:
df3 = pd.merge(df1, df2, how='inner', on='Week')
df3 = df3.drop_duplicates()
df3
But it gave me a new df (df3) with 13386431 rows × 2 columns
Desired outcome: A new df which has 3 columns (week, coeff1, coeff2), as df2 is longer, I expect to have some NaNs in coeff1 to fill the gaps.
I assume your output should look somewhat like this:
Week
Coeff1
Coeff2
1
-0.456662
0.571707
1
-0.533774
0.086152
1
-0.432871
0.824832
2
3
3
2
NaN
3
Don't mind the actual numbers though.
The problem is you won't achieve that with a join on Week, neither left nor inner and that is due to the fact that the Week-Index is not unique.
So, on a left join, pandas is going to join all the Coeff2-Values where df2.Week == 1 on every single row in df1 where df1.Week == 1. And that is why you get these millions of rows.
I will try and give you a workaround later, but maybe this helps you to think about this problem from another perspective!
Now is later:
What you actually want to do is to concatenate the Dataframes "per week".
You achieve that by iterating over every week, creating a df_subset[week] concatenating df1[week] and df2[week] by axis=1 and then concatenating all these subsets on axis=0 afterwards:
weekly_dfs=[]
for week in df1.Week.unique():
sub_df1 = df1.loc[df1.Week == week, "Coeff1"].reset_index(drop=True)
sub_df2 = df2.loc[df2.Week == week, "Coeff2"].reset_index(drop=True)
concat_df = pd.concat([sub_df1, sub_df2], axis=1)
concat_df["Week"] = week
weekly_dfs.append(concat_df)
df3 = pd.concat(weekly_dfs).reset_index(drop=True)
The last reset of the index is optional but I recommend it anyways!
Based on your last comment on the question, you may want to concatenate instead of merging the two data frames:
df3 = pd.concat([df1,df2], ignore_index=True, axis=1)
The resulting DataFrame should have 63265 rows and will need some work to get it to the required format (remove the added index columns, rename the remaining columns, etc.), but pd.concat should be a good start.
According to pandas' merge documentation, you can use merge in a way like that:
What you are looking for is a left join. However, the default option is an inner join. You can change this by passing a different how argument:
df2.merge(df1,how='left', left_on='Week', right_on='Week')
note that this would keep these rows in the bigger df and assign NaN to them when merging with the shorter df.

Merging columns based on the percentage column

I have a dataframe that has numerical and categorical values. Essentially what I am trying to accomplish is to merge the data based on a specific criteria. The criteria is when merging rows, once the percentage column becomes 100%, merge those rows into one. The numerical rows will be averaged and the categorical values will be listed.
I am here for ideas on how to tackle the problem in the most efficient way possible in python preferably.
Here is what the dataframe looks like:
<table><tbody><tr><th>x</th><th>y</th><th>z</th><th>a</th><th>%</th></tr><tr><td>3</td><td>8</td><td>lem</td><td>or</td><td>0.5</td></tr><tr><td>7</td><td>9</td><td>lem</td><td>or</td><td>0.5</td></tr><tr><td>5</td><td>10</td><td>lem</td><td>or</td><td>0.3</td></tr><tr><td>5</td><td>9</td><td>or</td><td>or</td><td>0.7</td></tr><tr><td>10</td><td>8</td><td>or</td><td>or</td><td>1</td></tr></tbody></table>
This is what the final dataframe would look like:
<table><tbody><tr><th>x</th><th>y</th><th>z</th><th>a</th><th>%</th></tr><tr><td>5</td><td>8.5</td><td>lem, lem</td><td>or, or </td><td>1</td></tr><tr><td>5</td><td>9.5</td><td>lem, or</td><td>or, or</td><td>1</td></tr><tr><td>10</td><td>8</td><td>or</td><td>or</td><td>1</td></tr></tbody></table>
IIUC, let's try:
s = df['%'].cumsum()
grp = s.where(s.mod(1).eq(0)).bfill()
df.groupby(grp, as_index=False).agg({'x':'mean',
'y':'mean',
'z': ", ".join,
'a':", ".join,
'%':'sum'})
Output:
x y z a %
0 5 8.5 lem, lem or, or 1.0
1 5 9.5 lem, or or, or 1.0
2 10 8.0 or or 1.0

How to copy one DataFrame column in to another Dataframe if their indexes values are the same

After creating a DataFrame with some duplicated cell values in column with the name 'keys':
import pandas as pd
df = pd.DataFrame({'keys': [1,2,2,3,3,3,3],'values':[1,2,3,4,5,6,7]})
I go ahead and create two more DataFrames which are the consolidated versions of the original DataFrame df. Those newly created DataFrames will have no duplicated cell values under the 'keys' column:
df_sum = df_a.groupby('keys', axis=0).sum().reset_index()
df_mean = df_b.groupby('keys', axis=0).mean().reset_index()
As you can see df_sum['values'] cells values were all summed together.
While df_mean['values'] cell values were averaged with mean() method.
Lastly I rename the 'values' column in both dataframes with:
df_sum.columns = ['keys', 'sums']
df_mean.columns = ['keys', 'means']
Now I would like to copy the df_mean['means'] column into the dataframe df_sum.
How to achieve this?
The Photoshoped image below illustrates the dataframe I would like to create. Both 'sums' and 'means' columns are merged into a single DataFrame:
There are several ways to do this. Using the merge function off the dataframe is the most efficient.
df_both = df_sum.merge(df_mean, how='left', on='keys')
df_both
Out[1]:
keys sums means
0 1 1 1.0
1 2 5 2.5
2 3 22 5.5
I think pandas.merge() is the function you are looking for. Like pd.merge(df_sum, df_mean, on = "keys"). Besides, this result can also be summarized on one agg function as following:
df.groupby('keys')['values'].agg(['sum', 'mean']).reset_index()
# keys sum mean
#0 1 1 1.0
#1 2 5 2.5
#2 3 22 5.5

Categories

Resources