Calculate standard deviation for intervals in dataframe column - python

I would like to calculate standard deviations for non rolling intervals.
I have a df like this:
value std year
3 nan 2001
2 nan 2001
4 nan 2001
19 nan 2002
23 nan 2002
34 nan 2002
and so on. I would just like to calculate the standard deviation for every year and save it in every cell in the respective row in "std". I have the same amount of data for every year, thus the length of the intervals never changes.
I already tried:
df["std"] = df.groupby("year").std()
but since the right gives a new dataframe that calculates the std for every column gouped by year this obviously does not work.
Thank you all very much for your support!

IIUC:
try via transform() method:
df['std']=df.groupby("year")['value'].transform('std')
OR
If you want to find the standard deviation of multiple columns then:
df[['std1','std2']]=df.groupby("year")[['column1','column2']].transform('std')

Related

How do I count values in one dataframe based on the conditions in another dataframe

I have two dataframes. df1 shows annual rainfall over a certain area:
df1:
longitude latitude year
-13.0 8.0 1979 15.449341
1980 21.970507
1981 18.114307
1982 16.881737
1983 24.122467
1984 27.108953
1985 27.401234
1986 18.238272
1987 25.421076
1988 11.796293
1989 17.778618
1990 18.095036
1991 20.414757
and df2 shows the upper limits of each bin:
bin limits
0 16.655970
1 18.204842
2 19.526524
3 20.852657
4 22.336731
5 24.211905
6 27.143820
I'm trying to add a new column to df2 that shows the frequency of rainfall events from df1 in their corresponding bin. For example, in bin 1 I'd be looking for the values in df1 that fall between 16.65 and 18.2.
I've tried the following:
rain = df1['tp1']
for i in range 7:
limit = df2.iloc[i]
out4['count']=rain[rain>limit].count()
However, I get the following message:
ValueError: Can only compare identically-labeled Series objects
Which I think is referring to the fact that I'm comparing two df's that are different sizes? I'm also unsure if that loop is correct or not.
Any help is much appreciated, thanks!
Use pd.cut to assign your rainfall into bins:
# Define the limits for your bins
# Bin 0: (-np.inf , 16.655970]
# Bin 1: (16.655970, 18.204842]
# Bin 2: (18.204842, 19.526524]
# ...
# note that your bins only go up to 27.14 while max rainfall is 27.4 (row 6).
# You may need to add / adjust your limits.
limits = [-np.inf] + df2["limits"].to_list()
# Assign the rainfall to each bin
bins = pd.cut(df1["rainfall"], limits, labels=df2["bin"])
# Count how many values fall into each bin
bins.value_counts(sort=False).rename_axis("bin")

Filter values as per std deviation for individual column

I am working on a requirement where I need to assign particular value to NaN based on the variable upper, which is my upper range of standard deviation
Here is a sample code-
data = {'year': ['2014','2014','2015','2014','2015','2015','2015','2014','2015'],
'month':['Hyundai','Toyota','Hyundai','Toyota','Hyundai','Toyota','Hyundai','Toyota',"Toyota"],
'make': [23,34,32,22,12,33,44,11,21]
}
df = pd.DataFrame.from_dict(data)
df=pd.pivot_table(df,index='month',columns='year',values='make',aggfunc=np.sum)
upper=df.mean() + 3*df.std()
This is just the sample data, the real data is huge, based on upper's value for every year, I need to filter the year column accordingly.
Sample I/P's:-
df-
upper std dev-
Desired O/P-
Based on the upper std deviation values in individual year, it should convert value as NaN if the value<upper.
Eg 2014 has upper=138, so only in 2014's column, if value<upper, convert it to NaN.
2014's upper value is only applicable in 2014 itself, and the same goes for 2015.
IIUC use DataFrame.lt for compare Dataframe by Series and then set NaNs if match by DataFrame.mask:
print (df.lt(upper))
year 2014 2015
month
Hyundai True True
Toyota True True
df = df.mask(df.lt(upper))
print (df)
year 2014 2015
month
Hyundai NaN NaN
Toyota NaN NaN

Plotting by Index with different labels

I am using pandas and matplotlib to generate some charts.
My DataFrame:
Journal Papers per year in journal
0 Information and Software Technology 4
1 2012 International Conference on Cyber Securit... 4
2 Journal of Network and Computer Applications 4
3 IEEE Security & Privacy 5
4 Computers & Security 11
My Dataframe is a result of a groupby out of a larger dataframe. What I want now, is a simple barchart, which in theory works fine with a df_groupby_time.plot(kind='bar'). However, I get this:
What I want are different colored bars, and a legend which states which color corresponds to which paper.
Playing around with relabeling hasn't gotten me anywhere so far. And I have no idea anymore on how to achieve what I want.
EDIT:
Resetting the index and plotting isn't what I want:
df_groupby_time.set_index("Journals").plot(kind='bar')
I found a solution, based on this question here.
SO, the dataframe needs to be transformed into a matrix, were the values exist only on the main diagonal.
First, I save the column journals for later in a variable.
new_cols = df["Journal"].values
Secondly, I wrote a function, that takes a series, the column Papers per year in Journal, and the previously saved new columns, as input parameters, and returns a dataframe, where the values are only on the main diagonal.:
def values_into_main_diagonal(some_series, new_cols):
"""Puts the values of a series onto the main diagonal of a new df.
some_series - any series given
new_cols - the new column labels as list or numpy.ndarray"""
x = [{i: some_series[i]} for i in range(len(some_series))]
main_diag_df = pd.DataFrame(x)
main_diag_df.columns = new_cols
return main_diag_df
Thirdly, feeding the function the Papers per year in Journal column and our saved new columns names, returns the following dataframe:
new_df:
1_journal 2_journal 3_journal 4_journal 5_journal
0 4 NaN NaN NaN NaN
1 NaN 4 NaN NaN NaN
2 NaN NaN 4 NaN NaN
3 NaN NaN NaN 5 NaN
4 NaN NaN NaN NaN 11
Finally plotting the new_df via new_df.plot(kind='bar', stacked=True) gives me what I want. The Journals in different colors as the legend and NOT on the axis.:

Python pandas dataframe select rows from columns

In an Excel sheet with columns Rainfall / Year / Month, I want to sum rainfall data per year. That is, for instance, for the year 2000, from month 1 to 12, summing all the Rainfall cells into a new one.
I tried using pandas in Python but cannot manage (just started coding). How can I proceed? Any help is welcome, thanks!
Here the head of the data (which has been downloaded):
rainfall (mm) \tyear month country iso3 iso2
0 120.54000 1990 1 ECU NaN NaN
1 231.15652 1990 2 ECU NaN NaN
2 136.62088 1990 3 ECU NaN NaN
3 203.47653 1990 4 ECU NaN NaN
4 164.20956 1990 5 ECU NaN NaN
Use groupby and aggregate sum if need sum of all years:
df = df.groupby('\tyear')['rainfall (mm)'].sum()
But if need only one value:
df.loc[df['\tyear'] == 2000, 'rainfall (mm)'].sum()
If you just want the year 2000, use
df[df['\tyear'] == 2000]['rainfall (mm)'].sum()
Otherwise, jezrael's answer is nice because it sums rainfall (mm) for each distinct value of \tyear.

Rolling standard deviation with Pandas, and NaNs

I have data that looks like this:
1472698113000000000 -28.84
1472698118000000000 -26.69
1472698163000000000 -27.65
1472698168000000000 -26.1
1472698238000000000 -27.33
1472698243000000000 -26.47
1472698248000000000 -25.24
1472698253000000000 -25.53
1472698283000000000 -27.3
...
This is a time series that grows. Each time it grows, I attempt to get the rolling standard deviation of the set, using pandas.rolling_std. Each time, the result includes NaNs, which I cannot use (I am trying to insert the result into InfluxDB, and it complains when it sees the NaNs.)
I've experimented with different window sizes. I am doing this on different series, of varying rates of growth and current sizes (some just a couple of measurements long, some hundreds or thousands).
Simply, I just want to have a rolling standard deviation in InfluxDB so that I can graph it and watch how the source data is changing over time, with respect to its mean. How can I overcome this NaN problem?
If you are doing something like
df.rolling(5).std()
and getting
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 5.032395e+10 1.037386
5 5.345559e+10 0.633024
6 4.263215e+10 0.967352
7 3.510698e+10 0.822879
8 1.767767e+10 0.971972
You can strip away the NaNs by using .dropna().
df.rolling(5).std().dropna():
4 5.032395e+10 1.037386
5 5.345559e+10 0.633024
6 4.263215e+10 0.967352
7 3.510698e+10 0.822879
8 1.767767e+10 0.971972

Categories

Resources