Plot mean of subset of a Panda dataframe - python

Assume a big set of data like
Height (m) My data
0 18 5.0
1 25 6.0
2 10 1.0
3 13 1.5
4 32 8.0
5 26 6.7
6 23 5.0
7 5 2.0
8 7 2.0
And I want to plot the average (and, if possible, the standard deviation) of "My data" as a function of height, separated in the range [0,5),[5,10),[10,15) and so on.
Any idea? I've tried different approaches and none of them work

If I understand you correctly:
# Precompute bins for pd.cut
bins = list(range(0, df['Height (m)'].max() + 5, 5))
# Cut Height into intervals which exclude the right endpoint,
# with bin edges at multiples of 5
df['HeightBin'] = pd.cut(df['Height (m)'], bins=bins, right=False)
# Within each bin, get mean, stdev (normalized by N-1 by default),
# and also show sample size to explain why some std values are NaN
df.groupby('HeightBin')['My data'].agg(['mean', 'std', 'count'])
mean std count
HeightBin
[0, 5) NaN NaN 0
[5, 10) 2.00 0.000000 2
[10, 15) 1.25 0.353553 2
[15, 20) 5.00 NaN 1
[20, 25) 5.00 NaN 1
[25, 30) 6.35 0.494975 2
[30, 35) 8.00 NaN 1

If I understand correctly, this is what you would like to do:
import pandas as pd
import numpy as np
bins = np.arange(0, 30, 5) # adjust as desired
df_stats = pd.DataFrame(columns=['mean', 'st_dev']) # DataFrame for the results
df_stats['mean'] = df.groupby(pd.cut(df['Height (m)'], bins, right=False)).mean()['My data']
df_stats['st_dev'] = df.groupby(pd.cut(df['Height (m)'], bins, right=False)).std()['My data']

Related

Compare two columns based on last N rows in a pandas DataFrame

I want to groupby "ts_code" and calculate percentage between one column max and min value from another column after max based on last N rows for each group. Specifically,
df
ts_code high low
0 A 20 10
1 A 30 5
2 A 40 20
3 A 50 10
4 A 20 30
5 B 20 10
6 B 30 5
7 B 40 20
8 B 50 10
9 B 20 30
Goal
Below is my expected result
ts_code high low l3_high_low_pct_chg l4_high_low_pct_chg
0 A 20 10 NA NA
1 A 30 5 NA NA
2 A 40 20 0.5 NA
3 A 50 10 0.8 0.8
4 A 20 30 0.8 0.8
5 B 50 10 NA NA
6 B 30 5 NA NA
7 B 40 20 0.9 NA
8 B 10 10 0.75 0.9
9 B 20 30 0.75 0.75
ln_high_low_pct_chg(such as l3_high_low_pct_chg)= 1-(the min value of the low column after the peak)/(the max value of high column),on last N rows for each group and each row.
Try and problem
df['l3_highest']=df.groupby('ts_code')['high'].transform(lambda x: x.rolling(3).max())
df['l3_lowest']=df.groupby('ts_code')['low'].transform(lambda x: x.rolling(3).min())
df['l3_high_low_pct_chg']=1-df['l3_lowest']/df['l3_highest']
But it fails such that for second row, the l3_lowest would be 5 not 20. I don't know how to calculate percentage after peak.
For last 4 rows, at index=8, low=10,high=50,low=5, l4_high_low_pct_chg=0.9
, at index=9, high=40, low=10, l4_high_low_pct_chg=0.75
Another test data
If the rolling window is 52, for hy_code 880912 group and index 1252, l52_high_low_pct_chg would be 0.281131 and 880301 group and index 1251, l52_high_low_pct_chg would be 0.321471.
Grouping by 'ts_code' is just a trivial groupby() function. DataFrame.rolling() function is for single columns, so it's a tricky to apply it if you need data from multiple columns. You can use "from numpy_ext import rolling_apply as rolling_apply_ext" as in this example: Pandas rolling apply using multiple columns. However, I just created a function that manually groups the dataframe into n length sub-dataframes, then applies the function to calculate the value. idxmax() finds the index value of the peak of the low column, then we find the min() of the values that follow. The rest is pretty straightforward.
import numpy as np
import pandas as pd
df = pd.DataFrame([['A', 20, 10],
['A', 30, 5],
['A', 40, 20],
['A', 50, 10],
['A', 20, 30],
['B', 50, 10],
['B', 30, 5],
['B', 40, 20],
['B', 10, 10],
['B', 20, 30]],
columns=['ts_code', 'high', 'low']
)
def custom_f(df, n):
s = pd.Series(np.nan, index=df.index)
def sub_f(df_):
high_peak_idx = df_['high'].idxmax()
min_low_after_peak = df_.loc[high_peak_idx:]['low'].min()
max_high = df_['high'].max()
return 1 - min_low_after_peak / max_high
for i in range(df.shape[0] - n + 1):
df_ = df.iloc[i:i + n]
s.iloc[i + n - 1] = sub_f(df_)
return s
df['l3_high_low_pct_chg'] = df.groupby("ts_code").apply(custom_f, 3).values
df['l4_high_low_pct_chg'] = df.groupby("ts_code").apply(custom_f, 4).values
print(df)
If you prefer to use the rolling function, this method gives the same output:
def rolling_f(rolling_df):
df_ = df.loc[rolling_df.index]
high_peak_idx = df_['high'].idxmax()
min_low_after_peak = df_.loc[high_peak_idx:]["low"].min()
max_high = df_['high'].max()
return 1 - min_low_after_peak / max_high
df['l3_high_low_pct_chg'] = df.groupby("ts_code").rolling(3).apply(rolling_f).values[:, 0]
df['l4_high_low_pct_chg'] = df.groupby("ts_code").rolling(4).apply(rolling_f).values[:, 0]
print(df)
Finally, if you want to do a true rolling window calculation that avoids any index lookup, you can use the numpy_ext (https://pypi.org/project/numpy-ext/)
from numpy_ext import rolling_apply
def np_ext_f(rolling_df, n):
def rolling_apply_f(high, low):
return 1 - low[np.argmax(high):].min() / high.max()
try:
return pd.Series(rolling_apply(rolling_apply_f, n, rolling_df['high'].values, rolling_df['low'].values), index=rolling_df.index)
except ValueError:
return pd.Series(np.nan, index=rolling_df.index)
df['l3_high_low_pct_chg'] = df.groupby('ts_code').apply(np_ext_f, n=3).sort_index(level=1).values
df['l4_high_low_pct_chg'] = df.groupby('ts_code').apply(np_ext_f, n=4).sort_index(level=1).values
print(df)
output:
ts_code high low l3_high_low_pct_chg l4_high_low_pct_chg
0 A 20 10 NaN NaN
1 A 30 5 NaN NaN
2 A 40 20 0.50 NaN
3 A 50 10 0.80 0.80
4 A 20 30 0.80 0.80
5 B 50 10 NaN NaN
6 B 30 5 NaN NaN
7 B 40 20 0.90 NaN
8 B 10 10 0.75 0.90
9 B 20 30 0.75 0.75
For large datasets, the speed of these operations becomes an issue. So, to compare the speed of these different methods, I created a timing function:
import time
def timeit(f):
def timed(*args, **kw):
ts = time.time()
result = f(*args, **kw)
te = time.time()
print ('func:%r took: %2.4f sec' % \
(f.__name__, te-ts))
return result
return timed
Next, let's make a large DataFrame, just by copying the existing dataframe 500 times:
df = pd.concat([df for x in range(500)], axis=0)
df = df.reset_index()
Finally, we run the three tests under a timing function:
#timeit
def method_1():
df['l52_high_low_pct_chg'] = df.groupby("ts_code").apply(custom_f, 52).values
method_1()
#timeit
def method_2():
df['l52_high_low_pct_chg'] = df.groupby("ts_code").rolling(52).apply(rolling_f).values[:, 0]
method_2()
#timeit
def method_3():
df['l52_high_low_pct_chg'] = df.groupby('ts_code').apply(np_ext_f, n=52).sort_index(level=1).values
method_3()
Which gives us this output:
func:'method_1' took: 2.5650 sec
func:'method_2' took: 15.1233 sec
func:'method_3' took: 0.1084 sec
So, the fastest method is to use the numpy_ext, which makes sense because that's optimized for vectorized calculations. The second fastest method is the custom function I wrote, which is somewhat efficient because it does some vectorized calculations while also doing some Pandas lookups. The slowest method by far is using Pandas rolling function.
For my solution, we'll use .groupby("ts_code") then .rolling to process groups of certain size and a custom_function. This custom function will take each group, and instead of applying a function directly on the received values, we'll use those values to query the original dataframe. Then, we can calculate the values as you expect by finding the row where the "high" peak is, then look the following rows to find the minimum "low" value and finally calculate the result using your formula:
def custom_function(group, df):
# Query the original dataframe using the group values
group = df.loc[group.values]
# Calculate your formula
high_peak_row = group["high"].idxmax()
min_low_after_peak = group.loc[high_peak_row:, "low"].min()
return 1 - min_low_after_peak / group.loc[high_peak_row, "high"]
# Reset the index to roll over that column and be able query the original dataframe
df["l3_high_low_pct_chg"] = df.reset_index().groupby("ts_code")["index"].rolling(3).apply(custom_function, args=(df,)).values
df["l4_high_low_pct_chg"] = df.reset_index().groupby("ts_code")["index"].rolling(4).apply(custom_function, args=(df,)).values
Output:
ts_code high low l3_high_low_pct_chg l4_high_low_pct_chg
0 A 20 10 NaN NaN
1 A 30 5 NaN NaN
2 A 40 20 0.50 NaN
3 A 50 10 0.80 0.80
4 A 20 30 0.80 0.80
5 B 50 10 NaN NaN
6 B 30 5 NaN NaN
7 B 40 20 0.90 NaN
8 B 10 10 0.75 0.90
9 B 20 30 0.75 0.75
We can take this idea further an only group once:
groups = df.reset_index().groupby("ts_code")["index"]
for n in [3, 4]:
df[f"l{n}_high_low_pct_chg"] = groups.rolling(n).apply(custom_function, args=(df,)).values

How to find smallest positive integer in data frame row

I have looked everywhere for this answer which must exist. I am trying to find the smallest positive integer per row in a data frame.
Imagine a dataframe
'lat':[-120, -90, -100, -100],
'long':[20, 21, 19, 18],
'dist1':[2, 6, 8, 1],
'dist2':[1,3,10,5]}```
The following function gives me the minimum value, but includes negatives. i.e. the df['lat'] column.
df.min(axis = 1)
Obviously, I could drop the lat column, or convert to string or something, but I will need it later. The lat column is the only column with negative values. I am trying to return a new column such as
df['min_dist'] = [1,3,8,1]
I hope this all makes sense. Thanks in advance for any help.
In general you can use DataFrame.where to mark negative values as null and exclude them from min calculation:
df['min_dist'] = df.where(df > 0).min(1)
df
lat long dist1 dist2 min_dist
0 -120 20 2 1 1.0
1 -90 21 6 3 3.0
2 -100 19 8 10 8.0
3 -100 18 1 5 1.0
Filter for just the dist columns and apply the minimum function :
df.assign(min_dist = df.iloc[:, -2:].min(1))
Out[205]:
lat long dist1 dist2 min_dist
0 -120 20 2 1 1
1 -90 21 6 3 3
2 -100 19 8 10 8
3 -100 18 1 5 1
Just use:
df['min_dist'] = df[df > 0].min(1)

Matplotlib - Plot uneven steps from DataFrame

I have this DataFrame with x-axis data organized in column. However, for the non-existent, the columns were omitted, so the steps are uneven. For instance:
0.1 0.2 0.5 ...
0 1 4 7 ...
1 2 5 8 ...
2 3 6 9 ...
I want to plot each of those in with x-axis np.arange(0, max(df.columns), step=0.1) and also combined plot of those. Is there any easy way to achieve this with matplotlib.pyplot?
plt.plot(np.arange(0, max(df.columns), step=0.1), new_data)
Any help would be appreciated.
If I understood you correctly, your final dataframe is supposed to look like this:
0.0 0.1 0.2 0.3 0.4 0.5
0 0.0 1 4 0.0 0.0 7
1 0.0 2 5 0.0 0.0 8
2 0.0 3 6 0.0 0.0 9
which can be generated (and then also plotted) like this:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
df = pd.DataFrame({0.1:[1,2,3],0.2:[4,5,6],0.5:[7,8,9]})
## make sure to actually include the maximum value (add one step)
# or alternatively rather use np.linspace() with appropriate number of points
xs = np.arange(0, max(df.columns) +0.1, step=0.1)
df = df.reindex(columns=xs, fill_value=0.0)
plt.plot(df.T)
plt.show()
which yields:

understanding cut command in python DataFrame

Class variable in Dataframe consist of several numbers, those numbers are :
5 681
6 638
7 199
4 53
8 18
3 10
i have seen following command on website :
bins = (2,6.5,8)
group_names = ['bad','good']
categories = pd.cut(df['quality'], bins, labels = group_names)
df['quality'] = categories
after that one in quality column we have only two categorical variables : bad and good, i am interested how exactly it works? if number is between 2.6 and 5.8 it is bad and all others are good or vice versa? please explain me this things
Consider:
import pandas as pd
df = pd.DataFrame({
'score': range(10)
})
bins = (2, 6.5, 8)
labels = ('bad', 'good')
df['quality'] = pd.cut(df['score'], bins, labels=labels)
print(df)
The output is:
score quality
0 0 NaN
1 1 NaN
2 2 NaN
3 3 bad
4 4 bad
5 5 bad
6 6 bad
7 7 good
8 8 good
9 9 NaN
There are 2 bins into which score data is assigned.
(2, 6.5] and (6.5, 8]
The left end is exclusive and right end is inclusive.
All numbers in (2, 6.5] will be evaluated to bad and those in (6.5, 8] will be evaluated to good.
Those data points that are outside these intervals will not have any value and hence NaN.

How to control width of graph line in matplotlib?

I am trying to plot line graphs in matplotlib with the following data, x,y points belonging to same id is one line, so there are 3 lines in the below df.
id x y
0 1 0.50 0.0
1 1 1.00 0.3
2 1 1.50 0.5
4 1 2.00 0.7
5 2 0.20 0.0
6 2 1.00 0.8
7 2 1.50 1.0
8 2 2.00 1.2
9 2 3.50 2.0
10 3 0.10 0.0
11 3 1.10 0.5
12 3 3.55 2.2
It can be simply plotted with following code:
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib notebook
fig, ax = plt.subplots(figsize=(12,8))
cmap = plt.cm.get_cmap("viridis")
groups = df.groupby("id")
ngroups = len(groups)
for i1, (key, grp) in enumerate(groups):
grp.plot(linestyle="solid", x = "x", y = "y", ax = ax, label = key)
plt.show()
But, I have another data frame df2 where weight of each id is given, and I am hoping to find a way to control the thickness of each line according to it's weight, the larger the weight, thicker is the line. How can I do this? Also what relation will be followed between the weight and width of the line ?
id weight
0 1 5
1 2 15
2 3 2
Please let me know if anything is unclear.
Based on the comments, you need to know a few things:
How to set the line width?
That's simple: linewidth=number. See https://matplotlib.org/examples/pylab_examples/set_and_get.html
How to take the weight and make it a significant width?
This depends on the range of your weight. If it's consistently between 2 and 15, I'd recommend simply dividing it by 2, i.e.:
linewidth=weight/2
If you find this aesthetically unpleasing, divide by a bigger number, though that would obviously reduce the number of linewidths you get.
How to get the weight out of df2?
Given the df2 you described and the code you showed, key is the id of df2. So you want:
df2[df2['id'] == key]['weight']
Putting it all together:
Replace your grp.plot line with the following:
grp.plot(linestyle="solid",
linewidth=df2[df2['id'] == key]['weight'] / 2.0,
x = "x", y = "y", ax = ax, label = key)
(All this is is your line with the entry for linewidth added in.)

Categories

Resources