Here is the data showing two columns between which I need to plot histogram
Cont Bin_Data
21 1
21 1
22.8 1
21.4 0
18.7 0
18.1 0
14.3 0
24.4 0
22.8 1
19.2 1
17.8 0
16.4 1
17.3 0
15.2 1
I have to plot Bin_Data(column) based histogram to compare Cont(column). I have tried 3 approaches and am not getting satisfactory result/plot.
Approach #1
plt.hist('mpg', bins=5, data=am)
Approach #2
plt.hist(mpg, bins=np.arange(mpg.min(), mpg.max()+1))
Approach #3
am = data['am']
legend = ['am', 'mpg']
mpg = data['mpg']
plt.hist([mpg, am], color=['orange', 'green'])
plt.xlabel("am")
plt.ylabel("mpg")
plt.legend(legend)
#plt.xticks(range(0, 7))
#plt.yticks(range(1, 20))
plt.title('Analysis of "am" upon "mpg"')
plt.show()
Related
I have a dataframe with messy data.
df:
1 2 3
-- ------- ------- -------
0 123/100 221/100 103/50
1 49/100 333/100 223/50
2 153/100 81/50 229/100
3 183/100 47/25 31/20
4 2.23 3.2 3.04
5 2.39 3.61 2.69
I want the fractional values to be converted to decimal with the conversion formula being
e.g:
123/100 = (123/100 + 1) = 2.23
333/100 = (333/100 +1) = 4.33
The calculation is fractional value + 1
And of course leave the decimal values as is.
How can I do it in Pandas and Python?
A simple way to do this is to first define a conversion function that will be applied to each element in a column:
def convert(s):
if '/' in s: # is a fraction
num, den = s.split('/')
return 1+(int(num)/int(den))
else:
return float(s)
Then use the .apply function to run all elements of a column through this function:
df['1'] = df['1'].apply(convert)
Result:
df['1']:
0 2.23
1 1.49
2 2.53
3 2.83
4 2.23
5 2.39
Then repeat on any other column as needed.
If you trust the data in your dataset, the simplest way is to use eval or better, suggested by #mozway, pd.eval:
>>> df.replace(r'(\d+)/(\d+)', r'1+\1/\2', regex=True).applymap(pd.eval)
1 2 3
0 2.23 3.21 3.06
1 1.49 4.33 5.46
2 2.53 2.62 3.29
3 2.83 2.88 2.55
4 2.23 3.20 3.04
5 2.39 3.61 2.69
I have a pandas dataframe like there is longer gaps in time and I want to slice them into smaller dataframes where time "clusters" are together
Time Value
0 56610.41341 8.55
1 56587.56394 5.27
2 56590.62965 6.81
3 56598.63790 5.47
4 56606.52203 6.71
5 56980.44206 4.75
6 56592.53327 6.53
7 57335.52837 0.74
8 56942.59094 6.96
9 56921.63669 9.16
10 56599.52053 6.14
11 56605.50235 5.20
12 57343.63828 3.12
13 57337.51641 3.17
14 56593.60374 5.69
15 56882.61571 9.50
I tried sorting this and taking time difference of two consecutive points with
df = df.sort_values("Time")
df['t_dif'] = df['Time'] - df['Time'].shift(-1)
And it gives
Time Value t_dif
1 56587.56394 5.27 -3.06571
2 56590.62965 6.81 -1.90362
6 56592.53327 6.53 -1.07047
14 56593.60374 5.69 -5.03416
3 56598.63790 5.47 -0.88263
10 56599.52053 6.14 -5.98182
11 56605.50235 5.20 -1.01968
4 56606.52203 6.71 -3.89138
0 56610.41341 8.55 -272.20230
15 56882.61571 9.50 -39.02098
9 56921.63669 9.16 -20.95425
8 56942.59094 6.96 -37.85112
5 56980.44206 4.75 -355.08631
7 57335.52837 0.74 -1.98804
13 57337.51641 3.17 -6.12187
12 57343.63828 3.12 NaN
Lets say I want to slice this dataframe to smaller dataframes where time difference between two consecutive points is smaller than 40 how would I go by doing this?
I could loop the rows but this is frowned upon so is there a smarter solution?
Edit: Here is a example:
df1:
Time Value t_dif
1 56587.56394 5.27 -3.06571
2 56590.62965 6.81 -1.90362
6 56592.53327 6.53 -1.07047
14 56593.60374 5.69 -5.03416
3 56598.63790 5.47 -0.88263
10 56599.52053 6.14 -5.98182
11 56605.50235 5.20 -1.01968
4 56606.52203 6.71 -3.89138
df2:
0 56610.41341 8.55 -272.20230
df3:
15 56882.61571 9.50 -39.02098
9 56921.63669 9.16 -20.95425
8 56942.59094 6.96 -37.85112
...
etc.
I think you can just
df1 = df[df['t_dif']<30]
df2 = df[df['t_dif']>=30]
def split_dataframe(df, value):
df = df.sort_values("Time")
df = df.reset_index()
df['t_dif'] = (df['Time'] - df['Time'].shift(-1)).abs()
indxs = df.index[df['t_dif'] > value].tolist()
indxs.append(-1)
indxs.append(len(df))
indxs.sort()
frames = []
for i in range(1, len(indxs)):
val = df.iloc[indxs[i] + 1: indxs[i]]
frames.append(val)
return frames
Returns the correct dataframes as a list
In a python script I have a pd's describe() output, called df looks like the following. The output has two index --Class and EL_base.
I want to make individual boxplot for each class. How can I do it?
count mean std min 25% 50% 75% max
Class EL_base
PC1 0 8 247.04 8.16 236.90 244.15 245.17 247.71 265.41
1 8 243.25 2.96 237.22 242.57 243.84 244.49 247.29
PC2 0 8 243.25 2.96 237.22 242.57 243.84 244.49 247.29
1 8 518.96 6.35 507.27 515.38 519.72 523.65 526.25
2 8 519.52 2.84 513.77 518.17 520.50 521.46 522.39
I want to display mean and standard deviation values above each of the boxplots in the grouped boxplot (see picture).
My code is
import pandas as pd
import seaborn as sns
from os.path import expanduser as ospath
df = pd.read_excel(ospath('~/Documents/Python/Kandidatspeciale/TestData.xlsx'),'Ark1')
bp = sns.boxplot(y='throw angle', x='incident angle',
data=df,
palette="colorblind",
hue='Bat type')
bp.set_title('Rubber Comparison',fontsize=15,fontweight='bold', y=1.06)
bp.set_ylabel('Throw Angle [degrees]',fontsize=11.5)
bp.set_xlabel('Incident Angle [degrees]',fontsize=11.5)
Where my dataframe, df, is
Bat type incident angle throw angle
0 euro 15 28.2
1 euro 15 27.5
2 euro 15 26.2
3 euro 15 27.7
4 euro 15 26.4
5 euro 15 29.0
6 euro 30 12.5
7 euro 30 14.7
8 euro 30 10.2
9 china 15 29.9
10 china 15 31.1
11 china 15 24.9
12 china 15 27.5
13 china 15 31.2
14 china 15 24.4
15 china 30 9.7
16 china 30 9.1
17 china 30 9.5
I tried with the following code. It needs to be independent of number of x (incident angles), for instance it should do the job for more angles of 45, 60 etc.
m=df.mean(axis=0) #Mean values
st=df.std(axis=0) #Standard deviation values
for i, line in enumerate(bp['medians']):
x, y = line.get_xydata()[1]
text = ' μ={:.2f}\n σ={:.2f}'.format(m[i], st[i])
bp.annotate(text, xy=(x, y))
Can somebody help?
This question brought me here since I was also looking for a similar solution with seaborn.
After some trial and error, you just have to change the for loop to:
for i in range(len(m)):
bp.annotate(
' μ={:.2f}\n σ={:.2f}'.format(m[i], st[i]),
xy=(i, m[i]),
horizontalalignment='center'
)
This change worked for me (although I just wanted to print the actual median values). You can also add changes like the fontsize, color or style (i.e., weight) just by adding them as arguments in annotate.
I am brand new to pandas and working with two dataframes. My goal is to append the non-date values of df_ls (below) column-wise to their nearest respective date in df_1. Is the only way to do this with a traditional for-loop or is their some more effective built-in method/function. I have googled this extensively without any luck and have only found ways to append blocks of dataframes to other dataframes. I haven't found a way to search through a dataframe and append a row in another dataframe at the nearest respective date. See example below:
Example of first dataframe (lets call it df_ls):
DATE ALBEDO_SUR B13_RATIO B23_RATIO B1_RAW B2_RAW
0 1999-07-04 0.070771 1.606958 1.292280 0.128069 0.103018
1 1999-07-20 0.030795 2.326290 1.728147 0.099020 0.073595
2 1999-08-21 0.022819 2.492871 1.762536 0.096888 0.068502
3 1999-09-06 0.014613 2.792271 1.894225 0.090590 0.061445
4 1999-10-08 0.004978 2.781847 1.790768 0.089291 0.057521
5 1999-10-24 0.003144 2.818474 1.805257 0.090623 0.058054
6 1999-11-09 0.000859 3.146100 1.993941 0.092787 0.058823
7 1999-12-11 0.000912 2.913604 1.656642 0.097239 0.055357
8 1999-12-27 0.000877 2.974692 1.799949 0.098282 0.059427
9 2000-01-28 0.000758 3.092533 1.782112 0.095153 0.054809
10 2000-03-16 0.002933 2.969185 1.727465 0.083059 0.048322
11 2000-04-01 0.016814 2.366437 1.514110 0.089720 0.057398
12 2000-05-03 0.047370 1.847763 1.401930 0.109767 0.083290
13 2000-05-19 0.089432 1.402798 1.178798 0.137965 0.115936
14 2000-06-04 0.056340 1.807828 1.422489 0.118601 0.093328
Example of second dataframe (let's call it df_1)
Sample Date Value
0 2000-05-09 1.68
1 2000-05-09 1.68
2 2000-05-18 1.75
3 2000-05-18 1.75
4 2000-05-31 1.40
5 2000-05-31 1.40
6 2000-06-13 1.07
7 2000-06-13 1.07
8 2000-06-27 1.49
9 2000-06-27 1.49
10 2000-07-11 2.29
11 2000-07-11 2.29
In the end, my goal is to have something like this (Note the appended values are values closest to the Sample Date, even though they dont match up perfectly):
Sample Date Value ALBEDO_SUR B13_RATIO B23_RATIO B1_RAW B2_RAW
0 2000-05-09 1.68 0.047370 1.847763 1.401930 0.109767 0.083290
1 2000-05-09 1.68 0.047370 1.847763 1.401930 0.109767 0.083290
2 2000-05-18 1.75 0.089432 1.402798 1.178798 0.137965 0.115936
3 2000-05-18 1.75 0.089432 1.402798 1.178798 0.137965 0.115936
4 2000-05-31 1.40 0.056340 1.807828 1.422489 0.118601 0.093328
5 2000-05-31 1.40 0.056340 1.807828 1.422489 0.118601 0.093328
6 2000-06-13 1.07 ETC.... ETC.... ETC ...
7 2000-06-13 1.07
8 2000-06-27 1.49
9 2000-06-27 1.49
10 2000-07-11 2.29
11 2000-07-11 2.29
Thanks for any and all help. As I said I am new to this and I have experience with this sort of thing in MATLAB but PANDAS is a new to me.
Thanks