Plotting histogram in pandas? - python

I have a dataframe like this and wish to make a frequency histogram.
1.19714
1.04872
0.188158
1
1.02339
1
1
1
0.38496
1.31858
1
1.33654
0.945736
1.00877
0.413445
0.810127
1
0.625
0.492857
0.187156
0.95847
I want to plot a frequency histogram, with x axis bins from -1 to 1. How can I do this in pandas?

pandas has a built-in histogram function.
Assuming your dataframe is called df:
import numpy as np
df.hist(bins=np.arange(-1,1,0.1))

Related

Create a seaborn histogram with two columns of a dataframe

I try to display a histogram with this dataframe.
gr_age weighted_cost
0 1 2272.985462
1 2 2027.919360
2 3 1417.617779
3 4 946.568598
4 5 715.731002
5 6 641.716770
I want to use gr_age column as the X axis and weighted_cost as the Y axis. Here is an example of what I am looking for with Excel:
I tried with the following code, and with discrete=True, but it gives another result, and I didn't do better with displot.
sns.histplot(data=df, x="gr_age", y="weighted_cost")
plt.show()
Thanking you for your ideas!
You want a barplot (x vs y values) not a histplot which plots the distribution of a dataset:
import seaborn as sns
ax = sns.barplot(data=df, x='gr_age', y='weighted_cost', color='#4473C5')
ax.set_title('Values by age group')
output:

Create bivariate histogram from binned pandas dataframe

I have a pandas dataframe with two columns a and b.
a b
23 1.1
76 2.1
98 3.4
18 1.4
I have binned b in 10 bins and want to create a histogram consisting of number of a values in bins of b.
dg = pd.cut(x=df['b'] , bins=10)
I want a histogram with x axis as the bins (of b) and y-axis as the total a values (like 23+18+...) in those bins.
How should I approach this problem?

how to plot the variation of feature

I have dataset with 2 columns and I would like to show the variation of one feature according to the binary output value
data
id Column1 output
1 15 0
2 80 1
3 120 1
4 20 0
... ... ...
I would like to drop a plot with python where x-axis contains values of Column1 and y-axis contains the percent of getting positive values.
I know already that the form of my plot have the form of exponontial function where when column1 has smaller numbers I will get more positive output then when it have long values
exponential plot maybe need two list like this
try this
import matplotlib.pyplot as plt
# x-axis points list
xL = [5,10,15,20,25,30]
# y-axis points list
yL = [100,50,25,12,10]
plt.plot(xL, yL)
plt.axis([0, 35, 0, 200])
plt.show()

Plot only unique rows from large pandas dataframe

I have a pandas dataframe of 434300 rows with the following structure:
x y p1 p2
1 8.0 1.23e-6 10 12
2 7.9 4.93e-6 10 12
3 7.8 7.10e-6 10 12
...
.
...
4576 8.0 8.85e-6 5 16
4577 7.9 2.95e-6 5 16
4778 7.8 3.66e-6 5 16
...
...
...
434300 ...
with the key point being that for every block of varying x,y data there are p1 and p2 that do not vary. Note that these blocks of constant p1,p2 are of varying length so it is not simply a matter of slicing the data every n rows.
I would like to plot the values p1 vs p2 in a graph, but would only like to plot the unique points.
If i do plot p1 vs p2 using:
In [1]: fig=plt.figure()
In [2]: ax=plt.subplot(111)
In [3]: ax.plot(df['p1'],df['p2'])
In [4]: len(ax.lines[0].get_xdata())
Out[4]: 434300
I see that matplotlib is plotting each individual line of data which is to be expected.
What is the neatest way to plot only the unique points from columns p1 and p2?
Here is a csv of a small example dataset that has all of the important features of my dataset.
Just drop the duplicates and plot:
df.drop_duplicates(how='all', columns=['p1', 'p2'])[['p1', 'p2]].plot()
You can slice the p1 and p2 columns from the data frame and then drop duplicates before plotting.
sub_df = df[['p1','p2']].drop_duplicates()
fig, ax = plt.subplots(1,1)
ax.plot(sub_df['p1'],sub_df['p2'])
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('exampleData.csv')
d = data[['p1', 'p2']].drop_duplicates()
plt.plot(d['p1'], d['p2'], 'o')
plt.show()
After looking at this answer to a similar question in R (which is what the pandas dataframes are based on) I found the pandas function pandas.Dataframe.drop_duplicates. If we modify my example code as follows:
In [1]: fig=plt.figure()
In [2]: ax=plt.subplot(111)
In [3]: df.drop_duplicates(subset=['p1','p2'],inplace=True)
In [3]: ax.plot(df['p1'],df['p2'])
In [4]: len(ax.lines[0].get_xdata())
Out[4]: 15
We see that this restricts df to only the unique points to be plotted. An important point is that you must pass a subset to drop_duplicates so that it only uses those columns to determine duplicate rows.

Using data from pythons pandas dataframes to sample from normal distributions

I'm trying to sample from a normal distribution using means and standard deviations that are stored in pandas DataFrames.
For example:
means= numpy.arange(10)
means=means.reshape(5,2)
produces:
0 1
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
and:
sts=numpy.arange(10,20)
sts=sts.reshape(5,2)
produces:
0 1
0 10 11
1 12 13
2 14 15
3 16 17
4 18 19
How would I produce another pandas dataframe with the same shape but with values sampled from the normal distribution using the corresponding means and standard deviations.
i.e. position 0,0 of this new dataframe would sample from a normal distribution with mean=0 and standard deviation=10, and so on.
My function so far:
def make_distributions(self):
num_data_points,num_species= self.means.shape
samples=[]
for i,j in zip(self.means,self.stds):
for k,l in zip(self.means[i],self.stds[j]):
samples.append( numpy.random.normal(k,l,self.n) )
will sample from the distributions for me but I'm having difficulty putting the data back into the same shaped dataframe as the mean and standard deviation dfs. Does anybody have any suggestions as to how to do this?
Thanks in advance.
You can use numpy.random.normal to sample from a random normal distribution.
IIUC, then this might be easiest, taking advantage of broadcasting:
import numpy as np
np.random.seed(1) # only for demonstration
np.random.normal(means,sts)
array([[ 16.24345364, -5.72932055],
[ -4.33806103, -10.94859209],
[ 16.11570681, -29.52308045],
[ 33.91698823, -5.94051732],
[ 13.74270373, 4.26196287]])
Check that it works:
np.random.seed(1)
print np.random.normal(0,10)
print np.random.normal(1,11)
16.2434536366
-5.72932055015
If you need a pandas DataFrame:
import pandas as pd
pd.DataFrame(np.random.normal(means,sts))
I will use dictionary to construct this dataframe. Suppose indices and columns are the same for means and stds:
means= numpy.arange(10)
means=pd.DataFrame(means.reshape(5,2))
stds=numpy.arange(10,20)
stds=pd.DataFrame(sts.reshape(5,2))
samples={}
for i in means.columns:
col={}
for j in means.index:
col[j]=numpy.random.normal(means.ix[j,i],stds.ix[j,i],2)
samples[i]=col
print(pd.DataFrame(samples))
# 0 1
#0 [0.0760974520154, 3.29439282825] [11.1292510583, 0.318246201796]
#1 [-25.4518020981, 19.2176263823] [17.0826945017, 9.36179435872]
#2 [14.5402484325, 8.33808246538] [6.96459947914, 26.5552235093]
#3 [0.775891790613, -2.09168601369] [2.38723023677, 15.8099942902]
#4 [-0.828518484847, 45.4592922652] [26.8088977308, 16.0818556353]
Or reset the dtype of a DataFrame and reassign values:
import itertools
samples = means * 0
samples = samples.astype(object)
for i,j in itertools.product(means.index, means.columns):
samples.set_value(i,j,numpy.random.normal(means.ix[i,j],stds.ix[i,j],2))

Categories

Resources