bin value of histograms from grouped data - python

I am a beginner in Python and I am making separate histograms of travel distance per departure hour. Data I'm using, about 2500 rows of this. Distance is float64, the Departuretime is str. However, for making further calculations I'd like to have the value of each bin in a histogram, for all histograms.
Up until now, I have the following:
df['Distance'].hist(by=df['Departuretime'], color = 'red',
edgecolor = 'black',figsize=(15,15),sharex=True,density=True)
This creates in my case a figure with 21 small histograms. Histogram output I'm receiving.
Of all these histograms I want to know the y-axis value of each bar, preferably in a dataframe with the distance binning as rows and the hours as columns.
With single histograms, I'd paste counts, bins, bars = in front of the entire line and the variable counts would contain the data I was looking for, however, in this case it does not work.
Ideally I'd like a dataframe or list of some sort for each histogram, containing the density values of the bins. I hope someone can help me out! Big thanks in advance!

First of all, note that the bins used in the different histograms that you are generating don't have the same edges (you can see this since you are using sharex=True and the resulting bars don't have the same width), in all cases you are getting 10 bins (the default), but they are not the same 10 bins.
This makes it impossible to combine them all in a single table in any meaningful way. You could provide a fixed list of bin edges as the bins parameter to standarize this.
Alternatively, I suggest you calculate a new column that describes to which bin each row belongs, this way we are also unifying the bins calulation.
You can do this with the cut function, which also gives you the same freedom to choose the number of bins or the specific bin edges the same way as with hist.
df['DistanceBin'] = pd.cut(df['Distance'], bins=10)
Then, you can use pivot_table to obtain a table with the counts for each combination of DistanceBin and Departuretime as rows and columns respectively as you asked.
df.pivot_table(index='DistanceBin', columns='Departuretime', aggfunc='count')

Related

Plotly Express: How do I add a second colormap to heatmap object?

I created a heatmap where correlations of two entities are visualized. However, as the matrix is symmetric i added significance values below the diagonal for higher information density. As those values are usually far smaller than the ones of the correlation coefficient I want to use a second colormap to differentiate between the upper and lower diagonal of the matrix. The code is the following:
fig = px.imshow(data,
labels=dict(x="Correlation of Returns", y="", color="PCC"),
x=domain,
y=domain,
color_continuous_scale=px.colors.diverging.balance,
zmin=-1, zmax=1
)
The data object simply is my nxn matrix as a list of lists. Domain is my label values. The following graph already contains one colormap:Sample HeatMap. Is there a way to add a second one and refer it to the values below the diagonal? I didn't find a solution online yet. Thanks in advance!
Note: I am using Dash, so I may need to stick to plotly figures and won't be able to use e.g. matplotlib

Is there a Python package that can trace a curve with a Gaussian lineshape over several x and y values?

My apologies for my ignorance in advance; I've only been learning Python for about two months. Every example question that I've seen on Stack Overflow seems to discuss a single distribution over a series of data, but not one distribution per data point with band broadening.
I have some (essentially) infinitely-thin bars at value x with height y that I need to run a line over so that it looks like the following photo:
The bars are the obtained from the the table of data on the far right. The curve is what I'm trying to make.
I am doing some TD-DFT work to calculate a theoretical UV/visible spectrum. It will output absorbance strengths (y-values, i.e., heights) for specific wavelengths of light (x-values). Theoretically, these are typically plotted as infinitely-thin bars, though we experimentally obtain a curve instead. The theoretical data can be made to appear like an experimental spectrum by running a curve over it that hugs y=0 and has a Gaussian lineshape around every absorbance bar.
I'm not sure if there's a feature that will do this for me, or if I need to do something like make a loop summing Gaussian curves for every individual absorbance, and then plot the resulting formula.
Thanks for reading!
It looks like my answer was using Seaborn to do a kernel density estimation. Because a KDE isn't weighted and only considers the density of x-values, I had to create a small loop to create a new list consisting of the x-entries each multiplied out by their respective intensities:
for j in range(len(list1)): #list1 contains x-values
list5.append([list1[j]]*int(list3[j])) #list5 was empty; see below for list3
#now to drop the brackets from within the list:
for k in range(len(list5)): #list5 was just made, containing intensity-proportional x-values
for l in list5[k]:
list4.append(l) #now just a list, rather than a list of lists
(had to make another list earlier of the intensities multiplied by 1000000 to make them all integers):
list3 = [i * 1000000 for i in list2] #list3 now contains integer intensities

Visualizing line density using a color map, versus alpha transparency

I want to visualize 200k-300k lines, maybe up to 1 million lines, where each line is a cumulative sequence of integer values that grows over time, one value per day on the order of 1000 days. the final values of each line range from 0 to 500.
it’s likely that some lines will be appear in my population of lines 1000s of times, others 100s, others 10s of times, and some outliers will be unique. For plotting large numbers of points in an xy plane, alpha transparency can be a solution in some cases, but isn’t great if you want to be able to robustly distinguish overplot density. A solution that scales more powerfully is to use something like hexbin, which bins the space and allows you to use a color map to plot density of points in each bin.
I haven’t been able to find a ready-made solution in python (ideally) or R for doing the analogous thing for plotting lines instead of points.
The following code demonstrates the issue using a small sample (n=1000 lines): can anyone propose how I might drop the alpha value approach in favor of a solution that allows me to introduce a color map for line density, using a transform I can control?
df = pd.DataFrame(np.random.randint(2,size=(100,1000)))
df.cumsum().plot(legend=False, color='grey', alpha=.1, figsize=(12,8))
in response to request: this is what a sample plot looks like now; in the wide dark band, 10 overplots full saturates the line, so that segments of lines overplotted 10,100, and 1000 times are indistinguishable

Plot several continuous variables according to the nominal values of two variables using Python

I would like to create a figure that shows how much money people earned in a game (continuous variable) as a function of the categorical values of three other variables. The first variable is whether people were included or excluded prior to the Money game, the second variable is whether people knew their decision-making partner and the last is the round of the game (participants played 5 rounds with a known co-player and 5 rounds with an unknown co-player). I know how to do draw violin plots as a function of the values of two categorical variables using FacetGrid (see below) but I did not manage to add another layer to it.
g= sns.FacetGrid(df_long, col = 'XP_Social_Condition', size=5, aspect=1)
g.map(sns.boxplot, 'DM partner', 'Money', palette = col_talk)
I have created two dataframe versions: my initial one and a melted one (see image below). I have also tried to create two plots together using f, (ax_l, ax_r) = but this does not seem to take FacetGrid plots as plots within the plot... You can see below links to see the data and the kind of plot I would like to use as a subplot - one showing a known player and one showing the unknown player. I am happy to share the data if it would help.
I have now tried the solution proposed
grid = sns.FacetGrid(melted_df, hue='DM partner', col='XP_Social_Condition')
grid.map(sns.violinplot, 'Round', 'Money')
But it still does not work. This shows the plot shown below, with the third hue variable not showing well the different conditions.
here is the new figure I get - almost there
data - original and melted
Thank you very much for your help.
OK, so you want to create one plot of continuous data depending on three different categorical variables?
I think what you're looking for is:
grid = sns.FacetGrid(melted_df, col='XP_Social_Condition')
grid.map(sns.violinplot, 'Round', 'Money', 'DM partner').add_legend()
The col results in two plots, one for each value of XP_Social_Condition. The three values passed to grid.map split the data so 'Round' becomes the x-axis, 'money' the y-axis and 'DM partner' the color. You can play around and swap the values 'DM_partner', 'XP_Social_Condition' and 'Round'.
The result should now look something like this or this ('Round' and 'DM Partner' swapped).

How can I normalize a histogram such that the sum of the heights is equal to 1?

I generated the figure below using the a call to matplotlib.pyplot.hist in which I passed the kwarg normed=True:
Upon further research, I realized that this kind of normalization works in such a way that the integral of the histogram is equal to 1. How can I plot this same data such that the sum of the heights of the bars equals 1?
In other words, I want each bit to represent the proportion of the whole that its values contain.
I'm not sure if there's a straightforward way, but
you can manually divide all bar heights by the length of the input (the following is made in ipython --pylab to skip the imports):
inp = normal(size=1000)
h = hist(inp)
Which gives you
Now, you can do:
bar(h[1][:-1], h[0]/float(len(inp)), diff(h[1]))
and get

Categories

Resources