Python: Binning and Visualization with Pandas - python

I'm pretty new to python.
So I'm trying to make an age interval column for my dataframe
df['age_interval'] = pd.cut(x=df['Age'], bins=[18, 22, 27, 32, 37, 42, 47, 52, 57, 60], include_lowest=True)
And I added my graph:
Problem: In the visualization the [18-22] bin is displayed as [17.99-22]
What I want: I want it to display 18-22.
Below is the plot code:
plt.figure(figsize=(15,8))
dist = sns.barplot(x=ibm_ages.index, y=ibm_ages.values, color='blue')
dist.set_title('IBM Age Distribution', fontsize = 24)
dist.set_xlabel('Age Range', fontsize=18)
dist.set_ylabel('Total Count', fontsize=18)
sizes=[]
for p in dist.patches:
height = p.get_height()
sizes.append(height)
dist.text(p.get_x()+p.get_width()/2.,
height + 5,
'{:1.2f}%'.format(height/total*100),
ha="center", fontsize= 8)
plt.tight_layout(h_pad=3)
plt.show()
Thank you

That's because it's a float64 Type and you want an integer try:
import numpy as np
df['age_interval'] = pd.cut(x=df['Age'].astype('Int64'), bins=[18, 22, 27, 32, 37, 42, 47, 52, 57, 60], include_lowest=True)
you can use .astype('Int64') whenever you want to convert float64 to Int64

Related

plt grid ALPHA parameter not working in matplotlib

I have the current function which generates a simple chart in matplotlib but as we see in the image the alpha parameter not seems to work, this happen in vs code, if I test this in a notebook works fine
what i need is the same format in vs code
data:
,hora,id
0,0,10
1,1,3
2,2,2
3,3,3
4,4,5
5,5,3
6,6,11
7,7,32
8,8,41
9,9,71
10,10,75
11,11,70
12,12,57
13,13,69
14,14,50
15,15,73
16,16,47
17,17,64
18,18,73
19,19,54
20,20,45
21,21,43
22,22,34
23,23,27
code:
import pandas as pd
from matplotlib import pyplot as plt
dfhoras=pd.read_clipboard(sep=',')
def questionsHour(dfhoras):
x = dfhoras['hora']
y = dfhoras['id']
horas=[x for x in range(24)]
plt.figure(facecolor="w")
plt.figure(figsize=(15,3))
plt.rcParams['axes.spines.top'] = False
plt.bar(x, y,linewidth=3,color="#172a3d")
plt.xticks(horas,fontweight='bold',color="#e33e31",fontsize=9)
plt.yticks(fontweight='bold',color="#e33e31",fontsize=9)
plt.grid(color="#172a3d", linestyle='--',linewidth=1,axis='y',alpha=0.15)
#aca creo las etiquetas de los puntos
for x,y in zip(x,y):
label = "{:.2f}".format(y)
plt.annotate(label,
(x,y),
textcoords="offset points",
xytext=(0,5),
ha='center',
fontsize=9,
fontweight='bold',
color="#e33e31")
plt.savefig('questions1.png',dpi=600,transparent=True,bbox_inches='tight')
questionsHour(dfhoras)
this is the result in vs code
and this is the result in a notebook
Make sure the environment packages used by VSCode are updated, as they aren't necessarily the same as those being used by Jupyter.
Tested in python 3.8.11, pandas 1.3.2, matplotlib 3.4.3
I was originally testing with matplotlib 3.4.2, which seems to have a bug and would not set weight='bold', so if VSCode is using a different package version, there could be a bug with alpha=0.15.
The OP uses plt.figure(facecolor="w") and plt.figure(figsize=(15,3)), which creates two different figures (not noticeable with inline plots, but two windows will open if using interactive plots). It should be plt.figure(facecolor="w", figsize=(15, 3)).
The following code uses the object oriented approach with axes, which makes sure all methods are applied to the correct axes being plotted.
Plot the dataframe directly with pandas.DataFrame.plot, which uses matplotlib as the default backend, and returns an axes.
Annotations are made using matplotlib.pyplot.bar_label
import pandas as pd
# test data
data = {'hora': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23], 'id': [10, 3, 2, 3, 5, 3, 11, 32, 41, 71, 75, 70, 57, 69, 50, 73, 47, 64, 73, 54, 45, 43, 34, 27]}
df = pd.DataFrame(data)
# plot function
def questionsHour(df):
ax = df.plot(x='hora', y='id', figsize=(15, 3), linewidth=3, color="#172a3d", rot=0, legend=False, xlabel='', kind='bar', width=0.75)
ax.set_xticklabels(ax.get_xticklabels(), weight='bold', color="#e33e31", fontsize=9)
ax.set_yticks(ax.get_yticks()) # prevents warning for next line
ax.set_yticklabels(ax.get_yticks(), weight='bold', color="#e33e31", fontsize=9)
ax.grid(color="#172a3d", linestyle='--', linewidth=1, axis='y', alpha=0.15)
ax.spines['top'].set_visible(False)
ax.bar_label(ax.containers[0], fmt='%.2f', fontsize=9, weight='bold', color="#e33e31")
questionsHour(df)

How to rearrange the small rectangular into a big rectangular in bin packing algorithmn

I'm new to python and I am trying to make re-arrange the rectangle from raw data. I'm using bin packing algorithmn and I want to sort it with color like below. Please need help?
Output Now:
Expected Output:
There are small changes in the code please follow below code once:
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
fig = plt.figure()
ax = fig.add_subplot(111)
temp_y=0
i=0
layout_height = 300
layout_width = 300
ax.set_xlim([0, layout_height])
ax.set_ylim([0, layout_width])
area_height = [80, 75, 50, 60, 52, 72, 100, 120, 150]
area_width = [50, 46, 52, 52, 50, 48, 25, 40, 48]
for (i,j) in zip(area_height, area_width):
print(i,j)
ax.add_patch(Rectangle((0, temp_y), float(j), float(i),edgecolor ='black',facecolor = 'red'))
temp_y = temp_y + float(i)

How to set the hue range for a numeric variable using a colored bubble plot in seaborn, python?

I'm trying to use seaborn to create a colored bubbleplot of 3-D points (x,y,z), each coordinate being an integer in range [0,255]. I want the axes to represent x and y, and the hue color and size of the scatter bubbles to represent the z-coordinate.
The code:
import seaborn
seaborn.set()
import pandas
import matplotlib.pyplot
x = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200]
y = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200]
z = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200]
df = pandas.DataFrame(list(zip(x, y, z)), columns =['x', 'y', 'z'])
ax = seaborn.scatterplot(x="x", y="y",
hue="z",
data=df)
matplotlib.pyplot.xlim(0,255)
matplotlib.pyplot.ylim(0,255)
matplotlib.pyplot.show()
gets me pretty much what I want:
This however makes the hue range be based on the data in z. I instead want to set the range according to the range of the min and max z values (as 0,255), and then let the color of the actual points map onto that range accordingly (so if a point has z-value 50, then that should be mapped onto the color represented by the value 50 in the range [0,255]).
My summarized question:
How to manually set the hue color range of a numerical variable in a scatterplot using seaborn?
I've looked thoroughly online on many tutorials and forums, but have not found an answer. I'm not sure I've used the right terminology. I hope my message got across.
Following #JohanC's suggestion of using hue_norm was the solution. I first tried doing so by removing the [hue=] parameter and only using the [hue_norm=] parameter, which didn't produce any colors at all (which makes sense).
Naturally one should use both the [hue=] and the [hue_norm=] parameters.
import seaborn
seaborn.set()
import pandas
import matplotlib.pyplot
x = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200]
y = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200]
z = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 255]
df = pandas.DataFrame(list(zip(x, y, z, my_sizes)), columns =['x', 'y', 'z'])
ax = seaborn.scatterplot(x="x", y="y",
hue="z",
hue_norm=(0,255), # <------- the solution
data=df)
matplotlib.pyplot.xlim(0,255)
matplotlib.pyplot.ylim(0,255)
matplotlib.pyplot.show()

Mathplotlib pandas-Plotting average line to scatter plot?

I have a scatter plot created from two columns of a pandas data frame and I would like to add a line across each axis representing the average. Is this possible with a scatter plot?
plt.title("NFL Conversion Rates", fontsize=40)
# simulating a pandas df['team'] column
types = df.Tm
x_coords = df['3D%']
y_coords = df['4D%']
binsy = [15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85]
binsx = [30,35,40,45,50,55]
avg_y = y_coords.mean()
avg_y = round(avg_y, 1)
display(avg_y)
avg_x = x_coords.mean()
avg_x = round(avg_x, 1)
display(avg_x)
for i,type in enumerate(types):
x = x_coords[i]
y = y_coords[i]
plt.scatter(x, y, s=30, marker='o', edgecolor='black', cmap='purple', linewidth=1, alpha = 0.5)
plt.text(x+0.2, y+0.1, type, fontsize=14)
plt.xlabel('3rd Down Conversion Percentage',fontsize=30)
plt.ylabel('4th Down Conversion Percentage', fontsize=30)
plt.xticks(binsx)
plt.yticks(binsy)
You can try
plt.axvline(<value>,color='red',ls='--') and plt.axhline(<value>,color='red',ls='--'). Substitute with the value at which you want the lines

Why does subplot produce empty space on xaxis?

I have values at x and y axis and trying to produce simple line graph on subplot. Here is the simple and basic example which shows the problem.
import matplotlib.pyplot as plt
x1 = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30,
31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58,
59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72]
y1 = [24.892730712890625, 25.268890380859375, 26.677642822265625, 28.294586181640625, 29.477203369140625,
30.61334228515625, 31.656219482421875, 32.371551513671875, 31.412261962890625, 31.973724365234375, 31.563812255859375,
30.72821044921875, 29.249237060546875, 26.759185791015625, 26.081024169921875, 25.27996826171875, 24.69805908203125,
24.92388916015625, 24.76177978515625, 24.385498046875, 24.093231201171875, 23.92156982421875, 23.788543701171875,
23.67657470703125, 23.581085205078125, 23.92095947265625, 25.90557861328125, 27.767333984375, 29.196136474609375,
30.25726318359375, 31.262786865234375, 32.2996826171875, 32.92620849609375, 33.32098388671875, 33.228057861328125,
30.495269775390625, 29.17010498046875, 28.04144287109375, 27.326202392578125, 24.904205322265625, 23.775054931640625,
24.1328125, 24.195343017578125, 23.751312255859375, 23.55316162109375, 23.459228515625, 23.304534912109375,
23.233062744140625, 23.093170166015625, 23.15887451171875, 25.13739013671875, 27.397430419921875, 28.923431396484375,
29.945037841796875, 30.976715087890625, 31.93109130859375, 32.665435791015625, 32.701324462890625, 31.212799072265625,
30.201507568359375, 29.591888427734375, 28.002410888671875, 27.72802734375, 27.371002197265625, 26.072509765625,
25.39373779296875, 25.196044921875, 25.2684326171875, 24.815582275390625, 24.27130126953125, 23.758575439453125,
23.49615478515625, 23.3907470703125]
plt.subplot(513)
plt.plot(x1, y1, 'b-')
plt.grid(True)
plt.show()
All works fine. But the output plot has some empty space on xaxis. Here is the image which shows the problem:-
Any help to solve the issue is appreciated.
=You could use xlim to overwrite this behaviour introduced by "AutoLocator" in the background:
plt.subplot(513, xlim=(0,72))
# or
plt.subplot(513, xlim=(x1[0], x1[-1]))
You could also ajust the Locator like shown in this example (and others):
http://matplotlib.org/examples/pylab_examples/major_minor_demo2.html
By default, in matplotlib versions <2.0, matplotlib will choose "even" limits for the axes. For example:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(1977)
x = np.linspace(2.1, 22.8, 1000)
y = np.random.normal(0, 1, x.size).cumsum()
fig, ax = plt.subplots()
ax.plot(x, y)
plt.show()
If you'd prefer to have the limits strictly set to the data limits, you can use ax.axis('tight'). However, this will set both the x and y-limits to be "tight", which often is not what you want.
In this case, you'd more likely want to set just the x-limits to be "tight". An easy way to do this is to use ax.margins(x=0). margins specifies that the autoscaling should pad things by a percentage of the data range. Therefore, by setting x=0, we're effectively making the x-limits identical to the data limits in the x-direction.
For example:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(1977)
x = np.linspace(2.1, 22.8, 1000)
y = np.random.normal(0, 1, x.size).cumsum()
fig, ax = plt.subplots()
ax.plot(x, y)
ax.margins(x=0)
plt.show()
You could also accomplish this by using ax.autoscale(axis='x', tight=True).
However, an additional advantage of margins is that you'll often want to have the y-axis pad by a percentage of the data range as well. Therefore, it's common to want to do something like:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(1977)
x = np.linspace(2.1, 22.8, 1000)
y = np.random.normal(0, 1, x.size).cumsum()
fig, ax = plt.subplots()
ax.plot(x, y)
ax.margins(x=0, y=0.05)
plt.show()

Categories

Resources