Matplotlib Grouped Bar graphs not working properly - python

I was playing with grouped bar graph in matplotlib. I am trying to plot grouped bar graphs with proper width and spacing . The data consists of median salaries of javascript developers,python developers and all developers. But I am not able to group them properly.
from matplotlib import pyplot as plt
import numpy as np
ages_x = [18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55]
x_indexes = np.arange(len(ages_x))
width = 0.9
dev_y = [17784, 16500, 18012, 20628, 25206, 30252, 34368, 38496, 42000, 46752,
49320, 53200, 56000, 62316, 64928, 67317, 68748, 73752, 77232, 78000,
78508, 79536, 82488, 88935, 90000, 90056, 95000, 90000, 91633, 91660,
98150, 98964, 100000, 98988, 100000, 108923, 105000, 103117]
plt.bar(x_indexes- width, dev_y, color='#444444',width = width, label='All Devs')
py_dev_y = [20046, 17100, 20000, 24744, 30500, 37732, 41247, 45372, 48876,
53850, 57287, 63016, 65998, 70003, 70000, 71496, 75370, 83640,
84666, 84392, 78254, 85000, 87038, 91991, 100000, 94796, 97962,
93302, 99240, 102736, 112285, 100771, 104708, 108423, 101407,
112542, 122870, 120000]
plt.bar(x_indexes, py_dev_y, width = width,label='Python')
js_dev_y = [16446, 16791, 18942, 21780, 25704, 29000, 34372, 37810, 43515,
46823, 49293, 53437, 56373, 62375, 66674, 68745, 68746, 74583,
79000, 78508, 79996, 80403, 83820, 88833, 91660, 87892, 96243,
90000, 99313, 91660, 102264, 100000, 100000, 91660, 99240, 108000,
105000, 104000]
plt.bar(x_indexes+width, js_dev_y,width = width, label='JavaScript')
plt.legend()
plt.savefig('plot.png')
plt.xlabel('Ages')
plt.ylabel('Median Salary (USD)')
plt.title('Median Salary (USD) by Age')
plt.xkcd()
This is how my graph is currently looking
This is how I want to look it like. Don't focus on colours and all.

To expand on the comment by ImportanceOfBeingErnest, here is the output from using width = 0.27 instead of width = 0.9:
Also note that this is without the plt.xkcd() - which is not really appropriate for this plot because it obfuscates the data and doesn't handle the bar offsetting correctly:

Related

Is there a numpy alternative to this for loop problem?

I have 3 arrays of the same length:
import numpy as np
weights = np.array([10, 14, 18, 22, 26, 30, 32, 34, 36, 38, 40])
resistances = np.array([15, 16.5, 18, 19.5, 21, 24, 27, 30, 33, 36, 39])
depths = np.array([0,1,2,3,4,5,6,7,8,9,10])
I want to take each item in weights, then find the nearest match that is >= this item in resistances, and then using the index of this nearest match I want to return the corresponding value from depths i.e. depths[index].
BUT, with the additional condition that if nothing is >= the max value in weights then just return last value in depths. I then want to populate a list with the results.
Is there a better way than the for loop approach below? I would like to avoid the loop.
SWP = []
for w in weights:
if len(depths[w<=resistances]) == 0:
swp=depths[-1]
else:
swp = np.min(depths[w<=resistances])
SWP.append(swp)
SWP
You can .clip
the indices that np.searchsorted produces with len(resistances)-1:
depths[
np.searchsorted(resistances, weights).clip(max=len(resistances)-1)
]
So any index larger than the last one - will become the last one.
Alternative idea (but only if your resistances are sorted) - clip the weights with the maximum of resistances:
depths[
np.searchsorted(resistances, weights.clip(max=resistances.max()))
]
Usually to do what you're talking about you want to create a function that can be mapped over a list.
import numpy as np
weights = np.array([10, 14, 18, 22, 26, 30, 32, 34, 36, 38, 40])
resistances = np.array([15, 16.5, 18, 19.5, 21, 24, 27, 30, 33, 36, 39])
depth = np.array([0,1,2,3,4,5,6,7,8,9,10])
def evaluate_weight(w):
depths = depth[resistances<=w]
return np.max(depths) if len(depths) else 0
SWP = list(map(evaluate_weight, weights))

Plotly Sankey Diagram, aligning nodes

I created a Sankey diagram using plotly (python) and it looks like this:
As you can see, some links overlap, but this plot can be easily changed (manually) to this:
I think the overlapping result comes from the 3rd column of nodes being centered on Y. Is there a way for me to align the 3rd column to the top (or bottom) to fix this problem? (or any other fix is also welcome of course)
The only thing I've found is setting x and y for nodes manually, but I seem to not be able to only set the y, and this also would involve calculating all those coordinates.
Thank you for the help!
Edit: My code
import plotly.graph_objects as go
sources = [23, 23, 23, 23, 23, 23, 23, 24, 8, 23, 23, 23, 30, 17, 5, 12, 20, 20, 23, 18, 18, 18, 18, 23, 33, 33, 33, 33, 33, 23, 16, 16, 23]
targets = [7, 13, 6, 21, 1, 2, 15, 23, 23, 32, 25, 19, 23, 23, 23, 23, 27, 22, 20, 31, 4, 0, 3, 18, 11, 26, 9, 14, 28, 33, 29, 10, 16]
values = [50.0, 1542.78, 287.44, 2619.76, 1583.26, 722.1, 5133.69, 6544.0, 2563.35, 6476.59, 4314.0, 82.87, 650.0, 1773.68, 16723.0, 32297.7, 81.64, 266.92, 348.56, 388.57, 743.2, 5403.24, 5821.52, 12356.53, 12905.68, 316.12, 497.68, 354.42, 3830.44, 17904.34, 175.95, 1224.46, 1400.41]
fig = go.Figure(data=[go.Sankey(
node = dict(
pad = 5,
thickness = 10,
line = dict(color = "black", width = 0.5),
label = list(range(len(values))),
color = "blue"
),
link = dict(
source = sources,
target = targets,
value = values
))])
fig.update_layout(title_text=
"Basic Sankey Diagram", font_size=8)
fig.write_html("test.html")
There's an open issue on github that both x and y positions have to be set in order for manual positioning to work. Does manually adding y coordinates along with x coordinates address your problem?
In general there other issues with sankey sorting as well.
I have been working with problems in this area only in plotly.R so I'm afraid I can't offer specific python suggestions to modify your code.
If you're also looking for suggestions about calculating the coordinates manually, you can calculate this as
1 - (cumulative_sum_of_higher_nodes + current_node_size/2)
or
1 - (cumulative_sum_of_all_nodes_including_current_node - current_node_size/2)
assuming y = 0 is at the bottom of the plot area.

Python - cut only the descending part of the dataset

I have a timeseries with various downcasts. My question is how do I slice a pandas dataframe (or in this case the array, just to keep it simple) to get the data and its indexes of the descending bits of the timeseries?
import matplotlib.pyplot as plt
import numpy as np
b = np.asarray([ 1.3068586 , 1.59882279, 2.11291473, 2.64699527,
3.23948166, 3.81979878, 4.37630243, 4.97740025,
5.59247254, 6.18671493, 6.77414586, 7.43078595,
8.02243495, 8.59612224, 9.22302662, 9.83263379,
10.43125902, 11.0956864 , 11.61107838, 12.09616684,
12.63973254, 12.49437955, 11.6433792 , 10.61083269,
9.50534291, 8.47418827, 7.40571742, 6.56611512,
5.66963658, 4.89748187, 4.10543794, 3.44828054,
2.76866318, 2.24306623, 1.68034463, 1.26568186,
1.44548443, 2.01225076, 2.60715524, 3.21968562,
3.8622007 , 4.57035958, 5.14021305, 5.77879484,
6.42776897, 7.09397923, 7.71722028, 8.30860725,
8.96652218, 9.66157193, 10.23469208, 10.79889453,
10.5788411 , 9.38270646, 7.82070643, 6.74893389,
5.68200335, 4.73429009, 3.78358222, 3.05924946,
2.30428171, 1.78052369, 1.27897065, 1.16840532,
1.59452726, 2.13085096, 2.70989933, 3.3396291 ,
3.97318058, 4.62429262, 5.23997774, 5.91232803,
6.5906609 , 7.21099657, 7.82936331, 8.49636247,
9.15634983, 9.76450244, 10.39680729, 11.04659976,
11.69287237, 12.35692643, 12.99957563, 13.66228386,
14.31806385, 14.91871927, 15.57212978, 16.22288287,
16.84697357, 17.50502002, 18.15907842, 18.83068151,
19.50945548, 20.18020639, 20.84441358, 21.52792846,
22.17933087, 22.84614545, 23.51212887, 24.18308399,
24.8552263 , 25.51709528, 26.18724379, 26.84531493,
27.50690265, 28.16610365, 28.83394822, 29.49621179,
30.15118676, 30.8019521 , 31.46714114, 32.1213546 ,
32.79366952, 33.45233007, 34.12158193, 34.77502197,
35.4532211 , 36.11018053, 36.76540453, 37.41746323])
plt.plot(-b)
plt.show()
You can just change the negative diffs to NaN and then plot:
bb = pd.Series(-b)
bb[bb.diff().ge(0)] = np.nan
bb.plot()
To get the indexes of descending values, use:
bb.index[bb.diff().lt(0)]
Int64Index([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,
14, 15, 16, 17, 18, 19, 20, 37, 38, 39, 40, 41, 42,
43, 44, 45, 46, 47, 48, 49, 50, 51, 65, 66, 67, 68,
69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81,
82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94,
95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107,
108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119],
dtype='int64')
create a second dataframe where you move everyting from one index then you do it by substracting them term to term. you should get what you want (getting only the ones with negative diff)
here:
df = DataFrame(b)
df = concat([df.shift(1),df],axis = 1)
df.columns = ['t-1','t']
df.reset_index()
df = df.drop(df.index[0])
df['diff'] = df['t']-df['t-1']
res = df[df['diff']<0]
There is also an easy numpy-only solution (the question is tagged pandas but the code uses only numpy) using np.where. You want the points where the graph is descending which means the data is ascending.
# the indices where the data is ascending.
ix, = np.where(np.diff(b) > 0)
# the values
c = b[ix]
Note that this will give you the first value in each ascending pair of consecutive values, while the pandas-based solution gives the second one. To get the same indices just add 1 to ix.
s = pd.Series(b)
assert np.all(s[s.diff() > 0].index == ix + 1)
assert np.all(s[s.diff() > 0] == b[ix + 1])

numpy/scipy, loop over subarrays

Lately I've been doing a lot of processing on 8x8 blocks of image-data.
Standard approach has been to use nested for-loops to extract the blocks, e.g.
for y in xrange(0,height,8):
for x in xrange(0,width,8):
d = image_data[y:y+8,x:x+8]
# further processing on the 8x8-block
I can't help to wonder if there is a way to vectorize this operation or another approach using numpy/scipy that I can use instead? An iterator of some kind?
A MWE1:
#!/usr/bin/env python
import sys
import numpy as np
from scipy.fftpack import dct, idct
import scipy.misc
import matplotlib.pyplot as plt
def dctdemo(coeffs=1):
unzig = np.array([
0, 1, 8, 16, 9, 2, 3, 10,
17, 24, 32, 25, 18, 11, 4, 5,
12, 19, 26, 33, 40, 48, 41, 34,
27, 20, 13, 6, 7, 14, 21, 28,
35, 42, 49, 56, 57, 50, 43, 36,
29, 22, 15, 23, 30, 37, 44, 51,
58, 59, 52, 45, 38, 31, 39, 46,
53, 60, 61, 54, 47, 55, 62, 63])
lena = scipy.misc.lena()
width, height = lena.shape
# reconstructed
rec = np.zeros(lena.shape, dtype=np.int64)
# Can this part be vectorized?
for y in xrange(0,height,8):
for x in xrange(0,width,8):
d = lena[y:y+8,x:x+8].astype(np.float)
D = dct(dct(d.T, norm='ortho').T, norm='ortho').reshape(64)
Q = np.zeros(64, dtype=np.float)
Q[unzig[:coeffs]] = D[unzig[:coeffs]]
Q = Q.reshape([8,8])
q = np.round(idct(idct(Q.T, norm='ortho').T, norm='ortho'))
rec[y:y+8,x:x+8] = q.astype(np.int64)
plt.imshow(rec, cmap='gray')
plt.show()
if __name__ == '__main__':
try:
c = int(sys.argv[1])
except ValueError:
sys.exit()
else:
if 1 <= int(sys.argv[1]) <= 64:
dctdemo(int(sys.argv[1]))
Footnotes:
Actual application: https://github.com/figgis/dctdemo
There's a function view_as_windows for this in Scikit Image
http://scikit-image.org/docs/dev/api/skimage.util.html#view-as-windows
Unfortunately I will have to finish this answer another time, but you can grab the windows in a form that you can pass to dct with:
from skimage.util import view_as_windows
# your code...
d = view_as_windows(lena.astype(np.float), (8, 8)).reshape(-1, 8, 8)
dct(d, axis=0)
There is a function called extract_patches in the scikit-learn feature extraction routines. You need to specify a patch_size and an extraction_step. The result will be a view on your image as patches, which may overlap. The resulting array is 4D, the first 2 index the patch, and the last two index the pixels of the patch. Try this
from sklearn.feature_extraction.image import extract_patches
patches = extract_patches(image_data, patch_size=(8, 8), extraction_step=(4, 4))
This gives (8, 8) size patches that overlap by half.
Note that up until now this uses no extra memory, because it is implemented using stride tricks. You can force a copy by reshaping
patches = patches.reshape(-1, 8, 8)
which will basically yield a list of patches.

python, weighted linspace

can anyone show me what the best way is to generate a (numpy) array containig values from 0 to 100, that is weighted by a (for example) normal distribution function with mean 50 and variance 5. So that there are more 50s and less (nearly no) zeros and hundreds. I think the problem should not be too hard to solve, but I'm stucked somehow...
I thought about something with np.linspace but it seems, that there is no weight option.
So just to be clear: I don't wan't a simple normal distribution from 0 to 100, but something like an array from 0 to 100 with higher density of values in the middle.
Thanks
You can use scipy's stats distributions:
import numpy as np
from scipy import stats
# your distribution:
distribution = stats.norm(loc=50, scale=5)
# percentile point, the range for the inverse cumulative distribution function:
bounds_for_range = distribution.cdf([0, 100])
# Linspace for the inverse cdf:
pp = np.linspace(*bounds_for_range, num=1000)
x = distribution.ppf(pp)
# And just to check that it makes sense you can try:
from matplotlib import pyplot as plt
plt.hist(x)
plt.show()
Of course, I admit the start and end point is not quite exact like this due to numerical inaccuracies when going back and forth.
It is important to understand, that your problem is not exactly solvable, since generally a finite discrete sample cannot exactly reproduce your distribution.
You can easily see this, when asking trivial versions of your question like a set of 3 values in [0,1] with an equal distribution. Here the results [0,0,1] and [0,1,1] would both be reasonable.
However, you can solve the problem roughly. If you ask for an array with count elements out of [0,1,...,N] where the given probabilities are p=[p0,p1,...,pN] and normalized (p0+...+pN==1) then the count c_k of the element k in your resulting array is theoretically
c[k] = p[k]*count
but these counts now are floats. You have to decide for a way to "round" them while keeping their total sum. This is the freedom of choice arising from the under-definedness of your question.
>>> sorted([int(random.gauss(50,5)) for i in range(100)])
[33, 40, 40, 40, 40, 40, 42, 42, 42, 42, 43, 43, 43, 43, 44, 44, 44, 44, 44, 45, 45, 45, 46, 46, 46, 46, 46, 46, 46, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 48, 48, 48, 48, 48, 48, 48, 49, 49, 50, 50, 50, 50, 50, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 53, 53, 53, 54, 54, 54, 54, 54, 54, 54, 54, 54, 55, 55, 56, 56, 57, 57, 57, 57, 57, 57, 57, 58, 61]

Categories

Resources