How can I get sorted cumulative plots in numpy/matplotlib or Pandas?
Let me explain this with an example. Say we have the following data:
number_of_items_sold_per_store = [10, 6, 90, 5, 102, 10, 6, 50, 85, 1, 2, 3, 6]
We want to plot a chart that, for a given (x,y) value is read as: the top %X selling stores sold %Y items. That is, it displays the data as follows:
where the best selling stores are to the left (i.e. the slope of the plot decreases monotonically). How can I do this in numpy or Pandas ? (i.e. assuming the above is a Series).
Assuming that you want the best performing stores to come first:
import numpy as np
import matplotlib.pyplot as plt
number_of_items_sold_per_store = [10, 6, 90, 5, 102, 10, 6, 50, 85, 1, 2, 3, 6]
ar = sorted(number_of_items_sold_per_store,reverse=True)
y = np.cumsum(ar).astype("float32")
#normalise to a percentage
y/=y.max()
y*=100.
#prepend a 0 to y as zero stores have zero items
y = np.hstack((0,y))
#get cumulative percentage of stores
x = np.linspace(0,100,y.size)
#plot
plt.plot(x,y)
plt.show()
I think the steps involved here are:
Sort the list of sale counts in descending order
Get the cumulative sum of the sorted list
Divide by the overall total and multiply by 100 to convert to percentage
Plot!
n_sold = number_of_items_sold_per_store
sorted_sales = list(reversed(sorted(n_sold)))
total_sales = np.sum(n_sold)
cum_sales = np.cumsum(sorted_sales).astype(np.float64) / total_sales
cum_sales *= 100 # Convert to percentage
# Borrowing the linspace trick from ebarr
x_vals = np.linspace(0, 100, len(cum_sales))
plt.plot(x_vals, cum_sales)
plt.show()
This works for me (you can convert ':
number_of_items_sold_per_store' to numpy array using number_of_items_sold_per_store.values)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
number_of_items_sold_per_store = [10, 6, 90, 5, 102, 10, 6, 50, 85, 1, 2, 3, 6]
# Create histogram
values, base = np.histogram(number_of_items_sold_per_store, bins=500)
# Cumulative data
cum = np.cumsum(values)
# plot the cumulative function
plt.plot(base[:-1], cum, c='red')
plt.show()
Related
I have x,y,v arrays of data points and I am binning v on x-y plane. I am trying to get the x,y,v values back after binning but I want them as arrays corresponding to each bin. My code can get them individually but that will not work for large data sets with many bins. Maybe I need to use loops of some kind but my understanding of loops is weak. Code:
from scipy import stats
import numpy as np
x=np.array([-10,-2,4,12,3,6,8,14,3])
y=np.array([5,5,-6,8,-20,10,2,2,8])
v=np.array([4,-6,-10,40,22,-14,20,8,-10])
ret = stats.binned_statistic_2d(x,
y,
values,
'count',
bins=2,
expand_binnumbers=True)
print('counts=',ret.statistic)
print('binnumber=', ret.binnumber)
binnumber = ret.binnumber
statistic = ret.statistic
# get the bin numbers according to some condition
idx_bin_x, idx_bin_y = np.where(statistic==statistic[1][1])#[0]
print('idx_binx=',idx_bin_x)
print('idx_bin_y=',idx_bin_y)
# A binnumber of i means the corresponding value is
# between (bin_edges[i-1], bin_edges[i]).
# -> increment the bin indices by one
idx_bin_x += 1
idx_bin_y += 1
print('idx_binx+1=',idx_bin_x)
print('idx_bin_y+1=',idx_bin_y)
# get the boolean mask and apply it
is_event_x = np.in1d(binnumber[0], idx_bin_x)
print('eventx=',is_event_x)
is_event_y = np.in1d(binnumber[1], idx_bin_y)
print('eventy=',is_event_y)
is_event_xy = np.logical_and(is_event_x, is_event_y)
print('event_xy=', is_event_xy)
events_x = x[is_event_xy]
events_y = y[is_event_xy]
event_v=v[is_event_xy]
print('x=', events_x)
print('y=', events_y)
print('v=',event_v)
This outputs x,y,v for the bin with count=5 but I want all 4 bins returning 4 arrays for each x,y,v. eg for bin1: x_bin1=[...], y_bin1=[...], v_bin1=[...] and so on for 4 bins.
Also, feel free to suggest if you think there are easier ways to bin 2d planes (x,y) with values (v) like mine and getting binned values. Thank you!
Using np.array facilitates a compact way to recover the arrays you are after:
from scipy import stats
# coordinates
x = np.array([-10,-2,4,12,3,6,8,14,3])
y = np.array([5,5,-6,8,-20,10,2,2,8])
v = np.array([4,-6,-10,40,22,-14,20,8,-10])
ret = stats.binned_statistic_2d(x, y, None, 'count', bins=2, expand_binnumbers=True)
b = ret.binnumber
for i in [1,2]:
for j in [1,2]:
m = (b[0] == i) & (b[1] == j) # mask
print((list(x[m]),list(y[m]),list(v[m])))
which gives for each of the four bins a tuple of 3 lists corresponding to x, y and v values:
([], [], [])
([-10, -2], [5, 5], [4, -6])
([4, 3], [-6, -20], [-10, 22])
([12, 6, 8, 14, 3], [8, 10, 2, 2, 8], [40, -14, 20, 8, -10])
I have a segmentation map (numpy.ndarray) that contain objects labeled with unique numbers. I want to combine objects across multiple slices by labeling them with the same number. Specifically, I want to renumber objects based on a DataFrame containing centroid positions and the desired label value.
First, I created some mock labels and a DataFrame:
df = pd.DataFrame({
"slice": [0, 0, 0, 0, 1, 1, 1, 2, 2, 2],
"number": [1, 2, 3, 4, 1, 2, 3, 1, 2, 3],
"x": [10, 20, 30, 40, 11, 21, 31, 12, 22, 32],
"y": [10, 20, 30, 40, 11, 21, 31, 12, 22, 32]
})
def make_segmap(df):
x, y = np.indices((50, 50))
maps = []
# Iterate over slices and coordinates
for n_slice in df["slice"].unique():
masks = []
for row in df[df["slice"] == n_slice].iterrows():
# Create circle
mask_circle = (x - row[1]["x"])**2 + (y - row[1]["y"])**2 < 5**2
# Random index number (here just a multiple)
masks.append(mask_circle * row[1]["number"]*3)
maps.append(np.max(masks, axis=0))
return np.stack(maps, axis=0)
segmap = make_segmap(df)
For renumbering, this is what I came up with so far:
new_maps = []
# Iterate over slices
for n_slice in df["slice"].unique():
new_labels = []
for row in df[df["slice"] == n_slice].iterrows():
# Find current value at position
original_label = segmap[n_slice, row[1]["y"], row[1]["x"]]
# Replace all label occurrences with the desired label from the DataFrame
replaced_label = np.where(segmap[n_slice] == original_label, row[1]["number"], 0)
new_labels.append(replaced_label)
new_maps.append(np.max(new_labels, axis=0))
new_segmap = np.stack(new_maps, axis=0)
This works reasonably well but doesn't scale to larger datasets. The real dataset has thousands of objects across hundreds of slices and this approach takes very long to run (an hour or so). Are there any suggestions on how to replace multiple values at once to improve performance?
Thanks in advance.
You can use groupby to replace the current quadratic search algorithm by a (quasi) linear search. Moreover, you can take advantage of Numpy's vectorization and broadcasting to remove the inner loop and make the computation faster.
Here is a faster implementation:
def make_segmap_fast(df):
x, y = np.indices((50, 50))
maps = []
# Iterate over slices and coordinates
for n_slice,subDf in df.groupby("slice"):
subDf_x = subDf["x"].to_numpy()[:, None, None]
subDf_y = subDf["y"].to_numpy()[:, None, None]
subDf_number = subDf["number"].to_numpy()[:, None, None]
# Create circle
mask_circle = (x - subDf_x)**2 + (y - subDf_y)**2 < 5**2
# Random index number (here just a multiple)
masks = mask_circle * subDf_number
maps.append(np.max(masks, axis=0)*3)
return np.stack(maps, axis=0)
On my machine, this is 2 times faster on the very small example (much more on bigger dataframes).
I have an array of data-points, for example:
[10, 9, 8, 7, 6, 5, 4, 3, 2, 1]
and I need to perform the following sum on the values:
However, the problem is that I need to perform this sum on each value > i. For example, using the last 3 values in the set the sum would be:
and so on up to 10.
If i run something like:
import numpy as np
x = np.array([10, 9, 8, 7, 6, 5, 4, 3, 2, 1])
alpha = 1/np.log(2)
for i in x:
y = sum(x**(alpha)*np.log(x))
print (y)
It returns a single value of y = 247.7827060452275, whereas I need an array of values. I think I need to reverse the order of the data to achieve what I want but I'm having trouble visualising the problem (hope I explained it properly) as a whole so any suggestions would be much appreciated.
The following computes all the partial sums of the grand sum in your formula
import numpy as np
# Generate numpy array [1, 10]
x = np.arange(1, 11)
alpha = 1 / np.log(2)
# Compute parts of the sum
parts = x ** alpha * np.log(x)
# Compute all partial sums
part_sums = np.cumsum(parts)
print(part_sums)
You really do not any explicit loop, or a non-numpy operation (like sum()) here. numpy takes care of all your needs.
I have a time series with several large data gaps. I would like to see a connecting line between data points that are less than an hour apart, but not if the gap is larger. The accepted answer to the question, Put a gap/break in a line plot, would work except that you sacrifice the masked points. I would like to avoid that.
I have attempted to make a list comprehension that would insert NaNs into the array, I think that would automatically achieve the same result, but I don't seem to be able to do it correctly. The best I have found is as follows:
import datetime as dtm
import numpy as np
x = np.array([dtm.datetime(2001,4,3,0,47,30),dtm.datetime(2001,4,3,0,52,30),dtm.datetime(2001,4,3,0,57,30),dtm.datetime(2001,4,3,3,57,30),dtm.datetime(2001,4,3,4,2,30),dtm.datetime(2001,4,3,4,7,30)])
xmod = np.array([x[0]]+[dt1 if dt1-dt0 < dtm.timedelta(hours=1.) else [dt1,np.nan] for dt1, dt0 in zip(x[1:],x[:-1])])
This gives the result:
In [7]: xmod
Out[7]:
array([datetime.datetime(2001, 4, 3, 0, 47, 30),
datetime.datetime(2001, 4, 3, 0, 47, 30),
datetime.datetime(2001, 4, 3, 0, 52, 30),
[datetime.datetime(2001, 4, 3, 0, 57, 30), nan],
datetime.datetime(2001, 4, 3, 3, 57, 30),
datetime.datetime(2001, 4, 3, 4, 2, 30)], dtype=object)
I have not been able to find a way to insert both the data point and the np.nan without putting brackets around them. Is this possible? Is there a better way to achieve my goal? Thanks!
In accordance with the comment above, probably the easiest way to do this would be to separate the data into groups where you need the gaps. Here is one way to implement such a thing.
import datetime as dtm
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
x = np.array([dtm.datetime(2001,4,3,0,47,30),dtm.datetime(2001,4,3,0,52,30),dtm.datetime(2001,4,3,0,57,30),
dtm.datetime(2001,4,3,3,57,30),dtm.datetime(2001,4,3,4,2,30),dtm.datetime(2001,4,3,4,7,30)])
y = range(len(x))
# make a dataframe with groups separated that are over an hour apart
data = []
g = 0
for i in range(len(x)):
x0 = x[i]
y0 = y[i]
if i < (len(x)-1):
x1 = x[i+1]
td = x1 - x0
elapsed_seconds = td.total_seconds()
hrs = (elapsed_seconds/60)/60
if hrs < 1:
data.append([x0,y0, g])
else:
data.append([x0,y0, g])
g+=1
else:
data.append([x0,y0, g])
df = pd.DataFrame(data, columns=['x', 'y', 'group'])
# draw a plot
fig, ax = plt.subplots(1,1, figsize = (8,5))
for i, dfg in df.groupby('group'):
ax.plot(dfg['x'], dfg['y'], c='b')
So, I accepted the answer by djakubosky because it seems clean and is probably the right approach. However, by the time that answer was posted, I had decided that what I was doing was inappropriate for a list comprehension and simply wrote it as a for loop - and that worked fine. Possibly this will be useful to someone else. Here is the code:
def insert_breaks(x,y):
import datetime as dtm
import numpy as np
xnew = []
ynew = []
for dt1, dt0, y1, y0 in zip(x[1:],x[:-1],y[1:],y[:-1]):
if dt1-dt0 < dtm.timedelta(hours=1):
xnew+=[dt0]
ynew+=[y0]
else:
xnew+=[dt0,dt0+(dt1-dt0)/2]
ynew+=[y0, np.nan]
xnew+=[dt1]
ynew+=[y1]
return xnew, ynew
I would like to create a scatter plot in matplotlib to measure the performance of my algorithm.
An example of my data is as follows:
x = [1, 2, 3, 4, 5]
y1 = [1, 2, 3] # corresponding to x = 1
y2 = [4, 5, 6] # corresponding to x = 2
y3 = [7, 8, 9] # corresponding to x = 3
y4 = [10, 11, 12] # corresponding to x = 4
y5 = [13, 14, 15] # corresponding to x = 5
What data type would be best to represent multiple y values with one x value?
In my example the relation is exponential. Is there a way to plot an exponential regression line in matplotlib?
I think it is related with the data analyses. If I understand correctly, I think you want to have a comparison with every test's time efficiency, but at each test run, they should be at the same test environments (like the same machine, the same input data, etc.) So just give a suggestion, you can use each test's average run time as the standard value to show your test results. Here is some code you can use.
import numpy as np
import matplotlib.pyplot as plt
data_dim = 4 # number of test
data_points = 100 # number of each test_data_points
data_set = np.random.rand(data_dim,data_points)
time = [ list(range(len(i))) for i in data_set]
norm = np.full((data_dim,data_points),1)
aver = [] # get each test's average value
ndx = 0
for i in norm:
aver.append(i* sum(data_set[0]) / data_points)
fig = plt.figure(figsize=(10,10))
ndx = 1
for i in range(0,2):
for j in range(0,2):
ax = fig.add_subplot(2,2,ndx)
ax.plot(time[ndx-1],data_set[ndx-1],'ko')
ax.plot(time[ndx-1],aver[ndx-1],'r')
ax.set_ylim(-1,2)
ndx += 1
plt.show()
The following is the run result. Note, the red solid line is the average of your test time, which will give some senses of your each test.