Plotting a dictionary with multiple values per key - python

I have a dictionary that looks like this:
1: ['4026', '4024', '1940', '2912', '2916], 2: ['3139', '2464'], 3:['212']...
For a few hundred keys, I'd like to plot them with the key as y for its set of x values, I tried this bit of code which gives the error underneath:
for rank, structs in b.iteritems():
y = b.keys()
x = b.values()
ax.plot(x, y, 'ro')
plt.show()
ValueError: setting an array element with a sequence
I'm at a bit of a loss on how to proceed so any help would be greatly appreciated!

You need to construct your list of Xs and Ys manually:
In [258]: b={1: ['4026', '4024', '1940', '2912', '2916'], 2: ['3139', '2464'], 3:['212']}
In [259]: xs, ys=zip(*((int(x), k) for k in b for x in b[k]))
In [260]: xs, ys
Out[260]: ((4026, 4024, 1940, 2912, 2916, 3139, 2464, 212), (1, 1, 1, 1, 1, 2, 2, 3))
In [261]: plt.plot(xs, ys, 'ro')
...: plt.show()
resulting:

1) Repeat your x values
plot expects a list of x values and a list of y values which have to have the same length. That's why you have to repeat the rank value several times. itertools.repeat() can do that for you.
2) change your iterator
iteritems() already returns a tuple (key,value). You don't have to use keys() and items().
Here's the code:
import itertools
for rank, structs in b.iteritems():
x = list(itertools.repeat(rank, len(structs)))
plt.plot(x,structs,'ro')
3) combine the plots
Using your code, you'd produce one plot per item in the dictionary. I guess you rather want to plot them within a single graph. If so, change your code accrodingly:
import itertools
x = []
y = []
for rank, structs in b.iteritems():
x.extend(list(itertools.repeat(rank, len(structs))))
y.extend(structs)
plt.plot(x,y,'ro')
4) example
Here's an example using your data:
import itertools
import matplotlib.pyplot as plt
d = {1: ['4026', '4024', '1940', '2912', '2916'], 2: ['3139', '2464'], 3:['212']}
x= []
y= []
for k, v in d.iteritems():
x.extend(list(itertools.repeat(k, len(v))))
y.extend(v)
plt.xlim(0,5)
plt.plot(x,y,'ro')

This is because you mismatched your data entries.
Currently you have
1: ['4026', '4024', '1940', '2912', '2916']
2: ['3139', '2464'],
...
hence
x = [1,2,...]
y = [['4026', '4024', '1940', '2912', '2916'],['3139', '2464'],...
when you really need
x = [1, 1, 1, 1, 1, 2, 2, ...]
y = ['4026', '4024', '1940', '2912', '2916', '3139', '2464',...]
Try
for rank, structs in b.iteritems():
# This matches each value for a given key with the appropriate # of copies of the
# value and flattens the list
# Produces x = [1, 1, 1, 1, 1, 2, 2, ...]
x = [key for (key,values) in b.items() for _ in xrange(len(values))]
# Flatten keys list
# Produces y = ['4026', '4024', '1940', '2912', '2916, '3139', '2464',...]
y = [val for subl in b.values() for val in subl]
ax.plot(x, y, 'ro')
plt.show()

Related

Creating Density/Heatmap Plot from Coordinates and Magnitude in Python

I have some data which is the number of readings at each point on a 5x10 grid, which is in the format of;
X = [1, 2, 3, 4,..., 5]
Y = [1, 1, 1, 1,...,10]
Z = [9,8,14,0,89,...,0]
I would like to plot this as a heatmap/density map from above, but all of the matplotlib graphs (incl. contourf) that I have found are requiring a 2D array for Z and I don't understand why.
EDIT;
I have now collected the actual coordinates that I want to plot which are not as regular as what I have above they are;
X = [8,7,7,7,8,8,8,9,9.5,9.5,9.5,11,11,11,10.5,
10.5,10.5,10.5,9,9,8, 8,8,8,6.5,6.5,1,2.5,4.5,
4.5,2,2,2,3,3,3,4,4.5,4.5,4.5,4.5,3.5,2.5,2.5,
1,1,1,2,2,2]
Y = [5.5,7.5,8,9,9,8,7.5,6,6.5,8,9,9,8,6.5,5.5,
5,3.5,2,2,1,2,3.5,5,1,1,2,4.5,4.5,4.5,4,3,
2,1,1,2,3,4.5,3.5,2.5,1.5,1,5.5,5.5,6,7,8,9,
9,8,7]
z = [286,257,75,38,785,3074,1878,1212,2501,1518,419,33,
3343,1808,3233,5943,10511,3593,1086,139,565,61,61,
189,155,105,120,225,682,416,30632,2035,165,6777,
7223,465,2510,7128,2296,1659,1358,204,295,854,7838,
122,5206,6516,221,282]
From what I understand you can't use floats in a np.array so I have tried to multiply all values by 10 so that they are all integers, but I am still running into some issues. Am I trying to do something that will not work?
They expect a 2D array because they use the "row" and "column" to set the position of the value. For example, if array[2, 3] = 5, then when x is 2 and y is 3, the heatmap will use the value 5.
So, let's try transforming your current data into a single array:
>>> array = np.empty((len(set(X)), len(set(Y))))
>>> for x, y, z in zip(X, Y, Z):
array[x-1, y-1] = z
If X and Y are np.arrays, you could do this too (SO answer):
>>> array = np.empty((X.shape[0], Y.shape[0]))
>>> array[np.array(X) - 1, np.array(Y) - 1] = Z
And now just plot the array as you prefer:
>>> plt.imshow(array, cmap="hot", interpolation="nearest")
>>> plt.show()

Find the highest value of y for each x value and connect the points with a line

I am exploring the best way to do this.
I have a scatter plot of y versus x, where x is income per capita.
After plotting all values as a scatter plot, I would like to find the highest value for y for each x value (i.e., at each income level) and then connect these points with a line.
How can I do this in Python?
You could use pandas, because it has a convenient groupby method and plays well with matplotlib:
import pandas as pd
# example data
df = pd.DataFrame({'x': [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4],
'y': [3, 7, 9, 4, 1, 2, 8, 6, 4, 4, 3, 1]})
# standard scatter plot
ax = df.plot.scatter('x', 'y')
# find max. y value for each x
groupmax = df.groupby('x').max()
# connect max. values with lines
groupmax.plot(ax=ax, legend=False);
You have two parallel lists: x and y. You want to group them by x and take the maximum in y. First, you should sort the lists together. Zip them into a list of tuples and sort:
xy = sorted(zip(x, y))
Now, group the sorted list by the first element ("x"). The result is a list of tuples where the first element is x and the second is a list of all points with that x. Naturally, each point is also a tuple, and the first element of each tuple is the same x:
from itertools import groupby
grouped = groupby(xy, lambda item: item[0])
Finally, take the x and the max of the points for each group:
envelope = [(xp, max(points)[1]) for xp, points in grouped]
envelope is a list of xy tuples that envelope your scatter plot. You can further unzip it into xs and ys:
x1, y1 = zip(*envelope)
Putting it all together:
x1, y1 = zip(*[(xp, max(points)[1])
for xp, points
in groupby(sorted(zip(x, y)), lambda item: item[0])])

Putting serial values in a matrix with Python

I have a stream of elements that are coming serially: (x0, y0, val0), (x1, y2, val1), ... (xN, yN, valN), etc.
x and y are the coordinates directly pointing where the val should be put in a matrix. I tried the following but it does not work (I expected the interpreter will automatically expand the matrix but it does not):
data_matrix = [ [] , [] ]
while (elem = new_element_available()):
data_matrix[ elem[x] ][ elem[y] ] = elem[val]
How can I do it in Python as similar as possible or at least - as easier as possible?
Extend your array to accommodate incoming points as you go. You may end up with a jagged 2D array, but you should be able to square it up easily if you need to.
inp = [(0, 0, 0), (0, 2, 1), (1, 1, 2), (3, 3, 3)]
outp = []
for x, y, val in inp:
outp.extend([] for _ in range(x-len(outp)+1))
outp[x].extend(None for _ in range(y-len(outp[x])+1))
outp[x][y] = val
print(outp)
[[0, None, 1], [None, 2], [], [None, None, None, 3]]
Alternatively you can use a dictionary-based structure with defaultdict:
import collections
outp = collections.defaultdict(dict)
for x, y, val in inp:
outp[x][y] = val
print(dict(outp))
{0: {0: 0, 2: 1, 1: 4}, 1: {1: 2}, 3: {3: 3}}

Pythonic way of finding indexes of unique elements in two arrays

I have two sorted, numpy arrays similar to these ones:
x = np.array([1, 2, 8, 11, 15])
y = np.array([1, 8, 15, 17, 20, 21])
Elements never repeat in the same array. I want to figure out a way of pythonicaly figuring out a list of indexes that contain the locations in the arrays at which the same element exists.
For instance, 1 exists in x and y at index 0. Element 2 in x doesn't exist in y, so I don't care about that item. However, 8 does exist in both arrays - in index 2 in x but index 1 in y. Similarly, 15 exists in both, in index 4 in x, but index 2 in y. So the outcome of my function would be a list that in this case returns [[0, 0], [2, 1], [4, 2]].
So far what I'm doing is:
def get_indexes(x, y):
indexes = []
for i in range(len(x)):
# Find index where item x[i] is in y:
j = np.where(x[i] == y)[0]
# If it exists, save it:
if len(j) != 0:
indexes.append([i, j[0]])
return indexes
But the problem is that arrays x and y are very large (millions of items), so it takes quite a while. Is there a better pythonic way of doing this?
Without Python loops
Code
def get_indexes_darrylg(x, y):
' darrylg answer '
# Use intersect to find common elements between two arrays
overlap = np.intersect1d(x, y)
# Indexes of common elements in each array
loc1 = np.searchsorted(x, overlap)
loc2 = np.searchsorted(y, overlap)
# Result is the zip two 1d numpy arrays into 2d array
return np.dstack((loc1, loc2))[0]
Usage
x = np.array([1, 2, 8, 11, 15])
y = np.array([1, 8, 15, 17, 20, 21])
result = get_indexes_darrylg(x, y)
# result[0]: array([[0, 0],
[2, 1],
[4, 2]], dtype=int64)
Timing Posted Solutions
Results show that darrlg code has the fastest run time.
Code Adjustment
Each posted solution as a function.
Slight mod so that each solution outputs an numpy array.
Curve named after poster
Code
import numpy as np
import perfplot
def create_arr(n):
' Creates pair of 1d numpy arrays with half the elements equal '
max_val = 100000 # One more than largest value in output arrays
arr1 = np.random.randint(0, max_val, (n,))
arr2 = arr1.copy()
# Change half the elements in arr2
all_indexes = np.arange(0, n, dtype=int)
indexes = np.random.choice(all_indexes, size = n//2, replace = False) # locations to make changes
np.put(arr2, indexes, np.random.randint(0, max_val, (n//2, ))) # assign new random values at change locations
arr1 = np.sort(arr1)
arr2 = np.sort(arr2)
return (arr1, arr2)
def get_indexes_lllrnr101(x,y):
' lllrnr101 answer '
ans = []
i=0
j=0
while (i<len(x) and j<len(y)):
if x[i] == y[j]:
ans.append([i,j])
i += 1
j += 1
elif (x[i]<y[j]):
i += 1
else:
j += 1
return np.array(ans)
def get_indexes_joostblack(x, y):
'joostblack'
indexes = []
for idx,val in enumerate(x):
idy = np.searchsorted(y,val)
try:
if y[idy]==val:
indexes.append([idx,idy])
except IndexError:
continue # ignore index errors
return np.array(indexes)
def get_indexes_mustafa(x, y):
indices_in_x = np.flatnonzero(np.isin(x, y)) # array([0, 2, 4])
indices_in_y = np.flatnonzero(np.isin(y, x[indices_in_x])) # array([0, 1, 2]
return np.array(list(zip(indices_in_x, indices_in_y)))
def get_indexes_darrylg(x, y):
' darrylg answer '
# Use intersect to find common elements between two arrays
overlap = np.intersect1d(x, y)
# Indexes of common elements in each array
loc1 = np.searchsorted(x, overlap)
loc2 = np.searchsorted(y, overlap)
# Result is the zip two 1d numpy arrays into 2d array
return np.dstack((loc1, loc2))[0]
def get_indexes_akopcz(x, y):
' akopcz answer '
return np.array([
[i, j]
for i, nr in enumerate(x)
for j in np.where(nr == y)[0]
])
perfplot.show(
setup = create_arr, # tuple of two 1D random arrays
kernels=[
lambda a: get_indexes_lllrnr101(*a),
lambda a: get_indexes_joostblack(*a),
lambda a: get_indexes_mustafa(*a),
lambda a: get_indexes_darrylg(*a),
lambda a: get_indexes_akopcz(*a),
],
labels=["lllrnr101", "joostblack", "mustafa", "darrylg", "akopcz"],
n_range=[2 ** k for k in range(5, 21)],
xlabel="Array Length",
# More optional arguments with their default values:
# logx="auto", # set to True or False to force scaling
# logy="auto",
equality_check=None, #np.allclose, # set to None to disable "correctness" assertion
# show_progress=True,
# target_time_per_measurement=1.0,
# time_unit="s", # set to one of ("auto", "s", "ms", "us", or "ns") to force plot units
# relative_to=1, # plot the timings relative to one of the measurements
# flops=lambda n: 3*n, # FLOPS plots
)
What you are doing is O(nlogn) which is decent enough.
If you want, you can do it in O(n) by iterating on both arrays with two pointers and since they are sorted, increase the pointer for the array with smaller object.
See below:
x = [1, 2, 8, 11, 15]
y = [1, 8, 15, 17, 20, 21]
def get_indexes(x,y):
ans = []
i=0
j=0
while (i<len(x) and j<len(y)):
if x[i] == y[j]:
ans.append([i,j])
i += 1
j += 1
elif (x[i]<y[j]):
i += 1
else:
j += 1
return ans
print(get_indexes(x,y))
which gives me:
[[0, 0], [2, 1], [4, 2]]
Although, this function will search for all the occurances of x[i] in the y array, if duplicates are not allowed in y it will find x[i] exactly once.
def get_indexes(x, y):
return [
[i, j]
for i, nr in enumerate(x)
for j in np.where(nr == y)[0]
]
You can use numpy.searchsorted:
def get_indexes(x, y):
indexes = []
for idx,val in enumerate(x):
idy = np.searchsorted(y,val)
if y[idy]==val:
indexes.append([idx,idy])
return indexes
One solution is to first look from x's side to see what values are included in y by getting their indices through np.isin and np.flatnonzero, and then use the same procedure from the other side; but instead of giving x entirely, we give only the (already found) intersected elements to gain time:
indices_in_x = np.flatnonzero(np.isin(x, y)) # array([0, 2, 4])
indices_in_y = np.flatnonzero(np.isin(y, x[indices_in_x])) # array([0, 1, 2])
Now you can zip them to get the result:
result = list(zip(indices_in_x, indices_in_y)) # [(0, 0), (2, 1), (4, 2)]

Perform summation at indizes saved as x and y arrays

I have basic question regarding indexing. I have two array lists with len = 9mio encoding vectorized image coordinates that have been extracted via computation by a previous function. Now I want to decrease a heatmap using the vectorized data. I could use a for loop and zip the coordinates. However, I would prefer a faster solution like
T = [L[i] +=1 for i in zip(X,Y)]
or something. Is this possible?
coord = [x_coords,y_coords]
Heatmap[coord[0],coord[1]] -= 1
This is one solution using collections. I have also added performance comparison versus #Piinthesky's pandas solution.
import pandas as pd
from collections import Counter, OrderedDict
#your pre-existing heatmap as a numpy array
heat_map = np.arange(32).reshape(8, 4)
#your x and y pairs as lists
x = [2, 3, 0, 5, 6, 2, 3, 4, 3]
y = [3, 1, 2, 0, 3, 3, 1, 1, 1]
def jp_data_analysis(heat_map, x, y):
#count occurences of x, y pairs
c = OrderedDict(Counter(zip(x, y)))
#create numpy array with count as value at position x, y
x_c, y_c = list(zip(*c))
pic_occur[x_c, y_c] = list(c.values())
#subtract this from heatmap
heat_map -= pic_occur
return heat_map
def piinthesky(heat_map, x, y):
#count occurences of x, y pairs
df = pd.DataFrame({"x": x, "y": y}).groupby(["x", "y"]).size().reset_index(name='count')
#create numpy array with count as value at position x, y
pic_occur = np.zeros([heat_map.shape[0], heat_map.shape[1]], dtype = int)
pic_occur[df["x"], df["y"]] = df["count"]
#and subtract this from heatmap
heat_map -= pic_occur
return heat_map
%timeit jp_data_analysis(heat_map, x, y)
# 10000 loops, best of 3: 43.8 µs per loop
%timeit piinthesky(heat_map, x, y)
# 100 loops, best of 3: 4.45 ms per loop
This is a solution using numpy/pandas. The x, y convention is according to the usual connotation in pictures, but you better check this with your dataset.
import pandas as pd
#your pre-existing heatmap as a numpy array
heat_map = np.arange(32).reshape(8, 4)
#your x and y pairs as lists
x = [2, 3, 0, 5, 6, 2, 3, 4, 3]
y = [3, 1, 2, 0, 3, 3, 1, 1, 1]
#count occurences of x, y pairs
df = pd.DataFrame({"x": x, "y": y}).groupby(["x", "y"]).size().reset_index(name='count')
#create numpy array with count as value at position x, y
pic_occur = np.zeros([heat_map.shape[0], heat_map.shape[1]], dtype = int)
pic_occur[df["x"], df["y"]] = df["count"]
#and subtract this from heatmap
heat_map -= pic_occur

Categories

Resources