verify distribution of uniformly distributed 3D coordinates - python

I would like to write a python script to generate a uniformly distributed 3D coordinates (e.g., x, y, z) where x, y, and z are float numbers between 0 and 1. For the moment, z can be fixed, thus what I need is a uniform distributed points in a 2D (x-y) plane. I have written a script to do this job and checked both x, and y are uniform numbers. However, I am not sure if these points are uniformly distributed in (x-y) plane.
My code is
1 import matplotlib.pyplot as plt
2 import random
3 import numpy as np
4 import csv
5 nk1=300
6 nk2=300
7 nk3=10
8 kx=[]
9 ky=[]
10 kz=[]
11 for i in range(nk1):
12 for j in range(nk2):
13 for k in range(nk3):
14 xkg1=random.random()
15 xkg2=random.random()
16 xkg3 = float(k)/nk3
17 kx.append(xkg1)
18 ky.append(xkg2)
19 kz.append(xkg3)
20 kx=np.array(kx)
21 count, bins, ignored = plt.hist(kx, normed=True)
22 plt.plot(bins, np.ones_like(bins), linewidth=2, color='r')
23 plt.show()
The plot shows both "kx", and "ky" are uniformly distributed numbers, however, how can I make sure that x-y are uniformly distributed in the 2D plane?

Just as you used np.histogram1 to check uniformity in 1D, you can use np.histogram2d to do the same thing in 2D, and np.histogramdd in 3D+.
To see an example, let's first fix your loops by making them go away:
kx = np.random.random(nk1 * nk2 * nk3)
ky = np.random.random(nk1 * nk2 * nk3)
kz = np.tile(np.arange(nk3) / nk3, n1 * n2)
hist2d, *_ = np.histogram2d(kx, ky, range=[[0, 1], [0, 1]])
The range parameter ensures that you are binning over [0, 1) in each direction, not over the actual min and max if your data, no matter how close it may be.
Now it's entirely up to you how to visualize the 100 data points in hist2d. One simple way would be to just ravel it and do a bar chart like you did for the 1D case:
plt.bar(np.arange(hist2d.size), hist2d.ravel())
plt.plot([0, hist2d.size], [nk1 * nk2 * nk3 / hist2d.size] * 2)
Another simple way would be to do a heat map:
plt.imshow(hist2d, interpolation='nearest', cmap='hot')
This is actually not as useful as the bar chart, and doesn't generalize to higher dimensions as well.
Your best bet is probably just checking the standard deviation of the raw data.
1 Or rather plt.hist did for you under the hood.

With the help of #Mad Physicist, I finally found the way to verify the uniform distribution of random numbers in 2D. Here I post my script, and explain the details:
1 import numpy as np
2 import random
3 import matplotlib.pyplot as plt
4 import matplotlib
5 fig = plt.figure()
6 ax1 = fig.add_subplot(211)
7 ax2 = fig.add_subplot(212)
8 nk=100
9 nk=100
10 nk=1
11 kx1=[]
12 ky1=[]
13 kz1=[]
14 for i in range(nk1):
15 for j in range(nk2):
16 for k in range(nk3):
17 xkg =r andom.random()
18 ykg = random.random()
19 zkg = float(k)/nk3
20 kx.append(xkg)
21 ky.append(ykg)
22 kz.append(zkg)
23 kx=np.array(kx)
24 ky=np.array(ky)
25 kz=np.array(kz)
26 xedges, yedges = np.linspace(0, 1, 6), np.linspace(0, 1, 6)
27 ## count the number of poins in the region definded by (xedges[i],
xedges[i+1])
28 ## and (yedges[i], xedges[y+1]). There are in total 10*10 2D
squares.
29 hist, xedges, yedges = np.histogram2d(kx, ky, (xedges, yedges))
30 xidx = np.clip(np.digitize(kx, xedges) - 1, 0, hist.shape[0] - 1)
31 yidx = np.clip(np.digitize(ky, yedges) - 1, 0, hist.shape[1] - 1)
32 ax1.bar(np.arange(hist.size),hist.ravel())
33 ax1.plot([0,hist.size], [nk1 * nk2 * nk3 / hist.size] * 2)
34 c = hist[xidx, yidx]
35 new = ax2.scatter(kx, ky, c=c, cmap='jet')
36 cax, _ = matplotlib.colorbar.make_axes(ax2)
37 cbar = matplotlib.colorbar.ColorbarBase(cax, cmap='jet')
38 ax2.grid(True)
39 plt.show()

Related

Multiple datasets and fit line

I have different datasets:
Df1
X Y
1 1
2 5
3 14
4 36
5 90
Df2
X Y
1 1
2 5
3 21
4 38
5 67
Df3
X Y
1 1
2 5
3 10
4 50
5 78
I would like to determine a line which fits this data and plot all data in one chart (like a regression).
On the x axis I have the time; on the y axis I have the frequency of an event that occurs.
Any help on the approach on how to determine the line and plot the results keeping the different legends (would be ok with seaborn or matplotlib) would be helpful.
What I have done so far is plotting the three lines as follows:
plot_df = pd.DataFrame(list(zip(dataset_list, x_lists, y_lists)),
columns =['Dataset', 'X', 'Y']).set_index('Dataset', inplace=False)
plot_df= plot_df.apply(pd.Series.explode).reset_index() # this step should transpose the resulting df and explode the values
# plot
fig, ax = plt.subplots(figsize=(10,8))
for name, group in plot_df.groupby('Dataset'):
group.plot(x = "X", y= "Y", ax=ax, label=name)
Please note that the three lists at the beginning contain information on the three different df.
I recommend using linregress from scipy.stats as this gives very readable code. Just need to add in the logic to your loop:
from scipy.stats import linregress
for name, group in plot_df.groupby('Dataset'):
group.plot(x = "X", y= "Y", ax=ax, label=name)
#fit a line to the data
fit = linregress(group.X, group.Y)
ax.plot(group.X, group.X * fit.slope + fit.intercept, label=f'{name} fit')

Pandas - plotting user RFM

Given the following DF of user RFM activity:
uid R F M
0 1 10 1 5
1 1 2 2 10
2 1 4 3 1
3 1 5 4 10
4 2 10 1 3
5 2 1 2 10
6 2 1 3 4
Recency: The time between the last purchase and today, represented by
the distance between the rightmost circle and the vertical dotted line
that's labeled Now.
Frequency: The time between purchases, represented by the distance
between the circles on a single line.
Monetary: The amount of money spent on each purchase, represented by
the size of the circle. This amount could be the average order value
or the quantity of products that the customer ordered.
I would like to plot something like the figure below:
Where the size of the circle is the M value and the distance is the R. Any help would be appreciated.
Update
As suggested by Diziet Asahi I've tried the following:
import matplotlib.pyplot as plt
def plot_users(df):
fig, ax = plt.subplots()
ax.axis('off')
ax.scatter(x=df['M'],y=df['uid'],s=30*df['R'], marker='o', color='grey')
ax.invert_xaxis()
ax.axvline(0, ls='--', color='black', zorder=-1)
for y in df['uid'].unique():
ax.axhline(y, color='grey', zorder=-1)
tmp = pd.DataFrame({'uid':[1,1,1,1,2,2,2],'R':[10,2,4,5,10,1,1],'F':[1,2,3,4,1,3,4],'M':[5,10,1,10,3,10,4]})
plot_users(tmp)
And I get the following:
So I think there is a bug, since first user has 4 records and the sizes also doesn't match.
you can use matplotlib's scatter() with the s= argument to draw markers with an area proportional to the value in M. The rest is just tweaking the appearance of the plot.
c = 'xkcd:dark grey'
fig, ax = plt.subplots()
ax.axis('off')
ax.scatter(x=df['R'],y=df['uid'],s=60*df['M'], marker='o', color=c)
ax.invert_xaxis()
ax.axvline(0, ls='--', color=c, zorder=-1)
for y in df['uid'].unique():
ax.axhline(y, color=c, zorder=-1)
ax.set_ymargin(1)

Generating an histogram with Matplotlib using a dataframe for x and y

I'm generating a simple line chart with Matplotlib, here is my code:
fig = plt.figure(facecolor='#131722',dpi=155, figsize=(8, 4))
ax1 = plt.subplot2grid((1,2), (0,0), facecolor='#131722')
for x in OrderedList:
rate_buy = []
total_buy = []
for y in x['data']['bids']:
rate_buy.append(y[0])
total_buy.append(y[1])
rBuys = pd.DataFrame({'buy': rate_buy})
tBuys = pd.DataFrame({'total': total_buy})
ax1.plot(rBuys.buy, tBuys.total, color='#0400ff', linewidth=0.5, alpha=1)
ax1.fill_between(rBuys.buy, 0, tBuys.total, facecolor='#0400ff', alpha=1)
Which gives me the following output:
And here is the data i used in the dataframe:
buy
0 9611
1 9610
2 9609
3 9608
4 9607
5 9606
6 9605
7 9604
8 9603
9 9602
10 9601
11 9600
12 9599
total
0 3.033661
1 3.295753
2 3.599813
3 22.305765
4 22.987476
5 30.975145
6 39.492845
7 42.828580
8 46.677708
9 49.533740
10 50.925840
11 61.396243
12 61.921523
I want to get the same output of the image, but with an histogram chart or whatever it's similar to that, where the height of the column on the y axis is retrieved from the total dataframe and the x axis position is retrieved from the buy dataframe. So the first element will have position x=9611 and y=3.033661
Is it possible to do that with Matplotlib? I tried to use hist, but it doesn't allow me to set both the x and the y axis
Pandas uses matplotlib as well, and the API is very easy once you have the dataframe.
Here is an example.
d = {
'buy':[
9611,
9610,
9609,
9608,
9607,
9606,
9605,
9604,
9603,
9602,
9601,
9600,
9599
],
'total':[
3.033661,
3.295753,
3.599813,
22.305765,
22.987476,
30.975145,
39.492845,
42.828580,
46.677708,
49.533740,
50.925840,
61.396243,
61.921523
]
}
df = pd.DataFrame(d)
df = df.sort_values(by=['buy']) #remember to sort your x values!
df.plot(kind='bar', x='buy', y='total', width=1)
plt.show()

plt.bar - x axis plot not in line with the x axis labels (matplotlib in python)

I am trying to make a bar plot however my x-axis and the x-axis labels are not aligned..
import matplotlib.pyplot as plt
import numpy as np
#import matplotlib.patches as mpatches
residue_distb = np.loadtxt('Aggregated AA Probabilities Final.csv',
delimiter=',', skiprows = 1, usecols = (1))
print residue_distb
x = np.arange(len(residue_distb))
bar_width = 0.4
plt.bar(x, residue_distb, width=bar_width, color = 'blue')
aminoacids = np.loadtxt('Aggregated AA Probabilities Final.csv',
dtype='str', delimiter=',', skiprows = 1, usecols = (0))
plt.xticks (x + bar_width*2, aminoacids)
plt.xticks(rotation=90)
I think the problem is that the x is between 0-19 and not 1-20:
x = [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]
this means that my last label has no value and my first value has no label.
I tried to do a list comprehension to add one to every value but when i plot this again it plots the numbers instead of the x-axis labels.
Is there a way to shift this?

Trouble with numpy.histogram2d

I'm trying to see if numpy.histogram2d will cross tabulate data in 2 arrays for me. I've never used this function before and I'm getting an error I don't know how to fix.
import numpy as np
import random
zones = np.zeros((20,30), int)
values = np.zeros((20,30), int)
for i in range(20):
for j in range(30):
values[i,j] = random.randint(0,10)
zones[:8,:15] = 100
zones[8:,:15] = 101
zones[:8,15:] = 102
zones[8:,15:] = 103
np.histogram2d(zones,values)
This code results in the following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-18-53447df32000> in <module>()
----> 1 np.histogram2d(zones,values)
C:\Python27\ArcGISx6410.2\lib\site-packages\numpy\lib\twodim_base.pyc in histogram2d(x, y, bins, range, normed, weights)
613 xedges = yedges = asarray(bins, float)
614 bins = [xedges, yedges]
--> 615 hist, edges = histogramdd([x,y], bins, range, normed, weights)
616 return hist, edges[0], edges[1]
617
C:\Python27\ArcGISx6410.2\lib\site-packages\numpy\lib\function_base.pyc in histogramdd(sample, bins, range, normed, weights)
279 # Sample is a sequence of 1D arrays.
280 sample = atleast_2d(sample).T
--> 281 N, D = sample.shape
282
283 nbin = empty(D, int)
ValueError: too many values to unpack
Here is what I am trying to accomplish:
I have 2 arrays. One array comes from a geographic dataset (raster) representing Landcover classes (e.g. 1=Tree, 2=Grass, 3=Building, etc.). The other array comes from a geographic dataset (raster) representing some sort of political boundary (e.g. parcels, census blocks, towns, etc). I am trying to get a table that lists each unique political boundary area (array values represent a unique id) as rows and the total number of pixels within each boundary for each landcover class as columns.
I'm assuming values is the landcover and zones is the political boundaries. You might want to use np.bincount, which is like a special histogram where each bin has spacing and width of exactly one.
import numpy as np
zones = np.zeros((20,30), int)
zones[:8,:15] = 100
zones[8:,:15] = 101
zones[:8,15:] = 102
zones[8:,15:] = 103
values = np.random.randint(0,10,(20,30)) # no need for that loop
tab = np.array([np.bincount(values[zones==zone]) for zone in np.unique(zones)])
You can do this more simply with histogram, though, if you are careful with the bin edges:
np.histogram2d(zones.flatten(), values.flatten(), bins=[np.unique(zones).size, values.max()-values.min()+1])
The way this works is as follows. The easiest example is to look at all values regardless of zone:
np.bincount(values)
Which gives you one row with the counts for each value (0 to 10). The next step is to look at the zones.
For one zone, you'd have just one row, and it would be:
zone = 101 # the desired zone
mask = zone==zones # a mask that is True wherever your zones map matches the desired zone
np.bincount(values[mask]) # count the values where the mask is True
Now, we just want to do this for each zone in the map. You can get a list of the unique values in your zones map with
zs = np.unique(zones)
and loop through it with a list comprehension, where each item is one of the rows as above:
tab = np.array([np.bincount(values[zones==zone]) for zone in np.unique(zones)])
Then, your table looks like this:
print tab
# elements with cover =
# 0 1 2 3 4 5 6 7 8 9 # in zone:
[[16 11 10 12 13 15 11 7 13 12] # 100
[13 23 15 16 24 16 24 21 15 13] # 101
[10 12 23 13 12 11 11 5 11 12] # 102
[19 25 20 12 16 19 13 18 22 16]] # 103
Finally, you can plot this in matplotlib as so:
import matplotlib.pyplot as plt
plt.hist2d(zones.flatten(), values.flatten(), bins=[np.unique(zones).size, values.max()-values.min()+1])
histogram2d expects 1D arrays as input, and your zones and values are 2D. You could linearize them with ravel:
np.histogram2d(zones.ravel(), values.ravel())
If efficiency isn't a concern, I think this works for what you want to do
from collections import Counter
c = Counter(zip(zones.flat[:], landcover_classes.flat[:]))
c will contain key/val tuples where the key is a tuple of (zone, landcover class). You can populate an array if you like with
for (i, j), count in c.items():
my_table[i, j] = count
That only works, of course, if i and j are sequential integers starting at zero (i.e., from 0 to Ni and 0 to Nj).

Categories

Resources