Related
I have the following boundary conditions for a time series in python.
The notation I use here is t_x, where x describe the time in milliseconds (this is not my code, I just thought this notation is good to explain my issue).
t_0 = 0
t_440 = -1.6
t_830 = 0
mean_value = -0.6
I want to create a list that contains 83 values (so the spacing is 10ms for each value).
The list should descibe a "curve" that starts at zero, has the minimum value of -1.6 at 440ms (so 44 in the list), ends with 0 at 880ms (so 83 in the list) and the overall mean value of the list should be -0.6.
I absolutely could not come up with an idea how to "fit" the boundaries to create such a list.
I would really appreciate help.
It is a quick and dirty approach, but it works:
X = list(range(0, 830 +1, 10))
Y = [0.0 for x in X]
Y[44] = -1.6
b = 12.3486
for x in range(44):
Y[x] = -1.6*(b*x+x**2)/(b*44+44**2)
for x in range(83, 44, -1):
Y[x] = -1.6*(b*(83-x)+(83-x)**2)/(b*38+38**2)
print(f'{sum(Y)/len(Y)=:8.6f}, {Y[0]=}, {Y[44]=}, {Y[83]=}')
from matplotlib import pyplot as plt
plt.plot(X,Y)
plt.show()
With the code giving following output:
sum(Y)/len(Y)=-0.600000, Y[0]=-0.0, Y[44]=-1.6, Y[83]=-0.0
And showing following diagram:
The first step in coming up with the above approach was to create a linear sloping 'curve' from the minimum to the zeroes. I turned out that linear approach gives here too large mean Y value what means that the 'curve' must have a sharp peak at its minimum and need to be approached with a polynomial. To make things simple I decided to use quadratic polynomial and approach the minimum from left and right side separately as the curve isn't symmetric. The b-value was found by trial and error and its precision can be increased manually or by writing a small function finding it in an iterative way.
Update providing a generic solution as requested in a comment
The code below provides a
meanYboundaryXY(lbc = [(0,0), (440,-1.6), (830,0), -0.6], shape='saw')
function returning the X and Y lists of the time series data calculated from the passed parameter with the boundary values:
def meanYboundaryXY(lbc = [(0,0), (440,-1.6), (830,0), -0.6]):
lbcXY = lbc[0:3] ; meanY_boundary = lbc[3]
minX = min(x for x,y in lbcXY)
maxX = max(x for x,y in lbcXY)
minY = lbc[1][1]
step = 10
X = list(range(minX, maxX + 1, step))
lenX = len(X)
Y = [None for x in X]
sumY = 0
for x, y in lbcXY:
Y[x//step] = y
sumY += y
target_sumY = meanY_boundary*lenX
if shape == 'rect':
subY = (target_sumY-sumY)/(lenX-3)
for i, y in enumerate(Y):
if y is None:
Y[i] = subY
elif shape == 'saw':
peakNextY = 2*(target_sumY-sumY)/(lenX-1)
iYleft = lbc[1][0]//step-1
iYrght = iYleft+2
iYstart = lbc[0][0] // step
iYend = lbc[2][0] // step
for i in range(iYstart, iYleft+1, 1):
Y[i] = peakNextY * i / iYleft
for i in range(iYend, iYrght-1, -1):
Y[i] = peakNextY * (iYend-i)/(iYend-iYrght)
else:
raise ValueError( str(f'meanYboundaryXY() EXIT, {shape=} not in ["saw","rect"]') )
return (X, Y)
X, Y = meanYboundaryXY()
print(f'{sum(Y)/len(Y)=:8.6f}, {Y[0]=}, {Y[44]=}, {Y[83]=}')
from matplotlib import pyplot as plt
plt.plot(X,Y)
plt.show()
The code outputs:
sum(Y)/len(Y)=-0.600000, Y[0]=0, Y[44]=-1.6, Y[83]=0
and creates following two diagrams for shape='rect' and shape='saw':
As an old geek, i try to solve the question with a simple algorithm.
First calculate points as two symmetric lines from 0 to 44 and 44 to 89 (orange on the graph).
Calculate sum except middle point and its ratio with sum of points when mean is -0.6, except middle point.
Apply ratio to previous points except middle point. (blue curve on the graph)
Obtain curve which was called "saw" by Claudio.
For my own, i think quadratic interpolation of Claudio is a better curve, but needs trial and error loops.
import matplotlib
# define goals
nbPoints = 89
msPerPoint = 10
midPoint = nbPoints//2
valueMidPoint = -1.6
meanGoal = -0.6
def createSerieLinear():
# two lines 0 up to 44, 44 down to 88 (89 values centered on 44)
serie=[0 for i in range(0,nbPoints)]
interval =valueMidPoint/midPoint
for i in range(0,midPoint+1):
serie[i]=i*interval
serie[nbPoints-1-i]=i*interval
return serie
# keep an original to plot
orange = createSerieLinear()
# work on a base
base = createSerieLinear()
# total except midPoint
totalBase = (sum(base)-valueMidPoint)
#total goal except 44
totalGoal = meanGoal*nbPoints - valueMidPoint
# apply ratio to reduce
reduceRatio = totalGoal/totalBase
for i in range(0,midPoint):
base[i] *= reduceRatio
base[nbPoints-1-i] *= reduceRatio
# verify
meanBase = sum(base)/nbPoints
print("new mean:",meanBase)
# draw
from matplotlib import pyplot as plt
X =[i*msPerPoint for i in range(0,nbPoints)]
plt.plot(X,base)
plt.plot(X,orange)
plt.show()
new mean: -0.5999999999999998
Hope you enjoy simple things :)
I have tried to simulate some event-onsets and predictors for an experiment. I have two predictors (circles and squares). The stimuli ('events') take 1 second and the ISI (interstimulus interval) is 8 seconds. I am also interested in both contrasts against baseline (circles against baseline; squares against baseline). In the end, I am trying to run the function that I have defined (simulate_data_fixed, n=420 is a paramater that is fixed) for 1000, at each iteration I would like to calculate an efficiency score in the end and store the efficiency scores in a list.
def simulate_data_fixed_ISI(N=420):
dg_hrf = glover_hrf(tr=1, oversampling=1)
# Create indices in regularly spaced intervals (9 seconds, i.e. 1 sec stim + 8 ISI)
stim_onsets = np.arange(10, N - 15, 9)
stimcodes = np.repeat([1, 2], stim_onsets.size / 2) # create codes for two conditions
np.random.shuffle(stimcodes) # random shuffle
stim = np.zeros((N, 1))
c = np.array([[0, 1, 0], [0, 0, 1]])
# Fill stim array with codes at onsets
for i, stim_onset in enumerate(stim_onsets):
stim[stim_onset] = 1 if stimcodes[i] == 1 else 2
stims_A = (stim == 1).astype(int)
stims_B = (stim == 2).astype(int)
reg_A = np.convolve(stims_A.squeeze(), dg_hrf)[:N]
reg_B = np.convolve(stims_B.squeeze(), dg_hrf)[:N]
X = np.hstack((np.ones((reg_B.size, 1)), reg_A[:, np.newaxis], reg_B[:, np.newaxis]))
dvars = [(c[i, :].dot(np.linalg.inv(X.T.dot(X))).dot(c[i, :].T))
for i in range(c.shape[0])]
eff = c.shape[0] / np.sum(dvars)
return eff
However, I want to run this entire chunk 1000 times and store the 'eff' in an array, etc. so that later on I want to display them as a histogram. How ı can do this?
If I understand you correctly you should be able just to run
EFF = [simulate_data_fixed_ISI() for i in range(1000)] #1000 repeats
As #theonlygusti clarified, this line, EFF, runs your function simulate_data_fixed_ISI() 1000 times and put each return in the array EFF
Test
import numpy as np
def simulate_data_fixed_ISI(n=1):
"""
Returns 'n' random numbers
"""
return np.random.rand(n)
EFF = [simulate_data_fixed_ISI() for i in range(5)]
EFF
#[array([0.19585137]),
# array([0.91692933]),
# array([0.49294667]),
# array([0.79751017]),
# array([0.58294512])]
Your question seems to boil down to:
I am trying to run the function that I have defined for 1000, at each iteration I would like to calculate an efficiency score in the end and store the efficiency scores in a list
I guess "the function that I have defined" is the simulate_data_fixed_ISI in your question?
Then you can simply run it 1000 times using a basic for loop, and add the results into a list:
def simulate_data_fixed_ISI(N=420):
dg_hrf = glover_hrf(tr=1, oversampling=1)
# Create indices in regularly spaced intervals (9 seconds, i.e. 1 sec stim + 8 ISI)
stim_onsets = np.arange(10, N - 15, 9)
stimcodes = np.repeat([1, 2], stim_onsets.size / 2) # create codes for two conditions
np.random.shuffle(stimcodes) # random shuffle
stim = np.zeros((N, 1))
c = np.array([[0, 1, 0], [0, 0, 1]])
# Fill stim array with codes at onsets
for i, stim_onset in enumerate(stim_onsets):
stim[stim_onset] = 1 if stimcodes[i] == 1 else 2
stims_A = (stim == 1).astype(int)
stims_B = (stim == 2).astype(int)
reg_A = np.convolve(stims_A.squeeze(), dg_hrf)[:N]
reg_B = np.convolve(stims_B.squeeze(), dg_hrf)[:N]
X = np.hstack((np.ones((reg_B.size, 1)), reg_A[:, np.newaxis], reg_B[:, np.newaxis]))
dvars = [(c[i, :].dot(np.linalg.inv(X.T.dot(X))).dot(c[i, :].T))
for i in range(c.shape[0])]
eff = c.shape[0] / np.sum(dvars)
return eff
eff_results = []
for _ in range(1000):
efficiency_score = simulate_data_fixed_ISI()
eff_results.append(efficiency_score)
Now eff_results contains 1000 entries, each of which is a call to your function simulate_data_fixed_ISI
I'm trying to create graphs of the Mandelbrot set. I have managed to do this by iterating over a lot of points but this takes a lot of processing power, so I'm now trying to generate a polynomial by iterating f(z) = z**2 + c many times and then finding the roots for z = c, in order to generate a boundary of the set.
However I can't seem to get Python to generate the polynomial, any help would be much apprecaited.
Edit:Implemented azro's fix but now I get the error - TypeError: unsupported operand type(s) for ** or pow(): 'NoneType' and 'int'
Code so far:
import numpy as np
c = None
def f(z):
return z**2 + c
eqn = c
for i in range(100):
eqn = f(eqn)
np.roots(eqn)
This is a very hard problem. Searching through literature, I only found this (which doesn't seem very reputable). However, it does seem to begin to create what you want. This is only up to 8 iterations. So the polynomial gets very complicated very fast. See the following code:
import matplotlib.pyplot as plt
coeff = [0, 1, 1, 2, 5, 14, 26, 44, 69, 94, 114, 116, 94, 60, 28, 8, 1]
coeff = [0, 1, 1, 2, 5, 14, 42, 132, 429, 1302, 3774, 10652, 29538, 80812, 218324, 582408, 1534301, 3993030, 10269590, 26108844, 65626918, 163107044, 400844588, 974083128, 2340595778, 5560968284, 13062923500, 30336029592, 69640352964, 158015533208, 354347339496, 785248461712, 1719477330477, 3720187393990, 7952125694214, 16792863663700, 35031835376454, 72188854953372, 146932182777116, 295372837865192, 586400982013486, 1149605839249820, 2225301467579844, 4252710138415640, 8022825031835276, 14938862548001560, 27452211062573400, 49778848242964944, 89054473147697354, 157160523515654628, 273551721580800380, 469540646039042536, 794643418760272876, 1325752376790240280, 2180053774442766712, 3532711259225506384, 5640327912922026260, 8870996681171366696, 13741246529612440920, 20959276151880728336, 31472438318100876584, 46514944583399578896, 67649247253332557392, 96791719611591962592, 136210493669590627493, 188481251186354006062, 256386228250001079082, 342743629811082484420, 450159936955994386738, 580706779030058464252, 735537050036491961156, 914470757914434625800, 1115597581733327913554, 1334957092752100409132, 1566365198635995978988, 1801452751402955781592, 2029966595320794439668, 2240353897304462193848, 2420609646335251593480, 2559320275988283588176, 2646791812246207696810, 2676118542978972739644, 2644036970936308845148, 2551425591643957182856, 2403354418943890067404, 2208653487832260558008, 1979045408073272278264, 1727958521630464742736, 1469189341596552030212, 1215604411161527170376, 978057923319151340728, 764655844340519788496, 580430565842543266504, 427417353874088245520, 305060580205223726864, 210835921361505594848, 140960183546144741182, 91071943593142473900, 56796799826096529620, 34150590308701283528, 19772322481956974532, 11008161481780603512, 5884917700519129288, 3016191418506637264, 1479594496462756340, 693434955498545848, 309881648709683160, 131760770157606224, 53181959591958024, 20324543852025936, 7333879739219600, 2490875091238112, 793548088258508, 236221241425176, 65418624260840, 16771945556496, 3958458557608, 854515874096, 167453394320, 29524775520, 4634116312, 639097008, 76185104, 7685024, 637360, 41696, 2016, 64, 1]
r = np.roots(coeff)
plt.plot(r.real, r.imag, '.')
I would suggest something more like the following (stolen and modified from here). This sounds like something you've already tried. But try changing the max iterations to get something that can run relatively fast (30 was fast and had relatively high resolution for me).
MAX_ITER = 30
def mandelbrot(c):
z = 0
n = 0
while abs(z) <= 2 and n < MAX_ITER:
z = z*z + c
n += 1
return n
# Image size (pixels)
WIDTH = 600
HEIGHT = 400
# Plot window
RE_START = -2
RE_END = 1
IM_START = -1
IM_END = 1
img = np.zeros((WIDTH, HEIGHT))
for x in range(0, WIDTH):
for y in range(0, HEIGHT):
# Convert pixel coordinate to complex number
c = complex(RE_START + (x / WIDTH) * (RE_END - RE_START),
IM_START + (y / HEIGHT) * (IM_END - IM_START))
# Compute the number of iterations
m = mandelbrot(c)
if m > MAX_ITER-1:
img[x, y] = 1
plt.imshow(img.T, cmap='bone')
Imagine that there is 10 houses, where there can be one to an infinite number of persons. Each of those persons sends a number of messages, containing their userid and the house number. This can be from 1 to infinite number of messages. I want to know the average number of messages that is sent by each person, for each house, to later plot which house got the largest number of average messages.
Now, that I've explained conceptually, the houses aren't houses, but latitudes, from f.ex -90 to -89 etc. And that a person can send messages from different houses.
So I've got a database with latitude and senderID. I want to plot the density of latitudes pr unique senderID:
Number of rows/Number of unique userids at each latitude over an interval
This is an sample input:
lat = [-83.76, -44.88, -38.36, -35.50, -33.99, -31.91, -27.56, -22.95,
40.72, 47.59, 54.42, 63.84, 76.77, 77.43, 78.54]
userid= [5, 7, 6, 6, 6, 6, 5, 2,
2, 2, 1, 5, 10, 9 ,8]
Here are the corresponding densities:
-80 to -90: 1
-40 to -50: 1
-30 to -40: 4
-20 to -30: 1
40 to 50: 2
50 to 60: 1
60 to 70: 1
70 to 80: 1
An other input:
lat = [70,70,70,70,70,80,80,80]
userid = [1,2,3,4,5,1,1,2]
The density for latitude 70 is 1, while the density for latitude 80 is 1.5.
If I would do this through a database query/pseudocode I would do something like:
SELECT count(latitude) FROM messages WHERE latitude < 79 AND latitude > 69
SELECT count(distinct userid) FROM messages WHERE latitude < 79 AND latitude > 69
The density would then be count(latitude)/count(distinct userid) - also to be interpreted as totalmessagesFromCertainLatitude/distinctUserIds. This would be repeated for intervals from -90 to 90, i.e -90<latitude<-89 up to 89<latitude<90
To get any help with this is probably a far stretch, but I just cant organize my thoughts to do this while I'm confident there are no errors. I would be happy for anything. I'm sorry if I was unclear.
Because this packs so neatly into pandas' built-ins, it's probably fast in pandas for big datasets.
lat = [-83.76, -44.88, -38.36, -35.50, -33.99, -31.91, -27.56, -22.95,
40.72, 47.59, 54.42, 63.84, 76.77, 77.43, 78.54]
userid= [5, 7, 6, 6, 6, 6, 5, 2,
2, 2, 1, 5, 10, 9 ,8]
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
from matplotlib.collections import PatchCollection
from math import floor
df = pd.DataFrame(zip(userid,lat), columns = ['userid','lat']
)
df['zone'] = map(lambda x: floor(x) * 10,df.lat/10) # for ten-degree zones
zonewidth=10
#df['zone'] = map(floor, df.lat) # for one-degree zones
#zonewidth=1 # ditto
dfz = df.groupby('zone') #returns a dict of dataframes
#for k, v in dfz: # useful for exploring the GroupBy object
# print(k, v.userid.values, float(len(v.userid.values))/len(set(v.userid.values)))
p = [(k, float(len(v.userid.values))/len(set(v.userid.values))) for k, v in dfz]
# plotting could be tightened up -- PatchCollection?
R = [Rectangle((x, 0), zonewidth, y, facecolor='red', edgecolor='black',fill=True) for x, y in p]
fig, ax = plt.subplots()
for r in R:
ax.add_patch(r)
plt.xlim((-90, 90))
tall = max([r.get_height() for r in R])
plt.ylim((0, tall + 0.5))
plt.show()
For the first set of test data:
I'm not 100% sure I've understood the output you want, but this will produce a stepped, cumulative histogram-like plot with the x-axis being latitudes (binned) and the y axis being the density you define above.
From your sample code, you already have numpy installed and are happy to use it. The approach I would take is to get two data sets rather like what would be returned by your SQL sample and then use them to get the densities and then plot. Using your existing latitude / userid data format - it might look something like this
EDIT: Removed first version of code from here and some comments which were redundant following clarification and question edits from the OP
Following comments and OP clarification - I think this is what is desired:
import numpy as np
import matplotlib.pyplot as plt
from itertools import groupby
import numpy as np
import matplotlib.pyplot as plt
from itertools import groupby
def draw_hist(latitudes,userids):
min_lat = -90
max_lat = 90
binwidth = 1
bin_range = np.arange(min_lat,max_lat,binwidth)
all_rows = zip(latitudes,userids)
binned_latitudes = np.digitize(latitudes,bin_range)
all_in_bins = zip(binned_latitudes,userids)
unique_in_bins = list(set(all_in_bins))
all_in_bins.sort()
unique_in_bins.sort()
bin_count_all = []
for bin, group in groupby(all_in_bins, lambda x: x[0]):
bin_count_all += [(bin, len([k for k in group]))]
bin_count_unique = []
for bin, group in groupby(unique_in_bins, lambda x: x[0]):
bin_count_unique += [(bin, len([ k for k in group]))]
# bin_count_all and bin_count_unique now contain the data
# corresponding to the SQL / pseudocode in your question
# for each latitude bin
bin_density = [(bin_range[b-1],a*1.0/u) for ((b,a),(_,u)) in zip(bin_count_all, bin_count_unique)]
bin_density = np.array(bin_density).transpose()
# plot as standard bar - note you can put uneven widths in
# as an array-like here if necessary
# the * simply unpacks the x and y values from the density
plt.bar(*bin_density, width=binwidth)
plt.show()
# can save away plot here if desired
latitudes = [-70.5, 5.3, 70.32, 70.43, 5, 32, 80, 80, 87.3]
userids = [1,1,2,2,4,5,1,1,2]
draw_hist(latitudes,userids)
Sample output with different bin widths on OP dataset
I think this solves the case, allthough it isn't efficient at all:
con = lite.connect(databasepath)
binwidth = 1
latitudes = []
userids = []
info = []
densities = []
with con:
cur = con.cursor()
cur.execute('SELECT latitude, userid FROM dynamicMessage')
con.commit()
print "executed"
while True:
tmp = cur.fetchone()
if tmp != None:
info.append([float(tmp[0]),float(tmp[1])])
else:
break
info = sorted(info, key=itemgetter(0))
for x in info:
latitudes.append(x[0])
userids.append(x[1])
x = 0
latitudecount = 0
for b in range(int(min(latitudes)),int(max(latitudes))+1):
numlatitudes = sum(i<b for i in latitudes)
if numlatitudes > 1:
tempdensities = latitudes[0:numlatitudes]
latitudes = latitudes[numlatitudes:]
tempuserids = userids[0:numlatitudes]
userids = userids[numlatitudes:]
density = numlatitudes/len(list(set(tempuserids)))
if density>1:
tempdensities = [b]*int(density)
densities.extend(tempdensities)
plt.hist(densities, bins=len(list(set(densities))))
plt.savefig('latlongstats'+'t'+str(time.strftime("%H:%M:%S")), format='png')
What follows is not a complete solution in terms of plotting the required histogram, but I think it's nevertheless worthy of being reported
The bulk of the solution, we scan the array of tuples to select the ones in the required range and we count
the number of selected tuples
the unique ids, using a trick consisting in creating a set (this discards automatically the duplicates) and computing its numerosity
eventually we return the required ratio or zero if the count of distinct ids is zero
def ratio(d, mn, mx):
tmp = [(lat, uid) for lat, uid in d if mn <= lat < mx]
nlats, nduids = len(tmp), len({t[1] for t in tmp})
return 1.0*nlats/nduids if nduids>0 else 0
The data is input and assigned, via zip, to a list of tuples
lat = [-83.76, -44.88, -38.36, -35.50, -33.99, -31.91, -27.56, -22.95,
-19.00, -12.32, -6.14, -1.11, 4.40, 10.23, 19.40, 31.18,
40.72, 47.59, 54.42, 63.84, 76.77]
userid= [52500.0, 70100.0, 35310.0, 47776.0, 70100.0, 30991.0, 37328.0, 25575.0,
37232.0, 6360.0, 52908.0, 52908.0, 52908.0, 77500.0, 345.0, 6360.0,
3670.0, 36690.0, 3720.0, 2510.0, 2730.0]
data = zip(lat,userid)
preparation of the bins
extremes = range(-90,91,10)
intervals = zip(extremes[:-1],extremes[1:])
actual computation, the result is a list of floats that can be passed to the relevant pyplot functions
ratios = [ratio(data,*i) for i in intervals]
print ratios
# [1.0, 0, 0, 0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 0, 1.0, 1.0, 1.0, 1.0, 1.0, 0]
from random import uniform
prob = [0.25,0.30,0.45]
def onetrial(prob):
u=uniform(0,1)
if 0 < u <= prob[0]:
return 11
if prob[0] < u <= prob[0]+prob[1]:
return 23
if prob[0]+prob[1] < u <= prob[0]+prob[1]+prob[2]:
return 39
print onetrial(prob)
I wonder how to reduce the repetitive part in the def using some for-loop techniques. Thanks.
The following is equivalent to your current code and it uses a for loop:
from random import uniform
prob = [0.25, 0.30, 0.45]
def onetrial(prob):
u = uniform(0, 1)
return_values = [11, 23, 39]
total_prob = 0
for i in range(3):
total_prob += prob[i]
if u <= total_prob:
return return_values[i]
I am a little unclear on the relationship between the values you return and the probabilities, it seems like for your code prob will always have exactly 3 elements, so I made that assumption as well.
I like F.J's answer, but I would use a list of tuples, assuming you can easily do so:
from random import uniform
prob = [(0.25, 11), (0.30, 23), (0.45, 39)]
def onetrial(prob):
u = uniform(0, 1)
total_prob = 0
for i in range(3):
total_prob += prob[i][0]
if u <= total_prob:
return prob[i][1]
Assuming you call onetrial frequently, calculate the CDF first to make it a bit faster:
from random import uniform
vals = [11, 23, 39]
prob = [0.25, 0.30, 0.45]
cdf = [sum(prob[0:i+1]) for i in xrange(3)]
def onetrial(vals, cdf):
u = uniform(0, 1)
for i in range(3):
if u <= cdf[i]:
return vals[i]
You could use bisect to make it even faster.