Related
I currently have the numbers above in a list. How would you go about adding similar numbers (by nearest 850) and finding average to make the list smaller.
For example I have the list
l = [2000,2200,5000,2350]
In this list, i want to find numbers that are similar by n+500
So I want all the numbers similar by n+500 which are 2000,2200,2350 to be added and divided by the amount there which is 3 to find the mean. This will then replace the three numbers added. so the list will now be l = [2183,5000]
As the image above shows the numbers in the list. Here I would like the numbers close by n+850 to all be selected and the mean to be found
It seems that you look for a clustering algorithm - something like K-means.
This algorithm is implemented in scikit-learn package
After you find your K means, you can count how many of your data were clustered with that mean, and make your computations.
However, it's not clear in your case what is K. You can try and run the algorithm for several K values until you get your constraints (the n+500 distance between the means)
You can use:
import numpy as np
l = np.array([2000,2200,5000,2350])
# find similar numbers (that are within each 500 fold)
similar = l // 500
# for each similar group get the average and convert it to integer (as in the desired output)
new_list = [np.average(l[similar == num]).astype(int) for num in np.unique(similar)]
print(new_list)
Output:
[2183, 5000]
Step 1:
list = [5620.77978515625,
7388.43017578125,
7683.580078125,
8296.6513671875,
8320.82421875,
8557.51953125,
8743.5,
9163.220703125,
9804.7939453125,
9913.86328125,
9940.1396484375,
9951.74609375,
10074.23828125,
10947.0419921875,
11048.662109375,
11704.099609375,
11958.5,
11964.8232421875,
12335.70703125,
13103.0,
13129.529296875,
16463.177734375,
16930.900390625,
17712.400390625,
18353.400390625,
19390.96484375,
20089.0,
34592.15625,
36542.109375,
39478.953125,
40782.078125,
41295.26953125,
42541.6796875,
42893.58203125,
44578.27734375,
45077.578125,
48022.2890625,
52535.13671875,
58330.5703125,
61597.91796875,
62757.12890625,
64242.79296875,
64863.09765625,
66930.390625]
Step 2:
seen = [] #to log used indices pairs
diff_dic = {} #to record indices and diff
for i,a in enumerate(list):
for j,b in enumerate(list):
if i!=j and (i,j)[::-1] not in seen:
seen.append((i,j))
diff_dic[(i,j)] = abs(a-b)
keys = []
for ind, diff in diff_dic.items():
if diff <= 850:
keys.append(ind)
uniques_k = [] #to record unique indices
for pair in keys:
for key in pair:
if key not in uniques_k:
uniques_k.append(key)
import numpy as np
list_arr = np.array(list)
nearest_avg = np.mean(list_arr[uniques_k])
list_arr = np.delete(list_arr, uniques_k)
list_arr = np.append(list_arr, nearest_avg)
list_arr
output:
array([ 5620.77978516, 34592.15625, 36542.109375, 39478.953125, 48022.2890625, 52535.13671875, 58330.5703125 , 61597.91796875, 62757.12890625, 66930.390625 , 20566.00205365])
You just need a conditional list comprehension like this:
l = [2000,2200,5000,2350]
n = 2000
a = [ (x) for x in l if ((n -250) < x < (n + 250)) ]
Then you can average with
np.mean(a)
or whatever method you prefer.
I have a function which creates a set of results in a list. This is in a for-loop which changes one of the variables in each iteration. I need to be able to store these lists separately so that I can show the difference in results between each iteration as a graph.
Is there any way to store them separately like that? So far the only solution I've found is to copy out the function multiple times and manually change the variable and name of the list it stores to, but obviously this is a terrible way of doing it and I figure there must be a proper way.
Here is the code. The function is messy but works. Ideally I would be able to put this all in another for-loop which changes deceleration_p each iteration and then stores collected_averages as a different list so that I could compare collected_averages for each iteration.
import numpy as np
import random
import matplotlib.pyplot as plt
from statistics import mean
road_length = 500
deceleration_p = 0.1
max_speed = 5
buffer_road = np.zeros(road_length, dtype=int)
buffer_speed = 0
number_of_iterations = 1000
average_speed = 0
average_speed_list = []
collected_averages = []
total_speed = 0
for cars in range(1, road_length):
empty_road = np.ones(road_length - cars, dtype=int) * -1
cars_on_road = np.ones(cars, dtype=int)
road = np.append(empty_road, cars_on_road)
np.random.shuffle(road)
for i in range(0, number_of_iterations):
# acceleration
for speed in np.nditer(road, op_flags=['readwrite']):
if -1 < speed < max_speed:
speed[...] += 1
# randomisation
for speed in np.nditer(road, op_flags=['readwrite']):
if 0 < speed:
if deceleration_p > random.random():
speed += -1
# slowing down
for cell in range(0, road_length):
speed = road[cell]
for val in range(1, speed + 1):
new_speed = val
if (cell + val) > (road_length - 1):
val += -road_length
if road[cell + val] > -1:
speed = val - 1
road[cell] = new_speed - 1
break
buffer_road=np.ones(road_length, dtype=int)*-1
for cell in range(0, road_length):
speed = road[cell]
buffer_cell = cell + speed
if (buffer_cell) > (road_length - 1):
buffer_cell += -road_length
if speed > -1:
total_speed += speed
buffer_road[buffer_cell] = speed
road = buffer_road
average_speed = total_speed/cars
average_speed_list.append(average_speed)
average_speed = 0
total_speed = 0
steady_state_average=mean(average_speed_list[9:number_of_iterations])
average_speed_list=[]
collected_averages.append(steady_state_average)
Not to my knowledge. As stated in the comments, you could use a dictionary, but my suggestion is to use a list. For every iteration of the loop, you could append the value. (From what I understood) You stated that your results are in a list, so you could make a 2D array. My recommendation would be to use a numpy array as it is much faster. Hopefully this was helpful.
I have a numpy array with these values:
[10620.5, 11899., 11879.5, 13017., 11610.5]
import Numpy as np
array = np.array([10620.5, 11899, 11879.5, 13017, 11610.5])
I would like to get values that are "close" (in this instance, 11899 and 11879) and average them, then replace them with a single instance of the new number resulting in this:
[10620.5, 11889, 13017, 11610.5]
the term "close" would be configurable. let's say a difference of 50
the purpose of this is to create Spans on a Bokah graph, and some lines are just too close
I am super new to python in general (a couple weeks of intense dev)
I would think that I could arrange the values in order, and somehow grab the one to the left, and right, and do some math on them, replacing a match with the average value. but at the moment, I just dont have any idea yet.
Try something like this, I added a few extra steps, just to show the flow:
the idea is to group the data into adjacent groups, and decide if you want to group them or not based on how spread they are.
So as you describe you can combine you data in sets of 3 nummbers and if the difference between the max and min numbers are less than 50 you average them, otherwise you leave them as is.
import pandas as pd
import numpy as np
arr = np.ravel([1,24,5.3, 12, 8, 45, 14, 18, 33, 15, 19, 22])
arr.sort()
def reshape_arr(a, n): # n is number of consecutive adjacent items you want to compare for averaging
hold = len(a)%n
if hold != 0:
container = a[-hold:] #numbers that do not fit on the array will be excluded for averaging
a = a[:-hold].reshape(-1,n)
else:
a = a.reshape(-1,n)
container = None
return a, container
def get_mean(a, close): # close = how close adjacent numbers need to be, in order to be averaged together
my_list=[]
for i in range(len(a)):
if a[i].max()-a[i].min() > close:
for j in range(len(a[i])):
my_list.append(a[i][j])
else:
my_list.append(a[i].mean())
return my_list
def final_list(a, c): # add any elemts held in the container to the final list
if c is not None:
c = c.tolist()
for i in range(len(c)):
a.append(c[i])
return a
arr, container = reshape_arr(arr,3)
arr = get_mean(arr, 5)
final_list(arr, container)
You could use fuzzywuzzy here to gauge the ratio of cloesness between 2 data sets.
See details here: http://jonathansoma.com/lede/algorithms-2017/classes/fuzziness-matplotlib/fuzzing-matching-in-pandas-with-fuzzywuzzy/
Taking Gustavo's answer and tweaking it to my needs:
def reshape_arr(a, close):
flag = True
while flag is not False:
array = a.sort_values().unique()
l = len(array)
flag = False
for i in range(l):
previous_item = next_item = None
if i > 0:
previous_item = array[i - 1]
if i < (l - 1):
next_item = array[i + 1]
if previous_item is not None:
if abs(array[i] - previous_item) < close:
average = (array[i] + previous_item) / 2
flag = True
#find matching values in a, and replace with the average
a.replace(previous_item, value=average, inplace=True)
a.replace(array[i], value=average, inplace=True)
if next_item is not None:
if abs(next_item - array[i]) < close:
flag = True
average = (array[i] + next_item) / 2
# find matching values in a, and replace with the average
a.replace(array[i], value=average, inplace=True)
a.replace(next_item, value=average, inplace=True)
return a
this will do it if I do something like this:
candlesticks['support'] = reshape_arr(supres_df['support'], 150)
where candlesticks is the main DataFrame that I am using and supres_df is another DataFrame that I am massaging before I apply it to the main one.
it works, but is extremely slow. I am trying to optimize it now.
I added a while loop because after averaging, the averages can become close enough to average out again, so I will loop again, until it doesn't need to average anymore. This is total newbie work, so if you see something silly, please comment.
Looking for occurrences of a pattern on each row of a matrix, I found that there was not clear solution to do it on python for very big matrix having a good performance.
I have a matrix similar to
matrix = np.array([[0,1,1,0,1,0],
[0,1,1,0,1,0]])
print 'matrix: ', matrix
where I want to check the occurreces of patterns [0,0], [0,1] [1,0] and [1,1] on each rowconsidering overlapping. For the example given, where both rows are equal,ther result is equal for each pattern:
pattern[0,0] = [0,0]
pattern[0,1] = [2,2]
pattern[1,0] = [2,2]
pattern[1,1] = [1,1]
The matrix in this example is quite small, but I am looking for performance as I have a huge matrix. You can test matrix with matrix = numpy.random.randint(2, size=(100000,10)) or bigger for example to see the differences
First I though on a possible answer converting rows to strings and looking for occurrences based on this answer (string count with overlapping occurrences):
def string_occurrences(matrix):
print '\n===== String count with overlapping ====='
numRow,numCol = np.shape(matrix)
Ocur = np.zeros((numRow,4))
for i in range(numRow):
strList = ''.join(map(str,matrix[i,:]))
Ocur[i,0] = occurrences(strList,'00')
Ocur[i,1] = occurrences(strList,'01')
Ocur[i,2] = occurrences(strList,'10')
Ocur[i,3] = occurrences(strList,'11')
return Ocur
using the function occurrences of the answer
def occurrences(string, sub):
count = start = 0
while True:
start = string.find(sub, start) + 1
if start > 0:
count+=1
else:
return count
but considering that the real array is huge, this solution is very very slow as it uses for loops, strings,...
So looking for a numpy solution I used a trick to compare the values with a pattern and roll the matrix on axis=1 to check all the occurrences.
I call it pseudo rolling window on 2D as the window is not square and the way of calculation is different. There are 2 options, where the second (Option 2) is faster because it avoids the extra calculation of numpy.roll
def pseudo_rolling_window_Opt12(matrix):
print '\n===== pseudo_rolling_window ====='
numRow,numCol = np.shape(matrix)
Ocur = np.zeros((numRow,4))
index = 0
for i in np.arange(2):
for j in np.arange(2):
#pattern = -9*np.ones(numCol) # Option 1
pattern = -9*np.ones(numCol+1) # Option 2
pattern[0] = i
pattern[1] = j
for idCol in range(numCol-1):
#Ocur[:,index] += np.sum(np.roll(matrix,-idCol, axis=1) == pattern, axis=1) == 2 # Option 1: 219.398691893 seconds (for my real matrix)
Ocur[:,index] += np.sum(matrix[:,idCol:] == pattern[:-(idCol+1)], axis=1) == 2 # Option 2: 80.929688930 seconds (for my real matrix)
index += 1
return Ocur
Searching for other possibilities, I found the "rolling window" which seemed to be a god answer for performance as it used the numpy function. Looking to this answer (Rolling window for 1D arrays in Numpy?) and the links on it, I checked the following function. But really, I do not understand the output as it seems that the calculations of the window are matching what I was expecting for result.
def rolling_window(a, size):
shape = a.shape[:-1] + (a.shape[-1] - size + 1, size)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
Used as:
a = rolling_window(matrix, 2)
print a == np.array([0,1])
print np.all(rolling_window(matrix, 2) == [0,1], axis=1)
Does someone know what is wrong on this last case? Or could be any possibility with better performance?
You are using the wrong axis of the numpy array. You should change the axis in np.all from 1 to 2.
Using the following code:
a = rolling_window(matrix, 2)
print np.all(rolling_window(matrix, 2) == [0,1], axis=2)
you get:
>>>[[ True False False True False]
[ True False False True False]]
So, in order to get the results you are looking for:
print np.sum(np.all(rolling_window(matrix, 2) == [0,1], axis=2),axis=1)
>>>[2 2]
So, among the selected values want to calculate median value.
arcpy.env.workspace = r"Database Connections\local.sde"
pLoc = "local.DBO.Parcels"
luLoc = "local.DBO.Land_Use"
luFields = ["MedYrBlt","MedVal","OCCount"]
arcpy.MakeFeatureLayer_management(pLoc,"cities_lyr")
arcpy.SelectLayerByAttribute_management("cities_lyr", "NEW_SELECTION", "YrBlt > 1000")
from selected cities_lyr want to calculate mean value field from YrBlt
with arcpy.da.SearchCursor(luLoc, ["OID#", "SHAPE#", luFields[0], luFields[1], luFields[2]]) as cursor:
for row in cursor:
if arcpy.Exists('in_memory/stats'):
arcpy.Delete_management(r'in_memory/stats')
arcpy.SelectLayerByLocation_management('cities_lyr', select_features = row[1])
arcpy.Statistics_analysis('cities_lyr', 'in_memory/stats','YrBlt MEAN','OBJECTID')
Here comes a question:
I just want to see the mean value, how can I do that?
luFields = ["MedYrBlt","MedVal","OCCount"]
are going to be used later not important for now.
Append values to an empty array and then calculate mean of that array. For example:
# Create array & cycle through years, append values to array
yrArray =[]
for row in cursor:
val = getValue("yrBlt")
yrArray.append(val)
#get sum of all values in array
x = 0
for i in yrArray:
x += i
#get average by dividing above sum by the length of the array.
meanYrBlt = x / len(yrArray)
On another note it may be beneficial to separate these processes out into their own classes. For example:
class arrayAvg:
def __init__(self,array):
x = 0
for i in array:
x += 1
arrayLength = len(array)
arrayAvg = x/arrayLength
self.avg = arrayAvg
self.count = arrayLength
This way you can reuse the code by calling:
yrBltAvg = arrayAvg(yrArray)
avg = yrBltAvg.avg #returns average
count = yrBltAvg.count #returns count
The second portion is unnecessary, but allows you to take advantage of object oriented programming, and you can expand upon that throughout the program.