Related
I want to develop an algorithm in python that returns an 18x9 matrix randomly populated with numbers from 1 to 90 using certain rules/conditions.
Rules
#1 - Maintain an array of 18x9 between 1 and 90.
#2 - First column contains 1-10, second column contains 11-20, etc.
#3 - Each row must have 5 numbers. Other columns should be set to 0.
#4 - Numbers must be arranged in ascending order from top to bottom in a column
What I have done so far?
import numpy as np
columns = 9
rows = 18
n_per_row = 5
matrix = np.zeros((rows, columns), dtype=int)
# Keep track of available places at each row.
available_places = {k: n_per_row for k in range(rows)}
# Shuffle order in which we fill the columns.
col_order = np.arange(columns)
np.random.shuffle(col_order)
for column in col_order:
# Indices of available rows.
indices = [c for c in available_places if available_places[c]]
# Sample which row to use of the available.
indices = np.random.choice(indices, size=min(len(indices), 10), replace=False)
# print(indices)
# Values for this column.
col_values = np.random.choice(list(np.arange(1, 10+1)), size=min(len(indices), 10), replace=False) + column*10
# Fill in ascending order.
matrix[sorted(indices), column] = sorted(col_values)
for idx in indices:
available_places[idx] -= 1
print(matrix)
Result
[[ 0 0 0 31 0 51 0 71 81]
[ 1 11 0 0 0 52 61 72 0]
[ 0 0 21 32 41 0 62 73 0]
[ 0 0 22 33 0 0 0 74 82]
[ 0 12 23 0 42 0 63 0 83]
[ 2 13 24 34 0 53 0 0 0]
[ 3 0 0 0 43 54 64 0 84]
[ 4 0 0 0 44 55 65 0 85]
[ 5 14 0 0 45 0 66 0 86]
[ 6 0 25 35 46 0 0 75 0]
[ 7 15 26 36 0 0 0 0 87]
[ 8 16 0 0 47 0 0 76 88]
[ 0 17 27 0 48 0 0 77 89]
[ 0 18 0 0 49 56 67 78 0]
[ 9 0 28 39 0 57 0 79 0]
[ 0 0 29 0 50 58 68 80 0]
[ 0 19 30 40 0 59 69 0 0]
[10 20 0 0 0 60 70 0 90]]
Expected Result: https://images.meesho.com/images/products/56485141/snksv_512.jpg
Final result according to the 4 rules
5 values per row
10 values per column starting with 1,11,21, etc in ascending order
( Notice these rules are not ok for a bingo as seen in the image )
============ final matrix ===============
--------------------------------
[1, 11, 21, 31, 41, 0, 0, 0, 0]
[2, 12, 0, 32, 42, 0, 61, 0, 0]
[0, 13, 0, 33, 0, 0, 62, 71, 81]
[3, 0, 0, 34, 0, 0, 63, 72, 82]
[0, 0, 22, 0, 0, 51, 64, 73, 83]
[4, 14, 23, 35, 0, 52, 0, 0, 0]
[5, 0, 24, 0, 43, 53, 0, 0, 84]
[6, 15, 0, 36, 44, 54, 0, 0, 0]
[7, 0, 0, 37, 0, 0, 65, 74, 85]
[0, 0, 0, 0, 45, 55, 66, 75, 86]
[8, 16, 25, 0, 0, 0, 67, 76, 0]
[0, 0, 26, 0, 46, 56, 0, 77, 87]
[9, 17, 0, 0, 0, 0, 68, 78, 88]
[10, 18, 0, 0, 0, 57, 0, 79, 89]
[0, 19, 27, 38, 47, 0, 0, 80, 0]
[0, 20, 28, 39, 48, 58, 0, 0, 0]
[0, 0, 29, 0, 49, 59, 69, 0, 90]
[0, 0, 30, 40, 50, 60, 70, 0, 0]
--------------------------------
Principles :
Establish first a matrix with 0 and 1 set as placeholders for future values.
Randomize 0 or 1 per cell in the matrix, but survey # of 1 in a row and # of 1 in a col to respect constraints.
As it could happen that random gives not enough 1 early, the both constraints cannot be satisfied at first try. Prog retry automatically and traces each try for observation. (max seen in my tests : 10 loops, mean : <=3 loops)
Once a satisfactory matrix of 0 & 1 is obtained, replace each 1 by the corresponding value for each col.
A solution :
import random
# #1 - Maintain an array of 18x9 (between 1 and 90)
maxRow = 18
maxCol = 9
# #2 - First column contains 1-10, second column contains 11-20, etc.
# ie first column 1 start from 1 and have 10 entries, column 2 start from 11 and have 10 entries, etc.
origins = [i*10 +1 for i in range(maxCol)] #[1, 11, 21, 31, 41, 51, 61, 71, 81]
maxInCol = [10 for i in range(maxCol)] #[10, 10, 10, 10, 10, 10, 10, 10, 10]
# comfort : display matrix
def showMatrix():
print('--------------------------------')
for row in range(len(matrix)):
print(matrix[row])
print('--------------------------------')
# comfort : count #values in a col
def countInCol(col):
count = 0
for row in range(maxRow):
count+=matrix[row][col]
return count
# verify the rules of 5 per row and 10 per cols
def verify():
ok = True
showMatrix()
# count elements in a col
for col in range(maxCol):
count = 0
for row in range(maxRow):
count+= matrix[row][col]
if(count!= maxInCol[col]):
print ('*** wrong # of elements in col {0} : {1} instead of {2}'.format(col, count,maxInCol[col]))
ok = False
# count elements in a row
for row in range(maxRow):
count = 0
for col in range(maxCol):
count+= matrix[row][col]
if(count!=5):
print('***** wrong # of elements in row {0} : {1}'.format(row, count))
ok = False
if (not ok): print( '********************************************')
return ok
# -- main ----
# need to iterate in case of no more value to complete a col
tour = 1
maxTour = 100 #security limit
while True:
# prepare a matrix of rows of cols of 0
matrix = [[0 for i in range(maxCol)] for i in range(18)]
# begin to fill some places with 1 instead of 0
for row in range(maxRow):
count = 0
for col in range(maxCol):
if (count==5): break # line is already full with 5 elt
# random a 0 or 1
placeHolder = random.choice([0,1])
# if the remaining cols of this row needs to be 1 to complete at 5/row
if (5-count) == (maxCol-col):
placeHolder = 1 # must complete the row
else:
inCol = countInCol(col)
# 10 places max in col
if (inCol)==maxInCol[col]: placeHolder = 0 # this col is full
# constraint : if the remaining rows of this col need to be 1 to complete the expected 10 values
if(maxRow-row) == (maxInCol[col]-inCol): placeHolder = 1
matrix[row][col] = placeHolder
count+= placeHolder
#-------- some case are not correct . prog loops
if verify():
print(' ok after {0} loop(s)'.format(tour))
break
# security infinite loop
if (tour>=maxTour): break
tour +=1
# now replace the placeholders by successive values per col
print('\n============ final matrix ===============')
for row in range(maxRow):
for col in range(maxCol):
if matrix[row][col]==1:
matrix[row][col] = origins[col]
origins[col]+=1
showMatrix()
HTH
I have time series data with a column that sums up seconds that something is running. All numbers are divisible by 30s but sometimes it does skip numbers (may jump from 30 to 90). This column can reset along as it is running, setting the start count back to 30s. How would I break up every chunk of runtime.
For example: If numbers in the column are 30, 60, 120, 150, 30, 60, 90, 30, 60, how would I break apart the dataframe into the full sequences with no resets.
30, 60, 120, 150 in 1 dataframe and 30, 60, 90 in the next and 30, 60 in the last? At the end, I need to take the max of each dataframe and add them together (that part I could figure out).
Using #RSale's input:
import pandas as pd
df = pd.DataFrame({'data': [30, 60, 120, 150, 30, 60, 90, 30, 60]})
d = dict(tuple(df.groupby(df['data'].eq(30).cumsum())))
d is a dictionary of three dataframes:
d[1]:
data
0 30
1 60
2 120
3 150
d[2]:
data
4 30
5 60
6 90
And d[3}:
data
7 30
8 60
Not very elegant but it get's the job done.
Loop through an array. Add array to a list when a number is smaller than the one before. Remove the saved array from the list and repeat until no change is detected.
numpy & recursive
import numpy as np
a = np.array([30, 60, 120, 150, 30, 60, 90, 30, 60])
y = []
def split(a,y):
for count,val in enumerate(a):
if count == 0:
pass
elif val < a[count-1]:
y.append(a[:count])
a = a[count:]
if len(a)> 0 and sorted(a) != list(a):
split(a,y)
else:
y.append(a)
a = []
return(y)
return(y)
y = split(a,y)
print(y)
>>[array([ 30, 60, 120, 150]), array([30, 60, 90]), array([30, 60])]
print([max(lis) for lis in y])
>>[150,90,60]
This will not consider 30 as a starting point but the samllest number after the reset.
Or using diff to find the change points.
numpy & diff version
import numpy as np
a = np.array([30, 60, 120, 150, 30, 60, 90, 30, 60])
y = []
def split(a,y):
a_diff = np.asarray(np.where(np.diff(a)<0))[0]
while len(a_diff)>1:
a_diff = np.asarray(np.where(np.diff(a)<0))[0]
y.append(a[:a_diff[0]+1])
a = a[a_diff[0]+1:]
y.append(a)
return(y)
y = split(a,y)
print(y)
print([max(lis) for lis in y])
>>[array([ 30, 60, 120, 150]), array([30, 60, 90]), array([30, 60])]
>>[150, 90, 60]
pandas & DataFrame version
import pandas as pd
df = pd.DataFrame({'data': [30, 60, 120, 150, 30, 60, 90, 30, 60]})
y = []
def split(df,y):
a = df['data']
a_diff = [count for count,val in enumerate(a.diff()[1:]) if val < 0 ]
while len(a_diff)>1:
a_diff = [count for count,val in enumerate(a.diff()[1:]) if val < 0 ]
y.append(a[:a_diff[0]+1])
a = a[a_diff[0]+1:]
y.append(a)
return(y)
y = split(df,y)
print(y)
print([max(lis) for lis in y])
I am trying to convert column 'reward levels' to int type, it seems that it is listed as object type.
I have tried
.astype(int)
ValueError: invalid literal for int() with base 10: '25,50,100,250,500,1,000,2,500'
also:
tuple(map(int, df['reward levels'].split(',')))
AttributeError: 'Series' object has no attribute 'split'
final:
**pd.to_numeric(df['reward levels'])
ValueError: Unable to parse string "25,50,100,250,500,1,000,2,500" at position 0**
https://drive.google.com/file/d/0By26wLpAqHfQaF9Jb19RUFVnNjA/view
link to the data. Thanks in advance I am a novice.
After looking at your data, it seems that reward levels has either , separated values preceding with $ sign or NaN, so what you can do is, for each value of reward levels:
Remove all $ signs, you can simply replace them by empty string ''
Split each values by comma ,, you will get list of integers as list of string
Call pd.to_numeric for each row in reward levels
df['reward levels'] = df['reward levels'].str.replace('$', '', regex=False).str.split(',').apply(pd.to_numeric)
OUTPUT:
1 [1, 5, 10, 25, 50]
2 [1, 10, 25, 40, 50, 100, 250, 1, 0, 1, 337, 9, 1]
3 [1, 10, 25, 30, 50, 75, 85, 100, 110, 250, 500...
4 [10, 25, 50, 100, 150, 250]
...
45952 [20, 50, 100]
45953 [1, 5, 10, 25, 50, 50, 75, 100, 200, 250, 500,...
45954 [10, 25, 100, 500]
45955 [15, 16, 19, 29, 29, 39, 75]
45956 [25, 25, 50, 100, 125, 250, 500, 1, 250, 2, 50...
Name: reward levels, Length: 45957, dtype: object
Furthermore, if you wish to have each of the list items on a separate row, you can use explode
df.explode('reward levels')
OUTPUT:
0 25
0 50
0 100
0 250
0 500
...
45956 250
45956 2
45956 500
45956 5
45956 0
Name: reward levels, Length: 416706, dtype: object
It depends what you want the output format to be. If you just want to split the strings as comma separated values and cast them as ints, you can use:
data = {'reward_levels': {0: '25,50,100,250,500,1,000,2,500',
1: '25,50,10',
2: '15,16,19,22'}}
df = pd.DataFrame(data)
df.apply(lambda x: [int(j) for j in x.reward_levels.split(",")], axis=1)
but the result may not be exactly what you want:
0 [25, 50, 100, 250, 500, 1, 0, 2, 500]
1 [25, 50, 10]
2 [15, 16, 19, 22]
It is more typical to have a single value for each cell/index. You can either explode into multiple columns, or duplicate as rows; the latter might be preferable as your arrays are of unequal length:
df.reward_levels.str.split(",", expand=True)
output:
0 1 2 3 4 5 6 7 8
0 25 50 100 250 500 1 000 2 500
1 25 50 10 None None None None None None
2 15 16 19 22 None None None None None
or
df.reward_levels.str.split(",").explode().astype(int)
output:
0 25
0 50
0 100
0 250
0 500
0 1
0 0
0 2
0 500
1 25
1 50
1 10
2 15
2 16
2 19
2 22
x = onefile1['quiz1']
grading = []
for i in x :
if i == '-':
grading.append(0)
elif float(i) < float(50.0):
grading.append('lessthen50')
elif i > 50.0 and i < 60.0:
grading.append('between50to60')
elif i > 60.0 and i < 70.0:
grading.append('between60to70')
elif i > 70.0 and i < 80.0:
grading.append('between70to80')
elif i > 80.0:
grading.append('morethen80')
else:
grading.append(0)
onefile1 = file.reset_index()
onefile1['grade'] = grading
It is giving me the following error :
Length of values does not match length of inde
You probably have a value equal to 50, 60 or 70 etc. You can use <= instead of < or cut from pandas,
import numpy as np
import pandas as pd
onefile1['quiz1'] = (onefile1['quiz1']
.astype(str).str.replace('-', '0')
.astype(float))
labels = [
0, 'lessthen50', 'between50to60',
'between60to70', 'between70to80', 'morethen80'
]
bins = [-1, 0, 50, 60, 70, 80, np.inf]
onefile1['grade'] = pd.cut(
onefile1.quiz1, bins=bins,
labels=labels, include_lowest=True)
Here is an example,
>>> import numpy as np
>>> import pandas as pd
>>> onefile1 = pd.DataFrame({'quiz1': [0, 40, 30, 60, 80, 100, '-']})
>>> onefile1['quiz1'] = (onefile1['quiz1']
.astype(str).str.replace('-', '0')
.astype(float))
>>> labels = [
0, 'lessthen50', 'between50to60',
'between60to70', 'between70to80', 'morethen80'
]
>>> bins = [-1, 0, 50, 60, 70, 80, np.inf]
>>> onefile1['grade'] = pd.cut(
onefile1.quiz1, bins=bins,
labels=labels, include_lowest=True)
>>> onefile1
quiz1 grade
0 0.0 0
1 40.0 lessthen50
2 30.0 lessthen50
3 60.0 between50to60
4 80.0 between70to80
5 100.0 morethen80
6 0.0 0
PS: It is a good idea to check the parameters include_lowest and right before use.
I have a problem where I need to determine where a value lands between other values. This is an awful long question...but its a convoluted problem (at least to me).
The simplest presentation of the problem can be seen with the following data:
I have a value of 24.0. I need to determine where that value lands within six 'ranges'. The ranges are: 10, 20, 30, 40, 50, 60. I need to calculate where along the ranges, the value lands. I can see that it lands between 20 and 30. A simple if statement can find that for me.
My if statement for checking if the value is between 20 and 30 would be:
if value >=20 and value <=30:
Pretty simple stuff.
What I'm having trouble with is when I try to rank the output.
As an example, let's say that each range value is given an integer representation. 10 =1, 20=2, 30=3, 40=4, 50=5, 60=6, 70=7. Additionally, lets say that if the value is less than the midpoint between two values, it is assigned the rank output of the lower value. For example, my value of 24 is between 20 and 30 so it should be ranked as a "2".
This in and of itself is fairly straightforward with this example, but using real world data, I have ranges and values like the following:
Value = -13 with Ranges = 5,35,30,25,-25,-30,-35
Value = 50 with Ranges = 5,70,65,60,40,35,30
Value = 6 with Ranges = 1,40,35,30,5,3,0
Another wrinkle - the orders of the ranges matter. In the above, the first range number equates to a ranking of 1, the second to a ranking of 2, etc as I mentioned a few paragraphs above.
The negative numbers in the range values were causing trouble until I decided to use a percentile ranking which gets rid of the negative values all together. To do this, I am using an answer from Map each list value to its corresponding percentile like this:
y=[stats.percentileofscore(x, a, 'rank') for a in x]
where x is the ranges AND the value I'm checking. Running the value=6 values above through this results in y being:
x = [1, 40, 35, 30, 5, 3, 0, 6]
y=[stats.percentileofscore(x, a, 'rank') for a in x]
Looking at "y", we see it as:
[25.0, 100.0, 87.5, 75.0, 50.0, 37.5, 12.5, 62.5]
What I need to do now is compare that last value (62.5) with the other values to see what the final ranking will be (rankings of 1 through 7) according to the following ranking map:
1=25.0
2=100.0
3=87.5
4=75.0
5=50.0
6=37.5
7=12.5
If the value lies between two of the values, it should be assigned the lower rank. In this example, the 62.5 value would have a final ranking value of 4 because it sits between 75.0 (rank=4) and 50.0 (rank=5).
If I take 'y' and break it out and use those values in multiple if/else statements it works for some but not all (the -13 example does not work correctly).
My question is this:
How can I programmatically analyze any value/range set to find the final ranking without building an enormous if/elif structure? Here are a few sample sets. Rankings are in order of presentation below (first value in Ranges =1 , second = 2, etc etc)
Value = -13 with Ranges = 5, 35, 30, 25, -25, -30, -35 --> Rank = 4
Value = 50 with Ranges = 5, 70, 65, 60, 40, 35, 30 --> Rank = 4
Value = 6 with Ranges = 1, 40, 35, 30, 5, 3,0 --> Rank = 4
Value = 24 with Ranges = 10, 20, 30, 40, 50, 60, 70 --> Rank = 2
Value = 2.26 with Ranges = 0.1, 0.55, 0.65, 0.75, 1.75, 1.85, 1.95 --> Rank = 7
Value = 31 with Ranges = 10, 20, 30, 40, 60, 70, 80 --> Rank = 3
I may be missing something very easy within python to do this...but I've bumped my head on this wall for a few days with no progress.
Any help/pointers are appreciated.
def checker(term):
return term if term >= 0 else abs(term)+1e10
l1, v1 = [5, 35, 30, 25, -25, -30, -35], -13 # Desired: 4
l2, v2 = [5, 70, 65, 60, 40, 35, 30], 50 # Desired: 4
l3, v3 = [1, 40, 35, 30, 5, 3, 0], 6 # Desired: 4
l4, v4 = [10, 20, 30, 40, 50, 60, 70], 24 # Desired: 2
l5, v5 = [0.1, 0.55, 0.65, 0.75, 1.75, 1.85, 1.95], 2.26 # Desired: 7
l6, v6 = [10, 20, 30, 40, 60, 70, 80], 31 # Desired: 3
Result:
>>> print(*(sorted(l_+[val], key=checker).index(val) for
... l_, val in zip((l1,l2,l3,l4,l5,l6),(v1,v2,v3,v4,v5,v6))), sep='\n')
4
4
4
2
7
3
Taking the first example of -13.
y = [5, 35, 30, 25, -25, -30, -35]
value_to_check = -13
max_rank = len(y) # Default value in case no range found (as per 2.26 value example)
for ii in xrange(len(y)-1,0,-1):
if (y[ii] <= value_to_check <= y[ii-1]) or (y[ii] >= value_to_check >= y[ii-1]):
max_rank = ii
break
>>> max_rank
4
In function form:
def get_rank(y, value_to_check):
max_rank = len(y) # Default value in case no range found (as per 2.26 value example)
for ii in xrange(len(y)-1,0,-1):
if (y[ii] <= value_to_check <= y[ii-1]) or (y[ii] >= value_to_check >= y[ii-1]):
max_rank = ii
break
return max_rank
When you call:
>>> get_rank(y, value_to_check)
4
This correctly finds the answer for all your data:
def get_rank(l,n):
mindiff = float('inf')
minindex = -1
for i in range(len(l) - 1):
if l[i] <= n <= l[i + 1] or l[i + 1] <= n <= l[i]:
diff = abs(l[i + 1] - l[i])
if diff < mindiff:
mindiff = diff
minindex = i
if minindex != -1:
return minindex + 1
if n > max(l):
return len(l)
return 1
>>> test()
[5, 35, 30, 25, -25, -30, -35] -13 Desired: 4 Actual: 4
[5, 70, 65, 60, 40, 35, 30] 50 Desired: 4 Actual: 4
[1, 40, 35, 30, 5, 3, 0] 6 Desired: 4 Actual: 4
[10, 20, 30, 40, 50, 60, 70] 24 Desired: 2 Actual: 2
[0.1, 0.55, 0.65, 0.75, 1.75, 1.85, 1.95] 2.26 Desired: 7 Actual: 7
[10, 20, 30, 40, 60, 70, 80] 31 Desired: 3 Actual: 3
For completeness, here is my test() function, but you only need get_rank for what you are doing:
>>> def test():
lists = [[[5, 35, 30, 25, -25, -30, -35],-13,4],[[5, 70, 65, 60, 40, 35, 30],50,4],[[1, 40, 35, 30, 5, 3,0],6,4],[[10, 20, 30, 40, 50, 60, 70],24,2],[[0.1, 0.55, 0.65, 0.75, 1.75, 1.85, 1.95],2.26,7],[[10, 20, 30, 40, 60, 70, 80],31,3]]
for l,n,desired in lists:
print l,n,'Desired:',desired,'Actual:',get_rank(l,n)