I have 14400 values saved in a list which represent a 4 pixels vertical by 6 pixels horizontal and 600 frames.
Here are the values if anyone is interested
# len of blurry values 14400
data = np.array(blurry_values)
#shape of data (4, 6, 600)
shape = ( 4,6,600 )
data= data.reshape(shape)
#print(np.round(np.mean(data, axis=2),2))
[[0.89 0.37 0.45 0.44 0.51 0.52]
[0.5 0.47 0.53 0.48 0.48 0.53]
[0.49 0.5 0.5 0.53 0.48 0.54]
[0.48 0.51 0.45 0.55 0.5 0.49]]
However, when I confirm the sanity of the first average by doing the following
list1 = blurry_values[::23]
np.round(np.mean(list1),2)
I get 0.51 instead of 0.89
I am trying to get the average value of the pixel across all the frames. Why are these values different?
I don't know exactly why, but :
list1 = blurry_values[:600]
gives 0.89
list1 = blurry_values[600:1200]
gives 0.37
Python starts reshaping by first filling the last dimension, i believe..
Let us tackle this with a smaller array:
import numpy as np
np.random.seed(42)
values = np.random.randint(low=0, high=100, size=48)
shape = (2,4,6)
data = values.reshape(shape) # 2 frames of 4 pixels by 6 pixels each
print(data, '\n')
print(np.round(np.mean(data, axis=0),2), '\n') # average values across frames
list1 = values[::24]
print(np.round(np.mean(list1),2)) # average of first pixel across frames
Output:
[[[51 92 14 71 60 20]
[82 86 74 74 87 99]
[23 2 21 52 1 87]
[29 37 1 63 59 20]]
[[32 75 57 21 88 48]
[90 58 41 91 59 79]
[14 61 61 46 61 50]
[54 63 2 50 6 20]]]
[[41.5 83.5 35.5 46. 74. 34. ]
[86. 72. 57.5 82.5 73. 89. ]
[18.5 31.5 41. 49. 31. 68.5]
[41.5 50. 1.5 56.5 32.5 20. ]]
41.5
Since I haven't seen the code that produced blurry_values, I can't be 100% sure, but I'm guessing that you're re-shaping blurry_values wrongly.
In most programming scenarios, I would expect the pixel-height and pixel-width to be represented by the last two axes, and the frame to be represented by an axis preceding these two.
So, I'm guessing that your shape should have been shape = (600, 4, 6) instead of shape = (4, 6, 600).
In that case, you should be doing np.round(np.mean(data, axis=0),2) rather than np.round(np.mean(data, axis=2),2). BTW, that would also produce a shape of (4, 6).
Then, for your sanity check, you should be doing this:
list1 = blurry_values[::24] # Note that it's 24, not 23
np.round(np.mean(list1),2)
You should be checking whether the first value of np.round(np.mean(data, axis=0),2), with the first value of np.round(np.mean(list1),2). (I haven't tested it myself, though).
Related
I read .csv file using this command
df = pd.read_csv('filename.csv', nrows=200)
I set the number of rows to 200. So it will only get the data for 200 rows. (200 rows x 1 column)
data
1 4.33
2 6.98
.
.
200 100.896
I want to plot these data however I would like to divide the number of rows by 50. (there will be 200 elements still but the numbers of the rows will be divided by 50).
data
0.02 4.33
0.04 6.98
.
.
4 100.896
I'm not sure how I would do that. Is there a way of doing this?
Just divide the index by 50.
Here an example :
import pandas as pd
import random
data = pd.DataFrame({'col1' : random.sample(range(300), 200)}, index = range(1,201))
data.index = data.index / 50
data
col1
0.02
196
0.04
198
0.06
278
0.08
209
0.10
36
...
...
3.92
175
3.94
69
3.96
145
3.98
15
4.00
18
I do have a large dataset (around 8 million rows x 25 columns) in Pandas and I am struggling to do one operation in a performant manner.
Here is how my dataset looks like:
temp size
location_id hours
135 78 12.0 100.0
79 NaN NaN
80 NaN NaN
81 15.0 112.0
82 NaN NaN
83 NaN NaN
84 14.0 22.0
I have a multi-index on [location_id, hours]. I have around 60k locations and 140 hours for each location (making up the 8 million rows).
The rest of the data is numeric (float). I have only included 2 columns here, normally there are around 20 columns.
What I am willing to do is to fill those NaN values by using the values around it. Basically, the value of hour 79 will be derived from the values of 78 and 81. For this example, the temp value of 79 will be 13.0 (basic extrapolation).
I always know that only the 78, 81, 84 (multiples of 3) hours will be filled and the rest will have NaN. That will always be the case. This is true for hours between 78-120.
With these in mind, I have implemented the following algorithm in Pandas:
df_relevant_data = df.loc[(df.index.get_level_values(1) >= 78) & (df.index.get_level_values(1) <= 120), :]
for location_id, data_of_location_id in df_relevant_data.groupby("location_id"):
for hour in range(81, 123, 3):
top_hour_data = data_of_location_id.loc[(location_id, hour), ['temp', 'size']] # e.g. 81
bottom_hour_data = data_of_location_id.loc[(location_id, (hour - 3)), ['temp', 'size']] # e.g. 78
difference = top_hour_data.values - bottom_hour_data.values
bottom_bump = difference * (1/3) # amount to add to calculate the 79th hour
top_bump = difference * (2/3) # amount to add to calculate the 80th hour
df.loc[(location_id, (hour - 2)), ['temp', 'size']] = bottom_hour_data.values + bottom_bump
df.loc[(location_id, (hour - 1)), ['temp', 'size']] = bottom_hour_data.values + top_bump
This works really well functionally, however the performance is horrible. It is taking at least 10 minutes on my dataset and that is currently not acceptable.
Is there a better/faster way to implement this? I am actually working only on a slice of the whole data (only hours between 78-120) so I would really expect it to work much faster.
I believe you are looking for interpolate:
print (df.interpolate())
temp size
location_id hours
135 78 12.000000 100.0
79 13.000000 104.0
80 14.000000 108.0
81 15.000000 112.0
82 14.666667 82.0
83 14.333333 52.0
84 14.000000 22.0
I have a weird question, it concerns slicing arrays and extract small thumbnail cutouts. I do have a solution, but it's a chunky for loop which runs fairly slowly on big images.
The current solution looks something like this:
import numpy as np
image = np.arange(0,10000,1).reshape(100,100) #create an image
cutouts = np.zeros((100,10,10)) #array to hold the thumbnails
l = 0
for i in range(0,10):
for j in range(0,10): #step a (10,10) box across the image + save results
cutouts[l,:,:] = image[(i*10):(i+1)*10, (j*10):(j+1)*10]
l = l+1
print(cutouts[0,:,:])
[[ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
[ 100. 101. 102. 103. 104. 105. 106. 107. 108. 109.]
[ 200. 201. 202. 203. 204. 205. 206. 207. 208. 209.]
[ 300. 301. 302. 303. 304. 305. 306. 307. 308. 309.]
[ 400. 401. 402. 403. 404. 405. 406. 407. 408. 409.]
[ 500. 501. 502. 503. 504. 505. 506. 507. 508. 509.]
[ 600. 601. 602. 603. 604. 605. 606. 607. 608. 609.]
[ 700. 701. 702. 703. 704. 705. 706. 707. 708. 709.]
[ 800. 801. 802. 803. 804. 805. 806. 807. 808. 809.]
[ 900. 901. 902. 903. 904. 905. 906. 907. 908. 909.]]
So, like I said, this works. But, once I get to very large images (I work in astronomy) with a couple different colour bands, it gets slow and clunky. In my dream world, I'd be able to do somethin like:
import numpy as np
image = np.arange(0,10000,1).reshape(100,100) #create an image
cutouts = image.reshape(100,10,10)
BUT, the doesn't create the right thumbnails, because it will read a whole row into the first (10,10) array, before moving onto the next one:
print(cutouts[0,:,:])
[[ 0 1 2 3 4 5 6 7 8 9]
[10 11 12 13 14 15 16 17 18 19]
[20 21 22 23 24 25 26 27 28 29]
[30 31 32 33 34 35 36 37 38 39]
[40 41 42 43 44 45 46 47 48 49]
[50 51 52 53 54 55 56 57 58 59]
[60 61 62 63 64 65 66 67 68 69]
[70 71 72 73 74 75 76 77 78 79]
[80 81 82 83 84 85 86 87 88 89]
[90 91 92 93 94 95 96 97 98 99]]
So yeah, that's the problem, am I going mad and the for loop is the best way to do it, or is there some clever way I can slice image array so that it produces the thumbnails I need.
Cheers!
Reshape to 4D, permute axes, reshape again -
H,W = 10,10 # height,width of thumbnail imgs
m,n = image.shape
cutouts = image.reshape(m//H,H,n//W,W).swapaxes(1,2).reshape(-1,H,W)
More info on the intuition behind it.
A more compact version with scikit-image builtin : view_as_blocks -
from skimage.util.shape import view_as_blocks
cutouts = view_as_blocks(image,(H,W)).reshape(-1,H,W)
If you are okay with the intermediate 4D output, it would a view into the input image and hence virtually free on runtime. Let's verify the view-part -
In [51]: np.shares_memory(image, image.reshape(m//H,H,n//W,W))
Out[51]: True
In [52]: np.shares_memory(image, view_as_blocks(image,(H,W)))
Out[52]: True
I want to find out how many sample will be taken from each level using proportion allocation method.
I have total 3 level's : [Small , Medium , Large ].
First , I want to take a sum for this 3 level's.
Next, I want to find out probability for this 3 levels
Next, I want to use this probability answer with multiply by how many samples given for this 3 levels
And, Last step is : sample will be select as top village's for the each level.
Data :
Village Workers Level
Aagar 10 Small
Dhagewadi 32 Small
Sherewadi 34 Small
Shindwad 42 Small
Dhokari 84 Medium
Khanapur 65 Medium
Ambikanagar 45 Medium
Takali 127 Large
Gardhani 122 Large
Pi.Khand 120 Large
Pangri 105 Large
let me explain, I am attaching excel photo
In the first step: I want to get sum values for level -> Small, Medium and High. i.e ( 10+32+34+42)=118 for Small level.
In the next step I want to find out probability for each levels rounding in 2 decimal.
i.e ( 118/786) =0.15 for small level.
And using length(size) of each level multiply by probability for find out how many sample(village) taken from each level.
i.e for Medium level we have probability 0.25 , and we have 3 villages in Medium level. so, 0.25*3 = 0.75 will be sample taken from medium level.
So, it will rounding to the next whole number 0.75 ~ 1 sample taken from Medium level, and it will take top village in this level. so, in medium level "Dhokri" village will be select,
I have done some work,
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df=pd.read_csv("/home/desktop/Desktop/t.csv")
df = df.sort('Workers', ascending=True)
df['level'] = pd.qcut(df['Workers'], 3, ['Small','Medium','Large'])
df
I am use this command for get the sum for level's. next what to do I am confuse,
df=df.groupby(['level'])['Workers'].aggregate(['sum']).unstack()
Is it possible in python , to get that village name what I get in the using excel ?
You can use:
transform with sum for same length of column
divide by div with sum and round
another transform with size
last custom function
df['Sum_Level_wise'] = df.groupby('Level')['Workers'].transform('sum')
df['Probability'] = df['Sum_Level_wise'].div(df['Workers'].sum()).round(2)
df['Sample'] = df['Probability'] * df.groupby('Level')['Workers'].transform('size')
df['Selected villages'] = df['Sample'].apply(np.ceil).astype(int)
df['Selected village'] = df.groupby('Level')
.apply(lambda x: x['Village'].head(x['Selected villages'].iat[0]))
.reset_index(level=0)['Village']
df['Selected village'] = df['Selected village'].fillna('')
print (df)
Village Workers Level Sum_Level_wise Probability Sample \
0 Aagar 10 Small 118 0.15 0.60
1 Dhagewadi 32 Small 118 0.15 0.60
2 Sherewadi 34 Small 118 0.15 0.60
3 Shindwad 42 Small 118 0.15 0.60
4 Dhokari 84 Medium 194 0.25 0.75
5 Khanapur 65 Medium 194 0.25 0.75
6 Ambikanagar 45 Medium 194 0.25 0.75
7 Takali 127 Large 474 0.60 2.40
8 Gardhani 122 Large 474 0.60 2.40
9 Pi.Khand 120 Large 474 0.60 2.40
10 Pangri 105 Large 474 0.60 2.40
Selected villages Selected village
0 1 Aagar
1 1
2 1
3 1
4 1 Dhokari
5 1
6 1
7 3 Takali
8 3 Gardhani
9 3 Pi.Khand
10 3
You can try debug with custom function:
def f(x):
a = x['Village'].head(x['Selected villages'].iat[0])
print (x['Village'])
print (a)
if (len(x) < len(a)):
print ('original village cannot be filled to Selected village, because length is higher')
return a
df['Selected village'] = df.groupby('Level').apply(f).reset_index(level=0)['Village']
df['Selected village'] = df['Selected village'].fillna('')
I'm trying to create a table of cosines using numpy in python. I want to have the angle next to the cosine of the angle, so it looks something like this:
0.0 1.000 5.0 0.996 10.0 0.985 15.0 0.966
20.0 0.940 25.0 0.906 and so on.
I'm trying to do it using a for loop but I'm not sure how to get this to work.
Currently, I have .
Any suggestions?
Let's say you have:
>>> d = np.linspace(0, 360, 10, endpoint=False)
>>> c = np.cos(np.radians(d))
If you don't mind having some brackets and such on the side, then you can simply concatenate column-wise using np.c_, and display:
>>> print(np.c_[d, c])
[[ 0.00000000e+00 1.00000000e+00]
[ 3.60000000e+01 8.09016994e-01]
[ 7.20000000e+01 3.09016994e-01]
[ 1.08000000e+02 -3.09016994e-01]
[ 1.44000000e+02 -8.09016994e-01]
[ 1.80000000e+02 -1.00000000e+00]
[ 2.16000000e+02 -8.09016994e-01]
[ 2.52000000e+02 -3.09016994e-01]
[ 2.88000000e+02 3.09016994e-01]
[ 3.24000000e+02 8.09016994e-01]]
But if you care about removing them, one possibility is to use a simple regex:
>>> import re
>>> print(re.sub(r' *\n *', '\n',
np.array_str(np.c_[d, c]).replace('[', '').replace(']', '').strip()))
0.00000000e+00 1.00000000e+00
3.60000000e+01 8.09016994e-01
7.20000000e+01 3.09016994e-01
1.08000000e+02 -3.09016994e-01
1.44000000e+02 -8.09016994e-01
1.80000000e+02 -1.00000000e+00
2.16000000e+02 -8.09016994e-01
2.52000000e+02 -3.09016994e-01
2.88000000e+02 3.09016994e-01
3.24000000e+02 8.09016994e-01
I'm removing the brackets, and then passing it to the regex to remove the spaces on either side in each line.
np.array_str also lets you set the precision. For more control, you can use np.array2string instead.
Side-by-Side Array Comparison using Numpy
A built-in Numpy approach using the column_stack((...)) method.
numpy.column_stack((A, B)) is a column stack with Numpy which allows you to compare two or more matrices/arrays.
Use the numpy.column_stack((A, B)) method with a tuple. The tuple must be represented with () parenthesizes representing a single argument with as many matrices/arrays as you want.
import numpy as np
A = np.random.uniform(size=(10,1))
B = np.random.uniform(size=(10,1))
C = np.random.uniform(size=(10,1))
np.column_stack((A, B, C)) ## <-- Compare Side-by-Side
The result looks like this:
array([[0.40323596, 0.95947336, 0.21354263],
[0.18001121, 0.35467198, 0.47653884],
[0.12756083, 0.24272134, 0.97832504],
[0.95769626, 0.33855075, 0.76510239],
[0.45280595, 0.33575171, 0.74295859],
[0.87895151, 0.43396391, 0.27123183],
[0.17721346, 0.06578044, 0.53619146],
[0.71395251, 0.03525021, 0.01544952],
[0.19048783, 0.16578012, 0.69430883],
[0.08897691, 0.41104408, 0.58484384]])
Numpy column_stack is useful for AI/ML applications when comparing the predicted results with the expected answers. This determines the effectiveness of the Neural Net training. It is a quick way to detect where errors are in the network calculations.
Pandas is very convenient module for such tasks:
In [174]: import pandas as pd
...:
...: x = pd.DataFrame({'angle': np.linspace(0, 355, 355//5+1),
...: 'cos': np.cos(np.deg2rad(np.linspace(0, 355, 355//5+1)))})
...:
...: pd.options.display.max_rows = 20
...:
...: x
...:
Out[174]:
angle cos
0 0.0 1.000000
1 5.0 0.996195
2 10.0 0.984808
3 15.0 0.965926
4 20.0 0.939693
5 25.0 0.906308
6 30.0 0.866025
7 35.0 0.819152
8 40.0 0.766044
9 45.0 0.707107
.. ... ...
62 310.0 0.642788
63 315.0 0.707107
64 320.0 0.766044
65 325.0 0.819152
66 330.0 0.866025
67 335.0 0.906308
68 340.0 0.939693
69 345.0 0.965926
70 350.0 0.984808
71 355.0 0.996195
[72 rows x 2 columns]
You can use python's zip function to go through the elements of both lists simultaneously.
import numpy as np
degreesVector = np.linspace(0.0, 360.0, 73.0)
cosinesVector = np.cos(np.radians(degreesVector))
for d, c in zip(degreesVector, cosinesVector):
print d, c
And if you want to make a numpy array out of the degrees and cosine values, you can modify the for loop in this way:
table = []
for d, c in zip(degreesVector, cosinesVector):
table.append([d, c])
table = np.array(table)
And now on one line!
np.array([[d, c] for d, c in zip(degreesVector, cosinesVector)])
You were close - but if you iterate over angles, just generate the cosine for that angle:
In [293]: for angle in range(0,60,10):
...: print('{0:8}{1:8.3f}'.format(angle, np.cos(np.radians(angle))))
...:
0 1.000
10 0.985
20 0.940
30 0.866
40 0.766
50 0.643
To work with arrays, you have lots of options:
In [294]: angles=np.linspace(0,60,7)
In [295]: cosines=np.cos(np.radians(angles))
iterate over an index:
In [297]: for i in range(angles.shape[0]):
...: print('{0:8}{1:8.3f}'.format(angles[i],cosines[i]))
Use zip to dish out the values 2 by 2:
for a,c in zip(angles, cosines):
print('{0:8}{1:8.3f}'.format(a,c))
A slight variant on that:
for ac in zip(angles, cosines):
print('{0:8}{1:8.3f}'.format(*ac))
You could concatenate the arrays together into a 2d array, and display that:
In [302]: np.vstack((angles, cosines)).T
Out[302]:
array([[ 0. , 1. ],
[ 10. , 0.98480775],
[ 20. , 0.93969262],
[ 30. , 0.8660254 ],
[ 40. , 0.76604444],
[ 50. , 0.64278761],
[ 60. , 0.5 ]])
In [318]: print(np.vstack((angles, cosines)).T)
[[ 0. 1. ]
[ 10. 0.98480775]
[ 20. 0.93969262]
[ 30. 0.8660254 ]
[ 40. 0.76604444]
[ 50. 0.64278761]
[ 60. 0.5 ]]
np.column_stack can do that without the transpose.
And you can pass that array to your formatting with:
for ac in np.vstack((angles, cosines)).T:
print('{0:8}{1:8.3f}'.format(*ac))
or you could write that to a csv style file with savetxt (which just iterates over the 'rows' of the 2d array and writes with fmt):
In [310]: np.savetxt('test.txt', np.vstack((angles, cosines)).T, fmt='%8.1f %8.3f')
In [311]: cat test.txt
0.0 1.000
10.0 0.985
20.0 0.940
30.0 0.866
40.0 0.766
50.0 0.643
60.0 0.500
Unfortunately savetxt requires the old style formatting. And trying to write to sys.stdout runs into byte v unicode string issues in Py3.
Just in numpy with some format ideas, to use #MaxU 's syntax
a = np.array([[i, np.cos(np.deg2rad(i)), np.sin(np.deg2rad(i))]
for i in range(0,361,30)])
args = ["Angle", "Cos", "Sin"]
frmt = ("{:>8.0f}"+"{:>8.3f}"*2)
print(("{:^8}"*3).format(*args))
for i in a:
print(frmt.format(*i))
Angle Cos Sin
0 1.000 0.000
30 0.866 0.500
60 0.500 0.866
90 0.000 1.000
120 -0.500 0.866
150 -0.866 0.500
180 -1.000 0.000
210 -0.866 -0.500
240 -0.500 -0.866
270 -0.000 -1.000
300 0.500 -0.866
330 0.866 -0.500
360 1.000 -0.000