I got a very wired test for python2.7 numpy array. Please look at this code.
import numpy as np
times = np.arange(5., 85, 0.1)
print times
times = np.array(times * 10, dtype=np.int)
print times
the original times should be [5.0 ~ 84.9]. After multiply 10, it should become [50 ~ 849], but result is like this:
[ 50 51 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66
67 68 69 70 71 72 ... ]
There are two 51 between 50 and 52
The problem is, that your third entry isn't exactly 52.0 but 51.999999999999993 (see Is floating point math broken?). Truncating that value therefore results in 51.
The correct way would be to first round the values. (As pointed out in Safest way to convert float to integer in python? all small enough integer numbers can be exactly expressed as a float.) You therefore have to calculate: times = np.array(np.round(times * 10), dtype=np.int)
Related
What I tried was this:
import numpy as np
def test_random(nr_selections, n, prob):
selected = np.random.choice(n, size=nr_selections, replace= False, p = prob)
print(str(nr_selections) + ': ' + str(selected))
n = 100
prob = np.random.choice(100, n)
prob = prob / np.sum(prob) #only for demonstration purpose
for i in np.arange(10, 100, 10):
np.random.seed(123)
test_random(i, n, prob)
The result was:
10: [68 32 25 54 72 45 96 67 49 40]
20: [68 32 25 54 72 45 96 67 49 40 36 74 46 7 21 20 53 65 89 77]
30: [68 32 25 54 72 45 96 67 49 40 36 74 46 7 21 20 53 62 86 60 35 37 8 48
52 47 31 92 95 56]
40: ...
Contrary to my expectation and hope, the 30 numbers selected do not contain all of the 20 numbers. I also tried using numpy.random.default_rng, but only strayed further away from my desired output. I also simplified the original problem somewhat in the above example. Any help would be greatly appreciated. Thank you!
Edit for clarification: I do not want to generate all the sequences in one loop (like in the example above) but rather use the related sequences in different runs of the same program. (Ideally, without storing them somewhere)
I am new to R and Python, so forgive me if this is an elementary question. I have a large data set of genes (columns) by patients (rows), with each value being an RNA expression value (most values falling between 0 and 1). I want to multiply the entire data set by 1000 so that all non-zero values will be >1.
Currently:
Pt GeneA GeneB GeneC
1 0.001 2 0
2 0 0.5 0.002
Would like:
Pt GeneA GeneB GeneC
1 1 2000 0
2 0 500 2
I have tried to do this in both R and Python and am running into issues with both. I have also tried converting my data between data frame and matrix, and it won't work with either. I have searched extensively on this website and find information about how to multiply an entire df/matrix by a vector, or individual columns by a scalar, but not the entire thing. Could someone kindly point me in the right direction? I feel like it can't possibly be this hard :)
Using R:
df <- read.csv("/Users/m/Desktop/data.csv")
df * 100
In Ops.factor(left, right) : ‘*’ not meaningful for factors
mtx <- as.matrix(df)
mtx * 100
Error in mtx * 100 : non-numeric argument to binary operator
Using Python 3.7.6:
df = df * 1000
^ This runs without an error message but the values in the cells are exactly the same, so it didn't actually multiply anything...
df = df.div(.001)
TypeError: unsupported operand type(s) for /: 'str' and 'float'
Any creative ideas or resources to point me in the right direction? Thank you!
What does str(df) give you? At least some of your columns have been converted to factors because they are character strings. Open the csv file in a text editor and make sure the numbers are not surrounded by "" or that missing values have been labeled with a character. Once you have the data read properly it will be simple:
set.seed(42)
dat <- data.frame(matrix(sample.int(100, 100, replace=TRUE), 10, 10))
str(dat)
# 'data.frame': 10 obs. of 10 variables:
# $ X1 : int 49 65 25 74 100 18 49 47 24 71
# $ X2 : int 100 89 37 20 26 3 41 89 27 36
# $ X3 : int 95 5 84 34 92 3 58 97 42 24
# $ X4 : int 30 43 15 22 58 8 36 68 86 18
# $ X5 : int 92 69 4 98 50 99 88 87 49 26
# $ X6 : int 6 6 2 3 21 2 58 10 40 5
# $ X7 : int 33 49 100 73 29 76 84 9 35 93
# $ X8 : int 16 92 69 92 2 82 24 18 69 55
# $ X9 : int 40 21 100 57 100 42 18 91 13 53
# $ X10: int 54 83 32 80 60 29 81 73 85 43
dat1000 <- dat * 1000
Try this option:
df[,c(2:ncol(df)] <- 1000*df[,c(2:ncol(df)]
If you instead wanted a perhaps more generic solution targeting only columns whose name starts with Gene, then use:
df[grep("^Gene", names(df))] <- 1000*df[grep("^Gene", names(df))]
Looking at your target result, you need to multiply all columns except pt. In python:
target_cols = [i for i in df.columns if i!='Pt']
for i in target_cols:
df[i] = df[i].astype(float)
df[i] = df[i]*1000
I am writing a script to encrypt and decrypt an image in python3 using PIL. Here I am converting the image into a numpy array and then multiplying every element of the array with 10. Now I noticed that the default function in PIL fromarray() is converting every element of the array to the mod of 256 if its larger than the 255, so when I am trying to retrieve the original value of the matrix I'm not getting the original one. For example, if the original value is 40 then its 10 times is 400 so the fromarray() is making it as 400 mod 256, which will give 144. Now if I add 256 to 144 I will have 400 and then divided by 10 will give me 40. But if the value is 54 then 10times is 540 and 540 mod 256 is 28. Now to get back the original value I need to add 256 two times which will give me 540. 540 isn't the only number which will give me 28 when I will mod it with 256. So I will never know when to add 256 one time and when two times or more. Is there any way I can make it stop of replacing every element of the matrix with its mod of 256?
from PIL import Image
from numpy import *
from pylab import *
#encryption
img1 = (Image.open('image.jpeg').convert('L'))
img1.show() #displaying the image
img = array(Image.open('image.jpeg').convert('L'))
a,b = img.shape
print(img)
print((a,b))
tup = a,b
for i in range (0, tup[0]):
for j in range (0, tup[1]):
img[i][j]= img[i][j]*10 #converting every element of the original array to its 10times
print(img)
imgOut = Image.fromarray(img)
imgOut.show()
imgOut.save('img.jpeg')
#decryption
img2 = (Image.open('img.jpeg'))
img2.show()
img3 = array(Image.open('img.jpeg'))
print(img3)
a1,b1 = img3.shape
print((a1,b1))
tup1 = a1,b1
for i1 in range (0, tup1[0]):
for j1 in range (0, tup1[1]):
img3[i1][j1]= ((img3[i1][j1])/10) #reverse of encryption
print(img3)
imgOut1 = Image.fromarray(img3)
imgOut1.show()
part of the original matrix before multiplying with 10 :
[41 42 45 ... 47 41 33]
[41 43 45 ... 44 38 30]
[41 42 46 ... 41 36 30]
[43 43 44 ... 56 56 55]
[45 44 45 ... 55 55 54]
[46 46 46 ... 53 54 54]
part of the matrix after multiplying with 10 :
[[154 164 194 ... 214 154 74]
[154 174 194 ... 184 124 44]
[154 164 204 ... 154 104 44]
[174 174 184 ... 48 48 38]
[194 184 194 ... 38 38 28]
[204 204 204 ... 18 28 28]
part of the expected matrix after dividing by 10 :
[41 42 45 ... 47 41 33]
[41 43 45 ... 44 38 30]
[41 42 46 ... 41 36 30]
[43 43 44 ... 56 56 55]
[45 44 45 ... 55 55 54]
[46 46 46 ... 53 54 54]
part of th output the script is providing: [[41 41 45 ... 48 40 33]
[41 43 44 ... 44 37 31]
[41 41 48 ... 41 35 30]
[44 42 43 ... 30 30 29]
[44 42 45 ... 29 29 29]
[45 47 44 ... 28 28 28]]
There are several problems with what you're trying to do here.
PIL images are either 8 bit per channel or 16 bit per channel (to the best of my knowledge). When you load a JPEG, it's loaded as 8 bits per channel, so the underlying data type is an unsigned 8-bit integer, i.e. range 0..255. Operations that would overflow or underflow this range wrap, which looks like the modulus behavior you're seeing.
You could convert the 8-bit PIL image to a floating point numpy array with np.array(img).astype('float32') and then normalize this to 0..1 by dividing with 255.
At this point you have non-quantized floating point numbers you can freely mangle however you wish.
However, then you still need to save the resulting image, at which point you again have a format problem. I believe TIFFs and some HDR image formats support floating point data, but if you want something that is widely readable, you'd likely go for PNG or JPEG.
For an encryption use case, JPEGs are not a good choice, as they're always inherently lossy, and you will, more likely than not, not get the same data back.
PNGs can be 8 or 16 bits per channel, but still, you'd have the problem of having to compress a basically infinite "dynamic range" of pixels (let's say you'd multiplied everything by a thousand!) into 0..255 or 0..65535.
An obvious way to do this is to find the maximum value in the image (np.max(...)), divide everything by it (so now you're back to 0..1), then multiply with the maximum value of the image data format... so with a simple multiplication "cipher" as you'd described, you'd essentially get the same image back.
Another way would be to clip the infinite range at the allowed values, i.e. everything below zero is zero, everything above it is, say, 65535. That'd be a lossy operation though, and you'd have no way of getting the unclipped values back.
First of all, PIL only supports 8-bit per channel images - although Pillow (the PIL fork) supports many more formats including higher bit-depths. The JPEG format is defined as only 8-bit per channel.
Calling Image.open() on a JPEG in PIL will therefore return an 8-bit array, so any operations on individual pixels will be performed as equivalent to uint8_t arithmetic in the backing representation. Since the maximum value in a uint8_t value is 256, all your arithmetic is necessarily modulo 256.
If you want to avoid this, you'll need to convert the representation to a higher bit-depth, such as 16bpp or 32bpp. You can do this with the NumPy code, such as:
img16 = np.array(img, dtype=np.uint16)
# or
img32 = np.array(img, dtype=np.uint32)
That will give you the extended precision that you desire.
However - your code example shows that you are trying to encryption and decrypt the image data. In that case, you do want to use modulo arithmetic! You just need to do some more research on actual encryption algorithms.
As none of the answers have helped me that much and I have solved the problem I would like to give an answer hoping one day it will help someone. Here the keys are (3, 25777) and (16971,25777).
The working code is as follows:
from PIL import Image
import numpy as np
#encryption
img1 = (Image.open('image.jpeg').convert('L'))
img1.show()
img = array((Image.open('image.jpeg').convert('L')))
img16 = np.array(img, dtype=np.uint32)
a,b = img.shape
print('\n\nOriginal image: ')
print(img16)
print((a,b))
tup = a,b
for i in range (0, tup[0]):
for j in range (0, tup[1]):
x = img16[i][j]
x = (pow(x,3)%25777)
img16[i][j] = x
print('\n\nEncrypted image: ')
print(img16)
imgOut = Image.fromarray(img16)
imgOut.show()
#decryption
img3_16 = img16
img3_16 = np.array(img, dtype=np.uint32)
print('\n\nEncrypted image: ')
print(img3_16)
a1,b1 = img3_16.shape
print((a1,b1))
tup1 = a1,b1
for i1 in range (0, tup1[0]):
for j1 in range (0, tup1[1]):
x1 = img3_16[i1][j1]
x1 = (pow(x,16971)%25777)
img3_16[i][j] = x1
print('\n\nDecrypted image: ')
print(img3_16)
imgOut1 = Image.fromarray(img3_16)y
imgOut1.show()
Feel free to point out the faults. Thank you.
I have a random list of series (integers) along with dates in a csv like:
1/1/2019,34 44 57 62 70
12/28/2018,09 10 25 37 38
12/25/2018,02 08 42 43 50
12/21/2018,10 13 61 62 70
12/18/2018,13 22 32 60 69
12/14/2018,05 22 26 43 49
12/11/2018,04 38 39 54 59
12/7/2018,04 10 20 33 57
12/4/2018,28 31 41 42 50
The list goes all the way back to year 1997. What I am trying is to predict the next series (or as closest as possible) based on these data:
The size of the list (2336)
What have I tried?
The approach that I've used so far is (e.g. for 1/1/2019,34 44 57 62 70):
1) Get the occurrence of each number in the list, i.e. the number 34 has occurred 170 times out the total list (2336).
2) Find the percentage of each number that has occurred. i.e.
Perc/Chances(34) = Occurrence/TotalNo.
Chances(34) = 170/2336
Chances(34) = 0.072 ~ 07
One way to get the list would be to just find the 5 numbers from the list with the least Percentages. but that won't be much effective.
On the other hand, Now I have a data which has each number, its percentage and its occurrence. Is there any way I can somehow train a neural network that predicts the next series? or closest.
Hierarchy:
Where comp_data.csv contains data like:
1/1/2019,34 44 57 62 70
12/28/2018,09 10 25 37 38
12/25/2018,02 08 42 43 50
12/21/2018,10 13 61 62 70
12/18/2018,13 22 32 60 69
12/14/2018,05 22 26 43 49
12/11/2018,04 38 39 54 59
12/7/2018,04 10 20 33 57
12/4/2018,28 31 41 42 50
and occurrence.csv contains:
34,170
44,197
57,36
62,38
70,37
09,186
10,210
25,197
37,185
38,206
02,217
08,185
and report.csv contains the number, occurrence and its percentage:
34,3,11
44,1,03
57,5,19
62,5,19
70,5,19
09,1,03
10,5,19
25,2,07
37,3,11
38,2,07
02,1,03
08,2,07
So I have the list of series, its occurrences over a period of time, and the percentages. Is there anyway I can create a NN that expects some INPUTS trains over a data and predicts the OUT (a series in this case)
The Problem:
Which ones would be the Input? As it is a pure random problem. PS. I cannot provide any Input since I need a series without INPUT. Perhaps, a LSTM Network for Regression?
I might be doing something wrong but I can't figure out what it is. I'm trying to reproduce some results from a real state dataset from Baton Rouge, LA. The original code is written in WinBUGS here. There are some minor differences between the dataset used in the link above and the one I'm using right now. However, I think that is not significant. This is the code:
import pymc as pm, pandas as pd, numpy as np
from scipy.spatial.distance import pdist, squareform
from numpy.linalg import inv
# Loading dataset
df = pd.read_table('http://pastebin.com/raw.php?i=41us4HVj', sep=' ')
# Setting priors
beta = pm.Normal('beta', 0.0, 0.1, size=3)
mu = pm.Lambda('mu', lambda b=beta:
b[0]+b[1]*df['LivingArea']/1000.0+b[2]*df['Age'])
tau = pm.Gamma('tau', 0.1, 0.1)
phi = pm.Uniform('phi', 0.1, 10)
# Trying to build a covariate matrix
A = squareform(pdist(np.array(zip(df['Latitude'], df['Longitude']))))
# Using the powered exponential to obtain a precision matrix
precision = pm.Lambda('exp', lambda u=A, tau=tau, phi=phi, kappa=1:
inv((1/tau)*np.exp(-(phi*u)**kappa)))
If I inspect the value of mu, I get this:
mu.value
Out[2]:
0 24.568272
1 2.909063
2 -2.778916
3 28.206696
4 -0.270571
5 -2.865153
6 14.158162
7 31.466438
8 44.681351
9 22.191397
10 -6.412350
11 11.709424
12 25.453254
13 24.366674
14 34.711048
...
55 24.625763
56 21.763089
57 65.108136
58 15.428714
59 20.992329
60 36.384037
61 16.730507
62 23.021763
63 54.887747
64 30.612696
65 52.685840
66 59.612372
67 18.822422
68 18.940658
69 72.678188
Length: 70, dtype: float64
However, after running MvNormal, the value of mu is changed:
w = pm.MvNormal('w', mu, precision)
mu.value
Out[4]:
0 -107.913779
1 -1.243466
2 8.283926
3 26.412651
4 1.806728
5 -1.300734
6 -80.657396
7 71.614343
8 -3.817774
9 -10.283683
10 -3.804962
11 8.639403
12 18.927553
13 -10.004095
14 -37.431770
...
55 88.612179
56 18.011459
57 -7.421157
58 7.974531
59 -3.697444
60 -17.520367
61 36.453531
62 -39.235745
63 -6.701737
64 68.672902
65 -44.040923
66 11.826075
67 -21.995198
68 -15.886362
69 4.653335
Length: 70, dtype: float64
By the way, this only happens to mu. The precision variable remains the same.
Did I make a mistake?
UPDATE:
Already filed an issue on GitHub. After further inspection, the culprit seems to be the pd.Series object that is used in the mu variable. If I convert or remove the Series, mu won't change after calling MvNormal.
Thanks!