Is numpys setdiff1d broken?

Is numpys setdiff1d broken? - python

To select data for training and validation in my machine learning projects, I usually use numpys masking functionality. So a typical reoccuring block of code to select the indices for validation and test data looks like this:
import numpy as np
validation_split = 0.2
all_idx = np.arange(0,100000)
idxValid = np.random.choice(all_idx, int(validation_split * len(all_idx)))
idxTrain = np.setdiff1d(all_idx, idxValid)
Now the following should always be true:
len(all_idx) == len(idxValid)+len(idxTrain)
Unfortunately, I found out that somehow this is not always the case. As I inrease the number of elements that are chosen from the all_idx-array the resulting numbers do not add up properly. Here another standalone example which breaks as soon as I increase the number of randomly chosen validation indices above 1000:
import numpy as np
all_idx = np.arange(0,100000)
idxValid = np.random.choice(all_idx, 1000)
idxTrain = np.setdiff1d(all_idx, idxValid)
print(len(all_idx), len(idxValid), len(idxTrain))
This results in -> 100000, 1000, 99005
I am confused?! Please try yourself. I would be glad to understand this.

idxValid = np.random.choice(all_idx, 10, replace=False)
Careful, you need to indicate that you don't want to have duplicates in idxValid. To do so, you just have to had replace=False in np.random.choice
replace boolean, optional
Whether the sample is with or without replacement

Consider the following example:
all_idx = np.arange(0, 100)
print(all_idx)
>>> [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
96 97 98 99]
Now if you print out your validation dataset:
idxValid = np.random.choice(all_idx, int(validation_split * len(all_idx)))
print(idxValid)
>>> [31 57 55 45 26 25 55 76 33 69 49 90 46 14 18 30 89 73 47 82]
You can actually observe that there are duplicates in the resulting set and thus
len(all_idx) == len(idxValid)+len(idxTrain)
wouldn't result to True.
What you need to do is to make sure that np.random.choice does a sampling without replcacement by passing replace=False:
idxValid = np.random.choice(all_idx, int(validation_split * len(all_idx)), replace=False)
Now the results should be as expected:
import numpy as np
validation_split = 0.2
all_idx = np.arange(0, 100)
print(all_idx)
idxValid = np.random.choice(all_idx, int(validation_split * len(all_idx)), replace=False)
print(idxValid)
idxTrain = np.setdiff1d(all_idx, idxValid)
print(idxTrain)
print(len(all_idx) == len(idxValid)+len(idxTrain))
and the output is:
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
96 97 98 99]
[12 85 96 64 48 21 55 56 80 42 11 92 54 77 49 36 28 31 70 66]
[ 0 1 2 3 4 5 6 7 8 9 10 13 14 15 16 17 18 19 20 22 23 24 25 26
27 29 30 32 33 34 35 37 38 39 40 41 43 44 45 46 47 50 51 52 53 57 58 59
60 61 62 63 65 67 68 69 71 72 73 74 75 76 78 79 81 82 83 84 86 87 88 89
90 91 93 94 95 97 98 99]
True
Consider using train_test_split from scikit-learn which is straight-forward:
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)

Related

issubset method different than subSet in superSet - Error in Python3.x

Why issubset method of sets in python3.x don't return the same than subSet in superSet ?
logically is correctly but the console return me unexpected result
works fine with shorts sets but large sets the (subSet in superSet) make mistakes
def isStrictSuperset(superSet, subSet):
strictSuperset = False
# condition1 = subSet.issubset(superSet) # why this is difrent than de follow condition
condition1 = subSet in superSet # Error! incorrect result line
condition2 = superSet != subSet
if condition1 and condition2:
strictSuperset = True
return strictSuperset # return if strict superset or not
if __name__ == "__main__":
# list of string
superSet = input().split(' ')
subSet = input().split(" ")
# convert the list of string to set of integers
superSet = set(int(x) for x in superSet)
subSet = set (int(x) for x in subSet)
# output
print( isStrictSuperset(superSet, subSet) )
input:
51 28 10 61 99 31 55 7 88 48 18 80 18 36 49 21 36 1 49 53 11 78 46 87 82 28 76 50 89 31 14 81 87 39 3 69 26 18 85 18 23 43 75 5 64 47 34 19 2 54 92 45 79 80 59 16 75 80 55 24 56 74 76 31 22 74 20 93 79 81 12 57 21 79 65 32 57 37 47 84 82 28 72 15 53 50 86 58 83 88 3 44 76 63 32 14 13 38 29 70 38 4 71 15 45 4 94 24 46 6 95 48 15 82 92 62 6 67 38 20 60 78 37 84 32 39 51 88 13 99 6 3 64 37 83 68 18 51 98 37 11 48 63 97 30 90 73 44 63 25 78 12 25 91 36 38 59 12 36 51 58 61 82 91 31 41 36 99 28 50 28 64 22 56 26 39 75 53 8 41 94 86 35 69 48 17 80 32 12 29 2 33 51 79 58 74 91 46 6 54 66 0 75 60 30 95 57 36 70 32 83 1 88 27 57 2 67 28 18 51 61 16 40 79 96 78 27 72 85 45 73 12 89 31 11 24 42 94 22 84 1 67 8 62 80 77 81 58 1 6 63 30 64 37 44 60 11 14 68 28 81 86 30 17 81 14 30 44 64 89 7 94 89 13 59 88 34 42 6 51 10 19 66 91 46 22 41 34 98 4 26 90 84 90 44 90 84 13 36 6 97 21 30 52 46 15 83 89 45 83 33 11 3 18 6 82 17 23 13 91 27 39 76 11 86 12 97 64 51 48 84 35 66 15 48 32 99 11 18 93 11 85 71 63 57 76 1 80 45 19 7 39 80 70 78 3 17 51 14 99 47 83 17 82 23 59 59 41 77 22 7 35 22 98 59 90 80 72 60 67 22 75 3 99 18 81 47 48 18 98 18 37 47 65 98 86 82 5 30 87 25 17 97 60 93 33 99 89 62 98 40 27 70 57 49 93 46 11 38 94 43 75 61 75 55 45 26 9 84 89 40 87 14 61 31 99 53 6 83 55 15 95 46 8 58 73 58 57 9 7 49 21 31 88 31 32 61 30 19 69 78 33 3 0 70 73 40 91 91 96 72 79 0 41 91 51 10 80 50 77 30 38 1 85 56 90 78 36 31 0 82 12 95 28 1 65 72 75 89 54
81 79 97 20 68 23 19 12 53 86 26 36 4 64 10 43 12 75 98 30 12 33 27 1 32 68 64 49 99 10 16 9 7 47 23 29 30 94 57 25 38 15 57 33 79 28 45 98 20 50 34 93 6 14 9 29 56 13 44 67 5 23 32 38 78 20 55 35 25 91 64 10 47 32 97 44 85 65 87 36 91 88 78 6 48 86 67 56 44 18 98 39 10 80 47 65 49 98 63 21
output: False
expected output: True

subset in superset checkes whether subset is an element of superset; i.e., it checks ∈, not ⊆.
You can simply use < to check whether a set is a proper subset of another: https://docs.python.org/3/library/stdtypes.html#frozenset.issubset
print({1, 2} in {1,2,3}) # False
print({1, 2} < {1,2,3}) # True

How do you split a time series into separate, even segments?

I want to perform a manual short time fourier transform. I have a simple time series in the form of a cosine wave. I want to perform a short time fourier transform by splitting up the time series into a number of evenly spaced segments that include overlap... how do i do that?
this is my time series:
fs = 10e3 # Sampling frequency
N = 1e5 # Number of samples
time = np.arange(N) / fs
x = np.cos(5*time) # Some random audio wave
# x.shape gives (100000,)
How do i split into say, 10 evenly spaced segments?

Here's one way to do this.
import numpy as np
def get_windows(n, Mt, olap):
# Split a signal of length n into olap% overlapping windows each containing Mt terms
ists = []
ieds = []
ist = 0
while 1:
ied = ist + Mt
if ied > n:
break
ists.append(ist)
ieds.append(ied)
ist += int(Mt * (1 - olap/100))
return ists, ieds
n = 100
x = np.arange(n)
ists, ieds = get_windows(n, Mt=20, olap=50) # windows of length 20 and 50% overlap
for ist, ied in zip(ists, ieds):
print(x[ist:ied])
result:
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]
[10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29]
[20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39]
[30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49]
[40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59]
[50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69]
[60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79]
[70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89]
[80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99]
If your data is relatively small and you are comfortable with storing all the windows in RAM, then you can continue as follows:
X = np.array([x[ist:ied] for ist, ied in zip(ists, ieds)])
# X.shape is (nwindows, Mt)
By doing this, you can generate W a windowing function (e.g. Hanning window) as a 1D array of shape (Mt, ), so that W*X will broadcast in a way so that W applies to each window in X.
I just noticed that the term "window" is used with two meanings in this context. Sorry for the confusion.

How to write this code in an optimal (pythonic) way?

I have the following code in R and I need to write it in an optimal way in python using pandas. I wrote it but it takes a long time to run.
1) is there someone who can confirm that this is an equivalent of R code in python
2) how to write it in a pythonic way(optimal way)
in R
for (i in 1:dim(df1)[1])
df1$column1[i] <- sum(df2[i,4:33])
in Python
for i in range(df1.shape[0]):
df1['column1'][i] = df2.iloc[i,3:34].sum()

These are two ways to make the replacement
df1['column1'] = df2.iloc[:, 3:34].sum(axis=1)
OR
df1.loc[:, 'column1'] = df2.iloc[:, 3:34].sum(axis=1)

Use vectorized operations:
>>> df = pd.DataFrame(np.random.randint(0, 100, (10, 15)), columns=list('abcdefghijklmno'))
>>> df
a b c d e f g h i j k l m n o
0 71 93 12 32 17 23 35 57 26 89 4 29 28 83 30
1 98 78 75 0 61 81 8 17 93 71 48 47 72 52 11
2 13 62 93 48 31 23 42 66 77 99 59 1 40 72 87
3 7 5 5 43 83 19 59 36 18 96 50 60 46 45 54
4 32 69 93 6 7 12 15 49 29 11 37 83 75 97 84
5 52 53 43 61 93 85 91 99 65 62 35 89 55 77 62
6 44 7 41 56 40 11 39 91 87 46 95 48 30 75 16
7 93 15 63 23 14 20 7 33 29 31 41 40 82 0 16
8 46 63 59 59 81 51 34 41 89 68 20 64 95 70 74
9 33 58 49 91 51 46 43 83 37 53 47 32 42 12 59
Then simply:
>>> df['column1'] = df.iloc[:, 3:8].sum(axis=1)
>>> df
a b c d e f g h i j k l m n o column1
0 71 93 12 32 17 23 35 57 26 89 4 29 28 83 30 164
1 98 78 75 0 61 81 8 17 93 71 48 47 72 52 11 167
2 13 62 93 48 31 23 42 66 77 99 59 1 40 72 87 210
3 7 5 5 43 83 19 59 36 18 96 50 60 46 45 54 240
4 32 69 93 6 7 12 15 49 29 11 37 83 75 97 84 89
5 52 53 43 61 93 85 91 99 65 62 35 89 55 77 62 429
6 44 7 41 56 40 11 39 91 87 46 95 48 30 75 16 237
7 93 15 63 23 14 20 7 33 29 31 41 40 82 0 16 97
8 46 63 59 59 81 51 34 41 89 68 20 64 95 70 74 266
9 33 58 49 91 51 46 43 83 37 53 47 32 42 12 59 314
>>>

How to create a pandas dataframe array ,whose specific column always has value greater than a particular column -by using np.random.randint

import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
print(df)
I want column 'A' always to have a value greater than column 'B'.

df.A, df.B = df[['A', 'B']].max(axis=1), df[['A', 'B']].min(axis=1)

Try this:
newdf = df.apply(lambda x: x if x[0]>x[1] else [*x[:2][::-1],*x[2:]],axis=1)
print(newdf)
Output:
A B C D
0 85 14 22 85
1 62 54 20 1
2 82 78 48 59
3 81 59 54 39
4 92 12 79 44
5 69 64 8 11
6 49 34 48 69
7 68 28 80 27
8 72 17 2 40
9 26 15 49 62
10 29 2 86 12
11 69 7 32 99
12 39 35 65 32
13 45 36 36 12
14 54 21 29 79
15 91 82 35 80
16 67 16 4 37
17 94 82 93 37
18 64 18 2 15
19 13 11 28 82
20 78 9 93 45
21 72 41 16 33
22 92 71 62 69
23 87 79 71 11
24 31 14 8 24
25 85 27 43 3
26 82 34 14 52
27 41 32 39 48
28 13 12 24 86
29 96 17 14 80
.. .. .. .. ..
70 17 13 20 91
71 26 7 57 96
72 41 0 24 58
73 98 68 90 13
74 88 35 81 56
75 65 43 70 86
76 82 81 44 68
77 97 45 23 66
78 81 45 78 48
79 62 24 43 62
80 43 13 42 49
81 97 28 75 45
82 3 0 54 40
83 57 46 16 38
84 87 46 35 13
85 41 13 78 89
86 62 36 94 23
87 84 35 69 93
88 63 18 39 3
89 45 42 30 6
90 81 8 49 82
91 28 28 11 47
92 97 81 49 92
93 86 24 82 40
94 76 72 30 51
95 93 92 1 69
96 97 76 38 81
97 87 49 26 64
98 98 25 93 55
99 57 2 87 10
[100 rows x 4 columns]

You can apply it to any no of columns.
import numpy as np
import pandas as pd
#np.random.seed(1)
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
#we are just sorting values of each rows in descending order.
df.values[:,::-1].sort()
print(df)
It gives following output:
A B C D
0 72 37 12 9
1 79 75 64 5
2 76 71 16 1
3 50 25 20 6
4 84 28 18 11
5 68 50 29 14
6 96 94 87 87
7 86 13 9 7
8 63 61 57 22
9 81 60 1 0
10 88 47 13 8
11 72 71 30 3
12 70 57 49 21
13 68 43 24 3
14 80 76 52 26
15 82 64 41 15
16 98 87 68 25
17 26 25 22 7
18 67 27 23 9
19 83 57 38 37
20 34 32 10 8

My output format is not correct can someone help me?

Given an unsorted list of integers, output the integers in order.
Sample Input
100 63 25 73 1 98 73 56 84 86 57 16 83 8 25 81 56 9 53 98 67 99 12 83 89 80 91 39 86 76 85 74 39 25 90 59 10 94 32 44 3 89 30 27 79 46 96 27 32 18 21 92 69 81 40 40 34 68 78 24 87 42 69 23 41 78 22 6 90 99 89 50 30 20 1 43 3 70 95 33 46 44 9 69 48 33 60 65 16 82 67 61 32 21 79 75 75 13 87 70 33
Sample Output
1 1 3 3 6 8 9 9 10 12 13 16 16 18 20 21 21 22 23 24 25 25 25 27 27 30 30 32 32 32 33 33 33 34 39 39 40 40 41 42 43 44 44 46 46 48 50 53 56 56 57 59 60 61 63 65 67 67 68 69 69 69 70 70 73 73 74 75 75 76 78 78 79 79 80 81 81 82 83 83 84 85 86 86 87 87 89 89 89 90 90 91 92 94 95 96 98 98 99 99
My Output
11 33 6 8 99 10 12 13 1616 18 20 2121 22 23 24 252525 2727 3030 323232 333333 34 3939 4040 41 42 43 4444 4646 48 50 53 5656 57 59 60 61 63 65 6767 68 696969 7070 7373 74 7575 76 7878 7979 80 8181 82 8383 84 85 8686 8787 898989 9090 91 92 94 95 96 9898 9999
my code:
a=int(input())
b=input()
b1=b.split(" ")
arr=list(map(int,b1))
ans=[]
for i in range(0,100,1):
#print(arr.count(i),end=' ')
ans.append(arr.count(i))
for i in range(0,len(ans)):
if(i==0):
continue
else:
print(str(i)*ans[i],end=' ')

If you have to do it that way, try:
print((str(i) + ' ') * ans[i], end=' ')

Use sorted with key:
string = "100 63 25 73 1 98 73 56 84 ... "
sorted_string = " ".join(sorted(string.split(), key=lambda x: int(x)))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Is numpys setdiff1d broken? - python

idxValid = np.random.choice(all_idx, 10, replace=False) Careful, you need to indicate that you don't want to have duplicates in idxValid. To do so, you just have to had replace=False in np.random.choice replace boolean, optional Whether the sample is with or without replacement

Related

issubset method different than subSet in superSet - Error in Python3.x

How do you split a time series into separate, even segments?

How to write this code in an optimal (pythonic) way?

How to create a pandas dataframe array ,whose specific column always has value greater than a particular column -by using np.random.randint

My output format is not correct can someone help me?

Categories

Resources