How do you split a time series into separate, even segments?

How do you split a time series into separate, even segments? - python

I want to perform a manual short time fourier transform. I have a simple time series in the form of a cosine wave. I want to perform a short time fourier transform by splitting up the time series into a number of evenly spaced segments that include overlap... how do i do that?
this is my time series:
fs = 10e3 # Sampling frequency
N = 1e5 # Number of samples
time = np.arange(N) / fs
x = np.cos(5*time) # Some random audio wave
# x.shape gives (100000,)
How do i split into say, 10 evenly spaced segments?

Here's one way to do this.
import numpy as np
def get_windows(n, Mt, olap):
# Split a signal of length n into olap% overlapping windows each containing Mt terms
ists = []
ieds = []
ist = 0
while 1:
ied = ist + Mt
if ied > n:
break
ists.append(ist)
ieds.append(ied)
ist += int(Mt * (1 - olap/100))
return ists, ieds
n = 100
x = np.arange(n)
ists, ieds = get_windows(n, Mt=20, olap=50) # windows of length 20 and 50% overlap
for ist, ied in zip(ists, ieds):
print(x[ist:ied])
result:
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]
[10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29]
[20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39]
[30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49]
[40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59]
[50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69]
[60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79]
[70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89]
[80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99]
If your data is relatively small and you are comfortable with storing all the windows in RAM, then you can continue as follows:
X = np.array([x[ist:ied] for ist, ied in zip(ists, ieds)])
# X.shape is (nwindows, Mt)
By doing this, you can generate W a windowing function (e.g. Hanning window) as a 1D array of shape (Mt, ), so that W*X will broadcast in a way so that W applies to each window in X.
I just noticed that the term "window" is used with two meanings in this context. Sorry for the confusion.

Related

Creating circle from data

I have some data and what i am trying to do is to make full circle and half circle using that data. Below is the code i did so far but it should start from zero and end at zero. Also this creates a so called half circle. Is there a way to create half-circle and full-circle, starts from zero and ends at zero. Or using the data without manipulating it?
np.random.seed(15)
data = np.random.randint(0, 100, 100)
print(data)
arr = data - np.mean(data)
arr = np.cumsum(np.sort(arr))
plt.plot(arr)
plt.axhline(0, color="#000000", ls="-.", linewidth=0.5)
plt.show()
[72 12 5 0 28 27 71 75 85 47 93 17 31 23 32 62 10 15 68 39 37 19 44 77
60 29 79 15 56 49 1 31 96 85 26 34 75 50 65 53 70 41 34 40 22 63 79 56
28 99 4 7 66 42 96 7 24 60 45 83 49 53 29 76 88 76 33 2 88 42 81 51
62 23 93 98 87 18 90 90 16 77 90 32 70 4 28 84 35 28 69 54 64 73 84 56
46 38 35 14]

You can use Circle (http://matplotlib.org/api/patches_api.html):
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
c = plt.Circle((0, 0), radius=1, edgecolor='b', facecolor='None')
ax.add_patch(c)
plt.show()

Is numpys setdiff1d broken?

To select data for training and validation in my machine learning projects, I usually use numpys masking functionality. So a typical reoccuring block of code to select the indices for validation and test data looks like this:
import numpy as np
validation_split = 0.2
all_idx = np.arange(0,100000)
idxValid = np.random.choice(all_idx, int(validation_split * len(all_idx)))
idxTrain = np.setdiff1d(all_idx, idxValid)
Now the following should always be true:
len(all_idx) == len(idxValid)+len(idxTrain)
Unfortunately, I found out that somehow this is not always the case. As I inrease the number of elements that are chosen from the all_idx-array the resulting numbers do not add up properly. Here another standalone example which breaks as soon as I increase the number of randomly chosen validation indices above 1000:
import numpy as np
all_idx = np.arange(0,100000)
idxValid = np.random.choice(all_idx, 1000)
idxTrain = np.setdiff1d(all_idx, idxValid)
print(len(all_idx), len(idxValid), len(idxTrain))
This results in -> 100000, 1000, 99005
I am confused?! Please try yourself. I would be glad to understand this.

idxValid = np.random.choice(all_idx, 10, replace=False)
Careful, you need to indicate that you don't want to have duplicates in idxValid. To do so, you just have to had replace=False in np.random.choice
replace boolean, optional
Whether the sample is with or without replacement

Consider the following example:
all_idx = np.arange(0, 100)
print(all_idx)
>>> [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
96 97 98 99]
Now if you print out your validation dataset:
idxValid = np.random.choice(all_idx, int(validation_split * len(all_idx)))
print(idxValid)
>>> [31 57 55 45 26 25 55 76 33 69 49 90 46 14 18 30 89 73 47 82]
You can actually observe that there are duplicates in the resulting set and thus
len(all_idx) == len(idxValid)+len(idxTrain)
wouldn't result to True.
What you need to do is to make sure that np.random.choice does a sampling without replcacement by passing replace=False:
idxValid = np.random.choice(all_idx, int(validation_split * len(all_idx)), replace=False)
Now the results should be as expected:
import numpy as np
validation_split = 0.2
all_idx = np.arange(0, 100)
print(all_idx)
idxValid = np.random.choice(all_idx, int(validation_split * len(all_idx)), replace=False)
print(idxValid)
idxTrain = np.setdiff1d(all_idx, idxValid)
print(idxTrain)
print(len(all_idx) == len(idxValid)+len(idxTrain))
and the output is:
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
96 97 98 99]
[12 85 96 64 48 21 55 56 80 42 11 92 54 77 49 36 28 31 70 66]
[ 0 1 2 3 4 5 6 7 8 9 10 13 14 15 16 17 18 19 20 22 23 24 25 26
27 29 30 32 33 34 35 37 38 39 40 41 43 44 45 46 47 50 51 52 53 57 58 59
60 61 62 63 65 67 68 69 71 72 73 74 75 76 78 79 81 82 83 84 86 87 88 89
90 91 93 94 95 97 98 99]
True
Consider using train_test_split from scikit-learn which is straight-forward:
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)

Project Euler #18 in Python- Beginner

By starting at the top of the triangle below and moving to adjacent numbers on the row below, the maximum total from top to bottom is 23.
3
7 4
2 4 6
8 5 9 3
That is, 3 + 7 + 4 + 9 = 23.
Find the maximum total from top to bottom of the triangle below:
75
95 64
17 47 82
18 35 87 10
20 04 82 47 65
19 01 23 75 03 34
88 02 77 73 07 63 67
99 65 04 28 06 16 70 92
41 41 26 56 83 40 80 70 33
41 48 72 33 47 32 37 16 94 29
53 71 44 65 25 43 91 52 97 51 14
70 11 33 28 77 73 17 78 39 68 17 57
91 71 52 38 17 14 91 43 58 50 27 29 48
63 66 04 68 89 53 67 30 73 16 69 87 40 31
04 62 98 27 23 09 70 98 73 93 38 53 60 04 23
NOTE: As there are only 16384 routes, it is possible to solve this problem by trying every route. However, Problem 67, is the same challenge with a triangle containing one-hundred rows; it cannot be solved by brute force, and requires a clever method! ;o)
My code is a bit haywire
a="75, 95 64, 17 47 82, 18 35 87 10, 20 04 82 47 65, 19 01 23 75 03 34, 88 02 77 73 07 63 67, 99 65 04 28 06 16 70 92, 41 41 26 56 83 40 80 70 33, 41 48 72 33 47 32 37 16 94 29, 53 71 44 65 25 43 91 52 97 51 14, 70 11 33 28 77 73 17 78 39 68 17 57, 91 71 52 38 17 14 91 43 58 50 27 29 48, 63 66 04 68 89 53 67 30 73 16 69 87 40 31, 04 62 98 27 23 09 70 98 73 93 38 53 60 04 23"
b=a.split(", ")
d=[]
ans=0
for x in range(len(b)):
b[x]= b[x].split(" ")
c= [int(i) for i in b[x]]
d.append(c)
index= d[0].index(max(d[0]))
print index
for y in range(len(d)):
ans+= d[y][index]
if y+1==len(d):
break
else:
index= d[y+1].index(max(d[y+1][index], d[y+1][index+1]))
print ans
So I'm getting 1063 as the answer whereas the actual answer is 1074. I guess my approach is right but there's some bug I'm still not able to figure out.

Your approach is incorrect. You can't just do a greedy algorithm. Consider the example:
3
7 4
2 4 6
8 5 9 500
Clearly:
3 + 7 + 4 + 9 = 23 < 500 + (other terms here)
Yet you follow this algorithm.
However if the triangle were just:
3
7 4
The greedy approach works, in other words, we need to reduce the problem to a kind of "3 number" triangle. So, assume the path you follow gets to 6, what choice should be made? Go to 500. (What happens if the apath goes to 4? What about 2?)
How can we use these results to make a smaller triangle?

It looks like you always choose the larger number (of left and right) in the next line (This is called a greedy algorithm.) But maybe choosing the smaller number first, you could choose larger numbers in the subsequent lines. (And indeed, by doing so, 1074 can be achieved.)
The hints in the comments are useful:
A backtrack approach would give the correct result.
A dynamic programming approach would give the correct result, and it's faster than the backtrack, thus it can work for problem 67 as well.

Small remark on your code.
The maximum sum in this triangle is indeed 1074. Your numbers are correct, just change your for-loop code from
for i,cell in enumerate(line):
newline.append(int(cell)+max(map(int,topline[max([0,i-1]):min([len(line),i+2])])))
to
for i,cell in enumerate(line):
newline.append(int(cell)+max(map(int,topline[max([0,i-1]):min([len(line),i+1])])))
(The "1" instead of "2")
Kindly

You can reach each cell only from the adjacent (at most) three cells on the top line, and the most favorable will be the one with the highest number among these, you don't need to keep track of the others.
I put an example of the code. This works for the pyramid aligned to the left as in your question (the original problem is with a centered pyramid, but at least I don't completely spoil the problem). Max total for my case is 1116:
d="""
75
95 64
17 47 82
18 35 87 10
20 04 82 47 65
19 01 23 75 03 34
88 02 77 73 07 63 67
99 65 04 28 06 16 70 92
41 41 26 56 83 40 80 70 33
41 48 72 33 47 32 37 16 94 29
53 71 44 65 25 43 91 52 97 51 14
70 11 33 28 77 73 17 78 39 68 17 57
91 71 52 38 17 14 91 43 58 50 27 29 48
63 66 04 68 89 53 67 30 73 16 69 87 40 31
04 62 98 27 23 09 70 98 73 93 38 53 60 04 23
"""
ll=[line.split() for line in d.split('\n')][1:-1]
topline=ll[0]
for line in ll[1:]:
newline=[]
for i,cell in enumerate(line):
newline.append(int(cell)+max(map(int,topline[max([0,i-1]):min([len(line),i+2])])))
topline=newline
print newline
print "final results is %i"%max(newline)

I thought about this problem all day and night. Here is my solution:
# Maximum path sum I
import time
def e18():
start = time.time()
f=open("num_triangle.txt")
summ=[75]
for s in f:
slst=s.split()
lst=[int(item) for item in slst]
for i in range(len(lst)):
if i==0:
lst[i]+=summ[i]
elif i==len(lst)-1:
lst[i]+=summ[i-1]
elif (lst[i]+summ[i-1])>(lst[i]+summ[i]):
lst[i]+=summ[i-1]
else:
lst[i]+=summ[i]
summ=lst
end = time.time() - start
print("Runtime =", end)
f.close()
return max(summ)
print(e18()) #1074
P.S. num_triangle.txt is without first string '75'

Processing string with numbers causes 'ValueError: invalid literal for int()' error in Python

So basically we are given a text that looks like this:
20
08 02 22 97 38 15 00 40 00 75 04 05 07 78 52 12 50 77 91 08
49 49 99 40 17 81 18 57 60 87 17 40 98 43 69 48 04 56 62 00
81 49 31 73 55 79 14 29 93 71 40 67 53 88 30 03 49 13 36 65
52 70 95 23 04 60 11 42 69 24 68 56 01 32 56 71 37 02 36 91
22 31 16 71 51 67 63 89 41 92 36 54 22 40 40 28 66 33 13 80
24 47 32 60 99 03 45 02 44 75 33 53 78 36 84 20 35 17 12 50
32 98 81 28 64 23 67 10 26 38 40 67 59 54 70 66 18 38 64 70
67 26 20 68 02 62 12 20 95 63 94 39 63 08 40 91 66 49 94 21
24 55 58 05 66 73 99 26 97 17 78 78 96 83 14 88 34 89 63 72
21 36 23 09 75 00 76 44 20 45 35 14 00 61 33 97 34 31 33 95
78 17 53 28 22 75 31 67 15 94 03 80 04 62 16 14 09 53 56 92
16 39 05 42 96 35 31 47 55 58 88 24 00 17 54 24 36 29 85 57
86 56 00 48 35 71 89 07 05 44 44 37 44 60 21 58 51 54 17 58
19 80 81 68 05 94 47 69 28 73 92 13 86 52 17 77 04 89 55 40
04 52 08 83 97 35 99 16 07 97 57 32 16 26 26 79 33 27 98 66
88 36 68 87 57 62 20 72 03 46 33 67 46 55 12 32 63 93 53 69
04 42 16 73 38 25 39 11 24 94 72 18 08 46 29 32 40 62 76 36
20 69 36 41 72 30 23 88 34 62 99 69 82 67 59 85 74 04 36 16
20 73 35 29 78 31 90 01 74 31 49 71 48 86 81 16 23 57 05 54
01 70 54 71 83 51 54 69 16 92 33 48 61 43 52 01 89 19 67 48
It's a 20x20 dimensional square, and you have to figure out the greatest product of four adjacent numbers in the same direction (horizontal, vertical, or diagonal) in a grid of positive integers. This is what I have:
def main():
# open file for reading
in_file = open ("./Grid.txt", "r")
dimension = in_file.readline()
dimensions = int(dimension)
greatest = 0
grid = ''
largest = [0, 0, 0, 0]
for i in range (0, dimensions):
grid = grid + in_file.readline()
grid = grid.strip()
grid = grid.replace(" ","")
i = 0
j = 0
print(int(grid[i]))
for i in range (0, dimensions * 2 + (dimensions - 1)):
for j in range (0, dimensions * 2 + (dimensions - 1) - 3):
if (int(grid[i])*10 + int(grid[i+1]))*(int(grid[i+2])*10 + int(grid[i+3]))*(int(grid[i+4])*10 + int(grid[i+5]))*(int(grid[i+6])*10 + int(grid[i+7])) > largest[0]:
largest[0] = (int(grid[i])*10 + int(grid[i+1]))*(int(grid[i+2])*10 + int(grid[i+3]))*(int(grid[i+4])*10 + int(grid[i+5]))*(int(grid[i+6])*10 + int(grid[i+7]))
print(max(largest))
main()
I know it's super complicated but basically, I'm not sure how to make this set of numbers look like a list of numbers (an array)... So I essentially end up having to make the numbers. For example the first number is 02, so I multiple 0 by 10, and add 2... Anyways, the problem is that I get ValueError: invalid literal for int() with base 10: '\n'. Any help is appreciated!

The problem is this line:
grid = grid + in_file.readline()
Change it to:
grid = grid + in_file.readline().strip() # you must strip each line
You need to strip each line as you read it, but currently, you're stripping only the final string, which leaves all the whitespace you have in each line present. Eventually, your code tries to convert non-numeric characters (e.g. spaces, newlines) into numbers and runs into the error.
After the fix, running it produces the following output:
➜ /tmp ./test.py
0
1614
Additional Recommendations
You definitely need to make your code more readable before posting. It was painful to look at and even more painful to debug... I almost left it there.
One possible start could be in the complicated for loop. Consider:
for i in range (0, dimensions * 2 + (dimensions - 1)):
for j in range (0, dimensions * 2 + (dimensions - 1) - 3):
tmp = int(grid[i]) * 10 \
+ int(grid[i+1]) * int(grid[i+2]) * 10 \
+ int(grid[i+3]) * int(grid[i+4]) * 10 \
+ int(grid[i+5]) * int(grid[i+6]) * 10 \
+ int(grid[i+7])
if tmp > largest[0]:
largest[0] = tmp
First, it allowed me to see that the culprit was int(grid[i+7]) instruction, whereas before it would show the entire line while complaining and was not informative.
Second, it does not calculate exactly the same thing twice. It uses a temporary variable instead.
Third, you should consider converting your grid variable into an actual grid (e.g. an array of arrays). Currently, it's merely a string, so the name is misleading.
Fourth, while you turn grid into an actual grid, you can use a list comprehension and convert the values into numbers directly, as in this short example:
>>> line = '12 34 5 6 78 08 1234'
>>> [int(v) for v in line.split()]
[12, 34, 5, 6, 78, 8, 1234] # array of integers, not strings
>>>
It will save you the conversions before getting to the other parts and validates the data for you in the process while the code is still simpler, instead of waiting to your complicated calculations to blow up.

Python: I am unable to comprehend the concept of a For Loop, apparently

I've got a list of 400 numbers, and i want to but them in a 20x20 grid using Python.
I've made a "2d array" (not really because Python doesn't support them, I've had to use a list of lists.)
When i try to loop through and assign each subsequnt item to the next box in the grid, it fails. i end up assinging the last item in the list to every box.
Here's the code:
numbers = "08 02 22 97 38 15 00 40 00 75 04 05 07 78 52 12 50 77 91 08 49 49 99 40 17 81 18 57 60 87 17 40 98 43 69 48 04 56 62 00 81 49 31 73 55 79 14 29 93 71 40 67 53 88 30 03 49 13 36 65 52 70 95 23 04 60 11 42 69 24 68 56 01 32 56 71 37 02 36 91 22 31 16 71 51 67 63 89 41 92 36 54 22 40 40 28 66 33 13 80 24 47 32 60 99 03 45 02 44 75 33 53 78 36 84 20 35 17 12 50 32 98 81 28 64 23 67 10 26 38 40 67 59 54 70 66 18 38 64 70 67 26 20 68 02 62 12 20 95 63 94 39 63 08 40 91 66 49 94 21 24 55 58 05 66 73 99 26 97 17 78 78 96 83 14 88 34 89 63 72 21 36 23 09 75 00 76 44 20 45 35 14 00 61 33 97 34 31 33 95 78 17 53 28 22 75 31 67 15 94 03 80 04 62 16 14 09 53 56 92 16 39 05 42 96 35 31 47 55 58 88 24 00 17 54 24 36 29 85 57 86 56 00 48 35 71 89 07 05 44 44 37 44 60 21 58 51 54 17 58 19 80 81 68 05 94 47 69 28 73 92 13 86 52 17 77 04 89 55 40 04 52 08 83 97 35 99 16 07 97 57 32 16 26 26 79 33 27 98 66 88 36 68 87 57 62 20 72 03 46 33 67 46 55 12 32 63 93 53 69 04 42 16 73 38 25 39 11 24 94 72 18 08 46 29 32 40 62 76 36 20 69 36 41 72 30 23 88 34 62 99 69 82 67 59 85 74 04 36 16 20 73 35 29 78 31 90 01 74 31 49 71 48 86 81 16 23 57 05 54 01 70 54 71 83 51 54 69 16 92 33 48 61 43 52 01 89 19 67 48"
grid = [[0 for col in range(20)] for row in range(20)]
for x in range(0, 1200, 3):
y = x + 2
a = numbers[x:y]
for i in range(20):
for j in range(20):
grid[i][j] = a
print(grid)
I can see why it's going wrong: the two loops that generate the list coordinates are inside the loop that gets each items from the list, so each time they run the value they are assigning doesn't change.
Therefore I guess seeing as they don't work in the loop, they need to be out of it.
The problem is that I can't work out where exactly.
Anyone give me a hand?

This is the sort of thing that list comprehensions are good for.
nums = iter(numbers.split())
grid = [[next(nums) for col in range(20)] for row in range(20)]
Alternatively, as a for loop:
grid = [[0]*20 for row in range(20)]
nums = iter(numbers.split())
for i in xrange(20):
for j in xrange(20):
grid[i][j] = next(nums)
I'm not recommending that you do this, but the way to do it if you don't just want to split the list and then call next on its iterator is to write a generator to parse the list the way that you were and then call next on that. I point this out because there are situations where builtins wont do it for you so you should see the pattern (not Design Pattern, just pattern for the pedantic):
def items(numbers):
for x in range(0, len(numbers), 3):
yield numbers[x:x+2]
nums = items(numbers)
for i in xrange(20):
for j in xrange(20):
grid[i][j] = next(nums)
This lets you step through the two loops in parallel.

Another alternative is to use the grouper idiom:
nums = iter(numbers.split())
grid = zip(*[nums]*20)
Note that this makes a list of tuples, not a list of lists.
If you need a list of lists, then
grid = map(list,zip(*[nums]*20))

Your for loop confusion stems from having more loops than you need.
Instead, one approach is to start by looping over your grid squares and then figuring out the needed offset into your list of numbers:
for i in range(20):
for j in range(20):
offset = i*20*3 + j*3
grid[i][j] = numbers[offset:offset+2]
In this case, the offset has a few constants. i is the row, so for each row you need to skip over a row's worth of characters (60) in your list. j is the column index; each column is 3 characters wide.
You could, of course, do this operation in reverse. In that case, you would loop over your list of numbers and then figure out to which grid square it belongs. It works very similarly, except you'd be using division and modulo arithmetic instead of multiplication in the offset above.
Hopefully this provides some insight into how to use for loops to work with multiple objects and multiple dimensions.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do you split a time series into separate, even segments? - python

Related

Creating circle from data

Is numpys setdiff1d broken?

Project Euler #18 in Python- Beginner

Processing string with numbers causes 'ValueError: invalid literal for int()' error in Python

Python: I am unable to comprehend the concept of a For Loop, apparently

Categories

Resources