In the FFT (second) plot, I am expecting a bigger peak at frequency = 1.0, compared to other frequencies, since it is a 1 Hz Square Wave signal sampled at 5Hz.
I am a beginner at this, possibly missing something silly here
Here's what I have done:
import numpy as np
from matplotlib import pyplot as plt
from scipy import signal
t500 = np.linspace(0,5,500,endpoint=False)
s1t500 = signal.square(2*np.pi*1.0*t500)
First plot shows 1 Hz Square Wave sampled at 5Hz for 5 seconds:
t5 = np.linspace(0,5,25,endpoint=False)
t5 = t5 + 1e-14
s1t5 = signal.square(2.0*np.pi*1.0*t5)
plt.ylim(-2,2); plt.plot(t500,s1t500,'k',t5,s1t5,'b',t5,s1t5,'bo'); plt.show()
Here in the Second plot, I am expecting the magnitude at f=1 Hz to be more than at f=2. Am I missing something ?
y1t5 = np.fft.fft(s1t5)
ff1t5 = np.fft.fftfreq(25,d=0.2)
plt.plot(ff1t5,y1t5); plt.show()
It seems you missed the fact that Fourier transform produces functions (or sequences of numbers in case of DFT/FFT) in complex space:
>>> np.fft.fft(s1t5)
[ 5. +0.j 0. +0.j 0. +0.j 0. +0.j 0. +0.j
5.-15.38841769j 0. +0.j 0. +0.j 0. +0.j 0. +0.j
5. +3.63271264j 0. +0.j 0. +0.j 0. +0.j 0. +0.j
# and so on
In order to see the amplitude spectrum on your plot, apply np.absolute or abs:
>>> np.absolute(np.fft.fft(s1t5))
[ 5. 0. 0. 0. 0. 16.18033989
0. 0. 0. 0. 6.18033989 0. 0.
0. 0. 6.18033989 0. 0. 0. 0.
16.18033989 0. 0. 0. 0. ]
Otherwise only real part will be shown.
Related
I am looking to use TfidfVectorizer and then convert csr matrix to array, but the array returned only contain 0's. Need to understand what's going on.
vector = TfidfVectorizer() # convert data to Matrix
x_feature_train = vector.fit_transform(X_train) # Fit our Train Data
x_test_feature_test = vector.transform(X_test) # Fit our Test Data
arr= x_feature_train.toarray()
print(arr[0][0])
Output
0.0
<class 'scipy.sparse.csr.csr_matrix'> [[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
...
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]]
The output is a sparse matrix meaning that it is a matrix that contains a lot of zeros.
There is a chance that all the values are at the center area of this large matrix thus when the printing statement only print out the head and tail of the matrix, it seemed like they are all zeros.
Another case might be that there is something wrong with the X_train data.
Without additional information, we won't be able to tell, thus, as #desertnaut recommended, posting a minimal reproducible example would be helpful.
I'm working on an animated bar plot to show how the number frequencies of rolling a six-sided die converge the more you roll the die. I'd like to show the number frequencies after each iteration, and for that I have to get a list of the number frequencies for that iteration in another list. Here's the code so far:
import numpy as np
import numpy.random as rd
rd.seed(23)
n_samples = 3
freqs = np.zeros(6)
frequencies = []
for roll in range(n_samples):
x = rd.randint(0, 6)
freqs[x] += 1
print(freqs)
frequencies.append(freqs)
print()
for x in frequencies:
print(x)
Output:
[0. 0. 0. 1. 0. 0.]
[1. 0. 0. 1. 0. 0.]
[1. 1. 0. 1. 0. 0.]
[1. 1. 0. 1. 0. 0.]
[1. 1. 0. 1. 0. 0.]
[1. 1. 0. 1. 0. 0.]
Desired output:
[0. 0. 0. 1. 0. 0.]
[1. 0. 0. 1. 0. 0.]
[1. 1. 0. 1. 0. 0.]
[0. 0. 0. 1. 0. 0.]
[1. 0. 0. 1. 0. 0.]
[1. 1. 0. 1. 0. 0.]
The upper three lists indeed show the number frequencies after each iteration. However, when I try to append the list to the 'frequencies' list, in the end it just shows the final number frequencies each time as can be seen in the lower three lists. This one's got me stumped, and I am rather new to Python. How would one get each list like in the first three lists of the output, in another? Thanks in advance!
You can do it like that by changing only frequencies.append(freqs) with frequencies.append(freqs.copy()). Like that, you can make a copy of freqs that would be independent of original freqs. A change in freqs won't change freqs.copy().
import numpy as np
import numpy.random as rd
rd.seed(23)
n_samples = 3
freqs = np.zeros(6)
frequencies = []
for roll in range(n_samples):
x = rd.randint(0, 6)
freqs[x] += 1
print(freqs)
frequencies.append(freqs.copy())
print(frequencies)
print()
for x in frequencies:
print(x)
Python is keeping track of freqs as single identity, and its value gets changed even after it gets appended. There is a good explanation for this beyond my comprehension =P
However, here is quick and dirty work around:
import numpy as np
import numpy.random as rd
rd.seed(23)
n_samples = 3
freqs = np.zeros(6)
frequencies = []
for roll in range(n_samples):
x = rd.randint(0, 6)
freqs_copy = []
for item in freqs:
freqs_copy.append(item)
freqs_copy[x] += 1
print(freqs_copy)
frequencies.append(freqs_copy)
print()
for x in frequencies:
print(x)
The idea is to make a copy of "freqs" that would be independent of original "freqs". In the code above "freqs_copy" would be unique to each iteration.
I have this python code to calculate coordinates distances among different points.
IDs,X,Y,Z
0-20,193.722,175.733,0.0998975
0-21,192.895,176.727,0.0998975
7-22,187.065,178.285,0.0998975
0-23,192.296,178.648,0.0998975
7-24,189.421,179.012,0.0998975
8-25,179.755,179.347,0.0998975
8-26,180.436,179.288,0.0998975
7-27,186.453,179.2,0.0998975
8-28,178.899,180.92,0.0998975
The code works perfectly, but as the amount of coordinates I now have is very big (~50000) I need to optimise this code, otherwise is impossible to run. Could someone suggest me a way of doing this that is more memory efficient? Thanks for any suggestion.
#!/usr/bin/env python
import pandas as pd
import scipy.spatial as spsp
df_1 =pd.read_csv('Spots.csv', sep=',')
coords = df_1[['X', 'Y', 'Z']].to_numpy()
distances = spsp.distance_matrix(coords, coords)
df_1['dist'] = distances.tolist()
# CREATES columns d0, d1, d2, d3
dist_cols = df_1['IDs']
df_1[dist_cols] = df_1['dist'].apply(pd.Series)
df_1.to_csv("results_Spots.csv")
There are a couple of ways to save space. The first is to only store the upper triangle of your matrix and make sure that your indices always reflect that. The second is only to store the values that meet your threshold. This can be done collectively by using sparse matrices, which support most of the operations you will likely need, and will only store the elements you need.
To store half the data, preprocess your indices when you access your matrix. So for your matrix, access index [i, j] like this:
getitem(A, i, j):
if i > j:
i, j = j, i
return dist[i, j]
scipy.sparse supports a number of sparse matrix formats: BSR, Coordinate, CSR, CSC, Diagonal, DOK, LIL. According to the usage reference, the easiest way to construct a matrix is using DOK or LIL format. I will show the latter for simplicity, although the former may be more efficient. I will leave it up to the reader to benchmark different options once a basic functioning approach has been shown. Remember to convert to CSR or CSC format when doing matrix math.
We will sacrifice speed for spatial efficiency by constructing one row at a time:
N = coords.shape[0]
threshold = 2
threshold2 = threshold**2 # minor optimization to save on some square roots
distances = scipy.sparse.lil_matrix((N, N))
for i in range(N):
# Compute square distances
d2 = np.sum(np.square((coords[i + 1:, :] - coords[i])), axis=1)
# Threshold
mask = np.flatnonzero(d2 <= threshold2)
# Apply, only compute square root if necessary
distances[i, mask + i + 1] = np.sqrt(d2[mask])
For your toy example, we find that there are only four elements that actually pass threshold, making the storage very efficient:
>>> distances.nnz
4
>>> distances.toarray()
array([[0. , 1.29304486, 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 1.1008038 , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0.68355102, 0. , 1.79082802],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ]])
Using the result from scipy.spatial.distance_matrix confirms that these numbers are in fact accurate.
If you want to fill the matrix (effectively doubling the storage, which should not be prohibitive), you should probably move away from LIL format before doing so. Simply add the transpose to the original matrix to fill it out.
The approach shown here addresses your storage concerns, but you can improve the efficiency of the entire computation using spatial sorting and other geospatial techniques. For example, you could use scipy.spatial.KDTree or the similar scipy.spatial.cKDTree to arrange your dataset and query neighbors within a specific threshold directly and efficiently.
For example, the following would replace the matrix construction shown here with what is likely a more efficient method:
tree = scipy.spatial.KDTree(coords)
distances = tree.sparse_distance_matrix(tree, threshold)
You are asking in your code for point to point distances in a ~50000 x ~50000 matrix. The result will be very big, if you really like to store it. The matrix is dense as each point has a positive distance to each other point.
I recommend to revisit your business requirements. Do you really need to calculate all these points upfront and store them in a file on a disk ? Sometimes it is better to do the required calculations on the fly; scipy.spacial is fast, perhaps even not much slower then reading a precalculated value.
EDIT (based on comment):
You can filter calculated distances by a threshold (here for illustration: 5.0) and then look up the IDs in the DataFrame
import pandas as pd
import scipy.spatial as spsp
df_1 =pd.read_csv('Spots.csv', sep=',')
coords = df_1[['X', 'Y', 'Z']].to_numpy()
distances = spsp.distance_matrix(coords, coords)
adj_5 = np.argwhere(distances[:] < 5.0)
pd.DataFrame(zip(df_1['IDs'][adj_5[:,0]].values,
df_1['IDs'][adj_5[:,1]].values),
columns=['from', 'to'])
I have a list of x y like the picture above
in code it works like this:
np.array([[1.3,2.1],[1.5,2.2],[3.1,4.8]])
now I would like to set a grid of which I can set the start, the number of columns and rows as well as the row and columns size, and then count the number of points in each cell.
in this example [0,0] has 1 point in it, [1,0] has 1, [2,0] has 3, [0,1] has 4 and so on.
I know it would probably be trivial to do by hand, even without numpy, but I need to create it as fast as possible, since I will have to process a ton of data this way.
whats a good way to do this? Basicly create a 2D Histogramm of points? And more importantly, how can I do it as fast as possible?
I think numpy.histogram2d is the best option.
a = np.array([[1.3,2.1],[1.5,2.2],[3.1,4.8]])
H, _, _ = np.histogram2d(a[:, 0], a[:, 1], bins=(range(6), range(6)))
print(H)
# [[0. 0. 0. 0. 0.]
# [0. 0. 2. 0. 0.]
# [0. 0. 0. 0. 0.]
# [0. 0. 0. 0. 1.]
# [0. 0. 0. 0. 0.]]
In my case I will have a PCM.txt file which contains the binary representation of a PCM data like below.
[1. 1. 0. 1. 0. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 0.
1.
0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1.
0. 1. 0. 1. 0. 1. 0. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1.
0. 1. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0.
0. 1. 0. 1.]
1's meaning binary 1
0's meaning binary 0
This is nothing but 100 samples of data.
Is it possible to implement a python code which will read this PCM.txt as the input and plot this PCM data using matplotlib. ? Could you please give me some tips to implement this scenario ?
Will this plotted figure look like a square wave ?
I think you want this:
import matplotlib.pyplot as plt
import numpy as np
x = np.arange(100)
y = [1.,1.,0.,1.,0.,1.,1.,1.,1.,1.,0.,1.,1.,1.,1.,1.,1.,1.,0.,1.,1.,1.,0.,1.,0.,1.,0.,1.,0.,0.,1.,0.,0.,0.,0.,0.,1.,0.,0.,0.,0.,0,0.,0.,1.,0.,0.,1.,0.,1.,0.,1.,0.,1.,0.,1.,1.,1.,1.,1.,0.,1.,1.,1.,1.,1.,1.,1.,0.,1.,1.,1.,0.,1.,0.,1.,0.,1.,0.,0.,1.,0.,0.,0.,0.,0.,1.,0.,0.,0.,0.,0.,0.,0.,1.,0.,0.,1.,0.,1.]
plt.step(x, y)
plt.show()
If you are having trouble reading the file, you can just use a regex to find things that look like numbers:
import matplotlib.pyplot as plt
import numpy as np
import re
# Slurp entire file
with open('data') as f:
s = f.read()
# Set y to anything that looks like a number
y = re.findall(r'[0-9.]+', s)
# Set x according to number of samples found
x = np.arange(len(y))
# Plot that thing
plt.step(x, y)
plt.show()