Python numpy array filter - python

I got the following numpy array named 'data'. It consists of 15118 rows and 2 columns. The first column mostly consist of 0.01 steps, but sometimes there is a step in between (shown in red) which I would like to remove/filter out.
I achieved this with the following code:
# Create array [0, 0.01 .... 140], rounded 2 decimals to prevent floating point error
b = np.round(np.arange(0,140.01,0.01),2)
# New empty data array
new_data = np.empty(shape=[0, 2])
# Loop over values to remove/filter out data
for x in b:
Index = np.where(x == data[:,0])[0][0]
new_data = np.vstack([new_data,data[Index]])
I feel like this code is far from optimal and I was wondering if anyone knows a faster/better way of achieving this?

Here's a solution using pandas for resampling, you can probably achieve the same result in pure numpy but there are a number of floating point and rounding error pitfalls you are going to face, maybe it's better to let a trusted library do the work for you.
Let's say arr is your data array and assume your index to be in fractions of seconds. You can convert your array to a dataframe with a timedelta index:
df = pd.DataFrame(arr[:,1], index=arr[:,0])
df.index = pd.to_timedelta(df.index, unit="s")
Than resampling it's pretty easy, 10ms is the frequency you want, first() should give you the expected result dropping everything but the records at 10ms ticks, but feel free to experiment with other functions
df = df.resample("10ms").first()
Eventually you could get back to your array with something like:
np.vstack([pd.to_numeric(df.index, downcast="float").values / 1e9,
df.values.squeeze()]).T

Related

Pandas dataframe sum of row won't let me use result in equation

Anybody wish to help me understand why below code doesn't work?
start_date = '1990-01-01'
ticker_list = ['SPY', 'QQQ', 'IWM','GLD']
tickers = yf.download(ticker_list, start=start_date)['Close'].dropna()
ticker_vol_share = (tickers.pct_change().rolling(20).std()) \
/ ((tickers.pct_change().rolling(20).std()).sum(axis=1))
Both (tickers.pct_change().rolling(20).std()) and ((tickers.pct_change().rolling(20).std()).sum(axis=1)) runs fine by themselves, but when ran together they form a dataframe with thousands of columns all filled with nan
Try this.
rolling_std = tickers.pct_change().rolling(20).std()
ticker_vol_share = rolling_std.apply(lambda row:row/sum(row),axis = 1)
You will get
Why its not working as expected:
Your tickers object is a DataFrame, as is the tickers.pct_change(), tickers.pct_change().rolling(20) and tickers.pct_change().rolling(20).std(). The tickers.pct_change().rolling(20).std().sum(axis=1) is probably a Series.
You're therefore doing element-wise division of a DataFrame by a Series. This yields a DataFrame.
Without seeing your source data, it's hard to say for sure why the output DF is filled with nan, but that can certainly happen if some of the things you're dividing by are 0. It might also happen if each series is only one element long after taking the rolling average. It might also happen if you're actually evaluating a Series tickers rather than a DataFrame, since Series.sum(axis=1) doesn't make a whole lot of sense. It is also suspicious that your top and bottom portions of the division are probably different shapes, since sum() collapses an axis.
It's not clear to me what your expected output is, so I'll defer to others or wait for an update before answering that part.

Speeding up 3D numpy and dataframe lookup

I currently have a pretty large 3D numpy array (atlasarray - 14M elements with type int64) in which I want to create a duplicate array where every element is a float based on a separate dataframe lookup (organfile).
I'm very much a beginner, so I'm sure that there must be a better (quicker) way to do this. Currently, it takes around 90s, which isn't ages but I'm sure can probably be reduced. Most of this code below is taken from hours of Googling, so surely isn't optimised.
import pandas as pd
organfile = pd.read_excel('/media/sf_VMachine_Shared_Path/ValidationData/ICRP110/AF/AF_OrgansSimp.xlsx')
densityarray = atlasarray
densityarray = densityarray.astype(float)
#create an iterable list of elements that can be written over and go for each elements
for idx, x in tqdm(np.ndenumerate(densityarray), total =densityarray.size):
densityarray[idx] = organfile.loc[x,'Density']
All of the elements in the original numpy array are integers which correspond to an organID. I used pandas to read in the key from an excel file and generate a 4-column dataframe, where in this particular case I want to extract the 4th column (which is a float). OrganIDs go up to 142. Apologies for the table format below, I couldn't get it to work so put it in code format instead.
|:OrganID:|:OrganName:|:TissueType:|:Density:|
|:-------:|:---------:|:----------:|:-------:|
|:---0---:|:---Air---:|:----53----:|:-0.001-:|
|:---1---:|:-Adrenal-:|:----43----:|:-1.030-:|
Any recommendations on ways I can speed this up would be gratefully received.
Put the density from the dataframe into a numpy array:
density = np.array(organfile['Density'])
Then run:
density[atlasarray]
Don't use loops, they are slow. The following example with 14M elements takes less than 1 second to run:
density = np.random.random((143))
atlasarray = np.random.randint(0, 142, (1000, 1000, 14))
densityarray = density[atlasarray]
Shape of densityarray:
print(densityarray.shape)
(1000, 1000, 14)

Quickly search through a numpy array and sum the corresponding values

I have an array with around 160k entries which I get from a CSV-file and it looks like this:
data_arr = np.array(['ID0524', 1.0]
['ID0965', 2.5]
.
.
['ID0524', 6.7]
['ID0324', 3.0])
I now get around 3k unique ID's from some database and what I have to do is look up each of these IDs in the array and sum the corresponding numbers.
So if I would need to look up "ID0524", the sum would be 7.7.
My current working code looks something like this (I'm sorry that it's pretty ugly, I'm very new to numpy):
def sumValues(self, id)
sub_arr = data_arr[data_arr[0:data_arr.size, 0] == id]
sum_arr = sub_arr[0:sub_arr.size, 1]
return sum_arr.sum()
And it takes around ~18s to do this for all 3k IDs.
I wondered if there is probably any faster way to this as the current runtime seems a bit too long for me. I would appreciate any guidance and hints on this. Thank you!
You could try the using builtin numpy methods.
numpy.intersect1d to find the unique IDs
numpy.sum to sum them up
A convenient tool to do your task is Pandas, with its grouping mechanism.
Start from the necessary import:
import pandas as pd
Then convert data_arr to a pandasonic DataFrame:
df = pd.DataFrame({'Id': data_arr[:, 0], 'Amount': data_arr[:, 1].astype(float)})
The reason for some complication in the above code is that:
elements of your input array are of a single type (in this case
object),
so there is necessary to convert the second column to float.
Then you can get the expected result in a single instruction:
result = df.groupby('Id').sum()
The result, for your data sample, is:
Amount
Id
ID0324 3.0
ID0524 7.7
ID0965 2.5
Another approach is that you could read your CSV file directly
into a DataFrame (see read_csv method), so there is no need to use
any Numpy array.
The advantage is that read_csv is clever enough to recognize the data
type of each column separately, at least it is able to tell apart numbers
from strings.

Computation between two large columns in a Pandas Dataframe

I have a dataframe that has 2 columns of zipcodes, I would like to add another column with their distance values, I am able to do this with a fairly low number of rows, but I am now working with a dataframe that has about 500,000 rows for calculations. The code I have works, but on my current dataframe it's been about 30 minutes of running, and still no completion, so I feel what i'm doing is extremely inefficient.
Here is the code
import pgeocode
dist = pgeocode.GeoDistance('us')
def distance_pairing(start,end):
return dist.query_postal_code(start, end)
zips['distance'] = zips.apply(lambda x: distance_pairing(x['zipstart'], x['zipend']), axis=1)
zips
I know looping is out of the question, so is there something else I can do, efficiency wise that would make this better?
Whenever possible, use vectorized operations in pandas and numpy. In this case:
zips['distance'] = dist.query_postal_code(
zips['zipstart'].values,
zips['zipend'].values,
)
This won't always work, but in this case, the underlying pgeocode.haversine function is written (in numpy) to accommodate arrays of x and y coordinates. This should speed up your code by several orders of magnitude for a dataframe of this size.

pandas: check whether an element is in dataframe or given column leads to strange results

I am doing some data handling based on a DataFrame with the shape of (135150, 12) so double checking my results manually is not applicable anymore.
I encountered some 'strange' behavior when I tried to check if an element is part of the dataframe or a given column.
This behavior is reproducible with even smaller dataframes as follows:
import numpy as np
import pandas as pd
start = 1e-3
end = 2e-3
step = 0.01e-3
arr = np.arange(start, end+step, step)
val = 0.0019
df = pd.DataFrame(arr, columns=['example_value'])
print(val in df) # prints `False`
print(val in df['example_value']) # prints `True`
print(val in df.values) # prints `False`
print(val in df['example_value'].values) # prints `False`
print(df['example_value'].isin([val]).any()) # prints `False`
Since I am a very beginner in data analysis I am not able to explain this behavior.
I know that I am using different approaches involving different datatypes (like pd.Series, np.ndarray or np.array) in order to check if the given value exists in the dataframe. Additionally when using np.array or np.ndarray the machine accuracy comes in play which I am aware of in mind.
However, at the end, I need to implement several functions to filter the dataframe and count the occurrences of some values, which I have done several times before based on boolean columns in combination with performed operations like > and < successfully.
But in this case I need to filter by the exact value and count its occurrences which after all lead me to the issue described above.
So could anyone explain, what's going on here?
The underlying issue, as Divakar suggested, is floating point precision. Because DataFrames/Series are built on top of numpy, there isn't really a penalty for using numpy methods though, so you can just do something like:
df['example_value'].apply(lambda x: np.isclose(x, val)).any()
or
np.isclose(df['example_value'], val).any()
both of which correctly return True.

Categories

Resources