pandas: round up to closet float number defined by user - python

I have one array which contains continuous values. I need to round up those values to the closet float value. ex: 32.25 to 32.50 , 30.29 to 30.50, 33.75 to 34.00. In short: if it is from .1 to .49 round up to .50 and if it is from .51 to .99 round up to .00. How can I do it. Thank you in advance.
array([32.5 , 32.49, 32.48, 32.47, 32.46, 32.45, 32.44, 32.43, 32.42,
32.41, 32.4 , 32.39, 32.38, 32.37, 32.36, 32.35, 32.34, 32.33,
32.32, 32.31, 32.3 , 32.29, 32.28, 32.27, 32.26, 32.25, 15.75,
15.76, 15.77, 15.78, 15.79, 15.8 , 15.81, 15.82, 15.83, 15.84,
15.85, 15.86, 15.87, 15.88, 15.89, 15.9 , 15.91, 15.92, 15.93,
15.94, 15.95, 15.96, 15.97, 15.98, 15.99, 16. , 16.01, 16.02,
16.03, 16.04, 16.05, 16.06, 16.07, 16.08, 16.09, 16.1 , 16.11,
16.12, 16.13, 16.14, 16.15, 16.16, 16.17, 16.18, 16.19, 16.2 ,
16.21, 16.22, 16.23, 16.24, 16.25, 16.26, 16.27, 16.28, 16.29,
16.3 , 16.31, 16.32, 16.33, 16.34, 16.35, 16.36, 16.37, 16.38,
16.39, 16.4 , 16.41, 16.42, 16.43, 16.44, 16.45, 16.46, 16.47,
16.48, 16.49, 16.5 , 25.25, 25.5 , 25.51, 25.52, 25.53, 25.54,
25.55, 25.56, 25.57, 25.58, 25.59, 25.6 , 25.61, 25.62, 25.63,
25.64, 25.65, 25.66, 25.67, 25.68, 25.69, 25.7 , 25.71, 25.72,
25.73, 25.74, 26. , 26.01, 26.02, 26.03, 26.04, 26.05, 26.06,
26.07, 26.08, 26.09, 26.1 , 26.11, 26.12, 26.13, 26.14, 26.15,
26.16, 26.17, 26.18, 26.19, 26.2 , 26.21, 26.22, 26.23, 26.24,
26.25, 26.26, 26.27, 26.28, 26.29, 26.3 , 26.31, 26.32, 26.5 ,
26.49, 26.48, 26.47, 26.46, 26.45, 26.44, 26.43, 26.42, 26.41,
26.4 , 26.39, 26.38, 26.37, 26.36, 26.35, 26.34, 26.33, 28.5 ,
28.51, 28.52, 28.53, 28.54, 28.55, 28.56, 28.57, 28.58, 28.59,
28.6 , 28.61, 28.62, 28.63, 28.64, 28.65, 28.66, 30.5 , 30.49,
30.48, 30.47, 30.46, 30.45, 30.44, 30.43, 30.42, 30.41, 30.4 ,
30.39, 30.38, 30.37, 30.36, 30.35, 30.34, 30.33, 30.32, 30.31,
30.3 , 30.29, 30.28, 30.27, 30.26, 30.25])

Did you not experiment with this? numpy is built for experimentation.
array = (array * 2 + 0.4999).round() / 2

Another solution:
import math
[math.modf(item)[1] + 0.5 if (0.1 <= ( item % 1) <= 0.5) else math.modf(item)[1] for item in array]

use the below code
round_off_values = np.round_(array, decimals = 1)

Related

array is 1-dimensional, but 2 were indexed when using numpy and recfromcsv

I am looping through a bunch of files and importing their contents as numpy arrays:
# get the dates for our gaps
import os.path
import glob
from pathlib import Path
from numpy import recfromcsv
folder = "daily_bars_filtered/*.csv"
df_gapper_list = []
df_intraday_analysis = []
# loop through the daily gappers
for fname in glob.glob(folder)[0:2]:
ticker = Path(fname).stem
daily_bars_arr = recfromcsv(fname, delimiter=',')
print(ticker)
print(daily_bars_arr)
Output:
AACG
[(b'2021-07-15', 43796169., 2.98, 3.83, 4.75, 2.9401, 2.98, 59.39597315)
(b'2022-01-04', 14934689., 1.25, 2.55, 2.59, 1.25 , 1.19, 117.64705882)
(b'2022-01-05', 8067429., 1.8 , 2.3 , 2.64, 1.72 , 2.55, 3.52941176)
(b'2022-01-07', 9718034., 1.93, 2.64, 2.94, 1.85 , 1.98, 48.48484848)]
AAL
[(b'2022-03-04', 76218689., 15.27 , 14.59, 15.4799, 14.42 , 15.71, 1.46467218)
(b'2022-03-07', 89360330., 14.32 , 12.84, 14.62 , 12.77 , 14.59, 0.20562029)
(b'2022-03-08', 88067102., 13.035, 13.51, 14.27 , 12.4401, 12.84, 11.13707165)
(b'2022-03-09', 88884229., 14.44 , 14.3 , 14.75 , 14.05 , 13.51, 9.17838638)
(b'2022-03-10', 56463182., 13.82 , 14.2 , 14.44 , 13.46 , 14.3 , 0.97902098)
(b'2022-03-11', 48342029., 14.4 , 14.02, 14.56 , 13.9 , 14.2 , 2.53521127)
(b'2022-03-14', 53284254., 14.04 , 14.25, 14.83 , 13.7 , 14.02, 5.77746077)]
What I then try to do is target the first column where my dates are, by doing:
print(daily_bars_arr[:,[0]])
But then I get the following error:
IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed
What am I doing wrong?

Quantile Regression with Tensorflow Probability

I am trying to give tensorflow probability a try. I have a simple quantile regression in R. I would like to get the same results from tensorflow probability.
R Quantile Regression
library("quantreg")
mtcars_url <- "https://gist.githubusercontent.com/seankross/a412dfbd88b3db70b74b/raw/5f23f993cd87c283ce766e7ac6b329ee7cc2e1d1/mtcars.csv"
mtcars <- readr::read_csv(mtcars_url)
rqfit <- rq(mpg ~ wt, data = mtcars, tau = seq(.1, .9, by = 0.1))
predict(rqfit, mtcars, interval = c("confidence"),
level = .95)
The output for the above is 9 quantile predictions for each value
tau= 0.1 tau= 0.2 tau= 0.3 tau= 0.4 tau= 0.5 tau= 0.6 tau= 0.7 tau= 0.8 tau= 0.9
1 20.493299 20.92800 21.238356 22.176471 22.338816 23.592283 24.475462 25.12302 27.97207
2 18.837113 19.33680 19.910959 20.938971 21.181250 22.313183 23.110731 23.78409 26.37404
3 22.441753 22.80000 22.800000 23.632353 23.700658 25.097106 26.081028 26.69824 29.85210
4 16.628866 17.21520 18.141096 19.288971 19.637829 20.607717 21.291089 21.99884 24.24333
5 15.167526 15.81120 16.969863 18.197059 18.616447 19.479100 20.086915 20.81743 22.83330
6 15.037629 15.68640 16.865753 18.100000 18.525658 19.378778 19.979877 20.71242 22.70797
7 14.323196 15.00000 16.293151 17.566176 18.026316 18.827010 19.391169 20.13484 22.01862
8 16.791237 17.37120 18.271233 19.410294 19.751316 20.733119 21.424886 22.13011 24.40000
9 17.051031 17.62080 18.479452 19.604412 19.932895 20.933762 21.638962 22.34014 24.65067
10 15.167526 15.81120 16.969863 18.197059 18.616447 19.479100 20.086915 20.81743 22.83330
11 15.167526 15.81120 16.969863 18.197059 18.616447 19.479100 20.086915 20.81743 22.83330
12 11.075773 11.88000 13.690411 15.139706 15.756579 16.318971 16.715226 17.50948 18.88523
13 13.284021 14.00160 15.460274 16.789706 17.300000 18.024437 18.534868 19.29472 21.01594
14 12.959278 13.68960 15.200000 16.547059 17.073026 17.773633 18.267273 19.03219 20.70260
15 3.411856 4.51680 7.547945 9.413235 10.400000 10.400000 10.400000 11.31363 11.49042
16 2.281753 3.43104 6.642192 8.568824 9.610132 9.527203 9.468772 10.40000 10.40000
17 2.794845 3.92400 7.053425 8.952206 9.968750 9.923473 9.891571 10.81481 10.89508
18 23.221134 23.54880 23.424658 24.214706 24.245395 25.699035 26.723254 27.32833 30.60412
19 27.020619 27.19920 26.469863 27.053676 26.900987 28.633441 29.854108 30.40000 34.27019
20 25.591753 25.82640 25.324658 25.986029 25.902303 27.529904 28.676693 29.24484 32.89150
21 21.500000 21.89520 22.045205 22.928676 23.042434 24.369775 25.305004 25.93689 28.94342
22 14.647938 15.31200 16.553425 17.808824 18.253289 19.077814 19.658764 20.39737 22.33196
23 15.200000 15.84240 16.995890 18.221324 18.639145 19.504180 20.113674 20.84369 22.86464
24 12.569588 13.31520 14.887671 16.255882 16.800658 17.472669 17.946160 18.71714 20.32659
25 12.537113 13.28400 14.861644 16.231618 16.777961 17.447588 17.919401 18.69089 20.29526
26 24.942268 25.20240 24.804110 25.500735 25.448355 27.028296 28.141504 28.71977 32.26482
27 23.610825 23.92320 23.736986 24.505882 24.517763 26.000000 27.044367 27.64337 30.98013
28 27.683093 27.83568 27.000822 27.548676 27.364013 29.145080 30.400000 30.93557 34.90940
29 16.921134 17.49600 18.375342 19.507353 19.842105 20.833441 21.531924 22.23513 24.52534
30 19.519072 19.99200 20.457534 21.448529 21.657895 22.839871 23.672679 24.33542 27.03205
31 14.323196 15.00000 16.293151 17.566176 18.026316 18.827010 19.391169 20.13484 22.01862
32 19.454124 19.92960 20.405479 21.400000 21.612500 22.789711 23.619160 24.28291 26.96938
Tensorflow Probability Try
# pip3 install tensorflow_probability must be installed
from pprint import pprint
import numpy as np
import pandas as pd
import tensorflow.compat.v2 as tf
import tensorflow_probability as tfp
import termplotlib as tpl
np.set_printoptions(formatter={'float': lambda x: "{0:0.1f}".format(x)})
tf.enable_v2_behavior()
tfd = tfp.distributions
mtcars_url = "https://gist.githubusercontent.com/seankross/a412dfbd88b3db70b74b/raw/5f23f993cd87c283ce766e7ac6b329ee7cc2e1d1/mtcars.csv"
df_cars = pd.read_csv(mtcars_url)
y = df_cars.mpg
x = df_cars.wt
y = np.array(y)
x = np.array(x)
negloglik = lambda y, rv_y: -rv_y.log_prob(y)
model = tf.keras.Sequential([
tf.keras.layers.Dense(1 + 1),
tfp.layers.DistributionLambda(
lambda t: tfd.Normal(loc=t[..., :1],
scale=1e-3 + tf.math.softplus(0.05 * t[...,1:]))),
])
# Do inference.
model.compile(optimizer=tf.optimizers.Adam(learning_rate=0.01), loss=negloglik)
model.fit(x, y, epochs=1000, verbose=False);
# generate 40000 samples
n = 40000
yhat = model(np.array(x))
a = yhat.sample(n).numpy().reshape(32,n,1)
# check that the quantiles of the 40000 samples are similar to R script
quantiles = np.linspace(start=0.1, stop=0.9, num=9)
array_quants = np.quantile(a, q = quantiles, axis=1)
print(array_quants.reshape(32, 9))
The output here is:
[[13.2 13.3 13.2 13.2 13.2 13.2 13.2 13.2 13.2]
[13.2 13.2 13.2 13.2 13.2 13.2 13.2 13.2 13.2]
[13.2 13.2 13.2 13.2 13.2 13.2 13.2 13.2 13.2]
[13.2 13.2 13.2 13.2 13.2 15.1 15.1 15.1 15.2]
[15.1 15.1 15.2 15.2 15.2 15.2 15.1 15.2 15.2]
[15.2 15.2 15.2 15.2 15.1 15.2 15.2 15.2 15.1]
[15.2 15.2 15.2 15.2 15.2 15.2 15.1 15.1 15.2]
[15.1 16.9 16.9 16.9 16.9 16.9 16.9 16.9 16.9]
[16.9 16.9 16.9 16.9 16.9 16.9 16.9 16.9 16.9]
[16.9 16.9 16.9 16.9 16.9 16.9 16.9 16.9 16.9]
[16.9 16.9 16.9 16.9 16.9 16.9 18.3 18.3 18.3]
[18.3 18.3 18.3 18.3 18.3 18.3 18.3 18.3 18.3]
[18.3 18.3 18.3 18.3 18.3 18.3 18.3 18.3 18.3]
[18.3 18.3 18.3 18.3 18.3 18.3 18.3 18.3 18.3]
[18.3 18.3 19.4 19.3 19.4 19.4 19.4 19.4 19.3]
[19.4 19.4 19.3 19.4 19.3 19.4 19.4 19.4 19.3]
[19.4 19.4 19.4 19.3 19.4 19.4 19.4 19.3 19.4]
[19.4 19.3 19.4 19.4 19.4 19.4 19.3 20.2 20.3]
[20.2 20.2 20.2 20.3 20.2 20.2 20.2 20.2 20.2]
[20.3 20.2 20.2 20.3 20.2 20.3 20.3 20.3 20.2]
[20.3 20.3 20.3 20.2 20.2 20.3 20.2 20.3 20.3]
[20.3 20.3 20.2 21.1 21.1 21.1 21.1 21.1 21.2]
[21.1 21.1 21.1 21.1 21.1 21.1 21.1 21.1 21.1]
[21.1 21.2 21.2 21.1 21.1 21.1 21.1 21.2 21.1]
[21.1 21.2 21.2 21.1 21.1 21.1 21.2 21.1 22.2]
[22.2 22.2 22.2 22.2 22.2 22.2 22.2 22.2 22.2]
[22.2 22.2 22.2 22.2 22.2 22.2 22.2 22.2 22.2]
[22.2 22.2 22.2 22.2 22.2 22.2 22.2 22.3 22.2]
[22.2 22.2 22.2 22.2 24.9 24.9 24.9 24.9 24.9]
[24.8 24.9 24.9 24.9 24.9 24.8 24.9 24.8 24.9]
[24.9 24.9 24.9 24.9 24.9 24.9 24.9 24.9 24.9]
[24.9 24.9 24.9 24.9 24.9 24.8 24.9 24.8 24.9]]
I am still learning TFP so I am sure there is some simple mistake I made in the model specification, but it is not currently obvious to me what that mistake is.
Try to adjust the learning rate and see if the model converges correctly. Also, with the reshape function in your code, you probably intended to do a transpose. With a learning rate of 1.0 and some code fixes, this is what I got:
>>> n = 1000
>>> a = yhat.sample(n)[:,:,0].numpy().T
>>> quantiles = np.linspace(start=0.1, stop=0.9, num=9)
>>> array_quants = np.quantile(a, q = quantiles, axis=1)
>>> np.round(array_quants.T,1)
array([[18.1, 19.8, 21.1, 22. , 23.1, 24.2, 25.4, 26.5, 28. ],
[16.6, 18.4, 19.8, 21. , 21.8, 23. , 24.3, 25.7, 27.4],
[19.9, 21.5, 22.9, 23.8, 24.7, 25.6, 26.6, 28. , 29.4],
[14.2, 16.1, 17.7, 19.1, 20.3, 21.5, 22.7, 24.4, 26.4],
[12.8, 15. , 16.4, 17.7, 19. , 20.3, 21.8, 23.4, 25.4],
[12.3, 14.6, 16.3, 17.7, 19. , 20.2, 21.7, 23.2, 25.2],
[12.1, 14.5, 16.2, 17.3, 18.6, 19.7, 21.3, 23. , 25.1],
[14.6, 16.3, 17.5, 19.1, 20.4, 21.6, 22.9, 24.2, 26.6],
[14.9, 17. , 18.4, 19.5, 20.5, 21.7, 22.9, 24.4, 26.5],
[13.1, 15.2, 17. , 18.2, 19.3, 20.6, 21.9, 23.4, 25.4],
[12.9, 15. , 16.7, 18.2, 19.4, 20.7, 22. , 23.6, 25.7],
[ 8.7, 11.6, 13.3, 14.5, 16.1, 17.5, 19. , 20.8, 23. ],
[10.7, 13.3, 15.1, 16.5, 17.7, 19.1, 20.6, 22.3, 24.3],
[11.1, 13.3, 14.9, 16.3, 17.4, 18.9, 20.4, 21.9, 24.1],
[ 2. , 5. , 7.1, 8.6, 10.2, 11.9, 13.4, 15.9, 18.6],
[-0.1, 3.9, 6.1, 8. , 9.7, 11.6, 13.5, 15.5, 17.9],
[ 0.3, 3.9, 6.3, 8.1, 9.8, 11.5, 13.1, 15.4, 18.4],
[20.6, 22.3, 23.3, 24.3, 25.2, 26.2, 27.2, 28.3, 30. ],
[24.2, 25.6, 26.6, 27.6, 28.3, 29.2, 30. , 30.8, 32. ],
[22.7, 24.1, 25.1, 26.1, 26.9, 27.9, 28.8, 29.9, 31.3],
[19. , 20.9, 22. , 23.3, 24. , 25. , 26. , 27.4, 29.1],
[12.4, 14.6, 16.3, 17.7, 18.9, 20. , 21.5, 23.1, 25.7],
[13.2, 15.3, 16.8, 18.2, 19.2, 20.5, 22. , 23.4, 25.4],
[10.5, 13.1, 14.9, 16.2, 17.4, 18.6, 20.1, 21.9, 24.3],
[10.3, 12.7, 14.4, 15.8, 17.4, 18.7, 20. , 21.9, 24.3],
[22.4, 23.9, 24.9, 25.8, 26.6, 27.5, 28.3, 29.3, 30.8],
[21.1, 22.6, 23.6, 24.5, 25.5, 26.3, 27.2, 28. , 29.6],
[25.1, 26.3, 27.2, 28. , 28.7, 29.4, 30.1, 31. , 32.3],
[14.4, 16.3, 17.7, 18.9, 20. , 21.3, 22.7, 24.2, 26.5],
[16.9, 18.8, 20.2, 21.2, 22.5, 23.5, 24.6, 25.8, 27.5],
[12.4, 14.7, 16.1, 17.5, 18.7, 20. , 21.5, 23.1, 25.3],
[16.8, 18.7, 20. , 21.2, 22.5, 23.5, 24.6, 26.3, 28.1]])

Change value of only 1 cell based on criteria DataFrame

Based on a condition, I want to change the value of the first row on a certain column, so far this is what I have
despesas['recibos'] =''
for a in recibos['recibos']:
if len(despesas.loc[(despesas['despesas']==a) & (despesas['recibos']==''), 'recibos'])>0:
despesas.loc[(despesas['despesas']==a) & (despesas['recibos']==''),
'recibos'].iloc[0] =a
So I want to change only the first value of the column recibos by the value on a where (despesas['despesas']==a) & (despesas['recibos']=='')
Edit 1
Example:
despesas['despesas'] = [11.95, 2.5, 1.2 , 0.6 , 2.66, 2.66, 3. , 47.5 , 16.95,17.56]
recibos['recibos'] = [11.95, 1.2 , 1.2 , 0.2 , 2.66, 2.66, 3. , 47.5 , 16.95, 17.56]
And the result should be:
[[11.95, 11.95], [2.5, null] , [1.2, 1.2] , [0.6, null] , [2.66, 2.66], [2.66, 2.66], [3., 3] , [47.5, 45.5 ], [16.95, 16.95], [17.56, 17.56]]
It could be works:
mapper = recibos['recibos'].map(despesas['despesas'].value_counts()).fillna(0)
despesas['recibos'] = recibos['recibos'].where(recibos.groupby('recibos')
.cumcount()
.lt(mapper),'null')
print(despesas)
despesas recibos
0 11.95 11.95
1 2.50 1.2
2 1.20 null
3 0.60 null
4 2.66 2.66
5 2.66 2.66
6 3.00 3
7 47.50 47.5
8 16.95 16.95
9 17.56 17.56
I found the solution that I was looking for
from itertools import count, filterfalse
despesas['recibos'] =''
for index, a in despesas.iterrows():
if len(recibos.loc[recibos['recibos']==a['despesas']])>0:
despesas.iloc[index,1]=True
recibos.drop(recibos.loc[recibos['recibos']==a['despesas']][:1].index, inplace=True)

How can I select single item from one list and doing operation on all items of second list using Python

For example if I have one list having data , and whose item should be selected one by one
a = [0.11 , 0.22 , 0.13, 6.7, 2.5, 2.8]
and the other one for which all items should be selected
b = [1.2 1.4, 2.6, 2.3, 5.7 9.9]
if I select 0.11 from a and do opertation like addition with all the items of b and then save the result in new array or list , how is that br possible with python? ...
I am sorry for the question as I am trying to learn python on my own, kindly tell me how is this thing possible.
Thank you in advance.
You need a nested loop. You can do it in a list comprehension to produce a list of lists:
[[item_a + item_b for item_b in b] for item_a in a]
If you want the end result to be a list of lists it could go like this:
c = [[x + y for x in b] for y in a]
If you want the end result to be a single list with next sublists appended to each other you could write as such:
c=[]
for (y in a):
c += ([y + x for x in b])
Another option is to convert your list into numpy array and then exploit the broadcasting property of numpy arrays:
import numpy as np
npA = np.array(a)
npB = np.array(b)
npA[:, None] + npB
array([[ 1.31, 1.51, 2.71, 2.41, 5.81, 10.01],
[ 1.42, 1.62, 2.82, 2.52, 5.92, 10.12],
[ 1.33, 1.53, 2.73, 2.43, 5.83, 10.03],
[ 7.9 , 8.1 , 9.3 , 9. , 12.4 , 16.6 ],
[ 3.7 , 3.9 , 5.1 , 4.8 , 8.2 , 12.4 ],
[ 4. , 4.2 , 5.4 , 5.1 , 8.5 , 12.7 ]])
You can also do element wise multiplication simply with:
npA[:, None] * npB
which returns:
array([[ 0.132, 0.154, 0.286, 0.253, 0.627, 1.089],
[ 0.264, 0.308, 0.572, 0.506, 1.254, 2.178],
[ 0.156, 0.182, 0.338, 0.299, 0.741, 1.287],
[ 8.04 , 9.38 , 17.42 , 15.41 , 38.19 , 66.33 ],
[ 3. , 3.5 , 6.5 , 5.75 , 14.25 , 24.75 ],
[ 3.36 , 3.92 , 7.28 , 6.44 , 15.96 , 27.72 ]])

Elegant numpy array shifting and NaN filling?

I have a specific performance problem here. I'm working with meteorological forecast timeseries, which I compile into a numpy 2d array such that
dim0 = time at which forecast series starts
dim1 = the forecast horizon, eg. 0 to 120 hrs
Now, I would like dim0 to have hourly intervals, but some sources yield forecasts only every N hours. As an example, say N=3 and the time step in dim1 is M=1 hour. Then I get something like
12:00 11.2 12.2 14.0 15.0 11.3 12.0
13:00 nan nan nan nan nan nan
14:00 nan nan nan nan nan nan
15:00 14.7 11.5 12.2 13.0 14.3 15.1
But of course there is information at 13:00 and 14:00 as well, since it can be filled in from the 12:00 forecast run. So I would like to end up with something like this:
12:00 11.2 12.2 14.0 15.0 11.3 12.0
13:00 12.2 14.0 15.0 11.3 12.0 nan
14:00 14.0 15.0 11.3 12.0 nan nan
15:00 14.7 11.5 12.2 13.0 14.3 15.1
What is the fastest way to get there, assuming dim0 is in the order of 1e4 and dim1 in the order of 1e2? Right now I'm doing it row by row but that is very slow:
nRows, nCols = dat.shape
if N >= M:
assert(N % M == 0) # must have whole numbers
for i in range(1, nRows):
k = np.array(np.where(np.isnan(self.dat[i, :])))
k = k[k < nCols - N] # do not overstep
self.dat[i, k] = self.dat[i-1, k+N]
I'm sure there must be a more elegant way to do this? Any hints would be greatly appreciated.
Behold, the power of boolean indexing!!!
def shift_nans(arr) :
while True:
nan_mask = np.isnan(arr)
write_mask = nan_mask[1:, :-1]
read_mask = nan_mask[:-1, 1:]
write_mask &= ~read_mask
if not np.any(write_mask):
return arr
arr[1:, :-1][write_mask] = arr[:-1, 1:][write_mask]
I think the naming is self explanatory of what is going on. Getting the slicing right is a pain, but it seems to be working:
In [214]: shift_nans_bis(test_data)
Out[214]:
array([[ 11.2, 12.2, 14. , 15. , 11.3, 12. ],
[ 12.2, 14. , 15. , 11.3, 12. , nan],
[ 14. , 15. , 11.3, 12. , nan, nan],
[ 14.7, 11.5, 12.2, 13. , 14.3, 15.1],
[ 11.5, 12.2, 13. , 14.3, 15.1, nan],
[ 15.7, 16.5, 17.2, 18. , 14. , 12. ]])
And for timings:
tmp1 = np.random.uniform(-10, 20, (1e4, 1e2))
nan_idx = np.random.randint(30, 1e4 - 1,1e4)
tmp1[nan_idx] = np.nan
tmp1 = tmp.copy()
import timeit
t1 = timeit.timeit(stmt='shift_nans(tmp)',
setup='from __main__ import tmp, shift_nans',
number=1)
t2 = timeit.timeit(stmt='shift_time(tmp1)', # Ophion's code
setup='from __main__ import tmp1, shift_time',
number=1)
In [242]: t1, t2
Out[242]: (0.12696346416487359, 0.3427293070417363)
Slicing your data using a=yourdata[:,1:].
def shift_time(dat):
#Find number of required iterations
check=np.where(np.isnan(dat[:,0])==False)[0]
maxiters=np.max(np.diff(check))-1
#No sense in iterations where it just updates nans
cols=dat.shape[1]
if cols<maxiters: maxiters=cols-1
for iters in range(maxiters):
#Find nans
col_loc,row_loc=np.where(np.isnan(dat[:,:-1]))
dat[(col_loc,row_loc)]=dat[(col_loc-1,row_loc+1)]
a=np.array([[11.2,12.2,14.0,15.0,11.3,12.0],
[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
[14.7,11.5,12.2,13.0,14.3,15.]])
shift_time(a)
print a
[[ 11.2 12.2 14. 15. 11.3 12. ]
[ 12.2 14. 15. 11.3 12. nan]
[ 14. 15. 11.3 12. nan nan]
[ 14.7 11.5 12.2 13. 14.3 15. ]]
To use your data as is or it can be changed slightly to take it directly, but this seems to be a clear way to show this:
shift_time(yourdata[:,1:]) #Updates in place, no need to return anything.
Using tiago's test:
tmp = np.random.uniform(-10, 20, (1e4, 1e2))
nan_idx = np.random.randint(30, 1e4 - 1,1e4)
tmp[nan_idx] = np.nan
t=time.time()
shift_time(tmp,maxiter=1E5)
print time.time()-t
0.364198923111 (seconds)
If you are really clever you should be able to get away with a single np.where.
This seems to do the trick:
import numpy as np
def shift_time(dat):
NX, NY = dat.shape
for i in range(NY):
x, y = np.where(np.isnan(dat))
xr = x - 1
yr = y + 1
idx = (xr >= 0) & (yr < NY)
dat[x[idx], y[idx]] = dat[xr[idx], yr[idx]]
return
Now with some test data:
In [1]: test_data = array([[ 11.2, 12.2, 14. , 15. , 11.3, 12. ],
[ nan, nan, nan, nan, nan, nan],
[ nan, nan, nan, nan, nan, nan],
[ 14.7, 11.5, 12.2, 13. , 14.3, 15.1],
[ nan, nan, nan, nan, nan, nan],
[ 15.7, 16.5, 17.2, 18. , 14. , 12. ]])
In [2]: shift_time(test_data)
In [3]: print test_data
Out [3]:
array([[ 11.2, 12.2, 14. , 15. , 11.3, 12. ],
[ 12.2, 14. , 15. , 11.3, 12. , nan],
[ 14. , 15. , 11.3, 12. , nan, nan],
[ 14.7, 11.5, 12.2, 13. , 14.3, 15.1],
[ 11.5, 12.2, 13. , 14.3, 15.1, nan],
[ 15.7, 16.5, 17.2, 18. , 14. , 12. ]])
And testing with a (1e4, 1e2) array:
In [1]: tmp = np.random.uniform(-10, 20, (1e4, 1e2))
In [2]: nan_idx = np.random.randint(30, 1e4 - 1,1e4)
In [3]: tmp[nan_idx] = nan
In [4]: time test3(tmp)
CPU times: user 1.53 s, sys: 0.06 s, total: 1.59 s
Wall time: 1.59 s
Each iteration of this pad,roll,roll combo essentially does what you are looking for:
import numpy as np
from numpy import nan as nan
# Startup array
A = np.array([[11.2, 12.2, 14.0, 15.0, 11.3, 12.0],
[nan, nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan, nan],
[14.7, 11.5, 12.2, 13.0, 14.3, 15.1]])
def pad_nan(v, pad_width, iaxis, kwargs):
v[:pad_width[0]] = nan
v[-pad_width[1]:] = nan
return v
def roll_data(A):
idx = np.isnan(A)
A[idx] = np.roll(np.roll(np.pad(A,1, pad_nan),1,0), -1, 1)[1:-1,1:-1][idx]
return A
print A
print roll_data(A)
print roll_data(A)
The output gives:
[[ 11.2 12.2 14. 15. 11.3 12. ]
[ nan nan nan nan nan nan]
[ nan nan nan nan nan nan]
[ 14.7 11.5 12.2 13. 14.3 15.1]]
[[ 11.2 12.2 14. 15. 11.3 12. ]
[ 12.2 14. 15. 11.3 12. nan]
[ nan nan nan nan nan nan]
[ 14.7 11.5 12.2 13. 14.3 15.1]]
[[ 11.2 12.2 14. 15. 11.3 12. ]
[ 12.2 14. 15. 11.3 12. nan]
[ 14. 15. 11.3 12. nan nan]
[ 14.7 11.5 12.2 13. 14.3 15.1]]
Everything is pure numpy so it should be extremely fast each iteration. However I'm not sure of the cost of creating a padded array and running the multiple iterations, if you try it let me know the results!

Categories

Resources