I have a specific performance problem here. I'm working with meteorological forecast timeseries, which I compile into a numpy 2d array such that
dim0 = time at which forecast series starts
dim1 = the forecast horizon, eg. 0 to 120 hrs
Now, I would like dim0 to have hourly intervals, but some sources yield forecasts only every N hours. As an example, say N=3 and the time step in dim1 is M=1 hour. Then I get something like
12:00 11.2 12.2 14.0 15.0 11.3 12.0
13:00 nan nan nan nan nan nan
14:00 nan nan nan nan nan nan
15:00 14.7 11.5 12.2 13.0 14.3 15.1
But of course there is information at 13:00 and 14:00 as well, since it can be filled in from the 12:00 forecast run. So I would like to end up with something like this:
12:00 11.2 12.2 14.0 15.0 11.3 12.0
13:00 12.2 14.0 15.0 11.3 12.0 nan
14:00 14.0 15.0 11.3 12.0 nan nan
15:00 14.7 11.5 12.2 13.0 14.3 15.1
What is the fastest way to get there, assuming dim0 is in the order of 1e4 and dim1 in the order of 1e2? Right now I'm doing it row by row but that is very slow:
nRows, nCols = dat.shape
if N >= M:
assert(N % M == 0) # must have whole numbers
for i in range(1, nRows):
k = np.array(np.where(np.isnan(self.dat[i, :])))
k = k[k < nCols - N] # do not overstep
self.dat[i, k] = self.dat[i-1, k+N]
I'm sure there must be a more elegant way to do this? Any hints would be greatly appreciated.
Behold, the power of boolean indexing!!!
def shift_nans(arr) :
while True:
nan_mask = np.isnan(arr)
write_mask = nan_mask[1:, :-1]
read_mask = nan_mask[:-1, 1:]
write_mask &= ~read_mask
if not np.any(write_mask):
return arr
arr[1:, :-1][write_mask] = arr[:-1, 1:][write_mask]
I think the naming is self explanatory of what is going on. Getting the slicing right is a pain, but it seems to be working:
In [214]: shift_nans_bis(test_data)
Out[214]:
array([[ 11.2, 12.2, 14. , 15. , 11.3, 12. ],
[ 12.2, 14. , 15. , 11.3, 12. , nan],
[ 14. , 15. , 11.3, 12. , nan, nan],
[ 14.7, 11.5, 12.2, 13. , 14.3, 15.1],
[ 11.5, 12.2, 13. , 14.3, 15.1, nan],
[ 15.7, 16.5, 17.2, 18. , 14. , 12. ]])
And for timings:
tmp1 = np.random.uniform(-10, 20, (1e4, 1e2))
nan_idx = np.random.randint(30, 1e4 - 1,1e4)
tmp1[nan_idx] = np.nan
tmp1 = tmp.copy()
import timeit
t1 = timeit.timeit(stmt='shift_nans(tmp)',
setup='from __main__ import tmp, shift_nans',
number=1)
t2 = timeit.timeit(stmt='shift_time(tmp1)', # Ophion's code
setup='from __main__ import tmp1, shift_time',
number=1)
In [242]: t1, t2
Out[242]: (0.12696346416487359, 0.3427293070417363)
Slicing your data using a=yourdata[:,1:].
def shift_time(dat):
#Find number of required iterations
check=np.where(np.isnan(dat[:,0])==False)[0]
maxiters=np.max(np.diff(check))-1
#No sense in iterations where it just updates nans
cols=dat.shape[1]
if cols<maxiters: maxiters=cols-1
for iters in range(maxiters):
#Find nans
col_loc,row_loc=np.where(np.isnan(dat[:,:-1]))
dat[(col_loc,row_loc)]=dat[(col_loc-1,row_loc+1)]
a=np.array([[11.2,12.2,14.0,15.0,11.3,12.0],
[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
[14.7,11.5,12.2,13.0,14.3,15.]])
shift_time(a)
print a
[[ 11.2 12.2 14. 15. 11.3 12. ]
[ 12.2 14. 15. 11.3 12. nan]
[ 14. 15. 11.3 12. nan nan]
[ 14.7 11.5 12.2 13. 14.3 15. ]]
To use your data as is or it can be changed slightly to take it directly, but this seems to be a clear way to show this:
shift_time(yourdata[:,1:]) #Updates in place, no need to return anything.
Using tiago's test:
tmp = np.random.uniform(-10, 20, (1e4, 1e2))
nan_idx = np.random.randint(30, 1e4 - 1,1e4)
tmp[nan_idx] = np.nan
t=time.time()
shift_time(tmp,maxiter=1E5)
print time.time()-t
0.364198923111 (seconds)
If you are really clever you should be able to get away with a single np.where.
This seems to do the trick:
import numpy as np
def shift_time(dat):
NX, NY = dat.shape
for i in range(NY):
x, y = np.where(np.isnan(dat))
xr = x - 1
yr = y + 1
idx = (xr >= 0) & (yr < NY)
dat[x[idx], y[idx]] = dat[xr[idx], yr[idx]]
return
Now with some test data:
In [1]: test_data = array([[ 11.2, 12.2, 14. , 15. , 11.3, 12. ],
[ nan, nan, nan, nan, nan, nan],
[ nan, nan, nan, nan, nan, nan],
[ 14.7, 11.5, 12.2, 13. , 14.3, 15.1],
[ nan, nan, nan, nan, nan, nan],
[ 15.7, 16.5, 17.2, 18. , 14. , 12. ]])
In [2]: shift_time(test_data)
In [3]: print test_data
Out [3]:
array([[ 11.2, 12.2, 14. , 15. , 11.3, 12. ],
[ 12.2, 14. , 15. , 11.3, 12. , nan],
[ 14. , 15. , 11.3, 12. , nan, nan],
[ 14.7, 11.5, 12.2, 13. , 14.3, 15.1],
[ 11.5, 12.2, 13. , 14.3, 15.1, nan],
[ 15.7, 16.5, 17.2, 18. , 14. , 12. ]])
And testing with a (1e4, 1e2) array:
In [1]: tmp = np.random.uniform(-10, 20, (1e4, 1e2))
In [2]: nan_idx = np.random.randint(30, 1e4 - 1,1e4)
In [3]: tmp[nan_idx] = nan
In [4]: time test3(tmp)
CPU times: user 1.53 s, sys: 0.06 s, total: 1.59 s
Wall time: 1.59 s
Each iteration of this pad,roll,roll combo essentially does what you are looking for:
import numpy as np
from numpy import nan as nan
# Startup array
A = np.array([[11.2, 12.2, 14.0, 15.0, 11.3, 12.0],
[nan, nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan, nan],
[14.7, 11.5, 12.2, 13.0, 14.3, 15.1]])
def pad_nan(v, pad_width, iaxis, kwargs):
v[:pad_width[0]] = nan
v[-pad_width[1]:] = nan
return v
def roll_data(A):
idx = np.isnan(A)
A[idx] = np.roll(np.roll(np.pad(A,1, pad_nan),1,0), -1, 1)[1:-1,1:-1][idx]
return A
print A
print roll_data(A)
print roll_data(A)
The output gives:
[[ 11.2 12.2 14. 15. 11.3 12. ]
[ nan nan nan nan nan nan]
[ nan nan nan nan nan nan]
[ 14.7 11.5 12.2 13. 14.3 15.1]]
[[ 11.2 12.2 14. 15. 11.3 12. ]
[ 12.2 14. 15. 11.3 12. nan]
[ nan nan nan nan nan nan]
[ 14.7 11.5 12.2 13. 14.3 15.1]]
[[ 11.2 12.2 14. 15. 11.3 12. ]
[ 12.2 14. 15. 11.3 12. nan]
[ 14. 15. 11.3 12. nan nan]
[ 14.7 11.5 12.2 13. 14.3 15.1]]
Everything is pure numpy so it should be extremely fast each iteration. However I'm not sure of the cost of creating a padded array and running the multiple iterations, if you try it let me know the results!
Related
I have one array which contains continuous values. I need to round up those values to the closet float value. ex: 32.25 to 32.50 , 30.29 to 30.50, 33.75 to 34.00. In short: if it is from .1 to .49 round up to .50 and if it is from .51 to .99 round up to .00. How can I do it. Thank you in advance.
array([32.5 , 32.49, 32.48, 32.47, 32.46, 32.45, 32.44, 32.43, 32.42,
32.41, 32.4 , 32.39, 32.38, 32.37, 32.36, 32.35, 32.34, 32.33,
32.32, 32.31, 32.3 , 32.29, 32.28, 32.27, 32.26, 32.25, 15.75,
15.76, 15.77, 15.78, 15.79, 15.8 , 15.81, 15.82, 15.83, 15.84,
15.85, 15.86, 15.87, 15.88, 15.89, 15.9 , 15.91, 15.92, 15.93,
15.94, 15.95, 15.96, 15.97, 15.98, 15.99, 16. , 16.01, 16.02,
16.03, 16.04, 16.05, 16.06, 16.07, 16.08, 16.09, 16.1 , 16.11,
16.12, 16.13, 16.14, 16.15, 16.16, 16.17, 16.18, 16.19, 16.2 ,
16.21, 16.22, 16.23, 16.24, 16.25, 16.26, 16.27, 16.28, 16.29,
16.3 , 16.31, 16.32, 16.33, 16.34, 16.35, 16.36, 16.37, 16.38,
16.39, 16.4 , 16.41, 16.42, 16.43, 16.44, 16.45, 16.46, 16.47,
16.48, 16.49, 16.5 , 25.25, 25.5 , 25.51, 25.52, 25.53, 25.54,
25.55, 25.56, 25.57, 25.58, 25.59, 25.6 , 25.61, 25.62, 25.63,
25.64, 25.65, 25.66, 25.67, 25.68, 25.69, 25.7 , 25.71, 25.72,
25.73, 25.74, 26. , 26.01, 26.02, 26.03, 26.04, 26.05, 26.06,
26.07, 26.08, 26.09, 26.1 , 26.11, 26.12, 26.13, 26.14, 26.15,
26.16, 26.17, 26.18, 26.19, 26.2 , 26.21, 26.22, 26.23, 26.24,
26.25, 26.26, 26.27, 26.28, 26.29, 26.3 , 26.31, 26.32, 26.5 ,
26.49, 26.48, 26.47, 26.46, 26.45, 26.44, 26.43, 26.42, 26.41,
26.4 , 26.39, 26.38, 26.37, 26.36, 26.35, 26.34, 26.33, 28.5 ,
28.51, 28.52, 28.53, 28.54, 28.55, 28.56, 28.57, 28.58, 28.59,
28.6 , 28.61, 28.62, 28.63, 28.64, 28.65, 28.66, 30.5 , 30.49,
30.48, 30.47, 30.46, 30.45, 30.44, 30.43, 30.42, 30.41, 30.4 ,
30.39, 30.38, 30.37, 30.36, 30.35, 30.34, 30.33, 30.32, 30.31,
30.3 , 30.29, 30.28, 30.27, 30.26, 30.25])
Did you not experiment with this? numpy is built for experimentation.
array = (array * 2 + 0.4999).round() / 2
Another solution:
import math
[math.modf(item)[1] + 0.5 if (0.1 <= ( item % 1) <= 0.5) else math.modf(item)[1] for item in array]
use the below code
round_off_values = np.round_(array, decimals = 1)
I am looping through a bunch of files and importing their contents as numpy arrays:
# get the dates for our gaps
import os.path
import glob
from pathlib import Path
from numpy import recfromcsv
folder = "daily_bars_filtered/*.csv"
df_gapper_list = []
df_intraday_analysis = []
# loop through the daily gappers
for fname in glob.glob(folder)[0:2]:
ticker = Path(fname).stem
daily_bars_arr = recfromcsv(fname, delimiter=',')
print(ticker)
print(daily_bars_arr)
Output:
AACG
[(b'2021-07-15', 43796169., 2.98, 3.83, 4.75, 2.9401, 2.98, 59.39597315)
(b'2022-01-04', 14934689., 1.25, 2.55, 2.59, 1.25 , 1.19, 117.64705882)
(b'2022-01-05', 8067429., 1.8 , 2.3 , 2.64, 1.72 , 2.55, 3.52941176)
(b'2022-01-07', 9718034., 1.93, 2.64, 2.94, 1.85 , 1.98, 48.48484848)]
AAL
[(b'2022-03-04', 76218689., 15.27 , 14.59, 15.4799, 14.42 , 15.71, 1.46467218)
(b'2022-03-07', 89360330., 14.32 , 12.84, 14.62 , 12.77 , 14.59, 0.20562029)
(b'2022-03-08', 88067102., 13.035, 13.51, 14.27 , 12.4401, 12.84, 11.13707165)
(b'2022-03-09', 88884229., 14.44 , 14.3 , 14.75 , 14.05 , 13.51, 9.17838638)
(b'2022-03-10', 56463182., 13.82 , 14.2 , 14.44 , 13.46 , 14.3 , 0.97902098)
(b'2022-03-11', 48342029., 14.4 , 14.02, 14.56 , 13.9 , 14.2 , 2.53521127)
(b'2022-03-14', 53284254., 14.04 , 14.25, 14.83 , 13.7 , 14.02, 5.77746077)]
What I then try to do is target the first column where my dates are, by doing:
print(daily_bars_arr[:,[0]])
But then I get the following error:
IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed
What am I doing wrong?
Based on a condition, I want to change the value of the first row on a certain column, so far this is what I have
despesas['recibos'] =''
for a in recibos['recibos']:
if len(despesas.loc[(despesas['despesas']==a) & (despesas['recibos']==''), 'recibos'])>0:
despesas.loc[(despesas['despesas']==a) & (despesas['recibos']==''),
'recibos'].iloc[0] =a
So I want to change only the first value of the column recibos by the value on a where (despesas['despesas']==a) & (despesas['recibos']=='')
Edit 1
Example:
despesas['despesas'] = [11.95, 2.5, 1.2 , 0.6 , 2.66, 2.66, 3. , 47.5 , 16.95,17.56]
recibos['recibos'] = [11.95, 1.2 , 1.2 , 0.2 , 2.66, 2.66, 3. , 47.5 , 16.95, 17.56]
And the result should be:
[[11.95, 11.95], [2.5, null] , [1.2, 1.2] , [0.6, null] , [2.66, 2.66], [2.66, 2.66], [3., 3] , [47.5, 45.5 ], [16.95, 16.95], [17.56, 17.56]]
It could be works:
mapper = recibos['recibos'].map(despesas['despesas'].value_counts()).fillna(0)
despesas['recibos'] = recibos['recibos'].where(recibos.groupby('recibos')
.cumcount()
.lt(mapper),'null')
print(despesas)
despesas recibos
0 11.95 11.95
1 2.50 1.2
2 1.20 null
3 0.60 null
4 2.66 2.66
5 2.66 2.66
6 3.00 3
7 47.50 47.5
8 16.95 16.95
9 17.56 17.56
I found the solution that I was looking for
from itertools import count, filterfalse
despesas['recibos'] =''
for index, a in despesas.iterrows():
if len(recibos.loc[recibos['recibos']==a['despesas']])>0:
despesas.iloc[index,1]=True
recibos.drop(recibos.loc[recibos['recibos']==a['despesas']][:1].index, inplace=True)
I have merged ten txt files(A_1,A_2......A_10 and B_1,B_2.....B_10) horizontally and got the output as A_B_1,A_B_2......A_B_3........ The issue is File A's has large and fixed number of rows (4320) while B's has smaller and fluctuating number of rows(2689,3078...), So whenever I try to load the merged file using numpy, I am facing a wrong number of columns error starting from the line when B's have no values. Any suggestion on how to solve this issue would be appreciated.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib notebook
data=np.loadtxt('/Users/Hrihaan/Desktop/Code/A_B_5.txt')
time=data[:,1]
V=data[:,3]
plt.plot(time,V)
Suppose you have a file named "A_B_5.txt".
The contents are:
3044 1995 9.0 3.8 3044 1995 9.0 3.8
3044 1995 9.0 3.8 3044 1995 9.0 3.8
3044 1995 9.0 3.8 3044 1995 9.0 3.8
3044 1995 9.0 3.8
3044 1995 9.0 3.8
3044 1995 9.0 3.8
You can use read_table from pandas:
import pandas as pd
data= pd.read_table("A_B_5.txt", sep="\s+", header=None).values
You'll get:
array([[ 3044. , 1995. , 9. , 3.8, 3044. , 1995. , 9. , 3.8],
[ 3044. , 1995. , 9. , 3.8, 3044. , 1995. , 9. , 3.8],
[ 3044. , 1995. , 9. , 3.8, 3044. , 1995. , 9. , 3.8],
[ 3044. , 1995. , 9. , 3.8, nan, nan, nan, nan],
[ 3044. , 1995. , 9. , 3.8, nan, nan, nan, nan],
[ 3044. , 1995. , 9. , 3.8, nan, nan, nan, nan]])
======
Read a list of files A_B_i.txt for i in 1,2,... 10:
data =[pd.read_table("A_B_"+i+".txt", sep="\s+", header=None).values
for i in range(1,11)]
And access each data frame like data[0],data[1] etc.
For example if I have one list having data , and whose item should be selected one by one
a = [0.11 , 0.22 , 0.13, 6.7, 2.5, 2.8]
and the other one for which all items should be selected
b = [1.2 1.4, 2.6, 2.3, 5.7 9.9]
if I select 0.11 from a and do opertation like addition with all the items of b and then save the result in new array or list , how is that br possible with python? ...
I am sorry for the question as I am trying to learn python on my own, kindly tell me how is this thing possible.
Thank you in advance.
You need a nested loop. You can do it in a list comprehension to produce a list of lists:
[[item_a + item_b for item_b in b] for item_a in a]
If you want the end result to be a list of lists it could go like this:
c = [[x + y for x in b] for y in a]
If you want the end result to be a single list with next sublists appended to each other you could write as such:
c=[]
for (y in a):
c += ([y + x for x in b])
Another option is to convert your list into numpy array and then exploit the broadcasting property of numpy arrays:
import numpy as np
npA = np.array(a)
npB = np.array(b)
npA[:, None] + npB
array([[ 1.31, 1.51, 2.71, 2.41, 5.81, 10.01],
[ 1.42, 1.62, 2.82, 2.52, 5.92, 10.12],
[ 1.33, 1.53, 2.73, 2.43, 5.83, 10.03],
[ 7.9 , 8.1 , 9.3 , 9. , 12.4 , 16.6 ],
[ 3.7 , 3.9 , 5.1 , 4.8 , 8.2 , 12.4 ],
[ 4. , 4.2 , 5.4 , 5.1 , 8.5 , 12.7 ]])
You can also do element wise multiplication simply with:
npA[:, None] * npB
which returns:
array([[ 0.132, 0.154, 0.286, 0.253, 0.627, 1.089],
[ 0.264, 0.308, 0.572, 0.506, 1.254, 2.178],
[ 0.156, 0.182, 0.338, 0.299, 0.741, 1.287],
[ 8.04 , 9.38 , 17.42 , 15.41 , 38.19 , 66.33 ],
[ 3. , 3.5 , 6.5 , 5.75 , 14.25 , 24.75 ],
[ 3.36 , 3.92 , 7.28 , 6.44 , 15.96 , 27.72 ]])