Missing column values - python

I have merged ten txt files(A_1,A_2......A_10 and B_1,B_2.....B_10) horizontally and got the output as A_B_1,A_B_2......A_B_3........ The issue is File A's has large and fixed number of rows (4320) while B's has smaller and fluctuating number of rows(2689,3078...), So whenever I try to load the merged file using numpy, I am facing a wrong number of columns error starting from the line when B's have no values. Any suggestion on how to solve this issue would be appreciated.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib notebook
data=np.loadtxt('/Users/Hrihaan/Desktop/Code/A_B_5.txt')
time=data[:,1]
V=data[:,3]
plt.plot(time,V)

Suppose you have a file named "A_B_5.txt".
The contents are:
3044 1995 9.0 3.8 3044 1995 9.0 3.8
3044 1995 9.0 3.8 3044 1995 9.0 3.8
3044 1995 9.0 3.8 3044 1995 9.0 3.8
3044 1995 9.0 3.8
3044 1995 9.0 3.8
3044 1995 9.0 3.8
You can use read_table from pandas:
import pandas as pd
data= pd.read_table("A_B_5.txt", sep="\s+", header=None).values
You'll get:
array([[ 3044. , 1995. , 9. , 3.8, 3044. , 1995. , 9. , 3.8],
[ 3044. , 1995. , 9. , 3.8, 3044. , 1995. , 9. , 3.8],
[ 3044. , 1995. , 9. , 3.8, 3044. , 1995. , 9. , 3.8],
[ 3044. , 1995. , 9. , 3.8, nan, nan, nan, nan],
[ 3044. , 1995. , 9. , 3.8, nan, nan, nan, nan],
[ 3044. , 1995. , 9. , 3.8, nan, nan, nan, nan]])
======
Read a list of files A_B_i.txt for i in 1,2,... 10:
data =[pd.read_table("A_B_"+i+".txt", sep="\s+", header=None).values
for i in range(1,11)]
And access each data frame like data[0],data[1] etc.

Related

array is 1-dimensional, but 2 were indexed when using numpy and recfromcsv

I am looping through a bunch of files and importing their contents as numpy arrays:
# get the dates for our gaps
import os.path
import glob
from pathlib import Path
from numpy import recfromcsv
folder = "daily_bars_filtered/*.csv"
df_gapper_list = []
df_intraday_analysis = []
# loop through the daily gappers
for fname in glob.glob(folder)[0:2]:
ticker = Path(fname).stem
daily_bars_arr = recfromcsv(fname, delimiter=',')
print(ticker)
print(daily_bars_arr)
Output:
AACG
[(b'2021-07-15', 43796169., 2.98, 3.83, 4.75, 2.9401, 2.98, 59.39597315)
(b'2022-01-04', 14934689., 1.25, 2.55, 2.59, 1.25 , 1.19, 117.64705882)
(b'2022-01-05', 8067429., 1.8 , 2.3 , 2.64, 1.72 , 2.55, 3.52941176)
(b'2022-01-07', 9718034., 1.93, 2.64, 2.94, 1.85 , 1.98, 48.48484848)]
AAL
[(b'2022-03-04', 76218689., 15.27 , 14.59, 15.4799, 14.42 , 15.71, 1.46467218)
(b'2022-03-07', 89360330., 14.32 , 12.84, 14.62 , 12.77 , 14.59, 0.20562029)
(b'2022-03-08', 88067102., 13.035, 13.51, 14.27 , 12.4401, 12.84, 11.13707165)
(b'2022-03-09', 88884229., 14.44 , 14.3 , 14.75 , 14.05 , 13.51, 9.17838638)
(b'2022-03-10', 56463182., 13.82 , 14.2 , 14.44 , 13.46 , 14.3 , 0.97902098)
(b'2022-03-11', 48342029., 14.4 , 14.02, 14.56 , 13.9 , 14.2 , 2.53521127)
(b'2022-03-14', 53284254., 14.04 , 14.25, 14.83 , 13.7 , 14.02, 5.77746077)]
What I then try to do is target the first column where my dates are, by doing:
print(daily_bars_arr[:,[0]])
But then I get the following error:
IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed
What am I doing wrong?

boxplot (from seaborn) would not plot as expected

The boxplot would not plot as expected.
This is what it actually plotted:
This is what it is supposed to plot:
This is the code and data:
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_score
scores = []
for ne in range(1,41): ## ne is the number of trees
clf = RandomForestClassifier(n_estimators = ne)
score_list = cross_val_score(clf, X, Y, cv=10)
scores.append(score_list)
sns.boxplot(scores) # scores are list of arrays
plt.xlabel('Number of trees')
plt.ylabel('Classification score')
plt.title('Classification score as a function of the number of trees')
plt.show()
scores =
[array([ 0.8757764 , 0.86335404, 0.75625 , 0.85 , 0.86875 ,
0.81875 , 0.79375 , 0.79245283, 0.8490566 , 0.85534591]),
array([ 0.89440994, 0.8447205 , 0.79375 , 0.85 , 0.8625 ,
0.85625 , 0.86875 , 0.88050314, 0.86792453, 0.8427673 ]),
array([ 0.91304348, 0.9068323 , 0.83125 , 0.84375 , 0.8875 ,
0.875 , 0.825 , 0.83647799, 0.83647799, 0.87421384]),
array([ 0.86956522, 0.86956522, 0.85 , 0.875 , 0.88125 ,
0.86875 , 0.8625 , 0.8490566 , 0.86792453, 0.89308176]),
....]
I would first create pandas DF out of scores:
import pandas as pd
In [15]: scores
Out[15]:
[array([ 0.8757764 , 0.86335404, 0.75625 , 0.85 , 0.86875 , 0.81875 , 0.79375 , 0.79245283, 0.8490566 , 0.85534591]),
array([ 0.89440994, 0.8447205 , 0.79375 , 0.85 , 0.8625 , 0.85625 , 0.86875 , 0.88050314, 0.86792453, 0.8427673 ]),
array([ 0.91304348, 0.9068323 , 0.83125 , 0.84375 , 0.8875 , 0.875 , 0.825 , 0.83647799, 0.83647799, 0.87421384]),
array([ 0.86956522, 0.86956522, 0.85 , 0.875 , 0.88125 , 0.86875 , 0.8625 , 0.8490566 , 0.86792453, 0.89308176])]
In [16]: df = pd.DataFrame(scores)
In [17]: df
Out[17]:
0 1 2 3 4 5 6 7 8 9
0 0.875776 0.863354 0.75625 0.85000 0.86875 0.81875 0.79375 0.792453 0.849057 0.855346
1 0.894410 0.844720 0.79375 0.85000 0.86250 0.85625 0.86875 0.880503 0.867925 0.842767
2 0.913043 0.906832 0.83125 0.84375 0.88750 0.87500 0.82500 0.836478 0.836478 0.874214
3 0.869565 0.869565 0.85000 0.87500 0.88125 0.86875 0.86250 0.849057 0.867925 0.893082
now we can easily plot boxplots:
In [18]: sns.boxplot(data=df)
Out[18]: <matplotlib.axes._subplots.AxesSubplot at 0xd121128>

How can I select single item from one list and doing operation on all items of second list using Python

For example if I have one list having data , and whose item should be selected one by one
a = [0.11 , 0.22 , 0.13, 6.7, 2.5, 2.8]
and the other one for which all items should be selected
b = [1.2 1.4, 2.6, 2.3, 5.7 9.9]
if I select 0.11 from a and do opertation like addition with all the items of b and then save the result in new array or list , how is that br possible with python? ...
I am sorry for the question as I am trying to learn python on my own, kindly tell me how is this thing possible.
Thank you in advance.
You need a nested loop. You can do it in a list comprehension to produce a list of lists:
[[item_a + item_b for item_b in b] for item_a in a]
If you want the end result to be a list of lists it could go like this:
c = [[x + y for x in b] for y in a]
If you want the end result to be a single list with next sublists appended to each other you could write as such:
c=[]
for (y in a):
c += ([y + x for x in b])
Another option is to convert your list into numpy array and then exploit the broadcasting property of numpy arrays:
import numpy as np
npA = np.array(a)
npB = np.array(b)
npA[:, None] + npB
array([[ 1.31, 1.51, 2.71, 2.41, 5.81, 10.01],
[ 1.42, 1.62, 2.82, 2.52, 5.92, 10.12],
[ 1.33, 1.53, 2.73, 2.43, 5.83, 10.03],
[ 7.9 , 8.1 , 9.3 , 9. , 12.4 , 16.6 ],
[ 3.7 , 3.9 , 5.1 , 4.8 , 8.2 , 12.4 ],
[ 4. , 4.2 , 5.4 , 5.1 , 8.5 , 12.7 ]])
You can also do element wise multiplication simply with:
npA[:, None] * npB
which returns:
array([[ 0.132, 0.154, 0.286, 0.253, 0.627, 1.089],
[ 0.264, 0.308, 0.572, 0.506, 1.254, 2.178],
[ 0.156, 0.182, 0.338, 0.299, 0.741, 1.287],
[ 8.04 , 9.38 , 17.42 , 15.41 , 38.19 , 66.33 ],
[ 3. , 3.5 , 6.5 , 5.75 , 14.25 , 24.75 ],
[ 3.36 , 3.92 , 7.28 , 6.44 , 15.96 , 27.72 ]])

Python: Removing a range of numbers from array list

Im having issues removing elements from a range a through b from an array list. The solutions ive searched online seem to only work for individual elements, adjacent elements and or elements that are whole numbers. Im dealing with float numbers.
self.genx = np.arange(0, 5, 0.1)
temp_select = self.genx[1:3] #I want to remove numbers from 1 - 3 from genx
print(temp_select)
self.genx = list(set(self.genx)-set(temp_select))
print(self.genx)
plt.plot(self.genx,self.geny)
However I get the following in the console and this is because im subtracting floats rather than whole numbers so it literally subtracts rather than removing which is what it would do if dealing with whole numbers:
genx: [ 0.0 , 0.1 , 0.2 , 0.3 , 0.4 , 0.5 , 0.6 , 0.7 , 0.8 , 0.9 , 1.0, 1.1 , 1.2 , 1.3 , 1.4 , 1.5 , 1.6 , 1.7 , 1.8 , 1.9 , 2.0, , 2.1 , 2.2 , 2.3 , 2.4 , 2.5 , 2.6 , 2.7 , 2.8 , 2.9
, 3.0 , 3.1 , 3.2 , 3.3 , 3.4 , 3.5 , 3.6 , 3.7 , 3.8 , 3.9 , 4.0 , 4.1 , 4.2 , 4.3 , 4.4
, 4.5 , 4.6 , 4.7 , 4.8 , 4.9]
temp_select: [ 0.1 0.2]
genx(after subtracted): [0.0, 0.5, 2.0, 3.0, 4.0, 1.5, 1.0, 1.1000000000000001, 0.70000000000000007, 0.90000000000000002, 2.7000000000000002, 0.30000000000000004, 2.9000000000000004, 1.9000000000000001, 3.3000000000000003, 0.40000000000000002, 4.7000000000000002, 3.4000000000000004, 2.2000000000000002, 2.8000000000000003, 1.4000000000000001, 0.60000000000000009, 3.6000000000000001, 1.3, 1.2000000000000002, 4.2999999999999998, 4.2000000000000002, 4.9000000000000004, 3.9000000000000004, 3.8000000000000003, 2.3000000000000003, 4.8000000000000007, 3.2000000000000002, 1.7000000000000002, 2.5, 3.5, 1.8, 4.1000000000000005, 2.4000000000000004, 4.4000000000000004, 1.6000000000000001, 0.80000000000000004, 2.6000000000000001, 4.6000000000000005, 2.1000000000000001, 3.1000000000000001, 3.7000000000000002, 4.5]
I didn't test this but you should be able to do something like the following:
self.genx = [ item for item in self.genx if not range_min < item < range_max ]
self.genx = [ item for item in self.genx if not range_min <= item <= range_max ]
Is this what you want??

Elegant numpy array shifting and NaN filling?

I have a specific performance problem here. I'm working with meteorological forecast timeseries, which I compile into a numpy 2d array such that
dim0 = time at which forecast series starts
dim1 = the forecast horizon, eg. 0 to 120 hrs
Now, I would like dim0 to have hourly intervals, but some sources yield forecasts only every N hours. As an example, say N=3 and the time step in dim1 is M=1 hour. Then I get something like
12:00 11.2 12.2 14.0 15.0 11.3 12.0
13:00 nan nan nan nan nan nan
14:00 nan nan nan nan nan nan
15:00 14.7 11.5 12.2 13.0 14.3 15.1
But of course there is information at 13:00 and 14:00 as well, since it can be filled in from the 12:00 forecast run. So I would like to end up with something like this:
12:00 11.2 12.2 14.0 15.0 11.3 12.0
13:00 12.2 14.0 15.0 11.3 12.0 nan
14:00 14.0 15.0 11.3 12.0 nan nan
15:00 14.7 11.5 12.2 13.0 14.3 15.1
What is the fastest way to get there, assuming dim0 is in the order of 1e4 and dim1 in the order of 1e2? Right now I'm doing it row by row but that is very slow:
nRows, nCols = dat.shape
if N >= M:
assert(N % M == 0) # must have whole numbers
for i in range(1, nRows):
k = np.array(np.where(np.isnan(self.dat[i, :])))
k = k[k < nCols - N] # do not overstep
self.dat[i, k] = self.dat[i-1, k+N]
I'm sure there must be a more elegant way to do this? Any hints would be greatly appreciated.
Behold, the power of boolean indexing!!!
def shift_nans(arr) :
while True:
nan_mask = np.isnan(arr)
write_mask = nan_mask[1:, :-1]
read_mask = nan_mask[:-1, 1:]
write_mask &= ~read_mask
if not np.any(write_mask):
return arr
arr[1:, :-1][write_mask] = arr[:-1, 1:][write_mask]
I think the naming is self explanatory of what is going on. Getting the slicing right is a pain, but it seems to be working:
In [214]: shift_nans_bis(test_data)
Out[214]:
array([[ 11.2, 12.2, 14. , 15. , 11.3, 12. ],
[ 12.2, 14. , 15. , 11.3, 12. , nan],
[ 14. , 15. , 11.3, 12. , nan, nan],
[ 14.7, 11.5, 12.2, 13. , 14.3, 15.1],
[ 11.5, 12.2, 13. , 14.3, 15.1, nan],
[ 15.7, 16.5, 17.2, 18. , 14. , 12. ]])
And for timings:
tmp1 = np.random.uniform(-10, 20, (1e4, 1e2))
nan_idx = np.random.randint(30, 1e4 - 1,1e4)
tmp1[nan_idx] = np.nan
tmp1 = tmp.copy()
import timeit
t1 = timeit.timeit(stmt='shift_nans(tmp)',
setup='from __main__ import tmp, shift_nans',
number=1)
t2 = timeit.timeit(stmt='shift_time(tmp1)', # Ophion's code
setup='from __main__ import tmp1, shift_time',
number=1)
In [242]: t1, t2
Out[242]: (0.12696346416487359, 0.3427293070417363)
Slicing your data using a=yourdata[:,1:].
def shift_time(dat):
#Find number of required iterations
check=np.where(np.isnan(dat[:,0])==False)[0]
maxiters=np.max(np.diff(check))-1
#No sense in iterations where it just updates nans
cols=dat.shape[1]
if cols<maxiters: maxiters=cols-1
for iters in range(maxiters):
#Find nans
col_loc,row_loc=np.where(np.isnan(dat[:,:-1]))
dat[(col_loc,row_loc)]=dat[(col_loc-1,row_loc+1)]
a=np.array([[11.2,12.2,14.0,15.0,11.3,12.0],
[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
[14.7,11.5,12.2,13.0,14.3,15.]])
shift_time(a)
print a
[[ 11.2 12.2 14. 15. 11.3 12. ]
[ 12.2 14. 15. 11.3 12. nan]
[ 14. 15. 11.3 12. nan nan]
[ 14.7 11.5 12.2 13. 14.3 15. ]]
To use your data as is or it can be changed slightly to take it directly, but this seems to be a clear way to show this:
shift_time(yourdata[:,1:]) #Updates in place, no need to return anything.
Using tiago's test:
tmp = np.random.uniform(-10, 20, (1e4, 1e2))
nan_idx = np.random.randint(30, 1e4 - 1,1e4)
tmp[nan_idx] = np.nan
t=time.time()
shift_time(tmp,maxiter=1E5)
print time.time()-t
0.364198923111 (seconds)
If you are really clever you should be able to get away with a single np.where.
This seems to do the trick:
import numpy as np
def shift_time(dat):
NX, NY = dat.shape
for i in range(NY):
x, y = np.where(np.isnan(dat))
xr = x - 1
yr = y + 1
idx = (xr >= 0) & (yr < NY)
dat[x[idx], y[idx]] = dat[xr[idx], yr[idx]]
return
Now with some test data:
In [1]: test_data = array([[ 11.2, 12.2, 14. , 15. , 11.3, 12. ],
[ nan, nan, nan, nan, nan, nan],
[ nan, nan, nan, nan, nan, nan],
[ 14.7, 11.5, 12.2, 13. , 14.3, 15.1],
[ nan, nan, nan, nan, nan, nan],
[ 15.7, 16.5, 17.2, 18. , 14. , 12. ]])
In [2]: shift_time(test_data)
In [3]: print test_data
Out [3]:
array([[ 11.2, 12.2, 14. , 15. , 11.3, 12. ],
[ 12.2, 14. , 15. , 11.3, 12. , nan],
[ 14. , 15. , 11.3, 12. , nan, nan],
[ 14.7, 11.5, 12.2, 13. , 14.3, 15.1],
[ 11.5, 12.2, 13. , 14.3, 15.1, nan],
[ 15.7, 16.5, 17.2, 18. , 14. , 12. ]])
And testing with a (1e4, 1e2) array:
In [1]: tmp = np.random.uniform(-10, 20, (1e4, 1e2))
In [2]: nan_idx = np.random.randint(30, 1e4 - 1,1e4)
In [3]: tmp[nan_idx] = nan
In [4]: time test3(tmp)
CPU times: user 1.53 s, sys: 0.06 s, total: 1.59 s
Wall time: 1.59 s
Each iteration of this pad,roll,roll combo essentially does what you are looking for:
import numpy as np
from numpy import nan as nan
# Startup array
A = np.array([[11.2, 12.2, 14.0, 15.0, 11.3, 12.0],
[nan, nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan, nan],
[14.7, 11.5, 12.2, 13.0, 14.3, 15.1]])
def pad_nan(v, pad_width, iaxis, kwargs):
v[:pad_width[0]] = nan
v[-pad_width[1]:] = nan
return v
def roll_data(A):
idx = np.isnan(A)
A[idx] = np.roll(np.roll(np.pad(A,1, pad_nan),1,0), -1, 1)[1:-1,1:-1][idx]
return A
print A
print roll_data(A)
print roll_data(A)
The output gives:
[[ 11.2 12.2 14. 15. 11.3 12. ]
[ nan nan nan nan nan nan]
[ nan nan nan nan nan nan]
[ 14.7 11.5 12.2 13. 14.3 15.1]]
[[ 11.2 12.2 14. 15. 11.3 12. ]
[ 12.2 14. 15. 11.3 12. nan]
[ nan nan nan nan nan nan]
[ 14.7 11.5 12.2 13. 14.3 15.1]]
[[ 11.2 12.2 14. 15. 11.3 12. ]
[ 12.2 14. 15. 11.3 12. nan]
[ 14. 15. 11.3 12. nan nan]
[ 14.7 11.5 12.2 13. 14.3 15.1]]
Everything is pure numpy so it should be extremely fast each iteration. However I'm not sure of the cost of creating a padded array and running the multiple iterations, if you try it let me know the results!

Categories

Resources