Merging numpy ndarray from CSVs - python

I have the following code:
from numpy import genfromtxt
nysedatafile = os.getcwd() + '/nyse.txt';
nysedata = genfromtxt(nysedatafile, delimiter='\t', names=True, dtype=None);
nasdaqdatafile = os.getcwd() + '/nasdaq.txt';
nasdaqdata = genfromtxt(nasdaqdatafile, delimiter='\t', names=True, dtype=None);
Now I would like to merge the data from the 2 CSVs and I tried various functions:
For example:
import numpy as np;
alldata = np.array(np.concatenate((nysedata, nasdaqdata)));
print('NYSE stocks:' + str(nysedata.shape[0]));
print('NASDAQ stocks:' + str(nasdaqdata.shape[0]));
print('ALL stocks:' + str(alldata.shape[0]));
returns:
TypeError: invalid type promotion
I tried as well numpy.vstack and to try to call an array on it.
I expect the last print to give the sum of the rows of the two previous csv files.
EDIT:
This command:
print('NYSE shape:' + str(nysedata.shape));
print('NASDAQ shape:' + str(nasdaqdata.shape));
print('NYSE dtype:' + str(nysedata.dtype));
print('NASDAQ dtype:' + str(nasdaqdata.dtype));
returns:
NYSE shape:(3257,)
NASDAQ shape:(2719,)
NYSE dtype:[('Symbol', 'S14'), ('Name', 'S62'), ('LastSale', 'S9'), ('MarketCap', '<f8'), ('ADR_TSO', 'S3'), ('IPOyear', 'S4'), ('Sector', 'S21'), ('industry', 'S62'), ('Summary_Quote', 'S38')]
NASDAQ dtype:[('Symbol', 'S14'), ('Name', 'S62'), ('LastSale', 'S7'), ('MarketCap', '<f8'), ('ADR_TSO', 'S3'), ('IPOyear', 'S4'), ('Sector', 'S21'), ('industry', 'S62'), ('Summary_Quote', 'S34')]

The reason why np.vstack (or np.concatenate) is raising an error is because the dtypes of the two arrays do not match.
Notice the very last field: ('Summary_Quote', 'S38') versus ('Summary_Quote', 'S34'). nysedata's Summary_Quote column is 38 bytes long, while nasdaqdata's column is only 34 bytes long.
(Edit: The LastSale column suffers a similar problem.)
This happened because genfromtxt guesses the dtype of the columns when the dtype = None parameter is set. For string columns, genfromtxt determines the minimum number of bytes needed to contain
all the strings in that column.
So to stack the two arrays, the smaller one has to be promoted to the larger one's dtype:
import numpy.lib.recfunctions as recfunctions
recfunctions.stack_arrays([nysedata,nasdaqdata.astype(nysedata.dtype)], usemask = False)
(My previous answer used np.vstack. This results in a 2-dimensional array of shape (N,1). recfunctions.stack_arrays returns a 1-dimensional array of shape (N,). Since nysedata and nasdaqdata are 1-dimensional, I think it is better to return a 1-dimensional array too.)
Possibly an easier solution would be to concatenate the two csv files first and then call genfromtxt:
import numpy as np
import os
cwd = os.getcwd()
nysedatafile = os.path.join(cwd, 'nyse.txt')
nasdaqdatafile = os.path.join(cwd, 'nasdaq.txt')
alldatafile = os.path.join(cwd, 'all.txt')
with open(nysedatafile) as f1, open(nasdaqdatafile) as f2, open(alldatafile, 'w') as g:
for line in f1:
g.write(line)
next(f2)
for line in f2:
g.write(line)
alldata = np.genfromtxt(alldatafile, delimiter='\t', names=True, dtype=None)

Related

h5py doesn't support NumPy dtype('U') (Unicode) and pandas doesn't support NumPy dtype('O')

I'm trying to create a .h5 file with a dataset that contains the data from a .dat file. First, I approach this using numpy:
import numpy as np
import h5py
filename = 'VAL220408-invparms.dat'
datasetname = 'EM27_104_COCCON_VAL/220408'
dtvec = [float for i in range(149)] #My data file have 149 columns
dtvec[1] = str
dtvec[2] = str #I specify the dtype of the second and third column
dataset = np.genfromtxt(filename,skip_header=0,names=True,dtype=dtvec)
fh5 = h5py.File('my_data.h5', 'w')
fh5.create_dataset(datasetname,data=dataset)
fh5.flush()
fh5.close()
But when running I get the error:
TypeError: No conversion path for dtype: dtype('<U')
If I don't specify the dtype everything is fine, the dataset is in order and the numerical values are correct, just the second and third columns have values of NaN; and I don't want that.
I found that h5py does not support Numpy's encoding for strings, so I supposed that using a dataframe from pandas will work. My code using pandas is like this:
import numpy as np
import pandas as pd
filename = 'VAL220408-invparms.dat'
datasetname = 'EM27_104_COCCON_VAL/220408'
df = pd.read_csv(filename,header=0,sep="\s+")
fh5 = h5py.File('my_data.h5', 'w')
fh5.create_dataset(datasetname,data=df)
fh5.flush()
fh5.close()
But then I get the error:
TypeError: Object dtype dtype('O') has no native HDF5 equivalent
Then I found that pandas had a function that transforms a dataframe into a .h5 file, so insted using h5py library I made:
df.to_hdf('my_data.h5','datasetname',format='table',mode='a')
BUT the data is all messed up in many tables inside the .h5 file. 😫
I really would like some help to just get the data of the second and third columns like it really is, a str.
I'm using Python 3.8
Thank you very much for reading.
I just figured it out.
In the h5py docs they say to specify the strings as h5py-strings using:
h5py.string_dtype(encoding='utf-8', length=None)
So in my first piece of code I put:
dtvec[1] = h5py.string_dtype(encoding='utf-8', length=None)
dtvec[2] = h5py.string_dtype(encoding='utf-8', length=None)
Hope this is helpful to someone reading this question.
To clarify, this problem is related to handling of NumPy's Unicode string type. HDF5 (and h5py) don't support this type. Details here: h5py: What about NumPy’s U type?
When you define your string fields (columns) as str, you get Unicode values. You can verify with the following:
dtvec = [float for i in range(149)] #My data file have 149 columns
dtvec[1] = str
dtvec[2] = str #I specify the dtype of the second and third column
dataset = np.genfromtxt(filename,names=True,dtype=dtvec)
print(dataset.dtype)
Output will look like this. The <U fields are where you have Unicode values. The Unicode values in fields 'str1' and 'str2' caused your original error.
[('float1', '<f8'), ('str1', '<U'), ('str2', '<U'), ('float2', '<f8').....]
When you modify to use h5py.string_dtype(), h5py knows how to convert the Unicode values to byte strings (which are supported by HDF5 and h5py). Setting length=None allows for variable length strings which are mapped to NumPy objects (arrays of byte strings). Details here: h5py: Variable-length strings
dtvec[1] = h5py.string_dtype(encoding='utf-8', length=None)
dtvec[2] = h5py.string_dtype(encoding='utf-8', length=None)
dataset = np.genfromtxt(filename,names=True,dtype=dtvec)
print(dataset.dtype)
Output will look like this. The O fields are where you have strings (as arrays of byte strings):
[('float1', '<f8'), ('str1', 'O'), ('str2', 'O'), ('float2', '<f8').....]
You can also define fixed length byte strings. (I used 5 because that's the size of my test data.)
dtvec[1] = h5py.string_dtype(encoding='utf-8', length=5)
dtvec[2] = h5py.string_dtype(encoding='utf-8', length=5)
# alternate definition, same result
# dtvec[1] = 'S5'
# dtvec[2] = 'S5'
dataset = np.genfromtxt(filename,names=True,dtype=dtvec)
print(dataset.dtype)
Output will look like this. The S5 fields are where you have byte strings:
[('float1', '<f8'), ('str1', 'S5'), ('str2', 'S5'), ('float2', '<f8').....]
As an aside on np.genfromtxt(), you don't have to define the dtype. If you set dtype=None the dtype for each column will be determined by their contents (individually). This is handy when you don't know the data types in advance. Here is an example for your data:
dataset = np.genfromtxt(filename,names=True,dtype=None)
print(dataset.dtype)
Output will look like this. I did not set the encoding= parameter above, so get string byte values. np.genfromtxt() will issue a VisibleDeprecationWarning when you do. However, you can write this data to HDF5.
[('float1', '<f8'), ('str1', 'S5'), ('str2', 'S5'), ('float2', '<f8').....]

Python/PyTables: Is it possible to have different data types for different columns of an array?

I create an expandable earray of Nx4 columns. Some columns require float64 datatype, the others can be managed with int32. Is it possible to vary the data types among the columns? Right now I just use one (float64, below) for all, but it takes huge disk space for (>10 GB) files.
For example, how can I ensure column 1-2 elements are int32 and 3-4 elements are float64?
import tables
f1 = tables.open_file("table.h5", "w")
a = f1.create_earray(f1.root, "dataset_1", atom=tables.Float32Atom(), shape=(0, 4))
Here is a simplistic version of how I am appending using Earray:
Matrix = np.ones(shape=(10**6, 4))
if counter <= 10**6: # keep appending to Matrix until 10**6 rows
Matrix[s:s+length, 0:4] = chunk2[left:right] # chunk2 is input np.ndarray
s += length
# save to disk when rows = 10**6
if counter > 10**6:
a.append(Matrix[:s])
del Matrix
Matrix = np.ones(shape=(10**6, 4))
What are the cons for the following method?
import tables as tb
import numpy as np
filename = 'foo.h5'
f = tb.open_file(filename, mode='w')
int_app = f.create_earray(f.root, "col1", atom=tb.Int32Atom(), shape=(0,2), chunkshape=(3,2))
float_app = f.create_earray(f.root, "col2", atom=tb.Float64Atom(), shape=(0,2), chunkshape=(3,2))
# array containing ints..in reality it will be 10**6x2
arr1 = np.array([[1, 1],
[2, 2],
[3, 3]], dtype=np.int32)
# array containing floats..in reality it will be 10**6x2
arr2 = np.array([[1.1,1.2],
[1.1,1.2],
[1.1,1.2]], dtype=np.float64)
for i in range(3):
int_app.append(arr1)
float_app.append(arr2)
f.close()
print('\n*********************************************************')
print("\t\t Reading Now=> ")
print('*********************************************************')
c = tb.open_file('foo.h5', mode='r')
chunks1 = c.root.col1
chunks2 = c.root.col2
chunk1 = chunks1.read()
chunk2 = chunks2.read()
print(chunk1)
print(chunk2)
No and Yes. All PyTables array types (Array, CArray, EArray, VLArray) are for homogeneous datatypes (similar to a NumPy ndarray). If you want to mix datatypes, you need to use a Table. Tables are extendable; they have an .append() method to add rows of data.
The creation process is similar to this answer (only the dtype is different): PyTables create_array fails to save numpy array. You only define the datatypes for a row. You don't define the shape or number of rows. That is implied as you add data to the table. If you already have your data in a NumPy recarray, you can reference it with the description= entry, and the Table will use the dtype for the table and populate with the data. More info here: PyTables Tables Class
Your code would look something like this:
import tables as tb
import numpy as np
table_dt = np.dtype(
{'names': ['int1', 'int2', 'float1', 'float2'],
'formats': [int, int, float, float] } )
# Create some random data:
i1 = np.random.randint(0,1000, (10**6,) )
i2 = np.random.randint(0,1000, (10**6,) )
f1 = np.random.rand(10**6)
f2 = np.random.rand(10**6)
with tb.File('table.h5', 'w') as h5f:
a = h5f.create_table('/', 'dataset_1', description=table_dt)
# Method 1 to create empty recarray 'Matrix', then add data:
Matrix = np.recarray( (10**6,), dtype=table_dt)
Matrix['int1'] = i1
Matrix['int2'] = i2
Matrix['float1'] = f1
Matrix['float2'] = f2
# Append Matrix to the table
a.append(Matrix)
# Method 2 to create recarray 'Matrix' with data in 1 step:
Matrix = np.rec.fromarrays([i1, i2, f1, f2], dtype=table_dt)
# Append Matrix to the table
a.append(Matrix)
You mentioned creating a very large file, but did not say how many rows (obviously way more than 10**6). Here are some additional thoughts based on comments in another thread.
The .create_table() method has an optional parameter: expectedrows=. This parameter is used 'to optimize the HDF5 B-Tree and amount of memory used'. Default value is set in tables/parameters.py (look for EXPECTED_ROWS_TABLE; It's only 10000 in my installation.) I highly suggest you set this to a larger value if you are creating 10**6 (or more) rows.
Also, you should consider file compression. There's a trade-off: compression reduces the file size, but will reduce I/O performance (increases access time).
There are a few options:
Enable compression when you create the file (add the filters= parameter when you create the file). Start with tb.Filters(complevel=1).
Use the HDF Group utility h5repack - run against a HDF5 file to create a new file (useful to go from uncompressed to compressed, or vice-versa).
Use the PyTables utility ptrepack - works similar to h4repack and delivered with PyTables.
I tend to use uncompressed files I work with often for best I/O performance. Then when done, I convert to compressed format for long term archiving.

unable to do np.savetxt on save newly appended string column to numpy array

I have numpy array mfcc having mfcc values , and is of shape (5911,20).
I have one list a =[] which has 5911 labels like apple cow dog.
I want to append the these labels to the mfcc numpy array.
STEP1
I converted list with labels to an array :
at = np.array(a)
print (at)
print at.shape
print type(at)
['apple' 'apple' 'apple' ..., 'cow' 'cow' 'cow']
(5912,)
<type 'numpy.ndarray'>
STEP2 I made sure both at and mfcc were of same dimensions:
if len(at) > len(mfcc):
at= at[ :-1]
STEP3 Then I stacked them together.
mfcc_with_labels=np.hstack((mfcc_with_labels,at[:,None]))
print mfcc_with_labels.shape
(5911,21)
PROBLEM STEP Now I want to save this mfcc_with_labels to a file. So that I can feed it to a neural network later.
np.savetxt("mfcc_with_labels.txt", mfcc, newline= "\n", delimiter="/t")
and it throws a huge ERROR
**
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-15-7709c644ca06> in <module>()
1 print mfcc_features_with_times_and_labels.shape
2
----> 3 np.savetxt("mfcc_with_labels.txt", mfcc, newline= "\n", delimiter="/t")
/usr/local/lib/python2.7/dist-packages/numpy/lib/npyio.pyc in savetxt(fname, X, fmt, delimiter, newline, header, footer, comments)
1256 raise TypeError("Mismatch between array dtype ('%s') and "
1257 "format specifier ('%s')"
-> 1258 % (str(X.dtype), format))
1259 if len(footer) > 0:
1260 footer = footer.replace('\n', '\n' + comments)
TypeError: Mismatch between array dtype ('|S32') and format specifier ('%.18e/t%.18e/t%.18e/t%.18e/t%.18e/t%.18e/t%.18e/t%.18e/t%.18e/t%.18e/t%.18e/t%.18e/t%.18e/t%.18e/t%.18e/t%.18e/t%.18e/t%.18e/t%.18e/t%.18e/t%.18e/t%.18e')
**
I tried specifying 'fmt = %s' as an option but nothing happens.
I inspected and
mfcc_with_labels[1] and the stacking/appending did work,
['-498.357912575' '-3.40930872496e-14' '1.55285010312e-14'
'-5.31554105812e-14' '4.81736993039e-15' '-3.17281148841e-14'
'5.24276966145e-15' '-3.58849635039e-14' '3.11248820963e-14'
'-6.31521494552e-15' '1.96551267563e-14' '1.26848188878e-14'
'6.53784651891e-14' '-3.15089835366e-14' '2.84134910594e-14'
'1.03625144071e-13' '-5.52444866686e-14' '-5.04415946628e-14'
'1.9026074286e-14' '3.42584334296e-14' 'apple']
Unable to comprehend why it is not being saved.
I already looked at : numpy beginner: writing an array using numpy.savetxt
numpy savetxt fails when using file handler
How to combine a numpy array and a text column and export to csv
Please guide me how to save this new numpy array properly.
I'm from an R programming background, in python is there any easy of python equivalent of saving this array like an R data frame kind of structure?
Final goal is to send this into a neural network.
The default fmt for savetxt is %.18e. Try that with a number
In [84]: '%.18e'%12
Out[84]: '1.200000000000000000e+01'
The actual format is that string replicated 21 times (the number of columns) and joined with the delimiter.
But your array has a string dtype, and contains strings (because you appended the labels. That doesn't work with that format.
Your mfcc_with_labels[1]
In [86]: row = np.array(['-5.04415946628e-14', '1.9026074286e-14', '3.425843
...: 34296e-14', 'apple'])
In [87]: row
Out[87]:
array(['-5.04415946628e-14', '1.9026074286e-14', '3.42584334296e-14',
'apple'], dtype='<U18')
'%s' fmt should work; this formatting does:
In [88]: '%s,%s,%s,%s'%tuple(row)
Out[88]: '-5.04415946628e-14,1.9026074286e-14,3.42584334296e-14,apple'

How to print arrays of multiple data formats with control on precision of floats in Python?

I am trying to print to file with np.savetxt arrays of multiple formats as show below:
import numpy as np
f = open('./multiple_format.dat', 'w')
c1 = np.array(['A', 'B'])
n1 = np.array([1.545446367853, 6.8218467347894])
n2 = np.array([1.546715887182, 2.9718145367852])
np.savetxt(f, np.column_stack([c1, np.around(n1, decimals = 3), np.round(n2, 3)]), fmt='%s', delimiter='\t')
I have already seen two answers at A1 and A2, some of the answers in those posts require the character width to be specified and accordingly white space is provided before the string array if it is shorter than this width as shown below:
import numpy as np
f = open('./multiple_format.dat', 'w')
c1 = np.array(['A', 'B'])
n1 = np.array([1.545446367853, 6.8218467347894])
n2 = np.array([1.546715887182, 2.9718145367852])
A['v1'] = c1
A['v2'] = n1
np.savetxt(f, A, fmt="%10s %10.3f")
I don't need leading space before the string and I need np.savetxt to print arrays of multiple data formats with control on precision of floats. How can this be done in Python?
The core of savetxt is
for row in X:
try:
fh.write(asbytes(format % tuple(row) + newline))
It iterates on the rows of your array, and applies
format % tuple(row)
to turn it into a string.
format is constructed from your fmt parameter. In your case with 2 % items it uses thefmt` as is:
In [95]: "%10s %10.3f"%tuple(np.array([1.545446367853, 6.8218467347894]))
Out[95]: '1.54544636785 6.822'
So when it comes to spacing and precision, you are at the mercy of the standard Python formatting system. I'd suggest playing with that kind of expression directly.
In [96]: "%.2f, %10.3f"%tuple(np.array([1.545446367853, 6.8218467347894]))
Out[96]: '1.55, 6.822'
In [97]: "%.2f, %.10f"%tuple(np.array([1.545446367853, 6.8218467347894]))
Out[97]: '1.55, 6.8218467348'
In [106]: (', '.join(["%.2e"]*2))%tuple([1.545446367853, 6.8218467347894])
Out[106]: '1.55e+00, 6.82e+00'

Custom reading CSV files (Keyword accesible / custom structure)

I am trying to do the following:
I downloaded a csv file containing my banking transactions of the last 180 days.
I want to readin this csv file and then do some plots with the data.
For that I setup a program that reads the csv file und makes the data avaible through keywords.
e.g. in the csv file there is a column "Buchungstag".
I replace that with the date keyword etc.
import numpy as np
import matplotlib.pylab as mpl
import csv
class finanz():
def __init__(self):
path = "/home/***/"
self.dataFileName = path + "test.csv"
self.data_read = open(self.dataFileName, 'r')
self._columns = {}
self._columns[0] = ["date", "Buchungstag", "", "S15"]
self._columns[1] = ["value", "Umsatz", "Euro", "f8"]
self._ident = {"Buchungstag":"date", "Umsatz in {0}":"value"}
self.base = 1205.30
self._readData()
def _readData(self):
r = csv.DictReader(self.data_read, delimiter=';')
dtype = map(lambda x: (self._columns[x][0],self._columns[x][3]),range(len(self._columns)))
self.data = np.recarray((2), dtype=dtype)
desiredKeys = map(lambda x:x, self._ident.iterkeys())
for i, x in enumerate(r):
for k in desiredKeys:
if k == "Umsatz in {0}":
v = np.float(x[k].replace(",", "."))+self.base
else:
v = x[k]
self.data[self._ident[k]][i] = v
def getAllData(self):
return self.data.copy()
a = finanz()
b = a.getAllData()
print type(b)
print type(b['value']),type(b['date'])
Sample data
"Buchungstag";"Wertstellung (Valuta)";"Vorgang";"Buchungstext";"Umsatz in {0}";
"02.06.2015";"02.06.2015";"Lastschrift/Belast.";"Auftraggeber: abc";"-3,75";
My question now is why is type(b['date']) a class 'numpy.core.records.recarray' and type(b['value']) a type 'numpy.ndarray' ??
And my second question would be how to "save" the date in a format that I can use with matplotlib?
The Third and final question is how can I check many rows the csv file has (for the creation of the empty self.data array)
Thx!
Repeating your array generation without the extra code:
In [230]: dt=np.dtype([('date', 'S15'), ('value', '<f8')])
In [231]: data=np.recarray((2,),dtype=dt)
In [232]: type(data['date'])
Out[232]: numpy.core.records.recarray
In [233]: type(data['value'])
Out[233]: numpy.ndarray
The fact that one field is returned as ndarray, and the other as recarray isn't significant. It's just how the recarray class is setup.
Now we mostly use 'structured arrays', created for example with
data1=np.empty((2,),dtype=dt)
or filled with '0s':
data1=np.zeros((2,),dtype=dt)
# array([('', 0.0), ('', 0.0)],
dtype=[('date', 'S15'), ('value', '<f8')])
With this, both data1['date'] and ['value'] are ndarray. recarray is the old version, and still compatible, but structured arrays are more consistent in their syntax and behavior. There are lots of SO questions about structured arrays, many produced by np.genfromtxt applied to csv files like yours.
I could combine this idea, plus my comment (about list appends):
def _readData(self):
r = csv.DictReader(self.data_read, delimiter=';')
if self._columns[0][1].endswith('tag'):
self._columns[0][2] = 'datetime64[D]'
dtype = map(lambda x: (self._columns[x][0],self._columns[x][3]),range(len(self._columns)))
desiredKeys = map(lambda x:x, self._ident.iterkeys())
data = []
for x in r:
aline = np.zeros((1,), dtype=dtype)
for k in desiredKeys:
if k == "Umsatz in {0}":
v = np.float(x[k].replace(",", "."))+self.base
else:
v = x[k]
v1 = v.split('.')
if len(v1)==3: # convert date to yyyyy-mm-dd format
v = '%s-%s-%s'%(v1[2],v1[1],v1[0])
aline[self._ident[k]] = v
data.append(aline)
self.data = np.concatenate(data)
producing a b like:
array([(datetime.date(2015, 6, 2), 1201.55),
(datetime.date(2015, 6, 2), 1201.55),
(datetime.date(2015, 6, 2), 1201.55)],
dtype=[('date', '<M8[D]'), ('value', '<f8')])
I believe genfromtxt collects each row as a tuple, and creates the array at the end. The docs for structured arrays shows that they can be constructed from
np.array([(item1, item2), (item3, item4),...], dtype=dtype)
I chose to construct an array for each line, and concatenate them at the end because that required fewer changes to your code.
I also changed that function so it converts the 'tag' column to np.datetime64 dtype. There are a number of SO questions about using that dtype. I believe it can used in matplotlib, though I don't have experience with that.

Categories

Resources