Custom reading CSV files (Keyword accesible / custom structure) - python

I am trying to do the following:
I downloaded a csv file containing my banking transactions of the last 180 days.
I want to readin this csv file and then do some plots with the data.
For that I setup a program that reads the csv file und makes the data avaible through keywords.
e.g. in the csv file there is a column "Buchungstag".
I replace that with the date keyword etc.
import numpy as np
import matplotlib.pylab as mpl
import csv
class finanz():
def __init__(self):
path = "/home/***/"
self.dataFileName = path + "test.csv"
self.data_read = open(self.dataFileName, 'r')
self._columns = {}
self._columns[0] = ["date", "Buchungstag", "", "S15"]
self._columns[1] = ["value", "Umsatz", "Euro", "f8"]
self._ident = {"Buchungstag":"date", "Umsatz in {0}":"value"}
self.base = 1205.30
self._readData()
def _readData(self):
r = csv.DictReader(self.data_read, delimiter=';')
dtype = map(lambda x: (self._columns[x][0],self._columns[x][3]),range(len(self._columns)))
self.data = np.recarray((2), dtype=dtype)
desiredKeys = map(lambda x:x, self._ident.iterkeys())
for i, x in enumerate(r):
for k in desiredKeys:
if k == "Umsatz in {0}":
v = np.float(x[k].replace(",", "."))+self.base
else:
v = x[k]
self.data[self._ident[k]][i] = v
def getAllData(self):
return self.data.copy()
a = finanz()
b = a.getAllData()
print type(b)
print type(b['value']),type(b['date'])
Sample data
"Buchungstag";"Wertstellung (Valuta)";"Vorgang";"Buchungstext";"Umsatz in {0}";
"02.06.2015";"02.06.2015";"Lastschrift/Belast.";"Auftraggeber: abc";"-3,75";
My question now is why is type(b['date']) a class 'numpy.core.records.recarray' and type(b['value']) a type 'numpy.ndarray' ??
And my second question would be how to "save" the date in a format that I can use with matplotlib?
The Third and final question is how can I check many rows the csv file has (for the creation of the empty self.data array)
Thx!

Repeating your array generation without the extra code:
In [230]: dt=np.dtype([('date', 'S15'), ('value', '<f8')])
In [231]: data=np.recarray((2,),dtype=dt)
In [232]: type(data['date'])
Out[232]: numpy.core.records.recarray
In [233]: type(data['value'])
Out[233]: numpy.ndarray
The fact that one field is returned as ndarray, and the other as recarray isn't significant. It's just how the recarray class is setup.
Now we mostly use 'structured arrays', created for example with
data1=np.empty((2,),dtype=dt)
or filled with '0s':
data1=np.zeros((2,),dtype=dt)
# array([('', 0.0), ('', 0.0)],
dtype=[('date', 'S15'), ('value', '<f8')])
With this, both data1['date'] and ['value'] are ndarray. recarray is the old version, and still compatible, but structured arrays are more consistent in their syntax and behavior. There are lots of SO questions about structured arrays, many produced by np.genfromtxt applied to csv files like yours.
I could combine this idea, plus my comment (about list appends):
def _readData(self):
r = csv.DictReader(self.data_read, delimiter=';')
if self._columns[0][1].endswith('tag'):
self._columns[0][2] = 'datetime64[D]'
dtype = map(lambda x: (self._columns[x][0],self._columns[x][3]),range(len(self._columns)))
desiredKeys = map(lambda x:x, self._ident.iterkeys())
data = []
for x in r:
aline = np.zeros((1,), dtype=dtype)
for k in desiredKeys:
if k == "Umsatz in {0}":
v = np.float(x[k].replace(",", "."))+self.base
else:
v = x[k]
v1 = v.split('.')
if len(v1)==3: # convert date to yyyyy-mm-dd format
v = '%s-%s-%s'%(v1[2],v1[1],v1[0])
aline[self._ident[k]] = v
data.append(aline)
self.data = np.concatenate(data)
producing a b like:
array([(datetime.date(2015, 6, 2), 1201.55),
(datetime.date(2015, 6, 2), 1201.55),
(datetime.date(2015, 6, 2), 1201.55)],
dtype=[('date', '<M8[D]'), ('value', '<f8')])
I believe genfromtxt collects each row as a tuple, and creates the array at the end. The docs for structured arrays shows that they can be constructed from
np.array([(item1, item2), (item3, item4),...], dtype=dtype)
I chose to construct an array for each line, and concatenate them at the end because that required fewer changes to your code.
I also changed that function so it converts the 'tag' column to np.datetime64 dtype. There are a number of SO questions about using that dtype. I believe it can used in matplotlib, though I don't have experience with that.

Related

Time series storage in HDF5 format

I want to store the results of time series (sensor data) into a HDF5 file. I cannot seem to be able to assign values to my dataset. Clearly, I am doing something wrong, I am just not sure what…
The code:
from datetime import datetime, timezone
import h5py
TIME_SERIES_FLOAT = np.dtype([("time", h5py.special_dtype(vlen=str)),
("value", np.float)])
h5 = h5py.File('balh.h5', "w")
dset = create_dataset('data', (1, 2), chunks=True, maxshape=(None, 2), dtype=TIME_SERIES_FLOAT)
dset[0]['time'] = datetime.now(timezone.utc).astimezone().isoformat()
dset[0]['value'] = 0.0
Then the update code resizes the dataset and adds more values. Clearly doing that per value is inefficient:
size = list(dset.shape)
size[0] += 1
dset.resize(tuple(size))
dset[size[0]-1]['time'] = datetime.now(timezone.utc).astimezone().isoformat()
dset[size[0]-1]['value'] = value
A much better method would be to collate some data into an np.array and then add that every so often…
Is this sensible?…
I need more coffee…
The defined type is a tuple containing a string (aka the time) and a float (aka the value) so to add one, I need:
dset[-1] = (datetime.now(timezone.utc).astimezone().isoformat(), value)
It is actually that simple!
Adding many entries is done this way:
l = [('stamp', x) for x in range(10)]
size = list(dset.shape)
tmp = size[0]
size[0] += len(l)
dset.resize(tuple(size))
for x in range(len(l)):
dset[tmp+x] = l[x]
Nonetheless, this feels somewhat clunky and sub-optimal…

How to keep column names when converting from pandas to numpy

According to this post, I should be able to access the names of columns in an ndarray as a.dtype.names
Howevever, if I convert a pandas DataFrame to an ndarray with df.as_matrix() or df.values, then the dtype.names field is None. Additionally, if I try to assign column names to the ndarray
X = pd.DataFrame(dict(age=[40., 50., 60.], sys_blood_pressure=[140.,150.,160.]))
print X
print type(X.as_matrix())# <type 'numpy.ndarray'>
print type(X.as_matrix()[0]) # <type 'numpy.ndarray'>
m = X.as_matrix()
m.dtype.names = list(X.columns)
I get
ValueError: there are no fields defined
UPDATE:
I'm particularly interested in the cases where the matrix only needs to hold a single type (it is an ndarray of a specific numeric type), since I'd also like to use cython for optimization. (I suspect numpy records and structured arrays are more difficult to deal with since they're more freely typed.)
Really, I'd just like to maintain the column_name meta data for arrays passed through a deep tree of sci-kit predictors. Its interface's .fit(X,y) and .predict(X) API don't permit passing additional meta-data about the column labels outside of the X and y object.
Consider a DF as shown below:
X = pd.DataFrame(dict(one=['Strawberry', 'Fields', 'Forever'], two=[1,2,3]))
X
Provide a list of tuples as data input to the structured array:
arr_ip = [tuple(i) for i in X.as_matrix()]
Ordered list of field names:
dtyp = np.dtype(list(zip(X.dtypes.index, X.dtypes)))
Here, X.dtypes.index gives you the column names and X.dtypes it's corresponding dtypes which are unified again into a list of tuples and fed as input to the dtype elements to be constructed.
arr = np.array(arr_ip, dtype=dtyp)
gives:
arr
# array([('Strawberry', 1), ('Fields', 2), ('Forever', 3)],
# dtype=[('one', 'O'), ('two', '<i8')])
and
arr.dtype.names
# ('one', 'two')
Pandas dataframe also has a handy to_records method. Demo:
X = pd.DataFrame(dict(age=[40., 50., 60.],
sys_blood_pressure=[140.,150.,160.]))
m = X.to_records(index=False)
print repr(m)
Returns:
rec.array([(40.0, 140.0), (50.0, 150.0), (60.0, 160.0)],
dtype=[('age', '<f8'), ('sys_blood_pressure', '<f8')])
This is a "record array", which is an ndarray subclass that allows field access using attributes, e.g. m.age in addition to m['age'].
You can pass this to a cython function as a regular float array by constructing a view:
m_float = m.view(float).reshape(m.shape + (-1,))
print repr(m_float)
Which gives:
rec.array([[ 40., 140.],
[ 50., 150.],
[ 60., 160.]],
dtype=float64)
Note in order for this to work, the original Dataframe must have a float dtype for every column. To make sure use m = X.astype(float, copy=False).to_records(index=False).
Yet more methods of converting a pandas.DataFrame to numpy.array while preserving label/column names
This is mainly for demonstrating how to set dtype/column_dtypes, because sometimes a data source iterator's output'll need some pre-normalization.
Method one inserts by column into a zeroed array of predefined height and is loosely based on a Creating Structured Arrays guide that just a bit of web-crawling turned up
import numpy
def to_tensor(dataframe, columns = [], dtypes = {}):
# Use all columns from data frame if none where listed when called
if len(columns) <= 0:
columns = dataframe.columns
# Build list of dtypes to use, updating from any `dtypes` passed when called
dtype_list = []
for column in columns:
if column not in dtypes.keys():
dtype_list.append(dataframe[column].dtype)
else:
dtype_list.append(dtypes[column])
# Build dictionary with lists of column names and formatting in the same order
dtype_dict = {
'names': columns,
'formats': dtype_list
}
# Initialize _mostly_ empty nupy array with column names and formatting
numpy_buffer = numpy.zeros(
shape = len(dataframe),
dtype = dtype_dict)
# Insert values from dataframe columns into numpy labels
for column in columns:
numpy_buffer[column] = dataframe[column].to_numpy()
# Return results of conversion
return numpy_buffer
Method two is based on user7138814's answer and will likely be more efficient as it is basically a wrapper for the built in to_records method available to pandas.DataFrames
def to_tensor(dataframe, columns = [], dtypes = {}, index = False):
to_records_kwargs = {'index': index}
if not columns: # Default to all `dataframe.columns`
columns = dataframe.columns
if dtypes: # Pull in modifications only for dtypes listed in `columns`
to_records_kwargs['column_dtypes'] = {}
for column in dtypes.keys():
if column in columns:
to_records_kwargs['column_dtypes'].update({column: dtypes.get(column)})
return dataframe[columns].to_records(**to_records_kwargs)
With either of the above one could do...
X = pandas.DataFrame(dict(age = [40., 50., 60.], sys_blood_pressure = [140., 150., 160.]))
# Example of overwriting dtype for a column
X_tensor = to_tensor(X, dtypes = {'age': 'int32'})
print("Ages -> {0}".format(X_tensor['age']))
print("SBPs -> {0}".format(X_tensor['sys_blood_pressure']))
... which should output...
Ages -> array([40, 50, 60])
SBPs -> array([140., 150., 160.])
... and a full dump of X_tensor should look like the following.
array([(40, 140.), (50, 150.), (60, 160.)],
dtype=[('age', '<i4'), ('sys_blood_pressure', '<f8')])
Some thoughts
While method two will likely be more efficient than the first, method one (with some modifications) may be more useful for merging two or more pandas.DataFrames into one numpy.array.
Additionally (after swinging back through to review), method one will likely face-plant as it's written with errors about to_records_kwargs not being a mapping if dtypes is not defined, next time I'm feeling Pythonic I may resolve that with an else condition.
Create an example:
import pandas
import numpy
PandasTable = pandas.DataFrame( {
"AAA": [4, 5, 6, 7],
"BBB": [10, 20, 30, 40],
"CCC": [100, 50, -30, -50],
"DDD": ['asdf1', 'asdf2', 'asdf3', 'asdf4'] } )
Solve the problem noting that we are creating something called a "structured numpy array":
NumpyDtypes = list( PandasTable.dtypes.items() )
NumpyTable = PandasTable.to_numpy(copy=True)
NumpyTableRows = [ tuple(Row) for Row in NumpyTable]
NumpyTableWithHeaders = numpy.array( NumpyTableRows, dtype=NumpyDtypes )
Rewrite the solution in 1 line of code:
NumpyTableWithHeaders2 = numpy.array( [ tuple(Row) for Row in PandasTable.to_numpy(copy=True)], dtype=list( PandasTable.dtypes.items() ) )
Print out the solution results:
print ('NumpyTableWithHeaders', NumpyTableWithHeaders)
print ('NumpyTableWithHeaders.dtype', NumpyTableWithHeaders.dtype)
print ('NumpyTableWithHeaders2', NumpyTableWithHeaders2)
print ('NumpyTableWithHeaders2.dtype', NumpyTableWithHeaders2.dtype)
NumpyTableWithHeaders [(4, 10, 100, 'asdf1') (5, 20, 50, 'asdf2') (6, 30, -30, 'asdf3')
(7, 40, -50, 'asdf4')]
NumpyTableWithHeaders.dtype [('AAA', '<i8'), ('BBB', '<i8'), ('CCC', '<i8'), ('DDD', 'O')]
NumpyTableWithHeaders2 [(4, 10, 100, 'asdf1') (5, 20, 50, 'asdf2') (6, 30, -30, 'asdf3')
(7, 40, -50, 'asdf4')]
NumpyTableWithHeaders2.dtype [('AAA', '<i8'), ('BBB', '<i8'), ('CCC', '<i8'), ('DDD', 'O')]
Documentation I had to read
Adding row/column headers to NumPy arrays
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_numpy.html
How to keep column names when converting from pandas to numpy
https://numpy.org/doc/stable/user/basics.creation.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dtypes.html
https://docs.scipy.org/doc/numpy-1.10.1/user/basics.rec.html
Notes and thoughts:
Pandas should add a flag in their 'to_numpy' function which does this.
Recent version Numpy documentation should be updated to include structured arrays, which behave differently than regular ones.
OK, here where I'm leaning:
class NDArrayWithColumns(np.ndarray):
def __new__(cls, obj, columns=None):
obj = obj.view(cls)
obj.columns = columns
return obj
def __array_finalize__(self, obj):
if obj is None: return
self.columns = getattr(obj, 'columns', None)
#staticmethod
def from_dataframe(df):
cols = tuple(df.columns)
arr = df.as_matrix(cols)
return NDArrayWithColumns.from_array(arr,cols)
#staticmethod
def from_array(array,columns):
if isinstance(array,NDArrayWithColumns):
return array
return NDArrayWithColumns(array,tuple(columns))
def __str__(self):
sup = np.ndarray.__str__(self)
if self.columns:
header = ", ".join(self.columns)
header = "# " + header + "\n"
return header+sup
return sup
NAN = float("nan")
X = pd.DataFrame(dict(age=[40., NAN, 60.], sys_blood_pressure=[140.,150.,160.]))
arr = NDArrayWithColumns.from_dataframe(X)
print arr
print arr.columns
print arr.dtype
Gives:
# age, sys_blood_pressure
[[ 40. 140.]
[ nan 150.]
[ 60. 160.]]
('age', 'sys_blood_pressure')
float64
and can also be passed to types cython function expecting a ndarray[2,double_t].
UPDATE: this works pretty good except for some oddness when passing the type to ufuncs.

How to print arrays of multiple data formats with control on precision of floats in Python?

I am trying to print to file with np.savetxt arrays of multiple formats as show below:
import numpy as np
f = open('./multiple_format.dat', 'w')
c1 = np.array(['A', 'B'])
n1 = np.array([1.545446367853, 6.8218467347894])
n2 = np.array([1.546715887182, 2.9718145367852])
np.savetxt(f, np.column_stack([c1, np.around(n1, decimals = 3), np.round(n2, 3)]), fmt='%s', delimiter='\t')
I have already seen two answers at A1 and A2, some of the answers in those posts require the character width to be specified and accordingly white space is provided before the string array if it is shorter than this width as shown below:
import numpy as np
f = open('./multiple_format.dat', 'w')
c1 = np.array(['A', 'B'])
n1 = np.array([1.545446367853, 6.8218467347894])
n2 = np.array([1.546715887182, 2.9718145367852])
A['v1'] = c1
A['v2'] = n1
np.savetxt(f, A, fmt="%10s %10.3f")
I don't need leading space before the string and I need np.savetxt to print arrays of multiple data formats with control on precision of floats. How can this be done in Python?
The core of savetxt is
for row in X:
try:
fh.write(asbytes(format % tuple(row) + newline))
It iterates on the rows of your array, and applies
format % tuple(row)
to turn it into a string.
format is constructed from your fmt parameter. In your case with 2 % items it uses thefmt` as is:
In [95]: "%10s %10.3f"%tuple(np.array([1.545446367853, 6.8218467347894]))
Out[95]: '1.54544636785 6.822'
So when it comes to spacing and precision, you are at the mercy of the standard Python formatting system. I'd suggest playing with that kind of expression directly.
In [96]: "%.2f, %10.3f"%tuple(np.array([1.545446367853, 6.8218467347894]))
Out[96]: '1.55, 6.822'
In [97]: "%.2f, %.10f"%tuple(np.array([1.545446367853, 6.8218467347894]))
Out[97]: '1.55, 6.8218467348'
In [106]: (', '.join(["%.2e"]*2))%tuple([1.545446367853, 6.8218467347894])
Out[106]: '1.55e+00, 6.82e+00'

reading file into numpy array python

For an assignment i need to read a file and write into a numpy array,
the data consists of a sting and 2 floats:
# naam massa(kg) radius(km)
Venus 4.8685e24 6051.8
Aarde 5.9736e24 6378.1
Mars 6.4185e23 3396.2
Maan 7.349e22 1738.1
Saturnus 5.6846e26 60268
the following was my solution to this problem:
def dataread(filename):
temp = np.empty((1,3), dtype=np.object)
x = 0
f = open(filename,'r')
for line in f:
if line[0] !='#' :
l = line.split('\t')
temp[0,0], temp[0,1] , temp[0,2] = l[0] , float(l[1]) , float(l[2])
if x == 0:
data = temp
if x > 0:
data = np.vstack((data,temp))
x+=1
f.close()
return data
somehow this returns the following array:
[['Aarde' 5.9736e+24 6378.1]
['Aarde' 5.9736e+24 6378.1]
['Mars' 6.4185e+23 3396.2]
['Maan' 7.349e+22 1738.1]
['Saturnus' 5.6846e+26 60268.0]]
The first line is being read but does not end up in the array while the second row is read in twice.
What am I doing wrong ? I’m new to python so any comments on efficiency are also very much appreciated
Thanks in advance
This will read in your three columns into a numpy structured array:
import numpy as np
data = np.genfromtxt(
'data.txt',
dtypes=None, # determine types automatically
names=['name', 'mass', 'radius'],
)
print(data['name'])

Merging numpy ndarray from CSVs

I have the following code:
from numpy import genfromtxt
nysedatafile = os.getcwd() + '/nyse.txt';
nysedata = genfromtxt(nysedatafile, delimiter='\t', names=True, dtype=None);
nasdaqdatafile = os.getcwd() + '/nasdaq.txt';
nasdaqdata = genfromtxt(nasdaqdatafile, delimiter='\t', names=True, dtype=None);
Now I would like to merge the data from the 2 CSVs and I tried various functions:
For example:
import numpy as np;
alldata = np.array(np.concatenate((nysedata, nasdaqdata)));
print('NYSE stocks:' + str(nysedata.shape[0]));
print('NASDAQ stocks:' + str(nasdaqdata.shape[0]));
print('ALL stocks:' + str(alldata.shape[0]));
returns:
TypeError: invalid type promotion
I tried as well numpy.vstack and to try to call an array on it.
I expect the last print to give the sum of the rows of the two previous csv files.
EDIT:
This command:
print('NYSE shape:' + str(nysedata.shape));
print('NASDAQ shape:' + str(nasdaqdata.shape));
print('NYSE dtype:' + str(nysedata.dtype));
print('NASDAQ dtype:' + str(nasdaqdata.dtype));
returns:
NYSE shape:(3257,)
NASDAQ shape:(2719,)
NYSE dtype:[('Symbol', 'S14'), ('Name', 'S62'), ('LastSale', 'S9'), ('MarketCap', '<f8'), ('ADR_TSO', 'S3'), ('IPOyear', 'S4'), ('Sector', 'S21'), ('industry', 'S62'), ('Summary_Quote', 'S38')]
NASDAQ dtype:[('Symbol', 'S14'), ('Name', 'S62'), ('LastSale', 'S7'), ('MarketCap', '<f8'), ('ADR_TSO', 'S3'), ('IPOyear', 'S4'), ('Sector', 'S21'), ('industry', 'S62'), ('Summary_Quote', 'S34')]
The reason why np.vstack (or np.concatenate) is raising an error is because the dtypes of the two arrays do not match.
Notice the very last field: ('Summary_Quote', 'S38') versus ('Summary_Quote', 'S34'). nysedata's Summary_Quote column is 38 bytes long, while nasdaqdata's column is only 34 bytes long.
(Edit: The LastSale column suffers a similar problem.)
This happened because genfromtxt guesses the dtype of the columns when the dtype = None parameter is set. For string columns, genfromtxt determines the minimum number of bytes needed to contain
all the strings in that column.
So to stack the two arrays, the smaller one has to be promoted to the larger one's dtype:
import numpy.lib.recfunctions as recfunctions
recfunctions.stack_arrays([nysedata,nasdaqdata.astype(nysedata.dtype)], usemask = False)
(My previous answer used np.vstack. This results in a 2-dimensional array of shape (N,1). recfunctions.stack_arrays returns a 1-dimensional array of shape (N,). Since nysedata and nasdaqdata are 1-dimensional, I think it is better to return a 1-dimensional array too.)
Possibly an easier solution would be to concatenate the two csv files first and then call genfromtxt:
import numpy as np
import os
cwd = os.getcwd()
nysedatafile = os.path.join(cwd, 'nyse.txt')
nasdaqdatafile = os.path.join(cwd, 'nasdaq.txt')
alldatafile = os.path.join(cwd, 'all.txt')
with open(nysedatafile) as f1, open(nasdaqdatafile) as f2, open(alldatafile, 'w') as g:
for line in f1:
g.write(line)
next(f2)
for line in f2:
g.write(line)
alldata = np.genfromtxt(alldatafile, delimiter='\t', names=True, dtype=None)

Categories

Resources