I am working on machine translating some text that is stored in a mongodb database. I am trying pull the data from a database and then store it in numpy recarray. However I keep getting errors when I try to save the ObjectId field to the recarray--despite the different type conversions and such I have read about. Here is my code. Any suggestions would help.
#Pull the records from the DB into a resultset
db_results_records_to_translate = \
db_connector.db_fetch_untranslated_records_from_db(
article_collection,rec_number)
#Create an empty numpy recarray to store the data
data_table_for_translation=np.zeros([db_results_records_to_translate.count(),6],
dtype=[('_id', np.str),
('article_raw_text', np.str),
('article_raw_date', np.str),
('translated',np.bool),
('translated_text',np.str),
('translated_date',np.str)])
#Write record data to the recarray
for index, r in enumerate(db_results_records_to_translate):
data_table_for_translation[index, 0] = str(r['_id']) # Line with errors!!!
data_table_for_translation[index,1] = r['article_raw_text']
data_table_for_translation[index,2] = r['article_raw_date']
data_table_for_translation[index, 3] = r['translated']
So after running this code, I get an error TypeError: expected an object with a buffer interface.
Now I have tried to convert the objectid from bson to string using the str(ObjectId) function as referenced in the documentation, but no luck.
Any suggestions?
NOTE: I noticed that this error happens even for the non-id columns too, so even straight text has an issue.
There are errors in the definition of the array, including the dtype, and errors in indexing fields during the iteration.
This is clip illustrates the changes I think you need to make to get this assignment to work:
# fake data - a list of tuples
db_results_records_to_translate = [('12','raw text','raw date')]
#Create an empty numpy recarray to store the data
data_table_for_translation=np.zeros([1,],
dtype=[('_id', 'U10'),
('article_raw_text', 'U10'),
('article_raw_date', 'U10')])
# string dtype has to include length
# I'm using unicode here (Python3), 'S10' would do just as well (in py2)
#Write record data to the structured array
for index, r in enumerate(db_results_records_to_translate):
data_table_for_translation[index]['_id'] = str(r[0])
data_table_for_translation[index]['article_raw_text'] = r[1]
data_table_for_translation[index]['article_raw_date'] = r[2]
print(db_results_records_to_translate)
Note that I index the 'fields' by name, not number. data_table... is a 1d array with n fields, not a 2d array with n columns. I'm indexing r by number because my mock data is a tuple, not the db named fields.
Related
I have a list of np.datetime64 data that looks as follows:
times =[2015-03-26T16:02:42.000000Z,
2015-03-26T16:02:45.000000Z,...]
type(times) returns list
type(times[1]) returns obspy.core.utcdatetime.UTCDateTime
Now, I understand that h5py does not support date time data.
I have tried the following:
time_str = [n.encode("ascii", "ignore") for n in time_str]
time_str = [str(s) for s in time_str]
type(time_str[1]) returns bytes
I am okay with creating the dataset and storing these date time values as a string
However, when attempting to create the dataset, I get the following error:
with h5py.File('data_ML.hdf5', 'w') as f:
f.create_dataset("time", data=time_str,maxshape=(None),chunks=True, dtype='str')
TypeError: No conversion path for dtype: dtype('<U')
Where am I messing up/ is there an alternative way to store these values as is so I can extract them later?
Ok, here we go. I couldn't get some of you code to work together (maybe you left some steps out, or changed variable names?). And, I could not get the obspy.core.utcdatetime.UTCDateTime object your have.
So I created an example that does the following:
Starts with a list of np.datetime64() objects,
Converts to a list of np.datetime_as_string() in UTC format
objects **see note at Item 4
Converts to a np.array with dtype='S30'
Note: I included Step 2 to replicate your data. See following section
for simpler version
Code below:
times =[np.datetime64('2015-03-26T16:02:42.000000'),
np.datetime64('2015-03-26T16:02:45.000000'),
np.datetime64('2015-03-26T16:02:48.000000'),
np.datetime64('2015-03-26T16:02:55.000000') ]
utc_times = [ np.datetime_as_string(n,timezone='UTC') for n in times ]
utc_str_arr = np.array(utc_times,dtype='S30')
with h5py.File('data_ML.hdf5', 'w') as f:
f.create_dataset("time", data=utc_str_arr,maxshape=(None),chunks=True)
You can simplify the process if you are starting with np.datetime64() objects, and don't have (and don't need or want) the intermediate list of string objects (variable utc_times in my code). The method below skips Step 2 above, and shows 2 ways to create a np.array() of properly encoded strings.
Code below:
times =[np.datetime64('2015-03-26T16:02:42.000000'),
np.datetime64('2015-03-26T16:02:45.000000'),
np.datetime64('2015-03-26T16:02:48.000000'),
np.datetime64('2015-03-26T16:02:55.000000') ]
# Create empty array with defined size and 'S#' dtype, then populate with for loop:
utc_str_arr1 = np.empty((len(times),),dtype='S30')
for i, n in enumerate(times):
utc_str_arr1[i] = np.datetime_as_string(n,timezone='UTC')
# -OR- Create array and populate using loop comprehension:
utc_str_arr2 = np.array( [np.datetime_as_string(n,timezone='UTC').encode('utf-8') for n in times] )
with h5py.File('data_ML.hdf5', 'w') as f:
f.create_dataset("time1", data=utc_str_arr1,maxshape=(None),chunks=True)
f.create_dataset("time2", data=utc_str_arr2,maxshape=(None),chunks=True)
Final result looks similar with either method (second method creates 2 identical datsets).
Image from HDFView:
To Read the Data:
Per request in Aug-02-2021 comment, here is the code to extract data from HDF5 and create Pandas timestamp objects (then saved to a dataframe). First the byte strings in the dataset are read and converted to NumPy Unicode strings with .astype(). Then the strings are converted to Pandas timestamp objects with pd.to_datetime() using the format= parameter.
import h5py
import numpy as np
import pandas as pd
with h5py.File('data_ML.hdf5', 'r') as h5f:
## returns a h5py dataset object:
dts_ds = h5f["time"]
longest_word=len(max(dts_ds, key=len))
## returns an array of byte strings representing np.datetime64:
## .astype() used to convert byte strings to unicode
dts_arr = dts_ds[:].astype('U'+str(longest_word))
## create a new array to hold Pandas datetime objects
## then loop over first array to convert and populate new array
pd_dts_arr = np.empty((dts_arr.shape[0],),dtype=object)
for i, dts in enumerate(dts_arr):
pd_dts_arr[i] = pd.to_datetime(dts, format='%Y-%m-%dT%H:%M:%S.%fZ')
dts_df = pd.DataFrame(pd_dts_arr)
There are a lot of ways to represent dates and time using native Python, NumPy and Pandas objects. More details about working with them can be found at this answer:
Converting between datetime, Timestamp and datetime64
I create an expandable earray of Nx4 columns. Some columns require float64 datatype, the others can be managed with int32. Is it possible to vary the data types among the columns? Right now I just use one (float64, below) for all, but it takes huge disk space for (>10 GB) files.
For example, how can I ensure column 1-2 elements are int32 and 3-4 elements are float64?
import tables
f1 = tables.open_file("table.h5", "w")
a = f1.create_earray(f1.root, "dataset_1", atom=tables.Float32Atom(), shape=(0, 4))
Here is a simplistic version of how I am appending using Earray:
Matrix = np.ones(shape=(10**6, 4))
if counter <= 10**6: # keep appending to Matrix until 10**6 rows
Matrix[s:s+length, 0:4] = chunk2[left:right] # chunk2 is input np.ndarray
s += length
# save to disk when rows = 10**6
if counter > 10**6:
a.append(Matrix[:s])
del Matrix
Matrix = np.ones(shape=(10**6, 4))
What are the cons for the following method?
import tables as tb
import numpy as np
filename = 'foo.h5'
f = tb.open_file(filename, mode='w')
int_app = f.create_earray(f.root, "col1", atom=tb.Int32Atom(), shape=(0,2), chunkshape=(3,2))
float_app = f.create_earray(f.root, "col2", atom=tb.Float64Atom(), shape=(0,2), chunkshape=(3,2))
# array containing ints..in reality it will be 10**6x2
arr1 = np.array([[1, 1],
[2, 2],
[3, 3]], dtype=np.int32)
# array containing floats..in reality it will be 10**6x2
arr2 = np.array([[1.1,1.2],
[1.1,1.2],
[1.1,1.2]], dtype=np.float64)
for i in range(3):
int_app.append(arr1)
float_app.append(arr2)
f.close()
print('\n*********************************************************')
print("\t\t Reading Now=> ")
print('*********************************************************')
c = tb.open_file('foo.h5', mode='r')
chunks1 = c.root.col1
chunks2 = c.root.col2
chunk1 = chunks1.read()
chunk2 = chunks2.read()
print(chunk1)
print(chunk2)
No and Yes. All PyTables array types (Array, CArray, EArray, VLArray) are for homogeneous datatypes (similar to a NumPy ndarray). If you want to mix datatypes, you need to use a Table. Tables are extendable; they have an .append() method to add rows of data.
The creation process is similar to this answer (only the dtype is different): PyTables create_array fails to save numpy array. You only define the datatypes for a row. You don't define the shape or number of rows. That is implied as you add data to the table. If you already have your data in a NumPy recarray, you can reference it with the description= entry, and the Table will use the dtype for the table and populate with the data. More info here: PyTables Tables Class
Your code would look something like this:
import tables as tb
import numpy as np
table_dt = np.dtype(
{'names': ['int1', 'int2', 'float1', 'float2'],
'formats': [int, int, float, float] } )
# Create some random data:
i1 = np.random.randint(0,1000, (10**6,) )
i2 = np.random.randint(0,1000, (10**6,) )
f1 = np.random.rand(10**6)
f2 = np.random.rand(10**6)
with tb.File('table.h5', 'w') as h5f:
a = h5f.create_table('/', 'dataset_1', description=table_dt)
# Method 1 to create empty recarray 'Matrix', then add data:
Matrix = np.recarray( (10**6,), dtype=table_dt)
Matrix['int1'] = i1
Matrix['int2'] = i2
Matrix['float1'] = f1
Matrix['float2'] = f2
# Append Matrix to the table
a.append(Matrix)
# Method 2 to create recarray 'Matrix' with data in 1 step:
Matrix = np.rec.fromarrays([i1, i2, f1, f2], dtype=table_dt)
# Append Matrix to the table
a.append(Matrix)
You mentioned creating a very large file, but did not say how many rows (obviously way more than 10**6). Here are some additional thoughts based on comments in another thread.
The .create_table() method has an optional parameter: expectedrows=. This parameter is used 'to optimize the HDF5 B-Tree and amount of memory used'. Default value is set in tables/parameters.py (look for EXPECTED_ROWS_TABLE; It's only 10000 in my installation.) I highly suggest you set this to a larger value if you are creating 10**6 (or more) rows.
Also, you should consider file compression. There's a trade-off: compression reduces the file size, but will reduce I/O performance (increases access time).
There are a few options:
Enable compression when you create the file (add the filters= parameter when you create the file). Start with tb.Filters(complevel=1).
Use the HDF Group utility h5repack - run against a HDF5 file to create a new file (useful to go from uncompressed to compressed, or vice-versa).
Use the PyTables utility ptrepack - works similar to h4repack and delivered with PyTables.
I tend to use uncompressed files I work with often for best I/O performance. Then when done, I convert to compressed format for long term archiving.
I am using Firestore to store time series data being pulled from a sensor. I am using Python to push the data, namely the Firebase-Admin package for verification. I chose to store this data using arrays, where each index corresponds with an array across different fields. Is there a way to add non-unique elements to the array? Or can arrays only store unique elements? If so, what data structure would you suggest for storing time series data.
I am trying to add observations to an existing array in Firestore, but ArrayUpdate only adds the element if it is not already present in the array. When I execute the second chuck of code (to update the existing array), only unique values are saved
# Initialize arrays and push to Firestore
import firebase_admin
from firebase_admin import credentials
from firebase_admin import firestore
import datetime
cred = credentials.Certificate('path_to_certificate')
firebase_admin.initialize_app(cred)
db = firestore.client()
cell_1_arr = []
cell_2_arr = []
cell_3_arr = []
exec_time_arr = []
curr_time_arr = []
pred_volt = 12562.70
meas_volt = 12362.70
current = 0.0
soc = 0,0
cell_1_volt = 4.32
cell_2_volt = 4.4
cell_3_volt = 4.23
exec_time = 0.4
curr_time = datetime.datetime.now()
pred_v_arr.append(pred_volt)
meas_v_arr.append(meas_volt)
c_arr.append(current)
soc_arr.append(soc)
cell_1_arr.append(cell_1_volt)
cell_2_arr.append(cell_2_volt)
cell_3_arr.append(cell_3_volt)
exec_time_arr.append(exec_time)
curr_time_arr.append(curr_time)
try:
push_data = {
u'time': curr_time_arr,
u'vpred': meas_v_arr,
u'vmeas': pred_v_arr,
u'current': c_arr,
u'soc': soc_arr,
u'cell1': cell_1_arr,
u'cell2': cell_2_arr,
u'cell3': cell_3_arr,
u'exectime': exec_time_arr
}
db.collection(u'battery1').document(u"day1").set(push_data)
# Add add a new observation to the different arrays
db.collection(u'battery1').document(u"day1").update({'time': firestore.ArrayUnion(curr_time_arr)})
db.collection(u'battery1').document(u"day1").update({'vpred': firestore.ArrayUnion(pred_v_arr)})
db.collection(u'battery1').document(u"day1").update({'vmeas': firestore.ArrayUnion(meas_v_arr)})
db.collection(u'battery1').document(u"day1").update({'current': firestore.ArrayUnion(c_arr)})
db.collection(u'battery1').document(u"day1").update({'cell1': firestore.ArrayUnion(cell_1_arr)})
db.collection(u'battery1').document(u"day1").update({'cell2': firestore.ArrayUnion(cell_2_arr)})
db.collection(u'battery1').document(u"day1").update({'cell3': firestore.ArrayUnion(cell_3_arr)})
db.collection(u'battery1').document(u"day1").update({'exectime': firestore.ArrayUnion(exec_time_arr)})
db.collection(u'battery1').document(u"day1").update({'soc': firestore.ArrayUnion(soc_arr)})
In the screenshot above you can see that there are 8 elements in the "time" field (as all calls to datetime.now() produce unique instances of timestamps), while all the other fields have only saved the unique data points sent (exectime/soc only have two data points, for 8 calls to ArrayUnion).
When you use firestore.ArrayUnion that operator's job is literally to ensure each value can only be present once in the array.
If you want to allow non-unique values, don't use firestore.ArrayUnion but just add the elements to the array regularly. This does require that you read the entire document and array first, then add the element locally, and write the result back.
I am currently discovering HDf5 library n Python and I have some problem. I have a dataset with this layout:
GROUP "GROUP1" {
DATASET "DATASET1" {
DATATYPE H5T_COMPOUND {
H5T_STD_I64LE "DATATYPE1";
H5T_STD_I64LE "DATATYPE2";
H5T_STD_I64LE "DATATYPE3";
}
DATASPACE SIMPLE { ( 3 ) / ( 3 ) }
DATA {
(0): {
1,
2,
3
I am trying to iterate in dataset to get the values associated to each datatype and copying them in a text file. (For example, "1" is the associated value to "DATATYPE1".) This following script does work:
new_file = open('newfile.txt', 'a')
for i in range(len(dataset[...])):
new_file.write('Ligne '+ str(i)+" "+":"+" ")
for j in range(len(dataset[i,...])):
new_file.write(str(dataset[i][j]) + "\n")
But it is not this clean... So I tried to get values by calling the datatypes by name. The closest script I found is the following:
for attribute in group.attrs:
print group.attrs[attribute]
Unfortunately, despite my tries it does not work on datatype :
Checking datatypes leads to dataset
for data.dtype in dataset.dtype:
#then print datatypes
print dataset.dtype[data.dtype
The backing error message is "numpy.dtype' object is not iterable".
Do you please have any idea how to process? I hope my question is clear.
Without your data it's hard to offer specific solutions. Here is a very simple example that mimics your data schema using pytables (& numpy). First it creates the HDF5 file, with table named DATASET1 under group GROUP1. DATASET1 has 3 int values in each row named: DATATYPE1, DATATYPE2, and DATATYPE3. The ds1.append() function adds rows of data to the table (1 row at a time).
After the data is created, walk_nodes() is used to traverse the HDF5 file structure and print node names and dtypes for tables.
import tables as tb
import numpy as np
with tb.open_file("SO_56545586.h5", mode = "w") as h5f:
ds1 = h5f.create_table('/GROUP1', 'DATASET1',
description=np.dtype([('DATATYPE1', int),('DATATYPE2', int),('DATATYPE3', int)]),
createparents=True)
for row in range(5) :
row_vals = [ (row, row+1, row*2), ]
ds1.append(row_vals)
## This section walks the file strcuture (groups and datasets), printing node names and dtype for tables:
for this_node in h5f.walk_nodes('/'):
print (this_node)
if isinstance(this_node, tb.Table) :
print (this_node.dtype)
Note: do not use mode = "w" when you open an existing file. It will create a new file (overwrite the existing file). Use mode = "a" or mode = "r+" if you need to append data, or mode = "r" if you only need to read the data.
To complete solution added by kcw78 I also found this script which also work. Because I can't iterate over dataset, I copied dataset into a new array :
dataset = file['path_to_dataset']
data = np.array(dataset) # Create a new array filled with dataset values as numpy.
print(data)
ls_column = list(data.dtype.names) # Get a list with datatypes associated to each data values.
print(ls_column) # Show layout of datatypes associated to each previous data values.
# Create an array filled with same datatypes rather than same subcases.
for col in ls_column:
k = data[col] # example : k=data['DATATYPE1'], k=data['DATATYPE2']
print(k)
Arnaud, OK, I see you are using h5py.
I don't understand what you mean by "I can't iterate over dataset". You can iterate over rows, or columns/fields.
Here is an example to demonstrate with h5py.
It shows 4 ways to extract data from the dataset, the last one iterates):
Read the entire HDF5 dataset to a np array
Then read 1 column from that array to another array
Read 1 column from the HDF5 dataset as an array
Loop thru HDF5 dataset columns and read 1 at a time as an array
Note that the return from .dtype.names is iterable. You don't need to create a list (unless you need it for other purposes). Also, HDF5 supports mixed types in datasets, so you can get a dtype with int, float, and string values (it will be a record array).
import h5py
import numpy as np
with h5py.File("SO_56545586.h5", "w") as h5f:
# create empty dataset 'DATASET1' in group '/GROUP1'
# dyte argument defines names and types
ds1 = h5f.create_dataset('/GROUP1/DATASET1', (10,),
dtype=np.dtype([('DATATYPE1', int),('DATATYPE2', int),('DATATYPE3', int)]) )
for row in range(5) : # load some arbitrary data into the dataset
row_vals = [ (row, row+1, row*2), ]
ds1[row] = row_vals
# to read the entire dataset as an array
ds1_arr = h5f['/GROUP1/DATASET1'][:]
print (ds1_arr.dtype)
# to read 1 column from ds1_arr as an array
ds1_col1 = ds1_arr[:]['DATATYPE1']
print ('for DATATYPE1 from ds1_arr, dtype=',ds1_col1.dtype)
# to read 1 HDF5 dataset column as an array
ds1_col1 = h5f['/GROUP1/DATASET1'][:,'DATATYPE1']
print ('for DATATYPE1 from HDF5, dtype=',ds1_col1.dtype)
# to loop thru HDF5 dataset columns and read 1 at a time as an array
for col in h5f['/GROUP1/DATASET1'].dtype.names :
print ('for ', col, ', dtype=',h5f['/GROUP1/DATASET1'][col].dtype)
col_arr = h5f['/GROUP1/DATASET1'][col][:]
print (col_arr.shape)
I have some very noisy (astronomy) data in csv format. Its shape is (815900,2) with 815k points giving information of what the mass of a disk is at a certain time. The fluctuations are pretty noticeable when you look at it close up. For example, here is an snippet of the data where the first column is time in seconds and the second is mass in kg:
40023700,2.40896E+028
40145700,2.44487E+028
40267700,2.44487E+028
40389700,2.44478E+028
40511600,1.535E+028
40633500,2.19067E+028
40755400,2.44496E+028
40877200,2.44489E+028
40999000,2.44489E+028
41120800,2.34767E+028
41242600,2.40936E+028
So it looks like there is a 1.53E+028 data point of noise, and also probably the 2.19E+028 and 2.35E+028 points.
To fix this, I am trying to set a Python script that will read in the csv data, then put some restriction on it so that if the mass is e.g. < 2.35E+028, it will remove the whole row and then create a new csv file with only the "good" data points:
40023700,2.40896E+028
40145700,2.44487E+028
40267700,2.44487E+028
40389700,2.44478E+028
40755400,2.44496E+028
40877200,2.44489E+028
40999000,2.44489E+028
41242600,2.40936E+028
Following this old question top answer by n8henrie, I so far have:
import pandas as pd
import csv
# Here are the locations of my csv file of my original data and an EMPTY csv file that will contain my good, noiseless set of data
originaldata = '/Users/myname/anaconda2/originaldata.csv'
gooddata = '/Users/myname/anaconda2/gooddata.csv'
# I use pandas to read in the original data because then I can separate the columns of time as 'T' and mass as 'M'
originaldata = pd.read_csv('originaldata.csv',delimiter=',',header=None,names=['t','m'])
# Numerical values of the mass values
M = originaldata['m'].values
# Now to put a restriction in
for row in M:
new_row = []
for column in row:
if column > 2.35E+028:
new_row.append(column)
csv.writer(open(newfile,'a')).writerow(new_row)
print('\n\n')
print('After:')
print(open(newfile).read())
However, when I run this, I get this error:
TypeError: 'numpy.float64' object is not iterable
I know the first column (time) is dtype int64 and the second column (mass) is dtype float64... but as a beginner, I'm still not quite sure what this error means or where I'm going wrong. Any help at all would be appreciated. Thank you very much in advance.
You can select rows by a boolean operation. Example:
import pandas as pd
from io import StringIO
data = StringIO('''\
40023700,2.40896E+028
40145700,2.44487E+028
40267700,2.44487E+028
40389700,2.44478E+028
40511600,1.535E+028
40633500,2.19067E+028
40755400,2.44496E+028
40877200,2.44489E+028
40999000,2.44489E+028
41120800,2.34767E+028
41242600,2.40936E+028
''')
df = pd.read_csv(data,names=['t','m'])
good = df[df.m > 2.35e+28]
out = StringIO()
good.to_csv(out,index=False,header=False)
print(out.getvalue())
Output:
40023700,2.40896e+28
40145700,2.44487e+28
40267700,2.44487e+28
40389700,2.44478e+28
40755400,2.44496e+28
40877200,2.44489e+28
40999000,2.44489e+28
41242600,2.40936e+28
This returns a column: M = originaldata['m'].values
So when you do for row in M:, you get only one value in row, so you can't iterate on it again.