In PyTables, how to create nested array of variable length?

In PyTables, how to create nested array of variable length? - python

I'm using PyTables 2.2.1 w/ Python 2.6, and I would like to create a table which contains nested arrays of variable length.
I have searched the PyTables documentation, and the tutorial example (PyTables Tutorial 3.8) shows how to create a nested array of length = 1. But for this example, how would I add a variable number of rows to data 'info2/info3/x' and 'info2/info3/y'?
For perhaps an easier to understand table structure, here's my homegrown example:
"""Desired Pytable output:
DIEM TEMPUS Temperature Data
5 0 100 Category1 <--||--> Category2
x <--| |--> y z <--|
0 0 0
2 1 1
4 1.33 2.67
6 1.5 4.5
8 1.6 6.4
5 1 99
2 2 0
4 2 2
6 2 4
8 2 6
5 2 96
4 4 0
6 3 3
8 2.67 5.33
Note that nested arrays have variable length.
"""
import tables as ts
tableDef = {'DIEM': ts.Int32Col(pos=0),
'TEMPUS': ts.Int32Col(pos=1),
'Temperature' : ts.Float32Col(pos=2),
'Data':
{'Category1':
{
'x': ts.Float32Col(),
'y': ts.Float32Col()
},
'Category2':
{
'z': ts.Float32Col(),
}
}
}
# create output file
fpath = 'TestDb.h5'
fh = ts.openFile(fpath, 'w')
# define my table
tableName = 'MyData'
fh.createTable('/', tableName, tableDef)
tablePath = '/'+tableName
table = fh.getNode(tablePath)
# get row iterator
row = table.row
for i in xrange(3):
print '\ni=', i
# calc some fake data
row['DIEM'] = 5
row['TEMPUS'] = i
row['Temperature'] = 100-i**2
for j in xrange(5-i):
# Note that nested array has variable number of rows
print 'j=', j,
# calc some fake nested data
val1 = 2.0*(i+j)
val2 = val1/(j+1.0)
val3 = val1 - val2
''' Magic happens here...
How do I write 'j' rows of data to the elements of
Category1 and/or Category2?
In bastardized pseudo-code, I want to do:
row['Data/Category1/x'][j] = val1
row['Data/Category1/y'][j] = val2
row['Data/Category2/z'][j] = val3
'''
row.append()
table.flush()
fh.close()
I have not found any indication in the PyTables docs that such a structure is not possible... but in case such a structure is in fact not possible, what are my alternatives to variable length nested columns?
EArray? VLArray? If so, how to integrate these data types into the above described structure?
some other idea?
Any assistance is greatly appreciated!
EDIT w/ additional info:
It appears that the PyTables gurus have already addressed the "is such a structure possible" question:
PyTables Mail Forum - Hierachical Datasets
So has anyone figured out a way to create an analogous PyTable data structure?
Thanks again!

I have a similar task: to dump fixed size data with arrays of a variable length.
I first tried using fixed size StringCol(64*1024) fields to store my variable length data (they are always < 64K). But it was rather slow and wasted a lot of disk space, despite blosc compression.
After days of investigation I ended with the following solution:
(spoiler: we store array fields in separate EArray instances, one EArray per one array field)
I store fixed size data in a regular pytables table.
I added 2 additional fields to these tables: arrFieldName_Offset and arrFieldName_Length:
class Particle(IsDescription):
idnumber = Int64Col()
ADCcount = UInt16Col()
TDCcount = UInt8Col()
grid_i = Int32Col()
grid_j = Int32Col()
pressure = Float32Col()
energy = FloatCol()
buffer_Offset = UInt32() # note this field!
buffer_Length = UInt32() # and this one too!
I also create one EArray instance per each array field:
datatype = StringAtom(1)
buffer = h5file.createEArray('/detector', 'arr', datatype, (0,), "")
Then I add rows corresponding to a fixed size data:
row['idnumber'] = ...
...
row['energy'] = ...
row['buffer_Offset'] = buffer.nrows
# my_buf is a string (I get it from a stream)
row['buffer_Length'] = len(my_buf)
table.append(row)
Ta-dah! Add the buffer into the array.
buffer.append(np.ndarray((len(my_buf),), buffer=my_buf, dtype=datatype))
That's the trick. In my experiments this approach is 2-10x times faster than storing ragged fixed sized arrays (like StringAtom(HUGE_NUMBER)) and the resulting DB is few times smaller (2-5x)
Getting the buffer data is easy. Suppose that row is a single row you read from your DB:
# Open array for reading
buffer = h5file.createEArray('/detector', 'Particle.buffer', datatype, (0,), "")
...
row = ...
...
bufferDataYouNeed = buffer[ row['buffer_Offset'] : row['buffer_Offset'] + row['buffer_Length']]

This is a common thing that folks starting out with PyTables want to do. Certainly, it was the first thing I tried to do. As of 2009, I don't think this functionality was supported. You can look here for one solution "I always recommend":
http://www.mail-archive.com/pytables-users#lists.sourceforge.net/msg01207.html
In short, just put each VLArray in a separate place. If you do that, maybe you don't end up needing VLArrays. If you store separate VLArrays for each trial (or whatever), you can keep metadata on those VLArrays (guaranteed to stay in sync with the array across renames, moves, etc.) or put it in a table (easier to search).
But you may also do well to pick whatever a single time-point would be for your column atom, then simply add another column for a time stamp. This would allow for a "ragged" array that still has a regular, repeated (tabular) structure in memory. For example:
Trial Data
1 0.4, 0.5, 0.45
2 0.3, 0.4, 0.45, 0.56
becomes
Trial Timepoint Data
1 1 0.4
1 2 0.5
...
2 4 0.56
Data above is a single number, but it could be, e.g. a 4x5x3 atom.
If nested VLArrays are supported in PyTables now, I'd certainly love to know!
Alternatively, I think h5py does support the full HDF5 feature-set, so if you're really committed to the nested data layout, you may have more luck there. You'll be losing out on a lot of nice features though! And in my experience, naive neuroscientists end up with quite poor performance since they don't get pytables intelligent choices for data layout, chunking, etc. Please report back if you go that route!

I also ran into this and I ended using a fixed array size. The arrays I was trying to store were of variable len so I created new ones from the with the correct fixed length
I did something along the lines of
def filled_list(src_list, targ_len):
"""takes a varible len() list and creates a new one with a fixed len()"""
for i in range(targ_len):
try:
yield src_list[i]
except IndexError:
yield 0
src_list = [1,2,3,4,5,6,7,8,9,10,11]
new_list = [x for x in filled_list(src_list, 100)]
That did the trick for me.

Related

How to divide a pandas data frame into sublists of n at a time?

I have a data frame made of tweets and their author, there is a total of 45 authors. I want to divide the data frame into groups of 2 authors at a time such that I can export them later into csv files.
I tried using the following: (given that the authors are in column named 'B' and the tweets are in columns named 'A')
I took the following from this question
df.set_index(keys=['B'],drop=False,inplace=True)
authors = df['B'].unique().tolist()
in order to separate the lists :
dgroups =[]
for i in range(0,len(authors)-1,2):
dgroups.append(df.loc[df.B==authors[i]])
dgroups.extend(df.loc[df.B ==authors[i+1]])
but instead it gives me sub-lists like this:
dgroups = [['A'],['B'],
[tweet,author],
['A'],['B'],
[tweet,author2]]
prior to this I was able to divide them correctly into 45 sub-lists derived from the previous link 1 as follows:
for i in authors:
groups.append(df.loc[df.B==i])
so how would i do that for 2 authors or 3 authors or like that?
EDIT: from #Jonathan Leon answer, i thought i would do the following, which worked but isn't a dynamic solution and is inefficient i guess, especially if n>3 :
dgroups= []
for i in range(2,len(authors)+1,2):
tempset1=[]
tempset2=[]
tempset1 = df.loc[df.B==authors[i-2]]
if(i-1 != len(authors)):
tempset2=df.loc[df.B ==authors[i-1]]
dgroups.append(tempset1.append(tempset2))
else:
dgroups.append(tempset1)

This imports the foreign language incorrectly, but the logic works to create a new csv for every two authors.
pd.read_csv('TrainDataAuthorAttribution.csv')
# df.groupby('B').count()
authors=df.B.unique().tolist()
auths_in_subset = 2
for i in range(auths_in_subset, len(authors)+auths_in_subset, auths_in_subset):
# print(authors[i-auths_in_subset:i])
dft = df[df.B.isin(authors[i-auths_in_subset:i])]
# print(dft)
dft.to_csv('df' + str(i) + '.csv')

Is pandas and numpy any good for manipulation of non numeric data?

I've been going in circles for days now, and I've run out of steam. Doesn't help that I'm new to python / numpy / pandas etc.
I started with numpy which led me to pandas, because of a GIS function that delivers a numpy array of data. That is my starting point. I'm trying to get to an endpoint being a small enriched dataset, in an excel spreadsheet.
But it seems like going down a rabbit hole trying to extract that data, and then manipulate it with the numpy toolsets. The delivered data is one dimensional, but each row contains 8 fields. A simple conversion to pandas and then to ndarray, magically makes it all good. Except that I lose headers in the process, and it just snowballs from there.
I've had to revaluate my understanding, based on some feedback on another post, and that's fine. But I'm just going in circles. Example after example seems to use predominantly numerical data, and I'm starting to get the feeling that's where it's strength lies. My trying to use it for what I call more of a non-mathematical / numerical purpose looks like I'm barking up the wrong tree.
Any advice?
Addendum
The data I extract from the GIS system is names, dates, other textual data. I then have another csv file that I need to use as a lookup, so that I can enrich the source with more textual information which finally gets published to excel.
SAMPLE DATA - SOURCE
WorkCode Status WorkName StartDate EndDate siteType Supplier
0 AT-W34319 None Second building 2020-05-04 2020-05-31 Type A Acem 1
1 AT-W67713 None Left of the red office tower 2019-02-11 2020-08-28 Type B Quester Q
2 AT-W68713 None 12 main street 2019-05-23 2020-11-03 Class 1 Type B Dettlim Group
3 AT-W70105 None city central 2019-03-07 2021-08-06 Other Hans Int
4 AT-W73855 None top floor 2019-05-06 2020-10-28 Type a None
SAMPLE DATA - CSV
["Id", "Version","Utility/Principal","Principal Contractor Contact"]
XM-N33463,7.1,"A Contracting company", "555-12345"
XM-N33211,2.1,"Contractor #b", "555-12345"
XM-N33225,1.3,"That other contractor", "555-12345"
XM-N58755,1.0,"v Contracting", "555-12345"
XM-N58755,2.3,"dsContracting", "555-12345"
XM-222222,2.3,"dsContracting", "555-12345"
BM-O33343,2.1,"dsContracting", "555-12345"
def SMAN():
####################################################################################################################
# Exporting the results of the analysis...
####################################################################################################################
"""
Approach is as follows:
1) Get the source data
2) Get he CSV lookup data loaded into memory - it'll be faster
3) Iterate through the source data, looking for matches in the CSV data
4) Add an extra couple of columns onto the source data, and populate it with the (matching) lookup data.
5) Export the now enhanced data to excel.
"""
arcpy.env.workspace = workspace + filenameGDB
input = "ApprovedActivityByLocalBoard"
exportFile = arcpy.da.FeatureClassToNumPyArray(input, ['WorkCode', 'Status','WorkName', 'PSN2', 'StartDate', 'EndDate', 'siteType', 'Supplier'])
# we have our data, but it's (9893,) instead of [9893 rows x 8 columns]
pdExportFile = pandas.DataFrame(exportFile)
LBW = pdExportFile.to_numpy()
del exportFile
del pdExportFile
# Now we have [9893 rows x 8 columns] - but we've lost the headers
col_list = ["WorkCode", "Version","Principal","Contact"]
allPermits = pandas.read_csv("lookup.csv", usecols=col_list)
# Now we have the CSV file loaded ... and only the important parts - should be fast.
# Shape: (94523, 4)
# will have to find a way to improve this...
# CSV file has got more than WordCode, because there are different versions (as different records)
# Only want the last one.
# each record must now be "enhanced" with matching record from the CSV file.
finalReport = [] # we are expecting this to be [9893 rows x 12 columns] at the end
counter = -1
for eachWorksite in LBW [:5]: #let's just work with 5 records right now...
counter += 1
# eachWorksite=list(eachWorksite) # eachWorksite is a tuple - so need to convert it
# # but if we change it to a list, we lose the headers!
certID = LBW [counter][0] # get the ID to use for lookup matching
# Search the CSV data
permitsFound = allPermits[allPermits['Id']==certID ]
permitsFound = permitsFound.to_numpy()
if numpy.shape(permitsFound)[0] > 1:
print ("Too many hits!") # got to deal with that CSV Version field.
exit()
else:
# now "enrich" the record/row by adding on the fields from the lookup
# so a row goes from 8 fields to 12 fields
newline = numpy.append (eachWorksite, permitsFound)
# and this enhanced record/row must become the new normal
# but I cannot change the original, so it must go into a new container
finalReport = numpy.append(finalReport, newline, axis = 0)
# now I should have a new container, of "enriched" data
# which as gone from [9893 rows x 8 columns] to [9893 rows x 12 columns]
# Some of the columns of course, could be empty.
#Now let's dump the results to an Excel file and make it accessible for everyone else.
df = pandas.DataFrame (finalReport)
filepath = 'finalreport.csv'
df.to_csv('filepath', index = False)
# Somewhere I was getting Error("Cannot convert {0!r} to Excel".format(value))
# Now I get
filepath = 'finalReport.xlsx'
df.to_excel(filepath, index=False)

I have eventually answered my own question, and this is how:
Yes, for my situation, pandas worked just fine, even beautifully for
manipulating non numerical data. I just had to learn some basics.
The biggest learning was to understand the pandas data frame as an object that has to be manipulated remotely by various functions/tools. Just because I "print" the dataframe, doesn't mean it's just text. (Thanks juanpa.arrivillaga for poitning out my erroneous assumptions) in Why can I not reproduce a nd array manually?
I also had to wrap my mind around the concept of indexes and columns, and how they could be altered/manipulated/ etc. And then, how to use them to maximum effect.
Once those fundamentals had been sorted, the rest followed naturally, and my code reduced to a couple of nice elegant functions.
Cheers

Python: calculate values in column only in rows with specific value in other column

I am desperately trying to solve this issue:
I have a csv with information on well core data with different columns, among them one column with IDs and two with X and Y coordinates. I was told now by the data supplier that some of the well cores (= rows) have wrong Y coordinates - the value should be e.g. instead 1400 -1400.
I am now trying to write a script to automatically change all the Y-values in the affected rows (well cores) (by *-1), but nothing has worked:
ges = pd.read_csv(r"C:\A....csv")
bk = [26740001, 26740002, 26740003] # List of IDs that should be changed
for x in bk:
for line in ges:
np.where(ges.query('ID== {}'.format(x)), ges.Y=ges.Y*-1, ges['Y'])
I have also tried it like this:
for line in ges:
if ges.ID.values == bk:
ges.Y = ges.Y*-1
else:
pass
or like this:
ges.loc[(ges.ID == bk), 'Y']=*-1
or:
ges.loc[ges['ID'].isin(bk), ges['Y']] = *-1
or:
ges.loc[ges['ID'].isin(bk), ges['Y']] = ges['Y']*-1
I am very grateful for every help!
edit:
I am sorry, this is my first post. To make it clearer, my data looks like this:
Now I was informed that the Y-values of ID 2, 3 and 6 are wrong and should be negative values. So my desired output is the following:
ID X Y other column other column
1 3459 1245 information information
2 4541 -1256 information information
3 2378 -2353 information information
4 6947 874 information information
5 2349 2351 information information
6 2347 -746 information information
I hope it is clear now. Thanks.

Try the following:
ids = [26740001, 26740002, 26740003]
for number_id in ids:
idx = ges['ID'] == number_id
ges.loc[idx, 'Y'] *= -1

Iterating through csv records based on the version of the record via Python

I have a csv file that has a primary_id field and a version field and it looks like this:
ful_id version xs at_grade date
000c1a6c-1f1c-45a6-a70d-f3555f7dd980 3 123 yes 20171003
000c1a6c-1f1c-45a6-a70d-f3555f7dd980 1 12 no 20170206
034c1a6c-4f1c-aa36-a70d-f2245f7rr342 1 334 yes 20150302
00dc5fec-ddb8-45fa-9c86-77e09ff590a9 1 556 yes 20170201
000c1a6c-1f1c-45a6-a70d-f3555f7dd980 2 123 no 20170206
edit this is what the actual data looks like plus add 106 more columns of data and 20,000 records
The larger version number is the latest version of that record.I am having a difficult time thinking of the logic to get the latest record based on version and dumping that into a dictionary.I am pulling the info from the csv into a blank list but If anyone could give me some guidance on some of the logic moving forward, I would appreciate it
import csv
from collections import defaultdict
reader = csv.DictReader(open('rpm_inv.csv', 'rb'))
allData = list(reader)
dict_list = []
for line in allData:
dict_list.append(line)
pprint.pprint(dict_list)

I'm not exactly sure how you want your output to look like, but this might point you at least in the right direction, as long as you're not opposed to pandas.
import pandas as pd
df = pd.read_csv('rpm_inv.csv', header=True)
by_version = df.groupby('Version')
latest = by_version.max()
# To put it into a dictionary of {version:ID}
{v:row['ID'] for v, row in latest.iterrows()}

There's no need for anything fancy.
defaultdict is included in Python's standard library. It's an improved dictionary. I've used it here because it obviates the need to initialise entries in a dictionary. This means that I can write, for instance, result[id] = max(result[id], version). If no entry exists for id then defaultdict creates one and puts version in it (because it's obvious that this will be the maximum).
I read through the lines in the input file, one at a time, discarding end-lines and blanks, splitting on the commas, and then use map to apply the int function to each string produced.
I ignore the first line in the file simply be reading it and assigning its contents to a variable that I have arbitrarily called ignore.
Finally, just to make the results more intelligible, I sort the keys in the dictionary, and present the contents of it in order.
>>> from collections import defaultdict
>>> result = defaultdict(int)
>>> with open('to_dict.txt') as input:
... ignore = input.readline()
... for line in input:
... id, version = map(int, line.strip().replace(' ', '').split(','))
... result[id] = max(result[id], version)
...
>>> ids = list(result.keys())
>>> ids.sort()
>>> for id in ids:
... id, result[id]
...
(3, 1)
(11, 3)
(20, 2)
(400, 2)
EDIT: With that much data it becomes a different question, in my estimation, better processed with pandas.
I've put the df.groupby(['ful_id']).version.idxmax() bit in to demonstrate what I've done. I group on ful_id, then ask for the maximum value of version and the index of the maximum value, all in one step using idxmax. Although pandas displays this as a two-column table the result is actually a list of integers that I can use to select rows from the dataframe.
That's what I do with df.iloc[df.groupby(['ful_id']).version.idxmax(),:]. Here the df.groupby(['ful_id']).version.idxmax() part identifies the rows, and the : part identifies the columns, namely all of them.
Thanks for an interesting question!
>>> import pandas as pd
>>> df = pd.read_csv('different.csv', sep='\s+')
>>> df
ful_id version xs at_grade date
0 000c1a6c-1f1c-45a6-a70d-f3555f7dd980 3 123 yes 20171003
1 000c1a6c-1f1c-45a6-a70d-f3555f7dd980 1 12 no 20170206
2 034c1a6c-4f1c-aa36-a70d-f2245f7rr342 1 334 yes 20150302
3 00dc5fec-ddb8-45fa-9c86-77e09ff590a9 1 556 yes 20170201
4 000c1a6c-1f1c-45a6-a70d-f3555f7dd980 2 123 no 20170206
>>> df.groupby(['ful_id']).version.idxmax()
ful_id
000c1a6c-1f1c-45a6-a70d-f3555f7dd980 0
00dc5fec-ddb8-45fa-9c86-77e09ff590a9 3
034c1a6c-4f1c-aa36-a70d-f2245f7rr342 2
Name: version, dtype: int64
>>> new_df = df.iloc[df.groupby(['ful_id']).version.idxmax(),:]
>>> new_df
ful_id version xs at_grade date
0 000c1a6c-1f1c-45a6-a70d-f3555f7dd980 3 123 yes 20171003
3 00dc5fec-ddb8-45fa-9c86-77e09ff590a9 1 556 yes 20170201
2 034c1a6c-4f1c-aa36-a70d-f2245f7rr342 1 334 yes 20150302

Create a pandas DataFrame from generator?

I've create a tuple generator that extract information from a file filtering only the records of interest and converting it to a tuple that generator returns.
I've try to create a DataFrame from:
import pandas as pd
df = pd.DataFrame.from_records(tuple_generator, columns = tuple_fields_name_list)
but throws an error:
...
C:\Anaconda\envs\py33\lib\site-packages\pandas\core\frame.py in from_records(cls, data, index, exclude, columns, coerce_float, nrows)
1046 values.append(row)
1047 i += 1
-> 1048 if i >= nrows:
1049 break
1050
TypeError: unorderable types: int() >= NoneType()
I managed it to work consuming the generator in a list, but uses twice memory:
df = pd.DataFrame.from_records(list(tuple_generator), columns = tuple_fields_name_list)
The files I want to load are big, and memory consumption matters. The last try my computer spends two hours trying to increment virtual memory :(
The question: Anyone knows a method to create a DataFrame from a record generator directly, without previously convert it to a list?
Note: I'm using python 3.3 and pandas 0.12 with Anaconda on Windows.
Update:
It's not problem of reading the file, my tuple generator do it well, it scan a text compressed file of intermixed records line by line and convert only the wanted data to the correct types, then it yields fields in a generator of tuples form.
Some numbers, it scans 2111412 records on a 130MB gzip file, about 6.5GB uncompressed, in about a minute and with little memory used.
Pandas 0.12 does not allow generators, dev version allows it but put all the generator in a list and then convert to a frame. It's not efficient but it's something that have to deal internally pandas. Meanwhile I've must think about buy some more memory.

You certainly can construct a pandas.DataFrame() from a generator of tuples, as of version 0.19 (and probably earlier). Don't use .from_records(); just use the constructor, for example:
import pandas as pd
someGenerator = ( (x, chr(x)) for x in range(48,127) )
someDf = pd.DataFrame(someGenerator)
Produces:
type(someDf) #pandas.core.frame.DataFrame
someDf.dtypes
#0 int64
#1 object
#dtype: object
someDf.tail(10)
# 0 1
#69 117 u
#70 118 v
#71 119 w
#72 120 x
#73 121 y
#74 122 z
#75 123 {
#76 124 |
#77 125 }
#78 126 ~

You cannot create a DataFrame from a generator with the 0.12 version of pandas. You can either update yourself to the development version (get it from the github and compile it - which is a little bit painful on windows but I would prefer this option).
Or you can, since you said you are filtering the lines, first filter them, write them to a file and then load them using read_csv or something else...
If you want to get super complicated you can create a file like object that will return the lines:
def gen():
lines = [
'col1,col2\n',
'foo,bar\n',
'foo,baz\n',
'bar,baz\n'
]
for line in lines:
yield line
class Reader(object):
def __init__(self, g):
self.g = g
def read(self, n=0):
try:
return next(self.g)
except StopIteration:
return ''
And then use the read_csv:
>>> pd.read_csv(Reader(gen()))
col1 col2
0 foo bar
1 foo baz
2 bar baz

To get it to be memory efficient, read in chunks. Something like this, using Viktor's Reader class from above.
df = pd.concat(list(pd.read_csv(Reader(gen()),chunksize=10000)),axis=1)

You can also use something like (Python tested in 2.7.5)
from itertools import izip
def dataframe_from_row_iterator(row_iterator, colnames):
col_iterator = izip(*row_iterator)
return pd.DataFrame({cn: cv for (cn, cv) in izip(colnames, col_iterator)})
You can also adapt this to append rows to a DataFrame.
--
Edit, Dec 4th: s/row/rows in last line

If generator is just like a list of DataFrames, you need just to create a new DataFrame concatenating elements of the list:
result = pd.concat(list)
Recently I've faced the same problem.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.