My goal:
I have a data structure in C++ which holds strings (or more accurately, multi-dimensional char array). I wish to expose this structure to Python via Numpy and Pandas. Eventually the goal is to let the user modify a dataframe which actually modifies the underlying C++ data-structure.
What I've accomplished so far:
I've wrapped the C++ data structure with 2D numpy array (via PyArray_New API call) and returned it into python. Then, inside python I'm using pandas.DataFrame(data=ndarray, columns=columns, copy=False) constructor to wrap the ndarray with pandas' dataframe without copying any data.
I've also managed to modify a single column. For example, I've managed to turn strings into lower case in the following way:
tmp = df["Some_field"].str.decode('ascii').str.lower().str.encode('ascii')
df["Some_field"][:] = tmp
The problem:
I'm now trying to make multiple columns into lower-case. I thought it would be straight forward but I'm struggling with this for a while since the manipulations does not change the underlying numpy arrays.
What I've tried to solve the problem:
fields_to_change = [...]
for field in fields:
tmp = df[field].str.decode('ascii').str.lower().str.encode('ascii')
df[field][:] = tmp
This yields SettingWithCopyWarning and the underlying structure is changed only for the first field in "fields_to_change".
2.
fields_to_change = [...]
for field in fields:
tmp = df[field].str.decode('ascii').str.lower().str.encode('ascii')
df.loc[:, field] = tmp[:]
This runs without errors/warning but again, underlying data is not being changed.
3.
fields_to_change = [...]
for field in fields:
tmp = df[field].str.decode('ascii').str.lower().str.encode('ascii')
np.copyto(dst=df[field].values, src=tmp.values, casting='unsafe')
This works perfectly and changes underlying data. But this code is problematic from a different aspect. The whole point is to expose pandas functionality to transparently modify underlying data. I could copy all values from user's manipulated dataframe into the arrays which hold the underlying data but it would severely slow down my program.
TLDR; my question is:
How can I use pandas to manipulate strings in certain columns without changing the underlying numpy arrays from which the dataframe was composed? Also, is there a way to make sure that the user cannot change underlying numpy arrays?
Thanks very much in advance.
Related
With the below code I'm trying to update the column df_test['placed'] to = 1 when the if statement is triggered and a prediction is placed. I haven't been able to get this to update correctly though, the code compiles but doesn't update to = 1 for the respective predictions placed.
df_test['placed'] = np.zeros(len(df_test))
for i in set(df_test['id']) :
mask = df_test['id']==i
predictions = lm.predict(X_test[mask])
j = np.argmax(predictions)
if predictions[j] > 0 :
df_test['placed'][mask][j] = 1
print(df_test['placed'][mask][j])
Answering your question
Edit: changed suggestion based on comments
The assignment part of your code, df_test['placed'][mask][j] = 1, uses what is called chained indexing. In short, your assignment only changes a temporary copy of the DataFrame that gets immediately thrown away, and never changes the original DataFrame.
To avoid this, the rule of thumb when doing assignment is: use only one set of square braces on a single DataFrame. For your problem, that should look like:
df_test.loc[mask.nonzero()[0][j], 'placed'] = 1
(I know the mask.nonzero() uses two sets of square brackets; actually nonzero() returns a tuple, and the first element of that tuple is an ndarray. But the dataframe only uses one set, and that's the important part.)
Some other notes
There are a couple notes I have on using pandas (& numpy).
Pandas & NumPy both have a feature called broadcasting. Basically, if you're assigning a single value to an entire array, you don't need to make an array of the same size first; you can just assign the single value, and pandas/NumPy automagically figures out for you how to apply it. So the first line of your code can be replaced with df_test['placed'] = 0, and it accomplishes the same thing.
Generally speaking when working with pandas & numpy objects, loops are bad; usually you can find a way to use some combination of broadcasting, element-wise operations and boolean indexing to do what a loop would do. And because of the way those features are designed, it'll run a lot faster too. Unfortunately I'm not familiar enough with the lm.predict method to say, but you might be able to avoid the whole for-loop entirely for this code.
I've got the following code in python and I think I'd need some help optimizing it.
I'm reading in a few million lines of data, but then throwing out most of them if one coordinate per line is not fitting my criterion.
The code is as following:
def loadFargoData(dataname, thlimit):
temp = np.loadtxt(dataname)
return temp[ np.abs(temp[:,1]) < thlimit ]
I've coded it as if it were C-type code and of course in python now this is crazy slow.
Can I throw out my temp object somehow? Or what other optimization can the Pythonian population help me with?
The data reader included in pandas might speed up your script. It reads faster than numpy. Pandas will produce a dataframe object, easy to view as a numpy array (also easy to convert if preferred) so you can execute your condition in numpy (which looks efficient enough in your question).
import pandas as pd
def loadFargoData(dataname, thlimit):
temp = pd.read_csv(dataname) # returns a dataframe
temp = temp.values # returns a numpy array
# the 2 lines above can be replaced by temp = pd.read_csv(dataname).values
return temp[ np.abs(temp[:,1]) < thlimit ]
You might want to check up Pandas' documentation to learn the function arguments you may need to read the file correctly (header, separator, etc).
I have a list of several large hdf5 files, each with a 4D dataset. I would like to obtain a concatenation of them on the first axis, as in, an array-like object that would be used as if all datasets were concatenated. My final intent is to sequentially read chunks of the data along the same axis (e.g. [0:100,:,:,:], [100:200,:,:,:], ...), multiple times.
Datasets in h5py share a significant part of the numpy array API, which allows me to call numpy.concatenate to get the job done:
files = [h5.File(name, 'r') for name in filenames]
X = np.concatenate([f['data'] for f in files], axis=0)
On the other hand, the memory layout is not the same, and memory cannot be shared among them (related question). Alas, concatenate will eagerly copy the entire content of each array-like object into a new array, which I cannot accept in my use case. The source code of the array concatenation function confirms this.
How can I obtain a concatenated view over multiple array-like objects, without eagerly reading them to memory? As far as this view is concerned, slicing and indexing over this view would behave just as if I had a concatenated array.
I can imagine that writing a custom wrapper would work, but I would like to know whether such an implementation already exists as a library, or whether another solution to the problem is available and just as feasible. My searches so far have yielded nothing of this sort. I am also willing to accept solutions specific to h5py.
flist = [f['data'] for f in files] is a list of dataset objects. The actual data is on the h5 files, is accessible as long as those files remain open.
When you do
arr = np.concatenate(flist, axis=0)
I imagine concatenate first does
tmep = [np.asarray(a) for a in flist]
that is, construct a list of numpy arrays. I assume np.asarray(f['data']) is the same as f['data'].value or f['data'][:] (as I discussed 2 yrs ago in the linked SO question). I should do some time tests comparing that with
arr = np.concatenate([a.value for a in flist], axis=0)
flist is a kind of lazy compilation of these data sets, in that the data still resides on the file, and is accessed only when you do something more.
[a.value[:,:,:10] for a in flist]
would load a portion of each of those data sets into memory; I expect that a concatenate on that list would be the equivalent of arr[:,:,:10].
Generators or generator comprehensions are a form of lazy evaluation, but I think they have to be turned into lists before use in concatenate. In any case, the result of concatenate is always an array with all the data in a contiguous block of memory. It is never blocks of data residing in files.
You need to tell us more about what intend to do with this large concatenated array of data sets. As outline I think you can construct arrays that contain slices of all the data sets. You could also perform other actions as I demonstrate in the previous answer - but with an access time cost.
My problem requires the incremental addition of rows into a sorted DataFrame (with a DateTimeIndex), but I'm currently unable to find an efficient way to do this. There doesn't seem to be any concept of an "insort".
I've tried appending the row and resorting in place, and I've also tried getting the insertion point with searchsorted and slicing and concatenating to create a new DataFrame. Both being "too slow".
Is Pandas just not suited to jobs where you don't have all the data at once and instead get your data incrementally?
Solutions I've tried:
Concatenation
def insert_data(df, data, index):
insertion_index = df.index.searchsorted(index)
new_df = pandas.concat([df[:insertion_index], pandas.DataFrame(data, index=[index]), df[insertion_index:]])
return new_df, insertion_index
Resorting
def insert_data(df, data, index):
new_df = df.append(pandas.DataFrame(data, index=[index]))
new_df.sort_index(inplace=True)
return new_df
pandas is built on numpy. numpy arrays are fixed sized objects. While there are numpy append and insert functions, in practice they construct new arrays from the old and new data.
There are 2 practical approaches to incrementally defining these arrays:
initialize a large empty array, and fill in values incrementally
incrementally create a Python list (or dictionary), and create the array from the completed list.
Appending to a Python list is a common and fast task. There is also a list insert, but it is slower. For sorted inserts there are specialized Python structures (e.g. bisect).
Pandas may have added functions to deal with common creation scenarios. But unless it has coded something special in C it is unlikely to be faster than a more basic Python structure.
Even if you have to use Pandas features at various points along the incremental build, it might best to create a new DataFrame on the fly from the underlying Python structure.
I have a fragmented structure in memory and I'd like to access it as a contiguous-looking memoryview. Is there an easy way to do this or should I implement my own solution?
For example, consider a file format that consists of records. Each record has a fixed length header, that specifies the length of the content of the record. A higher level logical structure may spread over several records. It would make implementing the higher level structure easier if it could see it's own fragmented memory location as a simple contiguous array of bytes.
Update:
It seems that python supports this 'segmented' buffer type internally, at least based on this part of the documentation. But this is only the C API.
Update2:
As far as I see, the referenced C API - called old-style buffers - does what I need, but it's now deprecated and unavailable in newer version of Python (3.X). The new buffer protocol - specified in PEP 3118 - offers a new way to represent buffers. This API is more usable in most of the use cases (among them, use cases where the represented buffer is not contiguous in memory), but does not support this specific one, where a one dimensional array may be laid out completely freely (multiple differently sized chunks) in memory.
First - I am assuming you are just trying to do this in pure python rather than in a c extension. So I am assuming you have loaded in the different records you are interested in into a set of python objects and your problem is that you want to see the higher level structure that is spread across these objects with bits here and there throughout the objects.
So can you not simply load each of the records into a byte arrays type? You can then use python slicing of arrays to create a new array that has just the data for the high level structure you are interested in. You will then have a single byte array with just the data you are interested in and can print it out or manipulate it in any way that you want to.
So something like:
a = bytearray(b"Hello World") # put your records into byte arrays like this
b = bytearray(b"Stack Overflow")
complexStructure = bytearray(a[0:6]+b[0:]) # Slice and join arrays to form
# new array with just data from your
# high level entity
print complexStructure
Of course you will still ned to know where within the records your high level structure is to slice the arrays correctly but you would need to know this anyway.
EDIT:
Note taking a slice of a list does not copy the data in the list it just creates a new set of references to the data so:
>>> a = [1,2,3]
>>> b = a[1:3]
>>> id(a[1])
140268972083088
>>> id(b[0])
140268972083088
However changes to the list b will not change a as b is a new list. To have the changes automatically change in the original list you would need to make a more complicated object that contained the lists to the original records and hid them in such a way as to be able to decide which list and which element of a list to change or view when a user look to modify/view the complex structure. So something like:
class ComplexStructure():
def add_records(self,record):
self.listofrecords.append(record)
def get_value(self,position):
listnum,posinlist = ... # formula to figure out which list and where in
# list element of complex structure is
return self.listofrecords[listnum][record]
def set_value(self,position,value):
listnum,posinlist = ... # formula to figure out which list and where in
# list element of complex structure is
self.listofrecords[listnum][record] = value
Granted this is not the simple way of doing things you were hoping for but it should do what you need.