I've just started learning numpy to analyse experimental data and I am trying to give a custom name to a ndarray column so that I can do operations without slicing.
The data I receive is generally a two column .txt file (I'll call it X and Y for the sake of clarity) with a large number of rows corresponding to the measured data. I do operations on those data and generate new columns (I'll call it F(X,Y), G(X,Y,F), ...). I know I can do column-wise operations by slicing ie Data[:,2]=Data[:,1]+Data[:,0], but with a large number of added columns, this becomes tedious. Hence I'm looking for a way to label the columns so that I can refer to a column by its label, and can also label the new columns I generate. So essentially I'm looking for something that'll allow me to directly write F=X+Y (as a substitute for the example above).
Currently, I'm assigning the entire column to a new variable and doing the operations, and then 'hstack'ing it to the data, but I'm unsure of the memory usage here. For example,
X=Data[:,0]
Y=Data[:,1]
F=X+Y
Data=numpy.hstack((Data,F.reshape(n,1)))
I've seen the use of structured array and record arrays, but the data I'm dealing with is homogeneous and new columns are being added continuously. Also, I hear Pandas is well suited for what I'm describing, but again, since I'm working with numerical data, I don't find the need to learn a new module, unless it's really needed. Thanks in advance.
Related
I have a few Pandas dataframes with several millions of rows each. The dataframes have columns containing JSON objects each with 100+ fields. I have a set of 24 functions that run sequentially on the dataframes, process the JSON (for example, compute some string distance between two fields in the JSON) and return a JSON with some new fields added. After all 24 functions execute, I get a final JSON which is then usable for my purposes.
I am wondering what the best ways to speed up performance for this dataset. A few things I have considered and read up on:
It is tricky to vectorize because many operations are not as straightforward as "subtract this column's values from another column's values".
I read up on some of the Pandas documentation and a few options indicated are Cython (may be tricky to convert the string edit distance to Cython, especially since I am using an external Python package) and Numba/JIT (but this is mentioned to be best for numerical computations only).
Possibly controlling the number of threads could be an option. The 24 functions can mostly operate without any dependencies on each other.
You are asking for advice and this is not the best site for general advice but nevertheless I will try to point a few things out.
The ideas you have already considered are not going to be helpful - neither Cython, Numba, nor threading are not going to address the main problem - the format of your data that is not conductive for performance of operations on the data.
I suggest that you first "unpack" the JSONs that you store in the column(s?) of your dataframe. Preferably, each field of the JSON (mandatory or optional - deal with empty values at this stage) ends up being a column of the dataframe. If there are nested dictionaries you may want to consider splitting the dataframe (particularly if the 24 functions are working separately at separate nested JSON dicts). Alternatively, you should strive to flatten the JSONs.
Convert to the data format that gives you the best performance. JSON stores all the data in the textual format. Numbers are best used in their binary format. You can do that column-wise on the columns that you suspect should be converted using df['col'].astype(...) (works on the whole dataframe too).
Update the 24 functions to operate not on JSON strings stored in dataframe but on the fields of the dataframe.
Recombine the JSONs for storage (I assume you need them in this format). At this stage the implicit conversion from numbers to strings will occur.
Given the level of details you provided in the question, the suggestions are necessarily brief. Should you have any more detailed questions at any of the above points, it would be best to ask maximally simple question on each of them (preferably containing a self-sufficient MWE).
I have multiple Excel files with different sheets in each file, these files have been made my people, so each one has different formats, different number of columns and also different structures to represent the data.
For example, in one sheet, the dataframe/table starts at 8th row, second column. In other it starts at 122 row, etc...
I want to retrieve something in common from these Excels, it is variable names and information.
However, I don't how could I possibly retrieve all this information without needing to parse each individual file. This is not an option because there are lot of these files with lots of sheets in each file.
I have been thinking about using regex as well as edit distance between words, but I don't know if that is the best option.
Any help is appreciated.
I will divide my answer into what I think you can do now, and suggestions for the future (if feasible).
An attempt to "solve" the problem you have with existing files.
Without regularity on your input files (such as at least a common name in the column), I think what you're describing is among the best solutions. Having said that, perhaps a "fancier" similarity metric between column names would be more useful than using regular expressions.
If you believe that there will be some regularity in the column names, you could look at string distances such as the Hamming Distance or the Levenshtein distance, and using a threshold on the distance that works for you. As an example, let's say that you have a function d(a:str, b:str) -> float that calculates a distance between column names, you could do something like this:
# this variable is a small sample of "expected" column names
plausible_columns = [
'interesting column',
'interesting',
'interesting-column',
'interesting_column',
]
for f in excel_files:
# process the file until you find columns
# I'm assuming you can put the colum names into
# a variable `columns` here.
for c in columns:
for p in plausible_columns:
if d(c,p) < threshold:
# do something to process the column,
# add to a pandas DataFrame, calculate the mean,
# etc.
If the data itself can tell you something on whether you should process it (such as having a particular distribution, or being in a particular range), you can use such features to decide on whether you should be using that column or not. Even better, you can use many of these characteristics to make a finer decision.
Having said this, I don't think a fully automated solution exists without inspecting some of the data manually, and studying the ditribution of the data, or variability in the names of the columns, etc.
For the future
Even with fancy methods to calculate features and doing some data analysis on the data you have right now, I think it would be impossible to ensure that you will always get the data you need (by the very nature of the problem). A reasonable way to solve this, in my opinion (and if this is feasible in whatever context you're working in), is to impose a stricter format in the data generation end (I suppose this is a manual thing with people inputting data to excel directly). I would argue that the best solution is to get rid of the problem at the root, and create a unified form, or excel sheet format, and distribute it to the people that will fill the files with data, so that you can ensure the data is then automatically ingested minimizing the risk of errors.
I'm relatively new to data analysis using Python and I'm trying to determine the most practical and useful way to read in my data so that I can index into it and use it in calculations. I have many images in the form of np.arrays that each have a corresponding set of data such as x- and y-coordinates, size, filter number, etc. I just want to make sure each set of data is grouped together with its corresponding image. My first thought was sticking the data in an np.array of dataclass instances (where each element of the array is an instance that contains all my data). My second thought was a pandas dataframe.
My gut is telling me that using a dataframe makes more sense. Do np.arrays store nicely inside dataframes? What are the pros/cons to each method and which would be best if I will need to be pulling data from them often, and I always need to make sure the data can be matched with its corresponding image?
What variables I have to read in: x_coord - float, y_coord - float, filter - int, image - np.ndarray.
I've been trying to stick the image arrays into a pandas dataframe but when indexing into it using .loc it is extremely slow to run the Jupyter Notebook cell. It was also very slow to populate the dataframe using .from_dict(). I'm guessing dataframes weren't meant to hold np.ndarrays?
My biggest concerns are the bookkeeping and ease of indexing - What can I do to always make sure I can retrieve the metadata for the corresponding image? In what form should my data be in so I can easily extract an image and its metadata, or all images with the same filter number, etc.
Suppose you have the following dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.nan,columns=['A','B','C'],index=[0,1,2])
Suppose I want an additional "layer" on top of this pandas dataframe, such that column A, row 0 would have its value, column B, row 0 would have a different value, column C row 0 would have something, column A row 1 and so on. So like a dataframe on top of this existing one.
Is it possible to add other layers? How does one access these layers? Is this efficient, i.e. should I just use a separate data frame all together? And one would save these multiple layers as a csv, by accessing the individual layers?, or is there a function that would break them down into different worksheets in the same workbook?
pandas.DataFrame cannot have 3 dimensions:
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types.
However, there is a way to fake 3-dimensions with MultiIndex / Advanced Indexing:
Hierarchical indexing (MultiIndex)
Hierarchical / Multi-level indexing is very exciting as it opens the door to some quite sophisticated data
analysis and manipulation, especially for working with higher
dimensional data. In essence, it enables you to store and manipulate
data with an arbitrary number of dimensions in lower dimensional data
structures like Series (1d) and DataFrame (2d).
If you really need that extra dimension go with pandas.Panel:
Panel is a somewhat less-used, but still important container for 3-dimensional data.
but don't miss this important disclaimer from the docs:
Note: Unfortunately Panel, being less commonly used than Series and
DataFrame, has been slightly neglected feature-wise. A number of
methods and options available in DataFrame are not available in Panel.
There is also pandas.Panel4D (experimental) in the unlikely chance that you need it.
PyTables supports the creation of tables from user-defined classes that inherit from the IsDescription class. This includes support for multidimensional cells, as in the following example from the documentation:
class Particle(IsDescription):
name = StringCol(itemsize=16) # 16-character string
lati = Int32Col() # integer
longi = Int32Col() # integer
pressure = Float32Col(shape=(2,3)) # array of floats (single-precision)
temperature = Float64Col(shape=(2,3)) # array of doubles (double-precision)
However, is it possible to store an arbitrarily-shaped multidimensional array in a single cell? Following the above example, something like pressure = Float32Col(shape=(x, y)) where x and y are determined upon the insertion of each row.
If not, what is the preferred approach? Storing each (arbitrarily-shaped) multidimensional array in a CArray with a unique name and then storing those names in a master index table? The application I'm imagining is storing images and associated metadata, which I'd like to be able to both query and use numexpr on.
Any pointers toward PyTables best practices are much appreciated!
The long answer is "yes, but you probably don't want to."
PyTables probably doesn't support it directly, but HDF5 does support creation of nested variable-length datatypes, allowing ragged arrays in multiple dimensions. Should you wish to go down that path, you'll want to use h5py and browse through HDF5 User's Guide, Datatypes chapter. See section 6.4.3.2.3. Variable-length Datatypes. (I'd link it, but they apparently chose not to put anchors that deep).
Personally, the way that I would arrange the data you've got is into groups of datasets, not into a single table. That is, something like:
/particles/particlename1/pressure
/particles/particlename1/temperature
/particles/particlename2/pressure
/particles/particlename2/temperature
and so on. The lat and long values would be attributes on the /particles/particlename group rather than datasets, though having small datasets for them is perfectly fine too.
If you want to be able to do searches based on the lat and long, then having a dataset with the lat/long/name columns would be good. And if you wanted to get really fancy, there's an HDF5 datatype for references, allowing you to store a pointer to a dataset, or even to a subset of a dataset.
The short answer is "no", and I think its a "limitation" of hdf5 rather than pytables.
I think the reason is that each unit of storage (the compound dataset) must be a well defined size, which if one or more component can change size then it will obviously not be. Note it is totally possible to resize and extend a dataset in hdf5 (pytables makes heavy use of this) but not the units of data within that array.
I suspect the best thing to do is either:
a) make it a well defined size and provide a flag for overflow. This works well if the largest reasonable size is still pretty small and you are okay with tail events being thrown out. Note you might be able to get ride of the unused disk space with hdf5 compression.
b) do as you suggest a create a new CArray in the same file just read that in when required. (to keep things tidy you might want to put these all under their own group)
HDF5 actually has an API which is designed (and optimized for) for storing images in a hdf5 file. I dont think its exposed in pytables.