I have multiple data sources of financial data that I want to parse into a common data model.
API retrieval. Single format from single source (currently)
CSV files – multiple formats from multiple sources
Once cleaned and validated, the data is stored in a database (this is a Django project, but I don’t think that’s important for this discussion).
I have opted to use Pydantic for the data cleaning and validation, but am open to other options.
Where I’m struggling is with the preprocessing of the data, especially with the CSVs.
Each CSV has a different set of headers and data structure. Some CSVs contain all information over a single row, while others present in multiple rows. As your can tell, there are very specific rules for each data source based on its origin. I have a dict that maps all the header variations to the model fields. I filter this by source.
Currently, I’m loading the CSV into a Pandas data frame using the group by function break the data up into blocks. I can then loop through the groups, modify the data based on it’s origin, and then assign the data to the appropriate columns to pass into a Pydantic BaseModel. After I did this, it seemed a bit pointless to be using Pydantic, as all the work was being done beforehand.
To make things more reusable, I thought of moving all the logic into the Pydantic BaseModel, passing the raw grouped data into a property, and processing into the appropriate data elements. But, this just seems wrong.
As with most problems, I’m sure this has been solved before. I’m looking for some guidance on appropriate patterns for this style of processing. All of the examples I’ve found to date are based on a single input format.
Related
I have a few Pandas dataframes with several millions of rows each. The dataframes have columns containing JSON objects each with 100+ fields. I have a set of 24 functions that run sequentially on the dataframes, process the JSON (for example, compute some string distance between two fields in the JSON) and return a JSON with some new fields added. After all 24 functions execute, I get a final JSON which is then usable for my purposes.
I am wondering what the best ways to speed up performance for this dataset. A few things I have considered and read up on:
It is tricky to vectorize because many operations are not as straightforward as "subtract this column's values from another column's values".
I read up on some of the Pandas documentation and a few options indicated are Cython (may be tricky to convert the string edit distance to Cython, especially since I am using an external Python package) and Numba/JIT (but this is mentioned to be best for numerical computations only).
Possibly controlling the number of threads could be an option. The 24 functions can mostly operate without any dependencies on each other.
You are asking for advice and this is not the best site for general advice but nevertheless I will try to point a few things out.
The ideas you have already considered are not going to be helpful - neither Cython, Numba, nor threading are not going to address the main problem - the format of your data that is not conductive for performance of operations on the data.
I suggest that you first "unpack" the JSONs that you store in the column(s?) of your dataframe. Preferably, each field of the JSON (mandatory or optional - deal with empty values at this stage) ends up being a column of the dataframe. If there are nested dictionaries you may want to consider splitting the dataframe (particularly if the 24 functions are working separately at separate nested JSON dicts). Alternatively, you should strive to flatten the JSONs.
Convert to the data format that gives you the best performance. JSON stores all the data in the textual format. Numbers are best used in their binary format. You can do that column-wise on the columns that you suspect should be converted using df['col'].astype(...) (works on the whole dataframe too).
Update the 24 functions to operate not on JSON strings stored in dataframe but on the fields of the dataframe.
Recombine the JSONs for storage (I assume you need them in this format). At this stage the implicit conversion from numbers to strings will occur.
Given the level of details you provided in the question, the suggestions are necessarily brief. Should you have any more detailed questions at any of the above points, it would be best to ask maximally simple question on each of them (preferably containing a self-sufficient MWE).
I am working with large textual data (articles containing words, symbols,escape characters,line breaks etc). Each article also consists of attributes like date , author etc.
It will be used in python for NLP purposes . What is the best way to store this data such that it can be read efficiently from disk.
EDIT :
I have loaded the data as a pandas dataframe in python
Storing as a csv results in corruption due to line breaks(\n) in the text making the data unusable.
storing as JSON is not working due to encoding problems.
I am working on a data analysis project based on Pandas. Data which has to be analyzed is collected from application log files. Log entries are based on sessions, which can be different types (and can have different actions), then each session can have mutliple services (also with different types, actions, etc.). I have transformed log file entries to pandas dataframe and then based on that completed all required calculations. At this moment that's around few hundred different calculations, which are at the end printed to stdout. If anomaly is found that is specifically flagged. So, basic functionality is there, but now after this first phase is done, I'm not happy with readability of the code and it seems to me that there must be a way to make the code better organized.
For example what I have at the moment is:
def build(log_file):
# build dataframe from log file entries
return df
def transform(df):
# transform dataframe (for example based on grouped sessions, services)
return transformed_df
def calculate(transformed_df):
# make calculations based on transformed dataframe and print them to stdout
print calculation1
print calculation2
etc.
Since there are numerous criteria for filtering data, there is at least 30-40 different data frame filters present. They are used in calculate and in transform functions. In calculate functions I have also some helper functions which perform tasks which can be applied to similar session/service types and then result is based just on filtered dataframe for that specific type. With all these requirements, transformations, filters, I now have more than 1000 lines of code, which as I said, I have a feeling it can be more readable.
My current idea is to have perhaps classes organized like this:
class Session:
# main class for sessions (it can be inherited by other session types), also with standradized output for calculations
class Service:
# main class for services (it can be inherited by other service types), also with standradized output for calculations, etc.
class Dataframe
# dataframe class with filters, etc.
But I'm not sure if this is good approach. I tried searching here, on github, different blogs, but I didn't find anything which would provide some examples what would be best way to organize code in more than basic panda projects. I would appreciate any suggestion which would put me in the right direction.
I'm using the sample Python Machine Learning "IRIS" dataset (for starting point of a project). These data are POSTed into a Flask web service. Thus, the key difference between what I'm doing and all the examples I can find is that I'm trying to load a Pandas DataFrame from a variable and not from a file or URL which both seem to be easy.
I extract the IRIS data from the Flask's POST request.values. All good. But at that point, I can't figure out how to get the pandas dataframe like the "pd.read_csv(....)" does. So far, it seems the only solution is to parse each row and build up several series I can use with the DataFrame constructor? There must be something I'm missing since reading this data from a URL is a simple one-liner.
I'm assuming reading a variable into a Pandas DataFrame should not be difficult since it seems like an obvious use-case.
I tried wrapping with io.StringIO(csv_data), then following up with read_csv on that variable, but that doesn't work either.
Note: I also tried things like ...
data = pd.DataFrame(csv_data, columns=['....'])
but got nothing but errors (for example, "constructor not called correctly!")
I am hoping for a simple method to call that can infer the columns and names and create the DataFrame for me, from a variable, without me needing to know a lot about Pandas (just to read and load a simple CSV data set, anyway).
So I'm trying to store Pandas DataFrames in HDF5 and getting strange errors, rather inconsistently. At least half the time, some part of the read-process-move-write cycle fails, often with no clearer explanation than "HDF5 Read Error". Even worse, sometimes the table ends up with nonsense/corrupted data that doesn't stop things until downstream -- either values that are off by orders of magnitude (and not even correlated with the correct ones) or dates that don't make sense (recent data mismarked as being dated in the 1750s...etc).
I thought I'd go through the current process and then the things that I suspect might be causing problems of that might help. Here's what it looks like:
Read some of the tables (call them "QUERY1" and "QUERY2") to see if they're up to date, and if they arent,
Take the table that had been in the HDF5 store as "QUERY1" and store it as QUERY1_YYYY_MM_DD" in the HDF5 store instead
Run the associated query on external database for that table. Each one is between 100 and 1500 columns of daily data back to 1980.
Store the result of query 1 as the new "QUERY1" in the HDF5 store
Compute several transformations of one or more of QUERY1, QUERY2,...QUERYn which will have hierarchical (Pandas MultiIndex) columns. Overwrite item "Derived_Frame1"...etc with its update/replacement in the HDF5 store
Multiple people with access to the relevant .h5 file on a Windows network drive run this routine -- potentially sometimes, but not usually, at the same time.
Some things I suspect could be part of the problem:
using default format (df.to_hdf(store, key)) instead of insisting on "Table" format with df.to_hdf(store, key, format='table')). I do this because default format is between 2 and 5x faster on both the read and the write according to %timeit
Using a network drive to allow several users to run this routine and access at least the derived frames. Not much I can do about this requirement, especially for read access to the derived dataframes at any time.
From the docs, it sounds like repeatedly deleting and re-writing an item in the HDF5 store can do weird things (at least gradually increasing the file size, not sure what else). Maybe I should be storing query archives in another file? Maybe I should be nuking and replacing the whole main file upon update?
Storing dataframes with MultiIndex columns in HDF5 in the first place -- this seems to be what gets me a "warning" under the default format, although it seems like the warning goes away if I use format='table'.
Edit: it is also possible/likely that different users running the routine above are using different versions of Pandas and different versions of PyTables.
Any ideas?