Still new to this, sorry if I ask something really stupid. What are the differences between a Python ordered dictionary and a pandas series?
The only difference I could think of is that an orderedDict can have nested dictionaries within the data. Is that all? Is that even true?
Would there be a performance difference between using one vs the other?
My project is a sales forecast, most of the data will be something like: {Week 1 : 400 units, Week 2 : 550 units}... Perhaps an ordered dictionary would be redundant since input order is irrelevant compared to Week#?
Again I apologize if my question is stupid, I am just trying to be thorough as I learn.
Thank you!
-Stephen
Most importantly, pd.Series is part of the pandas library so it comes with a lot of added functionality - see attributes and methods as you scroll down the pd.Series docs. This compares to OrderDict: docs.
For your use case, using pd.Series or pd.DataFrame (which could be a way of using nested dictionaries as it has an index and multiple columns) seem quite appropriate. If you take a look at the pandas docs, you'll also find quite comprehensive time series functionality that should come in handy for a project around weekly sales forecasts.
Since pandas is built on numpy, the specialized scientific computing package, performance is quite good.
Ordered dict is implemented as part of the python collections lib. These collection are very fast containers for specific use cases. If you would be looking for only dictionary related functionality (like order in this case) i would go for that. While you say you are going to do more deep analysis in an area where pandas is really made for (eg plotting, filling missing values). So i would recommend you going for pandas.Series.
Related
I have a few Pandas dataframes with several millions of rows each. The dataframes have columns containing JSON objects each with 100+ fields. I have a set of 24 functions that run sequentially on the dataframes, process the JSON (for example, compute some string distance between two fields in the JSON) and return a JSON with some new fields added. After all 24 functions execute, I get a final JSON which is then usable for my purposes.
I am wondering what the best ways to speed up performance for this dataset. A few things I have considered and read up on:
It is tricky to vectorize because many operations are not as straightforward as "subtract this column's values from another column's values".
I read up on some of the Pandas documentation and a few options indicated are Cython (may be tricky to convert the string edit distance to Cython, especially since I am using an external Python package) and Numba/JIT (but this is mentioned to be best for numerical computations only).
Possibly controlling the number of threads could be an option. The 24 functions can mostly operate without any dependencies on each other.
You are asking for advice and this is not the best site for general advice but nevertheless I will try to point a few things out.
The ideas you have already considered are not going to be helpful - neither Cython, Numba, nor threading are not going to address the main problem - the format of your data that is not conductive for performance of operations on the data.
I suggest that you first "unpack" the JSONs that you store in the column(s?) of your dataframe. Preferably, each field of the JSON (mandatory or optional - deal with empty values at this stage) ends up being a column of the dataframe. If there are nested dictionaries you may want to consider splitting the dataframe (particularly if the 24 functions are working separately at separate nested JSON dicts). Alternatively, you should strive to flatten the JSONs.
Convert to the data format that gives you the best performance. JSON stores all the data in the textual format. Numbers are best used in their binary format. You can do that column-wise on the columns that you suspect should be converted using df['col'].astype(...) (works on the whole dataframe too).
Update the 24 functions to operate not on JSON strings stored in dataframe but on the fields of the dataframe.
Recombine the JSONs for storage (I assume you need them in this format). At this stage the implicit conversion from numbers to strings will occur.
Given the level of details you provided in the question, the suggestions are necessarily brief. Should you have any more detailed questions at any of the above points, it would be best to ask maximally simple question on each of them (preferably containing a self-sufficient MWE).
I am working on a project that involves some larger-than-memory datasets, and have been evaluating different tools for working on a cluster instead of my local machine. One project that looked particularly interesting was dask, as it has a very similar API to pandas for its DataFrame class.
I would like to be taking aggregates of time-derivatives of timeseries-related data. This obviously necessitates ordering the time series data by timestamp so that you are taking meaningful differences between rows. However, dask DataFrames have no sort_values method.
When working with Spark DataFrame, and using Window functions, there is out-of-the-box support for ordering within partitions. That is, you can do things like:
from pyspark.sql.window import Window
my_window = Window.partitionBy(df['id'], df['agg_time']).orderBy(df['timestamp'])
I can then use this window function to calculate differences etc.
I'm wondering if there is a way to achieve something similar in dask. I can, in principle, use Spark, but I'm in a bit of a time crunch, and my familiarity with its API is much less than with pandas.
You probably want to set your timeseries column as your index.
df = df.set_index('timestamp')
This allows for much smarter time-series algorithms, including rolling operations, random access, and so on. You may want to look at http://dask.pydata.org/en/latest/dataframe-api.html#rolling-operations.
Note that in general setting an index and performing a full sort can be expensive. Ideally your data comes in a form that is already sorted by time.
Example
So in your case, if you just want to compute a derivative you might do something like the following:
df = df.set_index('timestamp')
df.x.diff(...)
Since I recently started a new project, I'm stuck in the "think before you code" phase. I've always done basic coding, but I really think I now need to carefully plan how I should organize the results that are produced by my script.
It's essentially quite simple: I have a bunch of satellite data I'm extracting from Google Earth Engine, including different sensors, different acquisition modes, etc. What I would like to do is to loop through a list of "sensor-acquisition_mode" couples, request the data, do some more processing, and finally save it to a variable or file.
Suppose I have the following example:
sensors = ['landsat','sentinel1']
sentinel_modes = ['ASCENDING','DESCENDING']
sentinel_polarization = ['VV','VH']
In the end, I would like to have some sort of nested data structure that at the highest level has the elements 'landsat' and 'sentinel1'; under 'landsat' I would have a time and values matrix; under 'sentinel1' I would have the different modes and then as well the data matrices.
I've been thinking about lists, dictionaries or classes with attributes, but I really can't make up my mind, also since I don't have that much of experience.
At this stage, a little help in the right direction would be much appreciated!
Lists: Don't use lists for nested and complex data structures. You're just shooting yourself in the foot- code you write will be specialized to the exact format you are using, and any changes or additions will be brutal to implement.
Dictionaries: Aren't bad- they'll nest nicely and you can use a dictionary whose value is a dictionary to hold named info about the keys. This is probably the easiest choice.
Classes: Classes are really really useful for this if you need a lot of behavior to go with them - you want the string of it to be represented a certain way, you want to be able to use primitive operators for some functionality, or you just want to make the code slightly more readable or reusable.
From there, it's all your choice- if you want to go through the extra code (it's good for you) of writing them as classes, do it! Otherwise, dictionaries will get you where you need to go. Notably the only thing a dictionary couldn't do would be if you have two things that should be at the key level in the dictionary with the same name (Dicts don't do repetition).
Question for experienced Pandas users on approach to working with Dataframe data.
Invariably we want to use Pandas to explore relationships among data elements. Sometimes we use groupby type functions to get summary level data on subsets of the data. Sometimes we use plots and charts to compare one column of data against another. I'm sure there are other application I haven't thought of.
When I speak with other fairly novice users like myself, they generally try to extract portions of a "large" dataframe into smaller dfs that are sorted or formatted properly to run applications or plot. This approach certainly has disadvantages in that if you strip out a subset of data into a smaller df and then want to run an analysis against a column of data you left in the bigger df, you have to go back and recut stuff.
My question is - is best practices for more experienced users to leave the large dataframe and try to syntactically pull out the data in such a way that the effect is the same or similar to cutting out a smaller df? Or is it best to actually cut out smaller dfs to work with?
Thanks in advance.
I am trying to create an interface between structured data and NLTK. NLP libraries generally work with bags of words, hence I need to turn my structured data into bags of words.
I need to associate the offset of a word with it's meta-data.Therefore my best bet is to have some sort of container that holds ranges as keys (allowing nested ranges) and can retrieve all the meta-data (multiple if the word offset is part of a nested range).
What code can I pickup that would do this efficiently (--i.e., sparse represention of the data ) ? Efficient because my global corpus will have at least a few hundred megabytes.
Note :
I am serialising structured forum posts. which will include posts with sections of quotes with them. I want to know which topic a word belonged to, and weather it's a quote or user-text. There will probably be additional metadata as my work progresses. Note that a word belonging to a quote is what I meant by nested meta-data, so the word is part of a quote, that belongs to a post made by a user.
I know that one can tag words in NLTK I haven't looked into it, if its possible to do what I want that way please comment. But I am still looking for the original approach.
There is probably something in numpy that can solve my problem, looking at that now
edit
The input data is far too complex to rip out and post. I have found what I was looking for tho http://packages.python.org/PyICL/. I needed to talk about intervals and not ranges :D I have used boost extensively, however making that a dependency makes me a bit uneasy (Sadly, I am having compiler errors with PyICL :( ).
The question now is: anyone know an interval container library or data structure that can be used to index nested intervals in a sparse fashion. Or put differently provides similar semantics to boost.icl
If you don't want to use PyICL or boost.icl Instead of relying on a specialized library you could just use sqlite3 to do the job ? If you use an in0memory version it will still be a few orders of magnitudes slower than boost.icl (from experience coding other data structures vs sqlite3) but should be more effective than using a c++ std::vector style approach on top of python containers.
You can use two integers and have date_type_low < offset < date_type_high predicate in your where clause. And depending on your table structure this will return nested/overlapping ranges.