Apologies for the somewhat subjective question, but I was wondering if anyone could summarize the relative pros and cons of using GroupBy vs hierarchical indexing for working on groups of data within a single dataframe.
My project involves constructing a dataframe of typing data. Each row represents a single keystroke (letter, touch/release time, touch/release coordinates), being then organized by sentences (the first level of grouping), and then by day (the second level). I need to both select sentences by an index (day/sentence), as well as perform operations on the entire sentence group (such as mapping functions to these groups, as opposed to individual rows).
It seems that both GroupBy and hierarchical indexing have some benefits -- GroupBy seems to make it easier to perform calculations on the group level, while hierarchical indexing makes it easier to pluck specific phrases out of the dataframe for other purposes. Can anyone advise on which technique would be more appropriate? Or is some combination of the two the best way to go?
Thank you!
Edit: this question was marked as a duplicate of a question regarding working with hierarchical indexing, but it is not -- this question is concerned with hierarchical indexing vs another approach.
Related
I'm writing a function that (hopefully) simplifies a complex operation for other users. As part of this, the user passes in some dataframes and an arbitrary boolean Column expression computing something from those dataframes, e.g.
(F.col("first")*F.col("second").getItem(2) < F.col("third")) & (F.col("fourth").startswith("a")).
The dataframes may have dozens of columns each, but I only need the result of this expression, so it should be more efficient to select only the relevant columns before the tables are joined. Is there a way, given an arbitrary Column, to extract the names of the source columns that Column is being computed from, i.e.
["first", "second", "third", "fourth"]?
I'm using PySpark, so an ideal solution would be contained only in Python, but some sort of hack that requires Scala would also be interesting.
Alternatives I've considered would be to require the users to pass the names of the source columns separately, or to simply join the entire tables instead of selecting the relevant columns first. (I don't have a good understanding of Spark internals, so maybe the efficiency loss isn't as much I think.) I might also be able to do something by cross-referencing the string representation of the column with the list of column names in each dataframe, but I suspect that approach would be unreliable.
I've been studying the different ways to filter and subset pandas DataFrames and came across the pandas.DataFrame.filter() method. However, I can't figure out why one would use this over another method of filtering (loc, iloc, logical operators, str.contains(), .query(), etc). Can anyone provide an example of when it makes sense to use .filter() over the alternatives?
filter is applied to index or column labels, not the values.
In contrast to query, contains etc. which are used to filter a DataFrame based on its contents.
If you for example would like to only keep columns ending with 'address', you could use df.filter(regex='address$')
Each function is used in particular circumstances. Filter() is useful for getting a large data down to a smaller size, based on the questions you want to ask. Query(), on the other hand, is useful for phrasing questions that use comparison operators (less than, equal to, greater than, etc.)
The Pandas Query() method is a fantastic way to filter and query data. Unlike other Pandas methods, it uses a string argument that functions rather similar to SQL syntax.
You should only use Query() when your question (query) can be posed as greater than, less than, equal to, or not equal to (or some combination of these).
Use filter() when you want to get a quick sense of your dataset or, as we shall see, create a new dataframe based on the columns you want. It is particularly useful if your dataset has many columns. You can also use it to reorder your columns in a more desired way.
for more read [http://pandas.pythonhumanities.com][1]
[1]: http://pandas.pythonhumanities.com/03_02_advanced_querying.html#:~:text=Filter()%20is%20useful%20for,greater%20than%2C%20etc.).
So, I am trying to pivot this data (link here) to where all the metrics/numbers are in a column with another column being the ID column. Obviously having a ton of data/metrics in a bunch of columns is much harder to compare and do calculations off of than having it all in one column.
So, I know what tools I need for this; Pandas, Findall, Wide_to_long (or melt) and maybe stack. However, I am having a bit of difficulty putting them all in the right place.
I easily can import the data into the df and view it, but when it comes to using findall with wide_to_long to pivot the data I get pretty confused. I am basing my idea off of this example (about half way down when they use the findall / regex to define new column names). I am looking to create a new column for each category of the metrics (ie. population estimate is one column and then % change is another, they should not be all one column)
Can someone help me set up the syntax correctly for this part? I am not good at the expressions dealing with pattern recognition.
Thank you.
I'm processing a bunch of text-based records in csv format using Dask, which I am learning to use to work around too large to fit in memory problems, and I'm trying to filter records within groups that best match a complicated criteria.
The best approach I've identified to approach this so far is to basically use Dash to group records in bite sized chunks and then write the applicable logic in Python:
def reduce_frame(partition):
records = partition.to_dict('record')
shortlisted_records = []
# Use Python to locate promising looking records.
# Some of the criteria can be cythonized; one criteria
# revolves around whether record is a parent or child
# of records in shortlisted_records.
for other in shortlisted_records:
if other.path.startswith(record.path) \
or record.path.startswith(other.path):
... # keep one, possibly both
...
return pd.DataFrame.from_dict(shortlisted_records)
df = df.groupby('key').apply(reduce_frame, meta={...})
In case it matters, the complicated criteria revolves around weeding out promising looking links on a web page based on link url, link text, and css selectors across the entire group. Think with given A, and B in shortlist, and C a new record, keep all if each are very very promising, else prefer C over A and/or B if more promising than either or both, else drop C. The resulting Pandas partition objects above are tiny. (The dataset in its entirety is not, hence my using Dask.)
Seeing how Pandas functionality exposes inherently row- and column-based functionality, I'm struggling to imagine any vectorized approach to solving this, so I'm exploring writing the logic in Python.
Is the above the correct way to proceed, or are there more Dask/Pandas idiomatic ways - or simply better ways - to approach this type of problem? Ideally one that allows to parallelize the computations across a cluster? For instance by using Dask.bag or Dask.delayed and/or cytoolz or something else I might have missed while learning Python?
I know nothing about Dask, but can tell a little on passing / blocking
some rows using Pandas.
It is possible to use groupby(...).apply(...) to "filter" the
source DataFrame.
Example: df.groupby('key').apply(lambda grp: grp.head(2)) returns
first 2 rows from each group.
In your case, write a function to applied to each group, which:
contains some logic, processing the current group,
generates the output DataFrame, based on this logic, e.g. returning
only some of input rows.
The returned rows are then concatenated, forming the result of apply.
Another possibility is to use groupby(...).filter(...), but in this
case the underlying function returns a decision "passing" or "blocking"
each group of rows.
Yet another possibility is to define a "filtering function",
say filtFun, which returns True (pass the row) or False (block the row).
Then:
Run: msk = df.apply(filtFun, axis=1) to generate a mask (which rows
passed the filter).
In further processing use df[msk], i.e. only these rows which passed
the filter.
But in this case the underlying function has acces only to the current row,
not to the whole group of rows.
Question for experienced Pandas users on approach to working with Dataframe data.
Invariably we want to use Pandas to explore relationships among data elements. Sometimes we use groupby type functions to get summary level data on subsets of the data. Sometimes we use plots and charts to compare one column of data against another. I'm sure there are other application I haven't thought of.
When I speak with other fairly novice users like myself, they generally try to extract portions of a "large" dataframe into smaller dfs that are sorted or formatted properly to run applications or plot. This approach certainly has disadvantages in that if you strip out a subset of data into a smaller df and then want to run an analysis against a column of data you left in the bigger df, you have to go back and recut stuff.
My question is - is best practices for more experienced users to leave the large dataframe and try to syntactically pull out the data in such a way that the effect is the same or similar to cutting out a smaller df? Or is it best to actually cut out smaller dfs to work with?
Thanks in advance.