Aggregating Pandas DF - Losing Data - python

I'm trying to aggregate a pandas df in a way an excel pivot table would. I have one quantitative variable called "Count". I would like the same qualitative variables to combine and the "Count" data to sum.
However, when I am trying to do this with the below code, I see that I am somehow losing data. Any idea why this might be happening and how I can fix it?
I expect the number of rows to decrease but the total sum of the "Count" column shouldn't change.

Since you have NaNs in your dataframe, they won't be included in your groupby operation, and thus the data for those rows will not be summed.

Related

Iterating through big data with pandas, large and small dataframes

This is my first post here and it’s based upon an issue I’ve created and tried to solve at work. I’ll try to precisely summarize my issue as I’m having trouble wrapping my head around a preferred solution. #3 is a real stumper for me.
Grab a large data file based on a parquet - no problem
Select 5 columns from the parquet and create a dataframe - no problem
import pandas
df = pd.read_parquet(’/Users/marmicha/Downloads/sample.parquet’,
columns=["ts", "session_id", "event", "duration", "sample_data"])
But here is where it gets a bit tricky for me. One column(a key column) is called "session_id" . Many values are unique. Many duplicate values(of session_id) exist and have multiple associated entry rows of data. I wish to iterate through the master dataframe, create a unique dataframe per session_id. Each of these unique (sub) dataframes would have a calculation done that simply gets the SUM of the "duration" column per session_id. Again that SUM would be unique per unique session_id, so each sub dataframe would have it's own SUM with a row added with that total listed along with the session_id I'm thinking there is a nested loop formula that will work for me but every effort has been a mess to date.
Ultimately, I'd like to have a final dataframe that is a collection of these unique sub dataframes. I guess I'd need to define this final dataframe, and append it with each new sub dataframe as I iterate through the data. I should be able to do that simply
Finally, write this final df to a new parquet file. Should be simple enough so I won't need help with that.
But that is my challenge in a nutshell. The main design I’d need help with is #3. I’ve played with interuples and iterows
I think the groupby function will work:
df.groupby('session_id')['duration'].sum()
More info here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html

How to undo column aggregation on pandas dataframe

I have a dataframe with columns that are an aggregation of corona virus cases over time.
I need the data in the date columns to be the new number of cases for that day instead of the aggregation.
So for example, I am trying to get the first row to be like
Anhui, Mainland China, 1, 8, 6
I think there might be a quick pandas way to do this but can't find it by google searching. A brute force method would be okay too. Thanks!
You can take take the finite difference along constant rows in the dataframe. If df is a copy of the numerical part of the dataframe then the following will do it:
df.diff(axis=1)
Documentation

How can I order entries in a Dask dataframe for display in seaborn?

I'm trying to get a Seaborn barplot containing the top n entries from a dataframe, sorted by one of the columns.
In Pandas, I'd typically do this using something like this:
df = df.sort_values('ColumnFoo', ascending=False)
sns.barplot(data=df[:10], x='ColumnFoo', y='ColumnBar')
Trying out Dask, though, there is (fairly obviously) no option to sort a dataframe, since dataframes are largely deferred objects, and sorting them would eliminate many of the benefits of using Dask in the first place.
Is there a either get ordered entries from a dataframe, or to have Seaborn pick the top n values from a dataframe's column?
If you're moving data to seaborn then it almost certainly fits in memory. I recommend just converting to a Pandas dataframe and then doing the sorting there.
Generally, once you've hit the small-data regime there is no reason to use Dask over Pandas. Pandas is more mature and a smoother experience. Dask Dataframe developers recommend using Pandas when feasible.

pyspark Window.partitionBy vs groupBy

Lets say I have a dataset with around 2.1 billion records.
It's a dataset with customer information and I want to know how many times they did something. So I should group on the ID and sum one column (It has 0 and 1 values where the 1 indicates an action).
Now, I can use a simple groupBy and agg(sum) it, but to my understanding this is not really efficient. The groupBy will move around a lot of data between partitions.
Alternatively, I can also use a Window function with a partitionBy clause and then sum the data. One of the disadvantage is that I'll then have to apply an extra filter cause it keeps all the data. And I want one record per ID.
But I don't see how this Window handles the data. Is it better than this groupBy and sum. Or is it the same?
As far as I know, when working with spark DataFrames, the groupBy operation is optimized via Catalyst. The groupBy on DataFrames is unlike the groupBy on RDDs.
For instance, the groupBy on DataFrames performs the aggregation on partitions first, and then shuffles the aggregated results for the final aggregation stage. Hence, only the reduced, aggregated results get shuffled, not the entire data. This is similar to reduceByKey or aggregateByKey on RDDs. See this related SO-article with a nice example.
In addition, see slide 5 in this presentation by Yin Huai which covers the benefits of using DataFrames in conjunction with Catalyst.
Concluding, I think you're fine employing groupBy when using spark DataFrames. Using Window does not seem appropriate to me for your requirement.

Pandas performance: Multiple dtypes in one column or split into different dtypes?

I have huge pandas DataFrames I work with. 20mm rows, 30 columns. The rows have a lot of data, and each row has a "type" that uses certain columns. Because of this, I've currently designed the DataFrame to have some columns that are mixed dtypes for whichever 'type' the row is.
My question is, performance wise, should I split out mixed dtype columns into two separate columns or keep them as one? I'm running into problems getting some of these DataFrames to even save(to_pickle) and trying to be as efficient as possible.
The columns could be mixes of float/str, float/int, float/int/str as currently constructed.
Seems to me that it may depend on what your subsequent use case is. But IMHO I would make each column unique type otherwise functions such as group by with totals and other common Pandas functions simply won't work.

Categories

Resources