How to undo column aggregation on pandas dataframe - python

I have a dataframe with columns that are an aggregation of corona virus cases over time.
I need the data in the date columns to be the new number of cases for that day instead of the aggregation.
So for example, I am trying to get the first row to be like
Anhui, Mainland China, 1, 8, 6
I think there might be a quick pandas way to do this but can't find it by google searching. A brute force method would be okay too. Thanks!

You can take take the finite difference along constant rows in the dataframe. If df is a copy of the numerical part of the dataframe then the following will do it:
df.diff(axis=1)
Documentation

Related

Aggregating Pandas DF - Losing Data

I'm trying to aggregate a pandas df in a way an excel pivot table would. I have one quantitative variable called "Count". I would like the same qualitative variables to combine and the "Count" data to sum.
However, when I am trying to do this with the below code, I see that I am somehow losing data. Any idea why this might be happening and how I can fix it?
I expect the number of rows to decrease but the total sum of the "Count" column shouldn't change.
Since you have NaNs in your dataframe, they won't be included in your groupby operation, and thus the data for those rows will not be summed.

Does Pandas use hashing for a single-indexed dataframe and binary searching for a multi-indexed dataframe?

I have always been under impression that Pandas uses hashing when indexing the rows in a dataframe such that the operations like df.loc[some_label] is O(1).
However, I just realized today that this is not the case, at least for multi-indexed dataframe. As pointed out in the document, "Indexing will work even if the data are not sorted, but will be rather inefficient (and show a PerformanceWarning)". Some articles I found seem to suggest, for multiindex dataframe, Pandas is using binary-search based indexing if you have called sort_index() on the dataframe; otherwise, it just linearly scans the rows.
My question are
Does single-indexed dataframe use hash-based indexing or not?
If not for question 1, does it use binary-search when sort_index() has been called, and linear scan otherwise, like in the case of multi-indexed dataframe?
If yes for question 1, why Pandas choose to not use hash-based indexing for multi-index as well?

Python: (findall, wide to long, pandas): Data manipulation

So, I am trying to pivot this data (link here) to where all the metrics/numbers are in a column with another column being the ID column. Obviously having a ton of data/metrics in a bunch of columns is much harder to compare and do calculations off of than having it all in one column.
So, I know what tools I need for this; Pandas, Findall, Wide_to_long (or melt) and maybe stack. However, I am having a bit of difficulty putting them all in the right place.
I easily can import the data into the df and view it, but when it comes to using findall with wide_to_long to pivot the data I get pretty confused. I am basing my idea off of this example (about half way down when they use the findall / regex to define new column names). I am looking to create a new column for each category of the metrics (ie. population estimate is one column and then % change is another, they should not be all one column)
Can someone help me set up the syntax correctly for this part? I am not good at the expressions dealing with pattern recognition.
Thank you.

Creating dataframe from scratch with daily rows

What is the simplest code that can be used to create a basic dataframe with a single column (let's call it date), with daily rows between dateA and dateB?
This dataframe can later be used for multiple purposes.
I can think of many ways to create it, but all of them need multiple lines of code. I wonder is there is a one liner, or an example of very simple code, for a task so simple?
You can use
df = pd.DataFrame({'date': pd.date_range('2018-10-01','2019-09-01', freq='D')})

pyspark Window.partitionBy vs groupBy

Lets say I have a dataset with around 2.1 billion records.
It's a dataset with customer information and I want to know how many times they did something. So I should group on the ID and sum one column (It has 0 and 1 values where the 1 indicates an action).
Now, I can use a simple groupBy and agg(sum) it, but to my understanding this is not really efficient. The groupBy will move around a lot of data between partitions.
Alternatively, I can also use a Window function with a partitionBy clause and then sum the data. One of the disadvantage is that I'll then have to apply an extra filter cause it keeps all the data. And I want one record per ID.
But I don't see how this Window handles the data. Is it better than this groupBy and sum. Or is it the same?
As far as I know, when working with spark DataFrames, the groupBy operation is optimized via Catalyst. The groupBy on DataFrames is unlike the groupBy on RDDs.
For instance, the groupBy on DataFrames performs the aggregation on partitions first, and then shuffles the aggregated results for the final aggregation stage. Hence, only the reduced, aggregated results get shuffled, not the entire data. This is similar to reduceByKey or aggregateByKey on RDDs. See this related SO-article with a nice example.
In addition, see slide 5 in this presentation by Yin Huai which covers the benefits of using DataFrames in conjunction with Catalyst.
Concluding, I think you're fine employing groupBy when using spark DataFrames. Using Window does not seem appropriate to me for your requirement.

Categories

Resources