How to "impute" missing item in 2 layered group by in pandas - python

I have a question regarding how to impute a second level index after a 2 layered groupby in pandas.
I have a dataframe of patient information. I'm trying to track when these reports were generated so I can chart them in pyplot. The three things that matter for what I'm trying to do is when the report was generated, the technology that generated the report, and the count of each technology per month. I have this line of code so far
frame.groupby([pd.Grouper(key="reportDate", freq='M'), pd.Grouper(key="sourceFilePathTechnology")], observed= False).count()
which generates the following table.
I'm close to what I'm trying to get, but I'm missing something and I can't find what I'm looking for in the documentation or in another SO post. The final missing step is that I would like to have every technology represented in the sourceFilePathTechnology index per month. so 2016-03-31 only has FSG, when I need it to also have NTP, MOL, even if the count is 0. And I need this for every month in the reportDate index Does anyone know how I can resolve this?
Thank you to anyone who can offer some input!

Found my answer. I needed to google pandas group by and count 0 and came across this post: Pandas groupby for zero values
the answer was
frame.groupby([pd.Grouper(key="reportDate", freq='M'), pd.Grouper(key="sourceFilePathTechnology")], observed= False).count().unstack(fill_value=0).stack()

Related

Plot multiple lines between and x and a y column based on a third column

First question on Stack Overflow! New to Pandas. I have data, a snippet of which is pasted below. I want to plot Hits (Y-axis) against Week (X-axis), with multiple line graphs, one for each experiment group.
I'm confused as to what exactly to do. I know I likely need to use groupby. Any help is appreciated. Thanks.
Hits, Experiment Group, Week

How to make a 5-Year calculation by matching date variables in a Pandas dataframe?

The objective is to determine the price percent change within a group (ticker). The percent change must be between 5 years of the date variable. The table below indicates the table I have now:
The final output would look like the following:
Some ideas I have experimented with include calculating the 5-year date for each record then iterate through the data frame, pairing the matching records. Then saving the records into a tuple. I think this approach would be very inefficient though. Any ideas?
Any ideas or help is greatly appreciated!

Problems on selecting values in a row

I'm having trouble selecting specific values of a row with pandas.
I have a CSV file with confirmed cases of Coronavirus in each country each day. So obviously some countries started having cases in different days and progressed in different ways.
Dataframe of countries I'm trying to plot:
I would like to filter each row since de 50th confirmed case, which occurs on different days for each country.
I tried to use the command df[df['column']>50], but this works for a single column and I want to do for all columns.
All my life I worked just with procedural programming in python without libraries but this week I decided to start using some of them, so my library understanding is very limited and I don't know how to insert a for loop on a library function, which I think is the case here. This is also my first question on stack overflow, so if I am doing something wrong please tell me. Thank you!

How to undo column aggregation on pandas dataframe

I have a dataframe with columns that are an aggregation of corona virus cases over time.
I need the data in the date columns to be the new number of cases for that day instead of the aggregation.
So for example, I am trying to get the first row to be like
Anhui, Mainland China, 1, 8, 6
I think there might be a quick pandas way to do this but can't find it by google searching. A brute force method would be okay too. Thanks!
You can take take the finite difference along constant rows in the dataframe. If df is a copy of the numerical part of the dataframe then the following will do it:
df.diff(axis=1)
Documentation

Python: aggregating data by row count

I'm trying to aggregate this call center data in various different ways in Python, for example mean q_time by type and priority. This is fairly straightforward using df.groupby.
However, I would also like to be able to aggregate by call volume. The problem is that each line of the data represents a call, so I'm not sure how to do it. If I'm just grouping by date then I can just use 'count' as the aggregate function, but how would I aggregate by e.g. weekday, i.e. create a data frame like:
weekday mean_row_count
1 100
2 150
3 120
4 220
5 200
6 30
7 35
Is there a good way to do this? All I can think of is looping through each weekday and counting the number of unique dates, then dividing the counts per weekday by the number of unique dates, but I think this could get messy and maybe really slow it down if I need to also group by other variables, or do it by date and hour of the day.
Since the date of each call is given, one idea is to implement a function to determine the day of the week from a given date. There are many ways to do this such as Conway's Doomsday algorithm.
https://en.wikipedia.org/wiki/Doomsday_rule
One can then go through each line, determine the week day, and add to the count for each weekday.
When I find myself thinking how to aggregate and query data in a versatile way, it think that the solution is probably a database. SQLite is a lightweight embedded database with high performances for simple use cases, and Python and a native support for it.
My advice here is : create a database and a table for your data, eventually add ancillary tables depending on your needs, load data into it, and use interative sqlite or Python scripts for your queries.

Categories

Resources