Pandas, Python: How to turn a row value into column and aggregate the values of another column as sum - python

I'm trying to analyse a covid data set and kind of at a loss on how to fix the data via pandas. The data set looks like the following:
I'm trying to make it look like this:
April 2 | April 3 | April 4
unique_tests total unique tests for april 2 | total unique tests for april 3|total unique tests for april 4
positive total positive for april 2 | total positive for april 3 |total positive for april 4
negative total negative for april 2 | total negative for april 3 |total negative for april 4
remaining total remaining for april 2 | total remaining for april 3 |total remaining for april 4
I have dates up to april 24.
Any ideas on how i can implement this? I can't make it work with pivot table in pandas

Use:
#convert columns to numeric and date to datetimes
df = pd.read_csv(file, thousands=',', parse_dates=['date'])
#create custom format of datetimes and aggregate sum, last transpose
df1 = df.groupby(df['date'].dt.strftime('%d-%b')).sum().T
Or is possible reassign column date filled by new format of datetimes:
df1 = df.assign(date = df['date'].dt.strftime('%d-%b')).groupby('date').sum().T

Related

How to filter dataframe based on condition that index is between date intervals?

I have 2 dataframes:
df_dec_light and df_rally.
df_dec_light.head():
log_return month year
1970-12-01 0.003092 12 1970
1970-12-02 0.011481 12 1970
1970-12-03 0.004736 12 1970
1970-12-04 0.006279 12 1970
1970-12-07 0.005351 12 1970
1970-12-08 -0.005239 12 1970
1970-12-09 0.000782 12 1970
1970-12-10 0.004235 12 1970
1970-12-11 0.003774 12 1970
1970-12-14 -0.005109 12 1970
df_rally.head():
rally_start rally_end
0 1970-12-18 1970-12-31
1 1971-12-17 1971-12-31
2 1972-12-15 1972-12-29
3 1973-12-21 1973-12-31
4 1974-12-20 1974-12-31
I need to filter df_dec_light based on condition that df_dec_light.index is between values of columns df_rally['rally_start']and df_rally['rally_end'].
I've tried something like this:
df_dec_light[(df_dec_light.index >= df_rally['rally_start']) & (df_dec_light.index <= df_rally['rally_end'])]
I was expecting to to recieve filtered df_dec_light dataframe with indexes that are within intervals between df_rally['rally_start'] and df_rally['rally_end'].
Something like this:
log_return month year
1970-12-18 0.001997 12 1970
1970-12-21 -0.003108 12 1970
1970-12-22 0.001111 12 1970
1970-12-23 0.000666 12 1970
1970-12-24 0.005644 12 1970
1970-12-28 0.005283 12 1970
1970-12-29 0.010810 12 1970
1970-12-30 0.002061 12 1970
1970-12-31 -0.001301 12 1970
Would really apreciate any help. Thanks!
Let's create an IntervalIndex from the start and end column values in df_rally dataframe, then map the intervals on index of df_dec_light dataframe and use notna to check if the index values are contained in any interval
ix = pd.IntervalIndex.from_arrays(df_rally.rally_start, df_rally.rally_end, closed='both')
mask = df_dec_light.index.map(ix.to_series()).notna()
then use the mask to filter the dataframe
df_dec_light[mask]
To solve this we can first turn the ranges in df_rally into pd.DateTimeIndex by calling pd.date_range on each row. This will give us each row of df_rally as a pd.DateTimeIndex.
As we want to later check if the index of df_dec_light is in any of the ranges, we want to combine all of these ranges. This is done with union.
We assert that the newly created pd.Series index_list is not empty and then select its first element. This element is the pd.DateTimeIndex on which we can now call union with all other pd.DateTimeIndex.
We can now use pd.Index.isin to create a boolean array of whether each index Date is found in the passed set of Dates.
If we now apply this mask to df_dec_light it returns only the entries that are within one of the specified ranges of df_rally.
index_list = df_rally.apply(lambda x: pd.date_range(x['rally_start'], x['rally_end']), axis=1)
assert(not index_list.empty)
all_ranges=index_list.iloc[0]
for range in index_list:
all_ranges=all_ranges.union(range)
print(all_ranges)
mask = df_dec_light.index.isin(all_ranges)
print(df_dec_light[mask])

Find the missing month in given date range then add that missing date in the data with same records as given in the last date

I have a Statement of accounts, where i have Unique ID, Disbursed date, payment date and the balance amount.
Date range for below data = Disbursed date to May-2022
Example of date:
Unique Disbursed date payment date balance amount
123 2022-Jan-13 2022-Jan-27 10,000
123 2022-Jan-13 2022-Feb-28 5,000
123 2022-Jan-13 2022-Apr-29 2,000
first I want to groupby payment date(last day of each month) and as an aggr function instead of Sum or mean, I want to carry forward the same balance reflecting in the last month last day.
As you can see March is missing in the records, here I want to add a new record for March with same balance given in Feb-22 i.e 5,000 and date for the new record should be last day of Mar-22.
Since date range given till 2022-May then here I want to add another new record for May-22 with same balance given in last month (Apr-22) i.e 2000 and date for the new record should be last day of May-22
Note : I have multiple unique ids like 123, 456, 789, etc.
I'd tried below code to find out the missing month
for i in df['date']:
pd.date_range(i,'2020-11-28').difference(df.index)
print(i)
but, it is giving days wise missing date. I want to find out the missing "month" instead of date for each unique id
You can use:
# generate needed month ends
idx = pd.date_range('2022-01', '2022-06', freq='M')
out = (df
# compute the month end for existing data
.assign(month_end=pd.to_datetime(df['payment date'])
.sub(pd.Timedelta('1d'))
.add(pd.offsets.MonthEnd()))
.set_index(['Unique', 'month_end'])
# reindex with missing ID/month ends
.reindex(pd.MultiIndex.from_product([df['Unique'].unique(),
idx
], names=['Unique', 'idx']))
.reset_index()
# fill missing month end with correct format
.assign(**{'payment date': lambda d:
d['payment date'].fillna(d['idx'].dt.strftime('%Y-%b-%d'))})
# ffill the data per ID
.groupby('Unique').ffill()
)
output:
Unique idx Disbursed date payment date balance amount
0 123 2022-01-31 2022-Jan-13 2022-Jan-27 10,000
1 123 2022-02-28 2022-Jan-13 2022-Feb-28 5,000
2 123 2022-03-31 2022-Jan-13 2022-Mar-31 5,000
3 123 2022-04-30 2022-Jan-13 2022-Apr-29 2,000
4 123 2022-05-31 2022-Jan-13 2022-May-31 2,000

count values of groups by consecutive days

i have data with 3 columns: date, id, sales.
my first task is filtering sales above 100. i did it.
second task, grouping id by consecutive days.
index
date
id
sales
0
01/01/2018
03
101
1
01/01/2018
07
178
2
02/01/2018
03
120
3
03/01/2018
03
150
4
05/01/2018
07
205
the result should be:
index
id
count
0
03
3
1
07
1
2
07
1
i need to do this task without using pandas/dataframe, but right now i can't imagine from which side attack this problem.
just for effort, i tried the suggestion for a solution here count consecutive days python dataframe
but the ids' not grouped.
here is my code:
data = df[df['sales'] >= 100]
data['date'] = pd.to_datetime(data['date']).dt.date
s = data.groupby('id').date.diff().dt.days.ne(1).cumsum()
new_frame = data.groupby(['id', s]).size().reset_index(level=0, drop=True)
it is very importent that the "new_frame" will have "count" column, because after i need to count id by range of those count days in "count" column. e.g. count of id's in range of 0-7 days, 7-12 days etc. but it's not part of my question.
Thank you a lot
Your code is close, but need some fine-tuning, as follows:
data = df[df['sales'] >= 100]
data['date'] = pd.to_datetime(data['date'], dayfirst=True)
df2 = data.sort_values(['id', 'date'])
s = df2.groupby('id').date.diff().dt.days.ne(1).cumsum()
new_frame = df2.groupby(['id', s]).size().reset_index(level=1, drop=True).reset_index(name='count')
Result:
print(new_frame)
id count
0 3 3
1 7 1
2 7 1
Summary of changes:
As your dates are in dd/mm/yyyy instead of the default mm/dd/yyyy, you have to specify the parameter dayfirst=True in pd.to_datetime(). Otherwise, 02/01/2018 will be regarded as 2018-02-01 instead of 2018-01-02 as expected and the day diff with adjacent entries will be around 30 as opposed to 1.
We added a sort step to sort by columns id and date to simplify the later grouping during the creation of the series s.
In the last groupby() the code reset_index(level=0, drop=True) should be dropping level=1 instead. Since, level=0 is the id fields which we want to keep.
In the last groupby(), we do an extra .reset_index(name='count') to make the Pandas series change back to a dataframe and also name the new column as count.

Python Pandas Year To Date vs. Last Year To Date (YTD, LYTD)

I am trying to solve for how to get the values of year to date versus last year to date from a dataframe.
Dataframe:
ID start_date distance
1 2019-7-25 2
2 2019-7-26 2
3 2020-3-4 1
4 2020-3-4 1
5 2020-3-5 3
6 2020-3-6 3
There is data back to 2017 and more data will keep getting added so I would like the YTD and LYTD to be dynamic based upon the current year.
I know how to get the cumulative sum for each year and month but I am really struggling with how to calculate the YTD and LYTD.
year_month_distance_df = distance_kpi_df.groupby(["Start_Year","Start_Month"]).agg({"distance":"sum"}).reset_index()
The other code I tried:
cum_sum_distance_ytd =
distance_kpi_df[["start_date_local","distance"]]
cum_sum_distance_ytd = cum_sum_distance_ytd.set_index("start_date_local")
cum_sum_distance_ytd = cum_sum_distance_ytd.groupby(pd.Grouper(freq = "D")).sum()
When I try this logic and add Start_Day into the group by it obviously just sums all the data for that day.
Expected output:
Year to Date = 8
Last Year to Date = 4
You could split the date into its components and get the ytd for all years with
expanding = df.groupby([
df.start_date.month, df.start_date.day, df.start_date.year
]).distance.sum().unstack().cumsum()
Unstacking will fill with np.nan wherever any year does not have a value in the row's date... if that is a problem you can use the fill_value parameter
.unstack(fill_value=0).cumsum()

Linear Regression in Pandas Groupby with freq='W-MON'

I have data over the timespan of over a year. I am interested in grouping the data by week, and getting the slope of two variables by week. Here is what the data looks like:
Date | Total_Sales| Products
2015-12-30 07:42:50| 2900 | 24
2015-12-30 09:10:10| 3400 | 20
2016-02-07 07:07:07| 5400 | 25
2016-02-07 07:08:08| 1000 | 64
So ideally I would like to perform a linear regression on total_sales and products on each week of this data and record the slope. This works when each week is represented in the data, but I have problems when there are some weeks skipped in the data. I know I could do this with turning the date into the week number but I feel like the result will be skewed because there is over a year's worth of data.
Here is the code I have so far:
df['Date']=pd.to_datetime(vals['EventDate']) - pd.to_timedelta(7,unit='d')
df.groupby(pd.Grouper(key='Week', freq='W-MON')).apply(lambda v: linregress(v.Total_Sales, v.Products)[0]).reset_index()
However, I get the following error:
ValueError: Inputs must not be empty.
I expect the output to look like this:
Date | Slope
2015-12-28 | -0.008
2016-02-01 | -0.008
I assume this is happening because python is unable to groupby properly and also it is unable to recognise datetime as key ,as Date column has varying timestamp too.
Try the following code.It worked for me:
df['Date']=pd.to_datetime(df['Date']) #### Converts Date column to Python Datetime
df['daysoffset'] = df['Date'].apply(lambda x: x.weekday())
#### Return the day of the week as an integer, where Monday is 0 and Sunday is 6.
df['week_start'] = df.apply(lambda x: x['Date'].date()-timedelta(days=x['daysoffset']), axis=1)
#### x.['Date'].date() removes timestamp and considers only Date
#### the line assigns date corresponding to last Monday to column 'week_start'.
df.groupby('week_start').apply(lambda v: stats.linregress(v.Total_Sales,v.Products)
[0]).reset_index()

Categories

Resources