My data looks like this:
id Open Close
1 1/1/15 1/1/15
2 1/1/15 2/1/15
3 3/1/15 4/1/15
I need to create a dataframe that shows the number of open cases on any day, so the result of the data above would look like:
Date #Open
1/1/15 1
2/1/15 0
3/1/15 1
Any ideas?
This method creates an index of all days between the first case open and the max of the last case opened or closed. It then iterates through each of these dates and filters the dataframe for the relevant date, checking the resulting size.
df['Open'] = pd.to_datetime(df.Open)
df['Close'] = pd.to_datetime(df.Close)
idx = pd.date_range(df.Open.min(), max(df.Open.max(), df.Close.max()))
cases = pd.DataFrame([len(df[(date >= df.Open) & (date < df.Close)])
for date in idx],
index=idx, columns=['case_count'])
>>> cases.head(3)
case_count
2015-01-01 1
2015-01-02 1
2015-01-03 1
>>> cases.tail(3)
case_count
2015-03-30 1
2015-03-31 1
2015-04-01 0
Related
I want to add a column to my data frame prod_data based on a range of dates. This is an example of the data in the column ['Mount Time'] I want to modify the new column from:
0 2022-08-17 06:07:00
1 2022-08-17 06:12:00
2 2022-08-17 06:40:00
3 2022-08-17 06:45:00
4 2022-08-17 06:47:00
The new column is named ['Week'] and I want it to run from M-S, with week 1 starting on 9/5/22, running through 9/11/22 and then week 2 the next M-S, and so on until the last week which would be 53. I would also like weeks previous to 9/5 to have negative week numbers, so 8/29/22 would be the start of week -1 and so on.
The only thing I could think of was to create 2 massive lists and use np.select to define the parameters of the column, but there has to be a cleaner way of doing this, right?
You can use pandas datetime objects to figure out how many days away a date is from your start date, 9/5/2022, and then use floor division to convert that to week numbers. I made the "mount_time" column just to emphasize that the original column should be a datetime object.
prod_data["mount_time"] = pd.to_datetime( prod_data[ "Mount Time" ] )
start_date = pd.to_datetime( "9/5/2022" )
days_away = prod_data.mount_time - start_date
prod_data["Week"] = ( days_away.dt.days // 7 ) + 1
As intended, 9/5/2022 through 9/11/2022 will have a value of 1. 8/29/2022 would start week 0 (not -1 as you wrote) unless you want 9/5/2022 to start as week 0 (in which case just delete the + 1 from the code). Some more examples:
>>> test[ ["date", "Week" ] ]
date Week
0 2022-08-05 -4
1 2022-08-14 -3
2 2022-08-28 -1
3 2022-08-29 0
4 2022-08-30 0
5 2022-09-05 1
6 2022-09-11 1
7 2022-09-12 2
Given Data is
id
date
1
10/20/2019
2
11/02/2019
3
12/12/2019
1
02/06/2019
1
05/14/2018
3
5/13/2019
2
07/20/2018
3
08/23/2019
2
06/25/2018
I want in This format
id
date1
date2
date3
1
05/14/2018
02/06/2019
10/20/2019
2
06/25/2018
07/20/2018
11/02/2019
3
05/13/2019
08/23/2019
12/12/2019
I am using For Loop to implement this on 4,00,000+ Unique Ids and its time-consuming. Is there any easy method?
I am using this code:
Each Policy number has Multiple DATEs, I want them arranged in min to max in a row in different columns like mentioned in 2nd table.
f= pd.DataFrame()
for i in range(0,len(uni_pol)):
d=ct.loc[ct["Policy_no"]== uni_pol[I]]
t=d.sort values ('DATE", ascending=True).T
df=pd.DataFrame(t)
a=df. loc['Policy_no' ]
col=df.columns
df['Policy_no']= a.loc[ col[0] ]
for j in range(0, len(col)):
nn= str(j+1)
name="Paydt"+nn
df[name] = df[col[j]]
CC= col[j]
df=df.drop([cc], axi5-1)
j=j+1
f = f.append(df. loc[' DATE'])
Here's one approach:
sort_values by "date"; then groupby "id" and create a list from dates; this builds a Series. Then create a DataFrame from the lists in the Series:
df['date'] = pd.to_datetime(df['date'])
s = df.sort_values(by='date').groupby('id')['date'].agg(list)
out = pd.DataFrame(s.tolist(), index=s.index, columns=[f'date{i}' for i in range(1,len(s.iat[0])+1)]).reset_index()
Output:
id date1 date2 date3
0 1 2018-05-14 2019-02-06 2019-10-20
1 2 2018-06-25 2018-07-20 2019-11-02
2 3 2019-05-13 2019-08-23 2019-12-12
I have a dataframe of daily license_type activations (either full or trial) as shown below. Basically, I am trying to see the monthly count of Trial to Full License conversions. I am trying to do this by taking into consideration the daily data and the user_email column.
Date User_Email License_Type P.Letter Month (conversions)
0 2017-01-01 10431046623214402832 trial d 2017-01
1 2017-07-09 246853380240772174 trial b 2017-07
2 2017-07-07 13685844038024265672 trial e 2017-07
3 2017-02-12 2475366081966194134 full c 2017-02
4 2017-04-08 761179767639020420 full g 2017-04
The logic I have is to iteratively check the User_Email column. If the User_Email value is a duplicate, then check license_type column. If value in license_type = 'full' return 1 in a new column called 'Conversions' else return 0 in 'conversion' column. This would be the amendment to the original dataframe above.
Then group 'Date' column by month and I should have a aggregate value of monthly conversions in 'Conversion' column? Should look something like below:
Date
2017-Apr 1
2017-Feb 2
2017-Jan 1
2017-Jul 0
2017-Mar 1
Name: Conversion
below was my trial at getting the desire output above
#attempt to create a new column Conversion and fill with 1 and 0 for if converted or not.
for values in df['User_email']:
if value.is_unique:
df['Conversion'] = 0 #because there is no chance to go from trial to Full
else:
if df['License_type'] = 'full': #check if license type is full
df['Conversion'] = 1 #if full, I assume it was originally trial and now is full
# Grouping daily data by month to get monthly total of conversions
converted = df.groupby(df['Date'].dt.strftime('%Y-%b'))['Conversion'].sum()
Your sample data doesn't have the features you note you are looking for. Rather than loop (always a pandas anti-pattern) have a simple function that operates row by row
for uniqueness test I'm getting a count of use of email address first and setting the number of times it occurs on each row
your logic I've transcribed in a slightly different way.
data = """ Date User_Email License_Type P.Letter Month
0 2017-01-01 10431046623214402832 trial d 2017-01
1 2017-07-09 246853380240772174 trial b 2017-07
2 2017-07-07 13685844038024265672 trial e 2017-07
3 2017-02-12 2475366081966194134 full c 2017-02
3 2017-03-13 2475366081966194134 full c 2017-03
3 2017-03-13 2475366081966194 full c 2017-03
4 2017-04-08 761179767639020420 full g 2017-04"""
a = [[t.strip() for t in re.split(" ",l) if t.strip()!=""] for l in [re.sub("([0-9]?[ ])*(.*)", r"\2", l) for l in data.split("\n")]]
df = pd.DataFrame(a[1:], columns=a[0])
df["Date"] = pd.to_datetime(df["Date"])
df = df.assign(
emailc=df.groupby("User_Email")["User_Email"].transform("count"),
Conversion=lambda dfa: dfa.apply(lambda r: 0 if r["emailc"]==1 or r["License_Type"]=="trial" else 1, axis=1)
).drop("emailc", axis=1)
df.groupby(df['Date'].dt.strftime('%Y-%b'))['Conversion'].sum()
output
Date
2017-Apr 0
2017-Feb 1
2017-Jan 0
2017-Jul 0
2017-Mar 1
I want to perform rolling median on price column over 4 days back, data will be groupped by date. So basically I want to take prices for a given day and all prices for 4 days back and calculate median out of these values.
Here are the sample data:
id date price
1637027 2020-01-21 7045204.0
280955 2020-01-11 3590000.0
782078 2020-01-28 2600000.0
1921717 2020-02-17 5500000.0
1280579 2020-01-23 869000.0
2113506 2020-01-23 628869.0
580638 2020-01-25 650000.0
1843598 2020-02-29 969000.0
2300960 2020-01-24 5401530.0
1921380 2020-02-19 1220000.0
853202 2020-02-02 2990000.0
1024595 2020-01-27 3300000.0
565202 2020-01-25 3540000.0
703824 2020-01-18 3990000.0
426016 2020-01-26 830000.0
I got close with combining rolling and groupby:
df.groupby('date').rolling(window = 4, on = 'date')['price'].median()
But this seems to add one row per each index value and by median definition, I am not able to somehow merge these rows to produce one result per row.
Result now looks like this:
date date
2020-01-10 2020-01-10 NaN
2020-01-10 NaN
2020-01-10 NaN
2020-01-10 3070000.0
2020-01-10 4890000.0
...
2020-03-11 2020-03-11 4290000.0
2020-03-11 3745000.0
2020-03-11 3149500.0
2020-03-11 3149500.0
2020-03-11 3149500.0
Name: price, Length: 389716, dtype: float64
It seems it just deleted 3 first values and then just printed price value.
Is it possible to get one lagged / moving median value per one date?
You can use rolling with a frequency window of 5 days to get today and last 4 days, then drop_duplicates to keep the last row per day. First create a copy (if you want to keep the original one), sort_values per date and ensure the date column is datetime
#sort and change to datetime
df_f = df[['date','price']].copy().sort_values('date')
df_f['date'] = pd.to_datetime(df_f['date'])
#create the column rolling
df_f['price'] = df_f.rolling('5D', on='date')['price'].median()
#drop_duplicates and keep the last row per day
df_f = df_f.drop_duplicates(['date'], keep='last').reset_index(drop=True)
print (df_f)
date price
0 2020-01-11 3590000.0
1 2020-01-18 3990000.0
2 2020-01-21 5517602.0
3 2020-01-23 869000.0
4 2020-01-24 3135265.0
5 2020-01-25 2204500.0
6 2020-01-26 849500.0
7 2020-01-27 869000.0
8 2020-01-28 2950000.0
9 2020-02-02 2990000.0
10 2020-02-17 5500000.0
11 2020-02-19 3360000.0
12 2020-02-29 969000.0
This is a step by step process. There are probably more efficient methods of getting what you want. Note, if you have time information for your dates, you would need to drop that information before grouping by date.
import pandas as pd
import statistics as stat
import numpy as np
# Replace with you data import
df = pd.read_csv('random_dates_prices.csv')
# Convert your date to a datetime
df['date'] = pd.to_datetime(df['date'])
# Sort your data by date
df = df.sort_values(by = ['date'])
# Create group by object
dates = df.groupby('date')
# Reformat dataframe for one row per day, with prices in a nested list
df = pd.DataFrame(dates['price'].apply(lambda s: s.tolist()))
# Extract price lists to a separate list
prices = df['price'].tolist()
# Initialize list to store past four days of prices for current day
four_days = []
# Loop over the prices list to combine the last four days to a single list
for i in range(3, len(prices), 1):
x = i - 1
y = i - 2
z = i - 3
four_days.append(prices[i] + prices[x] + prices[y] + prices[z])
# Initialize a list to store median values
medians = []
# Loop through four_days list and calculate the median of the last for days for the current date
for i in range(len(four_days)):
medians.append(stat.median(four_days[i]))
# Create dummy zero values to add lists create to dataframe
four_days.insert(0, 0)
four_days.insert(0, 0)
four_days.insert(0, 0)
medians.insert(0, 0)
medians.insert(0, 0)
medians.insert(0, 0)
# Add both new lists to data frames
df['last_four_day_prices'] = four_days
df['last_four_days_median'] = medians
# Replace dummy zeros with np.nan
df[['last_four_day_prices', 'last_four_days_median']] = df[['last_four_day_prices', 'last_four_days_median']].replace(0, np.nan)
# Clean data frame so you only have a single date a median value for past four days
df_clean = df.drop(['price', 'last_four_day_prices'], axis=1)
I need to get the month-end balance from a series of entries.
Sample data:
date contrib totalShrs
0 2009-04-23 5220.00 10000.000
1 2009-04-24 10210.00 20000.000
2 2009-04-27 16710.00 30000.000
3 2009-04-30 22610.00 40000.000
4 2009-05-05 28909.00 50000.000
5 2009-05-20 38409.00 60000.000
6 2009-05-28 46508.00 70000.000
7 2009-05-29 56308.00 80000.000
8 2009-06-01 66108.00 90000.000
9 2009-06-02 78108.00 100000.000
10 2009-06-12 86606.00 110000.000
11 2009-08-03 95606.00 120000.000
The output would look something like this:
2009-04-30 40000
2009-05-31 80000
2009-06-30 110000
2009-07-31 110000
2009-08-31 120000
Is there a simple Pandas method?
I don't see how I can do this with something like a groupby?
Or would I have to do something like iterrows, find all the monthly entries, order them by date and pick the last one?
Thanks.
Use Grouper with GroupBy.last, forward filling missing values by ffill with Series.reset_index:
#if necessary
#df['date'] = pd.to_datetime(df['date'])
df = df.groupby(pd.Grouper(freq='m',key='date'))['totalShrs'].last().ffill().reset_index()
#alternative
#df = df.resample('m',on='date')['totalShrs'].last().ffill().reset_index()
print (df)
date totalShrs
0 2009-04-30 40000.0
1 2009-05-31 80000.0
2 2009-06-30 110000.0
3 2009-07-31 110000.0
4 2009-08-31 120000.0
Following gives you the information you want, i.e. end of month values, though the format is not exactly what you asked:
df['month'] = df['date'].str.split('-', expand = True)[1] # split date column to get month column
newdf = pd.DataFrame(columns=df.columns) # create a new dataframe for output
grouped = df.groupby('month') # get grouped values
for g in grouped: # for each group, get last row
gdf = pd.DataFrame(data=g[1])
newdf.loc[len(newdf),:] = gdf.iloc[-1,:] # fill new dataframe with last row obtained
newdf = newdf.drop('date', axis=1) # drop date column, since month column is there
print(newdf)
Output:
contrib totalShrs month
0 22610 40000 04
1 56308 80000 05
2 86606 110000 06
3 95606 120000 08