How can I create a unique id in this pandas dataframe with datetime and number entry? - python

I've got a pandas dataframe that looks like this
miles dollars gallons date gal_cost mpg tank%_used day
0 253.2 21.37 11.138 2019-01-15 1.918657 22.732986 0.821993 Tuesday
1 211.9 22.24 11.239 2019-01-26 1.978824 18.853991 0.829446 Saturday
2 258.1 22.70 11.708 2019-02-02 1.938845 22.044756 0.864059 Saturday
3 223.0 22.24 11.713 2019-02-15 1.898745 19.038675 0.864428 Friday
I'd like to create a new column called 'id' that is unique for each entry. For the first entry in the df, the id would be c0115201901 because it is from the df_c dataframe, the date is 01 15 2019 and it is the first entry.
I know I'll end up doing something like this
df_c = df_c.assign(id=('c'+df_c['date']) + ?????)
but I'd like to parse the df_c['date'] column to pull values for the day, month and year individually. The df_c['date'] column is a datetime64[ns] type.
The other issue is I'd like to have a counter at the end of the id to count which number entry for the date it is. For example, 01 for the first entry, 02 for the second, etc.
I also have a df_m dataframe, but I can repeat the process with a different letter for that dataframe.

Refer pandas datetime-properties docs.
The date can be extracted easily like
df_c['date'].dt.day + df_c['date'].dt.month df_c['date'].dt.year

Related

Find the missing month in given date range then add that missing date in the data with same records as given in the last date

I have a Statement of accounts, where i have Unique ID, Disbursed date, payment date and the balance amount.
Date range for below data = Disbursed date to May-2022
Example of date:
Unique Disbursed date payment date balance amount
123 2022-Jan-13 2022-Jan-27 10,000
123 2022-Jan-13 2022-Feb-28 5,000
123 2022-Jan-13 2022-Apr-29 2,000
first I want to groupby payment date(last day of each month) and as an aggr function instead of Sum or mean, I want to carry forward the same balance reflecting in the last month last day.
As you can see March is missing in the records, here I want to add a new record for March with same balance given in Feb-22 i.e 5,000 and date for the new record should be last day of Mar-22.
Since date range given till 2022-May then here I want to add another new record for May-22 with same balance given in last month (Apr-22) i.e 2000 and date for the new record should be last day of May-22
Note : I have multiple unique ids like 123, 456, 789, etc.
I'd tried below code to find out the missing month
for i in df['date']:
pd.date_range(i,'2020-11-28').difference(df.index)
print(i)
but, it is giving days wise missing date. I want to find out the missing "month" instead of date for each unique id
You can use:
# generate needed month ends
idx = pd.date_range('2022-01', '2022-06', freq='M')
out = (df
# compute the month end for existing data
.assign(month_end=pd.to_datetime(df['payment date'])
.sub(pd.Timedelta('1d'))
.add(pd.offsets.MonthEnd()))
.set_index(['Unique', 'month_end'])
# reindex with missing ID/month ends
.reindex(pd.MultiIndex.from_product([df['Unique'].unique(),
idx
], names=['Unique', 'idx']))
.reset_index()
# fill missing month end with correct format
.assign(**{'payment date': lambda d:
d['payment date'].fillna(d['idx'].dt.strftime('%Y-%b-%d'))})
# ffill the data per ID
.groupby('Unique').ffill()
)
output:
Unique idx Disbursed date payment date balance amount
0 123 2022-01-31 2022-Jan-13 2022-Jan-27 10,000
1 123 2022-02-28 2022-Jan-13 2022-Feb-28 5,000
2 123 2022-03-31 2022-Jan-13 2022-Mar-31 5,000
3 123 2022-04-30 2022-Jan-13 2022-Apr-29 2,000
4 123 2022-05-31 2022-Jan-13 2022-May-31 2,000

How to create time series for every end date of the year with its corresponding value?

I have a dataframe that looks like in the following way where the 'Date' column has already datetime64 dtype:
Date Income_Company_A
0 1990-02-01 2185600.0
1 1990-02-02 3103200.0
........................................
5467 2011-10-10 29555500.0
5468 2011-10-11 54708100.0
How can I get the values for Income_Company_A where the date has to be an ending date for each year, i.e., it has to be 31 Dec for every year starting from 1990 till 2011?
Also, if the value is Null/NaN for the ending date for each year, then how can I fill it up with the value that can be found prior to that date from the dataframe?
The first output with NaN values should look like this:
1990-12-31 1593200.0
1991-12-31 4802000.0
1992-12-31 3302000.0
1993-12-31 5765200.0
1994-12-31 NaN
Then by replacing the NaN value for the date 1994-12-31 by the value that can be found for a prior date, for example, 1994-12-29 7865200.0, the final output should look like this:
1990-12-31 1593200.0
1991-12-31 4802000.0
1992-12-31 3302000.0
1993-12-31 5765200.0
1994-12-31 7865200.0
Assuming Date column is already in datetime data type:
df.loc[(df['Date'].dt.month == 12) & (df['Date'].dt.day == 31)].ffill()
in that case , try this :
df.loc[df.groupby(df['Date'].dt.year)['Date'].idxmax()].ffill()
Use resample and take the last valid value of year:
out = df.assign(Date=df['Date'].astype('datetime64')).resample('Y', on='Date').last()
You can omit .assign(...) if your Date column has already datetime64 dtype.

count values of groups by consecutive days

i have data with 3 columns: date, id, sales.
my first task is filtering sales above 100. i did it.
second task, grouping id by consecutive days.
index
date
id
sales
0
01/01/2018
03
101
1
01/01/2018
07
178
2
02/01/2018
03
120
3
03/01/2018
03
150
4
05/01/2018
07
205
the result should be:
index
id
count
0
03
3
1
07
1
2
07
1
i need to do this task without using pandas/dataframe, but right now i can't imagine from which side attack this problem.
just for effort, i tried the suggestion for a solution here count consecutive days python dataframe
but the ids' not grouped.
here is my code:
data = df[df['sales'] >= 100]
data['date'] = pd.to_datetime(data['date']).dt.date
s = data.groupby('id').date.diff().dt.days.ne(1).cumsum()
new_frame = data.groupby(['id', s]).size().reset_index(level=0, drop=True)
it is very importent that the "new_frame" will have "count" column, because after i need to count id by range of those count days in "count" column. e.g. count of id's in range of 0-7 days, 7-12 days etc. but it's not part of my question.
Thank you a lot
Your code is close, but need some fine-tuning, as follows:
data = df[df['sales'] >= 100]
data['date'] = pd.to_datetime(data['date'], dayfirst=True)
df2 = data.sort_values(['id', 'date'])
s = df2.groupby('id').date.diff().dt.days.ne(1).cumsum()
new_frame = df2.groupby(['id', s]).size().reset_index(level=1, drop=True).reset_index(name='count')
Result:
print(new_frame)
id count
0 3 3
1 7 1
2 7 1
Summary of changes:
As your dates are in dd/mm/yyyy instead of the default mm/dd/yyyy, you have to specify the parameter dayfirst=True in pd.to_datetime(). Otherwise, 02/01/2018 will be regarded as 2018-02-01 instead of 2018-01-02 as expected and the day diff with adjacent entries will be around 30 as opposed to 1.
We added a sort step to sort by columns id and date to simplify the later grouping during the creation of the series s.
In the last groupby() the code reset_index(level=0, drop=True) should be dropping level=1 instead. Since, level=0 is the id fields which we want to keep.
In the last groupby(), we do an extra .reset_index(name='count') to make the Pandas series change back to a dataframe and also name the new column as count.

How can the Pandas datetime object be used dynamically?

Given a dataframe df with only one column (consisting of datetime values that can be repeated). e.g:
date
2017-09-17
2017-09-17
2017-09-22
2017-11-04
2017-11-15
and df.dtypes is date datetime64[ns].
How can I create a new dataframe exporting information from the existing one so that for every month of a particular year there will be a second column with the number of observations for that month of the year.
The result for the above example would be something like:
date
observations
2017-09
3
2017-11
2
You can do:
(df['date'].dt.to_period('M') # change date to Month
.value_counts() # count the Month
.reset_index(name='observations') # make dataframe
)

Pandas: Dynamically Find Date of Current Week's Sunday and Place in Empty Cell in Dataframe

I have a pandas data frame that does counts by week. The beginning of week is always a Monday and the end is the corresponding Sunday.
Below is my sample data frame:
Week_Start_Date (Mon) Week_End_Date (Sun) Count
2018-08-20 35
2018-08-13 2018-08-19 40
I want to fill the blank cell (date associated with current Sunday) with the Sunday associated with the current week. I want this to be dynamic because the weeks will be changing.
Two questions:
Q1) How do I find the date of the Sunday associated with current week?
Q2) How do I place that date in the missing cell? Positionally, the missing cell will always be 2nd column, 1st row.
I have scoured Google and stackoverflow for some direction but couldn't find anything.
First convert to datetime. Then use fillna with your start date incremented by 6 days:
cols = ['Week_Start_Date', 'Week_End_Date']
df[cols] = df[cols].apply(pd.to_datetime, errors='coerce')
df['Week_End_Date'] = df['Week_End_Date'].fillna(df['Week_Start_Date'] + pd.DateOffset(days=6))
print(df)
Week_Start_Date Week_End_Date Count
0 2018-08-20 2018-08-26 35
1 2018-08-13 2018-08-19 40
If the 6-day increment is always true, you don't even need fillna:
df['Week_End_Date'] = df['Week_Start_Date'] + pd.DateOffset(days=6)

Categories

Resources