Pandas group by number of months between number of months

Pandas group by number of months between number of months - python

I have sales information for different types of parts having different durations.I want to take difference in months when my date is in 'YYYYMM' format.
I have tried this.
(data.YYYYMM.max() - data.YYYYMM.min()
which gives me difference in days.how can I get this difference in months.

You can convert column to_datetime and then to_period:
df = pd.DataFrame({'YYYYMM':['201505','201506','201508','201510']})
print (df)
YYYYMM
0 201505
1 201506
2 201508
3 201510
df['YYYYMM'] = pd.to_datetime(df['YYYYMM'], format='%Y%m').dt.to_period('m')
a = df.YYYYMM.max() - df.YYYYMM.min()
print (a)
5

Related

How to create new column in DataFrame based on substrings from another columns using numpy

I have a database with a column describing the dates when a certain data was collected. However, the dates were inserted as MM-DD (eg., Jul-13) and they are coded as string.
ids = pd.Series([1, 2, 3, 4])
dates = pd.Series(["Jul-29", "Jul-29", "Dec-29", "Apr-22"])
df = pd.DataFrame({"ids" : ids, "dates" : dates})
ids dates
0 1 Jul-29
1 2 Jul-29
2 3 Dec-29
3 4 Apr-22
I would like to insert the year in these dates before converting to date based on a condition. I know that data from December belongs to 2021, whereas the rest of the data was collected in 2022. Therefore I need something like this:
ids dates corrected_dates
0 1 Jul-29 Jul-29-2022
1 2 Jul-29 Jul-29-2022
2 3 Dec-29 Dec-29-2021
3 4 Apr-22 Apr-22-2022
I have tried:
df["corrected_dates"] = np.where("Dec" in df["dates"], df["dates"] + "-2021", df["dates"] + "-2022")
but this resulted in
ids dates corrected_dates
0 1 Jul-29 Jul-29-2022
1 2 Jul-29 Jul-29-2022
2 3 Dec-29 Dec-29-2022
3 4 Apr-22 Apr-22-2022
Therefore, I am probably not coding the conditional properly but I can't find out what I am doing wrong.
I was able to insert the the year in a new column by doing
corrected_dates = []
for date in df["dates"]:
if "Dec" in date:
new_date = date + "-2021"
else:
new_date = date + "-2022"
corrected_dates.append(new_date)
and then df["corrected_dates"] = corrected_dates but this seems too cumbersome (not to mention that I am not sure this would work if there were missing data in df["dates"].
Can anyone help me understand what I am doing wrong when using np.where() or suggest a better alternative than using a for loop?
Thanks

Let us do str.startswith
df['new'] = np.where(df["dates"].str.startswith('Dec'), df["dates"] + "-2021", df["dates"] + "-2022")
df
Out[19]:
ids dates new
0 1 Jul-29 Jul-29-2022
1 2 Jul-29 Jul-29-2022
2 3 Dec-29 Dec-29-2021
3 4 Apr-22 Apr-22-2022

Pandas total count each day

I have a large dataset (df) with lots of columns and I am trying to get the total number of each day.
|datetime|id|col3|col4|col...
1 |11-11-2020|7|col3|col4|col...
2 |10-11-2020|5|col3|col4|col...
3 |09-11-2020|5|col3|col4|col...
4 |10-11-2020|4|col3|col4|col...
5 |10-11-2020|4|col3|col4|col...
6 |07-11-2020|4|col3|col4|col...
I want my result to be something like this
|datetime|id|col3|col4|col...|Count
6 |07-11-2020|4|col3|col4|col...| 1
3 |5|col3|col4|col...| 1
2 |10-11-2020|5|col3|col4|col...| 1
4 |4|col3|col4|col...| 2
1 |11-11-2020|7|col3|col4|col...| 1
I tried to use resample like this df = df.groupby(['id','col3', pd.Grouper(key='datetime', freq='D')]).sum().reset_index() and this is my result. I am still new to programming and Pandas but I have read up on pandas docs and am still unable to do it.
|datetime|id|col3|col4|col...
6 |07-11-2020|4|col3|1|0.0
3 |07-11-2020|5|col3|1|0.0
2 |10-11-2020|5|col3|1|0.0
4 |10-11-2020|4|col3|2|0.0
1 |11-11-2020|7|col3|1|0.0

try this:
df = df.groupby(['datetime','id','col3']).count()

If you want the count values for all columns based only on the date, then:
df.groupby('datetime').count()
And you'll get a DataFrame who has the date time as the index and the column cells representing the number of entries for that given index.

How to find missing date rows in a sequence using pandas?

I have a dataframe with more than 4 million rows and 30 columns. I am just providing a sample of my patient dataframe
df = pd.DataFrame({
'subject_ID':[1,1,1,1,1,2,2,2,2,2,3,3,3],
'date_visit':['1/1/2020 12:35:21','1/1/2020 14:35:32','1/1/2020 16:21:20','01/02/2020 15:12:37','01/03/2020 16:32:12',
'1/1/2020 12:35:21','1/3/2020 14:35:32','1/8/2020 16:21:20','01/09/2020 15:12:37','01/10/2020 16:32:12',
'11/01/2022 13:02:31','13/01/2023 17:12:31','16/01/2023 19:22:31'],
'item_name':['PEEP','Fio2','PEEP','Fio2','PEEP','PEEP','PEEP','PEEP','PEEP','PEEP','Fio2','Fio2','Fio2']})
I would like to do two things
1) Find the subjects and their records which are missing in the sequence
2) Get the count of item_name for each subjects
For q2, this is what I tried
df.groupby(['subject_ID','item_name']).count() # though this produces output, column name is not okay. I mean why do it show the count value on `date_visit` column?
For q1, this is what I am trying
df['day'].le(df['shift_date'].add(1))
I expect my output to be like as shown below

You can get the first part with:
In [14]: df.groupby("subject_ID")['item_name'].value_counts().unstack(fill_value=0)
Out[14]:
item_name Fio2 PEEP
subject_ID
1 2 3
2 0 5
3 3 0
EDIT:
I think you've still got your date formats a bit messed up in your sample output, and strongly recommend switching everything to the ISO 8601 standard since that prevents problems like that down the road. pandas won't correctly parse that 11/01/2022 entry on its own, so I've manually fixed it in the sample.
Using what I assume these dates are supposed to be, you can find the gaps by grouping and using .resample():
In [73]: df['dates'] = pd.to_datetime(df['date_visit'])
In [74]: df.loc[10, 'dates'] = pd.to_datetime("2022-01-11 13:02:31")
In [75]: dates = df.groupby("subject_ID").apply(lambda x: x.set_index('dates').resample('D').first())
In [76]: dates.index[dates.isnull().any(axis=1)].to_frame().reset_index(drop=True)
Out[76]:
subject_ID dates
0 2 2020-01-02
1 2 2020-01-04
2 2 2020-01-05
3 2 2020-01-06
4 2 2020-01-07
5 3 2022-01-12
6 3 2022-01-14
7 3 2022-01-15
You can then add seq status to that first frame by checking whether the ID shows up in this new frame.

How to concat hour with date in python [duplicate]

This question already has an answer here:
Python: Adding hours to pandas timestamp
(1 answer)
Closed 3 years ago.
I have a pandas dataframe where date and hour is in two different columns as shown below -
I want to concat these two columns to have a new datatime column where I can apply pandas window/shift functions. Please share your views.
date hour
0 20190409 0
1 20190409 0
2 20190409 0
3 20190409 0
4 20190409 0

Use pandas.to_datetime and pd.to_timedelta and add them together:
df['datetime'] = pd.to_datetime(df['date'], format='%Y%m%d') + pd.to_timedelta(df['hour'], unit='H')

Iteratively Subset DataFrame based on Unique Times

Given the following example DataFrame:
>>> df
Times Values
0 05/10/2017 01:01:03 1
1 05/10/2017 01:05:00 2
2 05/10/2017 01:06:10 3
3 05/11/2017 08:25:20 4
4 05/11/2017 08:30:14 5
5 05/11/2017 08:30:35 6
I want to subset this DataFrame by the 'Time' column, by matching a partial string up to the hour. For example, I want to subset using partial strings which contain "05/10/2017 01:" and "05/11/2017 08:" which breaks up the subsets into two new data frames:
>>> df1
Times Values
0 05/10/2017 01:01:03 1
1 05/10/2017 01:05:00 2
2 05/10/2017 01:06:10 3
and
>>> df2
0 05/11/2017 08:25:20 4
1 05/11/2017 08:30:14 5
2 05/11/2017 08:30:35 6
Is it possible to make this subset iterative in Pandas, for multiple dates/times that similarly have the date/hour as the common identifier?

First, cast your Times column into a datetime format, and set it as the index:
df['Times'] = pd.to_datetime(df['Times'])
df.set_index('Times', inplace = True)
Then use the groupby method, with a TimeGrouper:
g = df.groupby(pd.TimeGrouper('h'))
g is an iterator that yields tuple pairs of times and sub-dataframes of those times. If you just want the sub-dfs, you can do zip(*g)[1].
A caveat: the sub-dfs are indexed by the timestamp, and pd.TimeGrouper only works when the times are the index. If you want to have the timestamp as a column, you could instead do:
df['Times'] = pd.to_datetime(df['Times'])
df['time_hour'] = df['Times'].dt.floor('1h')
g = df.groupby('time_hour')
Alternatively, you could just call .reset_index() on each of the dfs from the former method, but this will probably be much slower.

Convert Times to a hour period, groupby and then extract each group as a DF.
df1,df2=[g.drop('hour',1) for n,g in\
df.assign(hour=pd.DatetimeIndex(df.Times)\
.to_period('h')).groupby('hour')]
df1
Out[874]:
Times Values
0 2017-05-10 01:01:03 1
1 2017-05-10 01:05:00 2
2 2017-05-10 01:06:10 3
df2
Out[875]:
Times Values
3 2017-05-11 08:25:20 4
4 2017-05-11 08:30:14 5
5 2017-05-11 08:30:35 6

First make sure that the Times column is of type DateTime.
Second, set times column as index.
Third, use between_time method.
df['Times'] = pd.to_datetime(df['Times'])
df.set_index('Times', inplace=True)
df1 = df.between_time('1:00:00', '1:59:59')
df2 = df.between_time('8:00:00', '8:59:59')

If you use the datetime type you can extract things like hours and days.
times = pd.to_datetime(df['Times'])
hours = times.apply(lambda x: x.hour)
df1 = df[hours == 1]

You can use the str[] accessor to truncate the string representation of your date (you might have to cast astype(str) if your columns is a datetime and then use groupby.groups to access the dataframes as a dictionary where the keys are your truncated date values:
>>> df.groupby(df.Times.astype(str).str[0:13]).groups
{'2017-05-10 01': DatetimeIndex(['2017-05-10 01:01:03', '2017-05-10 01:05:00',
'2017-05-10 01:06:10'],
dtype='datetime64[ns]', name='time', freq=None),
'2017-05-11 08': DatetimeIndex(['2017-05-11 08:25:20', '2017-05-11 08:30:14',
'2017-05-11 08:30:35'],
dtype='datetime64[ns]', name='time', freq=None)}

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas group by number of months between number of months - python

I have sales information for different types of parts having different durations.I want to take difference in months when my date is in 'YYYYMM' format. I have tried this. (data.YYYYMM.max() - data.YYYYMM.min() which gives me difference in days.how can I get this difference in months.

Related

How to create new column in DataFrame based on substrings from another columns using numpy

Pandas total count each day

How to find missing date rows in a sequence using pandas?

How to concat hour with date in python [duplicate]

Iteratively Subset DataFrame based on Unique Times

Categories

Resources