Add missing dates do datetime column in Pandas using last value - python

I've already checked out Add missing dates to pandas dataframe, but I don't want to fill in the new dates with a generic value.
My dataframe looks more or less like this:
date (dd/mm/yyyy)
value
01/01/2000
a
02/01/2000
b
03/01/2000
c
06/01/2000
d
So in this example, days 04/01/2000 and 05/01/2000 are missing. What I want to do is to insert them before the 6th, with a value of c, the last value before the missing days. So the "correct" df should look like:
date (dd/mm/yyyy)
value
01/01/2000
a
02/01/2000
b
03/01/2000
c
04/01/2000
c
05/01/2000
c
06/01/2000
d
There are multiple instances of missing dates, and it's a large df (~9000 rows).
Thanks for your time! :)

try this:
# If your date format is dayfirst, then use the following code
df['date (dd/mm/yyyy)'] = pd.to_datetime(df['date (dd/mm/yyyy)'], dayfirst=True)
out = df.set_index('date (dd/mm/yyyy)').asfreq('D', method='ffill').reset_index()
print(out)

Assuming that your dates are drawn at a regular frequency, you can generate a pd.DateIndex with date_range, filter those which are not in your date column, crate a dataframe to concatenate with nan in the value column and fillna using the back or forward fill method.
# assuming your dataframe is df:
all_dates = pd.date_range(start=df.date.min(), end=df.date.max(), freq='M')
known_dates = set(df.date.to_list()) # set is blazing fast on `in` compared with a list.
unknown_dates = all_dates[~all_dates.isin(known_dates)]
df2 = pd.DateFrame({'date': unknown_dates})
df2['value'] = np.nan
df = pd.concat([df, df2])
df = df.sort_values('value').fillna(method='ffill')

Related

pandas increment row based on how many times a date is in a dataframe

i have this list for example dates = ["2020-2-1", "2020-2-3", "2020-5-8"] now i want to make a dataframe which contains only the month and year then the count of how many times it appeared, the output should be like
Date
Count
2020-2
2
2020-5
1
Shorted code:
df['month_year'] = df['dates'].dt.to_period('M')
df1 = df.groupby('month_year')['dates'].count().reset_index(name="count")
print(df1)
month_year count
0 2020-02 2
1 2020-05 1
import pandas as pd
dates = ["2020-2-1", "2020-2-3", "2020-5-8"]
df = pd.DataFrame({'Date':dates})
df['Date'] = df['Date'].str.slice(0,6)
df['Count'] = 1
df = df.groupby('Date').sum().reset_index()
Note: you might want to use the format "2020-02-01" with padded zeros so that the first 7 characters are always the year and month
This will give you a "month" and "year" column with the count of the year/month
If you want you could just combine the month/year columns together, but this will give you the results you expect if not exactly cleaned up.
df = pd.DataFrame({'Column1' : ["2020-2-1", "2020-2-3", "2020-5-8"]})
df['Month'] = pd.to_datetime(df['Column1']).dt.month
df['Year'] = pd.to_datetime(df['Column1']).dt.year
df.groupby(['Month', 'Year']).agg('count').reset_index()

pandas possible bug with groupby and resample

I am a newbie in pandas and seeking advice if this is a possible bug?
Dataframe with non unique datetime index. Col1 is a group variable, col2 is values.
i want to resample the hourly values to years and grouping by the group variable. i do this with this command
df_resample = df.groupby('col1').resample('Y').mean()
This works fine and creates a multiindex of col1 and the datetimeindeks, where col1 is now NOT a column in the dataframe
How ever if i change mean() to max() this is not the case. Then col1 is part of the multiindex, but the column is still present in the dataframe.
Isnt this a bug?
Sorry, but i dont know how to present dummy data as a dataframe in this post?
Edit:
code example:
from datetime import datetime, timedelta
import pandas as pd
data = {'category':['A', 'B', 'C'],
'value_hour':[1,2,3]}
days = pd.date_range(datetime.now(), datetime.now() + timedelta(2), freq='D')
df = pd.DataFrame(data, index=days)
df_mean = df.groupby('category').resample('Y').mean()
df_max = df.groupby('category').resample('Y').max()
print(df_mean, df_max)
category value_hour
A 2021-12-31 1.0
B 2021-12-31 2.0
C 2021-12-31 3.0
category category value_hour
A 2021-12-31 A 1
B 2021-12-31 B 2
C 2021-12-31 C 3
Trying to drop the category column from df_max gives an KeyError
df_max.drop('category')
File "C:\Users\mav\Anaconda3\envs\EWDpy\lib\site-packages\pandas\core\indexes\base.py", line 3363, in get_loc
raise KeyError(key) from err
KeyError: 'category'
Concerning the KeyError: the problem is that you are trying to drop the "category" row instead of the column.
When using drop to drop the columns you should add axis = 1 as in the following code:
df_max.drop('category', axis=1)
axis=1 indicates you are looking at the columns

Python - fill NaN by range of date

I have a dataframe, when:
one of column is a Date column.
another column it's X column, that column have missing values.
I want to fill column X by a specific range of dates.
so far I got to this code:
df[df['Date'] < datetime.date(2017,1,1)]['X'].fillna(1,inplace=True)
But it dose not work, I am not getting an error, but the data isn't fill.
and another point it look messy, maybe there is a better way.
Thank for the help.
First, you need to create your data frame:
import pandas as pd
df = pd.DataFrame({'Date': ['2016-01-01', '2018-01-01']})
df['Date'] = pd.to_datetime(df['Date'])
Next, you can conditionally set the column value:
df.loc[df['Date'] < '2017-01-01','X'] = 1
The result would be like this:
Date X
0 2016-01-01 1.0
1 2018-01-01 NaN

subset the pandas dataframe

I have a pandas dataframe
date Speed
1986-01-01 0.3
....
2017-03-01 0.4
where date is index of data frame.i want to create a data frame only having data of 1986,2000 and 2017 without date index like
index date speed
1 1986 0.3
....
13 2000 0.5
Assuming your 'date' index is already a datetime dtype:
df.reset_index(inplace = True)
df['date'] = df['date'].dt.year
df = df[df['date'].isin([1986,2000,2017])]
...and if not, add df['date'] = pd.to_datetime(df['date']) after the reset_index
Use the following steps:
# Step one -reindex
df = df.reindex()
# Step two - convert date column to date type
df['date'] = pd.to_datetime(df['date'])
# Step three - create year column using the date object
df['year'] = df.date.dt.year
# Step four - select target years
df[df.year.isin([1986,2000])]

How to index DateTime in Pandas dataframe

I have a few columns where some values are DateTime and some are simply the year. I'd like to index the values of the datetime such that if I = I[:4] I get the year for variable I after looping through the column instead of an error stating 'datetime.datetime' object is not subscriptable. Essentially, I'd like to drop everything from datetime that isn't the year in a column with mixed ints that have values that are only the year, and datetime instances from which I would only like to retrieve the year.
I am not exactly sure what you are trying to do and unfortunately I cannot post comments (yet).
From what I guess, you have a DataFrame with a column of mixed types, and you want to convert all values with type datetime to int. As an example, here is a DataFrame:
>>> data = [[1, 1990], [2, datetime(1991, 1, 1)]]
>>> df = pd.DataFrame(data, columns=['id', 'time'])
>>> df
id time
0 1 1990
1 2 1991-01-01 00:00:00
You can use map(link) to convert the second column:
>>> df.loc[:,'date'] = df.loc[:,'date'].map(lambda x: x.year if isinstance(x, datetime) else x)
>>> df
id time
0 1 1990
1 2 1991

Categories

Resources