Python - fill NaN by range of date - python

I have a dataframe, when:
one of column is a Date column.
another column it's X column, that column have missing values.
I want to fill column X by a specific range of dates.
so far I got to this code:
df[df['Date'] < datetime.date(2017,1,1)]['X'].fillna(1,inplace=True)
But it dose not work, I am not getting an error, but the data isn't fill.
and another point it look messy, maybe there is a better way.
Thank for the help.

First, you need to create your data frame:
import pandas as pd
df = pd.DataFrame({'Date': ['2016-01-01', '2018-01-01']})
df['Date'] = pd.to_datetime(df['Date'])
Next, you can conditionally set the column value:
df.loc[df['Date'] < '2017-01-01','X'] = 1
The result would be like this:
Date X
0 2016-01-01 1.0
1 2018-01-01 NaN

Related

pandas increment row based on how many times a date is in a dataframe

i have this list for example dates = ["2020-2-1", "2020-2-3", "2020-5-8"] now i want to make a dataframe which contains only the month and year then the count of how many times it appeared, the output should be like
Date
Count
2020-2
2
2020-5
1
Shorted code:
df['month_year'] = df['dates'].dt.to_period('M')
df1 = df.groupby('month_year')['dates'].count().reset_index(name="count")
print(df1)
month_year count
0 2020-02 2
1 2020-05 1
import pandas as pd
dates = ["2020-2-1", "2020-2-3", "2020-5-8"]
df = pd.DataFrame({'Date':dates})
df['Date'] = df['Date'].str.slice(0,6)
df['Count'] = 1
df = df.groupby('Date').sum().reset_index()
Note: you might want to use the format "2020-02-01" with padded zeros so that the first 7 characters are always the year and month
This will give you a "month" and "year" column with the count of the year/month
If you want you could just combine the month/year columns together, but this will give you the results you expect if not exactly cleaned up.
df = pd.DataFrame({'Column1' : ["2020-2-1", "2020-2-3", "2020-5-8"]})
df['Month'] = pd.to_datetime(df['Column1']).dt.month
df['Year'] = pd.to_datetime(df['Column1']).dt.year
df.groupby(['Month', 'Year']).agg('count').reset_index()

Add missing dates do datetime column in Pandas using last value

I've already checked out Add missing dates to pandas dataframe, but I don't want to fill in the new dates with a generic value.
My dataframe looks more or less like this:
date (dd/mm/yyyy)
value
01/01/2000
a
02/01/2000
b
03/01/2000
c
06/01/2000
d
So in this example, days 04/01/2000 and 05/01/2000 are missing. What I want to do is to insert them before the 6th, with a value of c, the last value before the missing days. So the "correct" df should look like:
date (dd/mm/yyyy)
value
01/01/2000
a
02/01/2000
b
03/01/2000
c
04/01/2000
c
05/01/2000
c
06/01/2000
d
There are multiple instances of missing dates, and it's a large df (~9000 rows).
Thanks for your time! :)
try this:
# If your date format is dayfirst, then use the following code
df['date (dd/mm/yyyy)'] = pd.to_datetime(df['date (dd/mm/yyyy)'], dayfirst=True)
out = df.set_index('date (dd/mm/yyyy)').asfreq('D', method='ffill').reset_index()
print(out)
Assuming that your dates are drawn at a regular frequency, you can generate a pd.DateIndex with date_range, filter those which are not in your date column, crate a dataframe to concatenate with nan in the value column and fillna using the back or forward fill method.
# assuming your dataframe is df:
all_dates = pd.date_range(start=df.date.min(), end=df.date.max(), freq='M')
known_dates = set(df.date.to_list()) # set is blazing fast on `in` compared with a list.
unknown_dates = all_dates[~all_dates.isin(known_dates)]
df2 = pd.DateFrame({'date': unknown_dates})
df2['value'] = np.nan
df = pd.concat([df, df2])
df = df.sort_values('value').fillna(method='ffill')

pandas possible bug with groupby and resample

I am a newbie in pandas and seeking advice if this is a possible bug?
Dataframe with non unique datetime index. Col1 is a group variable, col2 is values.
i want to resample the hourly values to years and grouping by the group variable. i do this with this command
df_resample = df.groupby('col1').resample('Y').mean()
This works fine and creates a multiindex of col1 and the datetimeindeks, where col1 is now NOT a column in the dataframe
How ever if i change mean() to max() this is not the case. Then col1 is part of the multiindex, but the column is still present in the dataframe.
Isnt this a bug?
Sorry, but i dont know how to present dummy data as a dataframe in this post?
Edit:
code example:
from datetime import datetime, timedelta
import pandas as pd
data = {'category':['A', 'B', 'C'],
'value_hour':[1,2,3]}
days = pd.date_range(datetime.now(), datetime.now() + timedelta(2), freq='D')
df = pd.DataFrame(data, index=days)
df_mean = df.groupby('category').resample('Y').mean()
df_max = df.groupby('category').resample('Y').max()
print(df_mean, df_max)
category value_hour
A 2021-12-31 1.0
B 2021-12-31 2.0
C 2021-12-31 3.0
category category value_hour
A 2021-12-31 A 1
B 2021-12-31 B 2
C 2021-12-31 C 3
Trying to drop the category column from df_max gives an KeyError
df_max.drop('category')
File "C:\Users\mav\Anaconda3\envs\EWDpy\lib\site-packages\pandas\core\indexes\base.py", line 3363, in get_loc
raise KeyError(key) from err
KeyError: 'category'
Concerning the KeyError: the problem is that you are trying to drop the "category" row instead of the column.
When using drop to drop the columns you should add axis = 1 as in the following code:
df_max.drop('category', axis=1)
axis=1 indicates you are looking at the columns

Bug/Feature for pandas where a multi-indexed dataframe filtered by date returns all the unfiltered dates when extracting the date index level

This is easiest to explain by code, so here goes - imagine the commands in ipython/jupyter notebooks:
from io import StringIO
import pandas as pd
test = StringIO("""Date,Ticker,x,y
2008-10-23,A,0,10
2008-10-23,B,1,11
2008-10-24,A,2,12
2008-10-24,B,3,13
2008-10-25,A,4,14
2008-10-25,B,5,15
2008-10-26,A,6,16
2008-10-26,B,7,17
""")
# Multi-index by Date and Ticker
df = pd.read_csv(test, index_col=[0, 1], parse_dates=True)
df
# Output to the command line
x y
Date Ticker
2008-10-23 A 0 10
B 1 11
2008-10-24 A 2 12
B 3 13
2008-10-25 A 4 14
B 5 15
2008-10-26 A 6 16
B 7 17
ts = pd.Timestamp(2008, 10, 25)
# Filter the data by Date >= ts
filtered_df = df.loc[ts:]
# output the filtered data
filtered_df
x y
Date Ticker
2008-10-25 A 4 14
B 5 15
2008-10-26 A 6 16
B 7 17
# Get all the level 0 data (i.e. the dates) in the filtered dataframe
dates = filtered_df.index.levels[0]
# output the dates in the filtered dataframe:
dates
DatetimeIndex(['2008-10-23', '2008-10-24', '2008-10-25', '2008-10-26'], dtype='datetime64[ns]', name='Date', freq=None)
# WTF!!!??? This was ALL of the dates in the original dataframe - I asked for the dates in the filtered dataframe!
# The correct output should have been:
DatetimeIndex(['2008-10-25', '2008-10-26'], dtype='datetime64[ns]', name='Date', freq=None)
So clearly in multi-indexing, when one filters, the index of the filtered dataframe retains all of the indices of the original dataframe, but only shows the visible indices when viewing the entire dataframe. However, when looking at data by index levels, it appears there is a bug (feature somehow?) where the entire index including the invisible indices is used to perform the operation I did to extract all the dates in the code above.
This is actually explained in the MultiIndex's User Guide (emphasis added):
The MultiIndex keeps all the defined levels of an index, even if they are not actually used. When slicing an index, you may notice this. ... This is done to avoid a recomputation of the levels in order to make slicing highly performant. If you want to see only the used levels, you can use the get_level_values() method.
In your case:
>>> filtered_df.index.get_level_values(0)
DatetimeIndex(['2008-10-25', '2008-10-25', '2008-10-26', '2008-10-26'], dtype='datetime64[ns]', name='Date', freq=None)
Which is what you expected.

Pandas series inserted into dataframe are read as NaN

I'm finding that when adding a series, based on the same time-period, to an existing dataframe it gets imported as NaNs. The dataframe has a field column, but I don't understand why that should change anything. To see the steps of my code, you can review the attached image. Hope that someone can help!
Illustration showing how the dataframe that the series is inserted into and how it gets read as NaN
Assuming that the value in the Field Index column is "actual" for every row, a solution could be the following:
test.reset_index().set_index('Date').assign(m1=m1)
That solution works but it can be done shorter:
days = pd.to_datetime(['2018-01-31', '2018-02-28', '2018-03-31'])
df = pd.DataFrame({'Field': ['Actual']*3, 'Date': days, 'Val':[1, 2, 3]}).set_index(['Field', 'Date'])
m1 = pd.Series([0, 2, 4], index=days)
df.reset_index(level='Field').assign(m1=m1)
Field Val m1
Date
2018-01-31 Actual 1 0
2018-02-28 Actual 2 2
2018-03-31 Actual 3 4
btw, that would be a nice mcve

Categories

Resources