Pandas dataframe apply to datetime column : does not work - python

I have a dataframe with a datetime column. I want to apply a function to set value as None in this column if the date is inferior to an other date. But the applied fonction set all my values to None. Could you help me ?
Here my code :
dateused = datetime.datetime.strptime('202004', '%Y%m')
df['date_pack'] =df['date_pack'].apply(lambda x: None if x < dateused else x)
The dtype of df['date_pack'] is datetime64[ns].
After this, all my values in my column 'date_pack' are None.
Thanks

I think you need Series.mask with set values to NaT for misisng values for datetimes:
df = pd.DataFrame({'date_pack': pd.to_datetime(['2020-08-10','2002-02-09'])})
dateused = datetime.datetime.strptime('202004', '%Y%m')
df['date_pack'] = df['date_pack'].mask(df['date_pack'] < dateused)
print (df)
date_pack
0 2020-08-10
1 NaT

Related

What is the equivalent function of ts from R language in python? [duplicate]

I have a dataframe with various attributes, including one datetime column. I want to extract one of the attribute columns as a time series indexed by the datetime column. This seemed pretty straightforward, and I can construct time series with random values, as all the pandas docs show.. but when I do so from a dataframe, my attribute values all convert to NaN.
Here's an analogous example.
df = pd.DataFrame({'a': [0,1], 'date':[pd.to_datetime('2017-04-01'),
pd.to_datetime('2017-04-02')]})
s = pd.Series(df.a, index=df.date)
In this case, the series will have correct time series index, but all the values will be NaN.
I can do the series in two steps, as below, but I don't understand why this should be required.
s = pd.Series(df.a)
s.index = df.date
What am I missing? I assume it has to do with series references, but don't understand at all why the values would go to NaN.
I am also able to get it to work by copying the index column.
s = pd.Series(df.a, df.date.copy())
The problem is that pd.Series() is trying to use the values specified in index to select values from the dataframe, but the date values in the dataframe are not present in the index.
You can set the index to the date column and then select the one data column you want. This will return a series with the dates as the index
import pandas as pd
df = pd.DataFrame({'a': [0,1], 'date':[pd.to_datetime('2017-04-01'),
pd.to_datetime('2017-04-02')]})
s = df.set_index('date')['a']
Examining s gives:
In [1]: s
Out[1]:
date
2017-04-01 0
2017-04-02 1
Name: a, dtype: int64
And you can confirm that s is a Series:
In [2]: isinstance(s, pd.Series)
Out[2]: True

Date concatenating in new column in dataframe

I have dataframe with column date with type datetime64[ns].
When I try to create new column day with format MM-DD based on date column only first method works from below. Why second method doesn't work in pandas?
df['day'] = df['date'].dt.strftime('%m-%d')
df['day2'] = str(df['date'].dt.month) + '-' + str(df['date'].dt.day)
Result for one row:
day 01-04
day2 0 1\n1 1\n2 1\n3 1\n4 ...
Types of columns
day object
day2 object
Problem of solution is if use str with df['date'].dt.month it return Series, correct way is use Series.astype:
df['day2'] = df['date'].dt.month.astype(str) + '-' + df['date'].dt.day.astype(str)

find nearest future date from columns

Please help with finding the next date from today for each row item from the four columns as show below. I have been stuck at this for a while now.
InDate1 InDate2 InDate3 InDate4
284075 2018-03-07 2018-09-07 2019-03-07 2019-01-21
334627 2018-03-07 2018-09-07 2019-03-07 2019-09-07
Using lookup:
For each row, find the column that holds the closest future date:
import pandas as pd
s = (df.apply(pd.to_datetime) # If not already datetime
.apply(lambda x: (x - pd.to_datetime('today')).dt.total_seconds())
.where(lambda x: x.gt(0)).idxmin(1))
print(s)
#284075 InDate3
#334627 InDate3
#dtype: object
Then lookup the values for each row:
df.lookup(s.index, s)
#array(['2019-03-07', '2019-03-07'], dtype=object)
To elaborate on what this does, you can look at what each part does separately
First, determine the difference in time between your DataFrame and today. .apply(pd.to_datetime) converts everything to a datetime so it can do arithmetic with the dates, and the second apply finds the time difference, converting it from a timedelta to the number of seconds, which is just a float. (Not sure why simple df - pd.to_datetime('today') doesn't quite work and the apply is needed)
s = (df.apply(pd.to_datetime) # If not already datetime
.apply(lambda x: (x - pd.to_datetime('today')).dt.total_seconds()))
print(s)
InDate1 InDate2 InDate3 InDate4
284075 -2.769565e+07 -1.179805e+07 3.840347e+06 -4.765262e+04
334627 -2.769565e+07 -1.179805e+07 3.840347e+06 1.973795e+07
Dates in the future will have a positive time difference, so I use .where to find only the cells that have positive values, replacing everything else with NaN
s = s.where(lambda x: x.gt(0))
# Could use s.where(s.gt(0)) here since `s` is defined
print(s)
InDate1 InDate2 InDate3 InDate4
284075 NaN NaN 3.840347e+06 NaN
334627 NaN NaN 3.840347e+06 1.973795e+07
Then .idxmin(axis=1) will return the column that has the minimum value (ignoring NaN), for each row (axis=1), which is the closest future date.
s.idxmin(1)
print(s)
284075 InDate3
334627 InDate3
dtype: object
Finally, DataFrame.lookup to lookup the original date in that cell is fairly self-explanatory.
Please check this.
First stack date values into rows so that we can apply minimum and today comparisons.
df1 = df.stack().reset_index()
df1.columns = ["ID", "Field", "Date"]
Then filter data with today and find out minimum date.
df1 = df1[df1.Date > datetime.datetime.now()].groupby("ID").agg("min").reset_index()
Then pivot resulted date and before it, just assign a static value for determine as single column header instead of IntDate1..etc.
df1.Field = "MinValue"
df1 = df1.pivot(index="ID", columns="Field", values="Date").reset_index()
Finally merge minimum date value dataframe with original dataframe.
df = df.merge(df1, how="left")

Fill nan with zero python pandas

this is my code:
for col in df:
if col.startswith('event'):
df[col].fillna(0, inplace=True)
df[col] = df[col].map(lambda x: re.sub("\D","",str(x)))
I have 0 to 10 event column "event_0, event_1,..."
When I fill nan with this code it fills all nan cells under all event columns to 0 but it does not change event_0 which is the first column of that selection and it is also filled by nan.
I made these columns from 'events' column with following code:
event_seperator = lambda x: pd.Series([i for i in
str(x).strip().split('\n')]).add_prefix('event_')
df_events = df['events'].apply(event_seperator)
df = pd.concat([df.drop(columns=['events']), df_events], axis=1)
Please tell me what is wrong? you can see dataframe before changing in the picture.
I don't know why that happened since I made all those columns the
same.
Your data suggests this is precisely what has not been done.
You have a few options depending on what you are trying to achieve.
1. Convert all non-numeric values to 0
Use pd.to_numeric with errors='coerce':
df[col] = pd.to_numeric(df[col], errors='coerce').fillna(0)
2. Replace either string ('nan') or null (NaN) values with 0
Use pd.Series.replace followed by the previous method:
df[col] = df[col].replace('nan', np.nan).fillna(0)

Extracting the hour from a time column in pandas

Suppose I have the following dataset:
How would I create a new column, to be the hour of the time?
For example, the code below works for individual times, but I haven't been able to generalise it for a column in pandas.
t = datetime.strptime('9:33:07','%H:%M:%S')
print(t.hour)
Use to_datetime to datetimes with dt.hour:
df = pd.DataFrame({'TIME':['9:33:07','9:41:09']})
#should be slowier
#df['hour'] = pd.to_datetime(df['TIME']).dt.hour
df['hour'] = pd.to_datetime(df['TIME'], format='%H:%M:%S').dt.hour
print (df)
TIME hour
0 9:33:07 9
1 9:41:09 9
If want working with datetimes in column TIME is possible assign back:
df['TIME'] = pd.to_datetime(df['TIME'], format='%H:%M:%S')
df['hour'] = df['TIME'].dt.hour
print (df)
TIME hour
0 1900-01-01 09:33:07 9
1 1900-01-01 09:41:09 9
My suggestion:
df = pd.DataFrame({'TIME':['9:33:07','9:41:09']})
df['hour']= df.TIME.str.extract("(^\d+):", expand=False)
"str.extract(...)" is a vectorized function that extract a regular expression pattern ( in our case "(^\d+):" which is the hour of the TIME) and return a Pandas Series object by specifying the parameter "expand= False"
The result is stored in the "hour" column
You can use extract() twice to feature out the 'hour' column
df['hour'] = df. TIME. str. extract("(\d+:)")
df['hour'] = df. hour. str. extract("(\d+)")

Categories

Resources