replace dataframe values from another list with specific indexes - python

I have a dataframe which has a column date and I'm trying to replace with another list based on index, for example:
wrong_dates_indexes has list of indexes where date is in wrong format in original dataframe df:
dirty_dates_indexes=[4,33,48,54,59,91,95,132,160,175,180,197,203,206,229,237,266,271,278,294,298,333,348,373,380,420,442]
formated_dates=['2019-04-25','2019-12-01','2019-06-16','2019-10-07','2019-08-06','2019-02-17','2019-11-20','2019-03-10','2019-10-11','2019-03-04','2019-07-31','2019-10-12','2019-09-13','2019-08-26','2019-12-29','2019-10-11','2019-11-20','2019-06-16','2019-12-12','2019-03-22','2019-01-21','2019-03-21','2019-10-15','2019-12-01','2019-03-20','2019-09-08','2019-08-19']
I'm trying to replace all values in df with indexed in
wrong_dates_indexes with values in formated_dates.
I've tried the following code, however receiving an error:
for index in dirty_dates_indexes:
df.loc[index].date.replace(df.loc[index].date,formated_dates(f for f in range(0,len(range(formated_dates)))))
Error:
TypeError: 'list' object cannot be interpreted as an integer
How to solve this? or is there any better approach?

You are trying to get the value from dirty_dates_indexes and use that to lookup the position in formatted_dates. It may be messing you up.
You are using loc instead of iloc to reach the specific row.
Here's what I did.
dirty_dates_indexes=[4,33,48,54,
59,91,95,132,
160,175,180,197,
203,206,229,237,
266,271,278,294,
298,333,348,373,
380,420,442]
formated_dates=['2019-04-25','2019-12-01','2019-06-16','2019-10-07',
'2019-08-06','2019-02-17','2019-11-20','2019-03-10',
'2019-10-11','2019-03-04','2019-07-31','2019-10-12',
'2019-09-13','2019-08-26','2019-12-29','2019-10-11',
'2019-11-20','2019-06-16','2019-12-12','2019-03-22',
'2019-01-21','2019-03-21','2019-10-15','2019-12-01',
'2019-03-20','2019-09-08','2019-08-19']
import pandas as pd
df = pd.DataFrame()
df['dirty_dates'] = pd.date_range('2019-01-01', periods=500,freq='D')
for i,row_id in enumerate(dirty_dates_indexes):
df.dirty_dates.iloc[row_id] = pd.to_datetime(formated_dates[i])
print (df.head(20))
The results are as follows:
dirty_dates
0 2019-01-01
1 2019-01-02
2 2019-01-03
3 2019-01-04
4 2019-04-25 # <-- this row changed
5 2019-01-06
6 2019-01-07
7 2019-01-08
8 2019-01-09
9 2019-01-10
10 2019-01-11
11 2019-01-12
12 2019-01-13
13 2019-01-14
14 2019-01-15
15 2019-01-16
16 2019-01-17
17 2019-01-18
18 2019-01-19
19 2019-01-20

Related

Getting a new datatime by offsetting a lapse in the unit of second from a fixed constant datetime

Given a fix datetime '2019-01-15 7:00:00', the objective is to create a new datatime based on an offset value under the column (offset, '') . The unit of the offset value is in term of second.
The expected output is given in the column time
(offset,'' ) time
0 0 2019-01-15 7:00:00
1 20 2019-01-15 7:00:20
2 40
3 60 2019-01-15 7:01:00
4 80
... ...
4315 86300
4316 86320
4317 86340
4318 86360
4319 86380
[4320 rows x 1 columns]
My impression this can be achieved via
pd.to_datetime(['2019-01-15 7:00:00']).add(pd.to_timedelta(df[('lapse','')],unit='s'))
However, the compiler return an error
AttributeError: 'DatetimeIndex' object has no attribute 'add'
May I know how to resolve issue?
The full code to reproduce the above issue is as below
import numpy as np
import pandas as pd
np.random.seed(0)
increment=20
max_val=86400
# Elapse unit in second
aran=np.arange(0,max_val,increment).astype(int)
df=pd.DataFrame(aran,columns=[('lapse','')])
df['time']=pd.to_datetime(['2019-01-15 7:00:00']).add(pd.to_timedelta(df[('lapse','')],unit='s'))
Use to_datetime with origin and unit='s' parameters:
df['time'] = pd.to_datetime(df[('lapse','')], origin='2019-01-15 7:00:00', unit='s')
print (df)
(lapse, ) time
0 0 2019-01-15 07:00:00
1 20 2019-01-15 07:00:20
2 40 2019-01-15 07:00:40
3 60 2019-01-15 07:01:00
4 80 2019-01-15 07:01:20
... ...
4315 86300 2019-01-16 06:58:20
4316 86320 2019-01-16 06:58:40
4317 86340 2019-01-16 06:59:00
4318 86360 2019-01-16 06:59:20
4319 86380 2019-01-16 06:59:40
[4320 rows x 2 columns]
Pass your timestamp as a str instead of list and use the + operator:
df['time'] = pd.to_datetime('2019-01-15 7:00:00') + pd.to_timedelta(df[('lapse','')],unit='s')
[out]
(lapse, ) time
0 0 2019-01-15 07:00:00
1 20 2019-01-15 07:00:20
2 40 2019-01-15 07:00:40
3 60 2019-01-15 07:01:00
4 80 2019-01-15 07:01:20
... ... ...
4315 86300 2019-01-16 06:58:20
4316 86320 2019-01-16 06:58:40
4317 86340 2019-01-16 06:59:00
4318 86360 2019-01-16 06:59:20
4319 86380 2019-01-16 06:59:40
[4320 rows x 2 columns]

Python dataframe import matching column and index data

I have a master data frame and auxiliary data frame. Both have the same timestamp index and columns with master having few more columns. I want to copy a certain column's data from aux to master.
My code:
maindf = pd.DataFrame({'A':[0.0,NaN],'B':[10,20],'C':[100,200],},index=pd.date_range(start='2020-05-04 08:00:00', freq='1h', periods=2))
auxdf= pd.DataFrame({'A':[1,2],'B':[30,40],},index=pd.date_range(start='2020-05-04 08:00:00', freq='1h', periods=2))
maindf =
A B C
2020-05-04 08:00:00 0.0 10 100
2020-05-04 09:00:00 NaN 20 200
auxdf =
A B
2020-05-04 08:00:00 1 30
2020-05-04 09:00:00 2 40
Expected answer: I want o take column A data in auxdf and copy to maindf by matching the index.
maindf =
A B C
2020-05-04 08:00:00 1 10 100
2020-05-04 09:00:00 2 20 200
My solution:
maindf['A'] = auxdf['A']
My solution is not correct because I am copying values directly without checking for matching index. how do I achieve the solution?
You can use .update(), as follows:
maindf['A'].update(auxdf['A'])
.update() uses non-NA values from passed Series to make updates. Aligns on index.
Note also that the original dtype of maindf['A'] is retained: remains as float type even when auxdf['A'] is of int type.
Result:
print(maindf)
A B C
2020-05-04 08:00:00 1.0 10 100
2020-05-04 09:00:00 2.0 20 200

Difference between datetimes in terms of number of business days using pandas

Is there a (more) convenient/efficient method to calculate the number of business days between to dates using pandas?
I could do
len(pd.bdate_range(start='2018-12-03',end='2018-12-14'))-1 # minus one only if end date is a business day
but for longer distances between the start and end day this seems rather inefficient.
There are a couple of suggestion how to use the BDay offset object, but they all seem to refer to the creation of dateranges or something similar.
I am thinking more in terms of a Timedelta object that is represented in business-days.
Say I have two series,s1 and s2, containing datetimes. If pandas had something along the lines of
s1.dt.subtract(s2,freq='B')
# giving a new series containing timedeltas where the number of days calculated
# use business days only
would be nice.
(numpy has a busday_count() method. But I would not want to convert my pandas Timestamps to numpy, as this can get messy.)
I think np.busday_count here is good idea, also convert to numpy arrays is not necessary:
s1 = pd.Series(pd.date_range(start='05/01/2019',end='05/10/2019'))
s2 = pd.Series(pd.date_range(start='05/04/2019',periods=10, freq='5d'))
s = pd.Series([np.busday_count(a, b) for a, b in zip(s1, s2)])
print (s)
0 3
1 5
2 7
3 10
4 14
5 17
6 19
7 23
8 25
9 27
dtype: int64
from xone import calendar
def business_dates(start, end):
us_cal = calendar.USTradingCalendar()
kw = dict(start=start, end=end)
return pd.bdate_range(**kw).drop(us_cal.holidays(**kw))
In [1]: business_dates(start='2018-12-20', end='2018-12-31')
Out[1]: DatetimeIndex(['2018-12-20', '2018-12-21', '2018-12-24', '2018-12-26',
'2018-12-27', '2018-12-28', '2018-12-31'],
dtype='datetime64[ns]', freq=None)
source Get business days between start and end date using pandas
#create dataframes with the dates
df=pd.DataFrame({'dates':pd.date_range(start='05/01/2019',end='05/31/2019')})
#check if the dates are in business days
df[df['dates'].isin(pd.bdate_range(df['dates'].get(0), df['dates'].get(len(df)-1)))]
out[]:
0 2019-05-01
1 2019-05-02
2 2019-05-03
5 2019-05-06
6 2019-05-07
7 2019-05-08
8 2019-05-09
9 2019-05-10
12 2019-05-13
13 2019-05-14
14 2019-05-15
15 2019-05-16
16 2019-05-17
19 2019-05-20
20 2019-05-21
21 2019-05-22
22 2019-05-23
23 2019-05-24
26 2019-05-27
27 2019-05-28
28 2019-05-29
29 2019-05-30
30 2019-05-31

Pandas - "time data does not match format " error when the string does match the format?

I'm getting a value error saying my data does not match the format when it does. Not sure if this is a bug or I'm missing something here. I'm referring to this documentation for the string format. The weird part is if I write the 'data' Dataframe to a csv and read it in then call the function below it will convert the date so I'm not sure why it doesn't work without writing to a csv.
Any ideas?
data['Date'] = pd.to_datetime(data['Date'], format='%d-%b-%Y')
I'm getting two errors
TypeError: Unrecognized value type: <class 'str'>
ValueError: time data '27‑Aug‑2018' does not match format '%d-%b-%Y' (match)
Example dates -
2‑Jul‑2018
27‑Aug‑2018
28‑May‑2018
19‑Jun‑2017
5‑Mar‑2018
15‑Jan‑2018
11‑Nov‑2013
23‑Nov‑2015
23‑Jun‑2014
18‑Jun‑2018
30‑Apr‑2018
14‑May‑2018
16‑Apr‑2018
26‑Feb‑2018
19‑Mar‑2018
29‑Jun‑2015
Is it because they all aren't double digit days? What is the string format value for single digit days? Looks like this could be the cause but I'm not sure why it would error on the '27' though.
End solution (It was unicode & not a string) -
data['Date'] = data['Date'].apply(unidecode.unidecode)
data['Date'] = data['Date'].apply(lambda x: x.replace("-", "/"))
data['Date'] = pd.to_datetime(data['Date'], format="%d/%b/%Y")
There seems to be an issue with your date strings. I replicated your issue with your sample data and if I remove the hyphens and replace them manually (for the first three dates) then the code works
pd.to_datetime(df1['Date'] ,errors ='coerce')
output:
0 2018-07-02
1 2018-08-27
2 2018-05-28
3 NaT
4 NaT
5 NaT
6 NaT
7 NaT
8 NaT
9 NaT
10 NaT
11 NaT
12 NaT
13 NaT
14 NaT
15 NaT
Bottom line: your hyphens look like regular ones but are actually something else, just clean your source data and you're good to go
You got a special mark here it is not -
df.iloc[0,0][2]
Out[287]: '‑'
Replace it with '-'
pd.to_datetime(df.iloc[:,0].str.replace('‑','-'),format='%d-%b-%Y')
Out[288]:
0 2018-08-27
1 2018-05-28
2 2017-06-19
3 2018-03-05
4 2018-01-15
5 2013-11-11
6 2015-11-23
7 2014-06-23
8 2018-06-18
9 2018-04-30
10 2018-05-14
11 2018-04-16
12 2018-02-26
13 2018-03-19
14 2015-06-29
Name: 2‑Jul‑2018, dtype: datetime64[ns]

Pandas DataFrame.resample monthly offset from particular day of month

I have a DataFrame df with sporadic daily business day rows (i.e., there is not always a row for every business day.)
For each row in df I want to create a historical resampled mean dfm going back one month at a time. For example, if I have a row for 2018-02-22 then I want rolling means for rows in the following date ranges:
2018-01-23 : 2018-02-22
2017-12-23 : 2018-01-22
2017-11-23 : 2017-12-22
etc.
But I can't see a way to keep this pegged to the particular day of the month using conventional offsets. For example, if I do:
dfm = df.resample('30D').mean()
Then we see two problems:
It references the beginning of the DataFrame. In fact, I can't find a way to force .resample() to peg itself to the end of the DataFrame – even if I have it operate on df_reversed = df.loc[:'2018-02-22'].iloc[::-1]. Is there a way to "peg" the resampling to something other than the earliest date in the DataFrame? (And ideally pegged to each particular row as I run some lambda on the associated historical resampling from each row's date?)
It will drift over time, because not every month is 30 days long. So as I go back in time I will find that the interval 12 "months" prior ends 2017-02-27, not 2017-02-22 like I want.
Knowing that I want to resample by non-overlapping "months," the second problem can be well-defined for month days 29-31: For example, if I ask to resample for '2018-03-31' then the date ranges would end at the end of each preceding month:
2018-03-01 : 2018-03-31
2018-02-01 : 2018-02-28
2018-01-01 : 2018-02-31
etc.
Though again, I don't know: is there a good or easy way to do this in pandas?
tl;dr:
Given something like the following:
someperiods = 20 # this can be a number of days covering many years
somefrequency = '8D' # this can vary from 1D to maybe 10D
rng = pd.date_range('2017-01-03', periods=someperiods, freq=somefrequency)
df = pd.DataFrame({'x': rng.day}, index=rng) # x in practice is exogenous data
from pandas.tseries.offsets import *
df['MonthPrior'] = df.index.to_pydatetime() + DateOffset(months=-1)
Now:
For each row in df: calculate df['PreviousMonthMean'] = rolling average of all df.x in range [df.MonthPrior, df.index). In this example the resulting DataFrame would be:
Index x MonthPrior PreviousMonthMean
2017-01-03 3 2016-12-03 NaN
2017-01-11 11 2016-12-11 3
2017-01-19 19 2016-12-19 7
2017-01-27 27 2016-12-27 11
2017-02-04 4 2017-01-04 19
2017-02-12 12 2017-01-12 16.66666667
2017-02-20 20 2017-01-20 14.33333333
2017-02-28 28 2017-01-28 12
2017-03-08 8 2017-02-08 20
2017-03-16 16 2017-02-16 18.66666667
2017-03-24 24 2017-02-24 17.33333333
2017-04-01 1 2017-03-01 16
2017-04-09 9 2017-03-09 13.66666667
2017-04-17 17 2017-03-17 11.33333333
2017-04-25 25 2017-03-25 9
2017-05-03 3 2017-04-03 17
2017-05-11 11 2017-04-11 15
2017-05-19 19 2017-04-19 13
2017-05-27 27 2017-04-27 11
2017-06-04 4 2017-05-04 19
If we can get that far, then I need to find an efficient way to iterate that so that for each row in df I can aggregate consecutive but non-overlapping df['PreviousMonthMean'] values going back one calendar month at a time from the given DateTimeIndex....

Categories

Resources