Replace year on pandas dataframe with variable of Timestamp format - python

I have created the following df with the following code:
df = pd.read_table('https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/Wind_Stats/wind.data', sep = "\s+", parse_dates = [[0,1,2]])
If we run the following command:
type(df['Yr_Mo_Dy'][0])
We'll see that the observations under ['Yr_Mo_Dy'] are of pandas._libs.tslibs.timestamps.Timestamp format.
What I am trying to do is the following: whenever I see a year >= 2061 (['Yr_Mo_Dy']), I want to subtract -100, otherwise I just keep the year and continue with the iteration.
I have tried the following code:
for i in list(range(df.shape[0])):
# assign all the observations under df['Yr_Mo_Dy'] to ts
ts = df['Yr_Mo_Dy'][i]
if df['Yr_Mo_Dy'][i].year >=2061:
# replace the year in ts by year - 100
ts.replace(year=df['Yr_Mo_Dy'][i].year - 100)
else:
continue
But the loop does nothing. I feel it has something to do with the variable assignment ts = df['Yr_Mo_Dy'][i]. yet I cannot figure another way of getting this done.
I am trying to assign a variable after each loop iteration considering the answer I saw in this post.

You should aim to avoid manual loops for vectorisable operations.
In this case, you can use numpy.where to create a conditional series:
df = pd.DataFrame({'A': pd.to_datetime(['2018-01-01', '2080-11-30',
'1955-04-05', '2075-10-09'])})
df['B'] = np.where(df['A'].dt.year >= 2061,
df['A'] - pd.DateOffset(years=100), df['A'])
print(df)
A B
0 2018-01-01 2018-01-01
1 2080-11-30 1980-11-30
2 1955-04-05 1955-04-05
3 2075-10-09 1975-10-09

Related

Applying a conditional statement on one column to achieve a result in another column

I am trying to write a script that will add 4 months or 8 months to the column titled "Date" depending on the column titled "Quarterly_Call__c". For instance, if the value in Quarterly_Call__c = 2 then add 4 months to the "Date" column and if the value is 3, add 8 months. Finally, I want the output in the column titled "New Date".
So far I am able to add the number of months I want using this piece of code:
from datetime import date
from dateutil.relativedelta import relativedelta
new_date = []
df['Date'] = df['Date'].dt.normalize()
for value in df['Date']:
new_date.append(value + relativedelta(months=+4))
df['New Date'] = new_date
However, as I mentioned, I would like this to work depending on the value in Quarterly_Call__c, so I tried writing this code:
for i in df['Quarterly_Call__c'].astype(int).to_list():
if i == 2:
for value in df['Date']:
new_date.append(value + relativedelta(months=+4))
elif i == 3:
for value in df['Date']:
new_date.append(value + relativedelta(months=+8))
Unfortunately, this does not work. Could you please recommend a solution? Thanks!
Using lambda expressions to each of the rows on your DataFrame seems to be the most convenient approach:
def date_calc(q,d):
if q == 2:
return d + relativedelta(months=+4)
else:
return d + relativedelta(months=+8)
df['New Date'] = df.apply(lambda x: date_calc(x['Quarterly_Call__c'], x['Date']), axis=1)
The date_calc function holds the same logic you posted in your question while taking the inputs as arguments, and the apply method of the DataFrame is used to calculate the 'New Date' column for each row where the variable x of the lambda expression represents a row of the DataFrame.
Keep in mind that the axis argument being specified to 1 is what makes sure that the function is applied for each row of the DataFrame rather than each column. More info about the apply method can be found here.
You could iterate through row by row to access the row data, and calculate the new date.
import pandas as pd
from dateutil.relativedelta import relativedelta
df = pd.DataFrame({
'Quarterly_Call__c': [2,3,2,3],
'Date': ['2021-02-25', '2021-03-25', '2021-04-25', '2021-05-25']
})
df['Date'] = pd.to_datetime(df['Date'])
df['New Date'] = '' #new empty column
for i in range(len(df)):
if df.loc[i, 'Quarterly_Call__c'] == 2:
df.loc[i, 'New Date'] = df.loc[i, 'Date'] + relativedelta(months=+4)
if df.loc[i, 'Quarterly_Call__c'] == 3:
df.loc[i, 'New Date'] = df.loc[i, 'Date'] + relativedelta(months=+8)
df['New Date'] = df['New Date'].dt.normalize()
Output
Quarterly_Call__c Date New Date
0 2 2021-02-25 2021-06-25
1 3 2021-03-25 2021-11-25
2 2 2021-04-25 2021-08-25
3 3 2021-05-25 2022-01-25
You can try lambda functions on your dataframe. For example:
import pandas as pd
numbers = {'set_of_numbers': [1,2,3,4,5,6,7,8,9,10]}
df = pd.DataFrame(numbers,columns=['set_of_numbers'])
df['equal_or_lower_than_4?'] = df['set_of_numbers'].apply(lambda x: 'True' if x <= 4 else 'False')
print (df)
You can check this link [1] for more information on how to apply if conditions on Pandas DataFrame.

Pandas SettingWithCopyWarning for unclear reason

Consider the following example code
import pandas as pd
import numpy as np
pd.set_option('display.expand_frame_repr', False)
foo = pd.read_csv("foo2.csv", skipinitialspace=True, index_col='Index')
foo.loc[:, 'Date'] = pd.to_datetime(foo.Date)
for i in range(0, len(foo)-1):
if foo.at[i, 'Type'] == 'Reservation':
for j in range(i+1, len(foo)):
if foo.at[j, 'Type'] == 'Payout':
foo.at[j, 'Nights'] = foo.at[i, 'Nights']
break
mask = (foo['Date'] >= '2018-03-31') & (foo['Date'] <= '2019-03-31')
foo2019 = foo.loc[mask]
foopayouts2019 = foo2019.loc[foo2019['Type'] == 'Payout']
foopayouts2019.loc[:, 'Nights'] = foopayouts2019['Nights'].apply(np.int64)
# foopayouts2019.loc[:, 'Nights'] = foopayouts2019['Nights'].astype(np.int64, copy=False)
with foo2.csv as:
Index,Date,Type,Nights,Amount,Payout
0,03/07/2018,Reservation,2.0,1000.00,
1,03/07/2018,Payout,,,1000.00
2,09/11/2018,Reservation,3.0,1500.00,
3,09/11/2018,Payout,,,1500.00
4,02/16/2019,Reservation,2.0,2000.00,
5,02/16/2019,Payout,,,2000.00
6,04/25/2019,Reservation,7.0,1200.00,
7,04/25/2019,Payout,,,1200.00
This gives the following warning:
/usr/lib/python2.7/dist-packages/pandas/core/indexing.py:543: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self.obj[item] = s
The warning does not mention a line number, but appears to be coming from the line:
foopayouts2019.loc[:, 'Nights'] = foopayouts2019['Nights'].apply(np.int64)
At least, if I comment that line out, the error goes away. So, I have two questions.
What is causing that error? I've been trying to use .loc where
appropriate, including in that line where the warning is (possibly)
coming from. If the problem is actually earlier, where is it?
Second, which is the better choice, .apply or astype, as used in
the following lines of code?
foopayouts2019.loc[:, 'Nights'] = foopayouts2019['Nights'].apply(np.int64)
# foopayouts2019.loc[:, 'Nights'] = foopayouts2019['Nights'].astype(np.int64, copy=False)
It seems that both of them work, except for that warning.
I would change a few things in the code:
We are checking if the current row is Reservation and the next row is Payout
by using shift()
and ffill-ing the values where condition matches by using np.where()
foo.Date=pd.to_datetime(foo.Date) #convert to datetime
c=foo.Type.eq('Reservation')&foo.Type.shift(-1).eq('Payout')
foo.Nights=np.where(~c,foo.Nights.ffill(),foo.Nights) #replace if else with np.where
Or:
c=foo.Type.shift().eq('Reservation')&foo.Type.eq('Payout')
np.where(c,foo.Nights.ffill(),foo.Nights)
Then use series.between() to check if dates fall between 2 dates:
foo2019 = foo[foo.Date.between('2018-03-31','2019-03-31')].copy() #changes
foopayouts2019 = foo2019[foo2019['Type'] == 'Payout'].copy() #changes .copy()
Or directly:
foopayouts2019=foo[foo.Date.between('2018-03-31','2019-03-31')&foo.Type.eq('Payout')].copy()
foopayouts2019.loc[:, 'Nights'] = foopayouts2019['Nights'].apply(np.int64) #.astype(int)
Index Date Type Nights Amount Payout
3 3 2018-09-11 Payout 3 NaN 1500.0
5 5 2019-02-16 Payout 2 NaN 2000.0

Create pandas column of pd.date_range

I have data like this:
import datetime as dt
import pandas as pd
df = pd.DataFrame({'date':[dt.datetime(2018,8,25), dt.datetime(2018,7,21)],
'n':[10,7]})
I would like to create a third column which contains a date range created by pd.date_range, using 'date' as the start date and 'n' as the number of periods.
So the first entry should be:
pd.date_range(dt.datetime(2018,8,25), periods=10, freq='d')
(I have a list of "target" dates, and my goal is to check whether the date_range contains any of those target dates).
I tried this:
df['date_range'] = df.apply(lambda x: pd.date_range(x['date'],
x['n'],
freq='d'))
But this gives a KeyError: ('date', 'occurred at index date')
Any idea on how to do this without using a for loop, or is there a better solution altogether?
You can solve your problem without creating date range or day columns. To check if a target date in tgt belongs to a date range specified by rows of df, you can calculate the end of date range, and then check if each date in tgt falls in between the start and end of a time interval. The code below implements this, and produces "target_date" column identical to the one in your own answer:
df = pd.DataFrame({'date':[dt.datetime(2018,8,25), dt.datetime(2018,7,21)],
'n':[10,7]})
df["daterange_end"] = df.apply(lambda x: x["date"] + pd.Timedelta(days=x["n"]), axis=1)
tgt = [dt.datetime(2018,8,26)]
df['target_date'] = 0
df.loc[(tgt[0] > df.date) &(tgt[0] < df.daterange_end),"target_date"] = 1
print(df)
# date n daterange_end target_date
# 0 2018-08-25 10 2018-09-04 1
# 1 2018-07-21 7 2018-07-28 0
You should add axis=1 in apply
df['date_range'] = df.apply(lambda x: pd.date_range(x['date'], x['n'], freq='d'), axis=1)
I came up with a solution that works (but I'm sure there's a nicer way...)
# define target
tgt = [dt.datetime(2018,8,26)]
# find max n
max_n = max(df['n'])
# create that many columns and increment the day
for i in range(max_n):
df['date_{}'.format(i)] = df['date'] + dt.timedelta(days=i)
new_cols = ['date_{}'.format(n) for n in range(max_n)]
# check each one and replace with a 1 if it matches the "tgt"
df['target_date'] = 0
for col in new_cols:
df['target_date'] = np.where(df[col].isin(tgt),
1,
df['target_date'])
# drop intermediate cols
df = df[[i for i in df.columns if not i in new_cols]]

pandas standalone series and from dataframe different behavior

Here is my code and warning message. If I change s to be a standalone Series by using s = pd.Series(np.random.randn(5)), there will no such errors. Using Python 2.7 on Windows.
It seems Series created from standalone and Series created from a column of a data frame are different behavior? Thanks.
My purpose is to change the Series value itself, other than change on a copy.
Source code,
import pandas as pd
sample = pd.read_csv('123.csv', header=None, skiprows=1,
dtype={0:str, 1:str, 2:str, 3:float})
sample.columns = pd.Index(data=['c_a', 'c_b', 'c_c', 'c_d'])
sample['c_d'] = sample['c_d'].astype('int64')
s = sample['c_d']
#s = pd.Series(np.random.randn(5))
for i in range(len(s)):
if s.iloc[i] > 0:
s.iloc[i] = s.iloc[i] + 1
else:
s.iloc[i] = s.iloc[i] - 1
Warning message,
C:\Python27\lib\site-packages\pandas\core\indexing.py:132: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self._setitem_with_indexer(indexer, value)
Content of 123.csv,
c_a,c_b,c_c,c_d
hello,python,numpy,0.0
hi,python,pandas,1.0
ho,c++,vector,0.0
ho,c++,std,1.0
go,c++,std,0.0
Edit 1, seems lambda solution does not work, tried to print s before and after, the same value,
import pandas as pd
sample = pd.read_csv('123.csv', header=None, skiprows=1,
dtype={0:str, 1:str, 2:str, 3:float})
sample.columns = pd.Index(data=['c_a', 'c_b', 'c_c', 'c_d'])
sample['c_d'] = sample['c_d'].astype('int64')
s = sample['c_d']
print s
s.apply(lambda x:x+1 if x>0 else x-1)
print s
0 0
1 1
2 0
3 1
4 0
Name: c_d, dtype: int64
Backend TkAgg is interactive backend. Turning interactive mode on.
0 0
1 1
2 0
3 1
4 0
regards,
Lin
By doing s = sample['c_d'], if you make a change to the value of s then your original Dataframe sample also changes. That's why you got the warning.
You can do s = sample[c_d].copy() instead, so that changing the value of s doesn't change the value of c_d column of the Dataframe sample.
I suggest you use apply function instead:
s.apply(lambda x:x+1 if x>0 else x-1)

Upsample data and interpolate

I have the following dataframe:
Month Col_1 Col_2
1 0,121 0,123
2 0,231 0,356
3 0,150 0,156
4 0,264 0,426
...
I need to resample this to weekly resolution and to interpolate between the points. The latter part, the interpolation is straight-forward. The reindex part is a bit tricky, on the other hand, at least for me.
If I use the DataFrame.reindex() method, it will only erase all the entries from the dataframe. I have tried to do it manually, by using .loc() to create new 'NaN' entries between each consecutive months, but this method overwrites the entries I already have.
Any clue how to do it? Thanks!
I have to assume a start date, I chose 2009-12-31.
To get resample to work, you need a pd.DateTimeIndex.
start_date = pd.to_datetime('2009-12-31')
df.Month = df.Month.apply(lambda x: start_date + pd.offsets.MonthEnd(x))
df = df.set_index('Month')
df.resample('W').interpolate()
Replicable code
from StringIO import StringIO
import pandas as pd
text = """Month Col_1 Col_2
1 0,121 0,123
2 0,231 0,356
3 0,150 0,156
4 0,264 0,426"""
df = pd.read_csv(StringIO(text), decimal=',', delim_whitespace=True)
start_date = pd.to_datetime('2009-12-31')
df.Month = df.Month.apply(lambda x: start_date + pd.offsets.MonthEnd(x))
df = df.set_index('Month')
df.resample('W').interpolate()

Categories

Resources