Applying Pandas iterrows function - python

I was wondering whether somebody could please give me some assistance with the Pandas iterrows package.
I'm currently using an iterative function which works but I was wondering whether using iterrows would make it more efficient to avoid a for loop?
import pandas as pd
import numpy as np
dataframe_1 = pd.read_csv("D\data\2018_19.csv")
def append_date_column(df):
df = df.copy()
df['date_column'] = np.nan
date_range = pd.date_range(start = '01/01/2001', periods = 207, freq = 'M').values
for row in range(df.shape[0]):
date_number = df.loc[row, "income2"]
if (not pd.isna(date_number)) and date_number < 207:
date = date_range[int(date_number) -1]
df.loc[row, 'date_column'] = date
df_with_date_column = df
return df_with_date_column
Thanks!

Related

Performance improvement: using pd.date_range() with apply() in python

I have a question about performance improvement of the following code:
df["range_column"] = list(zip(df.START, df1.END))
df["range_col"] = df["range_col"].swifter.apply(lambda x: pd.date_range(x[0], x[1], freq="60min"))
Explanation: I have two datetime columns. Based on these columns I create a tuple and a data range of 60 minutes.
For larger datasets, it takes quite a long time to run this code.
Below I have created some sample data to run the code.
Does anyone perhaps know of an alternative that produces the same result but is faster?
import faker
import pandas as pd
from faker import Faker
import swifter
# create some fake date data
fake = Faker()
Faker.seed(0)
df = []
for _ in range(5):
df.append(fake.date("%Y-%m-%d_%H_%M_%S"))
df1 = []
for _ in range(5):
df1.append(fake.date("%Y-%m-%d_%H_%M_%S"))
# create df
df = pd.DataFrame(df)
df["START"] = df
df = pd.DataFrame(df["START"])
df["START"] = pd.to_datetime(df["START"], format="%Y-%m-%d_%H_%M_%S")
# create df
df1 = pd.DataFrame(df1)
df1["END"] = df1
df1 = pd.DataFrame(df1["END"])
df1["END"] = pd.to_datetime(df1["END"], format="%Y-%m-%d_%H_%M_%S")
# merge
df2 = pd.concat([df, df1], axis = 1)
# create tuple
df2["range_col"] = list(zip(df2.START, df2.END))
# create date range
df2["range__col1"] = df2["range_col"].swifter.apply(lambda x: pd.date_range(x[0], x[1], freq="60min"))

How fill new column with businessDuration result in Dataframe Python

please help me to solve this, How to make new column in df with duration result? also result for all row. Thanks.
import pandas as pd
from datetime import time,datetime
from itertools import repeat
df = pd.read_csv("data.csv")
df['startdate_column'] = pd.to_datetime(df['startdate_column'])
df['enddate_column'] = pd.to_datetime(df['enddate_column'])
start_time=time(8,0,0)
end_time=time(17,0,0)
unit='min'
df['Duration'] = list(map(businessDuration,startdate=df['startdate_column'],enddate=df['enddate_column'],repeat(start_time),repeat(end_time),repeat(weekendlist=[6]),repeat(unit)))```
Use:
f = lambda x: businessDuration(startdate=x['startdate_column'],
enddate=x['enddate_column'],
starttime=start_time,
endtime=end_time,
weekendlist=[6],
unit=unit)
df['Duration'] = df.apply(f, axis=1)

Copy and convert all values in pandas dataframe

In a dataframe, I have a column "UnixTime" and want to convert it to a new column containing the UTC time.
import pandas as pd
from datetime import datetime
df = pd.DataFrame([1565691196, 1565691297, 1565691398], columns = ["UnixTime"])
unix_list = df["UnixTime"].tolist()
utc_list = []
for i in unix_list:
i = datetime.utcfromtimestamp(i).strftime('%Y-%m-%d %H:%M:%S')
utc_list.append(i)
df["UTC"] = utc_list
This works, but I guess there is a smarter approach?
Could you try this:
df["UTC"] = pd.to_datetime(df['UnixTime'], unit='s')
If you mean by smarter approach is pandas-way and less code, then this is your answer :
df["UTC"] = pd.to_datetime(df["UnixTime"], unit = "s")
Hope this helps.

Why can't I search for a row in a pandas df using a date as part of a tuple index?

I am trying to search a pandas df I made which has a tuple as an index. The first part of the tuple is a date and the second part is a forex pair. I've tried a few things but I can't seem to search using a date-formatted string as part of a tuple with .loc or .ix
My df looks like this:
Open Close
(11-01-2018, AEDAUD) 0.3470 0.3448
(11-01-2018, AEDCAD) 0.3415 0.3408
(11-01-2018, AEDCHF) 0.2663 0.2656
(11-01-2018, AEDDKK) 1.6955 1.6838
(11-01-2018, AEDEUR) 0.2277 0.2261
Here is the complete code :
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
forex_11 = pd.read_csv('FOREX_20180111.csv', sep=',', parse_dates=['Date'])
forex_12 = pd.read_csv('FOREX_20180112.csv', sep=',', parse_dates=['Date'])
time_format = '%d-%m-%Y'
forex = forex_11.append(forex_12, ignore_index=False)
forex['Date'] = forex['Date'].dt.strftime(time_format)
GBP = forex[forex['Symbol'] == "GBPUSD"]
forex.index = list(forex[['Date', 'Symbol']].itertuples(index=False, name=None))
forex_open_close = pd.DataFrame(np.array(forex[['Open','Close']]), index=forex.index)
forex_open_close.columns = ['Open', 'Close']
print(forex_open_close.head())
print(forex_open_close.ix[('11-01-2018', 'GBPUSD')])
How do I get the row which has index ('11-01-2018', 'GBPUSD') ?
Can you try putting the tuple in a list using brackets?
Like this:
print(forex_open_close.ix[[('11-01-2018', 'GBPUSD')]])
I would recommend using the Pandas multiIndex. In your case you could do the following:
tuples = list(data[['Date', 'Symbol']].itertuples(index=False, name=None))
data.index = pd.MultiIndex.from_tuples(tuples, names=['Date', 'Symbol'])
# And then to index
data.loc['2018-01-11', 'AEDCAD']

python pandas: vectorized time series window function

I have a pandas dataframe in the following format:
'customer_id','transaction_dt','product','price','units'
1,2004-01-02,thing1,25,47
1,2004-01-17,thing2,150,8
2,2004-01-29,thing2,150,25
3,2017-07-15,thing3,55,17
3,2016-05-12,thing3,55,47
4,2012-02-23,thing2,150,22
4,2009-10-10,thing1,25,12
4,2014-04-04,thing2,150,2
5,2008-07-09,thing2,150,43
I have written the following to create two new fields indicating 30 day windows:
import numpy as np
import pandas as pd
start_date_period = pd.period_range('2004-01-01', '12-31-2017', freq='30D')
end_date_period = pd.period_range('2004-01-30', '12-31-2017', freq='30D')
def find_window_start_date(x):
window_start_date_idx = np.argmax(x < start_date_period.end_time)
return start_date_period[window_start_date_idx]
df['window_start_dt'] = df['transaction_dt'].apply(find_window_start_date)
def find_window_end_date(x):
window_end_date_idx = np.argmin(x > end_date_period.start_time)
return end_date_period[window_end_date_idx]
df['window_end_dt'] = df['transaction_dt'].apply(find_window_end_date)
Unfortunately, this is far too slow doing the row-wise apply for my application. I would greatly appreciate any tips on vectorizing these functions if possible.
EDIT:
The resultant dataframe should have this layout:
'customer_id','transaction_dt','product','price','units','window_start_dt','window_end_dt'
It does not need to be resampled or windowed in the formal sense. It just needs 'window_start_dt' and 'window_end_dt' columns to be added. The current code works, it just need to be vectorized if possible.
EDIT 2: pandas.cut is built-in:
tt=[[1,'2004-01-02',0.1,25,47],
[1,'2004-01-17',0.2,150,8],
[2,'2004-01-29',0.2,150,25],
[3,'2017-07-15',0.3,55,17],
[3,'2016-05-12',0.3,55,47],
[4,'2012-02-23',0.2,150,22],
[4,'2009-10-10',0.1,25,12],
[4,'2014-04-04',0.2,150,2],
[5,'2008-07-09',0.2,150,43]]
start_date_period = pd.date_range('2004-01-01', '12-01-2017', freq='MS')
end_date_period = pd.date_range('2004-01-30', '12-31-2017', freq='M')
df = pd.DataFrame(tt,columns=['customer_id','transaction_dt','product','price','units'])
df['transaction_dt'] = pd.Series([pd.to_datetime(sub_t[1],format='%Y-%m-%d') for sub_t in tt])
the_cut = pd.cut(df['transaction_dt'],bins=start_date_period,right=True,labels=False,include_lowest=True)
df['win_start_test'] = pd.Series([start_date_period[int(x)] if not np.isnan(x) else 0 for x in the_cut])
df['win_end_test'] = pd.Series([end_date_period[int(x)] if not np.isnan(x) else 0 for x in the_cut])
print(df.head())
win_start_test and win_end_test should be equal to their counterparts computed using your function.
The ValueError was coming from not casting x to int in the relevant line. I also added a NaN check, though it wasn't needed for this toy example.
Note the change to pd.date_range and the use of the start-of-month and end-of-month flags M and MS, as well as converting the date strings into datetime.

Categories

Resources