Python Pandas - Creating a new column using currency_converter - python

I have a dataframe (dfFF) like this:
Sector Country Currency Amount Fund Start Year
0 Public USA USD 22000 2016
0 Private Hong Kong HKD 42000 2015
...
I want to create a new column that converts the currency/amount/fund start year into Euros and then to GBP (currency_converter only converts to everything to Euros or back hence why I am not converting straight to GBP). I want the currency rate for the year that the funding took place.
I am using the template code given by the website:
https://pypi.org/project/CurrencyConverter/
from currency_converter import CurrencyConverter
from datetime import date
c = CurrencyConverter()
c.convert(100, 'EUR', 'USD', date=date(2013, 3, 21))
I want to use the 1st of January for every year to make it consistent, so I have tried doing the following:
c = CurrencyConverter()
dfFF['Value'] = (c.convert(dfFF['Amount'],dfFF['Currency'],
'EUR',date=date(dfFF['Fund Start Year'],1,1)))
I am getting the error:
TypeError: 'Series' objects are mutable, thus they cannot be hashed
Although I feel that my solution isnt the best way of doing this.
Any suggestions? Even if I just get it into EUROS, and then I can do the same to convert it to GBP would be great.
Thank you

You have create a function for converting currency using data of a row:
def currency_convertor(row):
amount = row['amount']
curr = row['Currency'],
date_r = row['Fund Start Year']
new_curr = c.convert(amount,curr,'EUR',date=date(date_r,1,1))
return new_curr
and then apply it to dataframe:
dfFF['EUR_new'] = dfFF.apply(currency_convert, axis=1) # make sure to set axis=1
General Version
def currency_convertor(row,new_currency='EUR'):
amount = row['amount']
curr = row['Currency']
date_r = row['Fund Start Year']
new_curr = c.convert(amount,curr,new_currency,date=date(date_r,1,1))
return new_curr
dfFF['new_currency'] = dfFF.apply(lambda x: currency_convertor(x,new_currency="GBP"), axis=1)

Related

How to find the maximum date value with conditions in python?

I have a three columns dataframe as follows. I want to calculate the returns in three months per day for every funds, so I need to get the date with recorded NAV data three months ago. Should I use the max() function with filter() function to deal this problem? If so, how? If not, could you please help me figure out a better way to do this?
fund code
date
NAV
fund 1
2021-01-04
1.0000
fund 1
2021-01-05
1.0001
fund 1
2021-01-06
1.0023
...
...
...
fund 2
2020-02-08
1.0000
fund 2
2020-02-09
0.9998
fund 2
2020-02-10
1.0001
...
...
...
fund 3
2022-05-04
2.0021
fund 3
2022-05-05
2.0044
fund 3
2022-05-06
2.0305
I tried to combined the max() function with filter() as follows:
max(filter(lambda x: x<=df['date']-timedelta(days=91)))
But it didn't work.
Were this in excel, I know I could use the following functions to solve this problem:
{max(if(B:B<=B2-91,B:B))}
{max(if(B:B<=B3-91,B:B))}
{max(if(B:B<=B4-91,B:B))}
....
But with python, I don't know what I could do. I just learnt it three days ago. Please help me.
This picture is what I want if it was in excel. The yellow area is the original data. The white part is the procedure I need for the calculation and the red part is the result I want. To get this result, I need to divide the 3rd column by the 5th column.
I know that I could use pct_change(period=7) function to get the same results in this picture. But here is the tricky part: the line 7 rows before is not necessarily the data 7 days before, and not all the funds are recorded daily. Some funds are recorded weekly, some monthly. So I need to check if the data used for division exists first.
what you need is an implementation of the maximum in sliding window (for your example 1 week, 7days).
I could recreated you example as follow (to create the data frame you have):
import pandas as pd
import datetime
from random import randint
df = pd.DataFrame(columns=["fund code", "date", "NAV"])
date = datetime.datetime.strptime("2021-01-04", '%Y-%m-%d')
for i in range(10):
df = df.append({"fund code": 'fund 1', "date": date + datetime.timedelta(i), "NAV":randint(0,10)}, ignore_index=True)
for i in range(20, 25):
df = df.append({"fund code": 'fund 1', "date": date + datetime.timedelta(i), "NAV":randint(0,10)}, ignore_index=True)
for i in range(20, 25):
df = df.append({"fund code": 'fund 2', "date": date + datetime.timedelta(i), "NAV":randint(0,10)}, ignore_index=True)
this will look like your example, with not continuous dates and two different funds.
The maximum sliding window (for variable days length look like this)
import queue
class max_queue:
def __init__(self, win=7):
self.win = win
self.queue = queue.deque()
self.date = None
def append(self, date, value):
while self.queue and value > self.queue[-1][1]:
self.queue.pop()
while self.queue and date - self.queue[0][0] >= datetime.timedelta(self.win):
self.queue.popleft()
self.queue.append((date, value))
self.date = date
def get_max(self):
return self.queue[0][1]
now you could simply iterate over rows and get the max value in the timeframe you are interested.
mq = max_queue(7)
pre_code = ''
for idx, row in df.iterrows():
code, date, nav,*_ = row
if code != pre_code:
mq = max_queue(7)
pre_code = code
mq.append(date, nav)
df.at[idx, 'max'] = mq.get_max()
results will look like this, with added max column. This assumes that funds data are continuous, but you could as well modify to have seperate max_queue for each funds as well.
using max queue to only keep track of the max in the window would be the correct complexity O(n) for a solution. important if you are dealing with huge datasets and especially bigger date ranges (instead of week).

error "can only convert an array of size 1 to a Python scalar" when I try to get the index of a row of a pandas dataframe as an integer

I have an excel file with stock symbols and many other columns. I have a simplified version of the excel file below:
Symbol
Industry
0
AAPL
Technology Manufacturing
1
MSFT
Technology Manufacturing
2
TSLA
Electric Car Manufacturing
Essentially, I am trying to get the Industry based on the Symbol.
For example, if I use 'AAPL' I want to get 'Technology Manufacturing'. Here is my code so far.
import pandas as pd
excel_file1 = 'file.xlsx'
df = pd.read_excel(excel_file1)
stock = 'AAPL'
row_index = df[df['Symbol'] == stock].index.item()
industry = df['Industry'][row_index]
print(industry)
after trying to get row_index, I get an error: "ValueError: can only convert an array of size 1 to a Python scalar"
can someone solve this? Also let's say row_index works: is this code (below) correct?
industry = df['Industry'][row_index]
Use:
stock = 'AAPL'
industry = df[df['Symbol'] == stock]['Industry'][0]
OR:, if you want to search using index, use df.loc:
stock = 'AAPL'
industry = df.loc[df[df['Symbol'] == stock].index, 'Industry'][0]
But the first one's much better.

How to append two dataframe objects containing same column data but different column names?

I want to append an expense df to a revenue df but can't properly do so. Can anyone offer how I may do this?
'''
import pandas as pd
import lxml
from lxml import html
import requests
import numpy as np
symbol = 'MFC'
url = 'https://www.marketwatch.com/investing/stock/'+ symbol +'/financials'
df=pd.read_html(url)
revenue = pd.concat(df[0:1]) # the revenue dataframe obj
revenue = revenue.dropna(axis='columns') # drop naN column
header = revenue.iloc[:0] # revenue df header row
expense = pd.concat(df[1:2]) # the expense dataframe obj
expense = expense.dropna(axis='columns') # drop naN column
statement = revenue.append(expense) #results in a dataframe with an added column (Unnamed:0)
revenue = pd.concat(df[0:1]) =
Fiscal year is January-December. All values CAD millions.
2015
2016
2017
2018
2019
expense = pd.concat(df[1:2]) =
Unnamed: 0
2015
2016
2017
2018
2019
'''
How can I append the expense dataframe to the revenue dataframe so that I am left with a single dataframe object?
Thanks,
Rename columns.
df = df.rename(columns={'old_name': 'new_name',})
Then append with merge(), join(), or concat().
I managed to append the dataframes with the following code. Thanks David for putting me on the right track. I admit this is not the best way to do this because in a run time environment, I don't know the value of the text to rename and I've hard coded it here. Ideally, it would be best to reference a placeholder at df.iloc[:0,0] instead, but I'm having a tough time getting that to work.
df=pd.read_html(url)
revenue = pd.concat(df[0:1])
revenue = revenue.dropna(axis='columns')
revenue.rename({'Fiscal year is January-December. All values CAD millions.':'LineItem'},axis=1,inplace=True)
header = revenue.iloc[:0]
expense = pd.concat(df[1:2])
expense = expense.dropna(axis='columns')
expense.rename({'Unnamed: 0':'LineItem'}, axis=1, inplace=True)
statement = revenue.append(expense,ignore_index=True)
Using the df=pd.read_html(url) construct, several lists are returned when scraping marketwatch financials. The below function returns a single dataframe of all balance sheet elements. The same code applies to quarterly and annual income and cash flow statements.
def getBalanceSheet(url):
df=pd.read_html(url)
count = sum([1 for Listitem in df if 'Unnamed: 0' in Listitem])
statement = pd.concat(df[0:1])
statement = statement.dropna(axis='columns')
if 'q' in url: #quarterly
statement.rename({'All values CAD millions.':'LineItem'},axis=1,inplace=True)
else:
statement.rename({'Fiscal year is January-December. All values CAD millions.':'LineItem'},axis=1,inplace=True)
for rowidx in range(count):
df_name = 'df_'+str(int(rowidx))
df_name = pd.concat(df[rowidx+1:rowidx+2])
df_name = df_name.dropna(axis='columns')
df_name.rename({'Unnamed: 0':'LineItem'}, axis=1, inplace=True)
statement = statement.append(df_name,ignore_index=True)
return statement

Faster loop in Pandas looking for ID and older date

So, I have a DataFrame that represents purchases with 4 columns:
date (date of purchase in format %Y-%m-%d)
customer_ID (string column)
claim (1-0 column that means 1-the customer complained about the purchase, 0-customer didn't complain)
claim_value (for claim = 1 it means how much the claim cost to the company, for claim = 0 it's NaN)
I need to build 3 new columns:
past_purchases (how many purchases the specific customer had before this purchase)
past_claims (how many claims the specific customer had before this purchase)
past_claims_value (how much did the customer's past claims cost)
This has been my approach until now:
past_purchases = []
past_claims = []
past_claims_value = []
for i in range(0, len(df)):
date = df['date'][i]
customer_ID = df['customer_ID'][i]
df_temp = df[(df['date'] < date) & (df['customer_ID'] == customer_ID)]
past_purchases.append(len(df_temp))
past_claims.append(df_temp['claim'].sum())
past_claims_value.append(df['claim_value'].sum())
df['past_purchases'] = pd.DataFrame(past_purchases)
df['past_claims'] = pd.DataFrame(past_claims)
df['past_claims_value'] = pd.DataFrame(past_claims_value)
The code works fine, but it's too slow. Can anyone make it work faster? Thanks!
Ps: It's importante to check that the date is older, if the customer had 2 purchases in the same date they shouldn't count for each other.
Pss: I'm willing to use libraries for parallel processing like multiprocessing, concurrent.futures, joblib or dask, but never had before in a similar way.
Expected outcome:
Maybe you can try using a cumsum over customers, if the dates are sorted ascendant
df.sort_values('date', inplace=True)
new_temp_columns = ['claim_s','claim_value_s']
df[['claim_s','claim_value_s']] = df[new_temp_columns].shift()
df['past_claims'] = df.groupby('customer_ID')['claim_s'].transform(pd.Series.cumsum)
df['past_claims_value'] = df.groupby('customer_ID')['claim_value_s'].transform(pd.Series.cumsum)
# set the min value for the groups
dfc = data.groupby(['customer_ID','date'])[['past_claims','past_claims_value']]
data[['past_claims', 'past_claims_value']] = dfc.transform(min)
# Remove temp columns
data = data.loc[:, ~data.columns.isin(new_temp_columns)]
Again, this will only works if te date are srotes

Calculating customer lifetime using Pandas

I'm performing a Cohort analysis using python, and I am having trouble creating a new column that sums up the total months a user has stayed with us.
I know the math behind the answer, all I have to do is:
subtract the year when they canceled our service from when they started it
Multiply that by 12.
Subtract the month when they canceled our service from when they started it.
Add those two numbers together.
So in Excel, it looks like this:
=(YEAR(C2)-YEAR(B2))*12+(MONTH(C2)-MONTH(B2))
C is when the customer canceled the date, and B is when they started
The problem is that I am very new to Python and Pandas, and I am having trouble translating that function in Python
What I have tried so far:
df['Lifetime'] = df.Plan_Cancel_Date('%Y') - df.Plan_Start_Date('%Y')*12 +
df.Plan_Cancel_Date('%m') - df.Plan_Start_Date('%m')
df.head()
It returns with an error 'Series' is not callable, and I have a general understanding of what that means.
I then tried:
def LTVCalc (Plan_Start_Date, Plan_Cancel_Date):
df['Lifetime'] = df.Plan_Cancel_Date('%Y') - df.Plan_Start_Date('%Y')*12 +
df.Plan_Cancel_Date('%m') - df.Plan_Start_Date('%m')
df.head()
But that didn't add the Column 'Lifetime' to the DataFrame.
Anyone able to help a rookie?
I think need first convert to_datetime and then use dt.year and
dt.month:
df = pd.DataFrame({
'Plan_Cancel_Date': ['2018-07-07','2019-03-05','2020-10-08'],
'Plan_Start_Date': ['2016-02-07','2017-01-05','2017-08-08']
})
#print (df)
#if necessary convert to datetimes
df.Plan_Start_Date = pd.to_datetime(df.Plan_Start_Date)
df.Plan_Cancel_Date = pd.to_datetime(df.Plan_Cancel_Date)
df['Lifetime'] = ((df.Plan_Cancel_Date.dt.year - df.Plan_Start_Date.dt.year)*12 +
df.Plan_Cancel_Date.dt.month - df.Plan_Start_Date.dt.month)
print (df)
Plan_Cancel_Date Plan_Start_Date Lifetime
0 2018-07-07 2016-02-07 29
1 2019-03-05 2017-01-05 26
2 2020-10-08 2017-08-08 38

Categories

Resources