I need some help thinking through this:
I have a dataset with 61K records of services. Each service gets renewed on a specific date, each service also has a cost and that cost amount is billed in one of 10 different currencies.
what I need to do on each service record is to convert each service cost to CAD currency for the date the service was renewed.
when I do this in a small sample dataset with 6 services it takes 3 seconds, but this implies that if I do this on a 61k record dataset it might take over 8 hours, which is way too long (i think I can do that in excel or google sheets way faster, which I don't want to do)
Is there a better way or approach to do this with pandas/python in google colab so it doesn't take that long?
thank you in advance
# setup
import pandas as pd
!pip install forex-python
from forex_python.converter import CurrencyRates
#sample dataset/df
dummy_data = {
'siteid': ['11', '12', '13', '41', '42','51'],
'userid': [0,0,0,0,0,0],
'domain': ['A', 'B', 'C', 'E', 'F', 'G'],
'currency':['MXN', 'CAD', 'USD', 'USD', 'AUD', 'HKD'],
'servicecost': [2.5, 3.3, 1.3, 2.5, 2.5, 2.3],
'date': ['2022-02-04', '2022-03-05', '2022-01-03', '2021-04-06', '2022-12-05', '2022-11-01']
}
df = pd.DataFrame(dummy_data, columns = ['siteid', 'userid', 'domain','currency','servicecost','date'])
#ensure date is in the proper datatype
df['date'] = pd.to_datetime(df['date'],errors='coerce')
#go through df, get the data to do the conversion and populate a new series
def convertServiceCostToCAD(currency,servicecost,date):
return CurrencyRates().convert(currency, 'CAD', servicecost, date)
df['excrate']=list(map(convertServiceCostToCAD, df['currency'], df['servicecost'], df['date']))
So if I understand this correctly what this package does is provide a daily fixed rate between two currencies (so one direction is the inverse of the other direction).
And what makes things so slow is very clearly the calls to the packages methods. For me around ~4 seconds per call.
And you always are interested in finding out what is the rate between currency x and CAD.
The package has a method .get_rates() which seems to provide the same information used by the .convert() method, but for one currency and all others.
So what you can do is:
Collect all unique dates in the DataFrame
Call .get_rates() for each of those dates and save the result
Use the results plus your amounts to calculate the required column
E.g. as follows:
import pandas as pd
from forex_python.converter import CurrencyRates
from tqdm import tqdm # use 'pip install tqdm' before
df = pd.DataFrame({
'siteid': ['11', '12', '13', '41', '42', '51'],
'userid': [0, 0, 0, 0, 0, 0],
'domain': ['A', 'B', 'C', 'E', 'F', 'G'],
'currency': ['MXN', 'CAD', 'USD', 'USD', 'AUD', 'HKD'],
'servicecost': [2.5, 3.3, 1.3, 2.5, 2.5, 2.3],
'date': ['2022-02-04', '2022-03-05', '2022-01-03', '2021-04-06', '2022-12-05', '2022-11-01']
})
# get rates for all unique dates, added tqdm progress bar to see progress
rates_dict = {date: CurrencyRates().get_rates('CAD', date_obj=pd.to_datetime(date, errors='coerce'))
for date in tqdm(df['date'].unique())}
# now use these rates to set cost to 1/(CAD to currency_x rate), except when currency is CAD and when servicecost is 0, in those cases just use servicecost
df['excrate'] = df.apply(lambda row: 1.0/rates_dict[row['date']][row['currency']]*row['servicecost'] if row['currency']!='CAD' and row['servicecost'] != 0 else row['servicecost'], axis=1)
print(df)
> siteid userid domain currency servicecost date excrate
0 11 0 A MXN 2.5 2022-02-04 0.154553
1 12 0 B CAD 3.3 2022-03-05 3.300000
2 13 0 C USD 1.3 2022-01-03 1.670334
3 41 0 E USD 2.5 2021-04-06 3.140874
4 42 0 F AUD 2.5 2022-12-05 2.219252
5 51 0 G HKD 2.3 2022-11-01 0.380628
How much this speeds up things drastically depends on how many different dates there are in your data. But since you said the original DataFrame has 60k rows I assume there are large numbers of dates occuring multiple times. This code should take roughly ~4seconds * number of unique dates in your DataFrame to run.
Related
I've been googling all day and can't seem to find a solution to my problem, so I'll try my luck here. Here's an example of my event data (I obviously have more columns, but they don't really matter here):
data = {'id':['a', 'a', 'a', 'b', 'b', 'b', 'c'],
'timestamp':['2022-02-01 04:05:53', '2022-02-01 04:05:54', '2022-02-01 04:05:55', '2022-02-01 04:05:56', '2022-02-01 04:06:57', '2022-02-01 04:05:58', '2022-02-01 04:05:55']}
df2 = pd.DataFrame(data)
id timestamp
0 a 2022-02-01 04:05:53
1 a 2022-02-01 04:05:54
2 a 2022-02-01 04:05:55
3 b 2022-02-01 04:05:56
4 b 2022-02-01 04:06:22
5 b 2022-02-01 04:06:24
6 c 2022-02-01 04:05:55
I want to get a data frame where for each id, I'll have a consecutive sequence (of the length of at least 7200 events) of timestamps whereby the time difference between two data points can't be longer than 15 seconds. In other words, I want to get long enough gap-free data. So the data for the B vehicle won't be there anymore because it has gaps in timestamps that are longer than 15 seconds.
I tried different variations of cumsum/groupby/diff functions, but I can't find a way to have all the conditions filled. I also tried this, but it was the wrong approach as I was deleting rows that have time gaps and creating even bigger time gaps:
delta = pd.Timedelta(seconds=15)
df['time_diff'] = df.groupby('transmittingVehicle')['timestamp'].diff().gt(delta)
df_out = df.loc[df.time_diff != True]
Currently, I'm trying to use the boolean value from time_diff to iterate through the dataset, but that would be a very slow & lame solution, so I would like to avoid that.
I want to create new columns based on the time column.
I am trying to append the last 5 values in each row onto row zero, and then use the time column as the header for each column.
index ticker date time vol open close high low
0 AAPL 2022-01-06 09:00 121611 174.78 174.00 175.08 173.76
1 AAPL 2022-01-06 10:00 83471 174.11 173.89 174.64 173.88
2 AAPL 2022-01-06 11:00 76327 173.99 173.55 174.25 173.16
3 AAPL 2022-01-06 12:00 83471 174.11 173.89 174.64 173.88
Ultimately I want it to look like this:
Ticker date time vol9am open9am close9am high9am low9am vol10am open10am close10am high10am low10am
AAPL 2022-01-06 09:00 121611 174.78 174.00 175.08 173.76 83471 174.11 173.89 174.64 173.88
Any suggestions?
Pandas' unstack function will achieve this. This will provide a multi-index for the column names; you can use to_flat_index if you only want a single layer index:
df = pd.DataFrame(columns = ['index', 'ticker', 'date', 'time', 'vol', 'open', 'close', 'high', 'low'],
data = [
[0, 'AAPL', '2022-01-06', '09:00', 121611, 174.78, 174.00, 175.08, 173.76],
[1, 'AAPL', '2022-01-06', '10:00', 83471, 174.11, 173.89, 174.64, 173.88],
[2, 'AAPL', '2022-01-06', '11:00', 76327, 173.99, 173.55, 174.25, 173.16],
[3, 'AAPL', '2022-01-06', '12:00', 83471, 174.11, 173.89, 174.64, 173.88],
]
)
df.set_index(['ticker','date','time'])[['open','close']].unstack()
Ideally if you can post your questions with some minimum working code it makes it much easier to replicate :)
I try to figure out an efficient way for a big dataset to deal with the following: The data contains multiple rows per day with specified codes (string) and ratings as columns. I try to create a new dataset with columns for all the strings in this list; string=['239', '345', '346'] and the new dataset should contain the mean value of the rating on each day. So that I get a time series of means of the specified numbers.
This would be a simple example dataset:
df1 = pd.DataFrame({
'Date':['2021-01-01', '2021-01-01', '2021-01-01', '2021-01-02', '2021-01-02', '2021-01-02', '2021-01-02', '2021-01-03'],
'Code':['P:346 K,329 28', 'N2:345 P239', 'P:346 K2', 'E32 345', 'Q2_325', 'P;235 K345', '2W345', 'Pq-245 3460239'],
'Ratings':[9.0, 8.0, 5.0, 3.0, 2, 3, 6, 5]})
I try to achieve something similar to that table, but so far I wasn't able to get it done efficiently.
strings = ['239', '345', '346']
df2 = pd.DataFrame({
'Date':['2021-01-01', '2021-01-02', '2021-01-03'],
'239':[8.5, 'NA', '5'],
'345':[8, 4, 'NA'],
'346':[7, 'NA', 5],})
Thank you very much for your help:)
IIUC you can extract the strings in the code column and then pivot:
print (df1.assign(Code=df1["Code"].str.extractall(f"({'|'.join(strings)})").groupby(level=0).agg(tuple))
.explode("Code")
.pivot_table(index="Date", columns="Code", values="Ratings", aggfunc="mean"))
Code 239 345 346
Date
2021-01-01 8.5 8.0 7.0
2021-01-02 NaN 4.0 NaN
2021-01-03 5.0 NaN 5.0
This question is an extension from a question I posted here a while ago. I'm trying to understand the accepted answer provided by #patrickjlong1 (thanks again), therefore I'm running the code step by step and checking the result.
I found it hard to fathom this part.
>>> df_initial
data seriesID
0 {'year': '2017', 'period': 'M12', 'periodName'... SMS42000000000000001
1 {'year': '2017', 'period': 'M11', 'periodName'... SMS42000000000000001
2 {'year': '2017', 'period': 'M10', 'periodName'... SMS42000000000000001
3 {'year': '2017', 'period': 'M09', 'periodName'... SMS42000000000000001
4 {'year': '2017', 'period': 'M08', 'periodName'... SMS42000000000000001
5 {'year': '2017', 'period': 'M07', 'periodName'... SMS42000000000000001
The element in each row of the first column is a dictionary and they all have common keys: 'year', 'period' etc. What I want to convert it to is:
footnotes period periodName value year
0 {} M12 December 6418025 2017
0 {} M11 November 6418195 2017
0 {} M10 October 6418284 2017
...
The solution provided by #patrickjlong1 is to convert the row one at a time and then append them all, which I understand as one dictionary can be converted to one dataframe:
for i in range(0, len(df_initial)):
df_row = pd.DataFrame(df_initial['data'][i])
df_row['seriesID'] = series_col
df = df.append(df_row, ignore_index=True)
My question is: is this the only way to convert the data like I wanted? If not, what are the other methods?
Thanks
Avoid pd.DataFrame.append in a loop
I can't stress this enough. The pd.DataFrame.append method is expensive as it copies data unnecessarily. Putting this in a loop makes it n times more expensive.
Instead, you can feed a list of dictionaries to the pd.DataFrame constructor:
df = pd.DataFrame(df_initial['seriesID'].tolist())
In the case of weather or stock market data, temperatures and stock prices are both measured at multiple stations or stock tickers for any given date.
Therefore what is the most effective way to set an index which contains two fields?
For weather: the weather_station and then Date
For Stock Data: the stock_code and then Date
Setting the index in this way would allow filtering such as:
stock_df["code"]["start_date":"end_date"]
weather_df["station"]["start_date":"end_date"]
As mentioned by Anton you need to use MultiIndex as follows:
stock_df.index = pd.MultiIndex.from_arrays(stock_df[['code', 'date']].values.T, names=['idx1', 'idx2'])
weather_df.index = pd.MultiIndex.from_arrays(weather_df[['station', 'date']].values.T, names=['idx1', 'idx2'])
That functionality currently exists. Please refer to the documentation for more examples.
stock_df = pd.DataFrame({'symbol': ['AAPL', 'AAPL', 'F', 'F', 'F'],
'date': ['2016-1-1', '2016-1-2', '2016-1-1', '2016-1-2', '2016-1-3'],
'price': [100., 101, 50, 47.5, 49]}).set_index(['symbol', 'date'])
>>> stock_df
price
symbol date
AAPL 2016-1-1 100.0
2016-1-2 101.0
F 2016-1-1 50.0
2016-1-2 47.5
2016-1-3 49.0
>>> stock_df.loc['AAPL']
price
date
2016-1-1 100
2016-1-2 101
>>> stock_df.loc['AAPL', '2016-1-2']
price 101
Name: (AAPL, 2016-1-2), dtype: float64