I try to figure out an efficient way for a big dataset to deal with the following: The data contains multiple rows per day with specified codes (string) and ratings as columns. I try to create a new dataset with columns for all the strings in this list; string=['239', '345', '346'] and the new dataset should contain the mean value of the rating on each day. So that I get a time series of means of the specified numbers.
This would be a simple example dataset:
df1 = pd.DataFrame({
'Date':['2021-01-01', '2021-01-01', '2021-01-01', '2021-01-02', '2021-01-02', '2021-01-02', '2021-01-02', '2021-01-03'],
'Code':['P:346 K,329 28', 'N2:345 P239', 'P:346 K2', 'E32 345', 'Q2_325', 'P;235 K345', '2W345', 'Pq-245 3460239'],
'Ratings':[9.0, 8.0, 5.0, 3.0, 2, 3, 6, 5]})
I try to achieve something similar to that table, but so far I wasn't able to get it done efficiently.
strings = ['239', '345', '346']
df2 = pd.DataFrame({
'Date':['2021-01-01', '2021-01-02', '2021-01-03'],
'239':[8.5, 'NA', '5'],
'345':[8, 4, 'NA'],
'346':[7, 'NA', 5],})
Thank you very much for your help:)
IIUC you can extract the strings in the code column and then pivot:
print (df1.assign(Code=df1["Code"].str.extractall(f"({'|'.join(strings)})").groupby(level=0).agg(tuple))
.explode("Code")
.pivot_table(index="Date", columns="Code", values="Ratings", aggfunc="mean"))
Code 239 345 346
Date
2021-01-01 8.5 8.0 7.0
2021-01-02 NaN 4.0 NaN
2021-01-03 5.0 NaN 5.0
Related
So I have a dataframe structured like:
Date Metric Value
2020-01-01 Low 34.5
2020-01-01 High 36.5
2020-01-01 Open 23.5
2020-01-02 Low 32.5
...
I am trying to create another frame, where for every date there is a new 'Volume' column which is the High-low for that specific date. The frame is not keyed on the dates so it needs to be joined and then values in different columns added together? Not sure exactly how to do this. I'm trying to get the final result to look like this:
Date Volume
2020-01-01 2.00
2020-01-02 6.45
One approach could be as follows:
First, select only from df the rows which have High and Low in column Metric using Series.isin.
Next, use df.pivot to reshape the df and assign a new column Volume, containing the result of values in column Low subtracted from those in column High (see: Series.sub).
Finally, we add some cosmetic changes: we drop columns High and Low, reset the index (see: df.reset_index), and get rid of df.columns.name (which is automatically set to Metric during df.pivot).
import pandas as pd
import numpy as np
data = {'Date': {0: '2020-01-01', 1: '2020-01-01', 2: '2020-01-01',
3: '2020-01-02', 4: '2020-01-02', 5: '2020-01-02'},
'Metric': {0: 'Low', 1: 'High', 2: 'Open', 3: 'Low', 4: 'High',
5: 'Open'},
'Value': {0: 34.5, 1: 36.5, 2: 23.5, 3: 32.5, 4: 38.95, 5: 32.5}}
df = pd.DataFrame(data)
res = df[df.Metric.isin(['Low','High'])].pivot(index='Date', columns='Metric',
values='Value')
res = res.assign(Volume=res['High'].sub(res.Low)).drop(
['High', 'Low'], axis=1).reset_index(drop=False)
res.columns.name = None
print(res)
Date Volume
0 2020-01-01 2.00
1 2020-01-02 6.45
You can create 2 dataframes by filtering by low & high and join them by date. Finally, subtract columns low from high.
data=[
("2020-01-01","Low",34.5),
("2020-01-01","High",36.5),
("2020-01-01","Open",23.5),
("2020-01-02","Low",32.5),
("2020-01-02","High",38.95),
]
columns = ["Date", "Metric", "Value"]
df = pd.DataFrame(data=data, columns=columns)
df_low = df[df["Metric"]=="Low"].rename(columns={"Value": "Low"}).drop("Metric", axis=1)
df_high = df[df["Metric"]=="High"].rename(columns={"Value": "High"}).drop("Metric", axis=1)
df2 = df_low.merge(df_high, on="Date", how="inner")
df2["Volume"] = df2["High"] - df2["Low"]
[Out]:
Date Low High Volume
0 2020-01-01 34.5 36.50 2.00
1 2020-01-02 32.5 38.95 6.45
I need some help thinking through this:
I have a dataset with 61K records of services. Each service gets renewed on a specific date, each service also has a cost and that cost amount is billed in one of 10 different currencies.
what I need to do on each service record is to convert each service cost to CAD currency for the date the service was renewed.
when I do this in a small sample dataset with 6 services it takes 3 seconds, but this implies that if I do this on a 61k record dataset it might take over 8 hours, which is way too long (i think I can do that in excel or google sheets way faster, which I don't want to do)
Is there a better way or approach to do this with pandas/python in google colab so it doesn't take that long?
thank you in advance
# setup
import pandas as pd
!pip install forex-python
from forex_python.converter import CurrencyRates
#sample dataset/df
dummy_data = {
'siteid': ['11', '12', '13', '41', '42','51'],
'userid': [0,0,0,0,0,0],
'domain': ['A', 'B', 'C', 'E', 'F', 'G'],
'currency':['MXN', 'CAD', 'USD', 'USD', 'AUD', 'HKD'],
'servicecost': [2.5, 3.3, 1.3, 2.5, 2.5, 2.3],
'date': ['2022-02-04', '2022-03-05', '2022-01-03', '2021-04-06', '2022-12-05', '2022-11-01']
}
df = pd.DataFrame(dummy_data, columns = ['siteid', 'userid', 'domain','currency','servicecost','date'])
#ensure date is in the proper datatype
df['date'] = pd.to_datetime(df['date'],errors='coerce')
#go through df, get the data to do the conversion and populate a new series
def convertServiceCostToCAD(currency,servicecost,date):
return CurrencyRates().convert(currency, 'CAD', servicecost, date)
df['excrate']=list(map(convertServiceCostToCAD, df['currency'], df['servicecost'], df['date']))
So if I understand this correctly what this package does is provide a daily fixed rate between two currencies (so one direction is the inverse of the other direction).
And what makes things so slow is very clearly the calls to the packages methods. For me around ~4 seconds per call.
And you always are interested in finding out what is the rate between currency x and CAD.
The package has a method .get_rates() which seems to provide the same information used by the .convert() method, but for one currency and all others.
So what you can do is:
Collect all unique dates in the DataFrame
Call .get_rates() for each of those dates and save the result
Use the results plus your amounts to calculate the required column
E.g. as follows:
import pandas as pd
from forex_python.converter import CurrencyRates
from tqdm import tqdm # use 'pip install tqdm' before
df = pd.DataFrame({
'siteid': ['11', '12', '13', '41', '42', '51'],
'userid': [0, 0, 0, 0, 0, 0],
'domain': ['A', 'B', 'C', 'E', 'F', 'G'],
'currency': ['MXN', 'CAD', 'USD', 'USD', 'AUD', 'HKD'],
'servicecost': [2.5, 3.3, 1.3, 2.5, 2.5, 2.3],
'date': ['2022-02-04', '2022-03-05', '2022-01-03', '2021-04-06', '2022-12-05', '2022-11-01']
})
# get rates for all unique dates, added tqdm progress bar to see progress
rates_dict = {date: CurrencyRates().get_rates('CAD', date_obj=pd.to_datetime(date, errors='coerce'))
for date in tqdm(df['date'].unique())}
# now use these rates to set cost to 1/(CAD to currency_x rate), except when currency is CAD and when servicecost is 0, in those cases just use servicecost
df['excrate'] = df.apply(lambda row: 1.0/rates_dict[row['date']][row['currency']]*row['servicecost'] if row['currency']!='CAD' and row['servicecost'] != 0 else row['servicecost'], axis=1)
print(df)
> siteid userid domain currency servicecost date excrate
0 11 0 A MXN 2.5 2022-02-04 0.154553
1 12 0 B CAD 3.3 2022-03-05 3.300000
2 13 0 C USD 1.3 2022-01-03 1.670334
3 41 0 E USD 2.5 2021-04-06 3.140874
4 42 0 F AUD 2.5 2022-12-05 2.219252
5 51 0 G HKD 2.3 2022-11-01 0.380628
How much this speeds up things drastically depends on how many different dates there are in your data. But since you said the original DataFrame has 60k rows I assume there are large numbers of dates occuring multiple times. This code should take roughly ~4seconds * number of unique dates in your DataFrame to run.
I have two dataframes (simplified examples below). One contains a series of dates and values (df1), the second contains a date range (df2). I would like to identify/select/mask the date range from df2 in df1, sum the associated df1 values and add them to a new column in df2.
I'm a novice and all the techniques I have tried have been unsuccessful--a combination of wrong method, combining incompatible methods, syntax errors and so on. I have searched the Q&As here, but none have quite addressed this issue.
import pandas as pd
#********** df1: dates and values ***********
rng = pd.date_range('2012-02-24', periods=12, freq='D')
df1 = pd.DataFrame({ 'STATCON': ['C00028', 'C00489', 'C00038', 'C00589', 'C10028', 'C00499', 'C00238', 'C00729',
'C10044', 'C00299', 'C00288', 'C00771'],
'Date': rng,
'Val': [0.96, 0.57, 0.39, 0.17, 0.93, 0.86, 0.54, 0.58, 0.43, 0.19, 0.40, 0.32]
})
#********** df2: date range ***********
df2 = pd.DataFrame({
'BCON': ['B002', 'B004', 'B005'],
'Start': ['2012-02-25', '2012-02-28', '2012-03-01'],
'End': ['2012-02-29', '2012-03-04', '2012-03-06']
})
df2[['Start','End']] = df2[['Start','End']].apply(pd.to_datetime)
#********** Desired Output: df2 -- date range with summed values ***********
df3 = pd.DataFrame({
'BCON': ['B002', 'B004', 'B005'],
'Start': ['2012-02-25', '2012-02-28', '2012-03-01'],
'End': ['2012-02-29', '2012-03-04', '2012-03-06'],
'Sum_Val': [2.92, 3.53, 2.46]
})
You can solve this with the Dataframe.apply function as follow:
def to_value(row):
return df1[(row['Start'] <= df1['Date']) & (df1['Date'] <= row['End'])]['Val'].sum()
df3 = df2.copy()
df3['Sum_Val'] = df3.apply(to_value, axis=1)
The to_value function is called on every row of the df3 dataframe.
See here for a live implementation of the solution: https://1000words-hq.com/n/TcYN1Fz6Izp
One option is with conditional_join from pyjanitor - it tries to avoid searching every row (which can be memory consuming, depending on the data size):
# pip install pyjanitor
import pandas as pd
import numpy as np
df2 = df2.astype({'Start':np.datetime64, 'End':np.datetime64})
(df1
.conditional_join(
df2,
('Date', 'Start', '>='),
('Date', 'End', '<='))
.loc[:, ['BCON', 'Start', 'End', 'Val']]
.groupby(['BCON', 'Start', 'End'], as_index = False)
.agg(sum_val = ('Val', 'sum'))
)
BCON Start End sum_val
0 B002 2012-02-25 2012-02-29 2.92
1 B004 2012-02-28 2012-03-04 3.53
2 B005 2012-03-01 2012-03-06 2.46
This is an example dataset:
df = pd.DataFrame({'Date': ['11-9-2019', '11-9-2020', '11-8-2019', '15-5-2020'],
'name': ['Allen', 'Allen', 'David', 'David'],
'Grade': [50, np.nan, 60, np.nan],
'code': [3352326, np.nan, 22233467, np.nan]})
df['Date'] = pd.to_datetime(df['Date'])
df
What I want to do is to filter the latest date by column 'name' (ex: Allen). However, all info is in the row with the oldest date (ex: Grade, code), the row with the latest date all have missing data. I want the result to only show the rows with the latest date but also want to move the info from the rows with the oldest date to the rows with the latest date. Like the result below.
df1 = pd.DataFrame({'Date': ['11-9-2020', '15-5-2020'],
'name': ['Allen', 'David'],
'Grade': [50, 60],
'code': [3352326, 22233467]})
df1['Date'] = pd.to_datetime(df1['Date'])
df1
I'm not sure if it's possible, and I couldn't find related results.
Thanks in advance!
If the actual values are always the first row of the group, you could try:
>>> df.groupby('name').first().reset_index()
name Date Grade code
0 Allen 2019-11-09 50.0 3352326.0
1 David 2019-11-08 60.0 22233467.0
>>>
If they are not always the first one, you could try:
>>> df.groupby('name').apply(lambda x: x.apply(lambda y: y.dropna().tolist()).iloc[0]).rename_axis('').reset_index().drop('', axis=1)
Date Grade code name
0 2019-11-09 50.0 3352326.0 Allen
1 2019-11-08 60.0 22233467.0 David
>>>
Try:
df[['Grade','code']]=df[['Grade','code']].fillna(method='ffill')
df_out=df.groupby('name').agg('last').reset_index()
output:
df_out
Out[32]:
name Date Grade code
0 Allen 2020-11-09 50.0 3352326.0
1 David 2020-05-15 60.0 22233467.0
Basically:
Is there a way to apply a function that uses the column name of a dataframe in Pandas?
Like this:
df['label'] = df.apply(lambda x: '_'.join(labels_dict[column_name][x]), axis=1)
Where column name is the column that the apply is 'processing'.
Details:
I'd like to create a label for each row of a dataframe, based on a dictionary.
Let's take the dataframe df:
df = pd.DataFrame({ 'Application': ['Compressors', 'Fans', 'Fans', 'Material Handling'],
'HP': ['0.25', '0.25', '3.0', '15.0'],
'Sector': ['Commercial', 'Industrial', 'Commercial', 'Residential']},
index=[0, 1, 2, 3])
After I apply the label:
In [139]: df['label'] = df.apply(lambda x: '_'.join(x), axis=1)
In [140]: df
Out[140]:
Application HP Sector label
0 Compressors 0.25 Commercial Compressors_0.25_Commercial
1 Fans 0.25 Industrial Fans_0.25_Industrial
2 Fans 3.0 Commercial Fans_3.0_Commercial
3 Material Handling 15.0 Residential Material Handling_15.0_Residential
But the label is too long, especially when I consider the full dataframe, which contains a lot more columns. What I want is to use a dictionary to shorten the fields that come from the columns (I pasted the code for the dictionary at the end of the question).
I can do that for one field:
In [145]: df['application_label'] = df['Application'].apply(
lambda x: labels_dict['Application'][x])
In [146]: df
Out[146]:
Application HP Sector application_label
0 Compressors 0.25 Commercial cmp
1 Fans 0.25 Industrial fan
2 Fans 3.0 Commercial fan
3 Material Handling 15.0 Residential mat
But I want to do it for all the fields, like I did in snippet #2. So I'd like to do something like:
df['label'] = df.apply(lambda x: '_'.join(labels_dict[column_name][x]), axis=1)
Where column name is the column of df to which the function is being applied. Is there a way to access that information?
Thank you for your help!
I defined the dictionary as:
In [141]: labels_dict
Out[141]:
{u'Application': {u'Compressors': u'cmp',
u'Fans': u'fan',
u'Material Handling': u'mat',
u'Other/General': u'oth',
u'Pumps': u'pum'},
u'ECG': {u'Polyphase': u'pol',
u'Single-Phase (High LRT)': u'sph',
u'Single-Phase (Low LRT)': u'spl',
u'Single-Phase (Med LRT)': u'spm'},
u'Efficiency Level': {u'EL0': u'el0',
u'EL1': u'el1',
u'EL2': u'el2',
u'EL3': u'el3',
u'EL4': u'el4'},
u'HP': {0.25: 1.0,
0.33: 2.0,
0.5: 3.0,
0.75: 4.0,
1.0: 5.0,
1.5: 6.0,
2.0: 7.0,
3.0: 8.0,
10.0: 9.0,
15.0: 10.0},
u'Sector': {u'Commercial': u'com',
u'Industrial': u'ind',
u'Residential': u'res'}}
I worked out one way to do it, but it seems clunky. I'm hoping there's something more elegant out there.
df['label'] = pd.DataFrame([df[column_name].apply(lambda x: labels_dict[column_name][x])
for column_name in df.columns]).apply('_'.join)
I would say this is a bit more elegant
df.apply(lambda x: '_'.join([str(labels_dict[col][v]) for col, v in zip(df.columns, x)]), axis=1)