I am trying to do a change data capture on two dataframes. The logic is to merge two dataframes and group by one keys and then run a loop for groups having count >1 to see which column 'updated'. I am getting strange error. any help is appreciated.
code
import pandas as pd
import numpy as np
pd.set_option('display.height', 1000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
print("reading wolverine xlxs")
# defining metadata
df_header = ['DisplayName','StoreLanguage','Territory','WorkType','EntryType','TitleInternalAlias',
'TitleDisplayUnlimited','LocalizationType','LicenseType','LicenseRightsDescription',
'FormatProfile','Start','End','PriceType','PriceValue','SRP','Description',
'OtherTerms','OtherInstructions','ContentID','ProductID','EncodeID','AvailID',
'Metadata', 'AltID', 'SuppressionLiftDate','SpecialPreOrderFulfillDate','ReleaseYear','ReleaseHistoryOriginal','ReleaseHistoryPhysicalHV',
'ExceptionFlag','RatingSystem','RatingValue','RatingReason','RentalDuration','WatchDuration','CaptionIncluded','CaptionExemption','Any','ContractID',
'ServiceProvider','TotalRunTime','HoldbackLanguage','HoldbackExclusionLanguage']
df_w01 = pd.read_excel("wolverine_1.xlsx", names = df_header)
df_w02 = pd.read_excel("wolverine_2.xlsx", names = df_header)
df_w01['version'] = 'OLD'
df_w02['version'] = 'NEW'
#print(df_w01)
df_m_d = pd.concat([df_w01, df_w02], ignore_index = True)
first_pass = df_m_d[df_m_d.duplicated(['StoreLanguage','Territory','TitleInternalAlias','LocalizationType','LicenseType','FormatProfile'], keep=False)]
first_pass_keep_duplicate = df_m_d[df_m_d.duplicated(['StoreLanguage','Territory','TitleInternalAlias','LocalizationType','LicenseType','FormatProfile'], keep='first')]
group_by_1 = first_pass.groupby(['StoreLanguage','Territory','TitleInternalAlias','LocalizationType','LicenseType','FormatProfile'])
for i,rows in group_by_1.iterrows():
print("rownumber", i)
print (rows)
print(first_pass)
And The error I get :
AttributeError: Cannot access callable attribute 'iterrows' of 'DataFrameGroupBy' objects, try using the 'apply' method
Any help is much appreciated.
Your GroupBy object supports iteration, so instead of
for i,rows in group_by_1.iterrows():
print("rownumber", i)
print (rows)
you need to do something like
for name, group in group_by_1:
print name
print group
then you can do what you need to do with each group
See the docs
Why not do as suggested and use apply? Something like:
def print_rows(rows):
print rows
group_by_1.apply(print_rows)
Related
I'm using auto_arima via pmdarima to fit multiple time series via a groupby. This is to say, I have a pd.DataFrame of stacked time-indexed data, grouped by variable variable, and have successfully applied transform(pm.auto_arima) to each. The reproducible example finds boring best ARIMA models, but the idea seems to work. I now want to apply .predict() similarly, but cannot get it to play nice with apply / lambda(x) / their combinations.
The code below works until the # Forecasting - help! section. I'm having trouble catching the correct object (apparently) in the apply. How might I adapt one of test1, test2, or test3 to get what I want? Or, is there some other best-practice construct to consider? Is it better across columns (without a melt)? Or via a loop?
Ultimately, I hope that test1, say, is a stacked pd.DataFrame (or pd.Series at least) with 8 rows: 4 forecasted values for each of the 2 time series in this example, with an identifier column variable (possibly tacked on after the fact).
import pandas as pd
import pmdarima as pm
import itertools
# Get data - this is OK.
url = 'https://raw.githubusercontent.com/nickdcox/learn-airline-delays/main/delays_2018.csv'
keep = ['arr_flights', 'arr_cancelled']
# Setup data - this is OK.
df = pd.read_csv(url, index_col=0)
df.index = pd.to_datetime(df.index, format = "%Y-%m")
df = df[keep]
df = df.sort_index()
df = df.loc['2018']
df = df.groupby(df.index).sum()
df.reset_index(inplace = True)
df = df.melt(id_vars = 'date', value_vars = df.columns.to_list()[1:])
# Fit auto.arima for each time series - this is OK.
fit = df.groupby('variable')['value'].transform(pm.auto_arima).drop_duplicates()
fit = fit.to_frame(name = 'model')
fit['variable'] = keep
fit.reset_index(drop = True, inplace = True)
# Setup forecasts - this is OK.
max_date = df.date.max()
dr = pd.to_datetime(pd.date_range(max_date, periods = 4 + 1, freq = 'MS').tolist()[1:])
yhat = pd.DataFrame(list(itertools.product(keep, dr)), columns = ['variable', 'date'])
yhat.set_index('date', inplace = True)
# Forecasting - help! - Can't get any of these to work.
def predict_fn(obj):
return(obj.loc[0].predict(4))
predict_fn(fit.loc[fit['variable'] == 'arr_flights']['model']) # Appears to work!
test1 = fit.groupby('variable')['model'].apply(lambda x: x.predict(n_periods = 4)) # Try 1: 'Series' object has no attribute 'predict'.
test2 = fit.groupby('variable')['model'].apply(lambda x: x.loc[0].predict(n_periods = 4)) # Try 2: KeyError
test3 = fit.groupby('variable')['model'].apply(predict_fn) # Try 3: KeyError
I am using investpy to get historical stock data for 2 stocks ( TRP_pb , TRP_pc )
import investpy
import pandas as pd
import numpy as np
TRP_pb = investpy.get_stock_historical_data(stock='TRP_pb',
country='canada',
from_date='01/01/2022',
to_date='01/04/2022')
print(TRP_pb.head())
TRP_pc = investpy.get_stock_historical_data(stock='TRP_pc',
country='canada',
from_date='01/01/2022',
to_date='01/04/2022')
print(TRP_pc.head())
I can append the two tables by using the append method
appendedtable = TRP_pb.append(TRP_pc, ignore_index=False)
What I am trying to do is to use a loop function in order to combine these two tables
Here is what I have tried so far
preferredlist = ['TRP_pb','TRP_pc']
for i in preferredlist:
new = investpy.get_stock_historical_data(stock=i,
country='canada',
from_date='01/01/2022',
to_date='01/04/2022')
new.append(new, ignore_index=True)
However this doesnt work.
I would appreciate any help
Since get_stock_historical_data returns a DataFrame, you can create an empty dataframe before the for loop and concat in the loop.
preferredlist = ['TRP_pb','TRP_pc']
final_list = pd.DataFrame()
for i in preferredlist:
new = investpy.get_stock_historical_data(stock=i,
country='canada',
from_date='01/01/2022',
to_date='01/04/2022')
final_list = pd.concat([final_list, new])
import pandas as pd
nba = pd.read_csv("nba.csv")
names = pd.Series(nba['Name'])
data = nba['Salary']
nba_series = (data, index=[names])
print(nba_series)
Hello I am trying to convert the columns 'Name' and 'Salary' into a series from a dataframe. I need to set the names as the index and the salaries as the values but i cannot figure it out. this is my best attempt so far anyone guidance is appreciated
I think you are over-thinking this. Simply construct it with pd.Series(). Note the data needs to be with .values, otherwis eyou'll get Nans
import pandas as pd
nba = pd.read_csv("nba.csv")
nba_series = pd.Series(data=nba['Salary'].values, index=nba['Name'])
Maybe try set_index?
nba.set_index('name', inlace = True )
nba_series = nba['Salary']
This might help you
import pandas as pd
nba = pd.read_csv("nba.csv")
names = nba['Name']
#It's automatically a series
data = nba['Salary']
#Set names as index of series
data.index = nba_series
data.index = names might be correct but depends on the data
I wrote an function which only depends on a dataframe. The functions output is also a dataframe. I would like make different dataframes according a condition and save them as different datasets with different names. However I couldnt save them as dataframes with different names. Instead i manually do the process. Is there a code which would do the same. It would be much beneficial.
import os
import numpy as np
import pandas as pd
data1 = pd.read_csv('C:/Users/Oz/Desktop/vintage/vintage1.csv', encoding='latin-1')
product_list= data1['product_types'].unique()
def vintage_table(df):
df['Disbursement_Date']=pd.to_datetime(df.Disbursement_Date)
df['Closing_Date']=pd.to_datetime(df.Closing_Date)
df['NPL_date']=pd.to_datetime(df.NPL_date, errors='ignore')
df['NPL_date_period']=df.loc[df.NPL_date > '2015-01-01', 'NPL_date'].apply(lambda x: x.strftime('%Y-%m'))
df['Dis_date_period'] = df.Disbursement_Date.apply(lambda x: x.strftime('%Y-%m'))
df['diff']=((df.NPL_date-df.Disbursement_Date) / np.timedelta64(3, 'M')).round(0)
df=df.groupby(['Dis_date_period','NPL_date_period']).agg({'Dis_amount' : 'sum', 'NPL_amount' : 'sum', 'diff' : 'mean'})
df.reset_index(level=0, inplace=True)
df['Vintage_Ratio']=df['NPL_amount']/df['Dis_amount']
table=pd.pivot_table(df,values='Vintage_Ratio',index='Dis_date_period',columns=['diff'],).fillna(0)
return
The above is the function
#for e in product_list:
# sub = data1[data1['product_types'] == e]
# print(sub)
consumer = data1[data1['product_types'] == product_list[0]]
mortgage = data1[data1['product_types'] == product_list[1]]
vehicle = data1[data1['product_types'] == product_list[2]]
table_con = vintage_table(consumer)
table_mor = vintage_table(mortgage)
table_veh = vintage_table(vehicle)
I would like to improve this part is there a better way to do the same process?
You could have your vintage_table() function return a dataframe instead of just modifying one dataframe over and over and that way you could do this in the second code block:
table_con = vintage_table(consumer)
table_mor = vintage_table(mortgage)
table_veh = vintage_table(vechicle)
I am writing a function that will serve as filter for rows that I wanted to use.
The sample data frame is as follow:
df = pd.DataFrame()
df ['Xstart'] = [1,2.5,3,4,5]
df ['Xend'] = [6,8,9,10,12]
df ['Ystart'] = [0,1,2,3,4]
df ['Yend'] = [6,8,9,10,12]
df ['GW'] = [1,1,2,3,4]
def filter(data,Game_week):
pass_data = data [(data['GW'] == Game_week)]
when I recall the function filter as follow, I got an error.
df1 = filter(df,1)
The error message is
AttributeError: 'NoneType' object has no attribute 'head'
but when I use manual filter, it works.
pass_data = df [(df['GW'] == [1])]
This is my first issue.
My second issue is that I want to filter the rows with multiple GW (1,2,3) etc.
For that I can manually do it as follow:
pass_data = df [(df['GW'] == [1])|(df['GW'] == [2])|(df['GW'] == [3])]
if I want to use in function input as list [1,2,3]
how can I write it in function such that I can input a range of 1 to 3?
Could anyone please advise?
Thanks,
Zep
Use isin for pass list of values instead scalar, also filter is existing function in python, so better is change function name:
def filter_vals(data,Game_week):
return data[data['GW'].isin(Game_week)]
df1 = filter_vals(df,range(1,4))
Because you don't return in the function, so it will be None, not the desired dataframe, so do (note that also no need parenthesis inside the data[...]):
def filter(data,Game_week):
return data[data['GW'] == Game_week]
Also, isin may well be better:
def filter(data,Game_week):
return data[data['GW'].isin(Game_week)]
Use return to return data from the function for the first part. For the second, use -
def filter(data,Game_week):
return data[data['GW'].isin(Game_week)]
Now apply the filter function -
df1 = filter(df,[1,2])