Pandas merging dataframes resulting in x and y suffixes - python

I am creating my own dataset for a Uni project. I've used the merge function often and it always worked perfectly. This time I get x and y suffixes which I can not understand. I know pandas does this because -> The rows in the two data frames that match on the specified columns are extracted, and joined together. If there is more than one match, all possible matches contribute one row each. But I really don't get why. I assume it has to do with a warning I got earlier:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
unique_website_user['PurchaseOnWebsite'] = [apply_website_user() for x in unique_website_user.index]
I tried to merge the dataframes on the column 'CustomerID' where they obviously match. I really don't get the error.
Here is my code:
I first want to remove duplicate rows where the relevant columns are CustomerID and WebsiteID
Then I want to apply a function which returns true or false as a string randomly. Up to this point the resulting dataframe looks fine. The only warning I get is the one I described earlier.
And lastly I want to merge them and it results in a dataframe way larger than the original one. I really don't understand that.
import numpy as np
import pandas as pd
from numpy.random import choice
df = pd.DataFrame()
df['AdID'] = np.random.randint(1,1000001, size=100000)
df['CustomerID'] = np.random.randint(1,1001, size=len(df))
df["Datetime"] = choice(pd.date_range('2015-01-01', '2020-12-31'), len(df))
def check_weekday(date):
res = len(pd.bdate_range(date, date))
if res == 0:
result = "Weekend"
else:
result = "Working Day"
return result
df["Weekend"] = df["Datetime"].apply(check_weekday)
def apply_age():
age = choice([16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36],
p=[.00009, .00159, .02908, .06829, .09102, .10043, .10609, .10072, .09223, .08018, .06836, .05552,
.04549,.03864, .03009, .02439, .01939, .01586, .01280, .01069, .00905])
return age
def apply_income_class():
income_class = choice([np.random.randint(50,501),np.random.randint(502,1001), np.random.randint(1002,1501),np.random.randint(1502,2001)],
p=[.442, .387, .148, .023])
return income_class
def apply_gender():
gender = choice(['male', 'female'], p=[.537, .463])
return gender
unique_customers = df[['CustomerID']].drop_duplicates(keep="first")
unique_customers['Age'] = [apply_age() for x in unique_customers.index]
unique_customers['Gender'] = [apply_gender() for x in unique_customers.index]
unique_customers['Monthly Income'] = [apply_income_class() for x in unique_customers.index]
unique_customers['Spending Score'] = [np.random.randint(1,101) for x in unique_customers.index]
df = df.merge(unique_customers, on=['CustomerID'], how='left')
df['WebsiteID'] = np.random.randint(1,31, len(df))
df['OfferID'] = np.random.randint(1,2001, len(df))
df['BrandID'] = np.random.randint(1,10, len(df))
unique_offers = df[['OfferID']].drop_duplicates(keep="first")
print(len(unique_offers))
unique_offers['CategoryID'] = [np.random.randint(1,501) for x in unique_offers.index]
unique_offers['NPS'] = [np.random.randint(1, 101) for x in unique_offers.index]
df = df.merge(unique_offers, on=['OfferID'], how='left')
def apply_website_user():
purchase = np.random.choice(['True', 'False'])
return purchase
unique_website_user = df.drop_duplicates(subset=['CustomerID', 'WebsiteID'], keep="first").copy()
unique_website_user['PurchaseOnWebsite'] = [apply_website_user() for x in unique_website_user.index]
print(unique_website_user.head())
df = df.merge(unique_website_user[['CustomerID','PurchaseOnWebsite']], on='CustomerID', how='left')
#df['PurchaseOnWebsite']= df.groupby(['CustomerID', 'WebsiteID']).apply(apply_website_user)
print(df.head)
#Erstellen der csv-Datei
#df.to_csv(r'/Users/alina/Desktop/trainingsdaten.csv', sep=',', #index=False)

It's better to paste the data, rather than provide images, so this is just guidance as I can't test it. You have a couple issues and I don't think they are related.
copy or slice warning. You might be able to get rid of this two ways. One is reconfigure the line:
unique_website_user['PurchaseOnWebsite'] = [apply_website_user() for x in unique_website_user.index]
to the format it is suggesting. The other, more simple way that might work is to use .copy() on the line before it. You are dropping duplicates and then modifying it, and pandas is just warning that you are modifying a slice or view of the original. try this:
unique_website_user = df.drop_duplicates(subset=['CustomerID', 'WebsiteID'], keep="first").copy()
If you just want to merge back that one column and reduce number of columns, try this:
df = df.merge(unique_website_user[['CustomerID','PurchaseOnWebsite']], on='CustomerID', how='left')
Another alternative to this would be to use groupby() and apply your True/False function in and apply method. Something like:
df.groupby(['CustomerID']).apply(yourfunctionhere)
This gets rid of creating and merging dataframes. If you post all the code actual dataframe, we can be more specific.
UPDATE:
Saw your comment that you found your own answer. Also, this is way faster than your call to the weekday function.
df["Weekend"] = df['Datetime'].apply(lambda x: 'Weekend' if (x.weekday() == 5 or x.weekday() == 6) else 'Working Day')

Related

Pandas for python - code failing to sort by a specific column added to the data

Working with python and pandas and I have cleaned some data and added a new column and added some data. Now the dataframe refuses to sort for some reason. I have tried two different methods to ensure the column "Review_Score" is in numeric form and both work. I then tried to sort by name and it would not work either. Can anyone explain what when wrong here?
from itertools import count
from platform import platform
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#import the csv file
df = pd.read_csv("video_game.csv")
#clean the data converting str to numbers where needed + drop unwanted columns
df = df.drop(columns=["NA_players", "EU_players", "JP_players", "Other_players", "Global_players", "User_Count", "Rating", "Critic_Count"])
df['User_Score'] = pd.to_numeric(df['User_Score'] ,errors='coerce')
df = df.replace(np.nan, 0, regex=True)
df['User_Score'] = df['User_Score'].astype(float)
df['Critic_Score'] = pd.to_numeric(df['Critic_Score'] ,errors='coerce')
df = df.replace(np.nan, 0, regex=True)
df['Critic_Score'] = df['Critic_Score'].astype(float)
#filter all the NES games released in 1988 on the list
df2 = df.loc[(df['Platform'] == 'NES') & (df['Year_of_Release'] == 1988)]
score = [94, 78, 80, 76, 72, 43, 94, 95, 65, 35, 68]
#add new column and populate with the review scores from the list score
df2['Review_Score'] = score
#df2.Review_Score = df2.Review_Score.astype(float)
df2.Review_Score = pd.to_numeric(df2.Review_Score, errors='coerce')
df2.sort_values('Review_Score', ascending=True)
print(df2)
For the sort_values method, you will need to add an extra parameter inplace=True in order to apply the changes to the original dataframe. You could do it as such:
df2.sort_values('Review_Score', ascending=True, inplace=True)
Another way you could apply the changes is by reassigning the dataframe to the original variable as such:
df2 = df2.sort_values('Review_Score', ascending=True)

How can we represent a pandas.series value on Django?

I have the following code, where I am binning a Pandas dataframe into given number of bins:
def contibin(data, target, bins=10):
#Empty Dataframe
newDF,woeDF = pd.DataFrame(), pd.DataFrame()
#Extract Column Names
cols = data.columns
for ivars in cols[~cols.isin([target])]:
if (data[ivars].dtype.kind in 'bifc') and (len(np.unique(data[ivars]))>10):
binned_x = pd.qcut(data[ivars], bins, duplicates='drop')
d0 = pd.DataFrame({'x': binned_x, 'y': data[target]})
#print(d0)
else:
d0 = pd.DataFrame({'x': data[ivars], 'y': data[target]})
d = d0.groupby("x", as_index=False).agg({"y": ["count", "sum"]})
d.columns = ['Range', 'Total', 'No. of Good']
d['No. of Bad'] = d['Total'] - d['No. of Good']
d['Dist. of Good'] = np.maximum(d['No. of Good'], 0.5) / d['No. of Good'].sum()
d['Dist. of Bad'] = np.maximum(d['No. of Bad'], 0.5) / d['No. of Bad'].sum()
d['WoE'] = np.log(d['Dist. of Good']/d['Dist. of Bad'])
d['IV'] = d['WoE'] * (d['Dist. of Good'] - d['Dist. of Bad'])
#temp =pd.DataFrame({"Variable" : [ivars], "IV" : [d['IV'].sum()]}, columns = ["Variable", "IV"])
#newDF=pd.concat([newDF,temp], axis=0)
woeDF=pd.concat([woeDF,d], axis=0)
return woeDF
The problem I am facing is when I try to integrate the code on front end using Django, I am not being able to represent woeDF['Range'] in Django the way I am able to see it normally. I tried converting the Pandas.Series to string, but it still isn't giving me what I want. To illustrate what I want to see in my frontend, I am attaching a picture of a sample table which I got by running this code on the Churn modelling Dataset.The image of the table I need
You can turn the Dataframe in an array of objects using DataFrame.itertuples(index=False)
you will then be able to iterate through the dataframe in Jinja by accessing the columns via their names. See the below example in Python:
import pandas as pd
columns = {"name": ["john", "skip", "abu", "harry", "ben"],
"age": [10, 20, 30, 40, 50]}
df = pd.DataFrame(columns)
print(df)
df_objects = df.itertuples(index=False)
for person in df_objects:
print("{0}: {1}".format(person.name, person.age))

Pandas - Incrementally add to DataFrame

I'm trying to add rows and columns to pandas incrementally. I have a lot of data stored across multiple datastores and a heuristic to determine a value. As I navigate across this datastore, I'd like to be able to incrementally update a dataframe, where in some cases, either names or days will be missing.
def foo():
df = pd.DataFrame()
year = 2016
names = ['Bill', 'Bob', 'Ryan']
for day in range(1, 4, 1):
for name in names:
if random.choice([True, False]): # sometimes a name will be missing
continue
value = random.randrange(0, 20, 1) # random value from heuristic
col = '{}_{}'.format(year, day) # column name
df = df.append({col: value, 'name': name}, ignore_index=True)
df.set_index('name', inplace=True, drop=True)
print(df.loc['Bill'])
This produces the following results:
2016_1 2016_2 2016_3
name
Bill 15.0 NaN NaN
Bill NaN 12.0 NaN
I've created a heatmap of the data and it's blocky due to duplicate names, so the output I'm looking for is:
2016_1 2016_2 2016_3
name
Bill 15.0 12.0 NaN
How can I combine these rows?
Is there a more efficient means of creating this dataframe?
Try this :-
df.groupby('name')[df.columns.values].sum()
try this:
df.pivot_table(index='name', aggfunc='sum', dropna=False)
After you run your foo() function, you can use any aggregation function (if you have only one value per column and all the othes are null) and groupby on df.
First, use reset_index to get back your name column.
Then use groupby and apply. Here I propose a custom function which checks that there is only one value per column, and raise a ValueError if not.
df.reset_index(inplace=True)
def aggdata(x):
if all([i <= 1 for i in x.count()]):
return x.mean()
else:
raise ValueError
ddf = df.groupby('name').apply(aggdata)
If all the values of the column are null but one, x.mean() will return that value (actually, you can use almost any aggregator, since there is only one value, that is the one returned).
It would be easier to have the name as column and date as index instead. Plus, you can work within the loop with lists and afterwards create the pd.DataFrame.
e.g.
year = 2016
names = ['Bill', 'Bob', 'Ryan']
index = []
valueBill = []
valueBob = []
valueRyan = []
for day in range(1, 4):
if random.choice([True, False]): # sometimes a name will be missing
valueBill.append(random.randrange(0, 20))
valueBob.append(random.randrange(0, 90))
valueRyan.append(random.randrange(0, 200))
index.append('{}-0{}'.format(year, day)) # column name
else:
valueBill.append(np.nan)
valueBob.append(np.nan)
valueRyan.append(np.nan)
index.append(np.nan)
df = pd.DataFrame({})
for name, value in zip(names,[valueBill,valueBob,valueRyan]):
df[name] = value
df.set_index(pd.to_datetime(index))
You can append the entries with new names if it does not already exist and then do an update to update existing entries.
import pandas as pd
import random
def foo():
df = pd.DataFrame()
year = 2016
names = ['Bill', 'Bob', 'Ryan']
for day in range(1, 4, 1):
for name in names:
if random.choice([True, False]): # sometimes a name will be missing
continue
value = random.randrange(0, 20, 1) # random value from heuristic
col = '{}_{}'.format(year, day) # column name
new_df = pd.DataFrame({col: value, 'name':name}, index=[1]).set_index('name')
df = pd.concat([df,new_df[~new_df.index.isin(df.index)].dropna()])
df.update(new_df)
#df.set_index('name', inplace=True, drop=True)
print(df)

I want to run a loop with condition and save all outputs as dataframes with different names

I wrote an function which only depends on a dataframe. The functions output is also a dataframe. I would like make different dataframes according a condition and save them as different datasets with different names. However I couldnt save them as dataframes with different names. Instead i manually do the process. Is there a code which would do the same. It would be much beneficial.
import os
import numpy as np
import pandas as pd
data1 = pd.read_csv('C:/Users/Oz/Desktop/vintage/vintage1.csv', encoding='latin-1')
product_list= data1['product_types'].unique()
def vintage_table(df):
df['Disbursement_Date']=pd.to_datetime(df.Disbursement_Date)
df['Closing_Date']=pd.to_datetime(df.Closing_Date)
df['NPL_date']=pd.to_datetime(df.NPL_date, errors='ignore')
df['NPL_date_period']=df.loc[df.NPL_date > '2015-01-01', 'NPL_date'].apply(lambda x: x.strftime('%Y-%m'))
df['Dis_date_period'] = df.Disbursement_Date.apply(lambda x: x.strftime('%Y-%m'))
df['diff']=((df.NPL_date-df.Disbursement_Date) / np.timedelta64(3, 'M')).round(0)
df=df.groupby(['Dis_date_period','NPL_date_period']).agg({'Dis_amount' : 'sum', 'NPL_amount' : 'sum', 'diff' : 'mean'})
df.reset_index(level=0, inplace=True)
df['Vintage_Ratio']=df['NPL_amount']/df['Dis_amount']
table=pd.pivot_table(df,values='Vintage_Ratio',index='Dis_date_period',columns=['diff'],).fillna(0)
return
The above is the function
#for e in product_list:
# sub = data1[data1['product_types'] == e]
# print(sub)
consumer = data1[data1['product_types'] == product_list[0]]
mortgage = data1[data1['product_types'] == product_list[1]]
vehicle = data1[data1['product_types'] == product_list[2]]
table_con = vintage_table(consumer)
table_mor = vintage_table(mortgage)
table_veh = vintage_table(vehicle)
I would like to improve this part is there a better way to do the same process?
You could have your vintage_table() function return a dataframe instead of just modifying one dataframe over and over and that way you could do this in the second code block:
table_con = vintage_table(consumer)
table_mor = vintage_table(mortgage)
table_veh = vintage_table(vechicle)

Applying similar functions across multiple columns in python/pandas

Problem: Given the dataframe below, I'm trying to come up with the code that will apply a function to three distinct columns without having to write three separate function calls.
The code for the data:
import pandas as pd
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'days': [365, 365, 213, 318, 71],
'spend_30day': [22, 241.5, 0, 27321.05, 345],
'spend_90day': [22, 451.55, 64.32, 27321.05, 566.54],
'spend_365day': [854.56, 451.55, 211.65, 27321.05, 566.54]}
df = pd.DataFrame(data)
cols = df.columns.tolist()
cols = ['name', 'days', 'spend_30day', 'spend_90day', 'spend_365day']
df = df[cols]
df
The function below will essentially annualize spend; if someone has fewer than, say, 365 days in the "days" column, the following function will tell me what the spend would have been if they had 365 days:
def annualize_spend_365(row):
if row['days']/(float(365)) < 1:
return (row['spend_365day']/(row['days']/float(365)))
else:
return row['spend_365day']
Then I apply the function to the particular column:
df.spend_365day = df.apply(annualize_spend_365, axis=1).round(2)
df
This works exactly as I want it to for that one column. However, I don't want to have to rewrite this for each of the three different "spend" columns (30, 90, 365). I want to be able to write code that will generalize and apply this function to multiple columns in one pass.
I thought I could create lists of the columns and their respective days, use the "zip" function, and nest the function in a for loop, but my attempt below ultimately fails:
spend_cols = [df.spend_30day, df.spend_90day, df.spend_365day]
days_list = [30, 90, 365]
for col, day in zip(spend_cols, days_list):
def annualize_spend(row):
if (row.days/(float(day)) < 1:
return (row.col)/((row.days)/float(day))
else:
return row.col
col = df.apply(annualize_spend, axis = 1)
The error:
AttributeError: ("'Series' object has no attribute 'col'")
I'm not sure why the loop approach is failing. Regardless, I'm hoping for guidance on how to generalize function application in pandas. Thanks in advance!
Look at your two function definitions:
def annualize_spend_365(row):
if row['days']/(float(365)) < 1:
return (row['spend_365day']/(row['days']/float(365)))
else:
return row['spend_365day']
and
#col in [df.spend_30day, df.spend_90day, df.spend_365day]
def annualize_spend(row):
if (row.days/(float(day)) < 1:
return (row.col)/((row.days)/float(day))
else:
return row.col
See the difference? On the one hand, in the first case you access the fields with explicit field names, and it works. In the second case you try to access row.col, which fails, but in this case col assumes the values of the corresponding fields in df. Instead try
spend_cols = ['spend_30day', 'spend_90day', 'spend_365day']
before your loop. On the other hand, in the syntax df.days the field name is actually "days", but in df.col the field name is not the string "col", but the value of the string col. So you might want to use row[col] in the latter case as well. And anyway, I'm not sure how wise it is to take col as an output variable inside your loop over col.
I'm unfamiliar with pandas.DataFrame.apply, but it's probably possible to use a single function definition, which takes the number of days and the field of interest as input variables:
def annualize_spend(col,day,row):
if (row['days']/(float(day)) < 1:
return (row[col])/((row['days'])/float(day))
else:
return row[col]
spend_cols = ['spend_30day', 'spend_90day', 'spend_365day']
days_list = [30, 90, 365]
for col, day in zip(spend_cols, days_list):
col = df.apply(lambda row,col=col,day=day: annualize_spend(col,day,row), axis = 1)
The lambda will ensure that only one input parameter of your function is dangling free when it gets applyd.

Categories

Resources