Use Excel sheet to create dictionary in order to replace values - python

I have an excel file with product names. First row is the category (A1: Water, A2: Sparkling, A3:Still, B1: Soft Drinks, B2: Coca Cola, B3: Orange Juice, B4:Lemonade etc.), each cell below is a different product. I want to keep this list in a viewable format (not comma separated etc.) as this is very easy for anybody to update the product names (I have a second person running the script without understanding the script)
If it helps I can also have the excel file in a CSV format and I can also move the categories from the top row to the first column
I would like to replace the cells of a dataframe (df) with the product categories. For example, Coca Cola would become Soft Drinks. If the product is not in the excel it would not be replaced (ex. Cookie).
print(df)
Product Quantity
0 Coca Cola 1234
1 Cookie 4
2 Still 333
3 Chips 88
Expected Outcome:
print (df1)
Product Quantity
0 Soft Drinks 1234
1 Cookie 4
2 Water 333
3 Snacks 88

Use DataFrame.melt with DataFrame.dropna or DataFrame.stack for helper Series and then use Series.replace:
s = df1.melt().dropna().set_index('value')['variable']
Alternative:
s = df1.stack().reset_index(name='v').set_index('v')['level_1']
df['Product'] = df['Product'].replace(s)
#if performance is important
#df['Product'] = df['Product'].map(s).fillna(df['Product'])
print (df)
Product Quantity
0 Soft Drinks 1234
1 Cookie 4
2 Water 333
3 Snacks 88

Related

How to remove duplicates when upon editing an entity the originals are not replaced?

Consider that we have a dataset that represents some purchases. Products that have been bought together have the same basket ID.
When a purchased product is edited (e.g. the wrong price was inserted at first) it does not replace the original record. Instead, a new record is made for EVERY product of that basket ID and a new Basket ID is assigned to the purchase.
For example consider a purchase of a bottle of milk and a chocolate:
Product Price BasketID PreviousBasketID
0 Milk 2 1234 Null
1 Chocolate 3 1234 Null
Let's say that we'd like to edit the price of chocolate. Then the dataset would be:
Product Price BasketID PreviousBasketID
0 Milk 2 1234 Null
1 Chocolate 3 1234 Null
2 Milk 2 5678 1234
3 Chocolate 4 5678 1234
Is there a way to keep only the latest version of the basket (i.e. BasketID = 5678) and get rid of any previous versions?
Can you remove any rows that have a BasketID that appears in PreviousBasketID?
Something like:
df = df[~df["BasketID"].isin(df["PreviousBasketID"])]
Here the ~ means bitwise not. See here for more info.

Duplicate Analysis Using Python

I am a beginner in Python.
So far I have identified the duplicates using pandas lib but don't know how this will help me.
import pandas as pd
import numpy as np
dataframe = pd.read_csv("HKTW_INDIA_Duplicate_Account.csv")
dataframe.info()
name = dataframe["PARTY_NAME"duplicate_data=dataframe[name.isin(name[name.duplicated()])].sort_values("PARTY_NAME")
duplicate_data.head()
What I want: I have a set of data that is duplicated and I need to merge the duplicates based on certain conditions and need to populate the feedback in a new column.
I can do this manually also in Excel but the records are very high which will consume a lot of time. (More than 4,00,000 rows)
Primary Account ID Secondary Account ID Account Name Translated Name Created on Date Amount Today Amount Total Split Reamrks New ID
1234 245 Julia Julia 24-May-20 530 45 N
2345 Julia Julia 24-Sep-20 N
3456 42 Sara Sara 24-Aug-20 230 Y
4567 Sara Sara 24-Sep-20 Y
5678 Matt Matt 24-Jun-20 N
6789 Matt Matt 24-Sep-20 N
7890 58 Robert Robert 24-Feb-20 525 21 N
1937 Robert Robert 24-Sep-20 N
7854 55 Robert Robert 24-Jan-20 543 74 N
Conditions:
Only those accounts can be merged where we have "N" in Split Column and Amount_Total & Amount_Today is Blank.
Expected Output:
Value in Secondary_Account_ID or not.
Example: Row 2 does not have any Secondary Registry ID and does not have any value in Amount_Total & Amount_Todat but Row 1 has the value in Secondary_Account_ID, so in this case, Row 2 can be merged to Row 1 because both have the same name. In the remarks columns, it should give me Winner account have secondary id(row 2 & row 1) and copy the Account ID from row 1 and paste in (row 2 & row 1) (Column "New ID")
Expected Output:
If duplicate accounts have Amount_Total and Amount_Today then it should not be merged.
Expected Output:
If duplicate accounts do not have any value in Secondary_Account_ID then it will check for Amount_today or Amount_total column, if the value is there in these two columns then the account which does not have values in these two columns will be merged to another one.
Expected Output:
If more the one duplicate account has a Secondary ID and if Amount_Today or Amount_Total is available for one account then that account will be considered as a winner account.
Expected Output:
If more the one duplicate account has a Secondary ID and if Amount_Today or Amount_Total is available for more than one account then that account which has the maximum value in Amount_Total will be considered as winner account.
Expected Output:
If Secondary_Account_ID, Total_Amount, and Today_Amount is blank then it should consider the oldest account.
Expected Output:

How to get unique values with many different URLs

I have a dataframe that looks something like the one below.
Product URLs Company
0 shoes www.walmart.com/12va15a walmart
1 shoes www.costco.com/1apsd-dfasx costco
2 pants www.amazon.com/adsffa1 NaN
3 shirt www.Amazon.com/fas19axl Amazon
4 shoes www.walmart.com/ywsg141q NaN
I'm not sure if Pandas can get the unique variables in the URL column and fill it into the NaNs in the company column.
The dataframe that I will like looks like that below
Product URLs Company
0 shoes www.walmart.com/12va15a walmart
1 shoes www.costco.com/1apsd-dfasx costco
2 pants www.amazon.com/adsffa1 amazon
3 shirt www.Amazon.com/fas19axl amazon
4 shoes www.walmart.com/ywsg141q walmart
Edit: I have lowered all the URLs but i'm not sure how just extract the keywords like Amazon, costco, etc. Thanks
Add Series.str.extract for values between first and second .:
df.Company = df.URLs.str.lower().str.extract('\.(.+)\.', expand=False)
print (df)
Product URLs Company
0 shoes www.walmart.com/12va15a walmart
1 shoes www.costco.com/1apsd-dfasx costco
2 pants www.amazon.com/adsffa1 amazon
3 shirt www.Amazon.com/fas19axl amazon
4 shoes www.walmart.com/ywsg141q walmart
If want replace only missing values also use Series.fillna:
df.Company = df.Company.fillna(df.URLs.str.lower().str.extract('\.(.+)\.', expand=False))
lower your URLs before processing:
df.URLs = df.URLs.str.lower()

Creating a dictionary of categoricals in SQL and aggregating them in Python

I have a rather "cross platformed" question. I hope it is not too general.
One of my tables, say customers, consists of my customer id's and their associated demographic information. Another table, say transaction, contains all purchases from the customers in the respective shops.
I am interested in analyzing basket compositions together with demographics in python. Hence, I would like to have the shops as columns and the sum for the given customers at the shops in my dataframe
For clarity,
select *
from customer
where id=1 or id=2
gives me
id age gender
1 35 MALE
2 57 FEMALE
and
select *
from transaction
where id=1 or id=2
gives me
customer_id shop amount
1 2 250
1 2 500
2 3 100
2 7 200
2 11 125
Which should end up in a (preferably) Pandas dataframe as
id age gender shop_2 shop_3 shop_7 shop_11
1 35 MALE 750 0 0 0
2 57 FEMALE 0 100 200 125
Such that the last columns is the aggregated baskets of the customers.
I have tried to create a python dictionary of the purchases and amounts for each customer in SQL in the following way:
select customer_id, array_agg(concat(cast(shop as varchar), ' : ', cast(amount as varchar))) as basket
from transaction
group by customer_id
Resulting in
id basket
1 ['2 : 250', '2 : 500']
2 ['3 : 100', '7 : 200', '11 : 125']
which could easily be joined on the customer table.
However, this solution is not optimal due to the fact that it is strings and not integers inside the []. Hence, it involves a lot of manipulation and looping in python to get it on the format I want.
Is there any way where I can aggregate the purchases in SQL making it easier for python to read and aggregate into columns?
One simple solution would be to do the aggregation in pandas using pivot_table on the second dataframe and then merge with the first:
df2 = df2.pivot_table(columns='shop', values='amount', index='customer_id', aggfunc='sum', fill_value=0.0).reset_index()
df = pd.merge(df1, df2, left_on='id', right_on='customer_id')
Resulting dataframe:
id age gender 2 3 7 11
1 35 MALE 750 0 0 0
2 57 FEMALE 0 100 200 125

Regress by group in pandas dataframe and add columns with forecast values and beta/t-stats

here is an example of my dataframe df:
Category Y X1 X2
0 Apple 0.083050996 0.164056482 0.519875358
1 Apple 0.411044939 0.774160332 0.002869499
2 Apple 0.524315907 0.422193005 0.97720091
3 Apple 0.721124638 0.645927536 0.750210715
4 Berry 0.134488729 0.299288214 0.522933484
5 Berry 0.733162132 0.608742944 0.957595544
6 Berry 0.113051075 0.641533175 0.19799635
7 Berry 0.275379123 0.249143751 0.049082766
8 Carrot 0.588121494 0.750480977 0.615399987
9 Carrot 0.878221581 0.021366296 0.069184879
Now I want the code to be able to do a regression for each Category (ie, cross sectional regression grouped by Category (for Apple, Berry and Carrot etc,)).
Then I want to add new columns df['Y_hat'] which has the forecast value from the regression, and the corresponding 2 beta and t-statistic values (beta and t-stat values would be the same for multiple rows of same category).
Final df would have 5 additional columns, Y_hat, beta 1, beta 2 , t-stat 1 and t-stat 2.
You want to do a lot of things for a "GroupBy" :)
I think is better if you slice the DataFrame by Category, then store each individual result for that category in a dictionary which you're going to use at the end of the loop to build your DataFrame.
result = {}
# loop on every category
for category in df['Category'].unique():
# slice
df_slice = df[df['Category'] == category]
# run all the stuff your want to do
result[category] = {
'predicted_value': ***,
'Y_hat': ***
'etc'
...
}
# build dataframe with all your results
final_df = pd.DataFrame(result)
Will be much easier if ever need to debug too! Good luck! :)

Categories

Resources