Standardize textual data in a python dataframe

Standardize textual data in a python dataframe - python

I have a data frame df which consists of supplier information (data at the invoice level). Suppliers sometimes have non-standard names, for eg:
Invoice no.
Product name
Supplier Name
1
product 1
Pepsico
2
product 2
Pepsi
3
product 3
Peppsi
4
product 4
Mountain Dew
All of the above rows have the same supplier - Pepsi, but it has been registered with different names. How do I identify such kinds of rows to standardize all such entries?

Related

pandas keep row if it contains part of string

I have a string as String = 'Oil - this company
In my dataframe df1:
id CompanyName
1 Oil - this company
2 oil
3 oily
4 comp
I want to keep the rows that contain part of CompanyName
My final df should be: df1
id CompanyName
1 Oil - this company
2 oil
I tried:
df = df[df['CompanyName'].str.contains(String)]
but it deleted the second row 2 oil
Is there any way to keep the company Name that contains part of the string?

Creating a pandas column from a dictionary of regular expressions

I want to create a column which essentially shows the data type of the data within an excel spreadsheet, i.e. if the data within any given cell is a string or an integer or a float etc. Currently I'm working with mocked up data to test with and hope to eventually use this for larger excel files with more field headers.
My Current high level method is as follows:
Read Excel file and create a dataframe
Re-format this table to create a column of all data I wish to label with a data type (i.e if it is a string, integer or float), alongside the respective field headers.
Create a 'Data Type' column which will contain these labels for each piece of data which is populated by the corresponding data types held in a dictionary of regular expressions
import os
from glob import glob
import pandas as pd
from os import path
import re
sample_file = 'C:/Users/951297/Documents/Python Scripts/DD\\Fund_Data.xlsx'
dataf = pd.read_excel(sample_file)
dataf
FUND ID FUND NAME AMOUNT
0 10101 Holdings company A 10000
1 20202 Holdings company B 2000.5
2 30303 Holdings company C 3000
# Create column list of data attributes
stackdf= dataf.stack().reset_index()
stackdf = stackdf.rename(columns={'level_0':'index','level_1':'fh',0:'attribute'})
# Create a duplicate column of attribute to apply regex
stackdf_regex = stackdf.iloc[:,2:].rename(columns = {'attribute':'Data Type'})
# Dictionary of regex to replace values within the 'Data Type' column depending on the attribute
repl_dict = {re.compile(r'^[\d]+$'):'Integer',
re.compile(r'^[a-zA-Z0-9_ ]*$'): 'String',
re.compile(r'[\d]+\.'): 'Float'}
#concatenate tables
pd.concat([stackdf, stackdf_regex], axis=1)
This is the reformatted table I wish to apply my regular expressions onto:
index fh attribute Data Type
0 0 FUND ID 10101 10101
1 0 FUND NAME Holdings company A Holdings company A
2 0 AMOUNT 10000 10000
3 1 FUND ID 20202 20202
4 1 FUND NAME Holdings company B Holdings company B
5 1 AMOUNT 2000.5 2000.5
6 2 FUND ID 30303 30303
7 2 FUND NAME Holdings company C Holdings company C
8 2 AMOUNT 3000 3000
This is the desired output:
index fh attribute Data Type
0 0 FUND ID 10101 Integer
1 0 FUND NAME Holdings company A String
2 0 AMOUNT 10000 Integer
3 1 FUND ID 20202 Integer
4 1 FUND NAME Holdings company B String
5 1 AMOUNT 2000.5 Float
6 2 FUND ID 30303 Integer
7 2 FUND NAME Holdings company C String
8 2 AMOUNT 3000 Integer
However the following code produces the table below:
stackdf_regex = stackdf_regex.replace({'Data Type':repl_dict}, regex=True)
pd.concat([stackdf, stackdf_regex], axis=1)
index fh attribute Data Type
0 0 FUND ID 10101 10101
1 0 FUND NAME Holdings company A String
2 0 AMOUNT 10000 10000
3 1 FUND ID 20202 20202
4 1 FUND NAME Holdings company B String
5 1 AMOUNT 2000.5 2000.5
6 2 FUND ID 30303 30303
7 2 FUND NAME Holdings company C String
8 2 AMOUNT 3000 3000
Perhaps my regular expressions are incorrect or my understanding is lacking in applying the regular expressions on the dataframe. Happy to receive any suggestions on this current method or another suitable/efficient method I have not considered.
Note: I hope to eventually expand the regex dictionary to account for more data types and I understand it may not be efficient to check every cell for a pattern for larger datasets but I'm still in the early stages.

You can use, np.select, where each of the conditions test a given regex to the column Data Type using Series.str.contains and choices corresponds to the conditions:
conditions = [
df['Data Type'].str.contains(r'^\d+$'),
df['Data Type'].str.contains(r'^[\w\s]+$'),
df['Data Type'].str.contains(r'^\d+\.\d+$')]
choices = ['Interger', 'String', 'Float']
df['Data Type'] = np.select(conditions, choices, default=None)
# print(df)
index fh attribute Data Type
0 0 FUND ID 10101 Interger
1 0 FUND NAME Holdings company A String
2 0 AMOUNT 10000 Interger
3 1 FUND ID 20202 Interger
4 1 FUND NAME Holdings company B String
5 1 AMOUNT 2000.5 Float
6 2 FUND ID 30303 Interger
7 2 FUND NAME Holdings company C String
8 2 AMOUNT 3000 Interger

Use Excel sheet to create dictionary in order to replace values

I have an excel file with product names. First row is the category (A1: Water, A2: Sparkling, A3:Still, B1: Soft Drinks, B2: Coca Cola, B3: Orange Juice, B4:Lemonade etc.), each cell below is a different product. I want to keep this list in a viewable format (not comma separated etc.) as this is very easy for anybody to update the product names (I have a second person running the script without understanding the script)
If it helps I can also have the excel file in a CSV format and I can also move the categories from the top row to the first column
I would like to replace the cells of a dataframe (df) with the product categories. For example, Coca Cola would become Soft Drinks. If the product is not in the excel it would not be replaced (ex. Cookie).
print(df)
Product Quantity
0 Coca Cola 1234
1 Cookie 4
2 Still 333
3 Chips 88
Expected Outcome:
print (df1)
Product Quantity
0 Soft Drinks 1234
1 Cookie 4
2 Water 333
3 Snacks 88

Use DataFrame.melt with DataFrame.dropna or DataFrame.stack for helper Series and then use Series.replace:
s = df1.melt().dropna().set_index('value')['variable']
Alternative:
s = df1.stack().reset_index(name='v').set_index('v')['level_1']
df['Product'] = df['Product'].replace(s)
#if performance is important
#df['Product'] = df['Product'].map(s).fillna(df['Product'])
print (df)
Product Quantity
0 Soft Drinks 1234
1 Cookie 4
2 Water 333
3 Snacks 88

Creating a dictionary of categoricals in SQL and aggregating them in Python

I have a rather "cross platformed" question. I hope it is not too general.
One of my tables, say customers, consists of my customer id's and their associated demographic information. Another table, say transaction, contains all purchases from the customers in the respective shops.
I am interested in analyzing basket compositions together with demographics in python. Hence, I would like to have the shops as columns and the sum for the given customers at the shops in my dataframe
For clarity,
select *
from customer
where id=1 or id=2
gives me
id age gender
1 35 MALE
2 57 FEMALE
and
select *
from transaction
where id=1 or id=2
gives me
customer_id shop amount
1 2 250
1 2 500
2 3 100
2 7 200
2 11 125
Which should end up in a (preferably) Pandas dataframe as
id age gender shop_2 shop_3 shop_7 shop_11
1 35 MALE 750 0 0 0
2 57 FEMALE 0 100 200 125
Such that the last columns is the aggregated baskets of the customers.
I have tried to create a python dictionary of the purchases and amounts for each customer in SQL in the following way:
select customer_id, array_agg(concat(cast(shop as varchar), ' : ', cast(amount as varchar))) as basket
from transaction
group by customer_id
Resulting in
id basket
1 ['2 : 250', '2 : 500']
2 ['3 : 100', '7 : 200', '11 : 125']
which could easily be joined on the customer table.
However, this solution is not optimal due to the fact that it is strings and not integers inside the []. Hence, it involves a lot of manipulation and looping in python to get it on the format I want.
Is there any way where I can aggregate the purchases in SQL making it easier for python to read and aggregate into columns?

One simple solution would be to do the aggregation in pandas using pivot_table on the second dataframe and then merge with the first:
df2 = df2.pivot_table(columns='shop', values='amount', index='customer_id', aggfunc='sum', fill_value=0.0).reset_index()
df = pd.merge(df1, df2, left_on='id', right_on='customer_id')
Resulting dataframe:
id age gender 2 3 7 11
1 35 MALE 750 0 0 0
2 57 FEMALE 0 100 200 125

Merging 2 data frames without changing associated values

I currently have 2 datasets
1 = Drugs prescribed per hospital
2 = Crimes committed
I have been able to assign the located hospital ID to the various crimes so therefore I can identify which hospital is closer.
What I really would like to do is to assign the amount of drugs prescribed using the count_values method to the hospital ID in the Crime data so that I can then plot a scatter matrix of where the crimes took place and the total quantity of drugs prescribed from the closest hospital.
I have tried using the following
df = Crimes.merge(hosp[['hosp no', 'Total Quantity']],
left_on='hosp_no', right_on='hosp no').drop('hosp no', 1)
df
However when I use the above code the associated Hosp ID to the crime changes and I don't want it too!!
I am new to jupyter notebook so I would be most grateful for any help!!
Thank you in advance
Crimes df
ID Type Hosp No
0 Anti-Social 222
Hosp df
Hosp no Total Quantity Drug name
222 1000 Paracetamol
So basically Hosp 222 has prescribed 1000 Paracetamol drugs how can I assign the number 1000 to the Crime df where Hosp No = 222 to look like this:
Crimes df
ID Type Hosp No Total Quantity
0 Anti-Social 222 1000

If the columns you are merging on share the same name, you don't need on parameter. Since you need column added to crime, we can use parameter how = left
Crimes = Crimes.merge(Hosp[['Hosp No', 'Total Quantity']], how = 'left')
ID Type Hosp No Total Quantity
0 0 Anti-Social 222 1000
Let me know if this is the desired output or you need anything else

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Standardize textual data in a python dataframe - python

Related

pandas keep row if it contains part of string

Creating a pandas column from a dictionary of regular expressions

Use Excel sheet to create dictionary in order to replace values

Creating a dictionary of categoricals in SQL and aggregating them in Python

Merging 2 data frames without changing associated values

Categories

Resources