pandas keep row if it contains part of string - python

I have a string as String = 'Oil - this company
In my dataframe df1:
id CompanyName
1 Oil - this company
2 oil
3 oily
4 comp
I want to keep the rows that contain part of CompanyName
My final df should be: df1
id CompanyName
1 Oil - this company
2 oil
I tried:
df = df[df['CompanyName'].str.contains(String)]
but it deleted the second row 2 oil
Is there any way to keep the company Name that contains part of the string?

Related

Standardize textual data in a python dataframe

I have a data frame df which consists of supplier information (data at the invoice level). Suppliers sometimes have non-standard names, for eg:
Invoice no.
Product name
Supplier Name
1
product 1
Pepsico
2
product 2
Pepsi
3
product 3
Peppsi
4
product 4
Mountain Dew
All of the above rows have the same supplier - Pepsi, but it has been registered with different names. How do I identify such kinds of rows to standardize all such entries?

Pandas transforming chronological rows to columns

I have a table of work experiences where each row represents a job in chronological order from the first job to the most recent job. For data science purposes I'm trying to create a new table based on this table that displays new job attributes and old job attributes on the same row. For example, the original table would be like:
uniqueID
personID
startdate
enddate
title
functions
1
A1
1/1/21
12/1/21
Analyst
data science
2
A1
1/1/22
12/1/22
Manager
admin
The new table would be something like this:
uniqueID
personID
new_title
new_function
old_title
old_function
1
A1
Analyst
data science
nan
nan
2
A1
Manager
admin
Analyst
data science
I tried to use some groupby variations but haven't been able to get this result.
If I understand correctly, you're looking for a shift:
cols = ['title', 'functions']
df[['old_' + c for c in cols]] = df.groupby('personID')[cols].shift(1)
df = df.drop(['startdate', 'enddate'], axis=1).rename({c: 'new_' + c for c in cols}, axis=1)
Output:
>>> df
uniqueID personID new_title new_functions old_title old_functions
0 1 A1 Analyst data science NaN NaN
1 2 A1 Manager admin Analyst data science

Removing the rows from dataframe till the actual column names are found

I am reading tabular data from the email in the pandas dataframe.
There is no guarantee that column names will contain in the first row.Sometimes data is in the following format.The actual column names are [ID,Name and Year]
dummy1 dummy2 dummy3
test_column1 test_column2 test_column3
ID Name Year
1 John Sophomore
2 Lisa Junior
3 Ed Senior
Sometimes the column names come in the first row as expected.
ID Name Year
1 John Sophomore
2 Lisa Junior
3 Ed Senior
Once I read the HTML table from the email,how I remove the initial rows that don't contain the column names?So in the first case I would need to remove first 2 rows in the dataframe(including column row) and in the second case,i wouldn't have to remove anything.
Also,the column names can be in any sequence.
basically,I want to do in following
1.check whether once of the column names contains in one of the rows in dataframe
2.Remove the rows above
if "ID" in row:
remove the above rows
How can I achieve this?
You can first get index of valid columns and then filter and set accordingly.
df = pd.read_csv("d.csv",sep='\s+', header=None)
col_index = df.index[(df == ["ID","Name","Year"]).all(1)].item() # get columns index
df.columns = df.iloc[col_index].to_numpy() # set valid columns
df = df.iloc[col_index + 1 :] # filter data
df
ID Name Year
3 1 John Sophomore
4 2 Lisa Junior
5 3 Ed Senior
or
If you want to se ID as index
df = df.iloc[col_index + 1 :].set_index('ID')
df
Name Year
ID
1 John Sophomore
2 Lisa Junior
3 Ed Senior
Ugly but effective quick try:
id_name = df.columns[0]
df_clean = df[(df[id_name] == 'ID') | (df[id_name].dtype == 'int64')]

Creating a pandas column from a dictionary of regular expressions

I want to create a column which essentially shows the data type of the data within an excel spreadsheet, i.e. if the data within any given cell is a string or an integer or a float etc. Currently I'm working with mocked up data to test with and hope to eventually use this for larger excel files with more field headers.
My Current high level method is as follows:
Read Excel file and create a dataframe
Re-format this table to create a column of all data I wish to label with a data type (i.e if it is a string, integer or float), alongside the respective field headers.
Create a 'Data Type' column which will contain these labels for each piece of data which is populated by the corresponding data types held in a dictionary of regular expressions
import os
from glob import glob
import pandas as pd
from os import path
import re
sample_file = 'C:/Users/951297/Documents/Python Scripts/DD\\Fund_Data.xlsx'
dataf = pd.read_excel(sample_file)
dataf
FUND ID FUND NAME AMOUNT
0 10101 Holdings company A 10000
1 20202 Holdings company B 2000.5
2 30303 Holdings company C 3000
# Create column list of data attributes
stackdf= dataf.stack().reset_index()
stackdf = stackdf.rename(columns={'level_0':'index','level_1':'fh',0:'attribute'})
# Create a duplicate column of attribute to apply regex
stackdf_regex = stackdf.iloc[:,2:].rename(columns = {'attribute':'Data Type'})
# Dictionary of regex to replace values within the 'Data Type' column depending on the attribute
repl_dict = {re.compile(r'^[\d]+$'):'Integer',
re.compile(r'^[a-zA-Z0-9_ ]*$'): 'String',
re.compile(r'[\d]+\.'): 'Float'}
#concatenate tables
pd.concat([stackdf, stackdf_regex], axis=1)
This is the reformatted table I wish to apply my regular expressions onto:
index fh attribute Data Type
0 0 FUND ID 10101 10101
1 0 FUND NAME Holdings company A Holdings company A
2 0 AMOUNT 10000 10000
3 1 FUND ID 20202 20202
4 1 FUND NAME Holdings company B Holdings company B
5 1 AMOUNT 2000.5 2000.5
6 2 FUND ID 30303 30303
7 2 FUND NAME Holdings company C Holdings company C
8 2 AMOUNT 3000 3000
This is the desired output:
index fh attribute Data Type
0 0 FUND ID 10101 Integer
1 0 FUND NAME Holdings company A String
2 0 AMOUNT 10000 Integer
3 1 FUND ID 20202 Integer
4 1 FUND NAME Holdings company B String
5 1 AMOUNT 2000.5 Float
6 2 FUND ID 30303 Integer
7 2 FUND NAME Holdings company C String
8 2 AMOUNT 3000 Integer
However the following code produces the table below:
stackdf_regex = stackdf_regex.replace({'Data Type':repl_dict}, regex=True)
pd.concat([stackdf, stackdf_regex], axis=1)
index fh attribute Data Type
0 0 FUND ID 10101 10101
1 0 FUND NAME Holdings company A String
2 0 AMOUNT 10000 10000
3 1 FUND ID 20202 20202
4 1 FUND NAME Holdings company B String
5 1 AMOUNT 2000.5 2000.5
6 2 FUND ID 30303 30303
7 2 FUND NAME Holdings company C String
8 2 AMOUNT 3000 3000
Perhaps my regular expressions are incorrect or my understanding is lacking in applying the regular expressions on the dataframe. Happy to receive any suggestions on this current method or another suitable/efficient method I have not considered.
Note: I hope to eventually expand the regex dictionary to account for more data types and I understand it may not be efficient to check every cell for a pattern for larger datasets but I'm still in the early stages.
You can use, np.select, where each of the conditions test a given regex to the column Data Type using Series.str.contains and choices corresponds to the conditions:
conditions = [
df['Data Type'].str.contains(r'^\d+$'),
df['Data Type'].str.contains(r'^[\w\s]+$'),
df['Data Type'].str.contains(r'^\d+\.\d+$')]
choices = ['Interger', 'String', 'Float']
df['Data Type'] = np.select(conditions, choices, default=None)
# print(df)
index fh attribute Data Type
0 0 FUND ID 10101 Interger
1 0 FUND NAME Holdings company A String
2 0 AMOUNT 10000 Interger
3 1 FUND ID 20202 Interger
4 1 FUND NAME Holdings company B String
5 1 AMOUNT 2000.5 Float
6 2 FUND ID 30303 Interger
7 2 FUND NAME Holdings company C String
8 2 AMOUNT 3000 Interger

Use Excel sheet to create dictionary in order to replace values

I have an excel file with product names. First row is the category (A1: Water, A2: Sparkling, A3:Still, B1: Soft Drinks, B2: Coca Cola, B3: Orange Juice, B4:Lemonade etc.), each cell below is a different product. I want to keep this list in a viewable format (not comma separated etc.) as this is very easy for anybody to update the product names (I have a second person running the script without understanding the script)
If it helps I can also have the excel file in a CSV format and I can also move the categories from the top row to the first column
I would like to replace the cells of a dataframe (df) with the product categories. For example, Coca Cola would become Soft Drinks. If the product is not in the excel it would not be replaced (ex. Cookie).
print(df)
Product Quantity
0 Coca Cola 1234
1 Cookie 4
2 Still 333
3 Chips 88
Expected Outcome:
print (df1)
Product Quantity
0 Soft Drinks 1234
1 Cookie 4
2 Water 333
3 Snacks 88
Use DataFrame.melt with DataFrame.dropna or DataFrame.stack for helper Series and then use Series.replace:
s = df1.melt().dropna().set_index('value')['variable']
Alternative:
s = df1.stack().reset_index(name='v').set_index('v')['level_1']
df['Product'] = df['Product'].replace(s)
#if performance is important
#df['Product'] = df['Product'].map(s).fillna(df['Product'])
print (df)
Product Quantity
0 Soft Drinks 1234
1 Cookie 4
2 Water 333
3 Snacks 88

Categories

Resources