I am trying to insert a column with values 'True' and 'False' based on a validation using a separate column. The issue I'm having is that the condition is dependent on another column, acting as the dictionary (which uses regex) key.
E.g.
Table I have:
Type
Value
TypeA
a1111
TypeB
1b111
TypeC
11c11
TypeD
111d1
TypeD
1111e
Dictionary I have:
Column A
Column B
A
\w\d\d\d\d
B
\d\w\d\d\d
C
\d\d\w\d\d
D
\d\d\d\w\d
Result I want:
Type
Value
Result
TypeA
a1111
True
TypeB
1b111
True
TypeC
11c11
True
TypeD
111d1
True
TypeD
1111e
False
Any help would be appreciated!
I have tried playing around with numpy.where() but haven't had much luck.
You can use np.vectorize to create a lambda function that takes Value and pattern, and return True or False based on the output of re.match.
import re
# Create the dataframe
df = pd.DataFrame({
"Type": ["TypeA", "TypeB", "TypeC", "TypeD", "TypeD"],
"Value": ["a1111", "1b111", "11c11", "111d1", "1111e"]
})
# Create the dictionary dataframe
df_dict = pd.DataFrame({
"Column A": ["A", "B", "C", "D"],
"Column B": [r"\w\d\d\d\d", r"\d\w\d\d\d", r"\d\d\w\d\d", r"\d\d\d\w\d"]
})
# Add "Type" to the beginning of each value to match the "Type" column in the main dataframe
df_dict["Column A"] = "Type" + df_dict["Column A"]
# Merge the two dataframes to get corresponding regex pattern for "Type"
df = df.merge(df_dict, left_on="Type", right_on="Column A")
match_func = np.vectorize(lambda x, pattern: True if re.match(pattern, x) else False) # Create a vectorized function to match the regex pattern
df["Result"] = match_func(df["Value"], df["Column B"]) # Add the result to the dataframe
df = df.drop(columns=["Column A", "Column B"]) # Drop the columns that are no longer needed
df
Type Value Result
0 TypeA a1111 True
1 TypeB 1b111 True
2 TypeC 11c11 True
3 TypeD 111d1 True
4 TypeD 1111e False
Just One Line
df=pd.DataFrame({'Type':'TypeA','TypeB','TypeC','TypeD','TypeD'],
'Value':['a1111','1b111','11c11','111d1','1111e']})
re_dict={'A':r'\w\d\d\d\d','B':r'\d\w\d\d\d','C':r'\d\d\w\d\d','D':r'\d\d\d\w\d'}
df['result']=df.apply(lambda row:re.match(re_dict[row['Type'][-1]],row['Value'])!=None,axis=1)
Type Value result
0 TypeA a1111 True
1 TypeB 1b111 True
2 TypeC 11c11 True
3 TypeD 111d1 True
4 TypeD 1111e False
With Series.map and re.match functions:
d = {'A':'\w\d\d\d\d','B':'\d\w\d\d\d','C':'\d\d\w\d\d','D':'\d\d\d\w\d'}
df['Result'] = df.Type.str[-1].map(d)
df['Result'] = df.apply(lambda x: bool(re.match(x.Result, x.Value)), axis=1)
Type Value Result
0 TypeA a1111 True
1 TypeB 1b111 True
2 TypeC 11c11 True
3 TypeD 111d1 True
4 TypeD 1111e False
Related
I am importing data which should be categorical from an externally sourced csv file into a pandas dataframe.
The first thing I want to do is to validate that the values are valid for the categorical type.
My strategy is to create an instance of CategoricalDtype and then using apply to test each value.
Question: The only way I can figure out is to test each value is in CategoricalDtype.categories.values but is there a "better" way? are there any methods I can use to achieve the same? I'm new to CategoricalDtype and it doesnt feel like this is the best way to be testing the data value.
# example of what I'm doing
import pandas as pd
from pandas.api.types import CategoricalDtype
df = pd.read_csv('data.csv')
cat = CategoricalDtype(categories=["A", "B", "C"], ordered=False)
df['data_is_valid']=df['data_field'].apply(lambda x: x in cat.categories.values)
If need test if exist values from column data_field :
df['data_is_valid']=df['data_field'].isin(cat.categories)
If need test also categorical_dtype:
from pandas.api.types import is_categorical_dtype
df['data_is_valid']=df['data_field'].isin(cat.categories) & is_categorical_dtype(df['data_field'])
Difference is possible see in data sample:
from pandas.api.types import CategoricalDtype
from pandas.api.types import is_categorical_dtype
df = pd.DataFrame({ "data_field": ["A", "B", "C", "D", 'E']})
cat = CategoricalDtype(categories=["A", "B", "C"], ordered=False)
#categories match but not Categorical
df['data_is_valid1']=df['data_field'].isin(cat.categories) & is_categorical_dtype(df['data_field'])
#categories match not tested Categorical
df['data_is_valid2']=df['data_field'].isin(cat.categories)
cat_type = CategoricalDtype(categories=["A", "B", "C", 'D', 'E'], ordered=True)
#created Categorical column
df['data_field'] = df['data_field'].astype(cat_type)
#categoriesand Categorical match
df['data_is_valid3']=df['data_field'].isin(cat.categories) & is_categorical_dtype(df['data_field'])
#categories match not tested Categorical
df['data_is_valid4']=df['data_field'].isin(cat.categories)
print (df)
data_field data_is_valid1 data_is_valid2 data_is_valid3 data_is_valid4
0 A False True True True
1 B False True True True
2 C False True True True
3 D False False False False
4 E False False False False
I have a dataframe as shown below:
>>> import pandas as pd
>>> df = pd.DataFrame(data = [['app;',1,2,3],['app; web;',4,5,6],['web;',7,8,9],['',1,4,5]],columns = ['a','b','c','d'])
>>> df
a b c d
0 app; 1 2 3
1 app; web; 4 5 6
2 web; 7 8 9
3 1 4 5
I have an input array that looks like this: ["app","web"]
For each of these values I want to check against a specific column of a dataframe and return a decision as shown below:
>>> df.a.str.contains("app")
0 True
1 True
2 False
3 False
Since str.contains only allows me to look for an individual value, I was wondering if there's some other direct way to determine the same something like:
df.a.str.contains(["app","web"]) # Returns TypeError: unhashable type: 'list'
My end goal is not to do an absolute match (df.a.isin(["app", "web"]) but rather a 'contains' logic that says return true even if it has those characters present in that cell of data frame.
Note: I can of course use apply method to create my own function for the same logic such as:
elementsToLookFor = ["app","web"]
df[header] = df.apply(lambda element: all([a in element for a in elementsToLookFor]))
But I am more interested in the optimal algorithm for this and so prefer to use a native pandas function within pandas, or else the next most optimized custom solution.
This should work too:
l = ["app","web"]
df['a'].str.findall('|'.join(l)).map(lambda x: len(set(x)) == len(l))
also this should work as well:
pd.concat([df['a'].str.contains(i) for i in l],axis=1).all(axis = 1)
so many solutions, which one is the most efficient
The str.contains-based answers are generally fastest, though str.findall is also very fast on smaller dfs:
values = ['app', 'web']
pattern = ''.join(f'(?=.*{value})' for value in values)
def replace_dummies_all(df):
return df.a.str.replace(' ', '').str.get_dummies(';')[values].all(1)
def findall_map(df):
return df.a.str.findall('|'.join(values)).map(lambda x: len(set(x)) == len(values))
def lower_contains(df):
return df.a.astype(str).str.lower().str.contains(pattern)
def contains_concat_all(df):
return pd.concat([df.a.str.contains(l) for l in values], axis=1).all(1)
def contains(df):
return df.a.str.contains(pattern)
Try with str.get_dummies
df.a.str.replace(' ','').str.get_dummies(';')[['web','app']].all(1)
0 False
1 True
2 False
3 False
dtype: bool
Update
df['a'].str.contains(r'^(?=.*web)(?=.*app)')
Update 2: (To ensure case insenstivity doesn't matter and the column dtype is str without which the logic may fail):
elementList = ['app','web']
for eachValue in elementList:
valueString += f'(?=.*{eachValue})'
df[header] = df[header].astype(str).str.lower() #To ensure case insenstivity and the dtype of the column is string
result = df[header].str.contains(valueString)
For a particular column (dtype = object), how can I add '-' to the start of the string, given it ends with '-'.
i.e convert: 'MAY500-' to '-May500-'
(I need to add this to every element in the column)
Try something like this:
#setup
df = pd.DataFrame({'col':['aaaa','bbbb-','cc-','dddddddd-']})
mask = df.col.str.endswith('-'), 'col'
df.loc[mask] = '-' + df.loc[mask]
Output
df
col
0 aaaa
1 -bbbb-
2 -cc-
3 -dddddddd-
You can use np.select
Given a dataframe like this:
df
values
0 abcd-
1 a-bcd
2 efg-
You can use np.select as follows:
df['values'] = np.select([df['values'].str.endswith('-')], ['-' + df['values']], df['values'])
output:
df
values
0 -abcd-
1 a-bcd
2 -efg-
def add_prefix(text):
# If text is null or empty string the -1 index will result in IndexError
if text and text[-1] == "-":
return "-"+text
return text
df = pd.DataFrame(data={'A':["MAY500", "MAY500-", "", None, np.nan]})
# Change the column to string dtype first
df['A'] = df['A'].astype(str)
df['A'] = df['A'].apply(add_prefix)
0 MAY500
1 -MAY500-
2
3 None
4 nan
Name: A, dtype: object
I have a knack for using apply with lambda functions a lot. It just makes the code a lot easier to read.
df['value'] = df['value'].apply(lambda x: '-'+str(x) if str(x).endswith('-') else x)
I tried the link. But it doesnt work for my example given below.
I tried the loc[0] for the output. I tried .item(). But none of these help me.
>>> df2 = pd.DataFrame({ 'Item':['[Phone]', '[Watch]', '[Pen]', '[Pencil]', '[Knife]'], 'RelatedItem': ['[Phone cover]', '[Watch strap]', '[Pen cap]', '[Pencil lead]', '[fork]'], 'CountinInventory':['20','50','40','80','90']})
>>> df2
Item RelatedItem CountinInventory
0 [Phone] [Phone cover] 20
1 [Watch] [Watch strap] 50
2 [Pen] [Pen cap] 40
3 [Pencil] [Pencil lead] 80
4 [Knife] [fork] 90
>>> df2.loc[df2['Item'] == 'Phone', 'RelatedItem']
Series([], Name: RelatedItem, dtype: object)
>>> df2.loc[df2['Item'] == 'Phone', 'RelatedItem', 'CountinInventory']
pandas.core.indexing.IndexingError: Too many indexers
I have this data where when I feed Phone, I need to get Phone cover along with the CountinInventory value as my answer. Please advice what mistake am I doing here.
I believe you need str for remove first and last [] or use str.strip:
mask = df2['Item'].str[1:-1] == 'Phone'
#alternative solution
#mask = df2['Item'].str.strip('[]') == 'Phone'
print (mask)
0 True
1 False
2 False
3 False
4 False
Name: Item, dtype: bool
If no missing values is possible use list comprehension, what is faster if large data:
mask = [x[1:-1] == 'Phone'for x in df2['Item']]
mask = [x.strip('[]') == 'Phone'for x in df2['Item']]
print (mask)
[True, False, False, False, False]
Last for select multiple columns use list:
df3 = df2.loc[mask, ['RelatedItem', 'CountinInventory']]
print (df3)
RelatedItem CountinInventory
0 [Phone cover] 20
You could also use:
df.loc[df['Item'].str.contains('Phone'), ['RelatedItem', 'CountinInventory']]
The error too many indexers is because df.loc[] expects an array of labels, list or slice object with labels. But you have given a sequence of 'labels'.
for i in range(len(ingName)):
x = ingName[i]
#check if the string contains IngName ignoring uppercase Sub_Seg_Sub_Cat
df = df_grocery[df_grocery['Sub_Seg_Sub_Cat'].str.contains(str(ingName[i]), case=False)]
df['IngredientName'] = ingName[i]
df['IngredientID'] = ingID[i]
#write it out to a csv
df.to_csv("ingred11.csv", mode = 'a', encoding='utf-8', header=None)
What is the best way to check if a row in a column contains certain words? I am using str.contains and it is getting the substring. I want it where it checks if sauce is present in pasta sauce, not if sauc is present in pasta sauce, currently this is what it does, and is there anyway to achieve this?
Thanks
You can use the isin method:
In [11]: s = pd.Series(["cat", "dog", "sheep"])
In [12]: s.isin(["cat", "dog"])
Out[12]:
0 True
1 True
2 False
dtype: bool
Is this what you are after?
s = pd.Series(["cat dog", "dog", "sheep"])
# check if 'cat' or 'dog' is present in the Series.
s.apply(lambda x: np.max(np.in1d(np.asarray(x.split()), ['cat', 'dog'])))
Out[785]:
0 True
1 True
2 False