How create specific dummy variable using regular expression?

How create specific dummy variable using regular expression? - python

I have a pandas dataframe:
col1
johns id is 81245678316
eric bought 82241624316 yesterday
mine is87721624316
frank is a genius
i accepted new 82891224316again
I want to create new column with dummy variables (0,1) depending on col1. If there is 11 numbers starting with 8 and going in a row, than it must be 1, otherwise 0.
So I wrote this code:
df["is_number"] = df.col1.str.contains(r"\b8\d{10}").map({True: 1, False: 0})
However output is:
col1 is_number
johns id is 81245678316 1
eric bought 82241624316 yesterday 1
mine is87721624316 0
frank is a genius 0
i accepted new 82891224316again 0
as you see third and fifth rows have 0 in "is_number", but I want them to have 1, even though space is missing there between words and numbers in some places. How to do that? I want:
col1 is_number
johns id is 81245678316 1
eric bought 82241624316 yesterday 1
mine is87721624316 1
frank is a genius 0
i accepted new 82891224316again 1

You can use numeric boundaries as the numbers in your input can be "glued" to letters (that are word boundaries and thus there is no word boundary between the letters and 8):
df["is_number"] = df['col1'].str.contains(r"(?<!\d)8\d{10}(?!\d)").map({True: 1, False: 0})
Output:
>>> df
col1 is_number
0 johns id is 81245678316 1
1 eric bought 82241624316 yesterday 1
2 mine is87721624316 1
3 frank is a genius 0
4 i accepted new 82891224316again 1
See the regex demo.

You just need to remove the \b which stands for word boundary since you do not care if there is a boundary or not.
df["is_number"] = df.col1.str.contains(r"8\d{10}").map({True: 1, False: 0})

The solution can be as simple as yours, except that '\b' must be removed because it must match a word boundary:
df.col1.str.contains("8\d{10}").astype(int)
If you want exactly 11 digits, not more, then demand that the symbols before and after the eleven digits either do not exist or are not digits:
df.col1.str.contains("(^|\D)8\d{10}($|\D)").astype(int)

Related

Remove leading words pandas

I have this data df where Names is a column name and below it are its data:
Names
------
23James
0Sania
4124Thomas
101Craig
8Rick
How can I return it to this:
Names
------
James
Sania
Thomas
Craig
Rick
I tried with df.strip but there are certain numbers that are still in the DataFrame.

You can also extract all characters after digits using a capture group:
df['Names'] = df['Names'].str.extract('^\d+(.*)')
print(df)
# Output
Names
0 James
1 Sania
2 Thomas
3 Craig
4 Rick
Details on Regex101

We can use str.replace here with the regex pattern ^\d+, which targets leading digits.
df["Names"] = df["Names"].str.replace(r'^\d+', '')

The answer by Tim certainly solves this but I usually feel uncomfortable using regex as I'm not proficient with it so I would approach it like this -
def removeStartingNums(s):
count = 0
for i in s:
if i.isnumeric():
count += 1
else:
break
return s[count:]
df["Names"] = df["Names"].apply(removeStartingNums)
What the function essentially does is count the number of leading characters which are numeric and then returns a string which has those starting characters sliced off

How to filter rows with non Latin characters

I am stuck in a problem with a dataframe with a column of film names which has a bunch of non-latin names like Japanese or Chinese (and maybe Russian names too) my code is:
df['title'].head(5)
1 I am legend
2 wonder women
3 アライヴ
4 怪獣総進撃
5 dead sea
I just want an output that removes every non-Latin character title, so I want to remove every row that contains character similar to row 3 and 4, so my desired output is:
df['title'].head(5)
1 I am legend
2 wonder women
5 dead sea
6 the rig
7 altitude
Any help with this code?

You can use str.match with the Latin character range to identify non-latin characters, and use the boolean output to slice the data:
df_latin = df[~df['title'].str.match(r'.*[^\x00-\xFF]')]
output:
title
1 I am legend
2 wonder women
5 dead sea
6 the rig
7 altitude

You can encode your title column then decode to latin1. If this double transformation does not match your original data, remove row because it contains some non Latin characters:
df = df[df['title'] == df['title'].str.encode('unicode_escape').str.decode('latin1')]
print(df)
# Output
title
0 I am legend
1 wonder women
3 dead sea

You can use the isascii() method (if you're using Python 3.7+). Example:
"I am legend".isascii() # True
"アライヴ".isascii() # False
Even if you have 1 Non-English letter, the isascii() method will return False.
(Note that for strings like '34?#5' the method will return True, because those are all ASCII characters.)

We can easily makes a function which will return whether it is ascii or not and based on that we can then filter our dataframe.
dict_1 = {'col1':list(range(1,6)), 'col2':['I am legend','wonder women','アライヴ','怪獣総進撃','dead sea']}
def check_ascii(string):
if string.isascii() == True:
return True
else:
return False
df = pd.DataFrame(dict_1)
df['is_eng'] = df['col2'].apply(lambda x: check_ascii(x))
df2 = df[df['is_eng'] == True]
df2
Output

Remove specific characters from a pandas column?

Hello I have a dataframe where I want to remove a specific set of characters 'fwd' from every row that starts with it. The issue I am facing is that the code I am using to execute this is removing anything that starts with the letter 'f'.
my dataframe looks like this:
summary
0 Fwd: Please look at the attached documents and take action
1 NSN for the ones who care
2 News for all team members
3 Fwd: Please take action on the action needed items
4 Fix all the mistakes please
When i used the code:
df['Clean Summary'] = individual_receivers['summary'].map(lambda x: x.lstrip('Fwd:'))
I end up with a dataframe that looks like this:
summary
0 Please look at the attached documents and take action
1 NSN for the ones who care
2 News for all team members
3 Please take action on the action needed items
4 ix all the mistakes please
I don't want the last row to lose the F in 'Fix'.

You should use a regex remembering ^ indicates startswith:
df['Clean Summary'] = df['Summary'].str.replace('^Fwd','')
Here's an example:
df = pd.DataFrame({'msg':['Fwd: o','oe','Fwd: oj'],'B':[1,2,3]})
df['clean_msg'] = df['msg'].str.replace(r'^Fwd: ','')
print(df)
Output:
msg B clean_msg
0 Fwd: o 1 o
1 oe 2 oe
2 Fwd: oj 3 oj

You are not only loosing 'F' but also 'w', 'd', and ':'. This is the way lstrip works - it removes all of the combinations of characters in the passed string.
You should actually use x.replace('Fwd:', '', 1)
1 - ensures that only the first occurrence of the string is removed.

Find strings with UPPER case letters and ends with a certain word in regex

I have a dataframe where one column consists of strings that have three patterns:
1) Upper case letters only: APPLE COMPANY
2) Upper case letters and ends with the letters AS: CAR COMPANY AS
3) Upper and lower case letters: John Smith
df = pd.DataFrame({'NAME': ['APPLE COMPANY', 'CAR COMPANY AS', 'John Smith']})
NAME ...
0 APPLE COMPANY ...
1 CAR COMPANY AS ...
2 John Smith ...
3 ... ...
How can I take out those rows that do not meet the conditions of 2) and 3), i.e. 1)? In other words, how can I take out rows that only have UPPER case letters, does not end with AS or have both UPPER and LOWER letters in the string?
I came up with this:
df['NAME'].str.findall(r"(^[A-Z ':]+$)")
df['NAME'].str.findall('AS')
The first one extract strings with only upper letters, but second one only finds AS. If there are other methods than regex than I happy to try that as well.
Expected outcome is:
NAME ...
1 CAR COMPANY AS ...
2 John Smith ...
3 ... ...

This regex should work:
^(?:[A-Z ':]+ AS|.*[a-z].*)$
It matches either one of these:
[A-Z ':]+ AS - The case of all uppercase letters followed by AS
.*[a-z].* - The case of lowercase letters
Demo

one way would be,
df['temp']=df['NAME'].str.extract("(^[A-Z ':]+$)")
s1=df['temp']==df["NAME"]
s2=~df['NAME'].str.endswith('AS')
print(df.loc[~(s1&s2), 'NAME'])
O/P:
1 CAR COMPANY AS
2 John Smith
Name: NAME, dtype: object

Also you can try:
df_new = df[~df['NAME'].str.isupper()|df['NAME'].str.endswith('AS')]

Using apply and different patterns that you may want to check:
import re
def myfilter(x):
patterns = ['[A-Z]*AS$','[A-Z][a-z]{1,}']
for p in patterns:
if len(re.findall(p, x.NAME)):
return True
return False
selector = df.apply(myfilter, axis=1)
filtered_df = df[selector]

Removing stand-alone numbers from string in Python

There are a lot of similar questions, but I have not found a solution for my problem. I have a data frame with the following structure/form:
col_1
0 BULKA TARTA 500G KAJO 1
1 CUKIER KRYSZTAL 1KG KSC 4
2 KASZA JĘCZMIENNA 4*100G 2 0.92
3 LEWIATAN MAKARON WSTĄŻKA 1 0.89
However, I want to achieve the effect:
col_1
0 BULKA TARTA 500G KAJO
1 CUKIER KRYSZTAL 1KG KSC
2 KASZA JĘCZMIENNA 4*100G
3 LEWIATAN MAKARON WSTĄŻKA
So I want to remove the independent natural and decimal numbers, but leave the numbers in the string with the letters.
I tried to use df.col_1.str.isdigit().replace([True, False],[np.nan, df.col_1]) , but it only works on comparing the entire cell whether it is a number or not.
You have some ideas how to do it? Or maybe it would be good to break the column with spaces and then compare?

We could create a function that tries to convert to float. If it fails we return True (not_float)
import pandas as pd
df = pd.DataFrame({"col_1" : ["BULKA TARTA 500G KAJO 1",
"CUKIER KRYSZTAL 1KG KSC 4",
"KASZA JĘCZMIENNA 4*100G 2 0.92",
"LEWIATAN MAKARON WSTĄŻKA 1 0.89"]})
def is_not_float(string):
try:
float(string)
return False
except ValueError: # String is not a number
return True
df["col_1"] = df["col_1"].apply(lambda x: [i for i in x.split(" ") if is_not_float(i)])
df
Or following the example of my fellow SO:ers. However this would treat 130. as a number.
df["col_1"] = (df["col_1"].apply(
lambda x: [i for i in x.split(" ") if not i.replace(".","").isnumeric()]))
Returns
col_1
0 [BULKA, TARTA, 500G, KAJO]
1 [CUKIER, KRYSZTAL, 1KG, KSC]
2 [KASZA, JĘCZMIENNA, 4*100G]
3 [LEWIATAN, MAKARON, WSTĄŻKA]

Sure,
You could use a regex.
import re
df.col_1 = re.sub("\d+\.?\d+?", "", df.col_1)

Yes you can
def no_nums(col):
return ' '.join(filter(lambda word:word.replace('.','').isdigit()==False, col.split()))
df.col_1.apply(no_nums)
This filters out words from each value which are completely made of digits,
And maybe contains a decimal point.
If you want to filter out numbers like 1,000, simply add another replace for ','

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How create specific dummy variable using regular expression? - python

You just need to remove the \b which stands for word boundary since you do not care if there is a boundary or not. df["is_number"] = df.col1.str.contains(r"8\d{10}").map({True: 1, False: 0})

Related

Remove leading words pandas

How to filter rows with non Latin characters

Remove specific characters from a pandas column?

Find strings with UPPER case letters and ends with a certain word in regex

Removing stand-alone numbers from string in Python

Categories

Resources