Issue Finding Substring - python

I'm trying to find a substring Chief in the df column. Its working fine with split() on text with spaces but not working as expected with find().
sum(df['JobTitle'].apply(lambda x :'chief' in x.lower().split() ))
sum(df['JobTitle'].apply(lambda x : x.lower().find('chief') ==1))
Can you please highlight what the issue in find usage is here?

You can try with re:
import re
# if it appears, add 1, else add 0
sum(df['JobTitle'].apply(lambda x : int(bool(re.findall(r'\bchief\b', x.lower()))))
# add the number of times the word appears
sum(df['JobTitle'].apply(lambda x : len(re.findall(r'\bchief\b', x.lower())))
EDIT
If you want to catch chief but not words with chief inside, like mischief, use r'\bchief\b'
Demo : https://regex101.com/r/jYOfM1/1

Related

How to find required word in novel in python?

I have a text and I have got a task in python with reading module:
Find the names of people who are referred to as Mr. XXX. Save the result in a dictionary with the name as key and number of times it is used as value. For example:
If Mr. Churchill is in the novel, then include {'Churchill' : 2}
If Mr. Frank Churchill is in the novel, then include {'Frank Churchill' : 4}
The file is .txt and it contains around 10-15 paragraphs.
Do you have ideas about how can it be improved? (It gives me error after some words, I guess error happens due to the reason that one of the Mr. is at the end of the line.)
orig_text= open('emma.txt', encoding = 'UTF-8')
lines= orig_text.readlines()[32:16267]
counts = dict()
for line in lines:
wordsdirty = line.split()
try:
print (wordsdirty[wordsdirty.index('Mr.') + 1])
except ValueError:
continue
Try this:
text = "When did Mr. Churchill told Mr. James Brown about the fish"
m = [x[0] for x in re.findall('(Mr\.( [A-Z][a-z]*)+)', text)]
You get:
['Mr. Churchill', 'Mr. James Brown']
To solve the line issue simply read the entire file:
text = file.read()
Then, to count the occurrences, simply run:
Counter(m)
Finally, if you'd like to drop 'Mr. ' from all your dictionary entries, use x[0][4:] instead of x[0].
This can be easily done using regex and capturing group.
Take a look here for reference, in this scenario you might want to do something like
# retrieve a list of strings that match your regex
matches = re.findall("Mr\. ([a-zA-Z]+)", your_entire_file) # not sure about the regex
# then create a dictionary and count the occurrences of each match
# if you are allowed to use modules, this can be done using Counter
Counter(matches)
To access the entire file like that, you might want to map it to memory, take a look at this question

remove words starting with "#" in a column from a dataframe

I have a dataframe called tweetscrypto and I am trying to remove all the words from the column "text" starting with the character "#" and gather the result in a new column "clean_text". The rest of the words should stay exactly the same:
tweetscrypto['clean_text'] = tweetscrypto['text'].apply(filter(lambda x:x[0]!='#', x.split()))
it does not seem to work. Can somebody help?
Thanks in advance
Please str.replace string starting with #
Sample Data
text
0 News via #livemint: #RBI bars banks from links
1 Newsfeed from #oayments_source: How Africa
2 is that bitcoin? not my thing
tweetscrypto['clean_text']=tweetscrypto['text'].str.replace('(\#\w+.*?)',"")
Still, can capture # without escaping as noted by #baxx
tweetscrypto['clean_text']=tweetscrypto['text'].str.replace('(#\w+.*?)',"")
clean_text
0 News via : bars banks from links
1 Newsfeed from : How Africa
2 is that bitcoin? not my thing
In this case it might be better to define a method rather than using a lambda for mainly readability purposes.
def clean_text(X):
X = X.split()
X_new = [x for x in X if not x.startswith("#")
return ' '.join(X_new)
tweetscrypto['clean_text'] = tweetscrypto['text'].apply(clean_text)

Find values using regex (includes brackets)

it's my first time with regex and I have some issues, which hopefully you will help me find answers. Let's give an example of data:
chartData.push({
date: newDate,
visits: 9710,
color: "#016b92",
description: "9710"
});
var newDate = new Date();
newDate.setFullYear(
2007,
10,
1 );
Want I want to retrieve is to get the date which is the last bracket and the corresponding description. I have no idea how to do it with one regex, thus I decided to split it into two.
First part:
I retrieve the value after the description:. This was managed with the following code:[\n\r].*description:\s*([^\n\r]*) The output gives me the result with a quote "9710" but I can fairly say that it's alright and no changes are required.
Second part:
Here it gets tricky. I want to retrieve the values in brackets after the text newDate.setFullYear. Unfortunately, what I managed so far, is to only get values inside brackets. For that, I used the following code \(([^)]*)\) The result is that it picks all 3 brackets in the example:
"{
date: newDate,
visits: 9710,
color: "#016b92",
description: "9710"
}",
"()",
"2007,
10,
1 "
What I am missing is an AND operator for REGEX with would allow me to construct a code allowing retrieval of data in brackets after the specific text.
I could, of course, pick every 3rd result but unfortunately, it doesn't work for the whole dataset.
Does anyone of you know the way how to resolve the second part issue?
Thanks in advance.
You can use the following expression:
res = re.search(r'description: "([^"]+)".*newDate.setFullYear\((.*)\);', text, re.DOTALL)
This will return a regex match object with two groups, that you can fetch using:
res.groups()
The result is then:
('9710', '\n2007,\n10,\n1 ')
You can of course parse these groups in any way you want. For example:
date = res.groups()[1]
[s.strip() for s in date.split(",")]
==>
['2007', '10', '1']
import re
test = r"""
chartData.push({
date: 'newDate',
visits: 9710,
color: "#016b92",
description: "9710"
})
var newDate = new Date()
newDate.setFullYear(
2007,
10,
1);"""
m = re.search(r".*newDate\.setFullYear(\(\n.*\n.*\n.*\));", test, re.DOTALL)
print(m.group(1).rstrip("\n").replace("\n", "").replace(" ", ""))
The result:
(2007,10,1)
The AND part that you are referring to is not really an operator. The pattern matches characters from left to right, so after capturing the values in group 1 you cold match all that comes before you want to capture your values in group 2.
What you could do, is repeat matching all following lines that do not start with newDate.setFullYear(
Then when you do encounter that value, match it and capture in group 2 matching all chars except parenthesis.
\r?\ndescription: "([^"]+)"(?:\r?\n(?!newDate\.setFullYear\().*)*\r?\nnewDate\.setFullYear\(([^()]+)\);
Regex demo | Python demo
Example code
import re
regex = r"\r?\ndescription: \"([^\"]+)\"(?:\r?\n(?!newDate\.setFullYear\().*)*\r?\nnewDate\.setFullYear\(([^()]+)\);"
test_str = ("chartData.push({\n"
"date: newDate,\n"
"visits: 9710,\n"
"color: \"#016b92\",\n"
"description: \"9710\"\n"
"});\n"
"var newDate = new Date();\n"
"newDate.setFullYear(\n"
"2007,\n"
"10,\n"
"1 );")
print (re.findall(regex, test_str))
Output
[('9710', '\n2007,\n10,\n1 ')]
There is another option to get group 1 and the separate digits in group 2 using the Python regex PyPi module
(?:\r?\ndescription: "([^"]+)"(?:\r?\n(?!newDate\.setFullYear\().*)*\r?\nnewDate\.setFullYear\(|\G)\r?\n(\d+),?(?=[^()]*\);)
Regex demo

Regex to find alpha&numeric words in a text

I've got this text:
due to previous assess c6c587469 and 4ec0f198
nearest and with fill station in the citi
becaus of our satisfact in the d4a29a already
averaging my thoughts on e977f33588f react to
and I want to remove all "alpha&numeric" words
In output, I want
due to previous assess and
nearest and with fill station in the citi
becaus of our satisfact in the already
averaging my thoughts on react to
I tried this, but it doesn't work..
df_colum = df_colum.str.replace('[^A-Za-z0-9\s]+', '')
Any regex expert ?
Thanks
Try using this regex:
df_colum = df_colum.str.replace('\w*\d\w*', '')
Here's one way without regex:
def parser(x):
return ' '.join([i for i in x.split() if not any(c.isdigit() for c in i)])
df['text'] = df['text'].apply(parser)
print(df)
text
0 due to previous assess and
1 nearest and with fill station in the citi
2 becaus of our satisfact in the already
3 averaging my thoughts on react to
This one should work:
df_colum = df_colum.str.replace('(?:[0-9][^ ]*[A-Za-z][^ ]*)|(?:[A-Za-z][^ ]*[0-9][^ ]*)', '')
Explanation of the regex can be found here
You can look for where a digit meets a letter \d[a-z] or [a-z]\d then match up to end:
(?i)\b(?:[a-z]+\d+|\d+[a-z]+)\w*\b *
Live demo
(?i) Enables case-insensitivity
(?:...) Constructs a non-capturing group
\b Means a word boundary
Python code:
re.sub(r"\b(?:[a-z]+\d+|\d+[a-z]+)\w*\b *", "", str)

Using regular expressions in python to extract location mentions in a sentence

I am writing a code using python to extract the name of a road,street, highway, for example a sentence like "There is an accident along Uhuru Highway", I want my code to be able to extract the name of the highway mentioned, I have written the code below.
sentence="there is an accident along uhuru highway"
listw=[word for word in sentence.lower().split()]
for i in range(len(listw)):
if listw[i] == "highway":
print listw[i-1] + " "+ listw[i]
I can achieve this but my code is not optimized, i am thinking of using regular expressions, any help please
'uhuru highway' can be found as follows
import re
m = re.search(r'\S+ highway', sentence) # non-white-space followed by ' highway'
print(m.group())
# 'uhuru highway'
If the location you want to extract will always have highway after it, you can use:
>>> sentence = "there is an accident along uhuru highway"
>>> a = re.search(r'.* ([\w\s\d\-\_]+) highway', sentence)
>>> print(a.group(1))
>>> uhuru
You can do the following without using regexes:
sentence.split("highway")[0].strip().split(' ')[-1]
First split according to "highway". You'll get:
['there is an accident along uhuru', '']
And now you can easily extract the last word from the first part.

Categories

Resources