Use regex on list items to replace whole word - python

I have a large dataset all_transcripts with conversations and I have a small list gemeentes containing names of different cities. In all_transcripts, I want to replace each instance in which the name of a city is given, by 'woonplaats' (Dutch for city).
To do so, I have the following code:
all_transcripts['filtered'] = all_transcripts['no_punc'].str.replace('|'.join(gemeentes),' woonplaats ')
However, this replaces each instance in which the word combination appears and not just whole words.
What I'm looking for is something like:
all_transcripts['filtered'] = all_transcripts['no_punc'].re.sub('|'r"\b{}\b".format(join(gemeentes)),' woonplaats ')
But this doesn't work.
As an example, I have:
all_transcripts['no_punc'] = ['i live in amsterdam', 'i come from haarlem', 'groningen is her favourite city']
gemeentes = ['amsterdam', 'rotterdam', 'den haag', 'haarlem', 'groningen']
The output that I want, after I run the code is as follows:
>>> ['i live in woonplaats', 'i come from woonplaats', 'woonplaats is her favourite city']
Before, I've worked with the '\b' option of regex. However, I don't know how to apply it here. I could run a for loop for each word in gemeentes and apply it to the whole dataset. But given its size (gemeentes has over 300 variables and all_transcripts over 2.5 million rows), this would be very computationally expensive and thus, I would like a similar approach as above in which I replace a string, using the OR operator.

It looks like you're close, but you'll want to change your re.sub call a little. Something like this should work:
gemeentes = ['amsterdam', 'rotterdam', 'den haag', 'haarlem', 'groningen']
all_transcripts['filtered'] = [re.sub(r"\b({})\b".format("|".join(gemeentes)), "woonplaats", s) for s in all_transcripts['no_punc']]
Output
all_transcripts['filtered'] = ['i live in woonplaats', 'i come from woonplaats', 'woonplaats is her favourite city']
As for performance, I'm not sure that you're going to get better speeds out of this over a traditional for-loop as you're still having to loop over the 25 million entries and apply the regex.

If you are using pandas dataframe then you can use the following :
import pandas as pd
all_transcripts['filtered']= all_transcripts.replace([amsterdam', 'rotterdam', 'den haag', 'haarlem', 'groningen'], "woonplaats", regex=True)

Related

Search for terms in list within dataframe column, add terms found to new column

Example dataframe:
data = pd.DataFrame({'Name': ['Nick', 'Matthew', 'Paul'],
'Text': ["Lived in Norway, England, Spain and Germany with his car",
"Used his bikes in England. Loved his bike",
"Lived in Alaska"]})
Example list:
example_list = ["England", "Bike"]
What I need
I want to create a new column, called x, where if a term from example_list is found as a string/substring in data.Text (case insensitive), it adds the word it was found from to the new column.
Output
So in row 1, the word England was found and returned, and bike was found and returned, as well as bikes (which bike was a substring of).
Progress so far:
I have managed - with the following code - to return terms that match the terms regardless of case, however it wont find substrings... e.g. if search for "bike", and it finds "bikes", I want it to return "bikes".
pattern = fr'({"|".join(example_list)})'
data['Text'] = data['Text'].str.findall(pattern, flags=re.IGNORECASE).str.join(", ")
I think I might have found a solution for your pattern there:
pattern = fr'({"|".join("[a-zA-Z]*" + ex + "[a-zA-Z]*" for ex in example_list)})'
data['x'] = data['Text'].str.findall(pattern, flags=re.IGNORECASE).str.join(",")
Basically what I do is, I extend the pattern by optionally allowing letters before the (I think you don't explicitly mention this, maybe this has to be omitted) and after the word.
As an output I get the following:
I'm just not so sure, in which format you want this x-column. In your code you join it via commas (which I followed here) but in the picture you only have a list of the values. If you specify this, I could update my solution.

What is the most efficient way to grab and store part of a given string between keywords with python?

I have an array of keywords:
keyword_list = ['word1', 'anotherWord', 'wordup', 'word to your papa']
I have a string of text:
string_of_text = 'So this is a string of text. I want to talk about anotherWord...and then I'm going to say something I've been meaning to say "wordup". But I also wanted to say the following: word to your papa. And lastly I wanted to talk about word1...'
I want to return the following:
{'list_word': 'word1', 'string_of_text_after': '...'}, {'list_word': 'anotherWord', 'string_of_text_after': '...and then I'm going to say something I've been meaning to say "'}, {'list_word': 'wordup', 'string_of_text_after': '". But I also wanted to say the following: '}, {list_word: 'word to your papa', 'string_of_text_after':'. And lastly I wanted to talk about '}
As you can see it is a list of dictionaries with the list word and then the text that comes after the list word item but only until the next list word is detected is when it would discontinue.
What would be the most efficient way to do this in python (python 3 or later, 2 is also ok if there are any issues with deprecated methods).
you could try something like this:
keyword_list = ['word1', 'anotherWord', 'wordup', 'word to your papa']
string_of_text = """So this is a string of text. I want to talk about anotherWord...\
and then I'm going to say something I've been meaning to say "wordup".\
But I also wanted to say the following: word to your papa.\
And lastly I wanted to talk about word1..."""
def t(k, t):
ls = len(t)
tmp = {i:len(i) for i in k}
return [{"list_word":i,"string_of_text_after":t[t.find(i)+tmp[i]:]} for i in tmp if t.find(i)>0]
from pprint import pprint
pprint(t(keyword_list,string_of_text))
Result:
[{'list_word': 'wordup',
'string_of_text_after': '". But I also wanted to say the following: word to your papa. And lastly I wanted to talk about word1...'},
{'list_word': 'word1', 'string_of_text_after': '...'},
{'list_word': 'anotherWord',
'string_of_text_after': '... and then I\'m going to say something I\'ve been meaning to say "wordup". But I also wanted to say the following: word to your papa. And lastly I wanted to talk about word1...'},
{'list_word': 'word to your papa',
'string_of_text_after': '. And lastly I wanted to talk about word1...'}]
ATTENTION
This code has several implications :
the keyword_list has to be of unique elements ...
the call t.find(i) is doubled
the function returns a list, which must be saved in your memory, this could be fixed if you chose to return a generator like this :
return ({"list_word":i,"string_of_text_after":t[t.find(i)+tmp[i]:]} for i in tmp if t.find(i)>0) and to call it where und when needed.
Good luck ! :)

How can words with misspellings be corrected in a data frame?

I have a data frame and I want the wrong words to be edited in it. First i delete the characters that have been repeated more than twice in a word and then i apply Spell Correction on it. For the first part, I can only apply changes on strings. I want to be able to apply it to the data frame as well. How can I do this?
text='Aye concreeete steel and plastic housesss will keep us alll safe and flourishing ?'
import re
def reduce_lengthening(text):
pattern = re.compile(r"(.)\1{2,}")
return pattern.sub(r"\1\1", text)
print('string is: ',reduce_lengthening(text))
out put string is:
Aye concreete steel and plastic housess will keep us all safe and flourishing ?
How can I apply this function to the following data frame?
text=['dear pados wali anttty , can just keep your thoughts and nose out business raising_hands thaaaank .',
'but least did not call him losers suckers , juuust was did not want the cemetery and honor them , big deal.',
'some hunters are just entitled , you are lucky have them.',
'thin corrrect time that.. only one person could save from this crisis .. correct sarthak ? ?',
'thereee also the wuhan virus. that totally different ?',
'does nooot every woman hav adam apple amp; flat hairy chest ?']
import pandas as pd
df=pd.DataFrame()
df['Text']=text
If you have do it with apply function like below
df["Text"] = df["Text"].apply(reduce_lengthening)
or before adding this column ( using df['Text']=text),you can pass each text element to reduce_lengthening` in list comprehension like this, and store the resulting list
df["Text"] = [reduce_lengthening(x) for x in text]

Create a list of alphabetically sorted UNIQUE words and display the first N words in python

I am new to Python, apologize for a simple question. My task is the following:
Create a list of alphabetically sorted unique words and display the first 5 words
I have text variable, which contains a lot of text information
I did
test = text.split()
sorted(test)
As a result, I receive a list, which starts from symbols like $ and numbers.
How to get to words and print N number of them.
I'm assuming by "word", you mean strings that consist of only alphabetical characters. In such a case, you can use .filter to first get rid of the unwanted strings, turn it into a set, sort it and then print your stuff.
text = "$1523-the king of the 521236 mountain rests atop the king mountain's peak $#"
# Extract only the words that consist of alphabets
words = filter(lambda x: x.isalpha(), text.split(' '))
# Print the first 5 words
sorted(set(words))[:5]
Output-
['atop', 'king', 'mountain', 'of', 'peak']
But the problem with this is that it will still ignore words like mountain's, because of that pesky '. A regex solution might actually be far better in such a case-
For now, we'll be going for this regex - ^[A-Za-z']+$, which means the string must only contain alphabets and ', you may add more to this regex according to what you deem as "words". Read more on regexes here.
We'll be using re.match instead of .isalpha this time.
WORD_PATTERN = re.compile(r"^[A-Za-z']+$")
text = "$1523-the king of the 521236 mountain rests atop the king mountain's peak $#"
# Extract only the words that consist of alphabets
words = filter(lambda x: bool(WORD_PATTERN.match(x)), text.split(' '))
# Print the first 5 words
sorted(set(words))[:5]
Output-
['atop', 'king', 'mountain', "mountain's", 'of']
Keep in mind however, this gets tricky when you have a string like hi! What's your name?. hi!, name? are all words except they are not fully alphabetic. The trick to this is to split them in such a way that you get hi instead of hi!, name instead of name? in the first place.
Unfortunately, a true word split is far outside the scope of this question. I suggest taking a look at this question
I am newbie here, apologies for mistakes. Thank you.
test = '''The coronavirus outbreak has hit hard the cattle farmers in Pabna and Sirajganj as they are now getting hardly any customer for the animals they prepared for the last year targeting the Eid-ul-Azha this year.
Normally, cattle traders flock in large numbers to the belt -- one of the biggest cattle producing areas of the country -- one month ahead of the festival, when Muslims slaughter animals as part of their efforts to honour Prophet Ibrahim's spirit of sacrifice.
But the scene is different this year.'''
test = test.lower().split()
test2 = sorted([j for j in test if j.isalpha()])
print(test2[:5])
You can slice the sorted return list until the 5 position
sorted(test)[:5]
or if looking only for words
sorted([i for i in test if i.isalpha()])[:5]
or by regex
sorted([i for i in test if re.search(r"[a-zA-Z]")])
by using the slice of a list you will be able to get all list elements until a specific index in this case 5.

Extracting titles from a text

I'm working on a project, where I have to extract honorific titles (Mr, Mrs, St, etc.) from a novel. The desired output with the text I'm working with is:
['Col', 'Dr', 'Mr', 'Mrs', 'Otto', 'Rev', 'St']
However, with the code I wrote, the output is this:
{'Tom.', 'Mrs.', 'Otto.', 'Mary.', 'Bots.', 'Come.', 'No.', 'Col.', 'Cain.', 'Dr.', 'Gang.', 'Ike.', 'Kean.', 'St.', 'Hank.', 'Him.', 'Finn.', 'Ann.', 'Jane.', 'Alas.', 'Huck.', 'Sis.', 'Buck.', 'Jim.', 'Sid.', 'Mr.', 'Bill.', 'Rev.', 'Yes.'}
This is the code I have so far:
def get_titles(text):
pattern = re.compile('[A-Z][a-z]{1,3}\.')
title_tokens = set(re.findall(pattern, text))
pattern2 = re.compile('[A-Z][a-z]{1,3}')
pseudo_titles = set(re.findall(pattern2, text))
pseudo_titles = [word.strip() for word in pseudo_titles]
pseudo_titles = [word.replace('\n', '') for word in pseudo_titles]
difference = title_tokens.difference(pseudo_titles)
return difference
test = get_titles(text)
print(test)
As you can notice, the output gives me additional words with periods in them. I believe the issue stems from the regular expressions, but I'm not sure. Any advice or tips are appreciated.
The text can be found here: http://www.gutenberg.org/files/76/76-0.txt
Essentially, you are asking for an algorithm which can tell the difference between a title and one-word sentence. These are lexically indistinguishable; for example, consider the following two strings:
"Do I know who did this? Yes. Smith did it."
"Do I know who did this? Mr. Smith did it."
In the first sentence, "Yes." is a one-word sentence, and in the second, "Mr." is a title. As humans we only know this because we understand the meanings of the tokens "Yes" and "Mr"; so an algorithm which is able to distinguish between these cases requires some information about the meanings of the tokens it's parsing. It cannot work purely lexically like a regex does. This means you must either write a whitelist of allowed titles, or a blacklist of words which are not titles, or otherwise the problem is much more difficult.
Alternatively, if your project doesn't involve parsing titles from very many novels, you could just trim down the results by hand, using your human knowledge that "Tom" and "Yes" aren't titles. It shouldn't be that much work.

Categories

Resources