Python and Pandas: Using a function to replace text - python

I have a pandas dataframe with 2 columns (Line, Sentence) and I need to count the number of times the word "RESULT" appears in each sentence. but I don't want to count if it appears as "AS A RESULT" or "WAS THE RESULT", etc (the actual list is quite long and with other words).
I had this problem before in a list and I used a little trick: I replaced the string, run the count and them replace it back the original. see function below (version 1, first pass; version 2, second pass).
def ConfusingStrings(text, version):
if version == 1:
text = re.sub(r"AS A RESULT", "XXXASAREXULT", text)
text = re.sub(r"WAS THE RESULT", "XXXWASTHEREXULT", text)
if version == 2:
text = re.sub(r"XXXASAREXULT", "AS A RESULT", text)
text = re.sub(r"XXXOFTHEREXULT", "OF THE RESULT", text)
return text
Now, with the pandas dataframe, I am trying to use the apply function, see below, but to be honest I cannot get this to work.
df['sentence'] = df.apply(ConfusingStrings(df['sentence'],1), axis=1)
Thanks for any input.
UPDATE:
import pandas as pd
c = pd.DataFrame({'A': [1,2,3,4], 'B':['ABC RESULTS FROM XYZ', 'AS A RESULT WE WILL NOT', 'THE RESULT IS THAT', 'THE BORDER WAS THE RESULT OF'], 'C':[1, 0,1,0]})
print (c)
The outcome I need is something like column C (which I did here manually), but bear in mind that this is a simplification, the list of confusing words/expression is in fact quite long, that's why I am looking to separate it in a function (easier to update and keeps main code cleaner). So basically I need to create column C via a function, I think.

Hope this helps: I just created a dummy data frame to include ab and exclude the list 'fc ab', 'ab ac'
import pandas as pd
df = pd.DataFrame({'A': [1,2,3,4,5,6], 'B':['ab', 'ab ac', 'fc ab', 'ab', 'ab ac', 'fc ab']})
list_to_include = ['ab']
list_to_exclude = ['fc ab', 'ab ac']
df['match'] = df['B'].str.count(r'|'.join(list_to_include)) - df['B'].str.count(r'|'.join(list_to_exclude))
match is the column containing count. You can also put abs to include safety for non negative values.

Related

Search for terms in list within dataframe column, add terms found to new column

Example dataframe:
data = pd.DataFrame({'Name': ['Nick', 'Matthew', 'Paul'],
'Text': ["Lived in Norway, England, Spain and Germany with his car",
"Used his bikes in England. Loved his bike",
"Lived in Alaska"]})
Example list:
example_list = ["England", "Bike"]
What I need
I want to create a new column, called x, where if a term from example_list is found as a string/substring in data.Text (case insensitive), it adds the word it was found from to the new column.
Output
So in row 1, the word England was found and returned, and bike was found and returned, as well as bikes (which bike was a substring of).
Progress so far:
I have managed - with the following code - to return terms that match the terms regardless of case, however it wont find substrings... e.g. if search for "bike", and it finds "bikes", I want it to return "bikes".
pattern = fr'({"|".join(example_list)})'
data['Text'] = data['Text'].str.findall(pattern, flags=re.IGNORECASE).str.join(", ")
I think I might have found a solution for your pattern there:
pattern = fr'({"|".join("[a-zA-Z]*" + ex + "[a-zA-Z]*" for ex in example_list)})'
data['x'] = data['Text'].str.findall(pattern, flags=re.IGNORECASE).str.join(",")
Basically what I do is, I extend the pattern by optionally allowing letters before the (I think you don't explicitly mention this, maybe this has to be omitted) and after the word.
As an output I get the following:
I'm just not so sure, in which format you want this x-column. In your code you join it via commas (which I followed here) but in the picture you only have a list of the values. If you specify this, I could update my solution.

Use regex on list items to replace whole word

I have a large dataset all_transcripts with conversations and I have a small list gemeentes containing names of different cities. In all_transcripts, I want to replace each instance in which the name of a city is given, by 'woonplaats' (Dutch for city).
To do so, I have the following code:
all_transcripts['filtered'] = all_transcripts['no_punc'].str.replace('|'.join(gemeentes),' woonplaats ')
However, this replaces each instance in which the word combination appears and not just whole words.
What I'm looking for is something like:
all_transcripts['filtered'] = all_transcripts['no_punc'].re.sub('|'r"\b{}\b".format(join(gemeentes)),' woonplaats ')
But this doesn't work.
As an example, I have:
all_transcripts['no_punc'] = ['i live in amsterdam', 'i come from haarlem', 'groningen is her favourite city']
gemeentes = ['amsterdam', 'rotterdam', 'den haag', 'haarlem', 'groningen']
The output that I want, after I run the code is as follows:
>>> ['i live in woonplaats', 'i come from woonplaats', 'woonplaats is her favourite city']
Before, I've worked with the '\b' option of regex. However, I don't know how to apply it here. I could run a for loop for each word in gemeentes and apply it to the whole dataset. But given its size (gemeentes has over 300 variables and all_transcripts over 2.5 million rows), this would be very computationally expensive and thus, I would like a similar approach as above in which I replace a string, using the OR operator.
It looks like you're close, but you'll want to change your re.sub call a little. Something like this should work:
gemeentes = ['amsterdam', 'rotterdam', 'den haag', 'haarlem', 'groningen']
all_transcripts['filtered'] = [re.sub(r"\b({})\b".format("|".join(gemeentes)), "woonplaats", s) for s in all_transcripts['no_punc']]
Output
all_transcripts['filtered'] = ['i live in woonplaats', 'i come from woonplaats', 'woonplaats is her favourite city']
As for performance, I'm not sure that you're going to get better speeds out of this over a traditional for-loop as you're still having to loop over the 25 million entries and apply the regex.
If you are using pandas dataframe then you can use the following :
import pandas as pd
all_transcripts['filtered']= all_transcripts.replace([amsterdam', 'rotterdam', 'den haag', 'haarlem', 'groningen'], "woonplaats", regex=True)

Searching sub string through 3Million Records in python

I have a huge Data frame which has 3M records which has column called description. Also I have possible sub string set of around 5k.
I want to get the rows in which the description contains any of the sub string.
i used the following looping
for i in range(0,len(searchstring)):
ss=searchsting[i]
for k in range(0,len(df)):
desc=df['description'].iloc[k].lower()
if (bool(re.search(ss,desc))):
trans.append(df.iloc[k])
The issue is it is taking too much time as the search 5k times 3M looping.
Is there any better way to search substring?
Should be faster if you use pandas isin() function
Example:
import pandas as pd
a ='Hello world'
ss = a.split(" ")
df = pd.DataFrame({'col1': ['Hello', 'asd', 'asdasd', 'world']})
df.loc[df['col1'].isin(ss)].index
Returns a list of indexes:
Int64Index([0, 3], dtype='int64')
I found an alternative method. I have created a word dictionary for the description column of 3M data set, by splitting each words. (I have replaced numbers in the description with zeros and used it for dictionary generation)
def tokenize(desc):
desc=re.sub('\d', '0', desc)
tokens=re.split('\s+',desc)
return tokens
def make_inv_index(df):
inv_index={}
for i,tokens in df['description_removed_numbers'].iteritems():
for token in tokens:
try:
inv_index[token].append(i)
except KeyError:
inv_index[token]=[i]
return inv_index
df['description_removed_numbers']=df['description'].apply(tokenize)
inv_index_df=make_inv_index(df)
Now while searching the description, have to apply the same tokenization on the the search string and will take intersection of indexes of that particular words using the dictionary and will search only those fields. This substantially reduced my overall time taken to run the program.

Replace multiple substrings in a Pandas series with a value

All,
To replace one string in one particular column I have done this and it worked fine:
dataUS['sec_type'].str.strip().str.replace("LOCAL","CORP")
I would like now to replace multiple strings with one string say replace ["LOCAL", "FOREIGN", "HELLO"] with "CORP"
How can make it work? the code below didn't work
dataUS['sec_type'].str.strip().str.replace(["LOCAL", "FOREIGN", "HELLO"], "CORP")
You can perform this task by forming a |-separated string. This works because pd.Series.str.replace accepts regex:
Replace occurrences of pattern/regex in the Series/Index with some
other string. Equivalent to str.replace() or re.sub().
This avoids the need to create a dictionary.
import pandas as pd
df = pd.DataFrame({'A': ['LOCAL TEST', 'TEST FOREIGN', 'ANOTHER HELLO', 'NOTHING']})
pattern = '|'.join(['LOCAL', 'FOREIGN', 'HELLO'])
df['A'] = df['A'].str.replace(pattern, 'CORP')
# A
# 0 CORP TEST
# 1 TEST CORP
# 2 ANOTHER CORP
# 3 NOTHING
replace can accept dict , os we just create a dict for those values need to be replaced
dataUS['sec_type'].str.strip().replace(dict(zip(["LOCAL", "FOREIGN", "HELLO"], ["CORP"]*3)),regex=True)
Info of the dict
dict(zip(["LOCAL", "FOREIGN", "HELLO"], ["CORP"]*3))
Out[585]: {'FOREIGN': 'CORP', 'HELLO': 'CORP', 'LOCAL': 'CORP'}
The reason why you receive the error ,
str.replace is different from replace
The answer of #Rakesh is very neat but does not allow for substrings. With a small change however, it does.
Use a replacement dictionary because it makes it much more generic
Add the keyword argument regex=True to Series.replace() (not Series.str.replace) This does two things actually: It changes your replacement to regex replacement, which is much more powerful but you will have to escape special characters. Beware for that. Secondly it will make the replace work on substrings instead of the entire string. Which is really cool!
replacement = {
"LOCAL": "CORP",
"FOREIGN": "CORP",
"HELLO": "CORP"
}
dataUS['sec_type'].replace(replacement, regex=True)
Full code example
dataUS = pd.DataFrame({'sec_type': ['LOCAL', 'Sample text LOCAL', 'Sample text LOCAL sample FOREIGN']})
replacement = {
"LOCAL": "CORP",
"FOREIGN": "CORP",
"HELLO": "CORP"
}
dataUS['sec_type'].replace(replacement, regex=True)
Output
0 CORP
1 CORP
2 Sample text CORP
3 Sample text CORP sample CORP
Name: sec_type, dtype: object
#JJP answer is a good one if you have a long list. But if you have just two or three then you can simply use the '|' within the pattern. Make sure to add regex=True parameter.
Clearly .str.strip() is not a requirement but is good practise.
import pandas as pd
df = pd.DataFrame({'A': ['LOCAL TEST', 'TEST FOREIGN', 'ANOTHER HELLO', 'NOTHING']})
df['A'] = df['A'].str.strip().str.replace("LOCAL|FOREIGN|HELLO", "CORP", regex=True)
output
A
0 CORP TEST
1 TEST CORP
2 ANOTHER CORP
3 NOTHING
Function to replace multiple values in pandas Series:
def replace_values(series, to_replace, value):
for i in to_replace:
series = series.str.replace(i, value)
return series
Hope this helps someone
Try:
dataUS.replace({"sec_type": { 'LOCAL' : "CORP", 'FOREIGN' : "CORP"}})

Python Pandas Summing Up Score if Description Contains Phrase in a List

I have a long list (200,000+) of phrases:
phrase_list = ['some word', 'another example', ...]
And a two column pandas dataframe with a description in the first column and some score in the second
Description Score
this sentence contains some word in it 6
some word is on my mind 3
repeat another example of me 2
this sentence has no matches 100
another example with some word 10
There are 300,000+ rows. For each phrase in the phrase_list, I want to get the aggregate score if that phrase is found in each row. So, for 'some word', the score would be 6 + 3 + 10 = 19. For 'another example', the score would be 2 + 10 = 12.
The code that I have thus far works but is very slow:
phrase_score = []
for phrase in phrase_list:
phrase_score.append([phrase, df['score'][df['description'].str.contains(phrase)].sum()])
I would like to return pandas dataframe with the phrase in one column and the score in the second (this part is trivial if I have the list of lists). However, I wanted a faster way to get the list of lists.
You can use a dictionary comprehension to generate the score for each phrase in your phrase list.
For each phrase, it creates of mask of those rows in the dataframe that contain that phrase. The mask is df.Description.str.contains(phrase). This mask is then applied to the scores which are in turn summed, effectively df.Score[mask].sum().
df = pd.DataFrame({'Description': ['this sentence contains some word in it',
'some word on my mind',
'repeat another word on my mind',
'this sentence has no matches',
'another example with some word'],
'Score': [6, 3, 2, 100, 10]})
phrase_list = ['some word', 'another example']
scores = {phrase: df.Score[df.Description.str.contains(phrase)].sum()
for phrase in phrase_list}
>>> scores
{'another example': 10, 'some word': 19}
After re-reading your post in more detail, I note the similarity to your approach. I believe, however, that a dictionary comprehension might be faster than a for loop. Based on my testing, however, the results appear similar. I'm not aware of a more efficient solution without resulting to multiprocessing.

Categories

Resources