Compare pandas column values with words in textfile using regular expressions - python

I have a df dataframe like this:
product name description0 description1 description2 description3
A plane flies air passengers wings
B car rolls road NaN NaN
C boat floats sea passengers NaN
What I want to do is to compare for each value in description columns to be searched in a txt file.
Let's say my test.txt file is:
He flies to London then crosses the sea to reach New-York.
The result would look like this:
product name description0 description1 description2 description3 Match
A plane flies air passengers wings Match
B car rolls road NaN NaN No match
C boat floats sea passengers NaN Match
I know the main structure but I'm a bit lost for the rest
with open ("test.txt", 'r') as searchfile:
for line in searchfile:
print line
if re.search() in line:
print(match)

You can search the input text using str.find() since you are searching for string literals. re.search() seems to be an overkill.
A quick-and-dirty solution using .apply(axis=1):
Data
# df as given
input_text = "He flies to London then crosses the sea to reach New-York."
Code
input_text_lower = input_text.lower()
def search(row):
for el in row: # description 0,1,2,3
# skip non-string contents and if the search is successful
if isinstance(el, str) and (input_text_lower.find(el.lower()) >= 0):
return True
return False
df["Match"] = df[[f"description{i}" for i in range(4)]].apply(search, axis=1)
Result
print(df)
product name description0 description1 description2 description3 Match
0 A plane flies air passengers wings True
1 B car rolls road NaN NaN False
2 C boat floats sea passengers NaN True
Note
Word boundary, punctuation and hyphens are not considered in the original problem. In real cases, additional preprocessing steps is likely to be required. This is out of the scope of the original question.

Related

How can I count locally by a custom condition in dataframe python3

I have a dataframe called DF which contains two types of information: datetime and sentence(string).
0 2019-02-01 point say give choice invest motor today money...
1 2019-02-01 get inside car drive drunk excuse bad driving ...
2 2019-02-01 look car snow know buy car snow
3 2019-02-01 drive home car day terrify experience stay least
4 2019-02-01 quid way ferry nice trip enjoy land list celeb...
... ... ...
35818 2021-09-30 choice life drive type car holiday type carava...
35819 2021-09-30 scarlet carson bloody marvellous big car lover...
35820 2021-09-30 podcast adriano great episode dude weird car d...
35821 2021-09-30 scarlet carson smugly cruise traffic know driv...
35822 2021-09-30 hornet know fuel shortage brexit destroy suppl...
Now I generate a word list to seek whether the sentence contains these string:
word_list=['drive','car','buy','fuel','electric','panic','tax','second hand','petrol','auto']
I only need to count once if the word in word list appears in the sentence, here comes my solution
set_list=[]
for word in word_list:
for sentence in DF['new_processed_text']:
if word in sentence:
set_list.append(sentence)
count=len(set(set_list))
However, this will work for the whole dataset, and I want to do the process by day.
I have no ideas about dataframe.groupby, should I need that?
You can use a regular expression, pd.DataFrame.loc method, pd.Series.value_counts() method and pandas string methods for your purposes:
In aggregate:
len(df.loc[df['new_processed_text'].str.contains('|'.join(word_list), regex=True)].index)
Processing per day:
df.loc[df['new_processed_text'].str.contains('|'.join(word_list), regex=True), 'date_column'].value_counts()
You can remove duplicates first and then use the string methods of pandas Series objects.
import pandas as pd
s = pd.Series(['abc def', 'def xyz ijk', 'xyz ijk', 'abc def', 'abc def', 'ijk mn', 'def xyz'])
words = ['abc', 'xyz']
s_prime = s.drop_duplicates()
contains_word = s_prime.str.contains("|".join(words))
print(contains_word.sum())
In your case, s = DF['new_processed_text'] and words = word_list.
I forgot the per day part, which #Vladimir Vilimaitis provides in his answer with his last line.

How to filter rows with non Latin characters

I am stuck in a problem with a dataframe with a column of film names which has a bunch of non-latin names like Japanese or Chinese (and maybe Russian names too) my code is:
df['title'].head(5)
1 I am legend
2 wonder women
3 アライヴ
4 怪獣総進撃
5 dead sea
I just want an output that removes every non-Latin character title, so I want to remove every row that contains character similar to row 3 and 4, so my desired output is:
df['title'].head(5)
1 I am legend
2 wonder women
5 dead sea
6 the rig
7 altitude
Any help with this code?
You can use str.match with the Latin character range to identify non-latin characters, and use the boolean output to slice the data:
df_latin = df[~df['title'].str.match(r'.*[^\x00-\xFF]')]
output:
title
1 I am legend
2 wonder women
5 dead sea
6 the rig
7 altitude
You can encode your title column then decode to latin1. If this double transformation does not match your original data, remove row because it contains some non Latin characters:
df = df[df['title'] == df['title'].str.encode('unicode_escape').str.decode('latin1')]
print(df)
# Output
title
0 I am legend
1 wonder women
3 dead sea
You can use the isascii() method (if you're using Python 3.7+). Example:
"I am legend".isascii() # True
"アライヴ".isascii() # False
Even if you have 1 Non-English letter, the isascii() method will return False.
(Note that for strings like '34?#5' the method will return True, because those are all ASCII characters.)
We can easily makes a function which will return whether it is ascii or not and based on that we can then filter our dataframe.
dict_1 = {'col1':list(range(1,6)), 'col2':['I am legend','wonder women','アライヴ','怪獣総進撃','dead sea']}
def check_ascii(string):
if string.isascii() == True:
return True
else:
return False
df = pd.DataFrame(dict_1)
df['is_eng'] = df['col2'].apply(lambda x: check_ascii(x))
df2 = df[df['is_eng'] == True]
df2
Output

matching content creating new column

Hello I have a dataset where I want to match my keyword with the location. The problem I am having is the location "Afghanistan" or "Kabul" or "Helmund" I have in my dataset appears in over 150 combinations including spelling mistakes, capitalization and having the city or town attached to its name. What I want to do is create a separate column that returns the value 1 if any of these characters "afg" or "Afg" or "kab" or "helm" or "are contained in the location. I am not sure if upper or lower case makes a difference.
For instance there are hundreds of location combinations like so: Jegdalak, Afghanistan, Afghanistan,Ghazni♥, Kabul/Afghanistan,
I have tried this code and it is good if it matches the phrase exactly but there is too much variation to write every exception down
keywords= ['Afghanistan','Kabul','Herat','Jalalabad','Kandahar','Mazar-i-Sharif', 'Kunduz', 'Lashkargah', 'mazar', 'afghanistan','kabul','herat','jalalabad','kandahar']
#how to make a column that shows rows with a certain keyword..
def keyword_solution(value):
strings = value.split()
if any(word in strings for word in keywords):
return 1
else:
return 0
taleban_2['keyword_solution'] = taleban_2['location'].apply(keyword_solution)
# below will return the 1 values
taleban_2[taleban_2['keyword_solution'].isin(['1'])].head(5)
Just need to replace this logic where all results will be put into column "keyword_solution" that matches either "Afg" or "afg" or "kab" or "Kab" or "kund" or "Kund"
Given the following:
Sentences from the New York Times
Remove all non-alphanumeric characters
Change everything to lowercase, thereby removing the need for different word variations
Split the sentence into a list or set. I used set because of the long sentences.
Add to the keywords list as needed
Matching words from two lists
'afgh' in ['afghanistan']: False
'afgh' in 'afghanistan': True
Therefore, the list comprehension searches for each keyword, in each word of word_list.
[True if word in y else False for y in x for word in keywords]
This allows the list of keywords to be shorter (i.e. given afgh, afghanistan is not required)
import re
import pandas as pd
keywords= ['jalalabad',
'kunduz',
'lashkargah',
'mazar',
'herat',
'mazar',
'afgh',
'kab',
'kand']
df = pd.DataFrame({'sentences': ['The Taliban have wanted the United States to pull troops out of Afghanistan Turkey has wanted the Americans out of northern Syria and North Korea has wanted them to at least stop military exercises with South Korea.',
'President Trump has now to some extent at least obliged all three — but without getting much of anything in return. The self-styled dealmaker has given up the leverage of the United States’ military presence in multiple places around the world without negotiating concessions from those cheering for American forces to leave.',
'For a president who has repeatedly promised to get America out of foreign wars, the decisions reflect a broader conviction that bringing troops home — or at least moving them out of hot spots — is more important than haggling for advantage. In his view, decades of overseas military adventurism has only cost the country enormous blood and treasure, and waiting for deals would prolong a national disaster.',
'The top American commander in Afghanistan, Gen. Austin S. Miller, said Monday that the size of the force in the country had dropped by 2,000 over the last year, down to somewhere between 13,000 and 12,000.',
'“The U.S. follows its interests everywhere, and once it doesn’t reach those interests, it leaves the area,” Khairullah Khairkhwa, a senior Taliban negotiator, said in an interview posted on the group’s website recently. “The best example of that is the abandoning of the Kurds in Syria. It’s clear the Kabul administration will face the same fate.”',
'afghan']})
# substitute non-alphanumeric characters
df['sentences'] = df['sentences'].apply(lambda x: re.sub('[\W_]+', ' ', x))
# create a new column with a list of all the words
df['word_list'] = df['sentences'].apply(lambda x: set(x.lower().split()))
# check the list against the keywords
df['location'] = df.word_list.apply(lambda x: any([True if word in y else False for y in x for word in keywords]))
# final
print(df.location)
0 True
1 False
2 False
3 True
4 True
5 True
Name: location, dtype: bool

How to remove the same and rare words in dataframe pandas?

How do I remove the same words in the dataframe named df3?
My below codes doesn't seemed to work...
df3 = pd.DataFrame(np.array(c3), columns=["content"]).drop_duplicates()
def text_processing_cat3(df3):
''=== Removal of common words ==='''
freq = pd.Series(' '.join(df3['content']).split()).value_counts()[:10]
freq = list(freq.index)
df3['content'] = df3['content'].apply(lambda x: " ".join(x for x in
x.split() if x not in freq))
'''=== Removal of rare words ==='''
freq = pd.Series(' '.join(df3['content']).split()).value_counts()[-10:]
freq = list(freq.index)
df3['content'] = df3['content'].apply(lambda x: " ".join(x for x in
x.split() if x not in freq))
return df3
print(text_processing_cat3(df3)
The sample output for the above is:
cat_id content
0 3 male malay man nkda walking stick home ambulant ws void deck able walk bendemeer mall home bus stop away adli stays daughter family husband none image image image order cancellation note ct brain duplicate image
1 3 yo chinese man nkda phx hypertension hyperlipidemia benign hyperplasia open cholecystectomy gallbladder empyema distal gastrectomy pud penetrating aortic
Please help check the codes and improve the codes above. Thank you!!

Python, find words from array in string

I just want to ask how can I find words from array in my string?
I need to do filter that will find words i saved in my array in text that user type to text window on my web.
I need to have 30+ words in array or list or something.
Then user type text in text box.
Then script should find all words.
Something like spam filter i quess.
Thanks
import re
words = ['word1', 'word2', 'word4']
s = 'Word1 qwerty word2, word3 word44'
r = re.compile('|'.join([r'\b%s\b' % w for w in words]), flags=re.I)
r.findall(s)
>> ['Word1', 'word2']
Solution 1 uses the regex approach which will return all instances of the keyword found in the data. Solution 2 will return the indexes of all instances of the keyword found in the data
import re
dataString = '''Life morning don't were in multiply yielding multiply gathered from it. She'd of evening kind creature lesser years us every, without Abundantly fly land there there sixth creature it. All form every for a signs without very grass. Behold our bring can't one So itself fill bring together their rule from, let, given winged our. Creepeth Sixth earth saying also unto to his kind midst of. Living male without for fruitful earth open fruit for. Lesser beast replenish evening gathering.
Behold own, don't place, winged. After said without of divide female signs blessed subdue wherein all were meat shall that living his tree morning cattle divide cattle creeping rule morning. Light he which he sea from fill. Of shall shall. Creature blessed.
Our. Days under form stars so over shall which seed doesn't lesser rule waters. Saying whose. Seasons, place may brought over. All she'd thing male Stars their won't firmament above make earth to blessed set man shall two it abundantly in bring living green creepeth all air make stars under for let a great divided Void Wherein night light image fish one. Fowl, thing. Moved fruit i fill saw likeness seas Tree won't Don't moving days seed darkness.
'''
keyWords = ['Life', 'stars', 'seed', 'rule']
#---------------------- SOLUTION 1
print 'Solution 1 output:'
for keyWord in keyWords:
print re.findall(keyWord, dataString)
#---------------------- SOLUTION 2
print '\nSolution 2 output:'
for keyWord in keyWords:
index = 0
indexes = []
indexFound = 0
while indexFound != -1:
indexFound = dataString.find(keyWord, index)
if indexFound not in indexes:
indexes.append(indexFound)
index += 1
indexes.pop(-1)
print indexes
Output:
Solution 1 output:
['Life']
['stars', 'stars']
['seed', 'seed']
['rule', 'rule', 'rule']
Solution 2 output:
[0]
[765, 1024]
[791, 1180]
[295, 663, 811]
Try
words = ['word1', 'word2', 'word4']
s = 'word1 qwerty word2, word3 word44'
s1 = s.split(" ")
i = 0
for x in s1:
if(x in words):
print x
i++
print "count is "+i
output
'word1'
'word2'
count is 2

Categories

Resources