Python: Match strings for certain terms - python

I have a list of tweets, from which I have to choose tweets that have terms like "sale", "discount", or "offer". Also, I need to find tweets that advertise certain deals, like a discount, by recognizing things like "%", "Rs.", "$" amongst others. I have absolutely no idea about regular expressions and the documentation isn't getting me anywhere. Here is my code. It's rather lousy, but please excuse that
import pymongo
import re
import datetime
client = pymongo.MongoClient()
db = client .PWSocial
fourteen_days_ago = datetime.datetime.utcnow() - datetime.timedelta(days=14)
id_list = [57947109, 183093247, 89443197, 431336956]
ar1 = [" deal "," deals ", " offer "," offers " "discount", "promotion", " sale ", " inr", " rs", "%", "inr ", "rs ", " rs."]
def func(ac_id):
mylist = []
newlist = []
tweets = list(db.tweets.find({'user_id' : ac_id, 'created_at': { '$gte': fourteen_days_ago }}))
for item in tweets:
data = item.get('text')
data = data.lower()
data = data.split()
flag = 0
if set(ar1).intersection(data):
flag = 1
abc = []
for x in ar1:
for y in data:
if re.search(x,y):
abc.append(x)
flag = 1
break
if flag == 1:
mylist.append(item.get('id'))
newlist.append(abc)
print mylist
print newlist
for i in id_list:
func(i)
This code soen't give me any correct results, and being a noob to regexes, I cannot figure out whats wrong with it. Can anyone suggest a better way to do this job? Any help is appreciated.

My first advice - learn regular expressions, it gives you an unlimited power of text processing.
But, to give you some working solution (and start point to further exploration) try this:
import re
re_offers = re.compile(r'''
\b # Word boundary
(?: # Non capturing parenthesis
deals? # Deal or deals
| # or ...
offers? # Offer or offers
|
discount
|
promotion
|
sale
|
rs\.? # rs or rs.
|
inr\d+ # INR then digits
|
\d+inr # Digits then INR
) # And group
\b # Word boundary
| # or ...
\b\d+% # Digits (1 or more) then percent
|
\$\d+\b # Dollar then digits (didn't care of thousand separator yet)
''',
re.I|re.X) # Ignore case, verbose format - for you :)
abc = re_offers.findall("e misio $1 is inr123 discount 1INR a 1% and deal")
print(abc)

You don't need to use a regular expression for this, you can use any:
if any(term in tweet for term in search_terms):

In your array of things to search for you don't have a comma between " offers " and "discount" which is causing them to be joined together.
Also when you use split you are getting rid of the whitespace in your input text. "I have a deal" will become ["I","have","a","deal"] but your search terms almost all contain whitespace. So remove the spaces from your search terms in array ar1.
However you might want to avoid using regular expressions and just use in instead (you will still need the chnages I suggest above though):
if x in y:

You might want to consider starting with find instead instead of a regex. You don't have complex expressions, and as you're handling a line of text you don't need to call split, instead just use find:
for token in ar1:
if data.find(token) != -1:
abc.append(data)
Your for item in tweets loop becomes:
for item in tweets:
data = item.get('text')
data = data.lower()
for x in ar1:
if data.find(x)
newlist.append(data)
mylist.append(item.get('id'))
break
Re: your comment on jonsharpe's post, to avoid including substrings, surround your tokens by spaces, e.g. " rs ", " INR "

Related

How do I remove some text from "get_text()" output in BeautifulSoup

I'm making a web scraping program to get the retail trading sentiment from IG Markets.
The output I would like to be displayed in the console is:
"EUR/USD: 57% of clients accounts are short on this market".
The output I get right now is:
"EUR/USD: 57% of client accounts are short on this market The percentage of IG client
accounts with positions in this market that are currently long or short. Calculated
to the nearest 1%."
How do I remove this text:
"The percentage of IG client accounts with positions in this market that are
currently long or short. Calculated to the nearest 1%."
Thank you.
Here's the code:
import bs4, requests
def getIGsentiment(pairUrl):
res = requests.get(pairUrl)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
elems = soup.select('.price-ticket__sentiment')
return elems[0].get_text(" ", strip = True)
retail_positions = getIGsentiment('https://www.ig.com/us/forex/markets-forex/eur-usd')
print('EUR/USD: ' + retail_positions)
You can use a Regular expression (regex) for that :
>>> import re
>>> print('EUR/USD: ' + re.match('^.*on this market',retail_positions).group())
EUR/USD: 57% of client accounts are short on this market
You express a search pattern (^.*on this market) and re.match() will return a re.Match object and you can retrieve the match with the group() function.
This search pattern consist of 3 parts :
^ match the start of the line
.* mean to match zero or more (*) instance of any character (.)
on this market literally match this string
Regex are widely used and supported, but beware some variants, Python don't seem to support the [[:digit:]] character class...
If your string changes but the capitalization not, you can simply create a for loop to look after the 7th upper character and split the string. In this case, it's the letter 'T'.
Something like this:
phrase = "EUR/USD: 57 % of client accounts are short on this market The percentage of
IG client accounts with positions in this market that are currently long or short.
Calculated to the nearest 1 % ."
upperchars = []
for char in phrase:
if char.isupper():
upperchars.append(char)
final = phrase.split(upperchars[6])[0]
print(final)
The result would be:
EUR/USD: 57 % of client accounts are short on this market

Pandas Python Regular Expression Assistance

I wasn't sure what to call this title, feel free to edit it if you think there is a better name.
What I am trying to do is find cases that match certain search criteria.
Specifically, I am trying to find sentences that contain the word "where" in them. Once I have identified that, I am trying to find cases where the word "SQL" command is also located within that same tag.
Let's say I have a dataframe that looks like this:
search_criteria = ['where']
df4
Q R
0 file.sql <sentence>dave likes stuff</sentence><properties>version = "2", description = "example" type="SqlCommand">select id, name, from table where criteria = '5'</property><sentence>dave hates stuff>
0 file.sql <sentence>dave likes stuff</sentence><properties>version = "2", description = "example">select id, name, from table where criteria = '5'</properties><sentence>dave hates stuff>
I am trying to return this:
Q R
0 file.sql <properties>version = "2", description = "example">select id, name, from table</properties>
This record should get returned because it contains both "where" and "sqlcommand".
Here is my current process:
regex_stuff = df_all_xml_mfiles_tgther[cc:cc+1].R.str.findall('(<[^<]*?' + 'where' + '[^>]*?>)', re.IGNORECASE)
sql_command_regex_stuff = df_all_xml_mfiles_tgther[cc:cc+1].R.str.findall('(<property[^<]*?' + 'sqlcommand' + '[^>]*?<\/property>)', re.IGNORECASE)
if not regex_stuff.empty: #if one of the search criteria is found
if not sql_command_regex_stuff.empty: #check to see if the phrase "sqlcommand" is found anywhere as well
(insert rest of code)
This does not return anything.
What am I doing wrong?
Edit #1:
It seems like I need to do something at the end, to make the regex look something like this:
<property[^<]*?SqlCommand[^(<\/property>)]*
I feel like this is the right direction, doesn't work, but I feel like this is the right step.
You could just filter with str.contains:
df[(df['R'].str.contains('where', flags=re.IGNORECASE) & df['R'].str.contains('sqlcommand', flags=re.IGNORECASE))]
Q R
0 file.sql <sentence>dave likes stuff</sentence><properti...
or use ~ to return the opposite: strings that do not contain 'sqlcommand' or 'where'
df[~(df['R'].str.contains('where', flags=re.IGNORECASE) & df['R'].str.contains('sqlcommand', flags=re.IGNORECASE))]
Q R
1 file.sql <sentence>dave likes stuff</sentence><properti...
First of all, you have to have proper XML and SQL content, so you should
make the following corrections:
As the opening tag is <properties>, the closing tag must also be
</properties>, not </property>.
version, description and type are attributes (after them
there is > closing the opening tag, so after properties there
should be a space, not >.
Remove , after version="2".
Remove , after name.
Remove ( before <properties and ) after </properties>.
To find the required rows, use str.contains as the filtering
expression.
Below you have an example program:
import pandas as pd
import re
df4 = pd.DataFrame({
'Q' : 'file.sql',
'R' : [
'<s>dave</s><properties type="SqlCommand">select id, name '
'from table where criteria=\'5\'</properties><s>dave</s>',
'<s>dave</s><properties>select id, name from table '
'where criteria=\'6\'</properties><s>dave</s>',
'<s>mike</s><properties type="SqlCommand">drop table "Xyz"'
'</properties><s>mike</s>' ]})
df5 = df4[df4.R.str.contains(
'<properties[^<>]+?sqlcommand[^<>]+?>[^<>]+?where',
flags=re.IGNORECASE)]
print(df5)
Note that the regex takes care about the proper sequence of
strings:
First match <properties.
Then a sequence of chars other than < and > ([^<>]+?).
so we are still within the just opened XML tag.
Then match sqlcommand (ignoring case).
Then another sequence of chars other than < and >
([^<>]+?).
Then >, closing the tag.
Then another sequence of chars other than < and >
([^<>]+?).
And finally where (also ignoring case).
An attempt to check for sqlcommand and where in two separate
regexes is wrong, as these words can be at other locations,
which do not meet your requirement.

Exhaustively parse file for all matches

I have a grammar for parsing some log files using pyparsing but am running into an issue where only the first match is being returned. Is there a way to ensure that I get exhaustive matches? Here's some code:
from pyparsing import Literal, Optional, oneOf, OneOrMore, ParserElement, Regex, restOfLine, Suppress, ZeroOrMore
ParserElement.setDefaultWhitespaceChars(' ')
dt = Regex(r'''\d{2} (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) 20\d\d \d\d:\d\d:\d\d\,\d{3}''')
# TODO maybe add a parse action to make a datetime object out of the dt capture group
log_level = Suppress('[') + oneOf("INFO DEBUG ERROR WARN TRACE") + Suppress(']')
package_name = Regex(r'''(com|org|net)\.(\w+\.)+\w+''')
junk_data = Optional(Regex('\(.*?\)'))
guid = Regex('[A-Za-z0-9]{8}-[A-Za-z0-9]{4}-[A-Za-z0-9]{4}-[A-Za-z0-9]{4}-[A-Za-z0-9]{12}')
first_log_line = dt.setResultsName('datetime') + \
log_level('log_level') + \
guid('guid') + \
junk_data('junk') + \
package_name('package_name') + \
Suppress(':') + \
restOfLine('message') + \
Suppress('\n')
additional_log_lines = Suppress('\t') + package_name + restOfLine
log_entry = (first_log_line + Optional(ZeroOrMore(additional_log_lines)))
log_batch = OneOrMore(log_entry)
In my mind, the last two lines are sort of equivalent to
log_entry := first_log_line | first_log_line additional_log_lines
additional_log_lines := additional_log_line | additional_log_line additional_log_lines
log_batch := log_entry | log_entry log_batch
Or something of the sort. Am I thinking about this wrong? I only see a single match with all of the expected tokens when I do print(log_batch.parseString(data).dump()).
Your scanString behavior is a strong clue. Suppose I wrote an expression to match one or more items, and erroneously defined my expression such that the second item in my list did not match. Then OneOrMore(expr) would fail, while expr.scanString would "succeed", in that it would give me more matches, but would still overlook the match I might have wanted, but just mis-parsed.
import pyparsing as pp
data = "AAA _AB BBB CCC"
expr = pp.Word(pp.alphas)
print(pp.OneOrMore(expr).parseString(data))
Gives:
['AAA']
At first glance, this looks like the OneOrMore is failing, whereas scanString shows more matches:
['AAA']
['AB'] <- really wanted '_AB' here
['BBB']
['CCC']
Here is a loop using scanString which prints not the matches, but the gaps between the matches, and where they start:
# loop to find non-matching parts in data
last_end = 0
for t,s,e in expr.scanString(data):
gap = data[last_end:s]
print(s, ':', repr(gap))
last_end = e
Giving:
0 : ''
5 : ' _' <-- AHA!!
8 : ' '
12 : ' '
Here's another way to visualize this.
# print markers where each match begins in input string
markers = [' ']*len(data)
for t,s,e in expr.scanString(data):
markers[s] = '^'
print(data)
print(''.join(markers))
Prints:
AAA _AB BBB CCC
^ ^ ^ ^
Your code would be a little more complex since your data spans many lines, but using pyparsing's line, lineno and col methods, you could do something similar.
So, there's a workaround that seems to do the trick. For whatever reason, scanString does iterate through them all appropriately, so I can very simply get my matches in a generator with:
matches = (m for m, _, _ in log_batch.scanString(data))
Still not sure why parseString isn't working exhaustively, though, and still a bit worried that I've misunderstood something about pyparsing, so more pointers are welcome here.

Match unique patterns in string - Python

I have a list of strings called txtFreeForm:
['Add roth Sweep non vested money after 5 years of termination',
'Add roth in-plan to the 401k plan.]
I need to check if only 'Add roth' exists in the sentence. To do that i used this
for each_line in txtFreeForm:
match = re.search('add roth',each_line.lower())
if match is not None:
print(each_line)
But this obviously returns both the strings in my list as both contain 'add roth'. Is there a way to exclusively search for 'Add roth' in a sentence, because i have a bunch of these patterns to search in strings.
Thanks for your help!
Can you fix this problem by using the .Length property of strings? I'm not an experienced Python programmer, but here is how I think it should work:
for each_line in txtFreeForm:
match = re.search('add roth',each_line.lower())
if (match is not None) and (len(txtFreeForm) == len("Add Roth")):
print(each_line)
Basically, if the text is in the string, AND the length of the string is exactly to the length of the string "Add Roth", then it must ONLY contain "Add Roth".
I hope this was helpful.
EDIT:
I misunderstood what you were asking. You want to print out sentences that contain "Add Roth", but not sentences that contain "Add Roth in plan". Is this correct?
How about this code?
for each_line in txtFreeForm:
match_AR = re.search('add roth',each_line.lower())
match_ARIP = re.search('add roth in plan',each_line.lower())
if (match_AR is True) and (match_ARIP is None):
print(each_line)
This seems like it should fix the problem. You can exclude any strings (like "in plan") by searching for them too and adding them to the comparison.
You're close :) Give this a shot:
for each_line in txtFreeForm:
match = re.search('add roth (?!in[-]plan)',each_line.lower())
if match is not None:
print(each_line[match.end():])
EDIT:
Ahhh I misread... you have a LOT of these. This calls for some more aggressive magic.
import re
from functools import partial
txtFreeForm = ['Add roth Sweep non vested money after 5 years of termination',
'Add roth in-plan to the 401k plan.']
def roths(rows):
for row in rows:
match = re.search('add roth\s*', row.lower())
if match:
yield row, row[match.end():]
def filter_pattern(pattern):
return partial(lazy_filter_out, pattern)
def lazy_filter(pattern):
return partial(lazy_filter, pattern)
def lazy_filter_out(pattern, rows):
for row, rest in rows:
if not re.match(pattern, rest):
yield row, rest
def magical_transducer(bad_words, nice_rows):
magical_sentences = reduce(lambda x, y: y(x), [roths] + map(filter_pattern, bad_words), nice_rows)
for row, _ in magical_sentences:
yield row
def main():
magic = magical_transducer(['in[-]plan'], txtFreeForm)
print(list(magic))
if __name__ == '__main__':
main()
To explain a bit about what's happening hear, you mentioned you have a LOT of these words to process. The traditional way you might compare two groups of items is with nested for-loops. So,
results = []
for word in words:
for pattern in patterns:
data = do_something(word_pattern)
results.append(data)
for item in data:
for thing in item:
and so on...
and so fourth...
I'm using a few different techniques to attempt to achieve a "flatter" implementation and avoid the nested loops. I'll do my best to describe them.
**Function compositions**
# You will often see patterns that look like this:
x = foo(a)
y = bar(b)
z = baz(y)
# You may also see patterns that look like this:
z = baz(bar(foo(a)))
# an alternative way to do this is to use a functional composition
# the technique works like this:
z = reduce(lambda x, y: y(x), [foo, bar, baz], a)

Python: matching OR of two variables containing regex code

What I am trying to do is to take user input text which would contain wildcards (so I need to keep them that way) but furthermore to look for the specified input. So for example that I have working below I use the pipe |.
I figured out how to make this work:
dual = 'a bunch of stuff and a bunch more stuff!'
reobj = re.compile('b(.*?)f|\s[a](.*?)u', re.IGNORECASE)
result = reobj.findall(dual)
for link in result:
print link[0] +' ' + link[1]
which returns:
unch o
nd a b
As well
dual2 = 'a bunch of stuff and a bunch more stuff!'
#So I want to now send in the regex codes of my own.
userin1 = 'b(.*?)f'
userin2 = '\s[a](.*?)u'
reobj = re.compile(userin1, re.IGNORECASE)
result = reobj.findall(dual2)
for link in result:
print link[0] +' ' + link[1]
Which returns:
u n
u n
I don't understand what it is doing as if I get rid of all save link[0] in print I get:
u
u
I however can pass in a user input regex string:
dual = 'a bunch of stuff and a bunch more stuff!'
userinput = 'b(.*?)f'
reobj = re.compile(userinput, re.IGNORECASE)
result = reobj.findall(dual)
print(result)
but when I try to update this to two user strings with the pipe:
dual = 'a bunch of stuff and a bunch more stuff!'
userin1 = 'b(.*?)f'
userin2 = '\s[a](.*?)u'
reobj = re.compile(userin1|userin2, re.IGNORECASE)
result = reobj.findall(dual)
print(result)
I get the error:
reobj = re.compile(userin1|userin2, re.IGNORECASE)
TypeError: unsupported operand type(s) for |: 'str' and 'str'
I get this error a lot such as if I put brackets () or [] around userin1|userin2.
I have found the following:
Python regular expressions OR
but can not get it to work ;..{-( .
What I would like to do is to be able to understand how to pass in these regex variables such as that of OR and return all the matches of both as well as something such as AND - which in the end is useful as it will operate on files and let me know which files contain particular words with the various logical relations OR, AND etc.
Thanks much for your thoughts,
Brian
Although I couldn't get the answer from A. Rodas to work, he gave the idea for the .join. The example I worked out - although slightly different returns (in link[0] and link[1]) the desired results.
userin1 = '(T.*?n)'
userin2 = '(G.*?p)'
list_patterns = [userin1,userin2]
swaplogic = '|'
string = 'What is a Torsion Abelian Group (TAB)?'
theresult = re.findall(swaplogic.join(list_patterns), string)
print theresult
for link in theresult:
print link[0]+' '+link[1]

Categories

Resources