How do I remove some text from "get_text()" output in BeautifulSoup - python

I'm making a web scraping program to get the retail trading sentiment from IG Markets.
The output I would like to be displayed in the console is:
"EUR/USD: 57% of clients accounts are short on this market".
The output I get right now is:
"EUR/USD: 57% of client accounts are short on this market The percentage of IG client
accounts with positions in this market that are currently long or short. Calculated
to the nearest 1%."
How do I remove this text:
"The percentage of IG client accounts with positions in this market that are
currently long or short. Calculated to the nearest 1%."
Thank you.
Here's the code:
import bs4, requests
def getIGsentiment(pairUrl):
res = requests.get(pairUrl)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
elems = soup.select('.price-ticket__sentiment')
return elems[0].get_text(" ", strip = True)
retail_positions = getIGsentiment('https://www.ig.com/us/forex/markets-forex/eur-usd')
print('EUR/USD: ' + retail_positions)

You can use a Regular expression (regex) for that :
>>> import re
>>> print('EUR/USD: ' + re.match('^.*on this market',retail_positions).group())
EUR/USD: 57% of client accounts are short on this market
You express a search pattern (^.*on this market) and re.match() will return a re.Match object and you can retrieve the match with the group() function.
This search pattern consist of 3 parts :
^ match the start of the line
.* mean to match zero or more (*) instance of any character (.)
on this market literally match this string
Regex are widely used and supported, but beware some variants, Python don't seem to support the [[:digit:]] character class...

If your string changes but the capitalization not, you can simply create a for loop to look after the 7th upper character and split the string. In this case, it's the letter 'T'.
Something like this:
phrase = "EUR/USD: 57 % of client accounts are short on this market The percentage of
IG client accounts with positions in this market that are currently long or short.
Calculated to the nearest 1 % ."
upperchars = []
for char in phrase:
if char.isupper():
upperchars.append(char)
final = phrase.split(upperchars[6])[0]
print(final)
The result would be:
EUR/USD: 57 % of client accounts are short on this market

Related

How do I make my code remove the sender names found in the messages saved in a txt file and the tags using regex

Having this dialogue between a sender and a receiver through Discord, I need to eliminate the tags and the names of the interlocutors, in this case it would help me to eliminate the previous to the colon (:), that way the name of the sender would not matter and I would always delete whoever sent the message.
This is the information what is inside the generic_discord_talk.txt file
Company: <#!808947310809317387> Good morning, technical secretary of X-company, will Maria attend you, how can we help you?
Customer: Hi <#!808947310809317385>, I need you to help me with the order I have made
Company: Of course, she tells me that she has placed an order through the store's website and has had a problem. What exactly is Maria about?
Customer: I add the product to the shopping cart and nothing happens <#!808947310809317387>
Company: Does Maria have the website still open? So I can accompany you during the purchase process
Client: <#!808947310809317387> Yes, I have it in front of me
import collections
import pandas as pd
import matplotlib.pyplot as plt #to then graph the words that are repeated the most
archivo = open('generic_discord_talk.txt', encoding="utf8")
a = archivo.read()
with open('stopwords-es.txt') as f:
st = [word for line in f for word in line.split()]
print(st)
stops = set(st)
stopwords = stops.union(set(['you','for','the'])) #OPTIONAL
#print(stopwords)
I have created a regex to detect the tags
regex = re.compile("^(<#!.+>){,1}\s{,}(messegeA|messegeB|messegeC)(<#!.+>){,1}\s{,}$")
regex_tag = re.compile("^<#!.+>")
I need that the sentence print(st) give me return the words to me but without the emitters and without the tags
You could remove either parts using an alternation | matching either from the start of the string to the first occurrence of a comma, or match <#! till the first closing tag.
^[^:\n]+:\s*|\s*<#!\d+>
The pattern matches:
^ Start of string
[^:\n]+:\s* Match 1+ occurrences of any char except : or a newline, then match : and optional whitspace chars
| Or
\s*<#! Match literally, preceded by optional whitespace chars
[^<>]+ Negated character class, match 1+ occurrences of any char except < and >
> Match literally
Regex demo
If there can be only digits after <#!
^[^:\n]+:|<#!\d+>
For example
archivo = open('generic_discord_talk.txt', encoding="utf8")
a = archivo.read()
st = re.sub(r"^[^:\n]+:\s*|\s*<#![^<>]+>", "", a, 0, re.M)
If you also want to clear the leading and ending spaces, you can add this line
st = re.sub(r"^[^\S\n]*|[^\S\n]*$", "", st, 0, re.M)
I think this should work:
import re
data = """Company: <#!808947310809317387> Good morning, technical secretary of X-company, will Maria attend you, how can we help you?
Customer: Hi <#!808947310809317385>, I need you to help me with the order I have made
Company: Of course, she tells me that she has placed an order through the store's website and has had a problem. What exactly is Maria about?
Customer: I add the product to the shopping cart and nothing happens <#!808947310809317387>
Company: Does Maria have the website still open? So I can accompany you during the purchase process
Client: <#!808947310809317387> Yes, I have it in front of me"""
def run():
for line in data.split("\n"):
line = re.sub(r"^\w+: ", "", line) # remove the customer/company part
line = re.sub(r"<#!\d+>", "", line) # remove tags
print(line)

Python Regex - Extract text between (multiple) expressions in a textfile

I am a Python beginner and would be very thankful if you could help me with my text extraction problem.
I want to extract all text, which lies between two expressions in a textfile (the beginning and end of a letter). For both, the beginning and the end of the letter there are multiple possible expressions (defined in the lists "letter_begin" and "letter_end", e.g. "Dear", "to our", etc.). I want to analyze this for a bunch of files, find below an example of how such a textfile looks like -> I want to extract all text starting from "Dear" till "Douglas". In cases where the "letter_end" has no match, i.e. no letter_end expression is found, the output should start from the letter_beginning and end at the very end of the text file to be analyzed.
Edit: the end of "the recorded text" has to be after the match of "letter_end" and before the first line with 20 characters or more (as is the case for "Random text here as well" -> len=24.
"""Some random text here
 
Dear Shareholders We
are pleased to provide you with this semiannual report for Fund for the six-month period ended April 30, 2018. For additional information about the Fund, please visit our website a, where you can access quarterly commentaries. We value the trust that you place in us and look forward to serving your investment needs in the years to come.
Best regards
Douglas
Random text here as well"""
This is my code so far - but it is not able to flexible catch the text between the expressions (there can be anything (lines, text, numbers, signs, etc.) before the "letter_begin" and after the "letter_end")
import re
letter_begin = ["dear", "to our", "estimated"] # All expressions for "beginning" of letter
openings = "|".join(letter_begin)
letter_end = ["sincerely", "yours", "best regards"] # All expressions for "ending" of Letter
closings = "|".join(letter_end)
regex = r"(?:" + openings + r")\s+.*?" + r"(?:" + closings + r"),\n\S+"
with open(filename, 'r', encoding="utf-8") as infile:
text = infile.read()
text = str(text)
output = re.findall(regex, text, re.MULTILINE|re.DOTALL|re.IGNORECASE) # record all text between Regex (Beginning and End Expressions)
print (output)
I am very thankful for every help!
You may use
regex = r"(?:{})[\s\S]*?(?:{}).*(?:\n.*){{0,2}}".format(openings, closings)
This pattern will result in a regex like
(?:dear|to our|estimated)[\s\S]*?(?:sincerely|yours|best regards).*(?:\n.*){0,2}
See the regex demo. Note you should not use re.DOTALL with this pattern, and the re.MULTILINE option is also redundant.
Details
(?:dear|to our|estimated) - any of the three values
[\s\S]*? - any 0+ chars, as few as possible
(?:sincerely|yours|best regards) - any of the three values
.* - any 0+ chars other than newline
(?:\n.*){0,2} - zero, one or two repetitions of a newline followed with any 0+ chars other than newline.
Python demo code:
import re
text="""Some random text here
Dear Shareholders We
are pleased to provide you with this semiannual report for Fund for the six-month period ended April 30, 2018. For additional information about the Fund, please visit our website a, where you can access quarterly commentaries. We value the trust that you place in us and look forward to serving your investment needs in the years to come.
Best regards
Douglas
Random text here as well"""
letter_begin = ["dear", "to our", "estimated"] # All expressions for "beginning" of letter
openings = "|".join(letter_begin)
letter_end = ["sincerely", "yours", "best regards"] # All expressions for "ending" of Letter
closings = "|".join(letter_end)
regex = r"(?:{})[\s\S]*?(?:{}).*(?:\n.*){{0,2}}".format(openings, closings)
print(regex)
print(re.findall(regex, text, re.IGNORECASE))
Output:
['Dear Shareholders We\nare pleased to provide you with this semiannual report for Fund for the six-month period ended April 30, 2018. For additional information about the Fund, please visit our website a, where you can access quarterly commentaries. We value the trust that you place in us and look forward to serving your investment needs in the years to come.\nBest regards \nDouglas\n']

Python to extract the #user and url link in twitter text data with regex

There is a list string twitter text data, for example, the following data (actually, there is a large number of text,not just these data), I want to extract the all the user name after # and url link in the twitter text, for example: galaxy5univ and url link.
tweet_text = ['#galaxy5univ I like you',
'RT #BestOfGalaxies: Let's sit under the stars ...',
'#jonghyun__bot .........((thanks)',
'RT #yosizo: thanks.ddddd <https://yahoo.com>',
'RT #LDH_3_yui: #fam, ccccc https://msn.news.com']
my code:
import re
pu = re.compile(r'http\S+')
pn = re.compile(r'#(\S+)')
for row in twitter_text:
text = pu.findall(row)
name = (pn.findall(row))
print("url: ", text)
print("name: ", name)
Through testing the code in a large number of twitter data, I have got that my two patterns for url and name both are wrong(although in a few twitter text data is right). Do you guys have some documents or link about extract name and url from twitter text in the case of large twitter data.
If you have advices about extracting name and url from twitter data, please tell me, thanks!
Note that your pn = re.compile(r'#(\S+)') regex will capture any 1+ non-whitespace characters after #.
To exclude matching :, you need to convert the shorthand \S class to [^\s] negated character class equivalent, and add : to it:
pn = re.compile(r'#([^\s:]+)')
Now, it will stop capturing non-whitespace symbols before the first :. See the regex demo.
If you need to capture until the last :, you can just add : after the capturing group: pn = re.compile(r'#(\S+):').
As for a URL matching regex, there are many on the Web, just choose the one that works best for you.
Here is an example code:
import re
p = re.compile(r'#([^\s:]+)')
test_str = "#galaxy5univ I like you\nRT #BestOfGalaxies: Let's sit under the stars ...\n#jonghyun__bot .........((thanks)\nRT #yosizo: thanks.ddddd <https://y...content-available-to-author-only...o.com>\nRT #LDH_3_yui: #fam, ccccc https://m...content-available-to-author-only...s.com"
print(p.findall(test_str))
p2 = re.compile(r'(?:http|ftp|https)://(?:[\w_-]+(?:(?:\.[\w_-]+)+))(?:[\w.,#?^=%&:/~+#-]*[\w#?^=%&/~+#-])?')
print(p2.findall(test_str))
# => ['galaxy5univ', 'BestOfGalaxies', 'jonghyun__bot', 'yosizo', 'LDH_3_yui']
# => ['https://yahoo.com', 'https://msn.news.com']
If the usernames doesn't contain special chars, you can use:
#([\w]+)
See Live demo

How to distinguish between added sentences and altered sentences with difflib and nltk?

Downloading this page and making a very minor edit to it, changing the first 65 in this paragraph to 68:
I then run it through the following code to pull out the diffs.
import bs4
from bs4 import BeautifulSoup
import urllib2
import lxml.html as lh
url = 'https://secure.ssa.gov/apps10/reference.nsf/links/02092016062645AM'
response = urllib2.urlopen(url)
content = response.read() # get response as list of lines
root = lh.fromstring(content)
section1 = root.xpath("//div[#class = 'column-12']")[0]
section1_text = section1.text_content()
url2 = 'file:///Users/Pyderman/repos/02092016062645AM-modified.html'
response2 = urllib2.urlopen(url2)
content2 = response2.read() # get response as list of lines
root2 = lh.fromstring(content2)
section2 = root2.xpath("//div[#class = 'column-12']")[0]
section2_text = section2.text_content()
d = difflib.Differ()
soup = bs4.BeautifulSoup(unicode(section1_text))
soup2= bs4.BeautifulSoup(unicode(section2_text))
from nltk import sent_tokenize
sentences = [sentence for string in soup.stripped_strings for sentence in sent_tokenize(string)]
sentences2 = [sentence for string in soup2.stripped_strings for sentence in sent_tokenize(string)]
diff = d.compare(sentences, sentences2)
changes = [change for change in diff if change.startswith('-') or change.startswith('+')]
for change in changes:
print(change)
Printing the changes gives:
- It contains a Title II provision that changes the age at which workers compensation/public disability offset ends for disability beneficiaries from age 65 to full retirement age (FRA).
+ It contains a Title II provision that changes the age at which workers compensation/public disability offset ends for disability beneficiaries from age 68 to full retirement age (FRA).
So a change gets marked with a +, whether it's a new addition (a brand new full sentence also gets marked with a +) or a minor change to an existing sentence. As it stands then, unless my program does some additional processing, it will think that a new sentence was added and another one was removed.
How can we take advantage of the fact that what difflib sees as the apparently 'removed' sentence and the apparently 'added' sentence are very similar, in order to determine that we are in fact dealing with an in-place change to an existing sentence?
NOTE: The solution will need to be able to process potentially several changes in a single page, so it won't be sufficient to apply something like if sentence1 very similar to sentence 2: then it's a modification, since there will be several diffs to compare and contrast.

Python: Match strings for certain terms

I have a list of tweets, from which I have to choose tweets that have terms like "sale", "discount", or "offer". Also, I need to find tweets that advertise certain deals, like a discount, by recognizing things like "%", "Rs.", "$" amongst others. I have absolutely no idea about regular expressions and the documentation isn't getting me anywhere. Here is my code. It's rather lousy, but please excuse that
import pymongo
import re
import datetime
client = pymongo.MongoClient()
db = client .PWSocial
fourteen_days_ago = datetime.datetime.utcnow() - datetime.timedelta(days=14)
id_list = [57947109, 183093247, 89443197, 431336956]
ar1 = [" deal "," deals ", " offer "," offers " "discount", "promotion", " sale ", " inr", " rs", "%", "inr ", "rs ", " rs."]
def func(ac_id):
mylist = []
newlist = []
tweets = list(db.tweets.find({'user_id' : ac_id, 'created_at': { '$gte': fourteen_days_ago }}))
for item in tweets:
data = item.get('text')
data = data.lower()
data = data.split()
flag = 0
if set(ar1).intersection(data):
flag = 1
abc = []
for x in ar1:
for y in data:
if re.search(x,y):
abc.append(x)
flag = 1
break
if flag == 1:
mylist.append(item.get('id'))
newlist.append(abc)
print mylist
print newlist
for i in id_list:
func(i)
This code soen't give me any correct results, and being a noob to regexes, I cannot figure out whats wrong with it. Can anyone suggest a better way to do this job? Any help is appreciated.
My first advice - learn regular expressions, it gives you an unlimited power of text processing.
But, to give you some working solution (and start point to further exploration) try this:
import re
re_offers = re.compile(r'''
\b # Word boundary
(?: # Non capturing parenthesis
deals? # Deal or deals
| # or ...
offers? # Offer or offers
|
discount
|
promotion
|
sale
|
rs\.? # rs or rs.
|
inr\d+ # INR then digits
|
\d+inr # Digits then INR
) # And group
\b # Word boundary
| # or ...
\b\d+% # Digits (1 or more) then percent
|
\$\d+\b # Dollar then digits (didn't care of thousand separator yet)
''',
re.I|re.X) # Ignore case, verbose format - for you :)
abc = re_offers.findall("e misio $1 is inr123 discount 1INR a 1% and deal")
print(abc)
You don't need to use a regular expression for this, you can use any:
if any(term in tweet for term in search_terms):
In your array of things to search for you don't have a comma between " offers " and "discount" which is causing them to be joined together.
Also when you use split you are getting rid of the whitespace in your input text. "I have a deal" will become ["I","have","a","deal"] but your search terms almost all contain whitespace. So remove the spaces from your search terms in array ar1.
However you might want to avoid using regular expressions and just use in instead (you will still need the chnages I suggest above though):
if x in y:
You might want to consider starting with find instead instead of a regex. You don't have complex expressions, and as you're handling a line of text you don't need to call split, instead just use find:
for token in ar1:
if data.find(token) != -1:
abc.append(data)
Your for item in tweets loop becomes:
for item in tweets:
data = item.get('text')
data = data.lower()
for x in ar1:
if data.find(x)
newlist.append(data)
mylist.append(item.get('id'))
break
Re: your comment on jonsharpe's post, to avoid including substrings, surround your tokens by spaces, e.g. " rs ", " INR "

Categories

Resources