Hashtags python html - python

I want to extract all the hashtags from a given website:
For example, "I love #stack overflow because #people are very #helpful!"
This should pull the 3 hashtags into a table.
In the website I am targeting there is a table with a #tag description
So we can find #love this hashtag speaks about love
This is my work:
#import the library used to query a website
import urllib2
#specify the url
wiki = "https://www.symplur.com/healthcare-hashtags/tweet-chats/all"
#Query the website and return the html to the variable 'page'
page = urllib2.urlopen(wiki)
#import the Beautiful soup functions to parse the data returned from the
website
from bs4 import BeautifulSoup
#Parse the html in the 'page' variable, and store it in Beautiful Soup
format
soup = BeautifulSoup(page, "lxml")
print soup.prettify()
s = soup.get_text()
import re
re.findall("#(\w+)", s)
I have an issues in the output :
The first one is that the output look like this :
[u'eeeeee',
u'333333',
u'222222',
u'222222',
u'222222',
u'222222',
u'222222',
u'222222',
u'222222',
u'AASTGrandRoundsacute'
The output concatenate the Hashtag with the first word in the description. If I compare to the example I evoked before the output is 'lovethis'.
How can I do to extract only the one word after the hashtag.
Thank you

I think there's no need to use regex to parse the text you get from the page, you can use BeautifulSoup itself for that. I'm using Python3.6 in the code below, just to show the entire code, but the important line is hashtags = soup.findAll('td', {'id':'tweetchatlist_hashtag'}). Notice all hashtags in the table have td tag and id attribute = tweetchatlist_hashtag, so calling .findAll is the way to go here:
import requests
import re
from bs4 import BeautifulSoup
wiki = "https://www.symplur.com/healthcare-hashtags/tweet-chats/all"
page = requests.get(wiki).text
soup = BeautifulSoup(page, "lxml")
hashtags = soup.findAll('td', {'id':'tweetchatlist_hashtag'})
Now let's have a look at the first item of our list:
>>> hashtags[0]
<td id="tweetchatlist_hashtag" itemprop="location">#AASTGrandRounds</td>
So we see that what we really want is the value of title attribute of a:
>>> hashtags[0].a['title']
'#AASTGrandRounds'
To proceed to get a list of all hashtags using list comprehension:
>>> lst = [hashtag.a['title'] for hashtag in hashtags]
If you are not used with list comprehension syntax, the line above is similar to this:
>>> lst = []
>>> for hashtag in hashtags:
lst.append(hashtag.a['title'])
lst then is the desired output, see the first 20 items of the list:
>>> lst[:20]
['#AASTGrandRounds', '#abcDrBchat', '#addictionchat', '#advocacychat', '#AetnaMyHealthy', '#AlzChat', '#AnatQ', '#anzOTalk', '#AskAvaility', '#ASPChat', '#ATtalk', '#autchat', '#AXSChat', '#ayacsm', '#bcceu', '#bccww', '#BCSM', '#benurse', '#BeTheDifference', '#bioethx']

Related

How can I split the text elements out from this HTML String? Python

Good Morning,
I'm doing some HTML parsing in Python and I've run across the following which is a time & name pairing in a single table cell. I'm trying to extract each piece of info separately and have tried several different approaches to split the following string.
HTML String:
<span><strong>13:30</strong><br/>SecondWord</span></a>
My output would hopefully be:
text1 = 13:30
text2 = "SecondWord"
I'm currently using a loop through all the rows in the table, where I'm taking the text and splitting it by a new line. I noticed the HTML has a line break character in-between so it renders separately on the web, I was trying to replace this with a new line and run my split on that - however my string.replace() and re.sub() approaches don't seem to be working.
I'd love to know what I'm doing wrong.
Latest Approach:
resub_pat = r'<br/>'
rows=list()
for row in table.findAll("tr"):
a = re.sub(resub_pat,"\n",row.text).split("\n")
This is a bit hashed together, but I hope I've captured my problem! I wasn't able to find any similar issues.
You could try:
from bs4 import BeautifulSoup
import re
# the soup
soup = BeautifulSoup("<span><strong>13:30</strong><br/>SecondWord</span></a>", 'lxml')
# the regex object
rx = re.compile(r'(\d+:\d+)(.+)')
# time, text
text = soup.find('span').get_text()
x,y = rx.findall(text)[0]
print(x)
print(y)
Using recursive=False to get only direct text and strong.text to get the other one.
Ex:
from bs4 import BeautifulSoup
soup = BeautifulSoup("<span><strong>13:30</strong><br/>SecondWord</span></a>", 'lxml')
# text1
print(soup.find("span").strong.text) # --> 13:30
# text2
print(soup.find("span").find(text=True, recursive=False)) # --> SecondWord
from bs4 import BeautifulSoup
txt = '''<span><strong>13:30</strong><br/>SecondWord</span></a>'''
soup = BeautifulSoup(txt, 'html.parser')
text1, text2 = soup.span.get_text(strip=True, separator='|').split('|')
print(text1)
print(text2)
Prints:
13:30
SecondWord

is there any convenient way to get a index of sub section in a page?

it is convenient to use "index-x" to quick locate a sub section in a page.
for instance
https://docs.python.org/3/library/re.html#index-2
gives 3rd sub-section in this page.
when i want to share the location of a sub-section to others, how to get the index in a convenient way?
for instance, how to get the index of {m,n} sub-section without counting from index-0?
With bs4 4.7.1 you can use :has and :contains to target a specific text string and return the index (note that using select_one will return first match. Use a list comprehension and select if want to return all matches
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://docs.python.org/3/library/re.html')
soup = bs(r.content, 'lxml')
index = soup.select_one('dl:has(.pre:contains("{m,n}"))')['id']
print(index)
Any version: if you want a dictionary that maps special characters to indices. Thanks to #zoe for spotting the error in my dictionary comprehension.
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://docs.python.org/3/library/re.html')
soup = bs(r.content, 'lxml')
mappings = dict([(item['id'], [i.text for i in item.select('dt .pre')]) for item in soup.select('[id^="index-"]')])
indices = {i: k for (k, v) in mappings.items() for i in v}
You're looking for index-7.
You can download the HTML of the page and get all the possible values of index-something with the following code:
import re
import requests
from bs4 import BeautifulSoup
r = requests.get('https://docs.python.org/3/library/re.html')
soup = BeautifulSoup(r.content.decode())
result = [t['id'] for t in soup.find_all(id=re.compile('index-\d+'))]
print(result)
Output:
['index-0', 'index-1', 'index-2', 'index-3', 'index-4', 'index-5', 'index-6', 'index-7', 'index-8', 'index-9', 'index-10', 'index-11', 'index-12', 'index-13', 'index-14', 'index-15', 'index-16', 'index-17', 'index-18', 'index-19', 'index-20', 'index-21', 'index-22', 'index-23', 'index-24', 'index-25', 'index-26', 'index-27', 'index-28', 'index-29', 'index-30', 'index-31', 'index-32', 'index-33', 'index-34', 'index-35', 'index-36', 'index-37', 'index-38']
The t objects in the list comprehension contain the HTML of the tags whose id matches the regex.

Python Beautifulsoup find all function without duplication

markup = '<b></b><a></a><p>hey</p><li><p>How</p></li><li><p>are you </p></li>'
soup = BeautifulSoup(markup)
data = soup.find_all('p','li')
print(data)
The result looks like this
['<p>hey</p>,<p>How </p>,<li><p>How</p></li>,<p>are you </p>,<li><p>are you </p></li>']
How can I make it only return
['<p>hey</p>,<li><p>How</p></li>,<li><p>are you </p></li>']
or any ways that I can remove the duplicated p tags Thanks
BeautifulSoup is not analyzing the text of the html, instead, it is acquiring the desired tag structure. In this case, you need to perform an additional step and check if the text of an <p> is already seen in a different structure:
from bs4 import BeautifulSoup as soup
markup = '<b></b><a></a><p>hey</p><li><p>How</p></li><li><p>are you </p></li>'
s = soup(markup, 'lxml')
final_s = s.find_all(re.compile('p|li'))
final_data = [','.join(map(str, [a for i, a in enumerate(final_s) if not any(a.text == b.text for b in final_s[:i])]))]
Output:
['<p>hey</p>,<li><p>How</p></li>,<li><p>are you </p></li>']
If you are looking for the text and not the <li> tags, you can just search for the <p> tags and get the desired result without duplication.
>>> data = soup.find_all('p')
>>> data
[<p>hey</p>, <p>How</p>, <p>are you </p>]
Or, if you are simply looking for the text, you can use this:
>>> data = [x.text for x in soup.find_all('p')]
>>> data
['hey', 'How', 'are you ']

Find and extract curr_id number from Investing

I need to know the curr_id to submit using python to investing.com and extract historic data for a number of currencies/commodities. To do this I need the curr_id number. As in the example bellow. I'm able to extract all scripts. But then I cannot figure out how to find the correct script index that contains curr_id and extract the digits '2103'. Example: I need the code to find 2103.
import requests
from bs4 import BeautifulSoup
url = 'http://www.investing.com/currencies/usd-brl-historical-data'
r = requests.get(url)
#URL
url='http://www.investing.com/currencies/usd-brl-historical-data'
#OPEN URL
r = requests.get(url)
#DETERMINE FORMAT
soup=BeautifulSoup(r.content,'html.parser')
#FIND TABLE WITH VALUES IN soup
curr_data = soup.find_all('script', {'type':'text/javascript'})'
UPDATE
I did it like this:
g_data_string=str(g_data)
if 'curr_id' in g_data_string:
print('success')
start = g_data_string.find('curr_id') + 9
end = g_data_string.find('curr_id')+13
print(g_data_string[start:end])
But I`m sure there is a better way to do it.
You can use a regular expression pattern as a text argument to find a specific script element. Then, search inside the text of the script using the same regular expression:
import re
import requests
from bs4 import BeautifulSoup
url = 'http://www.investing.com/currencies/usd-brl-historical-data'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
pattern = re.compile(r"curr_id: (\d+)")
script = soup.find('script', text=pattern)
match = pattern.search(script.text)
if match:
print(match.group(1))
Prints 2103.
Here (\d+) is a capturing group that would match one or more digits.
You don't actually need a regex, you can get the id from by extracting the value attribute from the input tag with the name=item_ID
In [6]: from bs4 import BeautifulSoup
In [7]: import requests
In [8]: r = requests.get("http://www.investing.com/currencies/usd-brl-historical-data").content
In [9]: soup = BeautifulSoup(r, "html.parser")
In [10]: soup.select_one("input[name=item_ID]")["value"]
Out[10]: u'2103'
You could also look for the id starting with item_id:
In [11]: soup.select_one("input[id^=item_id]")["value"]
Out[11]: u'2103'
Or look for the first div with the pair_id attribute:
In [12]: soup.select_one("div[pair_id]")["pair_id"]
Out[12]: u'2103'
There are actually numerous ways to get it.

LiveScore BeautifulSoup Python

I am using BeautifulSoup to parse this site:
http://www.livescore.com/soccer/champions-league/
I'm looking to get the links for the rows with numbers:
FT Zenit St. Petersburg 3 - 0 Standard Liege"
The 3 - 0 is a link a link; what I want to do is find every link with numbers (so not results like
15:45 APOEL Nicosia ? - ? Paris Saint Germain
), so I can go load these links and parse out the minute data (<td class="min">)
Hi!!! Needs edited. Now I'm able to get the links. Like this:
import urllib2, re, bs4
sitioweb = urllib2.urlopen('http://www.livescore.com/soccer/champions-league/').read()
soup = bs4.BeautifulSoup(sitioweb)
href_tags = soup.find_all('a', {'class':"scorelink"})
links = []
for x in xrange(1, len(href_tags)):
insert = href_tags[x].get("href");links.append(insert)
print links
Now my problem is the following: I want to write all this into a DB (like sqlite) with the number of minute in which a goal was made (this information I can get from the link I get) but this is possible only in the case that the goal count is not ? - ?, as there isn't any goal made.
I hope you can understand me...
Best regards and thanks a lot for your help,
Marco
The following search matches only your links:
import re
links = soup.find_all('a', class_='scorelink', href=True,
text=re.compile('\d+ - \d+'))
The search is limited to:
<a> tags
with the class scorelink
a non-empty href attribute
and the link text containing two digits separated by a dash.
Extracting just the links is then trivial:
score_urls = [link['href'] for link in soup.find_all(
'a', class_='scorelink', href=True, text=re.compile('\d+ - \d+'))]
Demo:
>>> from bs4 import BeautifulSoup
>>> import requests
>>> from pprint import pprint
>>> soup = BeautifulSoup(requests.get('http://www.livescore.com/soccer/champions-league/').content)
>>> [link['href'] for link in soup.find_all('a', class_='scorelink', href=True, text=re.compile('\d+ - \d+'))]
['/soccer/champions-league/group-e/cska-moscow-vs-manchester-city/1-1821202/', '/soccer/champions-league/qualifying-round/zenit-st-petersburg-vs-standard-liege/1-1801440/', '/soccer/champions-league/qualifying-round/apoel-nicosia-vs-aab/1-1801432/', '/soccer/champions-league/qualifying-round/bate-borisov-vs-slovan-bratislava/1-1801436/', '/soccer/champions-league/qualifying-round/celtic-vs-maribor/1-1801428/', '/soccer/champions-league/qualifying-round/fc-porto-vs-lille/1-1801444/', '/soccer/champions-league/qualifying-round/arsenal-vs-besiktas/1-1801438/', '/soccer/champions-league/qualifying-round/athletic-bilbao-vs-ssc-napoli/1-1801446/', '/soccer/champions-league/qualifying-round/bayer-leverkusen-vs-fc-koebenhavn/1-1801442/', '/soccer/champions-league/qualifying-round/malmo-ff-vs-salzburg/1-1801430/', '/soccer/champions-league/qualifying-round/pfc-ludogorets-razgrad-vs-steaua-bucuresti/1-1801434/']
>>> pprint(_)
['/soccer/champions-league/group-e/cska-moscow-vs-manchester-city/1-1821202/',
'/soccer/champions-league/qualifying-round/zenit-st-petersburg-vs-standard-liege/1-1801440/',
'/soccer/champions-league/qualifying-round/apoel-nicosia-vs-aab/1-1801432/',
'/soccer/champions-league/qualifying-round/bate-borisov-vs-slovan-bratislava/1-1801436/',
'/soccer/champions-league/qualifying-round/celtic-vs-maribor/1-1801428/',
'/soccer/champions-league/qualifying-round/fc-porto-vs-lille/1-1801444/',
'/soccer/champions-league/qualifying-round/arsenal-vs-besiktas/1-1801438/',
'/soccer/champions-league/qualifying-round/athletic-bilbao-vs-ssc-napoli/1-1801446/',
'/soccer/champions-league/qualifying-round/bayer-leverkusen-vs-fc-koebenhavn/1-1801442/',
'/soccer/champions-league/qualifying-round/malmo-ff-vs-salzburg/1-1801430/',
'/soccer/champions-league/qualifying-round/pfc-ludogorets-razgrad-vs-steaua-bucuresti/1-1801434/']
Fairly easy to do outside of BeautifulSoup. Just find all the links first, then, filter out the ones that return a ? - ? text, then get the href attribute from each item in the sanitized list. See below.
In [1]: from bs4 import BeautifulSoup as bsoup
In [2]: import requests as rq
In [3]: url = "http://www.livescore.com/soccer/champions-league/"
In [4]: r = rq.get(url)
In [5]: bs = bsoup(r.text)
In [6]: links = bs.find_all("a", class_="scorelink")
In [7]: links
Out[7]:
[<a class="scorelink" href="/soccer/champions-league/group-a/atletico-madrid-vs-malmo-ff/1-1821150/" onclick="return false;">? - ?</a>,
<a class="scorelink" href="/soccer/champions-league/group-a/olympiakos-vs-juventus/1-1821151/" onclick="return false;">? - ?</a>,
...
In [8]: links_clean = [link for link in links if link.get_text() != "? - ?"]
In [9]: links_clean
Out[9]:
[<a class="scorelink" href="/soccer/champions-league/group-e/cska-moscow-vs-manchester-city/1-1821202/" onclick="return false;">0 - 1</a>,
<a class="scorelink" href="/soccer/champions-league/qualifying-round/zenit-st-petersburg-vs-standard-liege/1-1801440/" onclick="return false;">3 - 0</a>,
...
In [10]: links_final = [link["href"] for link in links_clean]
In [11]: links_final
Out[11]:
['/soccer/champions-league/group-e/cska-moscow-vs-manchester-city/1-1821202/',
'/soccer/champions-league/qualifying-round/zenit-st-petersburg-vs-standard-liege/1-1801440/',
...
Extracting the minutes from each link is, of course, up to you.

Categories

Resources