Dividing scraped text with Python and Beautiful Soup

Dividing scraped text with Python and Beautiful Soup - python

I've scraped the timetable from this website.
The output I get is:
"ROUTE": "NAPOLI PORTA DI MASSA \u00bb ISCHIA"
but I would like:
"DEPARTURE PORT": "NAPOLI PORTA DI MASSA"
"ARRIVAL PORT": "ISCHIA"
How do I divide the string?
Here is the code:
medmar_live_departures_table = list(soup.select('li.tratta'))
departure_time = []
for li in medmar_live_departures_table:
next_li = li.find_next_sibling("li")
while next_li and next_li.get("data-toggle"):
if next_li.get("class") == ["corsa-yes"]:
# departure_time.append(next_li.strong.text)
medmar_live_departures_data.append({
'ROUTE' : li.text
})

Two things,
1.Since "»" is a non-ascii character python is returning the non-ascii character like so "\u00bb", hence parsing the string by splitting the text with the non-ascii code like so will work:
parse=li.get_text().split('\u00bb')
Also, you can use the re library to parse non-ascii characters like so (you will need to add the re library if you choose this path):
import re
non_ascii = li.get_text()
parse = re.split('[^\x00-\x7f]', non_ascii)
#[^\x00-\x7f] will select non-ascii characters as pointed out by Moinuddin Quadri in https://stackoverflow.com/questions/40872126/python-replace-non-ascii-character-in-string
However by doing so python will create a list of parts from the the parse but not all texts in the "li" html tag carry the "»" character (ie.the text "POZZUOLI-PROCIDA" at the end of the table on the website) so we must account for that or we'll run into some issues.
2.A dictionary may be a poor choice of data structure since the data you are parsing will have the same keys.
For example, POUZZOULI » CASAMICCIOLA, and POUZOULI » PROCIDA. COSMICCIOLA and PROCIDA will have the same key. Python will will simply overwrite/update the value of the POUZZOULI key. So POUZZOULI: CASAMICCIOLA will become POUZZOULI: PROCIDA instead of adding POUZZOULI: CASAMICCIOLA as a dictionary entry and POUZZOULI: PROCIDA as another dictionary entry.
I suggest adding each part of the parse into lists as tuples like so:
single_port= []
ports=[]
medmar_live_departures_table = list(bs.select('li.tratta'))
departure_time = []
for li in medmar_live_departures_table:
next_li = li.find_next_sibling("li")
while next_li and next_li.get("data-toggle"):
if next_li.get("class") == ["corsa-yes"]:
# departure_time.append(next_li.strong.text)
non_ascii = li.get_text()
parse = re.split('[^\x00-\x7f]', non_ascii)
# The if statement takes care of table data strings that don't have the non-ascii character "»"
if len(parse) > 1:
ports.append((parse[0], parse[1]))
else:
single_port.append(parse[0])
# This will print out your data in your desired manner
for i in ports:
print("DEPARTURE: "+i[0])
print("ARRIVAL: "+i[1])
for i in single_port:
print(i)
I also used the split method in a test code that I ran:
import requests
from bs4 import BeautifulSoup
import re
url="https://www.medmargroup.it/"
response=requests.get(url)
bs=BeautifulSoup(response.text, 'html.parser')
timeTable=bs.find('section', class_="primarystyle-timetable")
medmar_live_departures_table=timeTable.find('ul')
single_port= []
ports=[]
for li in medmar_live_departures_table.find_all('li', class_="tratta"):
parse=li.get_text().split('\u00bb')
if len(parse)>1:
ports.append((parse[0],parse[1]))
else:
single_port.append(parse[0])
for i in ports:
print("DEPARTURE: "+i[0])
print("ARRIVAL: "+i[1])
for i in single_port:
print(i)
I hope this helps!

try this:
medmar_live_departures_table = list(soup.select('li.tratta'))
departure_time = []
for li in medmar_live_departures_table:
next_li = li.find_next_sibling("li")
while next_li and next_li.get("data-toggle"):
if next_li.get("class") == ["corsa-yes"]:
# departure_time.append(next_li.strong.text)
medmar_live_departures_data.append({
'DEPARTURE PORT' : li.text.split("\ u00bb")[0],
'ARRIVAL PORT' : li.text.split("\ u00bb")[1]
})

Related

How do I get the first 3 sentences of a webpage in python?

I have an assignment where one of the things I can do is find the first 3 sentences of a webpage and display it. Find the webpage text is easy enough, but I'm having problems figuring out how I find the first 3 sentences.
import requests
from bs4 import BeautifulSoup
url = 'https://www.troyhunt.com/the-773-million-record-collection-1-data-reach/'
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')
text = soup.find_all(text=True)
output = ''
blacklist = [
'[document]',
'noscript',
'header',
'html',
'meta',
'head',
'input',
'script'
]
for t in text:
if (t.parent.name not in blacklist):
output += '{} '.format(t)
tempout = output.split('.')
for i in range(tempout):
if (i >= 3):
tempout.remove(i)
output = '.'.join(tempout)
print(output)

Finding sentences out of text is difficult. Normally you would look for characters that might complete a sentence, such as '.' and '!'. But a period ('.') could appear in the middle of a sentence as in an abbreviation of a person's name, for example. I use a regular expression to look for a period followed by either a single space or the end of the string, which works for the first three sentences, but not for any arbitrary sentence.
import requests
from bs4 import BeautifulSoup
import re
url = 'https://www.troyhunt.com/the-773-million-record-collection-1-data-reach/'
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')
paragraphs = soup.select('section.article_text p')
sentences = []
for paragraph in paragraphs:
matches = re.findall(r'(.+?[.!])(?: |$)', paragraph.text)
needed = 3 - len(sentences)
found = len(matches)
n = min(found, needed)
for i in range(n):
sentences.append(matches[i])
if len(sentences) == 3:
break
print(sentences)
Prints:
['Many people will land on this page after learning that their email address has appeared in a data breach I\'ve called "Collection #1".', "Most of them won't have a tech background or be familiar with the concept of credential stuffing so I'm going to write this post for the masses and link out to more detailed material for those who want to go deeper.", "Let's start with the raw numbers because that's the headline, then I'll drill down into where it's from and what it's composed of."]

To scrape the first three sentences, just add these lines to ur code:
section = soup.find('section',class_ = "article_text post") #Finds the section tag with class "article_text post"
txt = section.p.text #Gets the text within the first p tag within the variable section (the section tag)
print(txt)
Output:
Many people will land on this page after learning that their email address has appeared in a data breach I've called "Collection #1". Most of them won't have a tech background or be familiar with the concept of credential stuffing so I'm going to write this post for the masses and link out to more detailed material for those who want to go deeper.
Hope that this helps!

Actually using beautify soup you can filter by the class "article_text post" seeing source code:
myData=soup.find('section',class_ = "article_text post")
print(myData.p.text)
And get the inner text of p element
Use this instead of soup = BeautifulSoup(html_page, 'html.parser')

Cant properly read api json data

I was successful in extracting the data about matches I required in the first half of my code, but I can't seem the do the other part. I am reading JSON data and doing it in the same way really but I'm getting strings, not dictionaries with data. I'm sure it's a logic problem or something, please help me. I have the working part on my github : https://github.com/LEvinson2504/Football-Prediction-and-analysis
import urllib.request
import json
#Match odds
#!/usr/bin/python
# -*- coding: utf-8 -*-
import urllib.request
def SportDemo():
# Set url parameter
url = "http://api.isportsapi.com/sport/free/football/odds/main?api_key=" + api_key
# Call iSport Api to get data in json format
f = urllib.request.urlopen(url)
content = f.read()
#data = json.loads((content.decode('utf-8')))
data = content.decode('utf-8')
'''store match ids
matches = []
#English teams match id
for team in data['data']:
if (team == 'English Premier League'):
#store match ids
matches.append(team['matchId'])
'''
#here is the problem, tried several ways to access data
for i in data[data]:
print(i['asia'])
'''
for match in data[data]['asia']:
for coun in match:
print(coun)
'''
'''
if(match == 'asian'):
print(type(match))
'''
#if (match['leagueName'] == 'ENG U23 D1'):
#for odds in data['data']:
#for i in matches:
#print()
SportDemo()
Expected Output, and I want to read inside the dictionaries to get the data iside keys "europe", "asia"
Json data : https://www.isportsapi.com/docs?isportsDocIndex=1-4-24 like her, I'm sorry I couldn't format.
But i get nothing

Firstly, when asking a question please take the time to tidy it up so that it represents what you actually ran and remove any commented-out code.
In your case, the problem can be reduced to:
f = urllib.request.urlopen(url)
content = f.read()
data = json.loads((content.decode('utf-8')))
#here is the problem, tried several ways to access data
for i in data[data]:
print(i['asia'])
and we can actually see what the issue is. data is a dict; within that dict is a key 'data', which is itself a dict. Iterating through a dict gives you the keys. If you just want to access the 'asia' data, then do so, no need to loop at all:
print(data['data']['asia'])
If you did want to iterate through every item, then use items():
for region, matches in data['data'].items():
print(region)
print(matches)

The download data is too big. 6.2M
Change the jupyter notebook configuration file. (jupyter_notebook_config.py)
Edit ~/.jupyter/jupyter_notebook_config.py
If you cannot find the file,
$ jupyter notebook --generate-config
Open the file and edit.
c.NotebookApp.iopub_data_rate_limit = 10000000
and restart $ jupyter notebook.
url = "http://api.isportsapi.com/sport/free/football/odds/main?api_key=" + api_key
# Call iSport Api to get data in json format
f = urllib.request.urlopen(url)
content = f.read()
#print(content.decode('utf-8'))
data = json.loads((content.decode('utf-8')))
print( data['data']['asian'])
# there is no 'asia' field in that content.
Output is
[{'matchId': '4196461', 'companyId': '1', 'initialHandicap': '-0.25', 'initialHome': '0.78', 'initialAway': '1.02', 'instantHandicap': '-0.25', 'instantHome': '0.78', 'instantAway': '1.02', 'modifyTime': 1567434821, 'close': False, 'inPlay': False}, {'matchId': '4196461', 'companyId': '3', 'initialHandicap': '-0.25', 'initialHome': '0.91', 'initialAway': '0.91', 'instantHandicap': '-0.25', 'instantHome': '0.81', 'instantAway': '1.09', 'modifyTime': 1567709243, 'close': False, 'inPlay': True}, {'matchId': '4196461', 'companyId': '8', 'initialHandicap': '-0.25', 'initialHome': '0.85', 'initialAway': '1.00', 'instantHandicap': '-0.25', 'instantHome': '0.80',
...

XML data from URL using BeatifulSoup and output to dictionary

here I need to read XML data from URL (exchange rate list), output is dictionary...now I can get only first currency...tried with find_all but without success...
Can somebody comment where I need to put for loop to read all values...
import bs4 as bs
import urllib.request
source urllib.request.urlopen('http://www.xxxy.hr/Downloads/PBZteclist.xml').read()
soup = bs.BeautifulSoup(source,'xml')
name = soup.find('Name').text
unit = soup.find('Unit').text
buyratecache = soup.find('BuyRateCache').text
buyrateforeign = soup.find('BuyRateForeign').text
meanrate = soup.find('MeanRate').text
sellrateforeign = soup.find('SellRateForeign').text
sellratecache = soup.find('SellRateCache').text
devize = {'naziv_valute': '{}'.format(name),
'jedinica': '{}'.format(unit),
'kupovni': '{}'.format(buyratecache),
'kupovni_strani': '{}'.format(buyrateforeign),
'srednji': '{}'.format(meanrate),
'prodajni_strani': '{}'.format(sellrateforeign),
'prodajni': '{}'.format(sellratecache)}
print ("devize:",devize)
Example of XML:
<ExchRates>
<ExchRate>
<Bank>Privredna banka Zagreb</Bank>
<CurrencyBase>HRK</CurrencyBase>
<Date>12.01.2019.</Date>
<Currency Code="036">
<Name>AUD</Name>
<Unit>1</Unit>
<BuyRateCache>4,485390</BuyRateCache>
<BuyRateForeign>4,530697</BuyRateForeign>
<MeanRate>4,646869</MeanRate>
<SellRateForeign>4,786275</SellRateForeign>
<SellRateCache>4,834138</SellRateCache>
</Currency>
<Currency Code="124">
<Name>CAD</Name>
<Unit>1</Unit>
<BuyRateCache>4,724225</BuyRateCache>
<BuyRateForeign>4,771944</BuyRateForeign>
<MeanRate>4,869331</MeanRate>
<SellRateForeign>4,991064</SellRateForeign>
<SellRateCache>5,040975</SellRateCache>
</Currency>
<Currency Code="203">
<Name>CZK</Name>
<Unit>1</Unit>
<BuyRateCache>0,280057</BuyRateCache>
<BuyRateForeign>0,284322</BuyRateForeign>
<MeanRate>0,290124</MeanRate>
<SellRateForeign>0,297377</SellRateForeign>
<SellRateCache>0,300351</SellRateCache>
</Currency>
...etc...
</ExchRate>
</ExchRates>

Simply iterate through all Currency nodes (not the soup object) and even use a list comprehension to build a list of dictionaries:
soup = bs.BeautifulSoup(source, 'xml')
# ALL EXCHANGE RATE NODES
curency_nodes = soup.findAll('Currency')
# LIST OF DICTIONAIRES
devize_list = [{'naziv_valute': c.find('Name').text,
'jedinica': c.find('Unit').text,
'kupovni': c.find('BuyRateCache').text,
'kupovni_strani': c.find('BuyRateForeign').text,
'srednji': c.find('MeanRate').text,
'prodajni_strani': c.find('SellRateForeign').text,
'prodajni': c.find('SellRateCache').text
} for c in curency_nodes]
Alternatively, incorporate a dictionary comprehension since you are extracting all elements:
devize_list = [{n.name: n.text} for c in currency_nodes \
for n in c.children if n.name is not None ]

Parsing HTML using LXML Python

I'm trying to parse Oxford Dictionary in order to obtain the etymology of a given word.
class SkipException (Exception):
def __init__(self, value):
self.value = value
try:
doc = lxml.html.parse(urlopen('https://en.oxforddictionaries.com/definition/%s' % "good"))
except SkipException:
doc = ''
if doc:
table = []
trs = doc.xpath("//div[1]/div[2]/div/div/div/div[1]/section[5]/div/p")
I cannot seem to work out how to obtain the string of text I need. I know I lack some lines of code in the ones I have copied but I don't know how HTML nor LXML fully works. I would much appreciate if someone could provide me with the correct way to solve this.

You don't want to do web scraping, and especially when probably every dictionary has an API interface. In the case of Oxford create an account at https://developer.oxforddictionaries.com/. Get the API credentials from your account and do something like this:
import requests
import json
api_base = 'https://od-api.oxforddictionaries.com:443/api/v1/entries/{}/{}'
language = 'en'
word = 'parachute'
headers = {
'app_id': '',
'app_key': ''
}
url = api_base.format(language, word)
reply = requests.get(url, headers=headers)
if reply.ok:
reply_dict = json.loads(reply.text)
results = reply_dict.get('results')
if results:
headword = results[0]
entries = headword.get('lexicalEntries')[0].get('entries')
if entries:
entry = entries[0]
senses = entry.get('senses')
if senses:
sense = senses[0]
print(sense.get('short_definitions'))

Here's a sample to get you started scraping Oxford dictionary pages:
import lxml.html as lh
from urllib.request import urlopen
url = 'https://en.oxforddictionaries.com/definition/parachute'
html = urlopen(url)
root = lh.parse(html)
body = root.find("body")
elements = body.xpath("//span[#class='ind']")
for element in elements:
print(element.text)
To find the correct search string you need to format the html so you can see the structure. I used the html formatter at https://www.freeformatter.com/html-formatter.html. Looking at the formatted HTML, I could see the definitions were in the span elements with the 'ind' class attribute.

Is it possible to pass a variable to (Beautifulsoup) soup.find()?

Hi I need to pass a variable to the soup.find() function, but it doesn't work :(
Does anyone know a solution for this?
from bs4 import BeautifulSoup
html = '''<div> blabla
<p class='findme'> p-tag content</p>
</div>'''
sources = {'source1': '\'p\', class_=\'findme\'',
'source2': '\'span\', class_=\'findme2\'',
'source1': '\'div\', class_=\'findme3\'',}
test = BeautifulSoup(html)
# this works
#print(test.find('p', class_='findme'))
# >>> <p class="findme"> p-tag content</p>
# this doesn't work
tag = '\'p\' class_=\'findme\''
# a source gets passed
print(test.find(sources[source]))
# >>> None
I am trying to split it up as suggested like this:
pattern = '"p", {"class": "findme"}'
tag = pattern.split(', ')
tag1 = tag[0]
filter = tag[1]
date = test.find(tag1, filter)
I don't get errors, just None for date. The problem is propably the content of tag1 and filter The debuger of pycharm gives me:
tag1 = '"p"'
filter = '{"class": "findme"}'
Printing them doesn't show these apostrophs. Is it possible to remove these apostrophs?

The first argument is a tag name, and your string doesn't contain that. BeautifulSoup (or Python, generally) won't parse out a string like that, it cannot guess that you put some arbitrary Python syntax in that value.
Separate out the components:
tag = 'p'
filter = {'class_': 'findme'}
test.find(tag, **filter)

Okay I got it, thanks again.
dic_date = {'source1': 'p, class:findme', other sources ...}
pattern = dic_date[source]
tag = pattern.split(', ')
if len(tag) is 2:
att = tag[1].split(':') # getting the attribute
att = {att[0]: att[1]} # building a dictionary for the attributes
date = soup.find(tag[0], att)
else:
date = soup.find(tag[0]) # if there is only a tag without an attribute
Well it doesn't look very nice but it's working :)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Dividing scraped text with Python and Beautiful Soup - python

Related

How do I get the first 3 sentences of a webpage in python?

Cant properly read api json data

XML data from URL using BeatifulSoup and output to dictionary

Parsing HTML using LXML Python

Is it possible to pass a variable to (Beautifulsoup) soup.find()?

Categories

Resources