I want to get
the number after: page=
the number after: "new">
the number after: /a>-
<td> </td>
<td> qwqwqwqwqw <br/> qwqwqwqwqw 4449-4450<br/> </td>
<td> </td>
<td> qwqwqwqwqw <br/> qwqwqwqwqw 5111-5550<br/> </td>
<td> </td>
...
My code
tables = soup.find_all('a', attrs={'target': 'new'})
gives my only a list (see below) without the third number
[4449,
5111,
...]
her is how i would try to extract the 3 numbers from my list, once it has the third digit in it.
list_of_number1 = []
list_of_number2 = []
list_of_number3 = []
regex = re.compile("page=(\d+)")
for table in tables:
number1 = filter(regex.match, tables)
number2 = table.next_sibling.strip()
number3 =
list_of_number1.append(number1)
list_of_number2.append(number2)
list_of_number3.append(number3)
Do i use beautifulsoup for the third number or is it feasible to regex through the whole html for any number following "/a>-"
Here is how you can obtain your result using just the information that you need to get the numbers in the specific a element and in the text node immediately on the right:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
tables = soup.find_all('a', attrs={'target': 'new'})
print([(t.text, t["href"].split('=')[-1], t.next_sibling.replace('-', '')) for t in tables])
# => [('4449', '99', '4450'), ('5111', '77', '5550')]
You may certainly go the harder way with regexps:
import re
#... the same initialization code as above
for t in tables:
page = ""
page_m = re.search(r"[#?]page=(\d+)", t["href"])
if page_m:
page = page_m.group(1)
else:
page = ""
num = "".join([x for x in t.next_sibling if x.isdigit()])
results.append((int(t.text), int(page), int(num)))
print(results)
# => [(4449, 99, 4450), (5111, 77, 5550)]
NOTE:
t.text - gets the element text
t["href"] - gets the href attribute value of the t element
t.next_sibling - gets the next node after current one that is on the same hierarchy level.
You can also try:
for b in soup.select('a'):
print(b.attrs['href'].split('=')[1], b.text, b.nextSibling)
Output:
99 4449 -4450
77 5111 -5550
Related
I am trying to extract column headings from one of the tables from ABBV 10-k sec filing (`Issuer Purchases of Equity Securities' table on page 25 - below the graph.)
inside <td> tag in the column heading <tr> tag, text is in separate <div> tags as in the example below
<tr>
<td>
<div>string1</div>
<div>string2</div>
<div>string3</div>
</td>
</tr>
when trying to extract all text fro a tag, there is no space separation between texts (e.g. for the above html output will be string1string3string3 expectedstring1 string3 string3).
Using below code to extract column headings from table
url = 'https://www.sec.gov/Archives/edgar/data/1551152/000155115218000014/abbv-20171231x10k.htm'
htmlpage = requests.get(url)
soup = BeautifulSoup(htmlpage.text, "lxml")
table = soup.find_all('table')[76]
rows = table.find_all('tr')
table_data = []
for tr in rows[2:3]:
row_data=[]
cells = tr.find_all(['td', 'th'], recursive=False)
for cell in cells[1:4]:
row_data.append(cell.text.encode('utf-8'))
table_data.append([x.decode('utf-8').strip() for x in row_data])
print(table_data)
output:[['(a) TotalNumberof Shares(or Units)Purchased', '', '(b) AveragePricePaid per Share(or Unit)']]
Expected output:[['(a) Total Number of Shares (or Units) Purchased', '', '(b) Average Price Paid per Share (or Unit)']] (each word separated bay a space)
use the separator parameter with .get_text():
html = '''<tr>
<td>
<div>string1</div>
<div>string2</div>
<div>string3</div>
</td>
</tr>'''
import bs4
soup = bs4.BeautifulSoup(html, 'html.parser')
td = soup.find('td')
td.get_text(separator=' ')
Here's how it looks with your code:
from bs4 import BeautifulSoup
import requests
url = 'https://www.sec.gov/Archives/edgar/data/1551152/000155115218000014/abbv-20171231x10k.htm'
htmlpage = requests.get(url)
soup = BeautifulSoup(htmlpage.text, "lxml")
table = soup.find_all('table')[76]
rows = table.find_all('tr')
table_data = []
for tr in rows[2:3]:
row_data=[]
cells = tr.find_all(['td', 'th'], recursive=False)
for cell in cells[1:4]:
row_data.append(cell.get_text(separator=' ').encode('utf-8'))
table_data.append([x.decode('utf-8').strip() for x in row_data])
print(table_data)
Output:
print(table_data)
[['(a) Total Number of Shares (or Units) Purchased', '', '(b) Average Price Paid per Share (or Unit)']]
I am a self-learner and a beginner, searched a lot but maybe have lack of searching. I am scraping some values from two web sites and I want o compare them with an HTML output. Each web pages, I am combinin two class'es and gettin into a list. But when making an output with HTML I don't want all list to print. So I made function to choose any keywords to print. When I want to print out that function, It turns out 'None' at HTML output but it turns what I wanted on console. So how to show that special list?
OS= Windows , Python3.
from bs4 import BeautifulSoup
import requests
import datetime
import os
import webbrowser
carf_meySayf = requests.get('https://www.carrefoursa.com/tr/tr/meyve/c/1015?show=All').text
carf_soup = BeautifulSoup(carf_meySayf, 'lxml')
#spans
carf_name_span = carf_soup.find_all('span', {'class' : 'item-name'})
carf_price_span = carf_soup.find_all('span', {'class' : 'item-price'})
#spans to list
carf_name_list = [span.get_text() for span in carf_name_span]
carf_price_list = [span.get_text() for span in carf_price_span]
#combine lists
carf_mey_all = [carf_name_list +' = ' + carf_price_list for carf_name_list, carf_price_list in zip(carf_name_list, carf_price_list)]
#Function to choose and print special product
def test(namelist,product):
for i in namelist:
if product in i:
print(i)
a = test(carf_mey_all,'Muz')
# Date
date = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
# HTML part
html_str = """
<html>
<title>Listeler</title>
<h2>Tarih: %s</h2>
<h3>Product & Shop List</h3>
<table style="width:100%%">
<tr>
<th>Carrefour</th>
</tr>
<tr>
%s
</tr>
</html>
"""
whole = html_str %(date,a)
Html_file= open("Meyve.html","w")
Html_file.write(whole)
Html_file.close()
the method test() must have return value, for example
def test(namelist,product):
results = ''
for i in namelist:
if product in i:
print(i)
results += '<td>%s</td>\n' % i
return results
Meyve.html results:
<html>
<title>Listeler</title>
<h2>Tarih: 2018-12-29 07:34:00</h2>
<h3>Product & Shop List</h3>
<table style="width:100%">
<tr>
<th>Carrefour</th>
</tr>
<tr>
<td>Muz = 6,99 TL</td>
<td>İthal Muz = 12,90 TL</td>
<td>Paket Yerli Muz = 9,99 TL</td>
</tr>
</html>
note: to be valid html you need to add <body></body>
The problem is that your test() function isn't explicitly returning anything, so it is implicitly returning None.
To fix this, test() should accumulate the text it wants to return (i.e, by building a list or string) and return a string containing the text you want to insert into html_str.
Im trying to extract links from this page:
http://www.tadpoletunes.com/tunes/celtic1/
view-source:http://www.tadpoletunes.com/tunes/celtic1/
but I only want the reels: which in the page are delineated by :
start:
<th align="left"><b><a name="reels">REELS</a></b></th>
end ( the lines above the following):
<th align="left"><b><a name="slides">SLIDES</a></b></th>
The question is how to do this. I have the following code which gets the links for everything with a .mid extension:
def import_midifiles():
archive_url="http://www.tadpoletunes.com/tunes/celtic1/"
sauce= urllib.request.urlopen("http://www.tadpoletunes.com/tunes/celtic1/celtic.htm").read()
soup=bs.BeautifulSoup(sauce,'lxml')
tables=soup.find_all('table')
for table in tables:
for link in table.find_all('a',href=True):
if link['href'].endswith('.mid'):
listofmidis.append(archive_url + link['href'])
if listofmidis:
listoflists.append(listofmidis)
midi_list = [item for sublist in listoflists for item in sublist]
return midi_list
I cannot figure this out from the beautifulsoup docs. I need the code because I will be repeating the activity on other sites in order to scrape data for training a model.
To get all the "REELS" links, you need to do the following:
Get the links in between "REELS" and "SLIDES" as you mentioned. To do that, first you'll need to find the <tr> tag containing <a name="reels">REELS</a>. This can be done using the .find_parent() method.
reels_tr = soup.find('a', {'name': 'reels'}).find_parent('tr')
Now, you can use the .find_next_siblings() method to get all the <tr> tags after "REELS". We can break the loop when we find the <tr> tag with <a name="slides">SLIDES</a> (or .find('a').text == 'SLIDES').
Complete code:
def import_midifiles():
BASE_URL = 'http://www.tadpoletunes.com/tunes/celtic1/'
r = requests.get(BASE_URL)
soup = BeautifulSoup(r.text, 'lxml')
midi_list = []
reels_tr = soup.find('a', {'name': 'reels'}).find_parent('tr')
for tr in reels_tr.find_next_siblings('tr'):
if tr.find('a').text == 'SLIDES':
break
midi_list.append(BASE_URL + tr.find('a')['href'])
return midi_list
print(import_midifiles())
Partial output:
['http://www.tadpoletunes.com/tunes/celtic1/ashplant.mid',
'http://www.tadpoletunes.com/tunes/celtic1/bashful.mid',
'http://www.tadpoletunes.com/tunes/celtic1/bigpat.mid',
'http://www.tadpoletunes.com/tunes/celtic1/birdcage.mid',
'http://www.tadpoletunes.com/tunes/celtic1/boatstre.mid',
...
...
'http://www.tadpoletunes.com/tunes/celtic1/silspear.mid',
'http://www.tadpoletunes.com/tunes/celtic1/stafreel.mid',
'http://www.tadpoletunes.com/tunes/celtic1/kilkenny.mid',
'http://www.tadpoletunes.com/tunes/celtic1/swaltail.mid',
'http://www.tadpoletunes.com/tunes/celtic1/cuptea.mid']
HTML code:
<td> <label class="identifier">Speed (avg./max):</label> </td> <td class="value"> <span class="block">4.5 kn<br>7.1 kn</span> </td>
I need to get values 4.5 kn and 7.1 as separate list items so I could append them separately. I do not want to split it I wanted to split the text string using re.sub, but it does not work. I tried too use replace to replace br, but it did not work. Can anybody provide any insight?
Python code:
def NameSearch(shipLink, mmsi, shipName):
from bs4 import BeautifulSoup
import urllib2
import csv
import re
values = []
values.append(mmsi)
values.append(shipName)
regex = re.compile(r'[\n\r\t]')
i = 0
with open('Ship_indexname.csv', 'wb')as f:
writer = csv.writer(f)
while True:
try:
shipPage = urllib2.urlopen(shipLink, timeout=5)
except urllib2.URLError:
continue
except:
continue
break
soup = BeautifulSoup(shipPage, "html.parser") # Read the web page HTML
#soup.find('br').replaceWith(' ')
#for br in soup('br'):
#br.extract()
table = soup.find_all("table", {"id": "vessel-related"}) # Finds table with class table1
for mytable in table: #Loops tables with class table1
table_body = mytable.find_all('tbody') #Finds tbody section in table
for body in table_body:
rows = body.find_all('tr') #Finds all rows
for tr in rows: #Loops rows
cols = tr.find_all('td') #Finds the columns
for td in cols: #Loops the columns
checker = td.text.encode('ascii', 'ignore')
check = regex.sub('', checker)
if check == ' Speed (avg./max): ':
i = 1
elif i == 1:
print td.text
pat=re.compile('<br\s*/>')
print pat.sub(" ",td.text)
values.append(td.text.strip("\n").encode('utf-8')) #Takes the second columns value and assigns it to a list called Values
i = 0
#print values
return values
NameSearch('https://www.fleetmon.com/vessels/kind-of-magic_0_3478642/','230034570','KIND OF MAGIC')
Locate the "Speed (avg./max)" label first and then go to the value via .find_next():
from bs4 import BeautifulSoup
data = '<td> <label class="identifier">Speed (avg./max):</label> </td> <td class="value"> <span class="block">4.5 kn<br>7.1 kn</span> </td>'
soup = BeautifulSoup(data, "html.parser")
label = soup.find("label", class_="identifier", text="Speed (avg./max):")
value = label.find_next("td", class_="value").get_text(strip=True)
print(value) # prints 4.5 kn7.1 kn
Now, you can extract the actual numbers from the string:
import re
speed_values = re.findall(r"([0-9.]+) kn", value)
print(speed_values)
Prints ['4.5', '7.1'].
You can then further convert the values to floats and unpack into separate variables:
avg_speed, max_speed = map(float, speed_values)
I want to get the movie tittle, year, rating, genres, and run time of five movies from the html page given in the code. These are in the rows of table called results.
from bs4 import BeautifulSoup
import urllib2
def read_from_url(url, num_m=5):
html_string = urllib2.urlopen(url)
soup = BeautifulSoup(html_string)
movie_table = soup.find('table', 'results') # table of movie
list_movies = []
count = 0
for row in movie_table.find_all("tr"):
dict_each_movie = {}
title = title.encode("ascii", "ignore") # getting title
dict_each_movie["title"] = title
year = year.encode("ascii","ignore") # getting year
dict_each_movie["year"] = year
rank = rank.encode("ascii","ignore") # getting rank
dict_each_movie["rank"] = rank
# genres = [] # getting genres of a movie
runtime = runtime.encode("ascii","ignore") # getting rank
dict_each_movie["runtime"] = runtime
list_movies.append(dict_each_movie)
count+=1
if count==num_of_m:
break
return list_movies
print read_from_url('http://www.imdb.com/search/title?at=0&sort=user_rating&start=1&title_type=feature&year=2005,2015',2)
Expected output:
[{'rating': '10.0', 'genres': ['Comedy', 'Family'], 'title': 'How to Beat a Bully', 'rank': '1', 'year': '2014', 'runtime': '90'},..........]
You're accessing a variable that hasn't been declared. When the interpreter sees title.encode("ascii", "ignore") it looks for the variable title which hasn't been declared previously. Python can't possible know what title is thus you can't call encode on it. Same goes for year and rank. Instead use:
title = 'How to Beat a Bully'.encode('ascii','ignore')
Why so???
Make your life easier with CSS Selectors.
<table>
<tr class="my_class">
<td id="id_here">
<a href = "link_here"/>First Link</a>
</td>
<td id="id_here">
<a href = "link_here"/>Second Link</a>
</td>
</tr>
</table>
for tr in movie_table.select("tr.my_class"):
for td in tr.select("td#id_here"):
print("Link " + td.select("a")[0]["href"])
print("Text "+ td.select("a")[0].text)