I'd like to parse website with rates but i couldn't take out data from <td> elements.
I wrote short code to test which gets 1st line with data tabel:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('https://www.gpw.pl/wskazniki_spolek_full')
table = driver.find_elements_by_xpath("//table[#class='tab03']/tbody/tr")[4].text
print table
driver.quit()
and i'm getting results:
2 PLNFI0800016 141 08OCTAVA 42 786 848 44,07 63,86 2016-12-31 H 0,69 27,80 ---
but i'd like go through all <td> elements in <tr> tag in loop by all tables which has class='tab03'
table = driver.find_elements_by_xpath("//table[#class='tab03']/tbody/tr")
for el in table:
col_id = el.find_element_by_tag_name('td')[1].text
col_kod = el.find_element_by_tag_name('td')[2].text
print("{}".format(col_id, col_kod))
driver.quit()
but i'm getting error: selenium.common.exceptions.NoSuchElementException: Message: Unable to locate element: td
There are some header rows that don't have td elements inside them, skip them:
rows = driver.find_elements_by_xpath("//table[#class='tab03']/tbody/tr")
for row in rows[3:]:
cells = row.find_elements_by_tag_name('td')
col_id = cells[0].text
col_kod = cells[1].text
print("{}".format(col_id, col_kod))
Also note that, to get to the td cells, use the find_elements_by_tag_name() and get the desired elements by index (0-based).
Related
im new to Python and im trying to make a web scraper to get the name and the ip of Minecraft server.
The problem is that I was able to get the value of the but for example the ip of the server is in a div inside de
Im using pandas and lxml.html
example:
<tr>
<td class="server-rank visible-sm visible-md visible-lg">
<p><span class="badge">#1</span></p>
</td>
<td class="server-name" align="center">
<div class="server-ip input-group">
<p> this is de ip of the server <p> -I WANT TO GET HERE-
</div>
</td>
</tr>
I dont know how to make to the div inside the tb.
I have this script that I took from a page that works perfect to the other things but not for getting to the inside.
from numpy import tile
import requests
import lxml.html as lh
import pandas as pd
import re
#https://www.servidoresminecraft.info/1.8/
url='https://topminecraftservers.org/version/1.8.8'
#Create a handle, page, to handle the contents of the website
page = requests.get(url)
#Store the contents of the website under doc
doc = lh.fromstring(page.content)
#Parse data that are stored between <tr>..</tr> of HTML
tr_elements = doc.xpath('//tr')
#Check the length of the first 12 rows
[len(T) for T in tr_elements[:5]]
tr_elements = doc.xpath('//tr')
#Create empty list
col=[]
i=0
#For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
i+=1
name=t.text_content()
print ('%d:"%s"'%(i,name))
col.append((name,[]))
#Since out first row is the header, data is stored on the second row onwards
for j in range(1,len(tr_elements)):
#T is our j'th row
T=tr_elements[j]
#If row is not of size 10, the //tr data is not from our table
if len(T)!=3:
break
#i is the index of our column
i=0
#Iterate through each element of the row
for t in T.iterchildren():
data=t.text_content()
#Check if row is empty
if i>0:
#Convert any numerical value to integers
try:
if i==2 and j == 1:
print(2)
data=int(data)
except:
pass
#Append the data to the empty list of the i'th column
col[i][1].append(data)
#Increment i for the next column
i+=1
[len(C) for (title,C) in col]
Dict={title:column for (title,column) in col}
df=pd.DataFrame(Dict)
print(df.head())
I just want to get and aotput thats shows the a table with the name of the server and the ip
Name ip
server1 xxx.xxx.x.x
server2 xxx.xxx.x.x
Any help??
If I understand you correctly, this should get you what you're looking for:
servers = []
cols = ["Name", "ip"]
for s in doc.xpath("//td[#class='server-name']"):
s_ip = s.xpath(".//div[#class='server-ip input-group']//span[#class='form-control text-justify']/text()")[0]
s_name = s.xpath('.//h4/a/span/text()')[0]
servers.append([s_name,s_ip])
pd.DataFrame(servers, columns = cols)
Output:
Name ip
0 AkumaMC akumamc.net
1 BattleAsya 1.8-1.16 play.battleasya.com
2 Caraotacraft network PRISON caraotacraft.top
3 FlameSquad 87.121.54.214:25568
4 LunixCraft lunixcraft.dk
etc.
As seen in the picture, I have more than 1 table rows. However, I am unable to retrieve any of those table rows except for the first row.
Code:
roster_tbody = browser.find_element(By.XPATH, "//*[#id='tableDay']/tbody")
tr = roster_tbody.find_elements(By.TAG_NAME, "tr")
print("number of table rows: " + str(len(tr)))
for trEl in tr:
print(trEl.text)
Expected output: I should be seeing all dates, days, year and 'Full'(if applicable) being printed for each row.
Actual output:
What I've tried:
1)
tr = WebDriverWait(browser, 20).until(
EC.visibility_of_all_elements_located((By.TAG_NAME, "tr"))
)
It does not work as well and return the first table row only.
2)
row_2 = browser.find_element(By.XPATH, "//*[#id='tableDay']/tbody/tr[2]")
print(row_2.text)
Outcome:
3)
changing //*[#id='tableDay']/tbody to .//*[#id='tableDay']/tbody does not work as well
Searching for tr2 from elements panel: tr2 was found and highlighted.
Searching for tr from console panel: all 37 tr was found.
6)
roster_tbody = browser.find_elements(By.XPATH, "//*[#id='tableDay']/tbody/tr")
j = 1
for i in range(len(tr)):
element = browser.find_element(By.XPATH, f"(//*[#id='tableDay']/tbody/tr)[{j}]")
browser.execute_script("arguments[0].scrollIntoView(true);", element)
print(element.get_attribute("innerHTML"))
j = j + 1
Output:
Any help is deeply appreciated!
Since there are lot of tr tags, I would assume scrolling to each element would be neccessary.
Also, I am using indexing to look for each web element.
Code :
roster_tbody = browser.find_elements(By.XPATH, "//*[#id='tableDay']/tbody/tr")
j = 1
for i in range(len(roster_tbody)):
element = browser.find_element(By.XPATH, f"(//*[#id='tableDay']/tbody/tr)[{j}]")
browser.execute_script("arguments[0].scrollIntoView(true);", element)
print(element.get_attribute('innerHTML'))
j = j + 1
I am trying to extract data from a dynamic table with the following structure:
Team 1 - Score - Team 2 - Minute first goal.
It is a table of soccer match results and there are about 10 matches per table and one table for each matchday. This is an example of the website in working with: https://www.resultados-futbol.com/premier/grupo1/jornada1
For this I am trying web scraping with BeautifulSoup in Python. Although I've made good progress, I'm running into a problem. I would like to generate a code that would iterate data by data each row of the table and I would get each data to a list so that I would have, for example:
List Team 1: Real Madrid, Barcelona
Score list: 1-0, 1-0
List Team 2: Atletico Madrid, Sevilla
First goal minutes list: 17', 64'
Once I have the lists, my intention is to make a complete dataframe with all the extracted data. However, I have the following problem: the matches that end 0-0. This implies that in the column Minute first goal there is none and it doesn't extract anything, so I can't 'fill' that value in any way in my dataframe and I get an error. To continue with the previous example, imagine that the second game has ended 0-0 and that in the 'Minutes first goal list' there is only one data (17').
In my mind the solution would be to create a loop that takes the data cell by cell and put a condition in 'Score' that if it is 0-0 to the list of Minutes first goal a value for example 'No goals' would be added.
This is the code I am using. I paste only the part in which I would like to create the loop:
page = BeautifulSoup(driver.page_source, 'html.parser') # I have to use Selenium previously cos I have to expand some buttons in the web
table = page.find('div', class_= 'contentitem').find_all('tr', class_= 'vevent')
teams1 = []
teams2 = []
scores = []
for cell in table:
team1 = cell.find('td', class_='team1')
for name in local:
nteam1 = name.text
teams1.append(nteam1)
team2 = cell.find('td', class_='team2')
for name in team2:
nteam2 = name.text
teams2.append(nteam2)
score = cell.find('span', class_='clase')
for name in score:
nscore = name.text
scores.append(nscore)
It is not clear to me how to iterate over the table to be able to store in the list the content of each cell and it is essential to include a condition "when the score cell is 0-0 create a non-goals entry in the list".
If someone could help me, I would be very grateful. Best regards
You are close to your goal, but can optimize your script a bit.
Do not use these different lists, just use one:
data = []
Try to get all information in one loop, there is an td that contains all the information and push a dict to your list:
for row in soup.select('tr.vevent .rstd'):
teams = row.select_one('.summary').get_text().split(' - ')
score = row.select_one('.clase').get_text()
data.append({
'team1':teams[0],
'team2':teams[1],
'score': score if score != '0-0' else 'No goals'
})
Push your data into DataFrame
pd.DataFrame(data)
Example
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
driver = webdriver.Chrome(executable_path=r'C:\Program Files\ChromeDriver\chromedriver.exe')
url = 'https://www.resultados-futbol.com/premier/grupo1/jornada1'
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser') # I have to use Selenium previously cos I have to expand some buttons in the web
data = []
for row in soup.select('tr.vevent .rstd'):
teams = row.select_one('.summary').get_text().split(' - ')
score = row.select_one('.clase').get_text()
data.append({
'team1':teams[0],
'team2':teams[1],
'score': score if score != '0-0' else 'No goals'
})
pd.DataFrame(data)
I have the following table :
<table id="sample">
<tbody>
<tr class="toprow">
<td style="width:25%"></td>
<td style="width:25%">Number of Jurisdictions</td>
<td style="width:25%">Per cent of total</td>
</tr>
<tr>
<td class="leftcol">Europe</td>
<td class="data">44</td>
<td class="data">29%</td>
</tr>
</tbody>
</table>
I am using beautifulsoup to get the content of the table :
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
# Site URL
url = "http://archive.ifrs.org/Use-around-the-world/Pages/Analysis-of-the-IFRS-jurisdictional-profiles.aspx"
# Make a GET request to fetch the raw HTML content
html_content = requests.get(url).text
# Parse HTML code for the entire site
soup = BeautifulSoup(html_content, "lxml")
# print(soup.prettify()) # print the parsed data of html
# On site there are 2 tables with id="sample"
# The following line will generate a list of HTML content for each table
gdp = soup.find_all("table", id="sample")
print("Number of tables on site: ", len(gdp))
# Lets go ahead and scrape first table with HTML code gdp[0]
table1 = gdp[0]
# the head will form our column names
body = table1.find_all("tr")
# Head values (Column names) are the first items of the body list
head = body[0] # 0th item is the header row
body_rows = body[1:] # All other items becomes the rest of the rows
# Lets now iterate through the head HTML code and make list of clean headings
# Declare empty list to keep Columns names
headings = []
for item in head.find_all("td"): # loop through all th elements
# convert the th elements to text and strip "\n"
item = (item.text).rstrip("\n")
# append the clean column name to headings
headings.append(item)
print(headings)
I was able to get the header :
['', 'Number of Jurisdictions', 'Per cent of total']
Now I want to get the content of the cells but I don't know how to loop through the <td> tag since its class may change to "leftcol" or "data"
If I understand you correctly, I would simplify this a bit:
gdp = soup.select("table#sample")[0]
rows = []
cols = []
for g in gdp.select('tr.toprow'):
for c in g.select('td'):
cols.append(c.text)
for g in gdp.select('tr:not(.toprow)'):
row = []
for item in g.select('td'):
row.append(item.text)
rows.append(row)
pd.DataFrame(rows, columns=cols)
Or you could simplify it even more (at the cost of making it, I believe, less readable) by using list comprehensions:
cols = [c.text for g in gdp.select('tr.toprow') for c in g.select('td')]
rows = [[item.text for item in g.select('td')] for g in gdp.select('tr:not(.toprow)')]
pd.DataFrame(rows, columns=cols)
Output:
Number of Jurisdictions Per cent of total
0 Europe 44 29%
1 Africa 23 15%
2 Middle East 13 9%
3 Asia and Oceania 33 22%
4 Americas 37 25%
5 Totals 150 100%
I'm trying to scrape the data in a table on the FT website, but I cannot get my code to work. I've been through other similar questions here on Stack Overflow, and while they have helped, it's beyond my skill to get the code working.
I'm looking to scrape the table and output to a list of dicts, or a dict of dicts, which I would then put into a pandas DataFrame.
EDIT FOR CLARITY:
I want to:
Scrape the table
strip out the html tags
return a dict where
the first cell of each row is the key, and the rest are values of the
key
So far I can do (1), (2) I see as more of a cleanup exercise, shouldn't be too hard, (3) is where I have issues. Some of the rows contain only one entry because they are section headings, but are not marked up as such in the html, so the standard dict comprehensions I've seen in other answers are returning either an error, because key with no values, or setting the first entry as the key for the all the rest of the data.
The table is here.
My code so far is:
from bs4 import BeautifulSoup
import urllib2
import lxml
soup = BeautifulSoup(urllib2.urlopen('http://markets.ft.com/research/Markets/Tearsheets/Financials?s=CLLN:LSE&subview=BalanceSheet').read())
table = soup.find('table', {'data-ajax-content' : 'true'})
for row in table.findAll('tr'):
for cell in row.findAll('td'):
print cell.findAll(text = True)
Which gets me this kind of output:
[u'Fiscal data as of Dec 31 2013']
[u'2013']
[u'2012']
[u'2011']
[u'ASSETS']
[u'Cash And Short Term Investments']
[u'416']
[u'660']
[u'495']
I have tried:
rows = iter(table)
headers = [col.text for col in next(rows)]
for row in rows:
values = [col.text for col in row]
print dict(zip(headers, values))
which may work, but I'm getting:
urllib2.HTTPError: HTTP Error 407: Proxy Authorization Required,
which I assume is because I'm behind a corporate proxy.
EDIT:
I tried the code above from home, and it gets this:
{u'Fiscal data as of Dec 31 2013201320122011': u'ASSETS'}
{u'Fiscal data as of Dec 31 2013201320122011': u'LIABILITIES'}
{u'Fiscal data as of Dec 31 2013201320122011': u'SHAREHOLDERS EQUITY'}
which is tantalisingly close, but has only captured the first row of each section.
Any help is greatly appreciated. I am new to python, so if you have time to explain your answer, that will also meet with my gratitude.
EDIT:
I've read around a bit more and tried a few more things:
table = soup.find('table', {'data-ajax-content' : 'true'})
rows = table.findAll('tr')
dict_for_series = {row[0]:row[1:] for row in rows}
print dict_for_series
Which results in:
{<tr><td class="label">Fiscal data as of Dec 31 2013</td><td>2013</td><td>2012</td><td>2011</td></tr>: [<tr class="section even"><td colspan="4">ASSETS</td></tr>, <tr class="odd"><td class="label">Cash And Short Term Investments</td><td>416</td><td>660</td><td>495</td></tr>, <tr class="even"><td class="label">Total Receivables, Net</td><td>1,216</td><td>1,122</td><td>1,102</td></tr>, <tr class="odd"><td class="label">Total Inventory</td><td>49</td><td>55</td><td>72</td><
In this case it seems the code sets the first entry as the key, and the rest as values.
Another attempt:
table = soup.find('table', {'data-ajax-content' : 'true'})
rows = table.findAll('tr')
d = []
for row in rows:
d.append(row.findAll('td'))
rowsdict = {t[0]:t[1:] for t in d}
dictSer = Series(rowsdict)
dictframe = DataFrame(dictSer)
print dictframe
Which results in:
0
<td class="label">Fiscal data as of Dec 31 2013</td> [<td>2013</td>, <td>2012</td>, <td>2011</td>]
<td colspan="4">ASSETS</td> []
<td class="label">Cash And Short Term Investments</td> [<td>416</td>, <td>660</td>, <td>495</td>]
<td class="label">Total Receivables, Net</td> [<td>1,216</td>, <td>1,122</td>, <td>1,102</td>]
which is very close to what I want, the structure is almost right, but judging by the placement of the square brackets, this is treating all the values ie <td>1,216</td> as a single cell.
Anyway, I'll keep playing around and trying to make it work, but if anyone has any pointers, please let me know!