when I tried to extract the table in investing.com historical data, I can't retrieve the column item and an error occur
Traceback (most recent call last):
File "<ipython-input-119-ba739f477693>", line 2, in <module>
col = row.find_elements(By.TAG_NAME, "td")[0]
IndexError: list index out of range
rows = table_id.find_elements(By.TAG_NAME, "tr")
successfully extracted all the rows of the table, but when i tried to loop over each row and output the first column item, above error occurred.
I further checked that the first column item in the second row is below web element
"selenium.webdriver.remote.webelement.WebElement (session="760e711c6189c07f6986103c1374ce13", element="0.5172974513607607-24")"
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
browser = webdriver.Chrome(executable_path=r'chromedriver.exe')
browser.get("https://www.investing.com/commodities/brent-oil-historical-data")
table_id = browser.find_element(By.XPATH, '//*[(#id = "curr_table")]')
rows = table_id.find_elements(By.TAG_NAME, "tr")
for row in rows:
col = row.find_elements(By.TAG_NAME, "td")[0] #e.g. get the first col
print (col.text)
You can use below way of fetching element. I have just changed the way you are fetching table and row.
WebElement table_id = driver.findElement**(By.xpath("//table[#id='curr_table']/tbody")**);
List<WebElement> rows = table_id.findElements(By.tagName("tr"));
for (WebElement row : rows){
List<WebElement> col = row.findElements(By.tagName("td"));
**System.out.println(col.get(0).getText());**
Related
I've hit a brick wall with my scrapy scrape of an html table. Basically, I have a piece of code that works by first assigning column names as objects, using them as keys and then appending them with corresponding xpath entries to a separate object. These are then put into a pandas dataframe, ultimately converted into a csv for final use.
import scrapy
from scrapy.selector import Selector
import re
import pandas as pd
class PostSpider(scrapy.Spider):
name = "standard_squads"
start_urls = [
"https://fbref.com/en/comps/11/stats/Serie-A-Stats",
]
def parse(self, response):
column_index = 1
columns = {}
for column_node in response.xpath('//*[#id="stats_standard_squads"]/thead/tr[2]/th'):
column_name = column_node.xpath("./text()").extract_first()
print("column name is: " + column_name)
columns[column_name] = column_index
column_index += 1
matches = []
for row in response.xpath('//*[#id="stats_standard_squads"]/tbody/tr'):
match = {}
for column_name in columns.keys():
if column_name=='Squad':
match[column_name]=row.xpath('th/a/text()').extract_first()
else:
match[column_name] = row.xpath(
"./td[{index}]//text()".format(index=columns[column_name]-1)
).extract_first()
matches.append(match)
print(matches)
df = pd.DataFrame(matches,columns=columns.keys())
yield df.to_csv("test_squads.csv",sep=",", index=False)
However, I just realised that the column header names in the xpath response (//*[#id="stats_standard_squads"]/thead/tr[2]/th) actually contain duplicates (for example on the page xG appears twice in the table, as does xA). Because of this when I loop through columns.keys() it tosses the duplicates away and so I only end up with 20 columns in the final csv, instead of 25.
I'm not sure what to do now- I've tried adding the column names to a list, adding them as dataframe headers and then appending to a new row each time but it seems to be a lot of boilerplate. I was hoping that there might be a simpler solution to this automated scrape which allows for duplicate names in a pandas dataframe column?
What about reading a list of columns into an array and adding suffixes:
def parse(self, response):
columns = []
for column_node in response.xpath('//*[#id="stats_standard_squads"]/thead/tr[2]/th'):
column_name = column_node.xpath("./text()").extract_first()
columns.append(column_name)
matches = []
for row in response.xpath('//*[#id="stats_standard_squads"]/tbody/tr'):
match = {}
suffixes = {}
for column_index, column_name in enumerate(columns):
# Get correct Index for the currect column
if column_name not in suffixes:
suffixes[column_name] = 1
df_name = column_name # no suffix for the first catch
else:
suffixes[column_name] += 1
df_name = f'{column_name}_{suffixes[column_name]}'
if column_name=='Squad':
match[df_name]=row.xpath('th/a/text()').extract_first()
else:
match[df_name] = row.xpath(
"./td[{index}]//text()".format(index=column_index)
).extract_first()
matches.append(match)
print(matches)
df = pd.DataFrame(matches,columns=columns.keys())
yield df.to_csv("test_squads.csv",sep=",", index=False)
I'm scraping worldometers home page to pull the data in the table in Python, but I am struggling as the values aren't pulling in correctly. (The strings are... (Country: USA, Spain, Italy...).
import requests
import lxml.html as lh
import pandas as pd
from tabulate import tabulate
url="https://www.worldometers.info/coronavirus/"
#Create a handle, page, to handle the contents of the website
page = requests.get(url)
#Store the contents of the website under doc
doc = lh.fromstring(page.content)
#Parse data that are stored between <tr>..</tr> of HTML
tr_elements = doc.xpath('//tr')
#Create empty list
col=[]
colLen = len(tr_elements[1])
i=0
#For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
i+=1
name=t.text_content()
print ('%d:"%s"'%(i,name))
col.append((name,[]))
print(colLen)
#Since out first row is the header, data is stored on the second row onwards
for j in range(1,len(tr_elements)):
#T is our j'th row
T=tr_elements[j]
if len(T)!=len(tr_elements[0]): break
#i is the index of our column
i=0
#Iterate through each element of the row
for t in T.iterchildren():
data=t.text_content()
#Append the data to the empty list of the i'th column
col[i][1].append(data)
#Increment i for the next column
i+=1
Dict={title:column for (title,column) in col}
df=pd.DataFrame(Dict)
df.head()
#Print Total Cases Col (this is incorrect when comparing to the webpage)
print(col[1][0:])
#Print Country Col (this is correct)
print(col[0][0:])
I can't seem to figure out what the issue is. Please help to solve the issue. I'm also open for suggestion to do this another way :)
Data Table on Webpage
Command Prompt output for Country ( Correct)
Command Prompt output for Total Cases ( incorrect)
I am trying to scrape a bunch of tables from one web page, with the code below I can get one table and the output to show correctly with pandas, but I cannot get more than one table at a time.
import bs4 as bs
import urllib.request
import pandas as pd
source = urllib.request.urlopen('https://www.URLHERE.com').read()
soup = bs.BeautifulSoup(source, 'lxml')
tables = soup.select('table')[-1]
rows = tables.find_all('tr')
output = []
for rows in rows:
cols = rows.find_all('td')
cols = [item.text.strip() for item in cols]
output.append([item for item in cols if item])
df = pd.DataFrame(output, columns = ['1','2', '3', '4', '5', '6'])
df = df.iloc[1:]
print(df)
If I remove the [-1] from my table variable then I get the error below.
AttributeError: 'list' object has no attribute 'find_all'
What do I need to change to get all the tables off the page?
You're on the right track already, just like a commenter already said, you'll need to find_all tables, then you can apply the row logic you are already using to each table in a loop instead of just the first table. Your code will look something like this:
tables = soup.find_all('table')
for table in tables:
# individual table logic here
rows = table.find_all('tr')
for row in rows:
# individual row logic here
I took a better look on that, and here is the sample code that i tested:
source = urllib.request.urlopen('URL').read()
soup = bs.BeautifulSoup(source, 'lxml')
tables = soup.select('table')
print("I found " + str(len(tables)) + " tables.")
all_rows = []
for table in tables:
print("Searching for <tr> items...")
rows = table.find_all('tr')
print("Found " + str(len(rows)) + "rows.")
for row in rows:
all_rows.append(row)
print("In total i have got " + str(len(all_rows)) + " rows.")
# example of first row
print(all_rows[0])
Little explanation: The problem with the Atribute Error when you removed [-1] was, that the tables variable was List object - and it don't have find_all method.
Your track with [-1] is okay - I assume that you know, that [-1] grabs the last items from list. So the same you have to do with all elements - which is shown in the code above.
You might find interesting to read about for construction on python and iterables: https://pythonfordatascience.org/for-loops-and-iterations-python/
Well if you want to extract all different tables present on a web-page in one time, you should try :
tables = pd.read_html("<URL_HERE>")
tables would be a list of dataframes for each table present on that page.
For more specific documentation refer to Pandas-Documentation
Link to website: http://www.tennisabstract.com/cgi-bin/player-classic.cgi?p=RafaelNadal
I am trying to write code which goes through each row in a table and extracts each element from that row.
I am aiming for an ouput in the following layout
Row1Element1, Row1Element2, Row1Element3
Row2Element1, Row2Element2, Row2Element3
Row3Element1, Row3Element2, Row3Element3
I have had two major attempts at coding this.
Attempt 1:
rows = driver.find_elements_by_xpath('//table//body//tr')
elements = rows.find_elements_by_xpath('//td')
#this gets all rows in the table, but then gets all elements on the page,
not just the table
Attempt 2:
driver.find_elements_by_xpath('//table//body//tr//td')
#this gets all the elements that I want, but makes no distinction to which
row each element belongs to
Any help is appreciated
You can get table headers and use indexes to get right sequence in the row data.
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("http://www.tennisabstract.com/cgi-bin/player-classic.cgi?p=RafaelNadal")
table_headers = [th.text.strip() for th in driver.find_elements_by_css_selector("#matchheader th")]
rows = driver.find_elements_by_css_selector("#matches tbody > tr")
date_index = table_headers.index("Date")
tournament_index = table_headers.index("Tournament")
score_index = table_headers.index("Score")
for row in rows:
table_data = row.find_elements_by_tag_name("td")
print(table_data[date_index].text, table_data[tournament_index].text, table_data[score_index].text)
This is the locator each rows the table you mean
XPATH: //table[#id="matches"]//tbody//tr
First following import:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
Each rows:
driver.get('http://www.tennisabstract.com/cgi-bin/player-classic.cgi?p=RafaelNadal')
rows = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, '//table[#id="matches"]//tbody//tr')))
for row in rows:
print(row.text)
Or each cells:
for row in rows:
cols = row.find_elements_by_tag_name('td')
for col in cols:
print(col.text)
I'm using BeautifulSoup to pull the elements of an HTML table into a python dict. The problem I'm having is, when I create the dict, the first record from the table is repeatedly loaded into the the dict. Printing the variable rows shows the expected number of different records returned in the response, but only the first record is printed when print(d) is called.
import requests
from bs4 import BeautifulSoup as bs
url = 'http://host.com/user_activity?page=3'
r = requests.get(url)
#print(r.text)
soup = bs(r.text, 'lxml')
table = soup.find_all('table')[0]
rows = table.find_all('td')
#records = soup.find_all('td')
#print(table.prettify())
ct=0
for record in rows :
if ct < 20:
keys = [th.get_text(strip=True)for th in table.find_all('th')]
values = [td.get_text(strip=True) for td in rows]
d = dict(zip(keys, values))
print(d)
ct+=1
I think you meant to get the header cells from the first row of the table (once, before the loop) and iterate over the tr elements instead of td.
You can also use a regular find() instead of find_all()[0] and enumerate() to handle the loop increment variable more nicely:
table = soup.find('table')
rows = table.find_all('tr')
headers = [th.get_text(strip=True) for th in rows[0].find_all('th')]
for ct, row in enumerate(rows[1:]):
values = [td.get_text(strip=True) for td in row.find_all('td')]
d = dict(zip(headers, values))
print(d)
Apart from what sir alecxe has already shown, you can do like this as well using selector. Just make sure the table index is accurate, as in first table or second table or another one you wanna parse.
table = soup.select("table")[0] #be sure to put here the correct index
for items in table.select("tr"):
data = ' '.join([item.text for item in items.select("th,td")])
print(data)