How to scrape all topics from twitter - python

All topics in twitter can be found in this link
I would like to scrape all of them with each of the subcategory inside.
BeautifulSoup doesn't seem to be useful here. I tried using selenium, but I don't know how to match the Xpaths that come after clicking the main category.
from selenium import webdriver
from selenium.common import exceptions
url = 'https://twitter.com/i/flow/topics_selector'
driver = webdriver.Chrome('absolute path to chromedriver')
driver.get(url)
driver.maximize_window()
main_topics = driver.find_elements_by_xpath('/html/body/div[1]/div/div/div[1]/div[2]/div/div/div/div/div/div[2]/div[2]/div/div/div[2]/div[2]/div/div/div/div/span')
topics = {}
for main_topic in main_topics[2:]:
print(main_topic.text.strip())
topics[main_topic.text.strip()] = {}
I know I can click the main category using main_topics[3].click(), but I don't know how I can maybe recursively click through them until I find only the ones with Follow on the right.

To scrape all the main topics e.g. Arts & culture, Business & finance, etc using Selenium and python you have to induce WebDriverWait for visibility_of_all_elements_located() and you can use either of the following Locator Strategies:
Using XPATH and text attribute:
driver.get("https://twitter.com/i/flow/topics_selector")
print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//span[contains(., 'see top Tweets about them in your timeline')]//following::div[#role='button']/div/span")))])
Using XPATH and get_attribute():
driver.get("https://twitter.com/i/flow/topics_selector")
print([my_elem.get_attribute("textContent") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//span[contains(., 'see top Tweets about them in your timeline')]//following::div[#role='button']/div/span")))])
Console Output:
['Arts & culture', 'Business & finance', 'Careers', 'Entertainment', 'Fashion & beauty', 'Food', 'Gaming', 'Lifestyle', 'Movies and TV', 'Music', 'News', 'Outdoors', 'Science', 'Sports', 'Technology', 'Travel']
To scrape all the main and sub topics using Selenium and WebDriver you can use the following Locator Strategy:
Using XPATH and get_attribute("textContent"):
driver.get("https://twitter.com/i/flow/topics_selector")
elements = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//span[contains(., 'see top Tweets about them in your timeline')]//following::div[#role='button']/div/span")))
for element in elements:
element.click()
print([my_elem.get_attribute("textContent") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[#role='button']/div/span[text()]")))])
driver.quit()
Console Output:
['Arts & culture', 'Animation', 'Art', 'Books', 'Dance', 'Horoscope', 'Theater', 'Writing', 'Business & finance', 'Business personalities', 'Business professions', 'Cryptocurrencies', 'Careers', 'Education', 'Fields of study', 'Entertainment', 'Celebrities', 'Comedy', 'Digital creators', 'Entertainment brands', 'Podcasts', 'Popular franchises', 'Theater', 'Fashion & beauty', 'Beauty', 'Fashion', 'Food', 'Cooking', 'Cuisines', 'Gaming', 'Esports', 'Game development', 'Gaming hardware', 'Gaming personalities', 'Tabletop gaming', 'Video games', 'Lifestyle', 'Animals', 'At home', 'Collectibles', 'Family', 'Fitness', 'Unexplained phenomena', 'Movies and TV', 'Movies', 'Television', 'Music', 'Alternative', 'Bollywood music', 'C-pop', 'Classical music', 'Country music', 'Dance music', 'Electronic music', 'Hip-hop & rap', 'J-pop', 'K-hip hop', 'K-pop', 'Metal', 'Musical instruments', 'Pop', 'R&B and soul', 'Radio stations', 'Reggae', 'Reggaeton', 'Rock', 'World music', 'News', 'COVID-19', 'Local news', 'Social movements', 'Outdoors', 'Science', 'Biology', 'Sports', 'American football', 'Australian rules football', 'Auto racing', 'Baseball', 'Basketball', 'Combat Sports', 'Cricket', 'Extreme sports', 'Fantasy sports', 'Football', 'Golf', 'Gymnastics', 'Hockey', 'Lacrosse', 'Pub sports', 'Rugby', 'Sports icons', 'Sports journalists & coaches', 'Tennis', 'Track & field', 'Water sports', 'Winter sports', 'Technology', 'Computer programming', 'Cryptocurrencies', 'Data science', 'Information security', 'Operating system', 'Tech brands', 'Tech personalities', 'Travel', 'Adventure travel', 'Destinations', 'Transportation']
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

Take a look at how XPATH works. Just put '//element[#attribute="foo"]' and you don't have to write out the whole path. Be careful as both main topics and sub topics (which are visible after clicking the main topics) have the same class name. That was causing the error. So, here's how I was able to click the subtopics, but I'm sure there's a better way:
I found the topics elements using:
topics = WebDriverWait(browser, 5).until(
EC.presence_of_all_elements_located((By.XPATH, '//div[#class="css-901oao r-13gxpu9 r-1qd0xha r-1b6yd1w r-1vr29t4 r-ad9z0x r-bcqeeo r-qvutc0"]'))
)
then I created an empty list named:
main_topics = []
Then, I for looped through topics and appeneded each element.text to the main_topics list, and clicked each element to show the main topics.
for topic in topics:
main_topics.append(topic.text)
topic.click()
Then, I a new variable called sub_topics: (it's now all the opened topics)
sub_topics = WebDriverWait(browser, 5).until(
EC.presence_of_all_elements_located((By.XPATH, '//span[#class="css-901oao css-16my406 r-1qd0xha r-ad9z0x r-bcqeeo r-qvutc0"]'))
)
Then, I created two more empty lists named:
subs_list = []
skip_these_words = ["Done", "Follow your favorite Topics", "You’ll see top Tweets about them in your timeline. Don’t see your favorite Topics yet? New Topics are added every week.", "Follow"]
]
Then, I for looped through the sub_topics and made an if statement to only append the elements.text to the subs_list ONLY IF they were not in the main_topics and skip_these_words lists. I did this to filter out the main topics and unecessary text at the top since all these dern elements have the same class name. Finally, each sub topic is clicked. This last part is confusing so here's an example:
for sub in sub_topics:
if sub.text not in main_topics and sub.text not in skip_these_words:
subs_list.append(sub.text)
sub.click()
There are also a few more hidden sub-sub topics. See if you can click the remaining sub-sub topics. Then, see if you can find the follow button element and click each one.

Related

Getting only a part of the page source using selenium webdriver

I'm trying to get the HTML of Billboard's top 100 chart, but I keep getting only about half of the page.
I tried getting the page source using this code:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
s=Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=s)
url = "https://www.billboard.com/charts/hot-100/"
driver.get(url)
driver.implicitly_wait(10)
print(driver.page_source)
But it always returns the page source only from the 53rd song on the chart (I've tried increasing the implicit wait and nothing changed)
Not sure why you are getting 53 elements. Using explicit waits I am getting all 100 elements.
User webdriverwait() as an explicit wait.
driver.get('https://www.billboard.com/charts/hot-100/')
elements=WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "ul.lrv-a-unstyle-list h3#title-of-a-story")))
print(len(elements))
print([item.text for item in elements])
Need to import below libraries.
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
OutPut:
100
['Anti-Hero', 'Lavender Haze', 'Maroon', 'Snow On The Beach', 'Midnight Rain', 'Bejeweled', 'Question...?', "You're On Your Own, Kid", 'Karma', 'Vigilante Shit', 'Unholy', 'Bad Habit', 'Mastermind', 'Labyrinth', 'Sweet Nothing', 'As It Was', 'I Like You (A Happier Song)', "I Ain't Worried", 'You Proof', "Would've, Could've, Should've", 'Bigger Than The Whole Sky', 'Super Freaky Girl', 'Sunroof', "I'm Good (Blue)", 'Under The Influence', 'The Great War', 'Vegas', 'Something In The Orange', 'Wasted On You', 'Jimmy Cooks', 'Wait For U', 'Paris', 'High Infidelity', 'Tomorrow 2', 'Titi Me Pregunto', 'About Damn Time', 'The Kind Of Love We Make', 'Late Night Talking', 'Cuff It', 'She Had Me At Heads Carolina', 'Glitch', 'Me Porto Bonito', 'Die For You', 'California Breeze', 'Dear Reader', 'Forever', 'Hold Me Closer', 'Just Wanna Rock', '5 Foot 9', 'Unstoppable', 'Thank God', 'Fall In Love', 'Rock And A Hard Place', 'Golden Hour', 'Heyy', 'Half Of Me', 'Until I Found You', 'Victoria’s Secret', 'Son Of A Sinner', "Star Walkin' (League Of Legends Worlds Anthem)", 'Romantic Homicide', 'What My World Spins Around', 'Real Spill', "Don't Come Lookin'", 'Monotonia', 'Stand On It', 'Never Hating', 'Wishful Drinking', 'No Se Va', 'Free Mind', 'Not Finished', 'Music For A Sushi Restaurant', 'Pop Out', 'Staying Alive', 'Poland', 'Whiskey On You', 'Billie Eilish.', 'Betty (Get Money)', '2 Be Loved (Am I Ready)', 'Wait In The Truck', 'All Mine', 'La Bachata', 'Last Last', 'Glimpse Of Us', 'Freestyle', 'Calm Down', 'Gatubela', 'Evergreen', 'Pick Me Up', 'Gotta Move On', 'Perfect Timing', 'Country On', 'Snap', 'She Likes It', 'Made You Look', 'Dark Red', 'From Now On', 'Forget Me', 'Miss You', 'Despecha']
driver.implicitly_wait(10) is not a pause command!
It has no effect on the next line print(driver.page_source).
If you want to wait for page to be completely load you can wait for some specific element there to be visible. Use WebDriverWait expected_conditions for that. Or just add a hardcoded pause time.sleep(10) instead of driver.implicitly_wait(10)

Selenium: No Such element exception when there is an element in the page

I have been trying to get the names of the batsmen from the page but Selenium is throwing
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":".table-body__cell rankings-table__name name"}.
I am not able to get why this is happening as I am blatantly copy pasting the class name. I have tried the implicitly wait function but nothing is happening using that as well. Can someone please help me out with this.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
import time
import pandas as pd
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
PATH = "C:\Program Files (x86)\chromedriver"
driver = webdriver.Chrome(PATH)
driver.get("https://www.icc-cricket.com/rankings/mens/player-rankings/odi/batting")
driver.implicitly_wait(20)
elements = driver.find_element_by_class_name("table-body__cell rankings-table__name name")
driver.quit()
To extract names of the batsmen from the webpage you need to induce WebDriverWait for visibility_of_all_elements_located() and you can use either of the following Locator Strategies:
Using CSS_SELECTOR:
driver.get('https://www.icc-cricket.com/rankings/mens/player-rankings/odi/batting')
print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "td.table-body__cell.rankings-table__name.name a[href^='/rankings/mens/player-rankings']")))])
Using XPATH:
driver.get('https://www.icc-cricket.com/rankings/mens/player-rankings/odi/batting')
print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//td[#class='table-body__cell rankings-table__name name']//a[starts-with(#href, '/rankings/mens/player-rankings')]")))])
Console Output:
['Virat Kohli', 'Rohit Sharma', 'Ross Taylor', 'Quinton de Kock', 'Aaron Finch', 'Jonny Bairstow', 'David Warner', 'Kane Williamson', 'Rassie van der Dussen', 'Shai Hope', 'Fakhar Zaman', 'Joe Root', 'Mushfiqur Rahim', 'Shikhar Dhawan', 'Imam-ul-Haq', 'Martin Guptill', 'Steve Smith', 'Jason Roy', 'Glenn Maxwell', 'Tamim Iqbal', 'Alex Carey', 'Paul Stirling', 'Ben Stokes', 'Eoin Morgan', 'Shakib Al Hasan', 'Lokesh Rahul', 'Tom Latham', 'David Miller', 'Jos Buttler', 'Shimron Hetmyer', 'Janneman Malan', 'Haris Sohail', 'Henry Nicholls', 'Nicholas Pooran', 'Rahmat Shah', 'Andrew Balbirnie', 'Kyle Coetzer', 'Aqib Ilyas', 'Mahmudullah', 'Brendan Taylor', 'Avishka Fernando', 'Sean Williams', 'Litton Das', 'Evin Lewis', 'Kariyawasa Asalanka', 'Harry Tector', 'Kedar Jadhav', 'Kusal Perera', 'Hardik Pandya', 'Sikandar Raza', 'Angelo Mathews', 'Hashmatullah Shaidi', 'Imad Wasim', 'Danushka Gunathilaka', 'Soumya Sarkar', 'Marcus Stoinis', 'Najibullah Zadran', 'Temba Bavuma', 'Kusal Mendis', 'Colin de Grandhomme', 'Sarfaraz Ahmed', 'Calum MacLeod', 'Marnus Labuschagne', 'Jimmy Neesham', 'Mitchell Marsh', 'Niroshan Dickwella', 'Mohammad Nabi', 'Richard Berrington', 'Craig Ervine', 'Aiden Markram', 'Assad Vala', 'William Porterfield', 'Asghar Afghan', 'Muhammad Usman', 'Heinrich Klaasen', 'Shreyas Iyer', 'Dhananjaya de Silva', 'Dasun Shanaka', 'George Munsey', 'Mitchell Santner', 'Rishabh Pant', 'Jason Holder', 'Mithun Ali', 'Andile Phehlukwayo', 'Rashid Khan', 'Ravindra Jadeja', 'Matthew Wade', 'Lahiru Thirimanne', 'Mohammad Rizwan', 'Dinesh Chandimal', 'Rovman Powell', 'Kieron Pollard', 'Zeeshan Maqsood', 'Gulbadin Naib', 'Matthew Cross', 'Moeen Ali', 'Curtis Campher', 'Chris Woakes', 'Scott Edwards']
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

match one or more specific strings in a list

I would like to have a function that appends a single or more than two words in one list. For example, I have a list called single_Word consists of four strings:
single_Word = ['news in media', 'car in automobile', 'email in technology', 'painting in art']
I would like to the extract 1st word (or basically any strings before 'in'), so it can return the following output:
['news', 'car', 'email', 'painting']
I have the following code that shows what I intended to do:
text_list = []
for text in single_Word:
x = text.split()
text_list.append(x[0])
print(text_list)
# ['news', 'car', 'email', 'painting']
which is fine for me and works as expected, but once I have another list that has more than a single string, it fails to catch that. I know the main reason for that is the x[0], which returns the first element, but how can I change this so it can match more than one string (or any string before 'in'). The following are the lists that I would match on.
two_Word = ['news online in media', 'car insurance in automobile', 'email account in technology', 'painting ideas in art']
three_Word = ['news online live in media', 'car insurance online in automobile', 'email account settings in technology', 'painting ideas pinterest in art']
the desired output for 2nd and 3rd lists:
['news online', 'car insurance', 'email account', 'painting ideas']
['news online live', 'car insurance online', 'email account settings', 'painting ideas pinterest']
Thank you #Ma0. it works by using: split(' in ', 1)
text_list = []
for text in three_Word:
x = text.split(' in ', 1)[0]
text_list.append(x)
print(text_list)
# ['news', 'car', 'email', 'painting']
# ['news online', 'car insurance', 'email account', 'painting ideas']
# ['news online live', 'car insurance online', 'email account settings', 'painting ideas pinterest']

Getting "StaleElementReferenceException" when accessing options in drop down menu

In Python and Selenium, I'm populating a form, submitting it, then scraping the resulting multi-page table that appears on the page underneath the form. After I scrape every page of this table, I reset the form and attempt to repopulate the form. However, a drop down menu is tripping up the code.
I've tried to make the driver wait for the drop down menu to reappear after I reset the form, but this doesn't help. I still receive the StaleReferenceElementException error on the if option.text == state line:
StaleElementReferenceException: Message: The element reference of <option> is stale; either the element is no longer attached to the DOM, it is not in the current frame context, or the document has been refreshed
How do I submit the form over and over for different options within the drop down menu?
states = ['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California',
'Colorado', 'Connecticut', 'Delaware', 'District of Columbia',
'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana',
'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland',
'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi', 'Missouri',
'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey',
'New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio',
'Oklahoma', 'Oregon', 'Pennsylvania', 'Rhode Island', 'South Carolina',
'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia',
'Washington', 'West Virginia', 'Wisconsin', 'Wyoming']
# Construct browser and link
browser = webdriver.Firefox(executable_path='/usr/local/bin/geckodriver')
url = 'https://myaccount.rid.org/Public/Search/Member.aspx'
ignored_exceptions = (StaleElementReferenceException,)
# Navigate to link
browser.get(url)
try:
# For each state
for state in states:
print('Searching ' + state)
# Set category and select state menu
category = Select(browser.find_element_by_name('ctl00$FormContentPlaceHolder$Panel$categoryDropDownList'))
category.select_by_value('a027b6c0-07bb-4301-b9b5-1b38dcdc59b6')
state_menu = Select(WebDriverWait(browser, 10, ignored_exceptions=ignored_exceptions).until(EC.presence_of_element_located((By.ID, 'FormContentPlaceHolder_Panel_stateDropDownList'))))
options = state_menu.options
for option in options:
if option.text == state:
state_menu.select_by_value(option.get_attribute('value'))
browser.find_element_by_name('ctl00$FormContentPlaceHolder$Panel$searchButtonStrip$searchButton').click()
# Scrape the first page of results
results = []
curr_page = 1
onFirstPage = True
scrape_page(curr_page)
# Reset form
browser.find_element_by_name('ctl00$FormContentPlaceHolder$Panel$searchButtonStrip$resetButton').click()
break
finally:
pass
The moment you select the option, element references will update and you can't use the older references. Reason you are getting the exception is, you are trying to get the attribute from the option which no longer valid.
Rather using the iteration, I would use the xpath to select the option as shown below
state_menu = WebDriverWait(browser, 10, ignored_exceptions=ignored_exceptions).until(EC.presence_of_element_located((By.ID, 'FormContentPlaceHolder_Panel_stateDropDownList')))
#options = state_menu.options <== replace this line with below line
option = state_menu.find_element_by_xpath("//option[.='" + state + "']")
#for option in options: <== remove this line
# if option.text == state: <== remove this
option.click()
browser.find_element_by_name('ctl00$FormContentPlaceHolder$Panel$searchButtonStrip$searchButton').click()
# Scrape the first page of results
results = []
curr_page = 1
onFirstPage = True
scrape_page(curr_page)
# Reset form
browser.find_element_by_name('ctl00$FormContentPlaceHolder$Panel$searchButtonStrip$resetButton').click()

Parse HTML table data with BeautifulSoup into a dict

I am trying to use BeautifulSoup to parse the information stored in an HTML table and store it into a dict. I've been able to get to the table, and iterate through the values, but there is still a lot of junk in the table that I'm not sure how to take care of.
# load the HTML file
r = requests.get("http://www.ebay.com/itm/222378225962")
soup = BeautifulSoup(r.content, "html.parser")
# navigate to the item attributes table
table = soup.find('div', 'itemAttr')
# iterate through the attribute information
attr = []
for i in table.findAll("tr"):
attr.append(i.text.strip().replace('\t', ''))
With this method, this is what the data looks like. As you you see, there is a lot of junk in there, and some lines contain multiple items like Year and VIN.
[u'Condition:\nUsed',
u'Seller Notes:\n\u201cExcellent Condition\u201d',
u'Year: \n\n2015\n\n VIN (Vehicle Identification Number): \n\n2G1FJ1EW2F9192023',
u'Mileage: \n\n29,000\n\n Transmission: \n\nManual',
u'Make: \n\nChevrolet\n\n Body Type: \n\nCoupe',
u'Model: \n\nCamaro\n\n Warranty: \n\nVehicle has an existing warranty',
u'Trim: \n\nSS Coupe 2-Door\n\n Vehicle Title: \n\nClear',
u'Engine: \n\n6.2L 6162CC 376Cu. In. V8 GAS OHV Naturally Aspirated\n\n Options: \n\nLeather Seats',
u'Drive Type: \n\nRWD\n\n Safety Features: \n\nAnti-Lock Brakes, Driver Airbag, Passenger Airbag, Side Airbags',
u'Power Options: \n\nAir Conditioning, Cruise Control, Power Locks, Power Windows, Power Seats\n\n Sub Model: \n\n1LE',
u'Fuel Type: \n\nGasoline\n\n Color: \n\nWhite',
u'For Sale By: \n\nPrivate Seller\n\n Interior Color: \n\nBlack',
u'Disability Equipped: \n\nNo\n\n Number of Cylinders: \n\n8',
u'']
Ultimately, I want the data to be stored in a dictionary like below. I know how to create a dictionary, but don't know how to clean up the data that needs to go into the dictionary without brute force find-and-replace.
{'Condition' : 'Used',
'Seller Notes' : 'Excellent Condition',
'Year': '2015',
'VIN (Vehicle Identification Number)': '2G1FJ1EW2F9192023',
'Mileage': '29,000',
'Transmission': 'Manual',
'Make': 'Chevrolet',
'Body Type': 'Coupe',
'Model': 'Camaro',
'Warranty': 'Vehicle has an existing warranty',
'Trim': 'SS Coupe 2-Door',
'Vehicle Title' : 'Clear',
'Engine': '6.2L 6162CC 376Cu. In. V8 GAS OHV Naturally Aspirated',
'Options': 'Leather Seats',
'Drive Type': 'RWD',
'Safety Features' : 'Anti-Lock Brakes, Driver Airbag, Passenger Airbag, Side Airbags',
'Power Options' : 'Air Conditioning, Cruise Control, Power Locks, Power Windows, Power Seats',
'Sub Model' : '1LE',
'Fuel Type' : 'Gasoline',
'Exterior Color' : 'White',
'For Sale By' : 'Private Seller',
'Interior Color' : 'Black',
'Disability Equipped' : 'No',
'Number of Cylinders': '8'}
Rather than trying to parse out the data from the tr elements, a better approach would be to iterate over the td.attrLabels data elements. You can use these labels as the key, and then use the adjacent sibling elements as the value.
In the example below, the CSS selector div.itemAttr td.attrLabels is used to select all td elements with .attrLabels classes that are descendants of the div.itemAttr. From there, the method .find_next_sibling() is used to find the adjacent sibling element.
r = requests.get("http://www.ebay.com/itm/222378225962")
soup = BeautifulSoup(r.content, 'lxml')
data = []
for label in soup.select('div.itemAttr td.attrLabels'):
data.append({ label.text.strip(): label.find_next_sibling().text.strip() })
Output:
> [{'Year:': '2015'}, {'VIN (Vehicle Identification Number):': '2G1FJ1EW2F9192023'}, {'Mileage:': '29,000'}, {'Transmission:': 'Manual'}, {'Make:': 'Chevrolet'}, {'Body Type:': 'Coupe'}, {'Model:': 'Camaro'}, {'Warranty:': 'Vehicle has an existing warranty'}, {'Trim:': 'SS Coupe 2-Door'}, {'Vehicle Title:': 'Clear'}, {'Engine:': '6.2L 6162CC 376Cu. In. V8 GAS OHV Naturally Aspirated'}, {'Options:': 'Leather Seats'}, {'Drive Type:': 'RWD'}, {'Safety Features:': 'Anti-Lock Brakes, Driver Airbag, Passenger Airbag, Side Airbags'}, {'Power Options:': 'Air Conditioning, Cruise Control, Power Locks, Power Windows, Power Seats'}, {'Sub Model:': '1LE'}, {'Fuel Type:': 'Gasoline'}, {'Exterior Color:': 'White'}, {'For Sale By:': 'Private Seller'}, {'Interior Color:': 'Black'}, {'Disability Equipped:': 'No'}, {'Number of Cylinders:': '8'}]
If you also want to retrieve the table header th elements, then you could select the table element and then use the CSS selector th, td.attrLabels in order to retrieve both labels:
r = requests.get("http://www.ebay.com/itm/222378225962")
soup = BeautifulSoup(r.content, 'lxml')
table = soup.find('div', 'itemAttr')
data = []
for label in table.select('th, td.attrLabels'):
data.append({ label.text.strip(): label.find_next_sibling().text.strip() })
Output:
> [{'Condition:': 'Used'}, {'Seller Notes:': '“Excellent Condition”'}, {'Year:': '2015'}, {'VIN (Vehicle Identification Number):': '2G1FJ1EW2F9192023'}, {'Mileage:': '29,000'}, {'Transmission:': 'Manual'}, {'Make:': 'Chevrolet'}, {'Body Type:': 'Coupe'}, {'Model:': 'Camaro'}, {'Warranty:': 'Vehicle has an existing warranty'}, {'Trim:': 'SS Coupe 2-Door'}, {'Vehicle Title:': 'Clear'}, {'Engine:': '6.2L 6162CC 376Cu. In. V8 GAS OHV Naturally Aspirated'}, {'Options:': 'Leather Seats'}, {'Drive Type:': 'RWD'}, {'Safety Features:': 'Anti-Lock Brakes, Driver Airbag, Passenger Airbag, Side Airbags'}, {'Power Options:': 'Air Conditioning, Cruise Control, Power Locks, Power Windows, Power Seats'}, {'Sub Model:': '1LE'}, {'Fuel Type:': 'Gasoline'}, {'Exterior Color:': 'White'}, {'For Sale By:': 'Private Seller'}, {'Interior Color:': 'Black'}, {'Disability Equipped:': 'No'}, {'Number of Cylinders:': '8'}]
If you want to strip out non-alphanumeric character(s) for the keys, then you could use:
r = requests.get("http://www.ebay.com/itm/222378225962")
soup = BeautifulSoup(r.content, 'lxml')
table = soup.find('div', 'itemAttr')
data = []
for label in table.select('th, td.attrLabels'):
key = re.sub(r'\W+', '', label.text.strip())
value = label.find_next_sibling().text.strip()
data.append({ key: value })
Output:
> [{'Condition': 'Used'}, {'SellerNotes': '“Excellent Condition”'}, {'Year': '2015'}, {'VINVehicleIdentificationNumber': '2G1FJ1EW2F9192023'}, {'Mileage': '29,000'}, {'Transmission': 'Manual'}, {'Make': 'Chevrolet'}, {'BodyType': 'Coupe'}, {'Model': 'Camaro'}, {'Warranty': 'Vehicle has an existing warranty'}, {'Trim': 'SS Coupe 2-Door'}, {'VehicleTitle': 'Clear'}, {'Engine': '6.2L 6162CC 376Cu. In. V8 GAS OHV Naturally Aspirated'}, {'Options': 'Leather Seats'}, {'DriveType': 'RWD'}, {'SafetyFeatures': 'Anti-Lock Brakes, Driver Airbag, Passenger Airbag, Side Airbags'}, {'PowerOptions': 'Air Conditioning, Cruise Control, Power Locks, Power Windows, Power Seats'}, {'SubModel': '1LE'}, {'FuelType': 'Gasoline'}, {'ExteriorColor': 'White'}, {'ForSaleBy': 'Private Seller'}, {'InteriorColor': 'Black'}, {'DisabilityEquipped': 'No'}, {'NumberofCylinders': '8'}]

Categories

Resources