Getting only a part of the page source using selenium webdriver

Getting only a part of the page source using selenium webdriver - python

I'm trying to get the HTML of Billboard's top 100 chart, but I keep getting only about half of the page.
I tried getting the page source using this code:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
s=Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=s)
url = "https://www.billboard.com/charts/hot-100/"
driver.get(url)
driver.implicitly_wait(10)
print(driver.page_source)
But it always returns the page source only from the 53rd song on the chart (I've tried increasing the implicit wait and nothing changed)

Not sure why you are getting 53 elements. Using explicit waits I am getting all 100 elements.
User webdriverwait() as an explicit wait.
driver.get('https://www.billboard.com/charts/hot-100/')
elements=WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "ul.lrv-a-unstyle-list h3#title-of-a-story")))
print(len(elements))
print([item.text for item in elements])
Need to import below libraries.
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
OutPut:
100
['Anti-Hero', 'Lavender Haze', 'Maroon', 'Snow On The Beach', 'Midnight Rain', 'Bejeweled', 'Question...?', "You're On Your Own, Kid", 'Karma', 'Vigilante Shit', 'Unholy', 'Bad Habit', 'Mastermind', 'Labyrinth', 'Sweet Nothing', 'As It Was', 'I Like You (A Happier Song)', "I Ain't Worried", 'You Proof', "Would've, Could've, Should've", 'Bigger Than The Whole Sky', 'Super Freaky Girl', 'Sunroof', "I'm Good (Blue)", 'Under The Influence', 'The Great War', 'Vegas', 'Something In The Orange', 'Wasted On You', 'Jimmy Cooks', 'Wait For U', 'Paris', 'High Infidelity', 'Tomorrow 2', 'Titi Me Pregunto', 'About Damn Time', 'The Kind Of Love We Make', 'Late Night Talking', 'Cuff It', 'She Had Me At Heads Carolina', 'Glitch', 'Me Porto Bonito', 'Die For You', 'California Breeze', 'Dear Reader', 'Forever', 'Hold Me Closer', 'Just Wanna Rock', '5 Foot 9', 'Unstoppable', 'Thank God', 'Fall In Love', 'Rock And A Hard Place', 'Golden Hour', 'Heyy', 'Half Of Me', 'Until I Found You', 'Victoria’s Secret', 'Son Of A Sinner', "Star Walkin' (League Of Legends Worlds Anthem)", 'Romantic Homicide', 'What My World Spins Around', 'Real Spill', "Don't Come Lookin'", 'Monotonia', 'Stand On It', 'Never Hating', 'Wishful Drinking', 'No Se Va', 'Free Mind', 'Not Finished', 'Music For A Sushi Restaurant', 'Pop Out', 'Staying Alive', 'Poland', 'Whiskey On You', 'Billie Eilish.', 'Betty (Get Money)', '2 Be Loved (Am I Ready)', 'Wait In The Truck', 'All Mine', 'La Bachata', 'Last Last', 'Glimpse Of Us', 'Freestyle', 'Calm Down', 'Gatubela', 'Evergreen', 'Pick Me Up', 'Gotta Move On', 'Perfect Timing', 'Country On', 'Snap', 'She Likes It', 'Made You Look', 'Dark Red', 'From Now On', 'Forget Me', 'Miss You', 'Despecha']

driver.implicitly_wait(10) is not a pause command!
It has no effect on the next line print(driver.page_source).
If you want to wait for page to be completely load you can wait for some specific element there to be visible. Use WebDriverWait expected_conditions for that. Or just add a hardcoded pause time.sleep(10) instead of driver.implicitly_wait(10)

Related

how to stop letter repeating itself python

I am making a code which takes in jumble word and returns a unjumbled word , the data.json contains a list and here take a word one-by-one and check if it contains all the characters of the word and later checking if the length is same , but the problem is when i enter a word as helol then the l is checked twice and giving me some other outputs including the main one(hello). i know why does it happen but i cant get a fix to it
import json
val = open("data.json")
val1 = json.load(val)#loads the list
a = input("Enter a Jumbled word ")#takes a word from user
a = list(a)#changes into list to iterate
for x in val1:#iterates words from list
for somethin in a:#iterates letters from list
if somethin in list(x):#checks if the letter is in the iterated word
continue
else:
break
else:#checks if the loop ended correctly (that means word has same letters)
if len(a) != len(list(x)):#checks if it has same number of letters
continue#returns
else:
print(x)#continues the loop to see if there are more like that
EDIT: many people wanted the json file so here it is
['Torres Strait Creole', 'good bye', 'agon', "queen's guard", 'animosity', 'price list', 'subjective', 'means', 'severe', 'knockout', 'life-threatening', 'entry into the war', 'dominion', 'damnify', 'packsaddle', 'hallucinate', 'lumpy', 'inception', 'Blankenese', 'cacophonous', 'zeptomole', 'floccinaucinihilipilificate', 'abashed', 'abacterial', 'ableism', 'invade', 'cohabitant', 'handicapped', 'obelus', 'triathlon', 'habitue', 'instigate', 'Gladstone Gander', 'Linked Data', 'seeded player', 'mozzarella', 'gymnast', 'gravitational force', 'Friedelehe', 'open up', 'bundt cake', 'riffraff', 'resourceful', 'wheedle', 'city center', 'gorgonzola', 'oaf', 'auf', 'oafs', 'galoot', 'imbecile', 'lout', 'moron', 'news leak', 'crate', 'aggregator', 'cheating', 'negative growth', 'zero growth', 'defer', 'ride back', 'drive back', 'start back', 'shy back', 'spring back', 'shrink back', 'shy away', 'abderian', 'unable', 'font manager', 'font management software', 'consortium', 'gown', 'inject', 'ISO 639', 'look up', 'cross-eyed', 'squinting', 'health club', 'fitness facility', 'steer', 'sunbathe', 'combatives', 'HTH', 'hope that helps', 'How The Hell', 'distributed', 'plum cake', 'liberalization', 'macchiato', 'caffè macchiato', 'beach volley', 'exult', 'jubilate', 'beach volleyball', 'be beached', 'affogato', 'gigabyte', 'terabyte', 'petabyte', 'undressed', 'decameter', 'sensual', 'boundary marker', 'poor man', 'cohabitee', 'night sleep', 'protruding ears', 'three quarters of an hour', 'spermophilus', 'spermophilus stricto sensu', "devil's advocate", 'sacred king', 'sacral king', 'myr', 'million years', 'obtuse-angled', 'inconsolable', 'neurotic', 'humiliating', 'mortifying', 'theological', 'rematch', 'varıety', 'be short', 'ontological', 'taxonomic', 'taxonomical', 'toxicology testing', 'on the job training', 'boulder', 'unattackable', 'inviolable', 'resinous', 'resiny', 'ionizing radiation', 'citrus grove', 'comic book shop', 'preparatory measure', 'written account', 'brittle', 'locker', 'baozi', 'bao', 'bau', 'humbow', 'nunu', 'bausak', 'pow', 'pau', 'yesteryear', 'fire drill', 'rotted', 'putto', 'overthrow', 'ankle monitor', 'somewhat stupid', 'a little stupid', 'semordnilap', 'pangram', 'emordnilap', 'person with a sunlamp tan', 'tittle', 'incompatible', 'autumn wind', 'dairyman', 'chesty', 'lacustrine', 'chronophotograph', 'chronophoto', 'leg lace', 'ankle lace', 'ankle lock', 'Babelfy', 'ventricular', 'recurrent', 'long-lasting', 'long-standing', 'long standing', 'sea bass', 'reap', 'break wind', 'chase away', 'spark', 'speckle', 'take back', 'Westphalian', 'Aeolic Greek', 'startup', 'abseiling', 'impure', 'bottle cork', 'paralympic', 'work out', 'might', 'ice-cream man', 'ice cream man', 'ice cream maker', 'ice-cream maker', 'traveling', 'special delivery', 'prizefighter', 'abs', 'ab', 'churro', 'pilfer', 'dehumanize', 'fertilize', 'inseminate', 'digitalize', 'fluke', 'stroke of luck', 'decontaminate', 'abandonware', 'manzanita', 'tule', 'jackrabbit', 'system administrator', 'system admin', 'springtime lethargy', 'Palatinean', 'organized religion', 'bearing puller', 'wheel puller', 'gear puller', 'shot', 'normalize', 'palindromic', 'lancet window', 'terminological', 'back of head', 'dragon food', 'barbel', 'Central American Spanish', 'basis', 'birthmark', 'blood vessel', 'ribes', 'dog-rose', 'dreadful', 'freckle', 'free of charge', 'weather verb', 'weather sentence', 'gipsy', 'gypsy', 'glutton', 'hump', 'low voice', 'meek', 'moist', 'river mouth', 'turbid', 'multitude', 'palate', 'peak of mountain', 'poetry', 'pure', 'scanty', 'spicy', 'spicey', 'spruce', 'surface', 'infected', 'copulate', 'dilute', 'dislocate', 'grow up', 'hew', 'hinder', 'infringe', 'inhabit', 'marry off', 'offend', 'pass by', 'brother of a man', 'brother of a woman', 'sister of a man', 'sister of a woman', 'agricultural farm', 'result in', 'rebel', 'strew', 'scatter', 'sway', 'tread', 'tremble', 'hog', 'circuit breaker', 'Southern Quechua', 'safety pin', 'baby pin', 'college student', 'university student', 'pinus sibirica', 'Siberian pine', 'have lunch', 'floppy', 'slack', 'sloppy', 'wishi-washi', 'turn around', 'bogeyman', 'selfish', 'Talossan', 'biomembrane', 'biological membrane', 'self-sufficiency', 'underevaluation', 'underestimation', 'opisthenar', 'prosody', 'Kumhar Bhag Paharia', 'psychoneurotic', 'psychoneurosis', 'levant', "couldn't-care-less attitude", 'noctambule', 'acid-free paper', 'decontaminant', 'woven', 'wheaten', 'waste-ridden', 'war-ridden', 'violence-ridden', 'unwritten', 'typewritten', 'spoken', 'abiogenetically', 'rasp', 'abstractly', 'cyclically', 'acyclically', 'acyclic', 'ad hoc', 'spare tire', 'spare wheel', 'spare tyre', 'prefabricated', 'ISO 9000', 'Barquisimeto', 'Maracay', 'Ciudad Guayana', 'San Cristobal', 'Barranquilla', 'Arequipa', 'Trujillo', 'Cusco', 'Callao', 'Cochabamba', 'Goiânia', 'Campinas', 'Fortaleza', 'Florianópolis', 'Rosario', 'Mendoza', 'Bariloche', 'temporality', 'papyrus sedge', 'paper reed', 'Indian matting plant', 'Nile grass', 'softly softly', 'abductive reasoning', 'abductive inference', 'retroduction', 'Salzburgian', 'cymotrichous', 'access point', 'wireless access point', 'dynamic DNS', 'IP address', 'electrolyte', 'helical', 'hydrometer', 'intranet', 'jumper', 'MAC address', 'Media Access Control address', 'nickel–cadmium battery', 'Ni-Cd battery', 'oscillograph', 'overload', 'photovoltaic', 'photovoltaic cell', 'refractor telescope', 'autosome', 'bacterial artificial chromosome', 'plasmid', 'nucleobase', 'base pair', 'base sequence', 'chromosomal deletion', 'deletion', 'deletion mutation', 'gene deletion', 'chromosomal inversion', 'comparative genomics', 'genomics', 'cytogenetics', 'DNA replication', 'DNA repair', 'DNA sequence', 'electrophoresis', 'functional genomics', 'retroviral', 'retroviral infection', 'acceptance criteria', 'batch processing', 'business rule', 'code review', 'configuration management', 'entity–relationship model', 'lifecycle', 'object code', 'prototyping', 'pseudocode', 'referential', 'reusability', 'self-join', 'timestamp', 'accredited', 'accredited translator', 'certify', 'certified translation', 'computer-aided design', 'computer-aided', 'computer-assisted', 'management system', 'computer-aided translation', 'computer-assisted translation', 'machine-aided translation', 'conference interpreter', 'freelance translator', 'literal translation', 'mother-tongue', 'whispered interpreting', 'simultaneous interpreting', 'simultaneous interpretation', 'base anhydride', 'binary compound', 'absorber', 'absorption coefficient', 'attenuation coefficient', 'active solar heater', 'ampacity', 'amorphous semiconductor', 'amorphous silicon', 'flowerpot', 'antireflection coating', 'antireflection', 'armored cable', 'electric arc', 'breakdown voltage','casing', 'facing', 'lining', 'assumption of Mary', 'auscultation']
Just a example and the dictionary is full of items

As I understand it you are trying to identify all possible matches for the jumbled string in your list. You could sort the letters in the jumbled word and match the resulting list against sorted lists of the words in your data file.
sorted_jumbled_word = sorted(a)
for word in val1:
if len(sorted_jumbled_word) == len(word) and sorted(word) == sorted_jumbled_word:
print(word)
Checking by length first reduces unnecessary sorting. If doing this repeatedly, you might want to create a dictionary of the words in the data file with their sorted versions, to avoid having to repeatedly sort them.
There are spaces and punctuation in some of the terms in your word list. If you want to make the comparison ignoring spaces then remove them from both the jumbled word and the list of unjumbled words, using e.g. word = word.replace(" ", "")

How to scrape all topics from twitter

All topics in twitter can be found in this link
I would like to scrape all of them with each of the subcategory inside.
BeautifulSoup doesn't seem to be useful here. I tried using selenium, but I don't know how to match the Xpaths that come after clicking the main category.
from selenium import webdriver
from selenium.common import exceptions
url = 'https://twitter.com/i/flow/topics_selector'
driver = webdriver.Chrome('absolute path to chromedriver')
driver.get(url)
driver.maximize_window()
main_topics = driver.find_elements_by_xpath('/html/body/div[1]/div/div/div[1]/div[2]/div/div/div/div/div/div[2]/div[2]/div/div/div[2]/div[2]/div/div/div/div/span')
topics = {}
for main_topic in main_topics[2:]:
print(main_topic.text.strip())
topics[main_topic.text.strip()] = {}
I know I can click the main category using main_topics[3].click(), but I don't know how I can maybe recursively click through them until I find only the ones with Follow on the right.

To scrape all the main topics e.g. Arts & culture, Business & finance, etc using Selenium and python you have to induce WebDriverWait for visibility_of_all_elements_located() and you can use either of the following Locator Strategies:
Using XPATH and text attribute:
driver.get("https://twitter.com/i/flow/topics_selector")
print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//span[contains(., 'see top Tweets about them in your timeline')]//following::div[#role='button']/div/span")))])
Using XPATH and get_attribute():
driver.get("https://twitter.com/i/flow/topics_selector")
print([my_elem.get_attribute("textContent") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//span[contains(., 'see top Tweets about them in your timeline')]//following::div[#role='button']/div/span")))])
Console Output:
['Arts & culture', 'Business & finance', 'Careers', 'Entertainment', 'Fashion & beauty', 'Food', 'Gaming', 'Lifestyle', 'Movies and TV', 'Music', 'News', 'Outdoors', 'Science', 'Sports', 'Technology', 'Travel']
To scrape all the main and sub topics using Selenium and WebDriver you can use the following Locator Strategy:
Using XPATH and get_attribute("textContent"):
driver.get("https://twitter.com/i/flow/topics_selector")
elements = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//span[contains(., 'see top Tweets about them in your timeline')]//following::div[#role='button']/div/span")))
for element in elements:
element.click()
print([my_elem.get_attribute("textContent") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[#role='button']/div/span[text()]")))])
driver.quit()
Console Output:
['Arts & culture', 'Animation', 'Art', 'Books', 'Dance', 'Horoscope', 'Theater', 'Writing', 'Business & finance', 'Business personalities', 'Business professions', 'Cryptocurrencies', 'Careers', 'Education', 'Fields of study', 'Entertainment', 'Celebrities', 'Comedy', 'Digital creators', 'Entertainment brands', 'Podcasts', 'Popular franchises', 'Theater', 'Fashion & beauty', 'Beauty', 'Fashion', 'Food', 'Cooking', 'Cuisines', 'Gaming', 'Esports', 'Game development', 'Gaming hardware', 'Gaming personalities', 'Tabletop gaming', 'Video games', 'Lifestyle', 'Animals', 'At home', 'Collectibles', 'Family', 'Fitness', 'Unexplained phenomena', 'Movies and TV', 'Movies', 'Television', 'Music', 'Alternative', 'Bollywood music', 'C-pop', 'Classical music', 'Country music', 'Dance music', 'Electronic music', 'Hip-hop & rap', 'J-pop', 'K-hip hop', 'K-pop', 'Metal', 'Musical instruments', 'Pop', 'R&B and soul', 'Radio stations', 'Reggae', 'Reggaeton', 'Rock', 'World music', 'News', 'COVID-19', 'Local news', 'Social movements', 'Outdoors', 'Science', 'Biology', 'Sports', 'American football', 'Australian rules football', 'Auto racing', 'Baseball', 'Basketball', 'Combat Sports', 'Cricket', 'Extreme sports', 'Fantasy sports', 'Football', 'Golf', 'Gymnastics', 'Hockey', 'Lacrosse', 'Pub sports', 'Rugby', 'Sports icons', 'Sports journalists & coaches', 'Tennis', 'Track & field', 'Water sports', 'Winter sports', 'Technology', 'Computer programming', 'Cryptocurrencies', 'Data science', 'Information security', 'Operating system', 'Tech brands', 'Tech personalities', 'Travel', 'Adventure travel', 'Destinations', 'Transportation']
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

Take a look at how XPATH works. Just put '//element[#attribute="foo"]' and you don't have to write out the whole path. Be careful as both main topics and sub topics (which are visible after clicking the main topics) have the same class name. That was causing the error. So, here's how I was able to click the subtopics, but I'm sure there's a better way:
I found the topics elements using:
topics = WebDriverWait(browser, 5).until(
EC.presence_of_all_elements_located((By.XPATH, '//div[#class="css-901oao r-13gxpu9 r-1qd0xha r-1b6yd1w r-1vr29t4 r-ad9z0x r-bcqeeo r-qvutc0"]'))
)
then I created an empty list named:
main_topics = []
Then, I for looped through topics and appeneded each element.text to the main_topics list, and clicked each element to show the main topics.
for topic in topics:
main_topics.append(topic.text)
topic.click()
Then, I a new variable called sub_topics: (it's now all the opened topics)
sub_topics = WebDriverWait(browser, 5).until(
EC.presence_of_all_elements_located((By.XPATH, '//span[#class="css-901oao css-16my406 r-1qd0xha r-ad9z0x r-bcqeeo r-qvutc0"]'))
)
Then, I created two more empty lists named:
subs_list = []
skip_these_words = ["Done", "Follow your favorite Topics", "You’ll see top Tweets about them in your timeline. Don’t see your favorite Topics yet? New Topics are added every week.", "Follow"]
]
Then, I for looped through the sub_topics and made an if statement to only append the elements.text to the subs_list ONLY IF they were not in the main_topics and skip_these_words lists. I did this to filter out the main topics and unecessary text at the top since all these dern elements have the same class name. Finally, each sub topic is clicked. This last part is confusing so here's an example:
for sub in sub_topics:
if sub.text not in main_topics and sub.text not in skip_these_words:
subs_list.append(sub.text)
sub.click()
There are also a few more hidden sub-sub topics. See if you can click the remaining sub-sub topics. Then, see if you can find the follow button element and click each one.

Selenium only returns an empty list

I'm trying to scrape football team names from betfair.com and no matter what, it returs an empty list. This is what I've tried most recently.
from selenium import webdriver
import pandas as pd
driver = webdriver.Chrome(r'C:\Users\Tom\Desktop\chromedriver\chromedriver.exe')
driver.get('https://www.betfair.com/exchange/plus/football')
team = driver.find_elements_by_xpath('//*[#id="main-wrapper"]/div/div[2]/div/ui-view/div/div/div/div/div[1]/div/div[1]/bf-super-coupon/main/ng-include[3]/section[1]/div[2]/bf-coupon-table/div/table/tbody/tr[1]/td[1]/a/event-line/section/ul[1]/li[1]')
print(team)

You should use WebDriverWait. Also, you should use a relative xPath, not absolute xPath. One more thing you are using find_elements for a single element.
Here I'm printing all the teams
from pprint import pprint
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Chrome(r"C:\Users\Tom\Desktop\chromedriver\chromedriver.exe")
driver.get('https://www.betfair.com/exchange/plus/football')
teams = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, '//*[#id="main-wrapper"]//ul[#class="runners"]/li')))
print([i.text for i in teams])
Output:
['Everton',
'West Ham',
'Tottenham',
'Watford',
'Chelsea',
'Newcastle',
'Wolves',
'Southampton',
'Leicester',
'Burnley',
'Aston Villa',
'Brighton',
'Bournemouth',
'Norwich',
'Crystal Palace',
'Man City',
'Man Utd',
'Liverpool',
'Sheff Utd',
'Arsenal',
'Eintracht Frankfurt',
'Leverkusen',
'Werder Bremen',
'Hertha Berlin',
'Augsburg',
'Bayern Munich',
'Fortuna Dusseldorf',
'Mainz',
'RB Leipzig',
'Wolfsburg',
'Union Berlin',
'Freiburg',
'Dortmund',
'Mgladbach',
'FC Koln',
'Paderborn',
'Hoffenheim',
'Schalke 04',
'St Etienne',
'Lyon',
'Nice',
'Paris St-G',
'Lyon',
'Dijon',
'Reims',
'Montpellier',
'Nimes',
'Amiens',
'Toulouse',
'Lille',
'Metz',
'Nantes',
'Angers',
'Brest',
'Bordeaux',
'St Etienne',
'Monaco',
'Rennes',
'Houston Dynamo',
'LA Galaxy',
'Philadelphia',
'New York City',
'Atlanta Utd',
'New England',
'Seattle Sounders',
'Minnesota Utd',
'DC Utd',
'FC Cincinnati',
'Orlando City',
'Chicago Fire',
'Montreal Impact',
'New York Red Bulls',
'Toronto FC',
'Columbus',
'Los Angeles FC',
'Colorado',
'FC Dallas',
'Kansas City',
'Shakhtar',
'Dinamo Zagreb',
'Atletico Madrid',
'Leverkusen',
'Club Brugge',
'Paris St-G',
'Tottenham',
'Crvena Zvezda',
'Olympiakos',
'Bayern Munich',
'Man City',
'Atalanta',
'Galatasaray',
'Real Madrid',
'Juventus',
'Lokomotiv',
'Ajax',
'Chelsea',
'RB Leipzig',
'Zenit St Petersburg']

Scraping data with a lack of classes / ids on elements

I'm trying to scrape data to build an object which looks like;
{
"artist": "Oasis",
"albums": {
"Definitely Maybe": [
"Rock n Roll Star",
"Shakermaker",
...
],
"(What's The Story) Morning Glory": [
"Hello",
"Roll With It"
...
],
...
}
}
Here is how the HTML on the page looks;
I'm currently scrapping the data like so;
data = []
for div in soup.find_all("div",{"id":"listAlbum"}):
links = div.findAll('a')
for a in links:
if a.text.strip() is "":
pass
elif a.text.strip():
data.append(a.text.strip())
Likewise, grabbing the album names is straightforward also;
for div in soup.find_all("div",{"class":"album"}):
titles = div.findAll('b')
for t in titles:
...
My problem is how to use the above two loops to build an object like the one at the top. How can I ensure the songs from X album, go into the correct album object. If each song had an album attribute, it would be clear to me. However, with the HTML structured the way it is - I'm at a bit of a loss.
EDIT: Find the HTML below;
<div id="listAlbum">
<a id="1368"></a>
<div class="album">album: <b>"Definitely Maybe"</b> (1994)</div>
Rock 'n' Roll Star<br>
Shakermaker<br>
Live Forever<br>
Up In The Sky<br>
Columbia<br>
Supersonic<br>
Bring It On Down<br>
Cigarettes & Alcohol<br>
Digsy's Diner<br>
Slide Away<br>
Married With Children<br>
Sad Song<br>
<a id="1366"></a>
<div class="album">album: <b>"(What's The Story) Morning Glory"</b> (1995)</div>
Hello<br>
Roll With It<br>
Wonderwall<br>
Don't Look Back In Anger<br>
Hey Now<br>
Some Might Say<br>
Cast No Shadow<br>
She's Electric<br>
Morning Glory<br>
Champagne Supernova<br>
Bonehead's Bank Holiday<br>

You can do this using find_next_siblings().
Code:
oasis = {
'artist': 'Oasis',
'albums': {}
}
soup = BeautifulSoup(html, 'lxml') # where html is the html you've provided
all_albums = soup.find('div', id='listAlbum')
first_album = all_albums.find('div', class_='album')
album_name = first_album.b.text
songs = []
for tag in first_album.find_next_siblings(['a', 'div']):
# If tag is <div> add the previous album.
if tag.name == 'div':
oasis['albums'][album_name] = songs
songs = []
album_name = tag.b.text
# If tag is <a> append song to the list.
else:
songs.append(tag.text)
# Add the last album
oasis['albums'][album_name] = songs
print(oasis)
Output:
{
'artist': 'Oasis',
'albums': {
'"Definitely Maybe"': ["Rock 'n' Roll Star", 'Shakermaker', 'Live Forever', 'Up In The Sky', 'Columbia', 'Supersonic', 'Bring It On Down', 'Cigarettes & Alcohol', "Digsy's Diner", 'Slide Away', 'Married With Children', 'Sad Song', ''],
'"(What\'s The Story) Morning Glory"': ['Hello', 'Roll With It', 'Wonderwall', "Don't Look Back In Anger", 'Hey Now', 'Some Might Say', 'Cast No Shadow', "She's Electric", 'Morning Glory', 'Champagne Supernova', "Bonehead's Bank Holiday"]
}
}
EDIT:
After checking the website, I've made a few changes to the code.
First, you need to skip this <a id="6910"></a> tag (which is located at the end of each album) as it will add a song with empty name. Second, the text other songs: is not located inside <b> tag; so it will raise an error with album_name = tag.b.text.
Doing the following changes will give you exactly what you need.
for tag in first_album.find_next_siblings(['a', 'div']):
if tag.name == 'div':
oasis['albums'][album_name] = songs
songs = []
album_name = tag.text if tag.text == 'other songs:' else tag.b.text
continue
if tag.get('id'):
continue
songs.append(tag.text)
Final output:
{
'artist': 'Oasis',
'albums': {
'"Definitely Maybe"': ["Rock 'n' Roll Star", 'Shakermaker', 'Live Forever', 'Up In The Sky', 'Columbia', 'Supersonic', 'Bring It On Down', 'Cigarettes & Alcohol', "Digsy's Diner", 'Slide Away', 'Married With Children', 'Sad Song'],
'"(What\'s The Story) Morning Glory"': ['Hello', 'Roll With It', 'Wonderwall', "Don't Look Back In Anger", 'Hey Now', 'Some Might Say', 'Cast No Shadow', "She's Electric", 'Morning Glory', 'Champagne Supernova', "Bonehead's Bank Holiday"],
'"Be Here Now"': ["D'You Know What I Mean?", 'My Big Mouth', 'Magic Pie', 'Stand By Me', 'I Hope, I Think, I Know', 'The Girl In The Dirty Shirt', 'Fade In-Out', "Don't Go Away", 'Be Here Now', 'All Around The World', "It's Getting Better (Man!!)"],
'"The Masterplan"': ['Acquiesce', 'Underneath The Sky', 'Talk Tonight', 'Going Nowhere', 'Fade Away', 'I Am The Walrus (Live)', 'Listen Up', "Rockin' Chair", 'Half The World Away', "(It's Good) To Be Free", 'Stay Young', 'Headshrinker', 'The Masterplan'],
'"Standing On The Shoulder Of Giants"': ["Fuckin' In The Bushes", 'Go Let It Out', 'Who Feels Love?', 'Put Yer Money Where Yer Mouth Is', 'Little James', 'Gas Panic!', 'Where Did It All Go Wrong?', 'Sunday Morning Call', 'I Can See A Liar', 'Roll It Over'],
'"Heathen Chemistry"': ['The Hindu Times', 'Force Of Nature', 'Hung In A Bad Place', 'Stop Crying Your Heart Out', 'Song Bird', 'Little By Little', '(Probably) All In The Mind', 'She Is Love', 'Born On A Different Cloud', 'Better Man'],
'"Don\'t Believe The Truth"': ['Turn Up The Sun', 'Mucky Fingers', 'Lyla', 'Love Like A Bomb', 'The Importance Of Being Idle', 'The Meaning Of Soul', "Guess God Thinks I'm Abel", 'Part Of The Queue', 'Keep The Dream Alive', 'A Bell Will Ring', 'Let There Be Love'],
'"Dig Out Your Soul"': ['Bag It Up', 'The Turning', 'Waiting For The Rapture', 'The Shock Of The Lightning', "I'm Outta Time", '(Get Off Your) High Horse Lady', 'Falling Down', "To Be Where There's Life", "Ain't Got Nothin'", 'The Nature Of Reality', 'Soldier On', 'I Believe In All'],
'other songs:': ["(As Long As They've Got) Cigarettes In Hell", '(I Got) The Fever', 'Alice', 'Alive', 'Angel Child', 'Boy With The Blues', 'Carry Us All', 'Cloudburst', 'Cum On Feel The Noize', "D'Yer Wanna Be A Spaceman", 'Eyeball Tickler', 'Flashbax', 'Full On', 'Helter Skelter', 'Heroes', 'I Will Believe', "Idler's Dream", 'If We Shadows', "It's Better People", 'Just Getting Older', "Let's All Make Believe", 'My Sister Lover', 'One Way Road', 'Round Are Way', 'Step Out', 'Street Fighting Man', 'Take Me', 'Take Me Away', 'The Fame', 'Whatever', "You've Got To Hide Your Love Away"]
}
}

Scraper harvesting major portion of data while missing few

I've written a script in python with selenium to scrape the complete flight schedule from a webpage. Upon running my script I could see that it is working good so far except for some fields which are not getting parsed. I've checked for the elements within which the data are located but I noticed that elements for already scraped one and the missing one are no different. What to do to get the full content. Thanks in advance.
Here is the script I'm trying with:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get("http://www.yvr.ca/en/passengers/flights/departing-flights")
wait = WebDriverWait(driver, 10)
item = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "table.yvr-flights__table")))
list_of_data = [[item.text for item in data.find_elements_by_css_selector('td')]
for data in item.find_elements_by_css_selector('tr')]
for tab_data in list_of_data:
print(tab_data)
driver.quit()
Here is the partial picture of the data [missing one and scraped one]:
https://www.dropbox.com/s/xaqeiq97b6upj5j/flight_stuff.jpg?dl=0
Here are the td elements for one block:
<tr class="yvr-flights__row yvr-flights__row--departed " id="226792377">
<td>
<time class="yvr-flights__label yvr-flights__scheduled-label yvr-flights__scheduled-label--departed notranslate" datetime="2017-08-24T06:20:00-07:00">
06:20
</time>
</td>
<td class="yvr-flights__table-cell--revised notranslate">
<time class="yvr-flights__label yvr-flights__revised-label yvr-flights__revised-label--departed" datetime="2017-08-24T06:20:00-07:00">
06:19
</time>
</td>
<td class="yvr-table__cell yvr-flights__flightNumber notranslate">AC560</td>
<td class="hidden-until--md yvr-table__cell yvr-table__cell--fade-out yvr-table__cell--nowrap notranslate">Air Canada</td>
<td class="yvr-table__cell yvr-table__cell--fade-out yvr-table__cell--nowrap notranslate">San Francisco</td>
<td class="hidden-until--md yvr-table__cell yvr-table__cell--nowrap notranslate">
Main
</td>
<td class="hidden-until--md yvr-table__cell yvr-table__cell--nowrap notranslate">E87</td>
<td class="yvr-flights__table-cell--status yvr-table__cell--nowrap">
<span class="yvr-flights__status yvr-flights__status--departed">Departed</span>
</td>
<td class="hidden-until--md yvr-table__cell yvr-table__cell--nowrap">
</td>
<td class="visible-until--md">
<button class="yvr-flights__toggle-flight">Toggle flight</button>
</td>
</tr>

Since you are looking to fix your script and not scrape data. I found few issues in your script.
One your scanning all tr nodes. But the tr you are interested in should have yvr-flights__row class. But there are ones which are hidden and don't have data. They have yvr-flights__row--hidden. So you don't want them
Also the 2nd column of table doesn't have data always. When it has it is more like below
<td class="yvr-flights__table-cell--revised notranslate">
<time class="yvr-flights__label yvr-flights__revised-label yvr-flights__revised-label--early" datetime="2017-08-25T06:30:00-07:00">
06:20
</time>
</td>
So you when you use .text on the td. The node itself has no text. But it has a time node which has the text. There are multiple ways to fix that. But I use JS to get the content of such node
driver.execute_script("return arguments[0].textContent;").trim()
So if you combine all of it below script does all the work
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get("http://www.yvr.ca/en/passengers/flights/departing-flights")
wait = WebDriverWait(driver, 10)
item = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "table.yvr-flights__table")))
list_of_data = [
[
item.text if item.text else driver.execute_script("return arguments[0].textContent.trim();", item).strip()
for item in data.find_elements_by_css_selector('td')
]
for data in item.find_elements_by_css_selector('tr.yvr-flights__row:not(.yvr-flights__row--hidden)')
]
for tab_data in list_of_data:
print(tab_data)
It gives me the below output
['02:00', '02:20', 'CX889', 'Cathay Pacific', 'Hong Kong', 'Main', 'D64', 'Departed', '', 'Toggle flight']
['05:15', '', 'PR127', 'Philippine Airlines', 'Manila', 'Main', 'D70', 'Departed', '', 'Toggle flight']
['06:00', '', 'AS964', 'Alaska Airlines', 'Seattle', 'Main', 'E73', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:00', '', 'DL4805', 'Delta Air Lines', 'Seattle', 'Main', 'E90', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:00', '', 'WS3114', 'WestJet', 'Kelowna', 'Main', 'A9', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:00', '', 'AA6045', 'American Airlines', 'Los Angeles', 'Main', 'E86', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:00', '', 'AC100', 'Air Canada', 'Toronto', 'Main', 'C45', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:01', '', 'UA618', 'United Airlines', 'San Francisco', 'Main', 'E76', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:10', '', 'AC8606', 'Air Canada', 'Winnipeg', 'Main', 'C39', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:10', '', 'AC8190', 'Air Canada', 'Kamloops', 'Main', 'C34', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:10', '', 'AC200', 'Air Canada', 'Calgary', 'Main', 'C29', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:15', '', 'WS560', 'WestJet', 'Calgary', 'Main', 'B13', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:20', '', 'AC560', 'Air Canada', 'San Francisco', 'Main', 'E87', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:30', '06:20', 'DL2555', 'Delta Air Lines', 'Minneapolis', 'Main', 'E88', 'Early', 'NOTIFY ME', 'Toggle flight']
['06:30', '', 'WS700', 'WestJet', 'Toronto', 'Main', 'B15', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:30', '', 'UA664', 'United Airlines', 'Chicago', 'Main', 'E75', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:40', '', 'AM695', 'AeroMexico', 'Mexico City', 'Main', 'D53', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:40', '', 'WS6110', 'WestJet', 'Mexico City', 'Main', 'D53', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:45', '06:45', 'AC8055', 'Air Canada', 'Victoria', 'Main', '',
...
['23:25', '', 'AC8269', 'Air Canada', 'Nanaimo', 'Main', '', 'On Time', 'NOTIFY ME', 'Toggle flight']
['23:25', '', 'AM697', 'AeroMexico', 'Mexico City', 'Main', 'D54', 'On Time', 'NOTIFY ME', 'Toggle flight']
['23:25', '', 'WS6108', 'WestJet', 'Mexico City', 'Main', 'D54', 'On Time', 'NOTIFY ME', 'Toggle flight']
['23:25', '', 'AC8083', 'Air Canada', 'Victoria', 'Main', 'C38', 'On Time', 'NOTIFY ME', 'Toggle flight']
['23:25', '', 'AC308', 'Air Canada', 'Montreal', 'Main', 'C29', 'On Time', 'NOTIFY ME', 'Toggle flight']
['23:26', '', 'WS564', 'WestJet', 'Montreal', 'Main', 'B13', 'On Time', 'NOTIFY ME', 'Toggle flight']
['23:30', '', 'AC128', 'Air Canada', 'Toronto', 'Main', 'C47', 'On Time', 'NOTIFY ME', 'Toggle flight']
['23:40', '', 'AC33', 'Air Canada', 'Sydney', 'Main', 'D52', 'On Time', 'NOTIFY ME', 'Toggle flight']
['23:45', '', 'AC35', 'Air Canada', 'Brisbane', 'Main', 'D65', 'On Time', 'NOTIFY ME', 'Toggle flight']
['23:45', '', 'AC344', 'Air Canada', 'Ottawa', 'Main', 'C49', 'On Time', 'NOTIFY ME', 'Toggle flight']

You should just open this URL and get all the details
http://www.yvr.ca/en/_api/Flights?%24filter=FlightScheduledTime%20gt%20DateTime%272017-08-24T00%3A00%3A00%27%20and%20FlightScheduledTime%20lt%20DateTime%272017-08-25T00%3A00%3A00%27%20and%20FlightType%20eq%20%27D%27&%24orderby=FlightScheduledTime%20asc
If I escape the URL it becomes like
http://www.yvr.ca/en/_api/Flights?$filter=FlightScheduledTime gt DateTime'2017-08-24T00:00:00' and FlightScheduledTime lt DateTime'2017-08-25T00:00:00' and FlightType eq 'D'&$orderby=FlightScheduledTime asc
So you should just parameterize this and replace dates based on current date get all the data in JSON form
{
odata.metadata: "http://www.yvr.ca/_api/$metadata#Flights",
value: [
{
FlightStatus: "Departed",
FlightRemarksAdjusted: "Departed",
FlightScheduledTime: "2017-08-24T06:15:00",
FlightEstimatedTime: "2017-08-24T06:10:00",
FlightNumber: "WS560",
FlightAirlineName: "WestJet",
FlightAircraftType: "73H",
FlightDeskTo: "",
FlightDeskFrom: "",
FlightCarousel: "",
FlightRange: "D",
FlightCarrier: "WS",
FlightCity: "Calgary",
FlightType: "D",
FlightAirportCode: "YYC",
FlightGate: "B14",
FlightRemarks: "Departed",
FlightID: 226790614,
FlightQuickConnect: ""
},
{
FlightStatus: "Departed",
FlightRemarksAdjusted: "Departed",
FlightScheduledTime: "2017-08-24T06:20:00",
FlightEstimatedTime: "2017-08-24T06:19:00",

as suggested by Tarun Lalwani, WebDriver is really the wrong tool for this activity.
The problem is that webdriver only returns text from elements that are visible on the screen, so if you want to see the data from all the rows you will need to scroll down the rows and collect the data one row at a time as discussed in WebElement getText() is an empty string in Firefox if element is not physically visible on the screen
This will be painfully slow.
I guess you could also grab the textcontent instead of item.text
in java:
item.getAttribute("textContent");
I'm sure python has an equivalent.
jsoup would be an alternative means to grab the data in a single shot and much faster

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Getting only a part of the page source using selenium webdriver - python

Related

how to stop letter repeating itself python

How to scrape all topics from twitter

Selenium only returns an empty list

Scraping data with a lack of classes / ids on elements

Scraper harvesting major portion of data while missing few

Categories

Resources