Scraper harvesting major portion of data while missing few - python

I've written a script in python with selenium to scrape the complete flight schedule from a webpage. Upon running my script I could see that it is working good so far except for some fields which are not getting parsed. I've checked for the elements within which the data are located but I noticed that elements for already scraped one and the missing one are no different. What to do to get the full content. Thanks in advance.
Here is the script I'm trying with:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get("http://www.yvr.ca/en/passengers/flights/departing-flights")
wait = WebDriverWait(driver, 10)
item = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "table.yvr-flights__table")))
list_of_data = [[item.text for item in data.find_elements_by_css_selector('td')]
for data in item.find_elements_by_css_selector('tr')]
for tab_data in list_of_data:
print(tab_data)
driver.quit()
Here is the partial picture of the data [missing one and scraped one]:
https://www.dropbox.com/s/xaqeiq97b6upj5j/flight_stuff.jpg?dl=0
Here are the td elements for one block:
<tr class="yvr-flights__row yvr-flights__row--departed " id="226792377">
<td>
<time class="yvr-flights__label yvr-flights__scheduled-label yvr-flights__scheduled-label--departed notranslate" datetime="2017-08-24T06:20:00-07:00">
06:20
</time>
</td>
<td class="yvr-flights__table-cell--revised notranslate">
<time class="yvr-flights__label yvr-flights__revised-label yvr-flights__revised-label--departed" datetime="2017-08-24T06:20:00-07:00">
06:19
</time>
</td>
<td class="yvr-table__cell yvr-flights__flightNumber notranslate">AC560</td>
<td class="hidden-until--md yvr-table__cell yvr-table__cell--fade-out yvr-table__cell--nowrap notranslate">Air Canada</td>
<td class="yvr-table__cell yvr-table__cell--fade-out yvr-table__cell--nowrap notranslate">San Francisco</td>
<td class="hidden-until--md yvr-table__cell yvr-table__cell--nowrap notranslate">
Main
</td>
<td class="hidden-until--md yvr-table__cell yvr-table__cell--nowrap notranslate">E87</td>
<td class="yvr-flights__table-cell--status yvr-table__cell--nowrap">
<span class="yvr-flights__status yvr-flights__status--departed">Departed</span>
</td>
<td class="hidden-until--md yvr-table__cell yvr-table__cell--nowrap">
</td>
<td class="visible-until--md">
<button class="yvr-flights__toggle-flight">Toggle flight</button>
</td>
</tr>

Since you are looking to fix your script and not scrape data. I found few issues in your script.
One your scanning all tr nodes. But the tr you are interested in should have yvr-flights__row class. But there are ones which are hidden and don't have data. They have yvr-flights__row--hidden. So you don't want them
Also the 2nd column of table doesn't have data always. When it has it is more like below
<td class="yvr-flights__table-cell--revised notranslate">
<time class="yvr-flights__label yvr-flights__revised-label yvr-flights__revised-label--early" datetime="2017-08-25T06:30:00-07:00">
06:20
</time>
</td>
So you when you use .text on the td. The node itself has no text. But it has a time node which has the text. There are multiple ways to fix that. But I use JS to get the content of such node
driver.execute_script("return arguments[0].textContent;").trim()
So if you combine all of it below script does all the work
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get("http://www.yvr.ca/en/passengers/flights/departing-flights")
wait = WebDriverWait(driver, 10)
item = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "table.yvr-flights__table")))
list_of_data = [
[
item.text if item.text else driver.execute_script("return arguments[0].textContent.trim();", item).strip()
for item in data.find_elements_by_css_selector('td')
]
for data in item.find_elements_by_css_selector('tr.yvr-flights__row:not(.yvr-flights__row--hidden)')
]
for tab_data in list_of_data:
print(tab_data)
It gives me the below output
['02:00', '02:20', 'CX889', 'Cathay Pacific', 'Hong Kong', 'Main', 'D64', 'Departed', '', 'Toggle flight']
['05:15', '', 'PR127', 'Philippine Airlines', 'Manila', 'Main', 'D70', 'Departed', '', 'Toggle flight']
['06:00', '', 'AS964', 'Alaska Airlines', 'Seattle', 'Main', 'E73', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:00', '', 'DL4805', 'Delta Air Lines', 'Seattle', 'Main', 'E90', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:00', '', 'WS3114', 'WestJet', 'Kelowna', 'Main', 'A9', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:00', '', 'AA6045', 'American Airlines', 'Los Angeles', 'Main', 'E86', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:00', '', 'AC100', 'Air Canada', 'Toronto', 'Main', 'C45', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:01', '', 'UA618', 'United Airlines', 'San Francisco', 'Main', 'E76', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:10', '', 'AC8606', 'Air Canada', 'Winnipeg', 'Main', 'C39', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:10', '', 'AC8190', 'Air Canada', 'Kamloops', 'Main', 'C34', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:10', '', 'AC200', 'Air Canada', 'Calgary', 'Main', 'C29', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:15', '', 'WS560', 'WestJet', 'Calgary', 'Main', 'B13', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:20', '', 'AC560', 'Air Canada', 'San Francisco', 'Main', 'E87', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:30', '06:20', 'DL2555', 'Delta Air Lines', 'Minneapolis', 'Main', 'E88', 'Early', 'NOTIFY ME', 'Toggle flight']
['06:30', '', 'WS700', 'WestJet', 'Toronto', 'Main', 'B15', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:30', '', 'UA664', 'United Airlines', 'Chicago', 'Main', 'E75', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:40', '', 'AM695', 'AeroMexico', 'Mexico City', 'Main', 'D53', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:40', '', 'WS6110', 'WestJet', 'Mexico City', 'Main', 'D53', 'On Time', 'NOTIFY ME', 'Toggle flight']
['06:45', '06:45', 'AC8055', 'Air Canada', 'Victoria', 'Main', '',
...
['23:25', '', 'AC8269', 'Air Canada', 'Nanaimo', 'Main', '', 'On Time', 'NOTIFY ME', 'Toggle flight']
['23:25', '', 'AM697', 'AeroMexico', 'Mexico City', 'Main', 'D54', 'On Time', 'NOTIFY ME', 'Toggle flight']
['23:25', '', 'WS6108', 'WestJet', 'Mexico City', 'Main', 'D54', 'On Time', 'NOTIFY ME', 'Toggle flight']
['23:25', '', 'AC8083', 'Air Canada', 'Victoria', 'Main', 'C38', 'On Time', 'NOTIFY ME', 'Toggle flight']
['23:25', '', 'AC308', 'Air Canada', 'Montreal', 'Main', 'C29', 'On Time', 'NOTIFY ME', 'Toggle flight']
['23:26', '', 'WS564', 'WestJet', 'Montreal', 'Main', 'B13', 'On Time', 'NOTIFY ME', 'Toggle flight']
['23:30', '', 'AC128', 'Air Canada', 'Toronto', 'Main', 'C47', 'On Time', 'NOTIFY ME', 'Toggle flight']
['23:40', '', 'AC33', 'Air Canada', 'Sydney', 'Main', 'D52', 'On Time', 'NOTIFY ME', 'Toggle flight']
['23:45', '', 'AC35', 'Air Canada', 'Brisbane', 'Main', 'D65', 'On Time', 'NOTIFY ME', 'Toggle flight']
['23:45', '', 'AC344', 'Air Canada', 'Ottawa', 'Main', 'C49', 'On Time', 'NOTIFY ME', 'Toggle flight']

You should just open this URL and get all the details
http://www.yvr.ca/en/_api/Flights?%24filter=FlightScheduledTime%20gt%20DateTime%272017-08-24T00%3A00%3A00%27%20and%20FlightScheduledTime%20lt%20DateTime%272017-08-25T00%3A00%3A00%27%20and%20FlightType%20eq%20%27D%27&%24orderby=FlightScheduledTime%20asc
If I escape the URL it becomes like
http://www.yvr.ca/en/_api/Flights?$filter=FlightScheduledTime gt DateTime'2017-08-24T00:00:00' and FlightScheduledTime lt DateTime'2017-08-25T00:00:00' and FlightType eq 'D'&$orderby=FlightScheduledTime asc
So you should just parameterize this and replace dates based on current date get all the data in JSON form
{
odata.metadata: "http://www.yvr.ca/_api/$metadata#Flights",
value: [
{
FlightStatus: "Departed",
FlightRemarksAdjusted: "Departed",
FlightScheduledTime: "2017-08-24T06:15:00",
FlightEstimatedTime: "2017-08-24T06:10:00",
FlightNumber: "WS560",
FlightAirlineName: "WestJet",
FlightAircraftType: "73H",
FlightDeskTo: "",
FlightDeskFrom: "",
FlightCarousel: "",
FlightRange: "D",
FlightCarrier: "WS",
FlightCity: "Calgary",
FlightType: "D",
FlightAirportCode: "YYC",
FlightGate: "B14",
FlightRemarks: "Departed",
FlightID: 226790614,
FlightQuickConnect: ""
},
{
FlightStatus: "Departed",
FlightRemarksAdjusted: "Departed",
FlightScheduledTime: "2017-08-24T06:20:00",
FlightEstimatedTime: "2017-08-24T06:19:00",

as suggested by Tarun Lalwani, WebDriver is really the wrong tool for this activity.
The problem is that webdriver only returns text from elements that are visible on the screen, so if you want to see the data from all the rows you will need to scroll down the rows and collect the data one row at a time as discussed in WebElement getText() is an empty string in Firefox if element is not physically visible on the screen
This will be painfully slow.
I guess you could also grab the textcontent instead of item.text
in java:
item.getAttribute("textContent");
I'm sure python has an equivalent.
jsoup would be an alternative means to grab the data in a single shot and much faster

Related

Getting only a part of the page source using selenium webdriver

I'm trying to get the HTML of Billboard's top 100 chart, but I keep getting only about half of the page.
I tried getting the page source using this code:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
s=Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=s)
url = "https://www.billboard.com/charts/hot-100/"
driver.get(url)
driver.implicitly_wait(10)
print(driver.page_source)
But it always returns the page source only from the 53rd song on the chart (I've tried increasing the implicit wait and nothing changed)
Not sure why you are getting 53 elements. Using explicit waits I am getting all 100 elements.
User webdriverwait() as an explicit wait.
driver.get('https://www.billboard.com/charts/hot-100/')
elements=WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "ul.lrv-a-unstyle-list h3#title-of-a-story")))
print(len(elements))
print([item.text for item in elements])
Need to import below libraries.
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
OutPut:
100
['Anti-Hero', 'Lavender Haze', 'Maroon', 'Snow On The Beach', 'Midnight Rain', 'Bejeweled', 'Question...?', "You're On Your Own, Kid", 'Karma', 'Vigilante Shit', 'Unholy', 'Bad Habit', 'Mastermind', 'Labyrinth', 'Sweet Nothing', 'As It Was', 'I Like You (A Happier Song)', "I Ain't Worried", 'You Proof', "Would've, Could've, Should've", 'Bigger Than The Whole Sky', 'Super Freaky Girl', 'Sunroof', "I'm Good (Blue)", 'Under The Influence', 'The Great War', 'Vegas', 'Something In The Orange', 'Wasted On You', 'Jimmy Cooks', 'Wait For U', 'Paris', 'High Infidelity', 'Tomorrow 2', 'Titi Me Pregunto', 'About Damn Time', 'The Kind Of Love We Make', 'Late Night Talking', 'Cuff It', 'She Had Me At Heads Carolina', 'Glitch', 'Me Porto Bonito', 'Die For You', 'California Breeze', 'Dear Reader', 'Forever', 'Hold Me Closer', 'Just Wanna Rock', '5 Foot 9', 'Unstoppable', 'Thank God', 'Fall In Love', 'Rock And A Hard Place', 'Golden Hour', 'Heyy', 'Half Of Me', 'Until I Found You', 'Victoria’s Secret', 'Son Of A Sinner', "Star Walkin' (League Of Legends Worlds Anthem)", 'Romantic Homicide', 'What My World Spins Around', 'Real Spill', "Don't Come Lookin'", 'Monotonia', 'Stand On It', 'Never Hating', 'Wishful Drinking', 'No Se Va', 'Free Mind', 'Not Finished', 'Music For A Sushi Restaurant', 'Pop Out', 'Staying Alive', 'Poland', 'Whiskey On You', 'Billie Eilish.', 'Betty (Get Money)', '2 Be Loved (Am I Ready)', 'Wait In The Truck', 'All Mine', 'La Bachata', 'Last Last', 'Glimpse Of Us', 'Freestyle', 'Calm Down', 'Gatubela', 'Evergreen', 'Pick Me Up', 'Gotta Move On', 'Perfect Timing', 'Country On', 'Snap', 'She Likes It', 'Made You Look', 'Dark Red', 'From Now On', 'Forget Me', 'Miss You', 'Despecha']
driver.implicitly_wait(10) is not a pause command!
It has no effect on the next line print(driver.page_source).
If you want to wait for page to be completely load you can wait for some specific element there to be visible. Use WebDriverWait expected_conditions for that. Or just add a hardcoded pause time.sleep(10) instead of driver.implicitly_wait(10)

how to Get item from nested arrays in python

I am scraping web table and append scraped data to array. After append data to array i get array like this (there are arrays in array):
[['Action'], ['1 796004', '35', '2022-04-28', '2013 FORD FUSION TITANIUM', '43004432', '3FA6P0RU3DR297126', 'CA', 'Copart', 'Batumi, Georgia', 'CAIU7608231EBKG03172414', '2022-05-02', '2022-05-02', '0000-00-00', '', 'dock receipt', 'YES', '', 'No', '', '5/3/2022 Per auction, the title is present and will be prepared for mail out; Follow up for a tracking number-Clara5/9/2022 Per auction, they are still working on
mailing out this title; Follow up required for a tracking number-Clara5/11/2022 Per auction, the title was mailed out; tr#776771949089-Clara[Add notes]', 'A779937', '', '', '', '[edit]', ''], ['2 763189', '43', '2022-01-10', '2018 TOYOTA CAMRY', '43241241', '4T1B11HK7JU080162', 'GA', 'Copart', 'Poti, Georgia', 'MRKU5529916217682189', '2022-01-25', '2022-01-28', '2022-06-20', '2022-06-27', 'dock receipt', 'YES', '2022-01-28', 'Yes', '', '[Add notes]', 'A774742', '', '', '', '[edit]', ''], ['3 762850', '37', '2022-01-07', '2017 VOLKSWAGEN TOUAREG', '65835511', 'WVGRF7BP3HD000549', 'CA', 'Copart', 'Batumi, Georgia', 'MSDU7281152EBKG02708589', '2022-02-09', '2022-02-09', '2022-06-07', '2022-06-14', 'dock receipt', 'YES', '2022-01-20', 'Yes', '', '[Add notes]', 'A774650', '', '', '', '[edit]', ''],]
Now i want to get 4th(5) items (it is actually car model, e.g. for firs appended array it is "2013 FORD FUSION TITANIUM") from these updated data (array), i want to have :"2013 FORD FUSION TITANIUM, "2018 TOYOTA CAMRY" etc.
How can i achive that?
Loop from the first index of the array to the end to avoid the first subarray.
At every iteration, select the ith subarray and get the 3rd index.
The result will be an array of the strings that you wanted.
prompt = [
['Action'],
['1 796004', '35', '2022-04-28', '2013 FORD FUSION TITANIUM', '43004432', '3FA6P0RU3DR297126', 'CA', 'Copart', 'Batumi, Georgia', 'CAIU7608231EBKG03172414', '2022-05-02', '2022-05-02', '0000-00-00', '', 'dock receipt', 'YES', '', 'No', '', '5/3/2022 Per auction, the title is present and will be prepared for mail out; Follow up for a tracking number-Clara5/9/2022 Per auction, they are still working on mailing out this title; Follow up required for a tracking number-Clara5/11/2022 Per auction, the title was mailed out; tr#776771949089-Clara[Add notes]', 'A779937', '', '', '', '[edit]', ''],
['2 763189', '43', '2022-01-10', '2018 TOYOTA CAMRY', '43241241', '4T1B11HK7JU080162', 'GA', 'Copart', 'Poti, Georgia', 'MRKU5529916217682189', '2022-01-25','2022-01-28', '2022-06-20', '2022-06-27', 'dock receipt', 'YES', '2022-01-28', 'Yes', '', '[Add notes]', 'A774742', '', '', '', '[edit]', ''],
['3 762850', '37', '2022-01-07', '2017 VOLKSWAGEN TOUAREG', '65835511', 'WVGRF7BP3HD000549', 'CA', 'Copart', 'Batumi, Georgia', 'MSDU7281152EBKG02708589', '2022-02-09', '2022-02-09', '2022-06-07', '2022-06-14', 'dock receipt', 'YES', '2022-01-20', 'Yes', '', '[Add notes]', 'A774650', '', '', '', '[edit]', ''],
]
res = [prompt[i][3] for i in range(1, len(prompt))]
print(res) # ['2013 FORD FUSION TITANIUM', '2018 TOYOTA CAMRY', '2017 VOLKSWAGEN TOUAREG']
If I misunderstood the question, please let me know.

how to stop letter repeating itself python

I am making a code which takes in jumble word and returns a unjumbled word , the data.json contains a list and here take a word one-by-one and check if it contains all the characters of the word and later checking if the length is same , but the problem is when i enter a word as helol then the l is checked twice and giving me some other outputs including the main one(hello). i know why does it happen but i cant get a fix to it
import json
val = open("data.json")
val1 = json.load(val)#loads the list
a = input("Enter a Jumbled word ")#takes a word from user
a = list(a)#changes into list to iterate
for x in val1:#iterates words from list
for somethin in a:#iterates letters from list
if somethin in list(x):#checks if the letter is in the iterated word
continue
else:
break
else:#checks if the loop ended correctly (that means word has same letters)
if len(a) != len(list(x)):#checks if it has same number of letters
continue#returns
else:
print(x)#continues the loop to see if there are more like that
EDIT: many people wanted the json file so here it is
['Torres Strait Creole', 'good bye', 'agon', "queen's guard", 'animosity', 'price list', 'subjective', 'means', 'severe', 'knockout', 'life-threatening', 'entry into the war', 'dominion', 'damnify', 'packsaddle', 'hallucinate', 'lumpy', 'inception', 'Blankenese', 'cacophonous', 'zeptomole', 'floccinaucinihilipilificate', 'abashed', 'abacterial', 'ableism', 'invade', 'cohabitant', 'handicapped', 'obelus', 'triathlon', 'habitue', 'instigate', 'Gladstone Gander', 'Linked Data', 'seeded player', 'mozzarella', 'gymnast', 'gravitational force', 'Friedelehe', 'open up', 'bundt cake', 'riffraff', 'resourceful', 'wheedle', 'city center', 'gorgonzola', 'oaf', 'auf', 'oafs', 'galoot', 'imbecile', 'lout', 'moron', 'news leak', 'crate', 'aggregator', 'cheating', 'negative growth', 'zero growth', 'defer', 'ride back', 'drive back', 'start back', 'shy back', 'spring back', 'shrink back', 'shy away', 'abderian', 'unable', 'font manager', 'font management software', 'consortium', 'gown', 'inject', 'ISO 639', 'look up', 'cross-eyed', 'squinting', 'health club', 'fitness facility', 'steer', 'sunbathe', 'combatives', 'HTH', 'hope that helps', 'How The Hell', 'distributed', 'plum cake', 'liberalization', 'macchiato', 'caffè macchiato', 'beach volley', 'exult', 'jubilate', 'beach volleyball', 'be beached', 'affogato', 'gigabyte', 'terabyte', 'petabyte', 'undressed', 'decameter', 'sensual', 'boundary marker', 'poor man', 'cohabitee', 'night sleep', 'protruding ears', 'three quarters of an hour', 'spermophilus', 'spermophilus stricto sensu', "devil's advocate", 'sacred king', 'sacral king', 'myr', 'million years', 'obtuse-angled', 'inconsolable', 'neurotic', 'humiliating', 'mortifying', 'theological', 'rematch', 'varıety', 'be short', 'ontological', 'taxonomic', 'taxonomical', 'toxicology testing', 'on the job training', 'boulder', 'unattackable', 'inviolable', 'resinous', 'resiny', 'ionizing radiation', 'citrus grove', 'comic book shop', 'preparatory measure', 'written account', 'brittle', 'locker', 'baozi', 'bao', 'bau', 'humbow', 'nunu', 'bausak', 'pow', 'pau', 'yesteryear', 'fire drill', 'rotted', 'putto', 'overthrow', 'ankle monitor', 'somewhat stupid', 'a little stupid', 'semordnilap', 'pangram', 'emordnilap', 'person with a sunlamp tan', 'tittle', 'incompatible', 'autumn wind', 'dairyman', 'chesty', 'lacustrine', 'chronophotograph', 'chronophoto', 'leg lace', 'ankle lace', 'ankle lock', 'Babelfy', 'ventricular', 'recurrent', 'long-lasting', 'long-standing', 'long standing', 'sea bass', 'reap', 'break wind', 'chase away', 'spark', 'speckle', 'take back', 'Westphalian', 'Aeolic Greek', 'startup', 'abseiling', 'impure', 'bottle cork', 'paralympic', 'work out', 'might', 'ice-cream man', 'ice cream man', 'ice cream maker', 'ice-cream maker', 'traveling', 'special delivery', 'prizefighter', 'abs', 'ab', 'churro', 'pilfer', 'dehumanize', 'fertilize', 'inseminate', 'digitalize', 'fluke', 'stroke of luck', 'decontaminate', 'abandonware', 'manzanita', 'tule', 'jackrabbit', 'system administrator', 'system admin', 'springtime lethargy', 'Palatinean', 'organized religion', 'bearing puller', 'wheel puller', 'gear puller', 'shot', 'normalize', 'palindromic', 'lancet window', 'terminological', 'back of head', 'dragon food', 'barbel', 'Central American Spanish', 'basis', 'birthmark', 'blood vessel', 'ribes', 'dog-rose', 'dreadful', 'freckle', 'free of charge', 'weather verb', 'weather sentence', 'gipsy', 'gypsy', 'glutton', 'hump', 'low voice', 'meek', 'moist', 'river mouth', 'turbid', 'multitude', 'palate', 'peak of mountain', 'poetry', 'pure', 'scanty', 'spicy', 'spicey', 'spruce', 'surface', 'infected', 'copulate', 'dilute', 'dislocate', 'grow up', 'hew', 'hinder', 'infringe', 'inhabit', 'marry off', 'offend', 'pass by', 'brother of a man', 'brother of a woman', 'sister of a man', 'sister of a woman', 'agricultural farm', 'result in', 'rebel', 'strew', 'scatter', 'sway', 'tread', 'tremble', 'hog', 'circuit breaker', 'Southern Quechua', 'safety pin', 'baby pin', 'college student', 'university student', 'pinus sibirica', 'Siberian pine', 'have lunch', 'floppy', 'slack', 'sloppy', 'wishi-washi', 'turn around', 'bogeyman', 'selfish', 'Talossan', 'biomembrane', 'biological membrane', 'self-sufficiency', 'underevaluation', 'underestimation', 'opisthenar', 'prosody', 'Kumhar Bhag Paharia', 'psychoneurotic', 'psychoneurosis', 'levant', "couldn't-care-less attitude", 'noctambule', 'acid-free paper', 'decontaminant', 'woven', 'wheaten', 'waste-ridden', 'war-ridden', 'violence-ridden', 'unwritten', 'typewritten', 'spoken', 'abiogenetically', 'rasp', 'abstractly', 'cyclically', 'acyclically', 'acyclic', 'ad hoc', 'spare tire', 'spare wheel', 'spare tyre', 'prefabricated', 'ISO 9000', 'Barquisimeto', 'Maracay', 'Ciudad Guayana', 'San Cristobal', 'Barranquilla', 'Arequipa', 'Trujillo', 'Cusco', 'Callao', 'Cochabamba', 'Goiânia', 'Campinas', 'Fortaleza', 'Florianópolis', 'Rosario', 'Mendoza', 'Bariloche', 'temporality', 'papyrus sedge', 'paper reed', 'Indian matting plant', 'Nile grass', 'softly softly', 'abductive reasoning', 'abductive inference', 'retroduction', 'Salzburgian', 'cymotrichous', 'access point', 'wireless access point', 'dynamic DNS', 'IP address', 'electrolyte', 'helical', 'hydrometer', 'intranet', 'jumper', 'MAC address', 'Media Access Control address', 'nickel–cadmium battery', 'Ni-Cd battery', 'oscillograph', 'overload', 'photovoltaic', 'photovoltaic cell', 'refractor telescope', 'autosome', 'bacterial artificial chromosome', 'plasmid', 'nucleobase', 'base pair', 'base sequence', 'chromosomal deletion', 'deletion', 'deletion mutation', 'gene deletion', 'chromosomal inversion', 'comparative genomics', 'genomics', 'cytogenetics', 'DNA replication', 'DNA repair', 'DNA sequence', 'electrophoresis', 'functional genomics', 'retroviral', 'retroviral infection', 'acceptance criteria', 'batch processing', 'business rule', 'code review', 'configuration management', 'entity–relationship model', 'lifecycle', 'object code', 'prototyping', 'pseudocode', 'referential', 'reusability', 'self-join', 'timestamp', 'accredited', 'accredited translator', 'certify', 'certified translation', 'computer-aided design', 'computer-aided', 'computer-assisted', 'management system', 'computer-aided translation', 'computer-assisted translation', 'machine-aided translation', 'conference interpreter', 'freelance translator', 'literal translation', 'mother-tongue', 'whispered interpreting', 'simultaneous interpreting', 'simultaneous interpretation', 'base anhydride', 'binary compound', 'absorber', 'absorption coefficient', 'attenuation coefficient', 'active solar heater', 'ampacity', 'amorphous semiconductor', 'amorphous silicon', 'flowerpot', 'antireflection coating', 'antireflection', 'armored cable', 'electric arc', 'breakdown voltage','casing', 'facing', 'lining', 'assumption of Mary', 'auscultation']
Just a example and the dictionary is full of items
As I understand it you are trying to identify all possible matches for the jumbled string in your list. You could sort the letters in the jumbled word and match the resulting list against sorted lists of the words in your data file.
sorted_jumbled_word = sorted(a)
for word in val1:
if len(sorted_jumbled_word) == len(word) and sorted(word) == sorted_jumbled_word:
print(word)
Checking by length first reduces unnecessary sorting. If doing this repeatedly, you might want to create a dictionary of the words in the data file with their sorted versions, to avoid having to repeatedly sort them.
There are spaces and punctuation in some of the terms in your word list. If you want to make the comparison ignoring spaces then remove them from both the jumbled word and the list of unjumbled words, using e.g. word = word.replace(" ", "")

Scraping Address Information Using Selenium in Python

I am trying to scrape address information from https://www.smartystreets.com/products/single-address-iframe. I have a script that searches for the given address in its parameters. When I look at the website itself, one can see various fields like Carrier Route.
Using 3301 South Greenfield Rd Gilbert, AZ 85297 as a hypothetical example, when one goes to the page manually, one can see the Carrier Route: R109.
I am having trouble, however, finding the carrier route on Selenium to scrape it. Does have any recommendations for how to find the Carrier Route for any given address?
Starting code:
driver = webdriver.Chrome('chromedriver')
address = "3301 South Greenfield Rd Gilbert, AZ 85297\n"
url = 'https://www.smartystreets.com/products/single-address-iframe'
driver.get(url)
driver.find_element_by_id("lookup-select-button").click()
driver.find_element_by_id("lookup-select").find_element_by_id("address-freeform").click()
driver.find_element_by_id("freeform-address").send_keys(address)
# Find Carrier Route here
You can use driver.execute_script to provide input for the fields and to click the submission button:
from selenium import webdriver
d = webdriver.Chrome('/path/to/chromedriver')
d.get('https://www.smartystreets.com/products/single-address-iframe')
s = '3301 South Greenfield Rd Gilbert, AZ 85297'
a, a1 = s.split(' Rd ')
route = d.execute_script(f'''
document.querySelector('#address-line1').value = '{a}'
document.querySelector('#city').value = '{(j:=a1.split())[0][:-1]}'
document.querySelector('#state').value = '{j[1]}'
document.querySelector('#zip-code').value = '{j[2]}'
document.querySelector('#submit-request').click()
return document.querySelector('#us-street-metadata li:nth-of-type(2) .answer.col-sm-5.col-xs-3').textContent
''')
Output:
'R109'
To get a full display of all the parameter data, you can use BeautifulSoup:
from bs4 import BeautifulSoup as soup
... #selenium driver source here
cols = soup(d.page_source, 'html.parser').select('#us-street-output div')
data = {i.h4.text:{b.select_one('span:nth-of-type(1)').get_text(strip=True)[:-1]:b.select_one('span:nth-of-type(2)').get_text(strip=True)
for b in i.select('ul li')} for i in cols}
print(data)
print(data['Metadata']['Congressional District'])
Output:
{'Metadata': {'Building Default': 'default', 'Carrier Route': 'R109', 'Congressional District': '05', 'Latitude': '33.291248', 'Longitude': '-111.737427', 'Coordinate Precision': 'Rooftop', 'County Name': 'Maricopa', 'County FIPS': '04013', 'eLOT Sequence': '0160', 'eLOT Sort': 'A', 'Observes DST': 'default', 'RDI': 'Commercial', 'Record Type': 'S', 'Time Zone': 'Mountain', 'ZIP Type': 'Standard'}, 'Analysis': {'Vacant': 'N', 'DPV Match Code': 'Y', 'DPV Footnotes': 'AABB', 'General Footnotes': 'L#', 'CMRA': 'N', 'EWS Match': 'default', 'LACSLink Code': 'default', 'LACSLink Indicator': 'default', 'SuiteLink Match': 'default', 'Enhanced Match': 'default'}, 'Components': {'Urbanization': 'default', 'Primary Number': '3301', 'Street Predirection': 'S', 'Street Name': 'Greenfield', 'Street Postdirection': 'default', 'Street Suffix': 'Rd', 'Secondary Designator': 'default', 'Secondary Number': 'default', 'Extra Secondary Designator': 'default', 'Extra Secondary Number': 'default', 'PMB Designator': 'default', 'PMB Number': 'default', 'City': 'Gilbert', 'Default City Name': 'Gilbert', 'State': 'AZ', 'ZIP Code': '85297', '+4 Code': '2176', 'Delivery Point': '01', 'Check Digit': '2'}}
'05'
Ajax1234, here's the code and screenshot you asked for:

Selenium only returns an empty list

I'm trying to scrape football team names from betfair.com and no matter what, it returs an empty list. This is what I've tried most recently.
from selenium import webdriver
import pandas as pd
driver = webdriver.Chrome(r'C:\Users\Tom\Desktop\chromedriver\chromedriver.exe')
driver.get('https://www.betfair.com/exchange/plus/football')
team = driver.find_elements_by_xpath('//*[#id="main-wrapper"]/div/div[2]/div/ui-view/div/div/div/div/div[1]/div/div[1]/bf-super-coupon/main/ng-include[3]/section[1]/div[2]/bf-coupon-table/div/table/tbody/tr[1]/td[1]/a/event-line/section/ul[1]/li[1]')
print(team)
You should use WebDriverWait. Also, you should use a relative xPath, not absolute xPath. One more thing you are using find_elements for a single element.
Here I'm printing all the teams
from pprint import pprint
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Chrome(r"C:\Users\Tom\Desktop\chromedriver\chromedriver.exe")
driver.get('https://www.betfair.com/exchange/plus/football')
teams = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, '//*[#id="main-wrapper"]//ul[#class="runners"]/li')))
print([i.text for i in teams])
Output:
['Everton',
'West Ham',
'Tottenham',
'Watford',
'Chelsea',
'Newcastle',
'Wolves',
'Southampton',
'Leicester',
'Burnley',
'Aston Villa',
'Brighton',
'Bournemouth',
'Norwich',
'Crystal Palace',
'Man City',
'Man Utd',
'Liverpool',
'Sheff Utd',
'Arsenal',
'Eintracht Frankfurt',
'Leverkusen',
'Werder Bremen',
'Hertha Berlin',
'Augsburg',
'Bayern Munich',
'Fortuna Dusseldorf',
'Mainz',
'RB Leipzig',
'Wolfsburg',
'Union Berlin',
'Freiburg',
'Dortmund',
'Mgladbach',
'FC Koln',
'Paderborn',
'Hoffenheim',
'Schalke 04',
'St Etienne',
'Lyon',
'Nice',
'Paris St-G',
'Lyon',
'Dijon',
'Reims',
'Montpellier',
'Nimes',
'Amiens',
'Toulouse',
'Lille',
'Metz',
'Nantes',
'Angers',
'Brest',
'Bordeaux',
'St Etienne',
'Monaco',
'Rennes',
'Houston Dynamo',
'LA Galaxy',
'Philadelphia',
'New York City',
'Atlanta Utd',
'New England',
'Seattle Sounders',
'Minnesota Utd',
'DC Utd',
'FC Cincinnati',
'Orlando City',
'Chicago Fire',
'Montreal Impact',
'New York Red Bulls',
'Toronto FC',
'Columbus',
'Los Angeles FC',
'Colorado',
'FC Dallas',
'Kansas City',
'Shakhtar',
'Dinamo Zagreb',
'Atletico Madrid',
'Leverkusen',
'Club Brugge',
'Paris St-G',
'Tottenham',
'Crvena Zvezda',
'Olympiakos',
'Bayern Munich',
'Man City',
'Atalanta',
'Galatasaray',
'Real Madrid',
'Juventus',
'Lokomotiv',
'Ajax',
'Chelsea',
'RB Leipzig',
'Zenit St Petersburg']

Categories

Resources