How can i loop through multiple pages to scrape table data (python)

How can i loop through multiple pages to scrape table data (python) - python

Im struggling to find a way to loop through pages and scrape data from a table - i've managed to get the data from the first page, but i dont know how to proceed with going through each page and getting the data. Ive tried various different bits of code but im unable to get anything to work. The site im trying to scrape adds &pageno=2 to the end of the url and next buttons (rather than numbered buttons) - any help would be great.
current code for scraping the first page successfully is as follows:
from cgitb import text
import requests
import pprint
import csv
from bs4 import BeautifulSoup
from lxml import html
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}
url = 'https://www.revcomps.com/past-entry-lists/?draw_chosen=2693823'
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
table = soup.find('table', {'class':'ticket_results'})
data = [td.text for td in table.find_all('td')]
for table in soup.find_all('table', {'class':'ticket_results'}):
data = [td.text for td in table.find_all('td')]
pprint.pprint(data)

You can just add your requests into a loop for the page number. A Python f string can be used to add the page variable into the URL:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}
for page in range(1, 3):
print(f"Page {page}")
url = f'https://www.revcomps.com/past-entry-lists/?draw_chosen=2693823&pageno={page};'
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
for table in soup.find_all('table', {'class':'ticket_results'}):
data = [td.text for td in table.find_all('td')]
print(data)
Giving you output starting:
Page 1
['Order', 'WINNING TICKET', 'Ticket', '2694340', 'Andrew Reynolds', '694', '2699224', 'Martin Lilge', '315', '2703986', 'Ricky Parton', '975']
['Order', 'Customer', 'Ticket', '2704184', 'Philip Stoyles', '001', '2700874', 'Timothy Powell', '002', '2696801', 'Steven Hill', '003', '2696301', 'Trevor Larken', '004', '2696387', 'george malone', '005', '2701735', 'Williams jonathan', '006', '2704193', 'Michael Worthington', '007', '2695573', 'Mike Bates', '008', '2695170', 'Debbie Gent', '009', '2699892', 'Edward Buffett', '010', '2694080', 'David Miller', '011', '2701554', 'Liz Coates', '012', '2694944', 'Amanda Demellweek', '013', '2695128', 'John Crowe', '014', '2698092', 'Jamie Houston', '015', '2703986', 'Ricky Parton', '016', '2700944', 'Tom Chant', '017', '2698687', 'Gary Young', '018', '2696026', 'Tritean Emanuel', '019', '2704117', 'Stephen Melekeowei', '020', '2700379', 'Darren Pearson', '021', '2696357', 'Kane Nicholas', '022', '2704062', 'Jessie Nellany', '023', '2700621', 'Nick Hart', '024', '2704879', 'Chris Maynard', '025', '2703091', 'Nils Omell', '026', '2702854', 'mr stephen elsley', '027', '2698997', 'Mark Skedgel-hill', '028', '2701558', 'Bradley King', '029', '2698372', 'simon miles', '030', '2694701', 'Gillian Chisnall', '031', '2701365', 'Sarah Tingle-kitchen', '032', '2694591', 'Robert Townsend', '033', '2695077', 'Glen Davies', '034', '2695177', 'Wayne Cummings', '035', '2701899', 'Ross Hay', '036', '2703464', 'Shaun Raynsford', '037', '2704149', 'K Oszlanczi', '038', '2703566', 'Daniel Fuller', '039', '2699263', 'Adam Torok', '040', '2700621', 'Nick Hart', '041', '2703279', 'Omar Shawesh', '042', '2699452', 'Mark Widger', '043', '2695848', 'Zoey Longley', '044', '2703704', 'Daniel Hyndman', '045', '2696997', 'Paul Daniel', '046', '2694506', 'Mark Thompson', '047', '2699460', 'Martin Buckingham', '048', '2695186', 'Matt Beavis', '049', '2701503', 'Craig Driscoll', '050', '2699318', 'Alan Parker', '051', '2699729', 'Stephen Minnikin', '052', '2695573', 'Mike Bates', '053', '2698438', 'Andrew Kosinski', '054', '2698679', 'Carly Mason', '055', '2702121', 'Mark Adams', '056', '2698613', 'Neil Gunn', '057', '2704149', 'K Oszlanczi', '058', '2699109', 'Steve Bowen', '059', '2702108', 'Thomas Martin', '060', '2696482', 'mr stephen elsley', '061', '2696813', 'Nigel Scott', '062', '2701394', 'Chris Brown', '063', '2698459', 'Gordon Bickerton', '064', '2700546', 'jon tribbeck', '065', '2702492', 'Mark Bentley', '066', '2704155', 'Ryan Stephens', '067', '2694831', 'David Godfrey', '068', '2695671', 'Lee Smith', '069', '2695066', 'Kristian Howells', '070', '2694225', 'Simon Costello', '071', '2695186', 'Matt Beavis', '072', '2699947', 'Anthony Abbey', '073', '2701845', 'Paul Quarterman', '074', '2695573', 'Mike Bates', '075', '2701618', 'Adam Kimber', '076', '2704433', 'John Hayes', '077', '2699484', 'Jamie Brookes', '078', '2695587', 'Richard Hurst', '079', '2696301', 'Trevor Larken', '080', '2698200', 'Ewa Krzyszkowska', '081', '2698023', 'Jason Reed', '082', '2702455', 'Simon Harrington', '083', '2694869', 'Mike Bates', '084', '2703644', 'Jason Gow', '085', '2700989', 'A J Freeman', '086', '2696784', 'Adam Timberlake', '087', '2701447', 'Lewis Middleton', '088', '2701236', 'Scot Beall', '089', '2695477', 'Simon Farrow', '090', '2697197', 'Marcus du Preez', '091', '2697115', 'Roderick Evans', '092', '2700621', 'Nick Hart', '093', '2701231', 'Norma Matheson', '094', '2695587', 'Richard Hurst', '095', '2702017', 'Michael Richardson', '096', '2703702', 'Dean Brain', '097', '2699907', 'Lee Murray', '098', '2694583', 'Kasia Krzyzak', '099', '2700048', 'terri palmer', '100', '2699499', 'Simon Hack', '101', '2694206', 'Graeme Allister', '102', '2700158', 'Melissa Dedman', '103', '2699262', 'Romans Zolovs', '104', '2694125', 'Jonathan Byrne', '105', '2702812', 'Nicola McLaughlin', '106', '2704152', 'Howard Pearson', '107', '2696432', 'Zac Sirrell', '108', '2696474', "Luke Davies-O'Grady", '109', '2699367', 'Charles Mulinder', '110', '2701365', 'Sarah Tingle-kitchen', '111', '2703659', 'Andrew Fenton', '112', '2695167', 'Roy heer', '113', '2698200', 'Ewa Krzyszkowska', '114', '2697494', 'Steve Nightingale', '115', '2698916', 'Dale Hodges', '116', '2695502', 'G G hearn', '117', '2699776', 'Antiny Swift', '118', '2704778', 'MARK SHAKESBY', '119', '2698200', 'Ewa Krzyszkowska', '120', '2694027', 'Paul Otway', '121', '2700621', 'Nick Hart', '122', '2695847', 'Gavin Holmes', '123', '2699915', 'Torquil Stupart', '124', '2703807', 'Andrew Telfer', '125', '2699931', 'Lloyd Reed', '126', '2700991', 'Clare Brown', '127', '2699914', 'Luke Twivey', '128', '2699308', 'MIKE SPENCER', '129', '2698885', 'dave wills', '130', '2695933', 'Regan Thacker', '131', '2696301', 'Trevor Larken', '132', '2698960', 'Adam Hamada', '133', '2699566', 'Action Fighter', '134', '2703704', 'Daniel Hyndman', '135', '2702652', 'Sarah Brooke', '136', '2694305', 'Scott Knowles', '137', '2700635', 'Jasen Swann', '138', '2696301', 'Trevor Larken', '139', '2694831', 'David Godfrey', '140', '2694174', 'Silviu Dan', '141', '2704446', 'Alan Ball', '142', '2699026', 'Adam Gillett', '143', '2699916', 'Dillon Graham', '144', '2698613', 'Neil Gunn', '145', '2697494', 'Steve Nightingale', '146', '2696380', 'Danny James Pearson', '147', '2700010', 'Peter Ede-Morley', '148', '2704731', 'Simon Wise', '149', '2694056', 'Joel Binns', '150']
Page 2
['Order', 'WINNING TICKET', 'Ticket', '2694340', 'Andrew Reynolds', '694', '2699224', 'Martin Lilge', '315', '2703986', 'Ricky Parton', '975']
['Order', 'Customer', 'Ticket', '2694305', 'Scott Knowles', '151', '2694171', 'Mariusz Karczewski', '152', '2704983', 'Jonathan Hill', '153', '2696473', 'claudia stefanoaia', '154', '2694111', 'David Robinson', '155', '2696301', 'Trevor Larken', '156', '2696270', 'Stuart Bowater', '157', '2699819', 'Ben Funnell', '158', '2703237', 'Mark Lund', '159', '2702804', 'Iain Wallace', '160', '2694206', 'Graeme Allister', '161', '2703060', 'Mark Maskell', '162', '2699308', 'MIKE SPENCER', '163', '2700589', 'Aidan McGilligan', '164', '2698428', 'Benjamin Melsome', '165', '2701686', 'Mariusz Karczewski', '166', '2694121', 'Joseph Woodard', '167', '2700989', 'A J Freeman', '168', '2699109', 'Steve Bowen', '169', '2704382', 'Keith Groundwater', '170', '2700144', 'Carl Marshall', '171', '2698017', 'Geoff Hall', '172', '2704941', 'Graham Riley', '173', '2697494', 'Steve Nightingale', '174', '2697796', 'Gary Leech', '175', '2699229', 'Karl Anson', '176', '2702100', 'Gary Plaskett', '177', '2694826', 'Rayminther Singh', '178', '2702394', 'Rebecca Smith', '179', '2694149', 'Martin Yates', '180', '2700860', 'Katie West', '181', '2695412', 'Daniel Payne', '182', '2695412', 'Daniel Payne', '183', '2699052', 'Ryan Stephens', '184', '2699136', 'Kevin Oliver', '185', '2696124', 'Lee Beesley', '186', '2695997', 'Matthew Prowse', '187', '2704493', 'Mrs P E Cranwell-Hayes', '188', '2701735', 'Williams jonathan', '189', '2699013', 'Charley Isaacs', '190', '2696452', 'Caroline Calver', '191', '2703014', 'Ryan Stephens', '192', '2699776', 'Antiny Swift', '193', '2694206', 'Graeme Allister', '194', '2702649', 'Jason Mcknight', '195', '2701415', 'Daniella Murphy', '196', '2694225', 'Simon Costello', '197', '2702685', 'Chris Firth', '198', '2701445', 'Ashlyn Adams', '199', '2694305', 'Scott Knowles', '200', '2694305', 'Scott Knowles', '201', '2695587', 'Richard Hurst', '202', '2694992', 'Dave Tomley', '203', '2694296', 'Rob Thornton', '204', '2699275', 'barry venn', '205', '2701234', 'Ben Cassidy', '206', '2699460', 'Martin Buckingham', '207', '2697494', 'Steve Nightingale', '208', '2694206', 'Graeme Allister', '209', '2697361', 'Nathan Bambury', '210', '2703464', 'Shaun Raynsford', '211', '2694471', 'lewis ballantyne', '212', '2694831', 'David Godfrey', '213', '2699627', 'Ross Fulton', '214', '2700449', 'Josh Hill', '215', '2695609', 'Will Badman', '216', '2698885', 'dave wills', '217', '2700989', 'A J Freeman', '218', '2694953', 'Mark Thomas', '219', '2700184', 'steven bennetts', '220', '2699109', 'Steve Bowen', '221', '2694305', 'Scott Knowles', '222', '2701572', 'Ethne Gambrill-Jarman', '223', '2694944', 'Amanda Demellweek', '224', '2698549', 'Mr R Bennett', '225', '2704463', 'Chris Beckett', '226', '2694608', 'Ryan Stephens', '227', '2700637', 'Andrew Mckimm', '228', '2694346', 'Will Stanyard', '229', '2699109', 'Steve Bowen', '230', '2701735', 'Williams jonathan', '231', '2701554', 'Liz Coates', '232', '2694818', 'Matt Dawe', '233', '2694372', 'Richard Lindsay', '234', '2699148', 'Grant Sivewright', '235', '2704556', 'Dale Warren', '236', '2694080', 'David Miller', '237', '2701266', 'Russell Miller', '238', '2694171', 'Mariusz Karczewski', '239', '2701647', 'Peter Renshaw', '240', '2699252', 'Nicola Haigh', '241', '2695609', 'Will Badman', '242', '2702654', 'I Petkuns', '243', '2698634', 'Gay Pieters', '244', '2701286', 'timothy cozens', '245', '2697830', 'Kevin Teasdale', '246', '2695046', 'Dan Christian Buentipo Palos', '247', '2694304', 'Gary Faulkner', '248', '2702737', 'Michael Welch', '249', '2704123', 'Paul Mcdermott', '250', '2696161', 'Jono Carter', '251', '2695871', 'Cameron Davidson', '252', '2704384', 'Lauren Redhead', '253', '2694414', 'Elaine Hills', '254', '2700798', 'Mathew Pierce', '255', '2704839', 'Danny Irvine', '256', '2704790', 'Gary Perry', '257', '2694056', 'Joel Binns', '258', '2694346', 'Will Stanyard', '259', '2700243', 'Scott Gourlay', '260', '2694206', 'Graeme Allister', '261', '2699263', 'Adam Torok', '262', '2695077', 'Glen Davies', '263', '2699109', 'Steve Bowen', '264', '2695149', 'Martin Wheeler', '265', '2697877', 'Rob Poundall', '266', '2697906', 'Mike Finn', '267', '2698068', 'Miguel Pacheco', '268', '2701176', 'alex mitchell', '269', '2700998', 'Antonio Domingo', '270', '2697049', 'James Skinner', '271', '2701415', 'Daniella Murphy', '272', '2698886', 'Julie Neill', '273', '2696260', 'John Doody', '274', '2696301', 'Trevor Larken', '275', '2694831', 'David Godfrey', '276', '2703702', 'Dean Brain', '277', '2702017', 'Michael Richardson', '278', '2697361', 'Nathan Bambury', '279', '2699938', 'Charlotte Jukes', '280', '2695350', 'Paul Fieldhouse', '281', '2702350', 'Barry Little', '282', '2694849', 'Matthew Riddell', '283', '2695592', 'Robert Harvey', '284', '2703363', 'Mason BURKINSHAW', '285', '2698579', 'Louise Davies', '286', '2696694', 'Stewart Smith', '287', '2704522', 'Adam Gillett', '288', '2701236', 'Scot Beall', '289', '2696784', 'Adam Timberlake', '290', '2704628', 'Lee Heginbotham', '291', '2699389', 'Lucy Donovan', '292', '2702673', 'James Jackson', '293', '2700232', 'raysean wharton', '294', '2699109', 'Steve Bowen', '295', '2699451', 'Winai Mays', '296', '2702364', 'Graham Williams', '297', '2695368', 'Daniel Moore', '298', '2703678', 'Ian Smith', '299', '2694027', 'Paul Otway', '300']
You should look into what happens when the end of the table is reached and test for that.

Related

Creating undirected unweighted graph from dictionary containing neighborhood relationship

I have a Python dictionary that looks like this:
{'Aitkin': ['Carlton', 'Cass', 'Crow Wing', 'Itasca',
'Kanabec', 'Mille Lacs', 'Pine', 'St. Louis'], 'Anoka':
['Chisago', 'Hennepin', 'Isanti', 'Ramsey', 'Sherburne',
'Washington'], 'Becker': ['Clay', 'Clearwater', 'Hubbard',
'Mahnomen', 'Norman', 'Otter Tail', 'Wadena'], 'Beltrami':
['Cass', 'Clearwater', 'Hubbard', 'Itasca', 'Koochiching',
'Lake of the Woods', 'Marshall', 'Pennington', 'Roseau'],
'Benton': ['Mille Lacs', 'Morrison', 'Sherburne', 'Stearns'], 'Big
Stone': ['Lac qui Parle', 'Stevens', 'Swift', 'Traverse'], 'Blue
Earth': ['Brown', 'Faribault', 'Le Sueur', 'Martin',
'Nicollet', 'Waseca', 'Watonwan'], 'Brown': ['Blue Earth',
'Cottonwood', 'Nicollet', 'Redwood', 'Renville', 'Watonwan'],
'Carlton': ['Aitkin', 'Pine', 'St. Louis'], 'Carver': ['Hennepin',
'McLeod', 'Scott', 'Sibley', 'Wright'], 'Cass': ['Aitkin',
'Beltrami', 'Crow Wing', 'Hubbard', 'Itasca', 'Morrison',
'Todd', 'Wadena'], 'Chippewa': ['Kandiyohi', 'Lac qui Parle',
'Renville', 'Swift', 'Yellow Medicine'], 'Chisago': ['Anoka',
'Isanti', 'Kanabec', 'Pine', 'Washington'], 'Clay': ['Becker',
'Norman', 'Otter Tail', 'Wilkin'], 'Clearwater': ['Becker',
'Beltrami', 'Hubbard', 'Mahnomen', 'Pennington', 'Polk'],
'Cook': ['Lake'], 'Cottonwood': ['Brown', 'Jackson', 'Murray',
'Nobles', 'Redwood', 'Watonwan'], 'Crow Wing': ['Aitkin', 'Cass',
'Mille Lacs', 'Morrison'], 'Dakota': ['Goodhue', 'Hennepin',
'Ramsey', 'Rice', 'Scott', 'Washington'], 'Dodge': ['Goodhue',
'Mower', 'Olmsted', 'Rice', 'Steele'], 'Douglas': ['Grant', 'Otter
Tail', 'Pope', 'Stearns', 'Stevens', 'Todd'], 'Faribault': ['Blue
Earth', 'Freeborn', 'Martin', 'Waseca'], 'Fillmore': ['Houston',
'Mower', 'Olmsted', 'Winona'], 'Freeborn': ['Faribault', 'Mower',
'Steele', 'Waseca'], 'Goodhue': ['Dakota', 'Dodge', 'Olmsted',
'Rice', 'Wabasha'], 'Grant': ['Douglas', 'Otter Tail', 'Pope',
'Stevens', 'Traverse', 'Wilkin'], 'Hennepin': ['Anoka', 'Carver',
'Dakota', 'Ramsey', 'Scott', 'Sherburne', 'Wright'],
'Houston': ['Fillmore', 'Winona'], 'Hubbard': ['Becker', 'Beltrami',
'Cass', 'Clearwater', 'Wadena'], 'Isanti': ['Anoka', 'Chisago',
'Kanabec', 'Mille Lacs', 'Pine', 'Sherburne'], 'Itasca': ['Aitkin',
'Beltrami', 'Cass', 'Koochiching', 'St. Louis'], 'Jackson':
['Cottonwood', 'Martin', 'Nobles', 'Watonwan'], 'Kanabec': ['Aitkin',
'Chisago', 'Isanti', 'Mille Lacs', 'Pine'], 'Kandiyohi': ['Chippewa',
'Meeker', 'Pope', 'Renville', 'Stearns', 'Swift'], 'Kittson':
['Marshall', 'Roseau'], 'Koochiching': ['Beltrami', 'Itasca', 'Lake
of the Woods', 'St. Louis'], 'Lac qui Parle': ['Big Stone',
'Chippewa', 'Swift', 'Yellow Medicine'], 'Lake': ['Cook', 'St.
Louis'], 'Lake of the Woods': ['Beltrami', 'Koochiching', 'Roseau'],
'Le Sueur': ['Blue Earth', 'Nicollet', 'Rice', 'Scott', 'Sibley',
'Waseca'], 'Lincoln': ['Lyon', 'Pipestone', 'Yellow Medicine'],
'Lyon': ['Lincoln', 'Murray', 'Pipestone', 'Redwood', 'Yellow
Medicine'], 'Mahnomen': ['Becker', 'Clearwater', 'Norman', 'Polk'],
'Marshall': ['Beltrami', 'Kittson', 'Pennington', 'Polk', 'Roseau'],
'Martin': ['Blue Earth', 'Faribault', 'Jackson', 'Watonwan'],
'McLeod': ['Carver', 'Meeker', 'Renville', 'Sibley', 'Wright'],
'Meeker': ['Kandiyohi', 'McLeod', 'Renville', 'Stearns', 'Wright'],
'Mille Lacs': ['Aitkin', 'Benton', 'Crow Wing', 'Isanti',
'Kanabec', 'Morrison', 'Sherburne'], 'Morrison': ['Benton',
'Cass', 'Crow Wing', 'Mille Lacs', 'Stearns', 'Todd'], 'Mower':
['Dodge', 'Fillmore', 'Freeborn', 'Olmsted', 'Steele'], 'Murray':
['Cottonwood', 'Lyon', 'Nobles', 'Pipestone', 'Redwood', 'Rock'],
'Nicollet': ['Blue Earth', 'Brown', 'Le Sueur', 'Renville', 'Sibley'],
'Nobles': ['Cottonwood', 'Jackson', 'Murray', 'Rock'], 'Norman':
['Becker', 'Clay', 'Mahnomen', 'Polk'], 'Olmsted': ['Dodge',
'Fillmore', 'Goodhue', 'Mower', 'Wabasha', 'Winona'], 'Otter Tail':
['Becker', 'Clay', 'Douglas', 'Grant', 'Wadena', 'Wilkin'],
'Pennington': ['Beltrami', 'Clearwater', 'Marshall', 'Polk', 'Red
Lake'], 'Pine': ['Aitkin', 'Carlton', 'Chisago', 'Isanti',
'Kanabec'], 'Pipestone': ['Lincoln', 'Lyon', 'Murray', 'Rock'],
'Polk': ['Clearwater', 'Mahnomen', 'Marshall', 'Norman',
'Pennington', 'Red Lake'], 'Pope': ['Douglas', 'Grant',
'Kandiyohi', 'Stearns', 'Stevens', 'Swift'], 'Ramsey': ['Anoka',
'Dakota', 'Hennepin', 'Washington'], 'Red Lake': ['Pennington',
'Polk'], 'Redwood': ['Brown', 'Cottonwood', 'Lyon', 'Murray',
'Renville', 'Yellow Medicine'], 'Renville': ['Brown', 'Chippewa',
'Kandiyohi', 'McLeod', 'Meeker', 'Nicollet', 'Redwood',
'Sibley', 'Yellow Medicine'], 'Rice': ['Dakota', 'Dodge',
'Goodhue', 'Le Sueur', 'Scott', 'Steele', 'Waseca'], 'Rock':
['Murray', 'Nobles', 'Pipestone'], 'Roseau': ['Beltrami', 'Kittson',
'Lake of the Woods', 'Marshall'], 'Scott': ['Carver', 'Dakota',
'Hennepin', 'Le Sueur', 'Rice', 'Sibley'], 'Sherburne': ['Anoka',
'Benton', 'Hennepin', 'Isanti', 'Mille Lacs', 'Stearns',
'Wright'], 'Sibley': ['Carver', 'Le Sueur', 'McLeod', 'Nicollet',
'Renville', 'Scott'], 'St. Louis': ['Aitkin', 'Carlton', 'Itasca',
'Koochiching', 'Lake'], 'Stearns': ['Benton', 'Douglas',
'Kandiyohi', 'Meeker', 'Morrison', 'Pope', 'Sherburne',
'Todd', 'Wright'], 'Steele': ['Dodge', 'Freeborn', 'Mower', 'Rice',
'Waseca'], 'Stevens': ['Big Stone', 'Douglas', 'Grant', 'Pope',
'Swift', 'Traverse'], 'Swift': ['Big Stone', 'Chippewa',
'Kandiyohi', 'Lac qui Parle', 'Pope', 'Stevens'], 'Todd':
['Cass', 'Douglas', 'Morrison', 'Otter Tail', 'Stearns', 'Wadena'],
'Traverse': ['Big Stone', 'Grant', 'Stevens', 'Wilkin'], 'Wabasha':
['Goodhue', 'Olmsted', 'Winona'], 'Wadena': ['Becker', 'Cass',
'Hubbard', 'Otter Tail', 'Todd'], 'Waseca': ['Blue Earth',
'Faribault', 'Freeborn', 'Le Sueur', 'Rice', 'Steele'],
'Washington': ['Anoka', 'Chisago', 'Dakota', 'Ramsey'], 'Watonwan':
['Blue Earth', 'Brown', 'Cottonwood', 'Jackson', 'Martin'], 'Wilkin':
['Clay', 'Grant', 'Otter Tail', 'Traverse'], 'Winona': ['Fillmore',
'Houston', 'Olmsted', 'Wabasha'], 'Wright': ['Carver', 'Hennepin',
'McLeod', 'Meeker', 'Sherburne', 'Stearns'], 'Yellow Medicine':
['Chippewa', 'Lac qui Parle', 'Lincoln', 'Lyon', 'Redwood',
'Renville']}
The keys in the dictionary represent the nodes, while the values(lists) represent nodes of neighbors of the key. This is an undirected unweighted graph.
Is there some function to implement this in networkx or some network analysis or graph-related libraries?

You can use the nx.Graph constructor -- it will add additional nodes if they don't appear as keys in the original dictionary (data represents the dictionary in the original question):
import networkx as nx
graph = nx.Graph(data)

How to select PID values with regular expressions?

#!/usr/bin/env python3.7
import subprocess
import re
import os
def main():
output=subprocess.check_output(["ps","aux"])
output=output.decode()
print(output)
if __name__=="__main__":
main()
I am trying to extract all PID values and put them in a sepearate list but i am unable to extract these.

to extract all PID values and put them in a sepearate list
To extract only pid numbers change ps command to use a specific user format
(-o format - specify user-defined format) to limit output fields.
import subprocess
import os
def main():
output = subprocess.check_output(["ps", "ax", "-o", "pid", "--no-headers"])
pids = output.decode().split()
print(pids)
if __name__=="__main__":
main()
Sample output:
['1', '2', '3', '4', '6', '8', '9', '10', '11', '12', '13', '14', '16', '17',
'18', '19', '20', '21', '23', '24', '25', '26', '27', '28', '30', '31', '32',
'33', '34', '35', '37', '38', '39', '40', '41', '42', '44', '45', '46', '47',
'48', '49', '51', '52', '53', '54', '55', '56', '58', '59', '60', '61', '62',
'63', '65', '66', '67', '68', '69', '70', '72', '73', '74', '75', '76', '77',
'79', '80', '81', '82', '83', '84', '86', '87', '88', '89', '90', '91', '93',
'94', '95', '96', '97', '100', '101', '102', '103', '104', '105', '193', '194',
'195', '199', '200', '202', '205', '206', '209', '210', '211', '212', '213',
'214', '220', '231', '248', '287', '288', '289', '290', '291', '296', '297',
'300', '307', '314', '315', '321', '324', '326', '328', '341', '344', '347',
'348', '357', '361', '362', '363', '366', '432', '483', '488', '494', '516',
'517', '518', '519', '520', '521', '522', '523', '524', '525', '526', '527',
'528', '529', '604', '620', '621', '624', '625', '627', '636', '637', '650',
'651', '743', '744', '752', '753', '770', '771', '785', '786', '791', '792',
'793', '794', '795', '796', '797', '798', '829', '838', '848', '853', '854',
'855', '856', '857', '858', '859', '860', '865', '896', '900', '901', '911',
'912', '921', '936', '937', '940', '944', '960', '964', '968', '970', '975',
'984', '989', '991', '995', '999', '1001', '1016', '1025', '1030', '1033',
'1034', '1036', '1038', '1050', '1059', '1067', '1071', '1078', '1095', '1098',
'1104', '1110', '1112', '1117', '1122', '1131', '1132', '1152', '1157', '1163',
'1169', '1175', '1181', '1191', '1201', '1204', '1210', '1218', '1225', '1250',
'1258', '1261', '1288', '1289', '1290', '1291', '1292', '1293', '1294', '1295',
'1296', '1297', '1298', '1300', '1327', '1334', '1339', '1346', '1395', '1436',
'1444', '1469', '1682', '1687', '1689', '1701', '1715', '1727', '1751', '1771',
'1797', '1837', '1900', '1902', '1992', '2025', '2075', '2307', '2492', '2801',
'2842', '2911', '3404', '3870', '3871', '3874', '4086', '4195', '5217', '5249',
'5745', '5762', '5773', '5803', '5808', '5809', '5812', '5813', '5816', '5836',
'5841', '6008', '6073', '6087', '6104', '6605', '7934', '8127', '8663',
'10274', '10862', '12317', '12428', '12605', '12622', '12650', '12676',
'12677', '12756', '12904', '13242', '13609', '14722', '14812', '15367',
'15409', '15522', '15536', '15839', '15859', '16087', '16152', '16303',
'16386', '16387']

Return a list of similar authors

I'm trying to write a function that will return a list of items from a key from a key (if that makes sense). For example, here's a dictionary of authors, and similar authors.
authors = {
'Ray Bradbury': ['Harlan Ellison', 'Robert Heinlein', 'Isaac Asimov', 'Arthur Clarke'],
'Harlan Ellison': ['Neil Stephenson', 'Kurt Vonnegut', 'Richard Morgan', 'Douglas Adams'],
'Kurt Vonnegut': ['Terry Pratchett', 'Tom Robbins', 'Douglas Adams', 'Neil Stephenson', 'Jeff Vandemeer'],
'Thomas Pynchon': ['Isaac Asimov', 'Jorges Borges', 'Robert Heinlein'],
'Isaac Asimov': ['Stephen Baxter', 'Ray Bradbury', 'Arthur Clarke', 'Kurt Vonnegut', 'Neil Stephenson'],
'Douglas Adams': ['Terry Pratchett', 'Chris Moore', 'Kurt Vonnegut']
}
And the function I came up with is this:
def get_similar(author_list, author):
for item in author_list[author]:
return author_list[author]
Which only returns the items for the first key. I'd like it to return all of the similar authors, like this:
get_similar(authors, 'Harlan Ellison')
['Terry Pratchett', 'Tom Robbins', 'Douglas Adams', 'Neil Stephenson',
'Jeff Vandemeer','Terry Pratchett', 'Chris Moore', 'Kurt Vonnegut']
Where it finds the key given (author), looks at the items listed for that key, and then returns those key's items. In this case Harlan Ellison has four authors listed - Neil Stephenson, Kurt Vonnegut, Richard Morgan, and Douglas Adams. The function then looks up those authors, and returns the items listed for them - Kurt Vonnegut returns Terry Pratchett, Tom Robbins, Douglas Adams, Neil Stephenson, and Jeff Vandemeer, and Douglas Adams returns Terry Pratchett, Chris Moore, and Kurt Vonnegut,
Duplicates are fine, and I'd like it in alphabetical order (I assume you could just use a sort command at the end) Any help would be much appreciated, I'm stumped!

I think this is what you are looking for. Hopefully it gets you going.
authors = {'Ray Bradbury': ['Harlan Ellison', 'Robert Heinlein', 'Isaac Asimov', 'Arthur Clarke'], 'Harlan Ellison': ['Neil Stephenson', 'Kurt Vonnegut', 'Richard Morgan', 'Douglas Adams'], 'Kurt Vonnegut': ['Terry Pratchett', 'Tom Robbins', 'Douglas Adams', 'Neil Stephenson', 'Jeff Vandemeer'], 'Thomas Pynchon': ['Isaac Asimov', 'Jorges Borges', 'Robert Heinlein'], 'Isaac Asimov': ['Stephen Baxter', 'Ray Bradbury', 'Arthur Clarke', 'Kurt Vonnegut', 'Neil Stephenson'], 'Douglas Adams': ['Terry Pratchett', 'Chris Moore', 'Kurt Vonnegut']}
def get_similar(authors, author):
retVal = []
for k, v in authors.items():
if k == author:
for value in v:
retVal.append(value)
if value in authors:
for v2 in authors[value]:
retVal.append(v2)
return sorted(retVal)
get_similar(authors, "Harlan Ellison") returns
['Chris Moore',
'Douglas Adams',
'Douglas Adams',
'Jeff Vandemeer',
'Kurt Vonnegut',
'Kurt Vonnegut',
'Neil Stephenson',
'Neil Stephenson',
'Richard Morgan',
'Terry Pratchett',
'Terry Pratchett',
'Tom Robbins']
I'll leave it to you to figure out how to remove the duplicates.

You are very close but instead of returning after finding the first list of similar authors, you should store all of the authors you find in a list and then return them all after your for loop has finished:
def get_similar(author_list, author):
similar_authors = []
for item in author_list[author]:
if item in author_list:
similar_authors.extend(author_list[item])
return similar_authors
Notice that I also added an if statement to make sure that the item is in fact one of the keys in your dictionary so you don't get an error later on (for example: 'Neil Stephenson' is in the dictionary as a member of one of the values but is not a key).
EXTRA INFO:
(if you are interested)
Another option is to turn your function into a generator instead. This has the advantage of not having to store all the similar authors in a list and instead yields each author as it is found:
def get_similar2(author_list, author):
for item in author_list[author]:
if item in author_list:
for other_author in author_list[item]:
yield other_author
Or if you are using python 3.3+ you can simplify this a bit by using the yield from expression to get functionally the same code as in get_similar2:
def get_similar3(author_list, author):
for item in author_list[author]:
if item in author_list:
yield from author_list[item]
All three of the functions/generators above will give you the same results (just remember to get all the values yielded from the generators):
print(get_similar(authors, 'Harlan Ellison'))
['Terry Pratchett', 'Tom Robbins', 'Douglas Adams', 'Neil Stephenson', 'Jeff Vandemeer', 'Terry Pratchett', 'Chris Moore', 'Kurt Vonnegut']
print(list(get_similar2(authors, 'Harlan Ellison')))
['Terry Pratchett', 'Tom Robbins', 'Douglas Adams', 'Neil Stephenson', 'Jeff Vandemeer', 'Terry Pratchett', 'Chris Moore', 'Kurt Vonnegut']
print(list(get_similar3(authors, 'Harlan Ellison')))
['Terry Pratchett', 'Tom Robbins', 'Douglas Adams', 'Neil Stephenson', 'Jeff Vandemeer', 'Terry Pratchett', 'Chris Moore', 'Kurt Vonnegut']

Here's a simple solution using a set and list comprehension:
def get_similar(author_list, author):
similar = set(author_list.get(author, []))
similar.update(*[author_list.get(item, []) for item in similar])
return sorted(similar)
get_similar(authors, 'Harlan Ellison')
Output:
['Chris Moore', 'Douglas Adams', 'Jeff Vandemeer', 'Kurt Vonnegut',
'Neil Stephenson', 'Richard Morgan', 'Terry Pratchett', 'Tom Robbins']

What you're doing now will work the same way without the for loop - you're essentially just doing a single lookup and return that, hence you get only one entry. What you need to do instead is to do your lookup, find the authors and then do a lookup for each of those authors, then rinse and repeat... The easiest way to do that is to use a bit of recursion:
def get_similar(authors, author):
return [a for x in authors.pop(author, []) for a in [x] + get_similar(authors, x)]
get_similar(authors, 'Harlan Ellison')
# ['Neil Stephenson', 'Kurt Vonnegut', 'Terry Pratchett', 'Tom Robbins', 'Douglas Adams',
# 'Terry Pratchett', 'Chris Moore', 'Kurt Vonnegut', 'Neil Stephenson', 'Jeff Vandemeer',
# 'Richard Morgan', 'Douglas Adams']
Then all you need to do is to turn it into a set to get rid of the duplicates and then sort it, or if you don't mind a slight performance hit (due to recursion) you can do it right inside your function:
def get_similar(authors, author):
return sorted(set([a for x in authors.pop(author, []) for a in [x] + get_similar(authors, x)]))
# ['Chris Moore', 'Douglas Adams', 'Jeff Vandemeer', 'Kurt Vonnegut', 'Neil Stephenson', 'Richard Morgan', 'Terry Pratchett', 'Tom Robbins']
Keep in mind that this modifies your input dictionary to avoid infinite recursion, so if you want to keep your authors dictionary intact call the function as get_similar(authors.copy(), author).

What is happening is that functions only accept one return to fix this, return the full row without iterating
def get_similar(author_list, author):
return sorted(author_list[author])

I'd use recursion to find similar authors in this fashion. Come to find out, it is even more inconvenient (and dangerous and slower) to want to return duplicates.
authors = {'Ray Bradbury': ['Harlan Ellison', 'Robert Heinlein', 'Isaac Asimov', 'Arthur Clarke'], 'Harlan Ellison': ['Neil Stephenson',
'Kurt Vonnegut', 'Richard Morgan', 'Douglas Adams'], 'Kurt Vonnegut': ['Terry Pratchett', 'Tom Robbins', 'Douglas Adams',
'Neil Stephenson', 'Jeff Vandemeer'], 'Thomas Pynchon': ['Isaac Asimov', 'Jorges Borges', 'Robert Heinlein'], 'Isaac Asimov':
['Stephen Baxter', 'Ray Bradbury', 'Arthur Clarke', 'Kurt Vonnegut', 'Neil Stephenson'], 'Douglas Adams': ['Terry Pratchett', 'Chris Moore', 'Kurt Vonnegut']}
def get_similar(author_list, author, currentList=[]):
for similar in author_list[author]:
if similar not in currentList:
currentList.append(similar)
if similar in authors:
get_similar(author_list, author, currentList)
return sorted(currentList)
print(get_similar(authors, "Harlan Ellison"))
Returns:
['Douglas Adams', 'Kurt Vonnegut', 'Neil Stephenson', 'Richard Morgan']

One way is using list comprehension + itertools.chain
from itertools import chain
def get_similar(author_list, author):
return sorted(set(chain(*[v for k,v in authors.items() if k in authors[author]])))
get_similar(authors, 'Harlan Ellison')
#['Chris Moore', 'Douglas Adams', 'Jeff Vandemeer', 'Kurt Vonnegut', 'Neil Stephenson', 'Terry Pratchett', 'Tom Robbins']

I would not include parameter author in the output if that's one of the elements in a list value. You could use list comprehension:
def get_similar(author_list, author):
# Lists of similar authors
similar = [author_list[auth] for auth in author_list[author] if auth in author_list]
# Merge the lists and sort the authors. Do not include parameter author
return sorted(auth for sub in similar for auth in sub if auth != author)
authors = {
'Ray Bradbury': ['Harlan Ellison', 'Robert Heinlein', 'Isaac Asimov', 'Arthur Clarke'],
'Harlan Ellison': ['Neil Stephenson', 'Kurt Vonnegut', 'Richard Morgan', 'Douglas Adams'],
'Kurt Vonnegut': ['Terry Pratchett', 'Tom Robbins', 'Douglas Adams', 'Neil Stephenson', 'Jeff Vandemeer'],
'Thomas Pynchon': ['Isaac Asimov', 'Jorges Borges', 'Robert Heinlein'],
'Isaac Asimov': ['Stephen Baxter', 'Ray Bradbury', 'Arthur Clarke', 'Kurt Vonnegut', 'Neil Stephenson'],
'Douglas Adams': ['Terry Pratchett', 'Chris Moore', 'Kurt Vonnegut']
}
>>> get_similar(authors, 'Harlan Ellison')
['Chris Moore', 'Douglas Adams', 'Jeff Vandemeer', 'Kurt Vonnegut', 'Neil Stephenson', 'Terry Pratchett', 'Terry Pratchett', 'Tom Robbins']
>>> get_similar(authors, 'Ray Bradbury') # There's 'Ray Bradbury' in the values of 'Isaac Asimov'
['Arthur Clarke', 'Douglas Adams', 'Kurt Vonnegut', 'Kurt Vonnegut', 'Neil Stephenson', 'Neil Stephenson', 'Richard Morgan', 'Stephen Baxter']

Write a CSV file with delimiter in a cell (two points)

How do I write to a .csv file the following:
1171837, 1974:3655:1862:279:1936
1172, 238:1833:228:234:1821:225:211:245:1941:315:2035:222:3371:231:224:216:1942
instead of this:
1171837, ['1974', '3655', '1862', '279', '1936']
1172, ['238', '1833', '228', '234', '1821', '225', '211', '245', '1941', '315', '2035', '222', '3371', '231', '224', '216', '1942']
These are the lists that I have:
lche=['1171837', '1172']
ltarg=[['1974', '3655', '1862', '279', '1936'],
['238', '1833', '228', '234', '1821', '225', '211', '245', '1941',
'315', '2035', '222', '3371', '231', '224', '216', '1942']]
This is the way how I did it. I do not know how use other delimiters.
data="list.csv"
csv_out = open(data, 'wb')
mywriter = csv.writer(csv_out)
for row in zip(lche,ltarg):
mywriter.writerow(row)
csv_out.close()

You can join the elements of ltarg together with a colon:
>>> ltarg2 = list()
>>> for elem in ltarg:
... ltarg2.append(':'.join(elem))
...
>>> ltarg2
['1974:3655:1862:279:1936', '238:1833:228:234:1821:225:211:245:1941:315:2035:222:3371:231:224:216:1942']
Then continue as you were with the new list:
for row in zip(lche,ltarg2):

You can use join
Try:
import csv
lche=['1171837', '1172']
ltarg=[['1974', '3655', '1862', '279', '1936'], ['238', '1833', '228', '234', '1821', '225', '211', '245', '1941', '315', '2035', '222', '3371', '231', '224', '216', '1942']]
data="list.csv"
csv_out = open(data, 'wb')
mywriter = csv.writer(csv_out)
l = [':'.join(x) for x in ltarg]
for row in zip(lche,l):
mywriter.writerow(row)
csv_out.close()

You need to combine (join) the sub-lists in ltarg into single strings while zipping its contents with the lche list.
import csv
lche = ['1171837', '1172']
ltarg = [['1974', '3655', '1862', '279', '1936'],
['238', '1833', '228', '234', '1821', '225', '211', '245', '1941',
'315', '2035', '222', '3371', '231', '224', '216', '1942']]
data = "list.csv"
with open(data, 'wb') as csv_out:
mywriter = csv.writer(csv_out)
for row in zip(lche, (':'.join(v for v in lt) for lt in ltarg)):
mywriter.writerow(row)
Contents of the list.csv file afterwards:
1171837,1974:3655:1862:279:1936
1172,238:1833:228:234:1821:225:211:245:1941:315:2035:222:3371:231:224:216:1942

XPath - extracting table data with irregular pattern

Extending an existing question and answer here, I am trying to extract player name and his position. The output would like:
playername, position
EJ Manuel, Quarterbacks
Tyrod Taylor, Quarterbacks
Anthony Dixon, Running backs
...
This is what I have done so far:
tree = html.fromstring(requests.get("https://en.wikipedia.org/wiki/List_of_current_AFC_team_rosters").text)
for h3 in tree.xpath("//table[#class='toccolours']//tr[2]"):
position = h3.xpath(".//b/text()")
players = h3.xpath(".//ul/li/a/text()")
print(position, players)
The above codes can deliver the following, but not in the format I need.
(['Quarterbacks', 'Running backs', 'Wide receivers', 'Tight ends', 'Offensive linemen', 'Defensive linemen', 'Linebackers', 'Defensive backs', 'Special teams', 'Reserve lists', 'Unrestricted FAs', 'Restricted FAs', 'Exclusive-Rights FAs'], ['EJ Manuel', 'Tyrod Taylor', 'Anthony Dixon', 'Jerome Felton', 'Mike Gillislee', 'LeSean McCoy', 'Karlos Williams', 'Leonard Hankerson', 'Marcus Easley', 'Marquise Goodwin', 'Percy Harvin', 'Dez Lewis', 'Walt Powell', 'Greg Salas', 'Sammy Watkins', 'Robert Woods', 'Charles Clay', 'Chris Gragg', "Nick O'Leary", 'Tyson Chandler', 'Ryan Groy', 'Seantrel Henderson', 'Cyrus Kouandjio', 'John Miller', 'Kraig Urbik', 'Eric Wood', 'T. J. Barnes', 'Marcell Dareus', 'Lavar Edwards', 'IK Enemkpali', 'Jerry Hughes', 'Kyle Williams', 'Mario Williams', 'Jerel Worthy', 'Jarius Wynn', 'Preston Brown', 'Randell Johnson', 'Manny Lawson', 'Kevin Reddick', 'Tony Steward', 'A. J. Tarpley', 'Max Valles', 'Mario Butler', 'Ronald Darby', 'Stephon Gilmore', 'Corey Graham', 'Leodis McKelvin', 'Jonathan Meeks', 'Merrill Noel', 'Nickell Robey', 'Sammy Seamster', 'Cam Thomas', 'Aaron Williams', 'Duke Williams', 'Dan Carpenter', 'Jordan Gay', 'Garrison Sanborn', 'Colton Schmidt', 'Blake Annen', 'Jarrett Boykin', 'Jonathan Dowling', 'Greg Little', 'Jacob Maxwell', 'Ronald Patrick', 'Cedric Reed', 'Cyril Richardson', 'Phillip Thomas', 'James Wilder, Jr.', 'Nigel Bradham', 'Ron Brooks', 'Alex Carrington', 'Cordy Glenn', 'Leonard Hankerson', 'Richie Incognito', 'Josh Johnson', 'Corbin Bryant', 'Stefan Charles', 'MarQueis Gray', 'Chris Hogan', 'Jordan Mills', 'Ty Powell', 'Bacarri Rambo', 'Cierre Wood'])
(['Quarterbacks', 'Running backs', 'Wide receivers', 'Tight ends', 'Offensive linemen', 'Defensive linemen', 'Linebackers', 'Defensive backs', 'Special teams', 'Reserve lists', 'Unrestricted FAs', 'Restricted FAs', 'Exclusive-Rights FAs'], ['Zac Dysert', 'Ryan Tannehill', 'Logan Thomas', 'Jay Ajayi', 'Jahwan Edwards', 'Damien Williams', 'Tyler Davis', 'Robert Herron', 'Greg Jennings', 'Jarvis Landry', 'DeVante Parker', 'Kenny Stills', 'Jordan Cameron', 'Dominique Jones', 'Dion Sims', 'Branden Albert', 'Jamil Douglas', "Ja'Wuan James", 'Vinston Painter', 'Mike Pouncey', 'Anthony Steen', 'Dallas Thomas', 'Billy Turner', 'Deandre Coleman', 'Quinton Coples', 'Terrence Fede', 'Dion Jordan', 'Earl Mitchell', 'Damontre Moore', 'Jordan Phillips', 'Ndamukong Suh', 'Charles Tuaau', 'Robert Thomas', 'Cameron Wake', 'Julius Warmsley', 'Jordan Williams', 'Neville Hewitt', 'Mike Hull', 'Jelani Jenkins', 'Terrell Manning', 'Chris McCain', 'Koa Misi', 'Zach Vigil', 'Walt Aikens', 'Damarr Aultman', 'Brent Grimes', 'Reshad Jones', 'Tony Lippett', 'Bobby McCain', 'Brice McCain', 'Tyler Patmon', 'Dax Swanson', 'Jamar Taylor', 'Matt Darr', 'John Denney', 'Andrew Franks', 'Louis Delmas', 'James-Michael Johnson', 'Rishard Matthews', 'Jacques McClendon', 'Lamar Miller', 'Matt Moore', 'Spencer Paysinger', 'Derrick Shelby', 'Kelvin Sheppard', 'Shelley Smith', 'Olivier Vernon', 'Michael Thomas', 'Brandon Williams', 'Shamiel Gary', 'Matt Hazel', 'Ulrick John', 'Jake Stoneburner'])
...
Any suggestions?

You can use nested loop for this task. First loop through the positions and then, for each position, loop through the corresponding players :
#loop through positions
for b in tree.xpath("//table[#class='toccolours']//tr[2]//b"):
#get current position text
position = b.xpath("text()")[0]
#get players that correspond to the current position
for a in b.xpath("following::ul[1]/li/a[not(*)]"):
#get current player text
player = a.xpath("text()")[0]
#print current position and player together
print(position, player)
Last part of the output :
.....
('Reserve lists', 'Chris Watt')
('Reserve lists', 'Eric Weddle')
('Reserve lists', 'Tourek Williams')
('Practice squad', 'Alex Bayer')
('Practice squad', 'Isaiah Burse')
('Practice squad', 'Richard Crawford')
('Practice squad', 'Ben Gardner')
('Practice squad', 'Michael Huey')
('Practice squad', 'Keith Lewis')
('Practice squad', 'Chuka Ndulue')
('Practice squad', 'Tim Semisch')
('Practice squad', 'Brad Sorensen')
('Practice squad', 'Craig Watts')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can i loop through multiple pages to scrape table data (python) - python

Related

Creating undirected unweighted graph from dictionary containing neighborhood relationship

How to select PID values with regular expressions?

Return a list of similar authors

Write a CSV file with delimiter in a cell (two points)

XPath - extracting table data with irregular pattern

Categories

Resources