Multiple Python scripts running concurrently with different inputs - python

So I have this python script that scrapes listing off of a specific craigslist URL the user constructs (location, max price, type of item, etc.). It then goes to the URL, scrapes the listings info (price, date posted, etc) and returns three outputs. One is 'x' number of items around the average price (the user determines the number of items and the range of prices such as $100 off the average price). Next, are 'x' closets listings based of the zip code the user provided in the begging (user also determine the # of items displayed based on proximity to zip code). Lastly the craigslist url link is outputted to the user so they can visit the page and look at the items displayed to them earlier. The data of the scrape is stored in a data.json file and a data.csv file .Content is the same just different formats, I would like to offload this data to a Database everytime a scrape is done. Either Cloud Firestore or AWS DynamoDB since I want to host this a web app in the future
What I want to do is allow the user to have multiple instances of the same scripts all with unique craigslist urls running at the same time. All of the code is the same, the only difference are the craigslist urls that the script scrapes.
I made a method that iterated through the creation of the attributes(location, max price, etc) and returns a lost of the urls, but in my main I call the contructor and it needs all of those attributes so I have to fish it from the urls and that seemed over the top.
I then tried to have the loop in my main. the user determine how many url links they want to make and append the completed links to a list. Again ran into the same problem.
class CraigslistScraper(object):
# Contructor of the URL that is being scraped
def __init__(self, location, postal_code, max_price, query, radius):
self.location = location # Location(i.e. City) being searched
self.postal_code = postal_code # Postal code of location being searched
self.max_price = max_price # Max price of the items that will be searched
self.query = query # Search for the type of items that will be searched
self.radius = radius # Radius of the area searched derived from the postal code given previously
self.url = f"https://{location}.craigslist.org/search/sss?&max_price={max_price}&postal={postal_code}&query={query}&20card&search_distance={radius}"
self.driver = webdriver.Chrome(r"C:\Program Files\chromedriver") # Path of Firefox web driver
self.delay = 7 # The delay the driver gives when loading the web page
# Load up the web page
# Gets all relevant data on the page
# Goes to next page until we are at the last page
def load_craigslist_url(self):
data = []
# url_list = []
self.driver.get(self.url)
while True:
try:
wait = WebDriverWait(self.driver, self.delay)
wait.until(EC.presence_of_element_located((By.ID, "searchform")))
data.append(self.extract_post_titles())
# url_list.append(self.extract_post_urls())
WebDriverWait(self.driver, 2).until(
EC.element_to_be_clickable((By.XPATH, '//*[#id="searchform"]/div[3]/div[3]/span[2]/a[3]'))).click()
except:
break
return data
# Extracts all relevant information from the web-page and returns them as individual lists
def extract_post_titles(self):
all_posts = self.driver.find_elements_by_class_name("result-row")
dates_list = []
titles_list = []
prices_list = []
distance_list = []
for post in all_posts:
title = post.text.split("$")
if title[0] == '':
title = title[1]
else:
title = title[0]
title = title.split("\n")
price = title[0]
title = title[-1]
title = title.split(" ")
month = title[0]
day = title[1]
title = ' '.join(title[2:])
date = month + " " + day
if not price[:1].isdigit():
price = "0"
int(price)
raw_distance = post.find_element_by_class_name(
'maptag').text
distance = raw_distance[:-2]
titles_list.append(title)
prices_list.append(price)
dates_list.append(date)
distance_list.append(distance)
return titles_list, prices_list, dates_list, distance_list
# Gets all of the url links of each listing on the page
# def extract_post_urls(self):
# soup_list = []
# html_page = urllib.request.urlopen(self.driver.current_url)
# soup = BeautifulSoup(html_page, "html.parser")
# for link in soup.findAll("a", {"class": "result-title hdrlnk"}):
# soup_list.append(link["href"])
#
# return soup_list
# Kills browser
def kill(self):
self.driver.close()
# Gets price value from dictionary and computes average
#staticmethod
def get_average(sample_dict):
price = list(map(lambda x: x['Price'], sample_dict))
sum_of_prices = sum(price)
length_of_list = len(price)
average = round(sum_of_prices / length_of_list)
return average
# Displays items around the average price of all the items in prices_list
#staticmethod
def get_items_around_average(avg, sample_dict, counter, give):
print("Items around average price: ")
print("-------------------------------------------")
raw_list = []
for z in range(len(sample_dict)):
current_price = sample_dict[z].get('Price')
if abs(current_price - avg) <= give:
raw_list.append(sample_dict[z])
final_list = raw_list[:counter]
for index in range(len(final_list)):
print('\n')
for key in final_list[index]:
print(key, ':', final_list[index][key])
# Displays nearest items to the zip provided
#staticmethod
def get_items_around_zip(sample_dict, counter):
final_list = []
print('\n')
print("Closest listings: ")
print("-------------------------------------------")
x = 0
while x < counter:
final_list.append(sample_dict[x])
x += 1
for index in range(len(final_list)):
print('\n')
for key in final_list[index]:
print(key, ':', final_list[index][key])
# Converts all_of_the_data list of dictionaries to json file
#staticmethod
def convert_to_json(sample_list):
with open(r"C:\Users\diego\development\WebScraper\data.json", 'w') as file_out:
file_out.write(json.dumps(sample_list, indent=4))
#staticmethod
def convert_to_csv(sample_list):
df = pd.DataFrame(sample_list)
df.to_csv("data.csv", index=False, header=True)
# Main where the big list data is broken down to its individual parts to be converted to a .csv file
also the parameters of the website are set
if name == "main":
location = input("Enter the location you would like to search: ") # Location Craigslist searches
zip_code = input(
"Enter the zip code you would like to base radius off of: ") # Postal code Craigslist uses as a base for 'MILES FROM ZIP'
type_of_item = input(
"Enter the item you would like to search (ex. furniture, bicycles, cars, etc.): ") # Type of item you are looking for
max_price = input(
"Enter the max price you would like the search to use: ") # Max price Craigslist limits the items too
radius = input(
"Enter the radius you would like the search to use (based off of zip code provided earlier): ") # Radius from postal code Craigslist limits the search to
scraper = CraigslistScraper(location, zip_code, max_price, type_of_item,
radius) # Constructs the URL with the given parameters
results = scraper.load_craigslist_url() # Inserts the result of the scrapping into a large multidimensional list
titles_list = results[0][0]
prices_list = list(map(int, results[0][1]))
dates_list = results[0][2]
distance_list = list(map(float, results[0][3]))
scraper.kill()
# Merge all of the lists into a dictionary
# Dictionary is then sorted by distance from smallest -> largest
list_of_attributes = []
for i in range(len(titles_list)):
content = {'Listing': titles_list[i], 'Price': prices_list[i], 'Date posted': dates_list[i],
'Distance from zip': distance_list[i]}
list_of_attributes.append(content)
list_of_attributes.sort(key=lambda x: x['Distance from zip'])
scraper.convert_to_json(list_of_attributes)
scraper.convert_to_csv(list_of_attributes)
# scraper.export_to_mongodb()
# Below function calls:
# Get average price and prints it
# Gets/prints listings around said average price
# Gets/prints nearest listings
average = scraper.get_average(list_of_attributes)
print(f'Average price of items searched: ${average}')
num_items_around_average = int(input("How many listings around the average price would you like to see?: "))
avg_range = int(input("Range of listings around the average price: "))
scraper.get_items_around_average(average, list_of_attributes, num_items_around_average, avg_range)
print("\n")
num_items = int(input("How many items would you like to display based off of proximity to zip code?: "))
print(f"Items around you: ")
scraper.get_items_around_zip(list_of_attributes, num_items)
print("\n")
print(f"Link of listings : {scraper.url}")
What i want is the program to get the number of URLs the user wants to scrape. That input will determine the number of instances of this script that needs to be running.
Then the user will run through the prompt of every scraper such as making the url ("what location would you like to search?: "). After they are done creating the urls, each scraper will run with their specific url and display back the three output described above specific to the url the scraper was assigned.
In the future I would like to add a time function and the user determines how often they want the script to run (every hour, every day, every other day, etc). Connect to a database and instead just query from the database the the 'x' # of listings around the average price range and the 'x' closest listings based off of proximity based of the specific url's results.

If you want several instances of your scraper in parallel while your main is running in loop, you need to use subprocceses.
https://docs.python.org/3/library/subprocess.html

Related

How to iterate a variable in XPATH, extract a link and store it into a list for further iteration

I'm following a Selenium tutorial for an Amazon price tracker (Clever Programming on Youtube) and I got stuck at getting the links from amazon using their techniques.
tutorial link: https://www.youtube.com/watch?v=WbJeL_Av2-Q&t=4315s
I realized the problem laid on the fact that I'm only getting one link out of the 17 available after doing the product search. I need to get all the links for every product after doing a search and them use then to get into each product and get their title, seller and price.
funtion get_products_links() should get all links and stores them into a list to be used by the function get_product_info()
def get_products_links(self):
self.driver.get(self.base_url) # Go to amazon.com using BASE_URL
element = self.driver.find_element_by_id('twotabsearchtextbox')
element.send_keys(self.search_term)
element.send_keys(Keys.ENTER)
time.sleep(2) # Wait to load page
self.driver.get(f'{self.driver.current_url}{self.price_filter}')
time.sleep(2) # Wait to load page
result_list = self.driver.find_elements_by_class_name('s-result-list')
links = []
try:
### Tying to get a list for Xpath links attributes ###
### Only numbers from 3 to 17 work after doing product search where 'i' is placed in the XPATH ###
i = 3
results = result_list[0].find_elements_by_xpath(
f'//*[#id="search"]/div[1]/div[1]/div/span[3]/div[2]/div[{i}]/div/div/div/div/div/div[1]/div/div[2]/div/span/a')
links = [link.get_attribute('href') for link in results]
return links
except Exception as e:
print("Didn't get any products...")
print(e)
return links
At this point get_products_links() only returns one link since I just made 'i' a fixed value of 3 to make it work for now.
I was thinking to iterate 'i' in some sort so I can save every different PATHs but I don't know how to implement this.
I've tried performing a for loop and append the result into a new list but them the app stops working
Here is the complete code:
from amazon_config import(
get_web_driver_options,
get_chrome_web_driver,
set_browser_as_incognito,
set_ignore_certificate_error,
NAME,
CURRENCY,
FILTERS,
BASE_URL,
DIRECTORY
)
import time
from selenium.webdriver.common.keys import Keys
class GenerateReport:
def __init__(self):
pass
class AmazonAPI:
def __init__(self, search_term, filters, base_url, currency):
self.base_url = base_url
self.search_term = search_term
options = get_web_driver_options()
set_ignore_certificate_error(options)
set_browser_as_incognito(options)
self.driver = get_chrome_web_driver(options)
self.currency = currency
self.price_filter = f"&rh=p_36%3A{filters['min']}00-{filters['max']}00"
def run(self):
print("Starting script...")
print(f"Looking for {self.search_term} products...")
links = self.get_products_links()
time.sleep(1)
if not links:
print("Stopped script.")
return
print(f"Got {len(links)} links to products...")
print("Getting info about products...")
products = self.get_products_info(links)
# self.driver.quit()
def get_products_info(self, links):
asins = self.get_asins(links)
product = []
for asin in asins:
product = self.get_single_product_info(asin)
def get_single_product_info(self, asin):
print(f"Product ID: {asin} - getting data...")
product_short_url = self.shorten_url(asin)
self.driver.get(f'{product_short_url}?language=en_GB')
time.sleep(2)
title = self.get_title()
seller = self.get_seller()
price = self.get_price()
def get_title(self):
try:
return self.driver.find_element_by_id('productTitle')
except Exception as e:
print(e)
print(f"Can't get title of a product - {self.driver.current_url}")
return None
def get_seller(self):
try:
return self.driver.find_element_by_id('bylineInfo')
except Exception as e:
print(e)
print(f"Can't get title of a product - {self.driver.current_url}")
return None
def get_price(self):
return '$99'
def shorten_url(self, asin):
return self.base_url + 'dp/' + asin
def get_asins(self, links):
return [self.get_asin(link) for link in links]
def get_asin(self, product_link):
return product_link[product_link.find('/dp/') + 4:product_link.find('/ref')]
def get_products_links(self):
self.driver.get(self.base_url) # Go to amazon.com using BASE_URL
element = self.driver.find_element_by_id('twotabsearchtextbox')
element.send_keys(self.search_term)
element.send_keys(Keys.ENTER)
time.sleep(2) # Wait to load page
self.driver.get(f'{self.driver.current_url}{self.price_filter}')
time.sleep(2) # Wait to load page
result_list = self.driver.find_elements_by_class_name('s-result-list')
links = []
try:
### Tying to get a list for Xpath links attributes ###
### Only numbers from 3 to 17 work after doing product search where 'i' is placed ###
i = 3
results = result_list[0].find_elements_by_xpath(
f'//*[#id="search"]/div[1]/div[1]/div/span[3]/div[2]/div[{i}]/div/div/div/div/div/div[1]/div/div[2]/div/span/a')
links = [link.get_attribute('href') for link in results]
return links
except Exception as e:
print("Didn't get any products...")
print(e)
return links
if __name__ == '__main__':
print("HEY!!!🚀🔥")
amazon = AmazonAPI(NAME, FILTERS, BASE_URL, CURRENCY)
amazon.run()
Steps to Run the script:
Step 1:
install Selenium==3.141.0 into your virtual environment
Step 2:
Search for Chrome Drivers on google and download the driver that matches you Chrome version. After download, extract the driver and paste it into your working folder
Step 3:
create a file called amazon_config.py and insert the following code:
from selenium import webdriver
DIRECTORY = 'reports'
NAME = 'PS4'
CURRENCY = '$'
MIN_PRICE = '275'
MAX_PRICE = '650'
FILTERS = {
'min': MIN_PRICE,
'max': MAX_PRICE
}
BASE_URL = "https://www.amazon.com/"
def get_chrome_web_driver(options):
return webdriver.Chrome('./chromedriver', chrome_options=options)
def get_web_driver_options():
return webdriver.ChromeOptions()
def set_ignore_certificate_error(options):
options.add_argument('--ignore-certificate-errors')
def set_browser_as_incognito(options):
options.add_argument('--incognito')
If you performed the steps correctly you should be able to run the script and it will perform the following:
Go to www.amazon.com
Search for a product (In this case "PS4")
Get a link for the first product
Visit that product link
Terminal should print:
HEY!!!🚀🔥
Starting script...
Looking for PS4 products...
Got 1 links to products...
Getting info about products...
Product ID: B012CZ41ZA - getting data...
What I'm not able to do is to get all links and iterate them so the script will visit all links in the first page
If you are able to get all links, the terminal should print:
HEY!!!🚀🔥
Starting script...
Looking for PS4 products...
Got 1 links to products...
Getting info about products...
Product ID: B012CZ41ZA - getting data...
Product ID: XXXXXXXXXX - getting data...
Product ID: XXXXXXXXXX - getting data...
Product ID: XXXXXXXXXX - getting data...
# and so on until all links are visited
I can't run it so I only guess how I would do it.
I would put all try/except in for-loop, and use links.append() instead of links = [...], and I would use return after exiting loop
# --- before loop ---
links = []
# --- loop ---
for i in range(3, 18):
try:
results = result_list[0].find_elements_by_xpath(
f'//*[#id="search"]/div[1]/div[1]/div/span[3]/div[2]/div[{i}]/div/div/div/div/div/div[1]/div/div[2]/div/span/a')
for link in results:
links.append(link.get_attribute('href'))
except Exception as e:
print(f"Didn't get any products... (i = {i})")
print(e)
# --- after loop ---
return links
But I would also try to use xpath with // to skip most of divs - and maybe if I would skip div[{i}] then I could get all products without for-loop.
BTW:
In get_products_info() I see similar problem - you create empty list product = [] but later in loop you assing value to product = ... so you remove previous value from product. It would need product.append() to keep all values.
Something like
def get_products_info(self, links):
# --- before loop ---
asins = self.get_asins(links)
product = []
# --- loop ---
for asin in asins:
product.append( self.get_single_product_info(asin) )
# --- after loop ---
return product

Looping and stop duplicating output | Selenium | Python

Very new to Python and Selenium, looking to scrape a few data points. I'm struggling in three areas:
I don't understand how to loop through multiple URLs properly
I can't figure out why the script is iterating twice over each URL
I can't figure out why it's only outputting the data for the second URL
Much thanks for taking a look!
Here's my current script:
urls = [
'https://developers.google.com/speed/pagespeed/insights/?url=https://www.crutchfield.com/%2F&tab=mobile',
'https://developers.google.com/speed/pagespeed/insights/?url=https://www.lastpass.com%2F&tab=mobile'
]
driver = webdriver.Chrome(executable_path='/Library/Frameworks/Python.framework/Versions/3.9/bin/chromedriver')
for url in urls:
for page in range(0, 1):
driver.get(url)
wait = WebDriverWait(driver, 120).until(EC.presence_of_element_located((By.CLASS_NAME, 'origin-field-data')))
df = pd.DataFrame(columns = ['Title', 'Core Web Vitals', 'FCP', 'FID', 'CLS', 'TTI', 'TBT', 'Total Score'])
company = driver.find_elements_by_class_name("audited-url__link")
data = []
for i in company:
data.append(i.get_attribute('href'))
for x in data:
#Get URL name
title = driver.find_element_by_xpath('//*[#id="page-speed-insights"]/div[2]/div[3]/div[2]/div[1]/div[1]/div/div[2]/h1/a')
co_name = title.text
#Get Core Web Vitals text pass/fail
cwv = driver.find_element_by_xpath('//*[#id="page-speed-insights"]/div[2]/div[3]/div[2]/div[1]/div[2]/div/div[1]/div[1]/div[1]/span[2]')
core_web = cwv.text
#Get FCP
fcp = driver.find_element_by_xpath('//*[#id="page-speed-insights"]/div[2]/div[3]/div[2]/div[1]/div[2]/div/div[1]/div[1]/div[2]/div[1]/div[1]/div')
first_content = fcp.text
#Get FID
fid = driver.find_element_by_xpath('//*[#id="page-speed-insights"]/div[2]/div[3]/div[2]/div[1]/div[2]/div/div[1]/div[1]/div[2]/div[3]/div[1]/div')
first_input = fid.text
#Get CLS
cls = driver.find_element_by_xpath('//*[#id="page-speed-insights"]/div[2]/div[3]/div[2]/div[1]/div[2]/div/div[1]/div[1]/div[2]/div[4]/div[1]/div')
layout_shift = cls.text
#Get TTI
tti = driver.find_element_by_xpath('//*[#id="interactive"]/div/div[1]')
time_interactive = tti.text
#Get TBT
tbt = driver.find_element_by_xpath('//*[#id="total-blocking-time"]/div/div[1]')
total_block = tbt.text
#Get Total Score
total_score = driver.find_element_by_xpath('//*[#id="page-speed-insights"]/div[2]/div[3]/div[2]/div[1]/div[1]/div/div[1]/a/div[2]')
score = total_score.text
#Adding all columns to dataframe
df.loc[len(df)] = [co_name,core_web,first_content,first_input,layout_shift,time_interactive,total_block,score]
driver.close()
#df.to_csv('Double Page Speed Test 9-10.csv')
print(df)
Q1 : I don't understand how to loop through multiple URLs properly ?
Ans : for url in urls:
Q2. I can't figure out why the script is iterating twice over each URL
Ans : Cause you have for page in range(0, 1):
Update 1:
I did not run your entire code with DF. Also sometimes either one of the pages, does not show the number and href, but when I typically run the below code,
driver = webdriver.Chrome(driver_path)
driver.maximize_window()
driver.implicitly_wait(50)
wait = WebDriverWait(driver, 20)
urls = [
'https://developers.google.com/speed/pagespeed/insights/?url=https://www.crutchfield.com/%2F&tab=mobile',
'https://developers.google.com/speed/pagespeed/insights/?url=https://www.lastpass.com%2F&tab=mobile'
]
data = []
for url in urls:
driver.get(url)
wait = WebDriverWait(driver, 120).until(EC.presence_of_element_located((By.CLASS_NAME, 'origin-field-data')))
company = driver.find_elements_by_css_selector("h1.audited-url a")
for i in company:
data.append(i.get_attribute('href'))
print(data)
this output :
['https://www.crutchfield.com//', 'https://www.lastpass.com/', 'https://www.lastpass.com/']
which is true case the element locator that we have used is representing 1 element on page 1 or 2 element on page 2

Only items from first Beautiful Soup object are being added to my lists

I suspect this isn't very complicated, but I can't see to figure it out. I'm using Selenium and Beautiful Soup to parse Petango.com. Data will be used to help a local shelter understand how they compare in different metrics to other area shelters. so next will be taking these data frames and doing some analysis.
I grab detail urls from a different module and import the list here.
My issue is, my lists are only showing the value from the HTML from the first dog. I was stepping through and noticed my len are different for the soup iterations, so I realize my error is after that somewhere but I can't figure out how to fix.
Here is my code so far (running the whole process vs using a cached page)
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
from Petango import pet_links
headings = []
values = []
ShelterInfo = []
ShelterInfoWebsite = []
ShelterInfoEmail = []
ShelterInfoPhone = []
ShelterInfoAddress = []
Breed = []
Age = []
Color = []
SpayedNeutered = []
Size = []
Declawed = []
AdoptionDate = []
# to access sites, change url list to pet_links (break out as needed) and change if false to true. false looks to the html file
url_list = (pet_links[4], pet_links[6], pet_links[8])
#url_list = ("Petango.html", "Petango.html", "Petango.html")
for link in url_list:
page_source = None
if True:
#pet page = link should populate links from above, hard code link was for 1 detail page, =to hemtl was for cached site
PetPage = link
#PetPage = 'https://www.petango.com/Adopt/Dog-Terrier-American-Pit-Bull-45569732'
#PetPage = Petango.html
PetDriver = webdriver.Chrome(executable_path='/Users/paulcarson/Downloads/chromedriver')
PetDriver.implicitly_wait(30)
PetDriver.get(link)
page_source = PetDriver.page_source
PetDriver.close()
else:
with open("Petango.html",'r') as f:
page_source = f.read()
PetSoup = BeautifulSoup(page_source, 'html.parser')
print(len(PetSoup.text))
#get the details about the shelter and add to lists
ShelterInfo.append(PetSoup.find('div', class_ = "DNNModuleContent ModPethealthPetangoDnnModulesShelterShortInfoC").find('h4').text)
ShelterInfoParagraphs = PetSoup.find('div', class_ = "DNNModuleContent ModPethealthPetangoDnnModulesShelterShortInfoC").find_all('p')
First_Paragraph = ShelterInfoParagraphs[0]
if "Website" not in First_Paragraph.text:
raise AssertionError("first paragraph is not about site")
ShelterInfoWebsite.append(First_Paragraph.find('a').text)
Second_Paragraph = ShelterInfoParagraphs[1]
ShelterInfoEmail.append(Second_Paragraph.find('a')['href'])
Third_Paragraph = ShelterInfoParagraphs[2]
ShelterInfoPhone.append(Third_Paragraph.find('span').text)
Fourth_Paragraph = ShelterInfoParagraphs[3]
ShelterInfoAddress.append(Fourth_Paragraph.find('span').text)
#get the details about the pet
ul = PetSoup.find('div', class_='group details-list').ul # Gets the ul tag
li_items = ul.find_all('li') # Finds all the li tags within the ul tag
for li in li_items:
heading = li.strong.text
headings.append(heading)
value = li.span.text
if value:
values.append(value)
else:
values.append(None)
Breed.append(values[0])
Age.append(values[1])
print(Age)
Color.append(values[2])
SpayedNeutered.append(values[3])
Size.append(values[4])
Declawed.append(values[5])
AdoptionDate.append(values[6])
ShelterDF = pd.DataFrame(
{
'Shelter': ShelterInfo,
'Shelter Website': ShelterInfoWebsite,
'Shelter Email': ShelterInfoEmail,
'Shelter Phone Number': ShelterInfoPhone,
'Shelter Address': ShelterInfoAddress
})
PetDF = pd.DataFrame(
{'Breed': Breed,
'Age': Age,
'Color': Color,
'Spayed/Neutered': SpayedNeutered,
'Size': Size,
'Declawed': Declawed,
'Adoption Date': AdoptionDate
})
print(PetDF)
print(ShelterDF)
output from print len and print the age value as the loop progresses
12783
['6y 7m']
10687
['6y 7m', '6y 7m']
10705
['6y 7m', '6y 7m', '6y 7m']
Could someone please point me in the right direction?
Thank you for your help!
Paul
You need to change the find method into find_all() in BeautifulSoup so it locates all the elements.
Values is global and you only ever append the first value in this list to Age
Age.append(values[1])
Same problem for your other global lists (static index whether 1 or 2 etc...).
You need a way to track the appropriate index to use perhaps through a counter, or determine other logic to ensure current value is added e.g. with current Age, is it is the second li in the loop? Or just append PetSoup.select_one("[data-bind='text: age']").text
It looks like each item of interest e.g. colour, spayed contains the data-bind attribute so you can use those with the appropriate attribute value to select each value and avoid a loop over li elements.
e.g. current_colour = PetSoup.select_one("[data-bind='text: color']").text
Best to set in a variable and test is not None before accessing with .text

List is being overwritten

So I am scraping listings off of Craigslist and my list of titles, prices, and dates are being overwritten every time the web driver goes to the next page. In the end, the only data in my .csv file and MongoDB collection are the listings on the last page.
I tried moving the instantiation of the lists but it still overwrites.
the function that extracts listing information from a page
def extract_post_information(self):
all_posts = self.driver.find_elements_by_class_name("result-row")
dates = []
titles = []
prices = []
for post in all_posts:
title = post.text.split("$")
if title[0] == '':
title = title[1]
else:
title = title[0]
title = title.split("\n")
price = title[0]
title = title[-1]
title = title.split(" ")
month = title[0]
day = title[1]
title = ' '.join(title[2:])
date = month + " " + day
if not price[:1].isdigit():
price = "0"
int(price)
titles.append(title)
prices.append(price)
dates.append(date)
return titles, prices, dates
The function that runs going to url and going to next page until there is no more next page
def load_craigslist_url(self):
self.driver.get(self.url)
while True:
try:
wait = WebDriverWait(self.driver, self.delay)
wait.until(EC.presence_of_element_located((By.ID, "searchform")))
print("Page is loaded")
self.extract_post_information()
WebDriverWait(self.driver, 2).until(
EC.element_to_be_clickable((By.XPATH, '//*[#id="searchform"]/div[3]/div[3]/span[2]/a[3]'))).click()
except:
print("Last page")
break
My main
if __name__ == "__main__":
filepath = '/home/diego/git_workspace/PyScrape/data.csv' # Filepath of written csv file
location = "philadelphia" # Location Craigslist searches
postal_code = "19132" # Postal code Craigslist uses as a base for 'MILES FROM ZIP'
max_price = "700" # Max price Craigslist limits the items too
query = "graphics+card" # Type of item you are looking for
radius = "400" # Radius from postal code Craigslist limits the search to
# s = 0
scraper = CraigslistScraper(location, postal_code, max_price, query, radius)
scraper.load_craigslist_url()
titles, prices, dates = scraper.extract_post_information()
d = [titles, prices, dates]
export_data = zip_longest(*d, fillvalue='')
with open('data.csv', 'w', encoding="utf8", newline='') as my_file:
wr = csv.writer(my_file)
wr.writerow(("Titles", "Prices", "Dates"))
wr.writerows(export_data)
my_file.close()
# scraper.kill()
scraper.upload_to_mongodb(filepath)
what I expect it to do is get all the info from one page, go to next page, get all of thats page info and append to the three lists titles, prices, and dates in the extract_post_information function. Once there are no more next pages, create a list out of those three lists called d (seen in my main function)
Should I put the extract_post_information function in the load_craigslist_url function? Or do I have to tweak where I instantiate the three lists in the extract_post _informtaion function?
In the load_craigslist_url() function, you're calling self.extract_post_information() without saving the returned information.

The for loop overwrite the results of another loop and I get only the last result

My problem is that I get only the last results of the loop, it overwrites other results and show me only the last ones.
I need to obtain from the json some information about a list of songs. This information is stored in a variable resource_url. So first of all I need to take list of tracks:
r = rq.get(url + tag)
time.sleep(2)
list = json.loads(r.content)
I get the resource url of each track:
c2 = pd.DataFrame(columns = ["title", "resource_url"])
for i, row in enumerate(list["results"]):
title = row["title"]
master_url = row["resource_url"]
c2.loc[i] = [title, resource_url]
Each track has several songs so I get the songs:
for i, row in enumerate(list["results"]):
url = row['resource_url']
r = rq.get(url)
time.sleep(2)
songs = json.loads(r.content)
And then I try to get the duration of each song:
c3 = pd.DataFrame(columns = ["title", "duration"])
for i, row in enumerate(songs["list"]):
title = row["title"]
duration = row["duration"]
c3.loc[i] = [title, duration]
c3.head(24)
I obtain only the information of the songs of the last track but I need to get all of them. They are overwritten.

Categories

Resources