So I am scraping listings off of Craigslist and my list of titles, prices, and dates are being overwritten every time the web driver goes to the next page. In the end, the only data in my .csv file and MongoDB collection are the listings on the last page.
I tried moving the instantiation of the lists but it still overwrites.
the function that extracts listing information from a page
def extract_post_information(self):
all_posts = self.driver.find_elements_by_class_name("result-row")
dates = []
titles = []
prices = []
for post in all_posts:
title = post.text.split("$")
if title[0] == '':
title = title[1]
else:
title = title[0]
title = title.split("\n")
price = title[0]
title = title[-1]
title = title.split(" ")
month = title[0]
day = title[1]
title = ' '.join(title[2:])
date = month + " " + day
if not price[:1].isdigit():
price = "0"
int(price)
titles.append(title)
prices.append(price)
dates.append(date)
return titles, prices, dates
The function that runs going to url and going to next page until there is no more next page
def load_craigslist_url(self):
self.driver.get(self.url)
while True:
try:
wait = WebDriverWait(self.driver, self.delay)
wait.until(EC.presence_of_element_located((By.ID, "searchform")))
print("Page is loaded")
self.extract_post_information()
WebDriverWait(self.driver, 2).until(
EC.element_to_be_clickable((By.XPATH, '//*[#id="searchform"]/div[3]/div[3]/span[2]/a[3]'))).click()
except:
print("Last page")
break
My main
if __name__ == "__main__":
filepath = '/home/diego/git_workspace/PyScrape/data.csv' # Filepath of written csv file
location = "philadelphia" # Location Craigslist searches
postal_code = "19132" # Postal code Craigslist uses as a base for 'MILES FROM ZIP'
max_price = "700" # Max price Craigslist limits the items too
query = "graphics+card" # Type of item you are looking for
radius = "400" # Radius from postal code Craigslist limits the search to
# s = 0
scraper = CraigslistScraper(location, postal_code, max_price, query, radius)
scraper.load_craigslist_url()
titles, prices, dates = scraper.extract_post_information()
d = [titles, prices, dates]
export_data = zip_longest(*d, fillvalue='')
with open('data.csv', 'w', encoding="utf8", newline='') as my_file:
wr = csv.writer(my_file)
wr.writerow(("Titles", "Prices", "Dates"))
wr.writerows(export_data)
my_file.close()
# scraper.kill()
scraper.upload_to_mongodb(filepath)
what I expect it to do is get all the info from one page, go to next page, get all of thats page info and append to the three lists titles, prices, and dates in the extract_post_information function. Once there are no more next pages, create a list out of those three lists called d (seen in my main function)
Should I put the extract_post_information function in the load_craigslist_url function? Or do I have to tweak where I instantiate the three lists in the extract_post _informtaion function?
In the load_craigslist_url() function, you're calling self.extract_post_information() without saving the returned information.
Related
I am quite new in scraping with xpath. I am trying to scrape product information on Target. I use selenium and xpath successfully get the price and name. But xpath cannot return any value when scraping for product sizeproduct size and sales locationsales location.
For example, for this url"https://www.target.com/p/pataday-once-daily-relief-extra-strength-drops-0-085-fl-oz/-/A-83775159?preselect=81887758#lnk=sametab", xpath for size is:
//*[#id="pageBodyContainer"]/div[1]/div[2]/div[2]/div/div[3]/div/div[1]/text()
xpath for sales location is:
//*[#id="pageBodyContainer"]/div[1]/div[2]/div[2]/div/div[1]/div[2]/span
I also try to get these two elements by using requests but it also did not work. Do anyone know why it happened? Any help appreciated. Thanks.
Following is my code:
def job_function():
urlList = ['https://www.target.com/p/pataday-once-daily-relief-extra-strength-drops-0-085-fl-oz/-/A-83775159?preselect=81887758#lnk=sametab',
'https://www.target.com/p/kleenex-ultra-soft-facial-tissue/-/A-84780536?preselect=12964744#lnk=sametab',
'https://www.target.com/p/claritin-24-hour-non-drowsy-allergy-relief-tablets-loratadine/-/A-80354268?preselect=14351285#lnk=sametab',
'https://www.target.com/p/opti-free-pure-moist-rewetting-drops-0-4-fl-oz/-/A-14358641#lnk=sametab'
]
def ScrapingTarget(url):
AArray = []
wait_imp = 10
CO = webdriver.ChromeOptions()
CO.add_experimental_option('useAutomationExtension', False)
CO.add_argument('--ignore-certificate-errors')
CO.add_argument('--start-maximized')
wd = webdriver.Chrome(r'D:\chromedriver\chromedriver_win32new\chromedriver_win32 (2)\chromedriver.exe',
options=CO)
wd.get(url)
wd.implicitly_wait(wait_imp)
sleep(1)
#start scraping
name = wd.find_element(by=By.XPATH, value="//*[#id='pageBodyContainer']/div[1]/div[1]/h1/span").text
sleep(0.5)
price = wd.find_element(by=By.XPATH, value="//*[#id='pageBodyContainer']/div[1]/div[2]/div[2]/div/div[1]/div[1]/span").text
sleep(0.5)
try:
size = wd.find_element(by=By.XPATH, value="//*[#id='pageBodyContainer']/div[1]/div[2]/div[2]/div/div[3]/div/div[1]/text()").text
except:
size = "none"
sleep(0.5)
try:
sales location = wd.find_element(by=By.XPATH, value="//*[#id='pageBodyContainer']/div[1]/div[2]/div[2]/div/div[1]/div[2]/span").text
except:
sales location = "none"
tz = pytz.timezone('Etc/GMT-0')
GMT = datetime.now(tz).strftime("%Y-%m-%d %H:%M:%S")
AArray.append([name, price, size, sales location, GMT])
with open(
r'C:\Users\12987\PycharmProjects\python\Network\priceingAlgoriCoding\export_Target_dataframe.csv',
'a', newline="", encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerows(AArray)
with concurrent.futures.ThreadPoolExecutor(4) as executor:
executor.map(ScrapingTarget, urlList)
sched = BlockingScheduler()
sched.add_job(job_function,'interval',seconds=60)
sched.start()
Very new to Python and Selenium, looking to scrape a few data points. I'm struggling in three areas:
I don't understand how to loop through multiple URLs properly
I can't figure out why the script is iterating twice over each URL
I can't figure out why it's only outputting the data for the second URL
Much thanks for taking a look!
Here's my current script:
urls = [
'https://developers.google.com/speed/pagespeed/insights/?url=https://www.crutchfield.com/%2F&tab=mobile',
'https://developers.google.com/speed/pagespeed/insights/?url=https://www.lastpass.com%2F&tab=mobile'
]
driver = webdriver.Chrome(executable_path='/Library/Frameworks/Python.framework/Versions/3.9/bin/chromedriver')
for url in urls:
for page in range(0, 1):
driver.get(url)
wait = WebDriverWait(driver, 120).until(EC.presence_of_element_located((By.CLASS_NAME, 'origin-field-data')))
df = pd.DataFrame(columns = ['Title', 'Core Web Vitals', 'FCP', 'FID', 'CLS', 'TTI', 'TBT', 'Total Score'])
company = driver.find_elements_by_class_name("audited-url__link")
data = []
for i in company:
data.append(i.get_attribute('href'))
for x in data:
#Get URL name
title = driver.find_element_by_xpath('//*[#id="page-speed-insights"]/div[2]/div[3]/div[2]/div[1]/div[1]/div/div[2]/h1/a')
co_name = title.text
#Get Core Web Vitals text pass/fail
cwv = driver.find_element_by_xpath('//*[#id="page-speed-insights"]/div[2]/div[3]/div[2]/div[1]/div[2]/div/div[1]/div[1]/div[1]/span[2]')
core_web = cwv.text
#Get FCP
fcp = driver.find_element_by_xpath('//*[#id="page-speed-insights"]/div[2]/div[3]/div[2]/div[1]/div[2]/div/div[1]/div[1]/div[2]/div[1]/div[1]/div')
first_content = fcp.text
#Get FID
fid = driver.find_element_by_xpath('//*[#id="page-speed-insights"]/div[2]/div[3]/div[2]/div[1]/div[2]/div/div[1]/div[1]/div[2]/div[3]/div[1]/div')
first_input = fid.text
#Get CLS
cls = driver.find_element_by_xpath('//*[#id="page-speed-insights"]/div[2]/div[3]/div[2]/div[1]/div[2]/div/div[1]/div[1]/div[2]/div[4]/div[1]/div')
layout_shift = cls.text
#Get TTI
tti = driver.find_element_by_xpath('//*[#id="interactive"]/div/div[1]')
time_interactive = tti.text
#Get TBT
tbt = driver.find_element_by_xpath('//*[#id="total-blocking-time"]/div/div[1]')
total_block = tbt.text
#Get Total Score
total_score = driver.find_element_by_xpath('//*[#id="page-speed-insights"]/div[2]/div[3]/div[2]/div[1]/div[1]/div/div[1]/a/div[2]')
score = total_score.text
#Adding all columns to dataframe
df.loc[len(df)] = [co_name,core_web,first_content,first_input,layout_shift,time_interactive,total_block,score]
driver.close()
#df.to_csv('Double Page Speed Test 9-10.csv')
print(df)
Q1 : I don't understand how to loop through multiple URLs properly ?
Ans : for url in urls:
Q2. I can't figure out why the script is iterating twice over each URL
Ans : Cause you have for page in range(0, 1):
Update 1:
I did not run your entire code with DF. Also sometimes either one of the pages, does not show the number and href, but when I typically run the below code,
driver = webdriver.Chrome(driver_path)
driver.maximize_window()
driver.implicitly_wait(50)
wait = WebDriverWait(driver, 20)
urls = [
'https://developers.google.com/speed/pagespeed/insights/?url=https://www.crutchfield.com/%2F&tab=mobile',
'https://developers.google.com/speed/pagespeed/insights/?url=https://www.lastpass.com%2F&tab=mobile'
]
data = []
for url in urls:
driver.get(url)
wait = WebDriverWait(driver, 120).until(EC.presence_of_element_located((By.CLASS_NAME, 'origin-field-data')))
company = driver.find_elements_by_css_selector("h1.audited-url a")
for i in company:
data.append(i.get_attribute('href'))
print(data)
this output :
['https://www.crutchfield.com//', 'https://www.lastpass.com/', 'https://www.lastpass.com/']
which is true case the element locator that we have used is representing 1 element on page 1 or 2 element on page 2
I am building a scraper for Ebay. I am trying to figure out a way to manipulate the page number portion of the Ebay url to go to the next page until there are no more pages (If you were on page 2 the page number portion would look like "_pgn=2"). I noticed that if you put any number greater than the max number of pages a listing has, the page will reload to the last page, not give like a page doesn't exist error. (If a listing has 5 pages, then the last listing' page number url portion of _pgn=5 would rout to the same page if the page number url portion was _pgn=100). How can I implement a way to start at page one, get the html soup of the page, get the all relevant data I want from the soup, then load up the next page with the new page number and start the process again until there are not any new pages to scrape? I tried to get the number of results a listing has by using selenium xpath and math.ceil the quotient of number of results and 50 (default number of max listings per page) and use that quotient as my max_page, but I get errors saying the element doesn't exist even though it does. self.driver.findxpath('xpath').text. That 243 is what I am trying to get with the xpath.
class EbayScraper(object):
def __init__(self, item, buying_type):
self.base_url = "https://www.ebay.com/sch/i.html?_nkw="
self.driver = webdriver.Chrome(r"chromedriver.exe")
self.item = item
self.buying_type = buying_type + "=1"
self.url_seperator = "&_sop=12&rt=nc&LH_"
self.url_seperator2 = "&_pgn="
self.page_num = "1"
def getPageUrl(self):
if self.buying_type == "Buy It Now=1":
self.buying_type = "BIN=1"
self.item = self.item.replace(" ", "+")
url = self.base_url + self.item + self.url_seperator + self.buying_type + self.url_seperator2 + self.page_num
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
return soup
def getInfo(self, soup):
for listing in soup.find_all("li", {"class": "s-item"}):
raw = listing.find_all("a", {"class": "s-item__link"})
if raw:
raw_price = listing.find_all("span", {"class": "s-item__price"})[0]
raw_title = listing.find_all("h3", {"class": "s-item__title"})[0]
raw_link = listing.find_all("a", {"class": "s-item__link"})[0]
raw_condition = listing.find_all("span", {"class": "SECONDARY_INFO"})[0]
condition = raw_condition.text
price = float(raw_price.text[1:])
title = raw_title.text
link = raw_link['href']
print(title)
print(condition)
print(price)
if self.buying_type != "BIN=1":
raw_time_left = listing.find_all("span", {"class": "s-item__time-left"})[0]
time_left = raw_time_left.text[:-4]
print(time_left)
print(link)
print('\n')
if __name__ == '__main__':
item = input("Item: ")
buying_type = input("Buying Type (e.g, 'Buy It Now' or 'Auction'): ")
instance = EbayScraper(item, buying_type)
page = instance.getPageUrl()
instance.getInfo(page)
if you want to iterate all pages and gather all results then your script needs to check if there is a next page after you visit the page
import requests
from bs4 import BeautifulSoup
class EbayScraper(object):
def __init__(self, item, buying_type):
...
self.currentPage = 1
def get_url(self, page=1):
if self.buying_type == "Buy It Now=1":
self.buying_type = "BIN=1"
self.item = self.item.replace(" ", "+")
# _ipg=200 means that expect a 200 items per page
return '{}{}{}{}{}{}&_ipg=200'.format(
self.base_url, self.item, self.url_seperator, self.buying_type,
self.url_seperator2, page
)
def page_has_next(self, soup):
container = soup.find('ol', 'x-pagination__ol')
currentPage = container.find('li', 'x-pagination__li--selected')
next_sibling = currentPage.next_sibling
if next_sibling is None:
print(container)
return next_sibling is not None
def iterate_page(self):
# this will loop if there are more pages otherwise end
while True:
page = instance.getPageUrl(self.currentPage)
instance.getInfo(page)
if self.page_has_next(page) is False:
break
else:
self.currentPage += 1
def getPageUrl(self, pageNum):
url = self.get_url(pageNum)
print('page: ', url)
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
return soup
def getInfo(self, soup):
...
if __name__ == '__main__':
item = input("Item: ")
buying_type = input("Buying Type (e.g, 'Buy It Now' or 'Auction'): ")
instance = EbayScraper(item, buying_type)
instance.iterate_page()
the important functions here are page_has_next and iterate_page
page_has_next - a function that check if the pagination of the page has another li element next to the selected page. e.g < 1 2 3 > if we are on page 1 then it checks if there is 2 next -> something like this
iterate_page - a function that loop until there is no page_next
also note that you don't need selenium for this unless you need to mimic user clicks or need a browser to navigate.
So I have this python script that scrapes listing off of a specific craigslist URL the user constructs (location, max price, type of item, etc.). It then goes to the URL, scrapes the listings info (price, date posted, etc) and returns three outputs. One is 'x' number of items around the average price (the user determines the number of items and the range of prices such as $100 off the average price). Next, are 'x' closets listings based of the zip code the user provided in the begging (user also determine the # of items displayed based on proximity to zip code). Lastly the craigslist url link is outputted to the user so they can visit the page and look at the items displayed to them earlier. The data of the scrape is stored in a data.json file and a data.csv file .Content is the same just different formats, I would like to offload this data to a Database everytime a scrape is done. Either Cloud Firestore or AWS DynamoDB since I want to host this a web app in the future
What I want to do is allow the user to have multiple instances of the same scripts all with unique craigslist urls running at the same time. All of the code is the same, the only difference are the craigslist urls that the script scrapes.
I made a method that iterated through the creation of the attributes(location, max price, etc) and returns a lost of the urls, but in my main I call the contructor and it needs all of those attributes so I have to fish it from the urls and that seemed over the top.
I then tried to have the loop in my main. the user determine how many url links they want to make and append the completed links to a list. Again ran into the same problem.
class CraigslistScraper(object):
# Contructor of the URL that is being scraped
def __init__(self, location, postal_code, max_price, query, radius):
self.location = location # Location(i.e. City) being searched
self.postal_code = postal_code # Postal code of location being searched
self.max_price = max_price # Max price of the items that will be searched
self.query = query # Search for the type of items that will be searched
self.radius = radius # Radius of the area searched derived from the postal code given previously
self.url = f"https://{location}.craigslist.org/search/sss?&max_price={max_price}&postal={postal_code}&query={query}&20card&search_distance={radius}"
self.driver = webdriver.Chrome(r"C:\Program Files\chromedriver") # Path of Firefox web driver
self.delay = 7 # The delay the driver gives when loading the web page
# Load up the web page
# Gets all relevant data on the page
# Goes to next page until we are at the last page
def load_craigslist_url(self):
data = []
# url_list = []
self.driver.get(self.url)
while True:
try:
wait = WebDriverWait(self.driver, self.delay)
wait.until(EC.presence_of_element_located((By.ID, "searchform")))
data.append(self.extract_post_titles())
# url_list.append(self.extract_post_urls())
WebDriverWait(self.driver, 2).until(
EC.element_to_be_clickable((By.XPATH, '//*[#id="searchform"]/div[3]/div[3]/span[2]/a[3]'))).click()
except:
break
return data
# Extracts all relevant information from the web-page and returns them as individual lists
def extract_post_titles(self):
all_posts = self.driver.find_elements_by_class_name("result-row")
dates_list = []
titles_list = []
prices_list = []
distance_list = []
for post in all_posts:
title = post.text.split("$")
if title[0] == '':
title = title[1]
else:
title = title[0]
title = title.split("\n")
price = title[0]
title = title[-1]
title = title.split(" ")
month = title[0]
day = title[1]
title = ' '.join(title[2:])
date = month + " " + day
if not price[:1].isdigit():
price = "0"
int(price)
raw_distance = post.find_element_by_class_name(
'maptag').text
distance = raw_distance[:-2]
titles_list.append(title)
prices_list.append(price)
dates_list.append(date)
distance_list.append(distance)
return titles_list, prices_list, dates_list, distance_list
# Gets all of the url links of each listing on the page
# def extract_post_urls(self):
# soup_list = []
# html_page = urllib.request.urlopen(self.driver.current_url)
# soup = BeautifulSoup(html_page, "html.parser")
# for link in soup.findAll("a", {"class": "result-title hdrlnk"}):
# soup_list.append(link["href"])
#
# return soup_list
# Kills browser
def kill(self):
self.driver.close()
# Gets price value from dictionary and computes average
#staticmethod
def get_average(sample_dict):
price = list(map(lambda x: x['Price'], sample_dict))
sum_of_prices = sum(price)
length_of_list = len(price)
average = round(sum_of_prices / length_of_list)
return average
# Displays items around the average price of all the items in prices_list
#staticmethod
def get_items_around_average(avg, sample_dict, counter, give):
print("Items around average price: ")
print("-------------------------------------------")
raw_list = []
for z in range(len(sample_dict)):
current_price = sample_dict[z].get('Price')
if abs(current_price - avg) <= give:
raw_list.append(sample_dict[z])
final_list = raw_list[:counter]
for index in range(len(final_list)):
print('\n')
for key in final_list[index]:
print(key, ':', final_list[index][key])
# Displays nearest items to the zip provided
#staticmethod
def get_items_around_zip(sample_dict, counter):
final_list = []
print('\n')
print("Closest listings: ")
print("-------------------------------------------")
x = 0
while x < counter:
final_list.append(sample_dict[x])
x += 1
for index in range(len(final_list)):
print('\n')
for key in final_list[index]:
print(key, ':', final_list[index][key])
# Converts all_of_the_data list of dictionaries to json file
#staticmethod
def convert_to_json(sample_list):
with open(r"C:\Users\diego\development\WebScraper\data.json", 'w') as file_out:
file_out.write(json.dumps(sample_list, indent=4))
#staticmethod
def convert_to_csv(sample_list):
df = pd.DataFrame(sample_list)
df.to_csv("data.csv", index=False, header=True)
# Main where the big list data is broken down to its individual parts to be converted to a .csv file
also the parameters of the website are set
if name == "main":
location = input("Enter the location you would like to search: ") # Location Craigslist searches
zip_code = input(
"Enter the zip code you would like to base radius off of: ") # Postal code Craigslist uses as a base for 'MILES FROM ZIP'
type_of_item = input(
"Enter the item you would like to search (ex. furniture, bicycles, cars, etc.): ") # Type of item you are looking for
max_price = input(
"Enter the max price you would like the search to use: ") # Max price Craigslist limits the items too
radius = input(
"Enter the radius you would like the search to use (based off of zip code provided earlier): ") # Radius from postal code Craigslist limits the search to
scraper = CraigslistScraper(location, zip_code, max_price, type_of_item,
radius) # Constructs the URL with the given parameters
results = scraper.load_craigslist_url() # Inserts the result of the scrapping into a large multidimensional list
titles_list = results[0][0]
prices_list = list(map(int, results[0][1]))
dates_list = results[0][2]
distance_list = list(map(float, results[0][3]))
scraper.kill()
# Merge all of the lists into a dictionary
# Dictionary is then sorted by distance from smallest -> largest
list_of_attributes = []
for i in range(len(titles_list)):
content = {'Listing': titles_list[i], 'Price': prices_list[i], 'Date posted': dates_list[i],
'Distance from zip': distance_list[i]}
list_of_attributes.append(content)
list_of_attributes.sort(key=lambda x: x['Distance from zip'])
scraper.convert_to_json(list_of_attributes)
scraper.convert_to_csv(list_of_attributes)
# scraper.export_to_mongodb()
# Below function calls:
# Get average price and prints it
# Gets/prints listings around said average price
# Gets/prints nearest listings
average = scraper.get_average(list_of_attributes)
print(f'Average price of items searched: ${average}')
num_items_around_average = int(input("How many listings around the average price would you like to see?: "))
avg_range = int(input("Range of listings around the average price: "))
scraper.get_items_around_average(average, list_of_attributes, num_items_around_average, avg_range)
print("\n")
num_items = int(input("How many items would you like to display based off of proximity to zip code?: "))
print(f"Items around you: ")
scraper.get_items_around_zip(list_of_attributes, num_items)
print("\n")
print(f"Link of listings : {scraper.url}")
What i want is the program to get the number of URLs the user wants to scrape. That input will determine the number of instances of this script that needs to be running.
Then the user will run through the prompt of every scraper such as making the url ("what location would you like to search?: "). After they are done creating the urls, each scraper will run with their specific url and display back the three output described above specific to the url the scraper was assigned.
In the future I would like to add a time function and the user determines how often they want the script to run (every hour, every day, every other day, etc). Connect to a database and instead just query from the database the the 'x' # of listings around the average price range and the 'x' closest listings based off of proximity based of the specific url's results.
If you want several instances of your scraper in parallel while your main is running in loop, you need to use subprocceses.
https://docs.python.org/3/library/subprocess.html
My problem is that I get only the last results of the loop, it overwrites other results and show me only the last ones.
I need to obtain from the json some information about a list of songs. This information is stored in a variable resource_url. So first of all I need to take list of tracks:
r = rq.get(url + tag)
time.sleep(2)
list = json.loads(r.content)
I get the resource url of each track:
c2 = pd.DataFrame(columns = ["title", "resource_url"])
for i, row in enumerate(list["results"]):
title = row["title"]
master_url = row["resource_url"]
c2.loc[i] = [title, resource_url]
Each track has several songs so I get the songs:
for i, row in enumerate(list["results"]):
url = row['resource_url']
r = rq.get(url)
time.sleep(2)
songs = json.loads(r.content)
And then I try to get the duration of each song:
c3 = pd.DataFrame(columns = ["title", "duration"])
for i, row in enumerate(songs["list"]):
title = row["title"]
duration = row["duration"]
c3.loc[i] = [title, duration]
c3.head(24)
I obtain only the information of the songs of the last track but I need to get all of them. They are overwritten.