Web-scraping through a list using BeautifulSoup

Web-scraping through a list using BeautifulSoup - python

I would need to get all the links of the following list of websites (from a dataframe column transformed to list):
urls = df['URLs'].tolist()
saving for each urls in a new column (Links) in a copy of the original dataset.
To get info from one of these websites, I am using :
http = httplib2.Http()
status, response = http.request('https://www.farmaciairisdiana.it/blog/') # for example
for link in BeautifulSoup(response,'html.parser', parse_only=SoupStrainer('a')):
if link.has_attr('href'):
print(link['href'])
This code works pretty well (I tested a few cases).
How can I iterate for each of those urls, saving the results into a new column?

You can iterate list urls and save each link to result list. Then you create new dataframe or add this list to new column.
For example:
http = httplib2.Http()
all_links = []
for url in urls: # `urls` is your list from the question
status, response = http.request(url)
for link in BeautifulSoup(response,'html.parser', parse_only=SoupStrainer('a')):
if link.has_attr('href'):
all_links.append(link['href'])
new_df = pd.DataFrame({'Links': all_links})
print(new_df)
# or
#df['Links'] = all_links
EDIT: To create new dataframe, you can use this example:
http = httplib2.Http()
all_links = []
for url in urls: # `urls` is your list from the question
status, response = http.request(url)
l = []
all_links.append({'URLs': url, 'Links': l})
for link in BeautifulSoup(response,'html.parser', parse_only=SoupStrainer('a')):
if link.has_attr('href'):
l.append(link['href'])
new_df = pd.DataFrame(all_links)
print(new_df)

Related

Using BeautifulSoup to grab the second part of a URL and store that text in a variable

I have a list of urls that all have the same first part of the url. All the urls have 'ingredient-disclosure' with the product category coming after seperated by a /. I want to create a list that contains all the product categories.
So for the given url, I want to grab the text 'commercial-professional' and store it in a list that contains all the product categories.
Here is one of the urls: https://churchdwight.com/ingredient-disclosure/commercial-professional/42000024-ah-trash-can-dumpster-deodorizer.aspx
Thank you for any help!

You might want to consider using a Python set to store the categories so you end up with one of each.
Try the following example that uses their index page to get possible links:
import requests
from bs4 import BeautifulSoup
import csv
url = "https://churchdwight.com/ingredient-disclosure/"
req = requests.get(url)
soup = BeautifulSoup(req.content, "html.parser")
categories = set()
for a_tag in soup.find_all("a", href=True):
url_parts = [p for p in a_tag["href"].split('/') if p]
if len(url_parts) > 2 and url_parts[0] == "ingredient-disclosure":
categories.update([url_parts[1]])
print("\n".join(sorted(categories)))
This would give you the following categories:
Nausea-Relief
antiperspirant-deodorant
cleaning-products
commercial-professional
cough-allergy
dental-care
depilatories
fabric-softener-sheets
feminine-hygiene
hair-care
hand-sanitizer
hemorrhoid-relief
laundry-fabric-care
nasal-care
oral-care
pain-relief
pet-care
pool-products
sexual-health
skin-care
wound-care

You split the urls on the "/" character and get whatever you need from the resulting list:
prod_cat_list = []
url = 'https://churchdwight.com/ingredient-disclosure/commercial-professional/42000024-ah-trash-can-dumpster-deodorizer.aspx'
parts = url.split('/')
domain = parts[2]
prod_category = parts[4]
prod_cat_list.append(prod_category)
print(prod_cat_list)

Values are getting replaced in python list

I am trying to get actual URLs of shortened URLs by using following code (I have replaced shortened URLs with others as stackoverflow doesn't allow them)
url_list = ['https://stackoverflow.com/questions/62242867/php-lumen-request-body-always-empty', 'https://twitter.com/i/web/status/1269102116364324865']
import requests
actual_list =[]
for link in url_list:
response = requests.get(link)
actual_url = response.url
actual_list = actual_url
print(actual_list)
At the end there is only last URL left in actual_url but I need each URL. Can someone tell me what am I doing wrong here?

You need to append the url's to the list.
actual_list =[]
for link in url_list:
response = requests.get(link)
actual_url = response.url
actual_list.append(actual_url)

please try this, you need use append to add the item to list:
url_list = ['https://stackoverflow.com/questions/62242867/php-lumen-request-body-always-empty', 'https://twitter.com/i/web/status/1269102116364324865']
import requests
actual_list =[]
for link in url_list:
response = requests.get(link)
actual_url = response.url
actual_list.append(actual_url)
print(actual_list)

you need to append the url in list:
actual_list =[]
for link in url_list:
response = requests.get(link)
actual_url = response.url
actual_list.append(actual_url)
but you are assigning the url to actual_list variable

How to iterate over a list of IDs to produce URLs for each ID every time?

I've got a list of IDs which I want to pass through the URLs to collect the data on the comments. But I'm kinda of newb and when I'm trying to iterate over the list, I'm getting only one url and consequently data for one comment. Can someone, please, explain me what's wrong with my code and how to get URLs for all IDs in a list and consequently collect the data for all comments?
comments_from_reddit = ['fkkmga7', 'fkkgxtj', 'fkklfx3', ...]
def getPushshiftData():
for ID in range(len(comments_from_reddit)):
url = 'https://api.pushshift.io/reddit/comment/search?ids={}'.format(comments_from_reddit[ID])
print(url)
req = requests.get(url)
data = json.loads(req.text)
return data['data']
data = getPushshiftData()
Output I'm getting: https://api.pushshift.io/reddit/comment/search?ids=fkkmga7
I will really appreciate any help on my issue. Thanks for your attention.

This should work:
comments_from_reddit = ['fkkmga7', 'fkkgxtj', 'fkklfx3', ...]
def getPushshiftData():
result = list()
for ID in range(len(comments_from_reddit)):
url = 'https://api.pushshift.io/reddit/comment/search?ids={}'.format(comments_from_reddit[ID])
print(url)
req = requests.get(url)
data = json.loads(req.text)
result.append( data['data'] )
return result
data = getPushshiftData()

How to fetch link from a file and loop it in python?

have a txt file with values
https://www.redbook.com.au/cars/details/2019-honda-civic-50-years-edition-auto-my19/SPOT-ITM-524208/
http://www.redbook.com.au/cars/research/used/details/1968-ford-fairmont-xt-manual/SPOT-ITM-336135
http://www.redbook.com.au/cars/research/used/details/1968-ford-f100-manual/SPOT-ITM-317784
code :
from bs4 import BeautifulSoup
import requests
url = 'https://www.redbook.com.au/cars/details/2019-honda-civic-50-years-edition-auto-my19/SPOT-ITM-524208/'
headers = {'User-Agent':'Mozilla/5.0'}
page = (requests.get(url, headers=headers))
tree = html.fromstring(page.content)
car_data = {}
# Overview
if tree.xpath('//tr[td="Badge"]//following-sibling::td[2]/text()'):
badge = tree.xpath('//tr[td="Badge"]//following-sibling::td[2]/text()')[0]
car_data["badge"] = badge
if tree.xpath('//tr[td="Series"]//following-sibling::td[2]/text()'):
car_data["series"] = tree.xpath('//tr[td="Series"]//following-sibling::td[2]/text()')[0]
if tree.xpath('//tr[td="Body"]//following-sibling::td[2]/text()'):
car_data["body_small"] = tree.xpath('//tr[td="Body"]//following-sibling::td[2]/text()')[0]
df=pd.DataFrame([car_data])
output :
df=
badge body_small series
0 50 Years Edition Sedan 10th Gen
how to take all the urls from txt file and loop it so that the output will append all values into a dict or df.
expected output
badge body_small series
0 50 Years Edition Sedan 10th Gen
1 (No Badge) Sedan XT
2 (No Badge) Utility (No Series)
tried converting the file into list and used forloop
url = ['https://www.redbook.com.au/cars/details/2019-honda-civic-50-years-edition-auto-my19/SPOT-ITM-524208/','http://www.redbook.com.au/cars/research/used/details/1966-ford-falcon-deluxe-xp-manual/SPOT-ITM-386381']
headers = {'User-Agent':'Mozilla/5.0'}
for lop in url:
page = (requests.get(lop, headers=headers))
but only one url value is generating. and if there are 1000 url converting them to list will take a lot of time

The problem with your code is you are overwriting the variable 'page' again and again in the for loop, hence you will get data of the last request only.
Below is the correct code
url = ['https://www.redbook.com.au/cars/details/2019-honda-civic-50-years-edition-auto-my19/SPOT-ITM-524208/','http://www.redbook.com.au/cars/research/used/details/1966-ford-falcon-deluxe-xp-manual/SPOT-ITM-386381']
headers = {'User-Agent':'Mozilla/5.0'}
page = []
for lop in url:
page.append(requests.get(lop, headers=headers).text)

Here (The code will generate a dictionary where each entry is the url (key) + the scraped data (value))
from bs4 import BeautifulSoup
import requests
def get_cars_data(url):
cars_data = {}
# TODO read the data using requests and with BS populate 'cars_data'
return cars_data
all_cars = {}
with open('urls.txt') as f:
urls = [line.strip() for line in f.readlines()]
for url in urls:
all_cars[url] = get_cars_data(url)
print('done')

If I got your question correctly then this is the answer for you question.
from bs4 import BeautifulSoup
import requests
cars = [] # gobal array for storing each car_data object
f = open("file.txt",'r') #file.txt would contain all the links that you wish to read
#This for loop will perform your thing for each url in the file
for url in f:
car_data={} # use it as a local variable
headers = {'User-Agent':'Mozilla/5.0'}
page = (requests.get(url, headers=headers))
tree = html.fromstring(page.content)
# Overview
if tree.xpath('//tr[td="Badge"]//following-sibling::td[2]/text()'):
badge = tree.xpath('//tr[td="Badge"]//following-sibling::td[2]/text()')[0]
car_data["badge"] = badge
if tree.xpath('//tr[td="Series"]//following-sibling::td[2]/text()'):
car_data["series"] = tree.xpath('//tr[td="Series"]//following-sibling::td[2]/text()')[0]
if tree.xpath('//tr[td="Body"]//following-sibling::td[2]/text()'):
car_data["body_small"] = tree.xpath('//tr[td="Body"]//following-sibling::td[2]/text()')[0]
cars.append(car_data) #Append it to global array

Iterate through a list of items for multiple response.get API

How do I iterate through a list in order to add it to the request.get ?
import requests, json
url = "http://www.omdbapi.com/?t="
# data = "Titanic"
data = "Titanic", "Avatar"
title_url = url + data
r = requests.get(title_url, '&apikey=xxx3432g')
print r.json
This works perfectly with only one title for data i do not know how to loop through it so I will get multiple titles

You can first put all the titles you need into a list, and then pass those parameters as a dict, so
import requests
url = "http://www.omdbapi.com"
titles = ["Titanic", "Avatar"]
for title in titles:
r = requests.get(url, params={"t":title, "apikey": "xxx3432g"})
print(r.json())
EDIT:
import requests
url = "http://www.omdbapi.com"
titles = ["Titanic", "Avatar"]
output_results = []
for title in titles:
r = requests.get(url, params={"t":title, "apikey": "xxx3432g"})
output_results.append(r.json())
print(output_results[0]["Title"])
print(output_results[1]["Year"])
More details on http://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web-scraping through a list using BeautifulSoup - python

Related

Using BeautifulSoup to grab the second part of a URL and store that text in a variable

Values are getting replaced in python list

How to iterate over a list of IDs to produce URLs for each ID every time?

How to fetch link from a file and loop it in python?

Iterate through a list of items for multiple response.get API

Categories

Resources