python beautifulSoup findAll - python

I am having an issue getting all of the data from this site...
The section of the code I cannot get to produce all of the data is "pn"
I am hoping this code would product these numbers from the site.
58312-GA4
58312-RG4
58312-RR$
I have tried a number of things from switching the tags and classes and going back and fourth with find, findAll, and find_all and no matter what I try I am getting only one result.
Any help would be great - thanks
Here is the code:
theurl="http://www.colehersee.com/home/grid/cat/14/?"
thepage = urllib.request.urlopen(theurl)
soup = BeautifulSoup(thepage,"html.parser")
for pn in soup.find('table',{"class":"mod_products_grid_listing"}).find_all('span',{"class":"product_code"}):
pn2 = pn.text
for main in soup.find_all('nav',{"id":"breadcrumb"}):
main1 = main.text
print(pn2)
print (main1)

You're running the for loop for getting the 'pn' value quite separately from the for loop for the 'main' value. To be specific, by the time your code reaches the second for loop, the previous for loop has already executed in its entirety.
This results in the variable pn2 getting assigned the last value that was returned by the for loop.
You might want to do something like
pn2 = []
for pn in soup.find('table',{"class":"mod_products_grid_listing"}).find_all('span',{"class":"product_code"}):
pn2.append(pn.text)

Related

Unable to retrieve value from dictionary after webscraping

I was hoping people on here would be able to answer what I believe to be a simple question. I'm a complete newbie and have been attempting to create an image webscraper from the site Archdaily. Below is my code so far after numerous attempts to debug it:
#### - Webscraping 0.1 alpha -
#### - Archdaily -
import requests
from bs4 import BeautifulSoup
# Enter the URL of the webpage you want to download the images from
page = 'https://www.archdaily.com/63267/ad-classics-house-vi-peter-eisenman/5037e0ec28ba0d599b000190-ad-classics-house-vi-peter-eisenman-image'
# Returns the webpage source code under page_doc
result = requests.get(page)
page_doc = result.content
# Returns the source code as BeautifulSoup object, as nested data structure
soup = BeautifulSoup(page_doc, 'html.parser')
img = soup.find('div', class_='afd-gal-items')
img_list = img.attrs['data-images']
for k, v in img_list():
if k == 'url_large':
print(v)
These elements here:
img = soup.find('div', class_='afd-gal-items')
img_list = img.attrs['data-images']
Attempts to isolate the data-images attribute, shown here:
My github upload of this portion, very long
As you can see, or maybe I'm completely wrong here, my attempts to call the 'url_large' values from this final dictionary list comes to a TypeError, shown below:
Traceback (most recent call last):
File "D:/Python/Programs/Webscraper/Webscraping v0.2alpha.py", line 23, in <module>
for k, v in img_list():
TypeError: 'str' object is not callable
I believe my error lies in the resulting isolation of 'data-images', which to me looks like a dict within a list, as they're wrapped by brackets and curly braces. I'm completely out of my element here because I basically jumped into this project blind (haven't even read past chapter 4 of Guttag's book yet).
I also looked everywhere for ideas and tried to mimic what I found. I've found solutions others have offered previously to change the data to JSON data, so I found the code below:
jsonData = json.loads(img.attrs['data-images'])
print(jsonData['url_large'])
But that was a bust, shown here:
Traceback (most recent call last):
File "D:/Python/Programs/Webscraper/Webscraping v0.2alpha.py", line 29, in <module>
print(jsonData['url_large'])
TypeError: list indices must be integers or slices, not str
There is a step I'm missing here in changing these string values, but I'm not sure where I could change them. I'm hoping someone can help me resolve this issue, thanks!
It's all about the types.
img_list is actually not a list, but a string. You try to call it by img_list() which results in an error.
You had the right idea of turning it into a dictionary using json.loads. The error here is pretty straight forward - jsonData is a list, not a dictionary. You have more than one image.
You can loop through the list. Each item in the list is a dictionary, and you'll be able to find the url_large attribute in each dictionary in the list:
images_json = img.attrs['data-images']
for image_properties in json.loads(images_json):
print(image_properties['url_large'])
#infinity & #simic0de are both right, but I wanted to more explicitly address what I see in your code as well.
In this particular block:
img_list = img.attrs['data-images']
for k, v in img_list():
if k == 'url_large':
print(v)
There is a couple syntax errors.
If 'img_list' truly WAS a dictionary, you cannot iterate through it this way. You would need to use img_list.items() (for python3) or img_list.iteritems() (python2) in the second line.
When you use the parenthesis like that, it implies that you're calling a function. But here, you're trying to iterate through a dictionary. That is why you get the 'is not callable' error.
The other main issue is the Type issue. simic0de & Infinity address that, but ultimately you need to check the type of img_list and convert it as needed so you can iterate through it.
Source of error:
img_list is a string. You have to convert it to list using json.loads and it not becomes a list of dicts that you have to loop over.
Working Solution:
import json
import requests
from bs4 import BeautifulSoup
# Enter the URL of the webpage you want to download the images from
page = 'https://www.archdaily.com/63267/ad-classics-house-vi-peter-eisenman/5037e0ec28ba0d599b000190-ad-classics-house-vi-peter-eisenman-image'
# Returns the webpage source code under page_doc
result = requests.get(page)
page_doc = result.content
# Returns the source code as BeautifulSoup object, as nested data structure
soup = BeautifulSoup(page_doc, 'html.parser')
img = soup.find('div', class_='afd-gal-items')
img_list = img.attrs['data-images']
for img in json.loads(img_list):
for k, v in img.items():
if k == 'url_large':
print(v)

Processing all data in a for loop instead of only one element

I wrote some code in order to scrape some data from a website. When I run the code manually I can get all the information for all the shoes, but when I run my script it only gives me one result for each variable.
What can I change to get all the results I want?
For example, when I run the following, I only get one result for marque and one for modele, but when i do it in my terminal I can see that vignette contains multiple values.
import requests
from bs4 import BeautifulSoup
r=requests.get('https://www.sarenza.com/store/product/gender-type/list/view?gender=1&type=76&index=0&count=99')
soup=BeautifulSoup(r.text,'lxml')
vignette=soup.find_all('li',class_='vignette')
for i in range(len(vignette)):
marque=vignette[i].contents[3].text
modele=vignette[i].contents[5].contents[3].text
You're updating your marque and modele variables overwriting their previous value on each iteration of the loop. At the end of the loop, they will only contain the last values that were assigned to them.
If you want to extract all the values, you need to use two lists, and append values to them like this:
marques = []
modeles = []
for i in range(len(vignette)):
marques.append(vignette[i].contents[3].text)
modeles.append(vignette[i].contents[5].contents[3].text)
Or, in a more Pythonic way:
marques = list(v.contents[3].text for v in vignette)
modeles = list(v.contents[5].contents[3].text for v in vignette)
Now you'll have all the values you need, and you can process them or print them out, like this:
for marque, modele in zip(marques, modeles):
print('Marque:', marque, 'Modèle:', modele)

problem with accessing index from for loop and using it to create a new list

I am extremely new to Python and programming in general (I basically started a few days ago) so forgive me if I use the wrong terms or if I'm asking a silly question.
I’m writing a web scraper to get some data from a job vacancy website. I've written some code that first of all downloads the data from the main search results page, parses it and extracts from it the headings which contain a link to each of the vacancy pages where the details of each specific vacancy can be found. Then I’ve written code that opens each link and parses the html from each vacancy page.
Now this all works fine. The issue I have is with the following. I want to scrape some data from each of these vacancy pages and save the data for each vacancy in a separate list so that later I can put all these lists in a data frame. I’ve therefore been looking for a way to number or ‘index’ (if that is the right term to use) each list so that I can refer to them later. Below is the code I have at the moment. Following the advice I found by reading existing answers on Stackoverflow I’ve tried to use enumerate to create an index which I can assign to each list, as follows:
vacancy_headings = resultspage1_soup.body.findAll("a", class_ ="vacancy-link")
vacancydetails = []
for index, vacancy in enumerate(vacancy_headings, start=0):
vacancypage_url = urljoin("https://www.findapprenticeship.service.gov.uk",vacancy["href"])
vacancypage_client = urlopen(vacancypage_url)
vacancypage_html = vacancypage_client.read()
vacancypage_soup = soup(vacancypage_html, "html.parser")
vacancydetails[index]=[]
for p in vacancypage_soup.select("p"):
if p.has_attr("itemprop"):
if p["itemprop"] == "employmentType" or p["itemprop"] == "streetAddress" or p["itemprop"] == "addressLocality" or p["itemprop"] == "addressRegion" or p["itemprop"] == "postalCode":
cells = p.text
vacancydetails[index].append(cells)`
But I get the following error message:
IndexError Traceback (most recent call last)
<ipython-input-10-b8a75df16395> in <module>()
9 vacancypage_html = vacancypage_client.read()
10 vacancypage_soup = soup(vacancypage_html, "html.parser")
---> 11 vacancydetails[index]=[]
12
13 for p in vacancypage_soup.select("p"):
IndexError: list assignment index out of range
Could someone explain to me (in easy-to-understand language if possible!) what is going wrong, and how I can fix this problem?
Thanks!!
Since vacancydetails is a list, trying to access a position in the list that doesn't exist is an error. And, when you first create it, the list is empty. So, before accessing any elements from the list, you'll need to first create those elements.
Thus, instead of this:
vacancydetails[index]=[]
...you want to append a new item to the list (and that new item happens to be an empty list itself), like this:
vacancydetails.append([])
The list vacancydetails is empty until you append to it (or assign to it from somewhere else). Because index is counting up from 0, you just want to manipulate the currently-final entry in vacancydetails in the for p loop.
So, rather than vacancydetails[index]=[] you want vacancydetails.append([]). But then the more pythonic thing to do is work with the last entry in vacancydetails, i.e., vacancydetails[-1], in which case you never need the index variable.
for vacancy in vacancy_headings:
vacancypage_url = urljoin("https://www.findapprenticeship.service.gov.uk",vacancy["href"])
### ...
vacancydetails.append([])
for p in vacancypage_soup.select("p"):
if p.has_attr("itemprop"):
### ...
vacancydetails[-1].append(cells)

How to integrate this script with this function in Python(Instagram)

I am doing a little script where I want to collect all the "code:" regarding a tag.
For example:
https://www.instagram.com/explore/tags/%s/?__a=1
The next next page will be:
https://www.instagram.com/explore/tags/plebiscito/?__a=1&max_id=end_cursor
However, my drawback is to make each url get me what I need (which are the comments and username of the people).
So as the script works, it does not do what I need.
The "obtain_max_id" function works, getting the following end_cursors, but I do not know how to adapt it.
I appreciate your help!
In conclusion, I need to adapt the "obtain_max_id" function in my "connect_main" function to extract the information I need with each of the URLs.
This is simple.
import requests
import json
host = "https://www.instagram.com/explore/tags/plebiscito/?__a=1"
r = requests.get(host).json()
for x in r['tag']['media']['nodes']:
print (x['code'])
next = r['tag']['media']['page_info']['end_cursor']
while next:
r = requests.get(host + "&max_id=" + next ).json()
for x in r['tag']['media']['nodes']:
print (x['code'])
next = r['tag']['media']['page_info']['end_cursor']
You have all the data you want in your data variable (in JSON form), right after you execute the line:
data = json.loads(finish.text)
in the while loop inside your obtain_max_id() method. Just use that.
Assuming everything inside the else block of your connect_main() method works, you could simple use that code inside the above while loop, right after you have all the data in your data variable.

how to convert a bs4.element.ResultSet to strings? Python

I have a simple code like:
p = soup.find_all("p")
paragraphs = []
for x in p:
paragraphs.append(str(x))
I am trying to convert a list I obtained from xml and convert it to string. I want to keep it with it's original tag so I can reuse some text, thus the reason why I am appending it like this. But the list contains over 6000 observations, thus an recursion error occurs because of the str:
"RuntimeError: maximum recursion depth exceeded while calling a Python object"
I read that you can change the max recursion but it's not wise to do so. My next idea was to split the conversion to strings into batches of 500, but I am sure that there has to be a better way to do this. Does anyone have any advice?
The problem here is probably that some of the binary graphic data at the bottom of the document contains the sequence of characters <P, which Beautiful Soup is trying to repair into an actual HTML tag. I haven't managed to pinpoint which text is causing the "recursion depth exceeded" error, but it's somewhere in there. It's p[6053] for me, but since you seem to have modified the file a bit (or maybe you're using a different parser for Beautiful Soup), it'll be different for you, I imagine.
Assuming you don't need the binary data at the bottom of the document to extract whatever you need from the actual <p> tags, try this:
# boot out the last `<document>`, which contains the binary data
soup.find_all('document')[-1].extract()
p = soup.find_all('p')
paragraphs = []
for x in p:
paragraphs.append(str(x))
I believe the issue is that the BeautifulsSoup object p is not built iteratiely, therefore the method call limit is reached before you can finish constructing p = soup.find_all('p'). Note the RecursionError is similarly thrown when building soup.prettify().
For my solution I used the re module to gather all <p>...</p> tags (see code below). My final result was len(p) = 5571. This count is lower than yours because the regex conditions did not match any text within the binary graphic data.
import re
import urllib
from urllib.request import Request, urlopen
url = 'https://www.sec.gov/Archives/edgar/data/1547063/000119312513465948/0001193125-13-465948.txt'
response = urllib.request.urlopen(url).read()
p = re.findall('<P((.|\s)+?)</P>', str(response)) #(pattern, string)
paragraphs = []
for x in p:
paragraphs.append(str(x))

Categories

Resources