I was hoping people on here would be able to answer what I believe to be a simple question. I'm a complete newbie and have been attempting to create an image webscraper from the site Archdaily. Below is my code so far after numerous attempts to debug it:
#### - Webscraping 0.1 alpha -
#### - Archdaily -
import requests
from bs4 import BeautifulSoup
# Enter the URL of the webpage you want to download the images from
page = 'https://www.archdaily.com/63267/ad-classics-house-vi-peter-eisenman/5037e0ec28ba0d599b000190-ad-classics-house-vi-peter-eisenman-image'
# Returns the webpage source code under page_doc
result = requests.get(page)
page_doc = result.content
# Returns the source code as BeautifulSoup object, as nested data structure
soup = BeautifulSoup(page_doc, 'html.parser')
img = soup.find('div', class_='afd-gal-items')
img_list = img.attrs['data-images']
for k, v in img_list():
if k == 'url_large':
print(v)
These elements here:
img = soup.find('div', class_='afd-gal-items')
img_list = img.attrs['data-images']
Attempts to isolate the data-images attribute, shown here:
My github upload of this portion, very long
As you can see, or maybe I'm completely wrong here, my attempts to call the 'url_large' values from this final dictionary list comes to a TypeError, shown below:
Traceback (most recent call last):
File "D:/Python/Programs/Webscraper/Webscraping v0.2alpha.py", line 23, in <module>
for k, v in img_list():
TypeError: 'str' object is not callable
I believe my error lies in the resulting isolation of 'data-images', which to me looks like a dict within a list, as they're wrapped by brackets and curly braces. I'm completely out of my element here because I basically jumped into this project blind (haven't even read past chapter 4 of Guttag's book yet).
I also looked everywhere for ideas and tried to mimic what I found. I've found solutions others have offered previously to change the data to JSON data, so I found the code below:
jsonData = json.loads(img.attrs['data-images'])
print(jsonData['url_large'])
But that was a bust, shown here:
Traceback (most recent call last):
File "D:/Python/Programs/Webscraper/Webscraping v0.2alpha.py", line 29, in <module>
print(jsonData['url_large'])
TypeError: list indices must be integers or slices, not str
There is a step I'm missing here in changing these string values, but I'm not sure where I could change them. I'm hoping someone can help me resolve this issue, thanks!
It's all about the types.
img_list is actually not a list, but a string. You try to call it by img_list() which results in an error.
You had the right idea of turning it into a dictionary using json.loads. The error here is pretty straight forward - jsonData is a list, not a dictionary. You have more than one image.
You can loop through the list. Each item in the list is a dictionary, and you'll be able to find the url_large attribute in each dictionary in the list:
images_json = img.attrs['data-images']
for image_properties in json.loads(images_json):
print(image_properties['url_large'])
#infinity & #simic0de are both right, but I wanted to more explicitly address what I see in your code as well.
In this particular block:
img_list = img.attrs['data-images']
for k, v in img_list():
if k == 'url_large':
print(v)
There is a couple syntax errors.
If 'img_list' truly WAS a dictionary, you cannot iterate through it this way. You would need to use img_list.items() (for python3) or img_list.iteritems() (python2) in the second line.
When you use the parenthesis like that, it implies that you're calling a function. But here, you're trying to iterate through a dictionary. That is why you get the 'is not callable' error.
The other main issue is the Type issue. simic0de & Infinity address that, but ultimately you need to check the type of img_list and convert it as needed so you can iterate through it.
Source of error:
img_list is a string. You have to convert it to list using json.loads and it not becomes a list of dicts that you have to loop over.
Working Solution:
import json
import requests
from bs4 import BeautifulSoup
# Enter the URL of the webpage you want to download the images from
page = 'https://www.archdaily.com/63267/ad-classics-house-vi-peter-eisenman/5037e0ec28ba0d599b000190-ad-classics-house-vi-peter-eisenman-image'
# Returns the webpage source code under page_doc
result = requests.get(page)
page_doc = result.content
# Returns the source code as BeautifulSoup object, as nested data structure
soup = BeautifulSoup(page_doc, 'html.parser')
img = soup.find('div', class_='afd-gal-items')
img_list = img.attrs['data-images']
for img in json.loads(img_list):
for k, v in img.items():
if k == 'url_large':
print(v)
Related
UPDATE: I've copied all of my code in now. I'm not a programmer, so it probably isn't formatted correct, etc. The PRINT commands at the end are just to test that I'm actually getting some kind of output, I'm planning on writing to a MySQL database once I have everything running fine.
I'm trying to scrape some horse racing details from the URL below. I've written some code which scrapes race details - horse names, times, etc - and for the most part, it's working fine.
For some reason though, when it parses the URL in the code below, it's returning a NoneType error after returning about 2/3's of the entries (once it hits the 20:05 race at Lingfield).
I've had a look at the source and as far as I can see, there is text in the Div FastResult__item (inside a tag). And the code is returning vales for the other Winning Trainers (also works if I change the date in the URL).
I'm stumped as to why it's returning None, instead of the expected value of Winning Trainer: Simon Crisford. Any help would be appreciated - I'm by no means an expert at using Python, so go easy.
Code (copied all of my code in now):
from bs4 import BeautifulSoup
import requests
page = requests.get("https://myracing.com/results/2019-08-03/")
soup = BeautifulSoup(page.text, "html.parser")
race_listings = soup.find_all("article", class_="FastResult")
for a in race_listings:
a_meeting = a.find("h3", class_="FastResult__title")
a_time = a.find("h3", class_="FastResult__title")
a_draw = a.find("span", class_="FastResult__draw_no").text.strip()
a_winning_horse = a.find("span", class_="FastResult__horse-name")
for div in a_winning_horse("sup", {'class':'Racecard__sup-text'}):
div.decompose()
a_winning_jockey = a.find("span", class_="FastResult__jockey-name")
for div in a_winning_jockey("sup", {'class':'Racecard__sup-text'}):
div.decompose()
a_winning_trainer = a.find("div", class_="FastResult__item")
print(a_meeting.text[6:].strip())
print(a_time.text[0:6].strip())
print(a_draw.strip("()"))
print(a_winning_horse.text.strip())
print(a_winning_jockey.text[3:].strip())
print(a_winning_trainer.text.strip())
Error:
Traceback (most recent call last):
File "/Users/andyjarvis/Documents/Python/horse_predictor_v2.py", line 30, in <module>
print(a_winning_trainer.text.strip())
AttributeError: 'NoneType' object has no attribute 'text'
I've managed to 'fix' this, basically by ignoring it. I think there's an issue with the website, as I'm able to pull most of the information I need. Just seems to be this one trainer that's holding the whole thing up.
I entered the following code, under the 'a_winning_trainer =' line and I'm able to continue:
if a_winning_trainer is None:
continue
Cheers
I'm attempting to scrape PGA stats from the API below.
url = 'https://statdata.pgatour.com/r/stats/current/02671.json?userTrackingId=exp=1594257225~acl=*~hmac=464d3dfcda2b2ccb384b77ac7241436f25b7284fb2eb0383184f48cbdff33cc4'
response = requests.get(url)
pga_stats = response.json()
I would like to only select the nested keys identified in this image. I've been able to traverse to the 'year' key with the below code, but I receive the following AttributeError for anything beyond that.
test = pga_stats.get('tours')[0].get('years')
(prints reduced dictionary)
test = pga_stats.get('tours')[0].get('years').get('stats')
'list' object has no attribute 'get'
My end goal is to write this player data to a csv file. Any suggestions would be greatly appreciated.
pga_stats.get('tours')[0].get('years') returns a list, not a dict. You actually want to use the get method on it's first element, like this:
test = pga_stats.get('tours')[0].get('years')[0].get('stats')
So I'm new to Python and am working on a simple program that will read a text file of protein names (PDB IDs) and create a URL to search a database (the PDB) for that protein and some associated data.
Unfortunately, as a newbie, I forgot to save my script, so I can't recall what I did to make my code work!
Below is my code so far:
import urllib
import urllib.parse
import urllib.request
import os
os.chdir("C:\\PythonProjects\\Samudrala Lab Projects")
protein_file = open("protein_list.txt","r")
protein_list = protein_file.read()
for item in protein_list:
item = item[0:4]
query_string =urlencode('customReportColumns','averageBFactor','resolution','experimentalTechnique','service=wsfile','format=csv')
**final_URL = url + '?pdbid={}{}'.format(url, item, query_string)**
print(final_URL)
The line of code I'm stuck on is starred.
The object "final_url" within the loop is missing some modification to indicate that I'd like the URL to search for the item as a pdbid. Can anyone give me a hint as to how I can tell the URL to plug in each item on the list as a PDBID?
I'm getting a type error indicating that it's not a valid non-string sequence or mapping object. Original post was edited to add this info.
Please let me know if this is an unclear question, or if you need any additional info.
Thanks!
How about something like this?
final_URL = "{}?pdbids={}{}".format(url, item, query_string)
I'm a Python novice, thanks for your patience.
I retrieved a web page, using the requests module. I used Beautiful Soup to harvest a few hundred href objects (links). I used uritools to create an array of full URLs for the target pages I want to download.
I don't want everybody who reads this note to bombard the web server with requests, so I'll show a hypothetical example that is realistic for just 2 hrefs. The array looks like this:
hrefs2 = ['http://ku.edu/pls/WP040?PT001F01=910&pf7331=11',
'http://ku.edu/pls/WP040?PT001F01=910&pf7331=12']
If I were typing these into 100s of lines of code, I understand what to do in order to retrieve each page:
from lxml import html
import requests
url = 'http://ku.edu/pls/WP040/'
payload = {'PT001F01' : '910', 'pf7331' : '11')
r = requests.get(url, params = payload)
Then get the second page
payload = {'PT001F01' : '910', 'pf7331' : '12')
r = requests.get(url, params = payload)
And keep typing in payload objects. Not all of the hrefs I'm dealing with are sequential, not all of the payloads are different simply in the last integer.
I want to automate this and I don't see how to create the payloads from the hrefs2 array.
While fiddling with uritools, I find urisplit which can give me the part I need to parse into a payload:
[urisplit(x)[3] for x in hrefs2]
['PT001F01=910&pf7331=11',
'PT001F01=910&pf7331=12']
Each one of those has to be turned into a payload object and I don't understand what to do.
I'm using Python3 and I used uritools because that appears to be the standards-compliant replacement of urltools.
I fell back on shell script to get pages with wget, which does work, but it is so un-Python-ish that I'm asking here for what to do. I mean, this does work:
import subprocess
for i in hrefs2:
subprocess.call(["wget", i])
You can pass the full url to requests.get() without splitting up the parameters.
>>> requests.get('http://ku.edu/pls/WP040?PT001F01=910&pf7331=12')
<Response [200]>
If for some reason you don't want to do that, you'll need to split up the parameters some how. I'm sure there are better ways to do it, but the first thing that comes to mind is:
a = ['PT001F01=910&pf7331=11',
'PT001F01=910&pf7331=12']
# list to store all url parameters after they're converted to dicts
urldata = []
#iterate over list of params
for param in a:
data = {}
# split the string into key value pairs
for kv in param.split('&'):
# split the pairs up
b = kv.split('=')
# first part is the key, second is the value
data[b[0]] = b[1]
# After converting every kv pair in the parameter, add the result to a list.
urldata.append(data)
You could do this with less code but I wanted to be clear what was going on. I'm sure there is already a module somewhere out there that does this for you too.
I have a simple code like:
p = soup.find_all("p")
paragraphs = []
for x in p:
paragraphs.append(str(x))
I am trying to convert a list I obtained from xml and convert it to string. I want to keep it with it's original tag so I can reuse some text, thus the reason why I am appending it like this. But the list contains over 6000 observations, thus an recursion error occurs because of the str:
"RuntimeError: maximum recursion depth exceeded while calling a Python object"
I read that you can change the max recursion but it's not wise to do so. My next idea was to split the conversion to strings into batches of 500, but I am sure that there has to be a better way to do this. Does anyone have any advice?
The problem here is probably that some of the binary graphic data at the bottom of the document contains the sequence of characters <P, which Beautiful Soup is trying to repair into an actual HTML tag. I haven't managed to pinpoint which text is causing the "recursion depth exceeded" error, but it's somewhere in there. It's p[6053] for me, but since you seem to have modified the file a bit (or maybe you're using a different parser for Beautiful Soup), it'll be different for you, I imagine.
Assuming you don't need the binary data at the bottom of the document to extract whatever you need from the actual <p> tags, try this:
# boot out the last `<document>`, which contains the binary data
soup.find_all('document')[-1].extract()
p = soup.find_all('p')
paragraphs = []
for x in p:
paragraphs.append(str(x))
I believe the issue is that the BeautifulsSoup object p is not built iteratiely, therefore the method call limit is reached before you can finish constructing p = soup.find_all('p'). Note the RecursionError is similarly thrown when building soup.prettify().
For my solution I used the re module to gather all <p>...</p> tags (see code below). My final result was len(p) = 5571. This count is lower than yours because the regex conditions did not match any text within the binary graphic data.
import re
import urllib
from urllib.request import Request, urlopen
url = 'https://www.sec.gov/Archives/edgar/data/1547063/000119312513465948/0001193125-13-465948.txt'
response = urllib.request.urlopen(url).read()
p = re.findall('<P((.|\s)+?)</P>', str(response)) #(pattern, string)
paragraphs = []
for x in p:
paragraphs.append(str(x))