How to get text with json.loads in Python

How to get text with json.loads in Python - python

I wrote a code which get a username from instagram. Sometimes my algorithm is not working and I get the name as 'p'. I'am trying to write a code for this exception (in the if head == 'p': part of code ).
First of all I use soup.select to get this block of information:
Blockquote
{"#context":"http:\/\/schema.org","#type":"ImageObject","caption":"I think I\u2019m getting better at editing these.... and by that I mean that there getting more and more muddled to the point hat I don\u2019t think people will be able to tell what they are soon.... not really what I\u2019m going for but oh well.\n-\n\u2022September 3 2018\u2022\n-\n-\nThis is the update I mentioned on my cuts on my leg. I finally cleaned them after two days. I normally don\u2019t wait that long but didn\u2019t really have the right circumstances to actually get to clean them the night of the relapse. *shrug*\n-\n-\n-\n#selfharm #selfharmo","representativeOfPage":"http:\/\/schema.org\/True","uploadDate":"2018-09-04T06:27:24","author":{"#type":"Person",**"alternateName":"#alittlereddrop"**,"mainEntityofPage":{"#type":"ProfilePage","#id":"https:\/\/www.instagram.com\/alittlereddrop\/"}},"commentCount":"0","interactionStatistic":{"#type":"InteractionCounter","interactionType":{"#type":"LikeAction"},"userInteractionCount":"2"},"mainEntityofPage":{"#type":"ItemPage","#id":"https:\/\/www.instagram.com\/p\/BnS0sdDlsmP\/?tagged=selfharmo"},"description":"2 Likes, 0 Comments - No One Cares (#alittlereddrop) on Instagram: \u201cI think I\u2019m getting better at editing these.... and by that I mean that there getting more and more\u2026\u201d","name":"No One Cares on Instagram: \u201cI think I\u2019m getting better at editing these.... and by that I mean that there getting more and more muddled to the point hat I don\u2019t think\u2026\u201d"}
Blockquote
There is a part "alternateName": which contains a name. But I can't get it even with json.loads . Do you have any ideas?
file = open('users.txt', 'r', encoding="ISO-8859-1")
urls = file.readlines()
for url in urls:
url = url.strip ('\n')
try:
req = requests.get(url)
req.raise_for_status()
except HTTPError as http_err:
output = open('output2.txt', 'a')
output.write(f'К сожалению страница недоступна.\n')
except Exception as err:
output = open('output2.txt', 'a')
output.write(f'К сожалению страница недоступна2\n')
else:
output = open('output2.txt', 'a')
soup = BeautifulSoup(req.text, "lxml")
the_url = soup.select("[rel='canonical']")[0]['href']
the_url2=the_url.replace('https://www.instagram.com/','')
head, sep, tail = the_url2.partition('/')
if head == 'p':
data = soup.select("[type='application/ld+json']")[0]
oJson2 = json.loads(data.text)["alternateName"]
str (oJson2)
output.write (oJson2+'\n')
else:
output.write (head+'\n')

You have a problem with syntax in you json file. There are double stars placed incorrectly in two places:
**"alternateName":"#alittlereddrop"**,.
If you're opening json from file, do this:
import json
with open('yourfilename.json') as fo:
jsn = json.loads(fo.read().replace('**', ''))
print(jsn['author']['alternateName'])
# '#alittlereddrop'
In your case, try instead of this line:
oJson2 = json.loads(data.text)["alternateName"]
This
oJson2 = json.loads(data.text.replace('**', ''))['author']["alternateName"]

Related

How can i get all the content on a website

i would like to do webscraping
so i do a simple request:
import urllib.request
fp = urllib.request.urlopen("https://www.iadfrance.fr/trouver-un-conseiller")
mybytes = fp.read()
mystr = mybytes.decode("utf8")
faa = open("demofile2.txt", "a")
faa.write(mystr)
faa.close()
fp.close()
but
i don't find any name in my file;
Why? and there is a way to get all the performers on the map?
Thanks for your answers!

Here is how you get the data
import requests
r = requests.get('https://www.iadfrance.fr/agent-search-location?southwestlat=48.8251752&southwestlng=2.2935677&northeastlat=48.8816507&northeastlng=2.4039459')
if r.status_code == 200:
print(r.json())
else:
print(f'Oops. Status code is {r.status_code}')

The fundamental concept here has a name, "HATEOAS", Hypermedia as the Engine of Application State.
The first response that you get contains the next list of resources that you need to ask. In turn, they may contain quite a few more. Some of those resources might be Javascript, which when executed requests even more data. That's inconvenient and a violation of the theoretical HATEOAS model, but it is very much the practice for interactive websites.

Extracting follower count from Instagram

I am trying to pull the the number of followers from a list of Instagram accounts. I have tried using the "find" method within Requests, however, the string that I am looking for when I inspect the actual Instagram no longer appears when I print "r" from the code below.
Was able to get this code to run successfully find the past, however, will no longer run.
Webscraping Instagram follower count BeautifulSoup
import requests
user = "espn"
url = 'https://www.instagram.com/' + user
r = requests.get(url).text
start = '"edge_followed_by":{"count":'
end = '},"followed_by_viewer"'
print(r[r.find(start)+len(start):r.rfind(end)])
I receive a "-1" error, which means the substring from the find method was not found within the variable "r".

I think it's because of the last ' in start and first ' in end...this will work:
import requests
import re
user = "espn"
url = 'https://www.instagram.com/' + user
r = requests.get(url).text
followers = re.search('"edge_followed_by":{"count":([0-9]+)}',r).group(1)
print(followers)
'14061730'

I want to suggest an updated solution to this question, as the answer of Derek Eden above from 2019 does not work anymore, as stated in its comments.
The solution was to add the r' before the regular expression in the re.search like so:
follower_count = re.search(r'"edge_followed_by\\":{\\"count\\":([0-9]+)}', response).group(1)
This r'' is really important as without it, Python seems to treat the expression as regular string which leads to the query not giving any results.
Also the instagram page seems to have backslashes in the object we look for at least in my tests, so the code example i use is the following in Python 3.10 and working as of July 2022:
# get follower count of instagram profile
import os.path
import requests
import re
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
# get instagram follower count
def get_instagram_follower_count(instagram_username):
url = "https://www.instagram.com/" + instagram_username
filename = "instagram.html"
try:
if not os.path.isfile(filename):
r = requests.get(url, verify=False)
print(r.status_code)
print(r.text)
response = r.text
if not r.status_code == 200:
raise Exception("Error: " + str(r.status_code))
with open(filename, "w") as f:
f.write(response)
else:
with open(filename, "r") as f:
response = f.read()
# print(response)
follower_count = re.search(r'"edge_followed_by\\":{\\"count\\":([0-9]+)}', response).group(1)
return follower_count
except Exception as e:
print(e)
return 0
print(get_instagram_follower_count('your.instagram.profile'))
The method returns the follower count as expected. Please note that i added a few lines to not hammer Instagrams webserver and get blocked while testing by just saving the response in a file.
This is a slice of the original html content that contains the part we are looking for:
... mRL&s=1\",\"edge_followed_by\":{\"count\":110070},\"fbid\":\"1784 ...
I debugged the regex in regexr, it seems to work just fine at this point in time.
There are many posts about the regex r prefix like this one
Also the documentation of the re package shows clearly that this is the issue with the code above.

How to download books automatically from Gutenberg

I am trying to download books from "http://www.gutenberg.org/". I want to know why my code gets nothing.
import requests
import re
import os
import urllib
def get_response(url):
response = requests.get(url).text
return response
def get_content(html):
reg = re.compile(r'(<span class="mw-headline".*?</span></h2><ul><li>.*</a></li></ul>)',re.S)
return re.findall(reg,html)
def get_book_url(response):
reg = r'a href="(.*?)"'
return re.findall(reg,response)
def get_book_name(response):
reg = re.compile('>.*</a>')
return re.findall(reg,response)
def download_book(book_url,path):
path = ''.join(path.split())
path = 'F:\\books\\{}.html'.format(path) #my local file path
if not os.path.exists(path):
urllib.request.urlretrieve(book_url,path)
print('ok!!!')
else:
print('no!!!')
def get_url_name(start_url):
content = get_content(get_response(start_url))
for i in content:
book_url = get_book_url(i)
if book_url:
book_name = get_book_name(i)
try:
download_book(book_url[0],book_name[0])
except:
continue
def main():
get_url_name(start_url)
if __name__ == '__main__':
start_url = 'http://www.gutenberg.org/wiki/Category:Classics_Bookshelf'
main()
I have run the code and get nothing, no tracebacks. How can I download the books automatically from the website?

I have run the code and get nothing,no tracebacks.
Well, there's no chance you get a traceback in the case of an exception in download_book() since you explicitely silent them:
try:
download_book(book_url[0],book_name[0])
except:
continue
So the very first thing you want to do is to at least print out errors:
try:
download_book(book_url[0],book_name[0])
except exception as e:
print("while downloading book {} : got error {}".format(book_url[0], e)
continue
or just don't catch exception at all (at least until you know what to expect and how to handle it).
I don't even know how to fix it
Learning how to debug is actually even more important than learning how to write code. For a general introduction, you want to read this first.
For something more python-specific, here are a couple ways to trace your program execution:
1/ add print() calls at the important places to inspect what you really get
2/ import your module in the interactive python shell and test your functions in isolation (this is easier when none of them depend on global variables)
3/ use the builtin step debugger
Now there are a few obvious issues with your code:
1/ you don't test the result of request.get() - an HTTP request can fail for quite a few reasons, and the fact you get a response doesn't mean you got the expected response (you could have a 400+ or 500+ response as well.
2/ you use regexps to parse html. DONT - regexps cannot reliably work on html, you want a proper HTML parser instead (BeautifulSoup is the canonical solution for web scraping as it's very tolerant). Also some of your regexps look quite wrong (greedy match-all etc).

start_url is not defined in main()
You need to use a global variable. Otherwise, a better (cleaner) approach is to pass in the variable that you are using. In any case, I would expect an error, start_url is not defined
def main(start_url):
get_url_name(start_url)
if __name__ == '__main__':
start_url = 'http://www.gutenberg.org/wiki/Category:Classics_Bookshelf'
main(start_url)
EDIT:
Nevermind, the problem is in this line: content = get_content(get_response(start_url))
The regex in get_content() does not seem to match anything. My suggestion would be to use BeautifulSoup, from bs4 import BeautifulSoup. For any information regarding why you shouldn't parse html with regex, see this answer RegEx match open tags except XHTML self-contained tags
Asking regexes to parse arbitrary HTML is like asking a beginner to write an operating system

As others have said, you get no output because your regex doesn't match anything. The text returned by the initial url has got a newline between </h2> and <ul>, try this instead:
r'(<span class="mw-headline".*?</span></h2>\n<ul><li>.*</a></li></ul>)'
When you fix that one, you will face another error, I suggest some debug printouts like this:
def get_url_name(start_url):
content = get_content(get_response(start_url))
for i in content:
print('[DEBUG] Handling:', i)
book_url = get_book_url(i)
print('[DEBUG] book_url:', book_url)
if book_url:
book_name = get_book_name(i)
try:
print('[DEBUG] book_url[0]:', book_url[0])
print('[DEBUG] book_name[0]:', book_name[0])
download_book(book_url[0],book_name[0])
except:
continue

How to pass NoneTypes? So the crawler carries on and doesn't stop:

The below come's up with the error:
"if soup.find(text=bbb).parent.parent.get_text(strip=True
AttributeError: 'NoneType' object has no attribute 'parent'"
Any help would be appreciated as I can't quite get it to run fully, python only returns results up to the error, I need it to return empty if there is no item and move on. I tried putting a IF statement but that doesnt work.
import csv
import re
import requests
from bs4 import BeautifulSoup
f = open('dataoutput.csv','w', newline= "")
writer = csv.writer(f)
def trade_spider(max_pages):
page = 1
while page <= max_pages:
url = 'http://www.zoopla.co.uk/for-sale/property/nottingham/?price_max=200000&identifier=nottingham&q=Nottingham&search_source=home&radius=0&pn=' + str(page) + '&page_size=100'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('a', {'class': 'listing-results-price text-price'}):
href = "http://www.zoopla.co.uk" + link.get('href')
title = link.string
get_single_item_data(href)
page += 1
def get_single_item_data(item_url):
source_code = requests.get(item_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for item_e in soup.findAll('table', {'class' : 'neither'}):
Sold = item_e.get_text(strip=True)
bbb = re.compile('First listed')
try:
next_s = soup.find(text=bbb).parent.parent.get_text(strip=True)
except:
Pass
try:
writer.writerow([ Sold, next_s])
except:
pass
trade_spider(2)

Your exception comes from trying to access an attribute on None. You don't intend to do that, but because some earlier part of your expression turns out to be None where you expected something else, the later parts break.
Specifically, either soup.find(text=bbb) or soup.find(text=bbb).parent is None (probably the former, since I think None is the returned value if find doesn't find anything).
There are a few ways you can write your code to address this issue. You could either try to detect that it's going to happen ahead of time (and do something else instead), or you can just go ahead and try the attribute lookup and react if it fails. These two approaches are often called "Look Before You Leap" (LBYL) and "Easier to Ask Forgiveness than Permission" (EAFP).
Here's a bit of code using an LBYL approach that checks to make sure the values are not None before accessing their attributes:
val = soup.find(text=bbb)
if val and val.parent: # I'm assuming the non-None values are never falsey
next_s = val.parent.parent.get_text(strip=True)
else:
# do something else here?
The EAFP approach is perhaps simpler, but there's some risk that it could catch other unexpected exceptions instead of the ones we expect (so be careful using this design approach during development):
try:
next_s = soup.find(text=bbb).parent.parent.get_text(strip=True)
except AttributeError: # try to catch the fewest exceptions possible (so you don't miss bugs)
# do something else here?
It's not obvious to me what your code should do in the "do something else here" sections in the code above. It might be that you can ignore the situation, but probably you'd need an alternative value for next_s to be used by later code. If there's no useful value to substitute, you might want to bail out of the function early instead (with a return statement).

Python runs smoothly until an error with scraping occurs tried to fix it using try and except but it doesn't seem to work

Python newbie here practicing my skills. I came across a roadbump and would be very happy to receive some help. What i'm trying to do is to get a list of links from a spreadsheet. From there, Python will get the data, extract a specific class and paste the data to ColB. Problem is, there are instances when the link is broken, hence there will be no data scraped. I used try and except to get around this but it seems like it's not working. What it seems to do is that when an error occurs, it just skips writing the data and proceeds to write the data on the wrong cell. here is my code:
credentials = ServiceAccountCredentials.from_json_keyfile_name('Te....4e.json', scope)
gc = gspread.authorize(credentials)
#selects the spreadsheet
sh = gc.open_by_url('https://docs.google.com/spreadsheets/d/1u7....0')
worksheet = sh.worksheet('Keywords')
colvalue = "A"
rownumber = 2
updaterowvalue = 2
while rownumber <100:
try:
val = worksheet.acell(colvalue +str(rownumber)).value
rownumber += 1
url = val
#scrape elements
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
#print titles only
h1 = soup.find("h1", class_= "sg-text--headline")
updatecolvalue = "B"
worksheet.update_acell(updatecolvalue +str(updaterowvalue), h1.get_text())
updaterowvalue +=1
except AttributeError:
pass
print('DONE')

I assume that the extra indentation on the line starting worksheet.update_acell is an error, since your code is invalid as given.
The problem is that when an exception occurs, updaterowvalue +=1 is not executed, which causes the results to get out of sync with the URLs.
Fixing this is simple, just stop using updaterowvalue and just use rownumber in the worksheet.update_acell() call. Since you want the result to be in the same row as the URL, updaterowvalue is unnecessary.
A more pythonic way of writing the loop would be:
for rownumber in range(2,100):
which allows you to eliminate the rownumber += 1 line too.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to get text with json.loads in Python - python

Related

How can i get all the content on a website

Extracting follower count from Instagram

How to download books automatically from Gutenberg

How to pass NoneTypes? So the crawler carries on and doesn't stop:

Python runs smoothly until an error with scraping occurs tried to fix it using try and except but it doesn't seem to work

Categories

Resources