I am trying to pull the the number of followers from a list of Instagram accounts. I have tried using the "find" method within Requests, however, the string that I am looking for when I inspect the actual Instagram no longer appears when I print "r" from the code below.
Was able to get this code to run successfully find the past, however, will no longer run.
Webscraping Instagram follower count BeautifulSoup
import requests
user = "espn"
url = 'https://www.instagram.com/' + user
r = requests.get(url).text
start = '"edge_followed_by":{"count":'
end = '},"followed_by_viewer"'
print(r[r.find(start)+len(start):r.rfind(end)])
I receive a "-1" error, which means the substring from the find method was not found within the variable "r".
I think it's because of the last ' in start and first ' in end...this will work:
import requests
import re
user = "espn"
url = 'https://www.instagram.com/' + user
r = requests.get(url).text
followers = re.search('"edge_followed_by":{"count":([0-9]+)}',r).group(1)
print(followers)
'14061730'
I want to suggest an updated solution to this question, as the answer of Derek Eden above from 2019 does not work anymore, as stated in its comments.
The solution was to add the r' before the regular expression in the re.search like so:
follower_count = re.search(r'"edge_followed_by\\":{\\"count\\":([0-9]+)}', response).group(1)
This r'' is really important as without it, Python seems to treat the expression as regular string which leads to the query not giving any results.
Also the instagram page seems to have backslashes in the object we look for at least in my tests, so the code example i use is the following in Python 3.10 and working as of July 2022:
# get follower count of instagram profile
import os.path
import requests
import re
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
# get instagram follower count
def get_instagram_follower_count(instagram_username):
url = "https://www.instagram.com/" + instagram_username
filename = "instagram.html"
try:
if not os.path.isfile(filename):
r = requests.get(url, verify=False)
print(r.status_code)
print(r.text)
response = r.text
if not r.status_code == 200:
raise Exception("Error: " + str(r.status_code))
with open(filename, "w") as f:
f.write(response)
else:
with open(filename, "r") as f:
response = f.read()
# print(response)
follower_count = re.search(r'"edge_followed_by\\":{\\"count\\":([0-9]+)}', response).group(1)
return follower_count
except Exception as e:
print(e)
return 0
print(get_instagram_follower_count('your.instagram.profile'))
The method returns the follower count as expected. Please note that i added a few lines to not hammer Instagrams webserver and get blocked while testing by just saving the response in a file.
This is a slice of the original html content that contains the part we are looking for:
... mRL&s=1\",\"edge_followed_by\":{\"count\":110070},\"fbid\":\"1784 ...
I debugged the regex in regexr, it seems to work just fine at this point in time.
There are many posts about the regex r prefix like this one
Also the documentation of the re package shows clearly that this is the issue with the code above.
Related
I have very basic knowledge of python, so sorry if my question sounds dumb.
I need to query a website for a personal project I am doing, but I need to query it 500 times, and each time I need to change 1 specific part of the url, then take the data and upload it to gsheets.
(The () signifies what part of the url I need to change)
'https://www.alphavantage.co/query?function=BALANCE_SHEET&symbol=(symbol)&apikey=apikey'
I thought about using while and format {} to do it, but I was unsure how to change the string each time, bar writing out the names for variables by hand (defeating the whole purpose of this).
I already have a list of the symbols I need to use, but I don't know how to input them
Example of how I get 1 piece of data
import requests
url = 'https://www.alphavantage.co/query?function=BALANCE_SHEET&symbol=MMM&apikey=demo'
r = requests.get(url)
data = r.json()
Example of what I'd like to change it to
import requests
url = 'https://www.alphavantage.co/query?function=BALANCE_SHEET&symbol=AOS&apikey=demo'
r = requests.get(url)
data = r.json()
#then change it to
import requests
url = 'https://www.alphavantage.co/query?function=BALANCE_SHEET&symbol=ABT&apikey=demo'
r = requests.get(url)
data = r.json()
so on and so forth, 500 times.
You might combine .format with for loop, consider following simple example
symbols = ["abc","xyz","123"]
for s in symbols:
url = 'https://www.example.com?symbol={}'.format(s)
print(url)
output
https://www.example.com?symbol=abc
https://www.example.com?symbol=xyz
https://www.example.com?symbol=123
You might also elect to use any other way of formatting, e.g. f-string (requires python3.6 or newer) in which case code would be
symbols = ["abc","xyz","123"]
for s in symbols:
url = f'https://www.example.com?symbol={s}'
print(url)
Alternatively you might params optional argument of requests.get function as follows
import requests
symbols = ["abc","xyz","123"]
for s in symbols:
r = requests.get('https://www.example.com', params={'symbol':s})
print(r.url)
output
https://www.example.com/?symbol=abc
https://www.example.com/?symbol=xyz
https://www.example.com/?symbol=123
I was trying to create an instagram post downloader bot with python:
import requests
import re
#get url's detail
def get_response(url):
r = requests.get(url)
while r.status_code != 200:
r = requests.get(url)
return r.text
def prepare_urls(matches):
return list({match.replace("\\u0026", "&") for match in matches})
url = input('Enter Instagram URL: ')
response = get_response(url)
#check if there is video url or picture url in the json webpage that is opened
vid_matches = re.findall('"video_url":"([^"]+)"', response)
pic_matches = re.findall('"display_url":"([^"]+)"', response)
vid_urls = prepare_urls(vid_matches)
pic_urls = prepare_urls(pic_matches)
if vid_urls:
print('Detected Videos:\n{0}'.format('\n'.join(vid_urls)))
if pic_urls:
print('Detected Pictures:\n{0}'.format('\n'.join(pic_urls)))
if not (vid_urls or pic_urls):
print('Could not recognize the media in the provided URL.')
After I finished the code, I tried it with a video link and it worked . After 1 hour I tried the same video link but it prints third condition :"Could not recognize the media in the provided URL.".
I'm confused . As you see , I never used my login credentials in the code but first time it works and second time not works...
Any ideas?
Make it so that each URL ends with the string "?__a=1" (When I have some free time, I'll edit this post and add the exact command to append the string to the URL's end.)
For example, instead of:
https://www.instagram.com/p/CECsuu2BgXj/
It should be:
https://www.instagram.com/p/CECsuu2BgXj/?__a=1
Output:
Detected Videos:
https://instagram.fdet1-2.fna.fbcdn.net/v/t50.2886-16/117817389_1889475617843249_1329686959743847420_n.mp4?efg=eyJ2ZW5jb2RlX3RhZyI6InZ0c192b2RfdXJsZ2VuLjcyMC5jbGlwcy5kZWZhdWx0IiwicWVfZ3JvdXBzIjoiW1wiaWdfd2ViX2RlbGl2ZXJ5X3Z0c19vdGZcIl0ifQ&_nc_ht=instagram.fdet1-2.fna.fbcdn.net&_nc_cat=105&_nc_ohc=OZRYx-3yUoAAX-b1xzZ&edm=AABBvjUBAAAA&vs=17858436890092651_3299599943&_nc_vs=HBksFQAYJEdDM0FCUWN4YUFQVGQ3WUdBUHhMQUxJXy0zTVNicV9FQUFBRhUAAsgBABUAGCRHQ0hOQ2dkbFlrcEYwOWtDQUtHQ0RqWUV4cGdzYnFfRUFBQUYVAgLIAQAoABgAGwAVAAAm1onK7OqJuT8VAigCQzMsF0AkmZmZmZmaGBJkYXNoX2Jhc2VsaW5lXzFfdjERAHX%2BBwA%3D&ccb=7-4&oe=6200F187&oh=00_AT-WTSxaoeTOd_GO0gMtqSqkgRXtxibffFG5pJGyCOPTNQ&_nc_sid=83d603
Detected Pictures:
https://instagram.fdet1-1.fna.fbcdn.net/v/t51.2885-15/e35/117915347_192544875567579_944852773653606759_n.jpg?_nc_ht=instagram.fdet1-1.fna.fbcdn.net&_nc_cat=103&_nc_ohc=0Bdvog7HWe8AX-3vsql&edm=AABBvjUBAAAA&ccb=7-4&oh=00_AT_O33BzV3tCKaDp_9eqeBUiYgyzVguImltLTuPIPKP4hg&oe=6201035F&_nc_sid=83d603
For more info, check out this awesome post.
I wrote a code which get a username from instagram. Sometimes my algorithm is not working and I get the name as 'p'. I'am trying to write a code for this exception (in the if head == 'p': part of code ).
First of all I use soup.select to get this block of information:
Blockquote
{"#context":"http:\/\/schema.org","#type":"ImageObject","caption":"I think I\u2019m getting better at editing these.... and by that I mean that there getting more and more muddled to the point hat I don\u2019t think people will be able to tell what they are soon.... not really what I\u2019m going for but oh well.\n-\n\u2022September 3 2018\u2022\n-\n-\nThis is the update I mentioned on my cuts on my leg. I finally cleaned them after two days. I normally don\u2019t wait that long but didn\u2019t really have the right circumstances to actually get to clean them the night of the relapse. *shrug*\n-\n-\n-\n#selfharm #selfharmo","representativeOfPage":"http:\/\/schema.org\/True","uploadDate":"2018-09-04T06:27:24","author":{"#type":"Person",**"alternateName":"#alittlereddrop"**,"mainEntityofPage":{"#type":"ProfilePage","#id":"https:\/\/www.instagram.com\/alittlereddrop\/"}},"commentCount":"0","interactionStatistic":{"#type":"InteractionCounter","interactionType":{"#type":"LikeAction"},"userInteractionCount":"2"},"mainEntityofPage":{"#type":"ItemPage","#id":"https:\/\/www.instagram.com\/p\/BnS0sdDlsmP\/?tagged=selfharmo"},"description":"2 Likes, 0 Comments - No One Cares (#alittlereddrop) on Instagram: \u201cI think I\u2019m getting better at editing these.... and by that I mean that there getting more and more\u2026\u201d","name":"No One Cares on Instagram: \u201cI think I\u2019m getting better at editing these.... and by that I mean that there getting more and more muddled to the point hat I don\u2019t think\u2026\u201d"}
Blockquote
There is a part "alternateName": which contains a name. But I can't get it even with json.loads . Do you have any ideas?
file = open('users.txt', 'r', encoding="ISO-8859-1")
urls = file.readlines()
for url in urls:
url = url.strip ('\n')
try:
req = requests.get(url)
req.raise_for_status()
except HTTPError as http_err:
output = open('output2.txt', 'a')
output.write(f'К сожалению страница недоступна.\n')
except Exception as err:
output = open('output2.txt', 'a')
output.write(f'К сожалению страница недоступна2\n')
else:
output = open('output2.txt', 'a')
soup = BeautifulSoup(req.text, "lxml")
the_url = soup.select("[rel='canonical']")[0]['href']
the_url2=the_url.replace('https://www.instagram.com/','')
head, sep, tail = the_url2.partition('/')
if head == 'p':
data = soup.select("[type='application/ld+json']")[0]
oJson2 = json.loads(data.text)["alternateName"]
str (oJson2)
output.write (oJson2+'\n')
else:
output.write (head+'\n')
You have a problem with syntax in you json file. There are double stars placed incorrectly in two places:
**"alternateName":"#alittlereddrop"**,.
If you're opening json from file, do this:
import json
with open('yourfilename.json') as fo:
jsn = json.loads(fo.read().replace('**', ''))
print(jsn['author']['alternateName'])
# '#alittlereddrop'
In your case, try instead of this line:
oJson2 = json.loads(data.text)["alternateName"]
This
oJson2 = json.loads(data.text.replace('**', ''))['author']["alternateName"]
My code is to search a Link passed in the command prompt, get the HTML code for the webpage at the Link, search the HTML code for links on the webpage, and then repeat these steps for the links found. I hope that is clear.
It should print out any links that cause errors.
Some more needed info:
The max visits it can do is 100.
If a website has an error, a None value is returned.
Python3 is what I am using
eg:
s = readwebpage(url)... # This line of code gets the HTML code for the link(url) passed in its argument.... if the link has an error, s = None.
The HTML code for that website has links that end in p2.html, p3.html, p4.html, and p5.html on its webpage. My code reads all of these, but it does not visit these links individually to search for more links. If it did this, it should search through these links and find a link that ends in p10.html, and then it should report that the link ending with p10.html has errors. Obviously it doesn't do that at the moment, and it's giving me a hard time.
My code..
url = args.url[0]
url_list = [url]
checkedURLs = []
AmountVisited = 0
while (url_list and AmountVisited<maxhits):
url = url_list.pop()
s = readwebpage(url)
print("testing url: http",url) #Print the url being tested, this code is here only for testing..
AmountVisited = AmountVisited + 1
if s == None:
print("* bad reference to http", url)
else:
urls_list = re.findall(r'href="http([\s:]?[^\'" >]+)', s) #Creates a list of all links in HTML code starting with...
while urls_list: #... http or https
insert = urls_list.pop()
while(insert in checkedURLs and urls_list):
insert = urls_list.pop()
url_list.append(insert)
checkedURLs = insert
Please help :)
Here is the code you wanted. However, please, stop using regexes for parsing HTML. BeautifulSoup is the way to go for that.
import re
from urllib import urlopen
def readwebpage(url):
print "testing ",current
return urlopen(url).read()
url = 'http://xrisk.esy.es' #put starting url here
yet_to_visit= [url]
visited_urls = []
AmountVisited = 0
maxhits = 10
while (yet_to_visit and AmountVisited<maxhits):
print yet_to_visit
current = yet_to_visit.pop()
AmountVisited = AmountVisited + 1
html = readwebpage(current)
if html == None:
print "* bad reference to http", current
else:
r = re.compile('(?<=href=").*?(?=")')
links = re.findall(r,html) #Creates a list of all links in HTML code starting with...
for u in links:
if u in visited_urls:
continue
elif u.find('http')!=-1:
yet_to_visit.append(u)
print links
visited_urls.append(current)
Not Python but since you mentioned you aren't tied strictly to regex, I think you might find some use in using wget for this.
wget --spider -o C:\wget.log -e robots=off -w 1 -r -l 10 http://www.stackoverflow.com
Broken down:
--spider: When invoked with this option, Wget will behave as a Web spider, which means that it will not download the pages, just check that they are there.
-o C:\wget.log: Log all messages to C:\wget.log.
-e robots=off: Ignore robots.txt
-w 1: set a wait time of 1 second
-r: set recursive search on
-l 10: sets the recursive depth to 10, meaning wget will only go as deep as 10 levels in, this may need to change depending on your max requests
http://www.stackoverflow.com: the URL you want to start with
Once complete, you can review the wget.log entries to determine which links had errors by searching for something like HTTP status codes 404, etc.
I suspect your regex is part of your problem. Right now, you have http outside your capture group, and [\s:] matches "some sort of whitespace (ie \s) or :"
I'd change the regex to: urls_list = re.findall(r'href="(.*)"',s). Also known as "match anything in quotes, after href=". If you absolutely need to ensure the http[s]://, use r'href="(https?://.*)"' (s? => one or zero s)
EDIT: And with actually working regex, using a non-greedly glom: href=(?P<q>[\'"])(https?://.*?)(?P=q)'
(Also, uh, while it's not technically necessary in your case because re caches, I think it's good practice to get into the habit of using re.compile.)
I think it's awfully nice that all of your URLs are full URLs. Do you have to deal with relative URLs at all?
`
So I'm trying to create a Python script that will take a search term or query, then search google for that term. It should then return 5 URL's from the result of the search term.
I spent many hours trying to get PyGoogle to work. But later found out Google no longer supports the SOAP API for search, nor do they provide new license keys. In a nutshell, PyGoogle is pretty much dead at this point.
So my question here is... What would be the most compact/simple way of doing this?
I would like to do this entirely in Python.
Thanks for any help
Use BeautifulSoup and requests to get the links from the google search results
import requests
from bs4 import BeautifulSoup
keyword = "Facebook" #enter your keyword here
search = "https://www.google.co.uk/search?sclient=psy-ab&client=ubuntu&hs=k5b&channel=fs&biw=1366&bih=648&noj=1&q=" + keyword
r = requests.get(search)
soup = BeautifulSoup(r.text, "html.parser")
container = soup.find('div',{'id':'search'})
url = container.find("cite").text
print(url)
What issues are you having with pygoogle? I know it is no longer supported, but I've utilized that project on many occasions and it would work fine for the menial task you have described.
Your question did make me curious though--so I went to Google and typed "python google search". Bam, found this repository. Installed with pip and within 5 minutes of browsing their documentation got what you asked:
import google
for url in google.search("red sox", num=5, stop=1):
print(url)
Maybe try a little harder next time, ok?
Here, link is the xgoogle library to do the same.
I tried similar to get top 10 links which also counts words in links we are targeting. I have added the code snippet for your reference :
import operator
import urllib
#This line will import GoogleSearch, SearchError class from xgoogle/search.py file
from xgoogle.search import GoogleSearch, SearchError
my_dict = {}
print "Enter the word to be searched : "
#read user input
yourword = raw_input()
try:
#This will perform google search on our keyword
gs = GoogleSearch(yourword)
gs.results_per_page = 80
#get google search result
results = gs.get_results()
source = ''
#loop through all result to get each link and it's contain
for res in results:
#print res.url.encode('utf8')
#this will give url
parsedurl = res.url.encode("utf8")
myurl = urllib.urlopen(parsedurl)
#above line will read url content, in below line we parse the content of that web page
source = myurl.read()
#This line will count occurrence of enterd keyword in our webpage
count = source.count(yourword)
#We store our result in dictionary data structure. For each url, we store it word occurent. Similar to array, this is dictionary
my_dict[parsedurl] = count
except SearchError, e:
print "Search failed: %s" % e
print my_dict
#sorted_x = sorted(my_dict, key=lambda x: x[1])
for key in sorted(my_dict, key=my_dict.get, reverse=True):
print(key,my_dict[key])