I was trying to create an instagram post downloader bot with python:
import requests
import re
#get url's detail
def get_response(url):
r = requests.get(url)
while r.status_code != 200:
r = requests.get(url)
return r.text
def prepare_urls(matches):
return list({match.replace("\\u0026", "&") for match in matches})
url = input('Enter Instagram URL: ')
response = get_response(url)
#check if there is video url or picture url in the json webpage that is opened
vid_matches = re.findall('"video_url":"([^"]+)"', response)
pic_matches = re.findall('"display_url":"([^"]+)"', response)
vid_urls = prepare_urls(vid_matches)
pic_urls = prepare_urls(pic_matches)
if vid_urls:
print('Detected Videos:\n{0}'.format('\n'.join(vid_urls)))
if pic_urls:
print('Detected Pictures:\n{0}'.format('\n'.join(pic_urls)))
if not (vid_urls or pic_urls):
print('Could not recognize the media in the provided URL.')
After I finished the code, I tried it with a video link and it worked . After 1 hour I tried the same video link but it prints third condition :"Could not recognize the media in the provided URL.".
I'm confused . As you see , I never used my login credentials in the code but first time it works and second time not works...
Any ideas?
Make it so that each URL ends with the string "?__a=1" (When I have some free time, I'll edit this post and add the exact command to append the string to the URL's end.)
For example, instead of:
https://www.instagram.com/p/CECsuu2BgXj/
It should be:
https://www.instagram.com/p/CECsuu2BgXj/?__a=1
Output:
Detected Videos:
https://instagram.fdet1-2.fna.fbcdn.net/v/t50.2886-16/117817389_1889475617843249_1329686959743847420_n.mp4?efg=eyJ2ZW5jb2RlX3RhZyI6InZ0c192b2RfdXJsZ2VuLjcyMC5jbGlwcy5kZWZhdWx0IiwicWVfZ3JvdXBzIjoiW1wiaWdfd2ViX2RlbGl2ZXJ5X3Z0c19vdGZcIl0ifQ&_nc_ht=instagram.fdet1-2.fna.fbcdn.net&_nc_cat=105&_nc_ohc=OZRYx-3yUoAAX-b1xzZ&edm=AABBvjUBAAAA&vs=17858436890092651_3299599943&_nc_vs=HBksFQAYJEdDM0FCUWN4YUFQVGQ3WUdBUHhMQUxJXy0zTVNicV9FQUFBRhUAAsgBABUAGCRHQ0hOQ2dkbFlrcEYwOWtDQUtHQ0RqWUV4cGdzYnFfRUFBQUYVAgLIAQAoABgAGwAVAAAm1onK7OqJuT8VAigCQzMsF0AkmZmZmZmaGBJkYXNoX2Jhc2VsaW5lXzFfdjERAHX%2BBwA%3D&ccb=7-4&oe=6200F187&oh=00_AT-WTSxaoeTOd_GO0gMtqSqkgRXtxibffFG5pJGyCOPTNQ&_nc_sid=83d603
Detected Pictures:
https://instagram.fdet1-1.fna.fbcdn.net/v/t51.2885-15/e35/117915347_192544875567579_944852773653606759_n.jpg?_nc_ht=instagram.fdet1-1.fna.fbcdn.net&_nc_cat=103&_nc_ohc=0Bdvog7HWe8AX-3vsql&edm=AABBvjUBAAAA&ccb=7-4&oh=00_AT_O33BzV3tCKaDp_9eqeBUiYgyzVguImltLTuPIPKP4hg&oe=6201035F&_nc_sid=83d603
For more info, check out this awesome post.
Related
I am trying to pull the the number of followers from a list of Instagram accounts. I have tried using the "find" method within Requests, however, the string that I am looking for when I inspect the actual Instagram no longer appears when I print "r" from the code below.
Was able to get this code to run successfully find the past, however, will no longer run.
Webscraping Instagram follower count BeautifulSoup
import requests
user = "espn"
url = 'https://www.instagram.com/' + user
r = requests.get(url).text
start = '"edge_followed_by":{"count":'
end = '},"followed_by_viewer"'
print(r[r.find(start)+len(start):r.rfind(end)])
I receive a "-1" error, which means the substring from the find method was not found within the variable "r".
I think it's because of the last ' in start and first ' in end...this will work:
import requests
import re
user = "espn"
url = 'https://www.instagram.com/' + user
r = requests.get(url).text
followers = re.search('"edge_followed_by":{"count":([0-9]+)}',r).group(1)
print(followers)
'14061730'
I want to suggest an updated solution to this question, as the answer of Derek Eden above from 2019 does not work anymore, as stated in its comments.
The solution was to add the r' before the regular expression in the re.search like so:
follower_count = re.search(r'"edge_followed_by\\":{\\"count\\":([0-9]+)}', response).group(1)
This r'' is really important as without it, Python seems to treat the expression as regular string which leads to the query not giving any results.
Also the instagram page seems to have backslashes in the object we look for at least in my tests, so the code example i use is the following in Python 3.10 and working as of July 2022:
# get follower count of instagram profile
import os.path
import requests
import re
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
# get instagram follower count
def get_instagram_follower_count(instagram_username):
url = "https://www.instagram.com/" + instagram_username
filename = "instagram.html"
try:
if not os.path.isfile(filename):
r = requests.get(url, verify=False)
print(r.status_code)
print(r.text)
response = r.text
if not r.status_code == 200:
raise Exception("Error: " + str(r.status_code))
with open(filename, "w") as f:
f.write(response)
else:
with open(filename, "r") as f:
response = f.read()
# print(response)
follower_count = re.search(r'"edge_followed_by\\":{\\"count\\":([0-9]+)}', response).group(1)
return follower_count
except Exception as e:
print(e)
return 0
print(get_instagram_follower_count('your.instagram.profile'))
The method returns the follower count as expected. Please note that i added a few lines to not hammer Instagrams webserver and get blocked while testing by just saving the response in a file.
This is a slice of the original html content that contains the part we are looking for:
... mRL&s=1\",\"edge_followed_by\":{\"count\":110070},\"fbid\":\"1784 ...
I debugged the regex in regexr, it seems to work just fine at this point in time.
There are many posts about the regex r prefix like this one
Also the documentation of the re package shows clearly that this is the issue with the code above.
I've been tinkering with Python using Pythonista on my iPad. I decided to write a simple script that pulls song lyrics in Japanese from one website, and makes post requests to another website that basically annotates the lyrics with extra information.
When I use Python 2 and the module mechanize for the second website, everything works fine, but when I use Python 3 and requests, the resulting text is nonsense.
This is a minimal script that doesn't exhibit the issue:
#!/usr/bin/env python2
from bs4 import BeautifulSoup
import requests
import mechanize
def main():
# Get lyrics from first website (lyrical-nonsense.com)
url = 'https://www.lyrical-nonsense.com/lyrics/bump-of-chicken/hello-world/'
html_raw_lyrics = BeautifulSoup(requests.get(url).text, "html5lib")
raw_lyrics = html_raw_lyrics.find("div", id="Lyrics").get_text()
# Use second website to anotate lyrics with fugigana
browser = mechanize.Browser()
browser.open('http://furigana.sourceforge.net/cgi-bin/index.cgi')
browser.select_form(nr=0)
browser.form['text'] = raw_lyrics
request = browser.submit()
# My actual script does more stuff at this point, but this snippet doesn't need it
annotated_lyrics = BeautifulSoup(request.read().decode('utf-8'), "html5lib").find("body").get_text()
print annotated_lyrics
if __name__ == '__main__':
main()
The truncated output is:
扉(とびら)開(ひら)けば捻(ねじ)れた昼(ひる)の夜(よる)昨日(きのう)どうやって帰(かえ)った体(からだ)だけが確(たし)かおはよう これからまた迷子(まいご)の続(つづ)き見慣(みな)れた知(し)らない景色(けしき)の中(なか)でもう駄目(だめ)って思(おも)ってから わりと何(なん)だかやれている死(し)にきらないくらいに丈夫(じょうぶ)何(なに)かちょっと恥(は)ずかしいやるべきことは忘(わす)れていても解(わか)るそうしないと とても苦(くる)しいから顔(かお)を上(あ)げて黒(くろ)い目(め)の人(にん)君(くん)が見(み)たから光(ひかり)は生(う)まれた選(えら)んだ色(しょく)で塗(ぬ)った世界(せかい)に [...]
This is a minimal script that exhibits the issue:
#!/usr/bin/env python3
from bs4 import BeautifulSoup
import requests
def main():
# Get lyrics from first website (lyrical-nonsense.com)
url = 'https://www.lyrical-nonsense.com/lyrics/bump-of-chicken/hello-world/'
html_raw_lyrics = BeautifulSoup(requests.get(url).text, "html5lib")
raw_lyrics = html_raw_lyrics.find("div", id="Lyrics").get_text()
# Use second website to anotate lyrics with fugigana
url = 'http://furigana.sourceforge.net/cgi-bin/index.cgi'
data = {'text': raw_lyrics, 'state': 'output'}
html_annotated_lyrics = BeautifulSoup(requests.post(url, data=data).text, "html5lib")
annotated_lyrics = html_annotated_lyrics.find("body").get_text()
print(annotated_lyrics)
if __name__ == '__main__':
main()
whose truncated output is:
IQp{_<n(åiFcf0c_S`QLºKJoFSK~_÷PnMc_åjDorn-gFÄîcfcfKhU`KfD{kMjDOD+UKacheZKWDyMSho،fDfã]FWjDhhfæWDKTRfÒDînºL_KIo~_x`rgWc_Lkò~fxyjD·nsoiS`FTê`QLÒüíüLn [...]
It's worth noting that if I just try to get the HTML of the second request, like so:
# Use second website to anotate lyrics with fugigana
url = 'http://furigana.sourceforge.net/cgi-bin/index.cgi'
data = {'text': raw_lyrics, 'state': 'output'}
annotated_lyrics = requests.post(url, data=data).content.decode('utf-8')
A embedded null character error occurs when printing annotated_lyrics. This issue can be circumvented by passing truncated lyrics to the post requests. In the current example, only one character can be passed.
However, with
url = 'https://www.lyrical-nonsense.com/lyrics/aimer/brave-shine/'
I can pass up to 51 characters, like so:
data = {'text': raw_lyrics[0:51], 'state': 'output'}
before triggering the embedded null character error.
I've tried using urllib instead of requests, decoding and encoding to utf-8 the resulting HTML of the post request, or the data passed as an argument to this request. I've also checked that the encoding of the website is utf-8, which matches the encoding of the post requests:
r = requests.post(url, data=data)
print(r.encoding)
prints utf-8.
I think the problem has to do with how Python 3 is more strict in how it treats strings vs bytes, but I've been unable to pinpoint the exact cause.
While I'd appreciate a working code sample in Python 3, I'm more interested in what exactly I'm doing wrong, in what is the code doing that results in failure.
I'm able to get the lyrics properly with this code in python3.x:
url = 'https://www.lyrical-nonsense.com/lyrics/bump-of-chicken/hello-world/'
resp = requests.get(url)
print(BeautifulSoup(resp.text).find('div', class_='olyrictext').get_text())
Printing (truncated)
>>> BeautifulSoup(resp.text).find('div', class_='olyrictext').get_text()
'扉開けば\u3000捻れた昼の夜\r\n昨日どうやって帰った\u3000体だけ...'
A few things strike me as odd there, notably the \r\n (windows line ending) and \u3000 (IDEOGRAPHIC SPACE) but that's probably not the problem
The one thing I noticed that's odd about the form submission (and why the browser emulator probably succeeds) is the form is using multipart instead of urlencoded form data. (signified by enctype="multipart/form-data")
Sending multipart form data is a little bit strange in requests, I had to poke around a bit and eventually found this which helps show how to format the multipart data in a way that the backing server understands. To do this you have to abuse files but have a "None" filename. "for humans" hah!
url2 = 'http://furigana.sourceforge.net/cgi-bin/index.cgi'
resp2 = requests.post(url2, files={'text': (None, raw_lyrics), 'state': (None, 'output')})
And the text is not mangled now!
>>> BeautifulSoup(resp2.text).find('body').get_text()
'\n扉(とびら)開(ひら)けば捻(ねじ)れた昼(ひる)...'
(Note that this code should work in either python2 or python3)
I am trying to develop a script with python to web scraping some information on a specific website for learning purposes.
I went over a lot of different tutorials and posts, trying to gather some insights from them, they are very useful but still didn't help me to find a way to log in the website and do searches with different keywords.
I tried to use different APIs, such as requests and urllib, maybe I didn't find the right way to solve it.
The steps lists as follow:
login information set up
Send login information to the website and get response for future use
keywords setup
import header
set up cookiejar
from login response, do the search
After I tried, it will work randomly, and
here is the code:
import getpass
# marvin
# date:2018/2/7
# login stage preparation
def login_values():
login="https://www.****.com/login"
username = input("Please insert your username: ")
password = getpass.getpass("Please type in your password: ")
host="www.****.com"
#store login screts
data = {
"username": username,
"password": password,
}
return login,host,data
The following is for getting the HTML file from a website
import requests
import random
import http.cookiejar
import socket
# Set up web scraping function to output the html text file
def webscrape(login_url,host_url,login_data,target_url):
#static values preparation
##import header
user_agents = [
***
]
agent = random.choice(user_agents)
headers={'User-agent':agent,
'Accept':'*/*',
'Accept-Language':'en-US,en;q=0.9;zh-cmn-Hans',
'Host':host_url,
'charset':'utf-8',
}
##set up cookie jar
cj = http.cookiejar.CookieJar()
#
# get the html file
socket.setdefaulttimeout(20)
s=requests.Session()
req=s.post(login_url, data=login_data)
res = s.get(target_url, cookies=cj,headers=headers)
html=res.text
return html
Here is the code to get each links from html:
from bs4 import BeautifulSoup
#set up html parsing function for parsing all the list links
def getlist(keyword,loginurl,hosturl,valuesurl,html_lists):
page=1
pagenum=10# set up maximum page num
links=[]
soup=BeautifulSoup(html_lists,"lxml")
try:
for li in soup.find("div",class_="search_pager human_pager in-block").ul.find_all('li'):
target_part=soup.find_all("div",class_="search_result_single search-2017 pb25 pt25 pl30 pr30 ")
[links.append(link.find('a')['href']) for link in target_part]
page+=1
if page<=pagenum:
try:
nexturl=soup.find('div',class_='search_pager human_pager in-block').ul.find('li',class_='pagination-next ng-scope ').a['href'] #next page
except AttributeError:
print("{}'s links are all stored!".format(keyword))
return links
else:
chs_html=webscrape(loginurl,hosturl,valuesurl,nexturl)
soup=BeautifulSoup(chs_html,"lxml")
except AttributeError:
target_part=soup.find_all("div",class_="search_result_single search-2017 pb25 pt25 pl30 pr30 ")
[links.append(link.find('a')['href']) for link in target_part]
print("There is only one page")
return links
The test code is:
keyword="****"
myurl="https://www.****.com/search/os2?key={}".format(keyword)
chs_html=webscrape(login,host,values,myurl)
chs_links=getlist(keyword,login,host,values,chs_html)
targethtml=webscrape(login,host,values,chs_links[1])
There are total 22 links and one page containing 19 links, so it is supposed to have more than one page, if the result "There is only one page" shown up, it indicates a failure.
Problems:
The login_values function is to secure my login information by combining all functions to a final function, but apparently, the username and password are still really easy to show just by print() command.
This the main problem!! Like I mentioned before, this method works randomly. By the way, what I mean not working, it is that the HTML file is only the login page instead of the searching result. I want to get a better control to make it work most of the time. I checked user-agents by print agent every time to see if they are relevant, and it is not! I cleared cookies with suspicious to full storage memory, and it is not.
There are sometimes I facing max trial error or OS error, I guess it is the error from the server I was trying to reach, is there a way I can set up a wait timer for me to prevent these errors from happening?
import requests
MSA_request=""">G1
MGCTLSAEDKAAVERSKMIDRNLREDGEKAAREVKLLLL
>G2
MGCTVSAEDKAAAERSKMIDKNLREDGEKAAREVKLLLL
>G3
MGCTLSAEERAALERSKAIEKNLKEDGISAAKDVKLLLL"""
q={"stype":"protein","sequence":MSA_request,"outfmt":"clustal"}
r=requests.post("http://www.ebi.ac.uk/Tools/msa/clustalo/",data=q)
This is my script, I send this request to website, but the result looks like I did nothing, web service didn't receive my request. This method used to be fine with other website, maybe this page with a pop window to ask cookie agreement?
The form on the page you are referring to has a separate URL, namely
http://www.ebi.ac.uk/Tools/services/web_clustalo/toolform.ebi
you can verify this with a DOM inspector in your browser.
So in order to proceed with requests, you need to access the right page
r=requests.post("http://www.ebi.ac.uk/Tools/services/web_clustalo/toolform.ebi",data=q)
this will submit a job with your input data, it doesn't return the result directly. To check the results, it's necessary to extract the job ID from the previous response and then generate another request (with no data) to
http://www.ebi.ac.uk/Tools/services/web_clustalo/toolresult.ebi?jobId=...
However, you should definitely check whether this programatic access is compatible with the TOS of that website...
Here is an example:
from lxml import html
import requests
import sys
import time
MSA_request=""">G1
MGCTLSAEDKAAVERSKMIDRNLREDGEKAAREVKLLLL
>G2
MGCTVSAEDKAAAERSKMIDKNLREDGEKAAREVKLLLL
>G3
MGCTLSAEERAALERSKAIEKNLKEDGISAAKDVKLLLL"""
q={"stype":"protein","sequence":MSA_request,"outfmt":"clustal"}
r = requests.post("http://www.ebi.ac.uk/Tools/services/web_clustalo/toolform.ebi",data = q)
tree = html.fromstring(r.text)
title = tree.xpath('//title/text()')[0]
#check the status and get the job id
status, job_id = map(lambda s: s.strip(), title.split(':', 1))
if status != "Job running":
sys.exit(1)
#it might take some time for the job to finish
time.sleep(10)
#download the results
r = requests.get("http://www.ebi.ac.uk/Tools/services/web_clustalo/toolresult.ebi?jobId=%s" % (job_id))
#prints the full response
#print(r.text)
#isolate the alignment block
tree = html.fromstring(r.text)
alignment = tree.xpath('//pre[#id="alignmentContent"]/text()')[0]
print(alignment)
I am attempting to create a bot that fetches market links from steam but have run into a problem. I was able to return all the data from a single page, but when I attempt to get multiple pages it just gives me copies of the first page though I give it working links (eg: http://steamcommunity.com/market/search?q=appid%3A753#p1 and then http://steamcommunity.com/market/search?q=appid%3A753#p2). I have tested the links and they work in my browser. This is my code.
import urllib2
import random
import time
start_url = "http://steamcommunity.com/market/search?q=appid%3A753"
end_page = 3
urls = []
def get_raw(url):
req = urllib2.Request(url)
response = urllib2.urlopen(req)
return response.read()
def get_market_urls(html):
index = 0
while index != -1:
index = html.find("market_listing_row_link", index+25)
beg = html.find("http", index)
end = html.find('"',beg)
print html[beg:end]
urls.append(html[beg:end])
def go_to_page(page):
return start_url+"#p"+str(page)
def wait(min, max):
wait_t = random.randint(min,max)
time.sleep(wait_t)
for i in range(end_page):
url = go_to_page(i+1)
raw = get_raw(url)
get_market_urls(raw)
Your problem is that you've misunderstood what the URL says.
The number after the hashtag doesn't mean it's a different URL that can be fetched. This is called the query string. In that particular page the query string explains to the javascript which page to pull off AJAX. (Read about it Here and Here if you're interested..).
Anyway, you shoul look at the url: http://steamcommunity.com/market/search/render/?query=appid%3A753&start=00&count=10. You can play with the start=00&count=10 parameters to get the results you want.
Enjoy.