extracting hyperlinks in python.

extracting hyperlinks in python. - python

Im trying to make a simple web browser in python. (i'm a novice programmer and this is the first time I'm using python.)
I'm aware that i have to save my links in a list and create a function to to go to my url overtime sees that list but i have no idea how to do that. I would very much appreciate it if someone could please help me with that.
Heading
Here's my code: #!/usr/bin/env python
import urllib
url = "http://google.com"
data = urllib.urlopen(url)
tokens = data.read().split()
List=[]
for token in tokens:
if token == '<body>':
print ''
elif token == '</body>':
print ''
#elif token[6:-2] == '<a href':
else:
enter code here
print token,
selectedLink = raw_input('Select a link:')

Related

How can I check a webscraping page with requests realtime (always), (autoupdate)? Python

I'm a fellow young programmer and I have a question about,
I have a code checking percentages on https://shadowpay.com/en?price_from=0.00&price_to=34.00&game=csgo&hot_deal=true
And I want to make it happen in real-time.
Questions:
Is there a way to make it check in real-time or is it just by refreshing the page?
if refreshing page:
How can I make it refresh the page, I saw older answers but they did not work for me because the answers only worked in their code.
(I tried to request get it every time the while loop happens, but it doesn't work, or should it?)
This is the code:
import json
import requests
import time
import plyer
import random
import copy
min_notidication_perc = 26; un = 0; us = ""; biggest_number = 0;
r = requests.get('https://api.shadowpay.com/api/market/get_items?types=[]&exteriors=[]&rarities=[]&collections=[]&item_subcategories=[]&float={"from":0,"to":1}&price_from=0.00&price_to=34.00&game=csgo&hot_deal=true&stickers=[]&count_stickers=[]&short_name=&search=&stack=false&sort=desc&sort_column=price_rate&limit=50&offset=0', timeout=3)
while True:
#Here is the place where I'm thinking of putting it
time.sleep(5); skin_list = [];perc_list = []
for i in range(len(r.json()["items"])):
perc_list.append(r.json()["items"][i]["discount"])
skin_list.append(r.json()["items"][i]["collection"]["name"])
skin = skin_list[perc_list.index(max(perc_list))]; print(skin)
biggest_number = int(max(perc_list))
if un != biggest_number or us != skin:
if int(max(perc_list)) >= min_notidication_perc:
plyer.notification.notify(
title=f'-{int(max(perc_list))}% ShadowPay',
message=f'{skin}',
app_icon="C:\\Users\\<user__name>\\Downloads\\Inipagi-Job-Seeker-Target.ico",
timeout=120,
)
else:
pass
else:
pass
us = skin;un = biggest_number
print(f'id: {random.randint(1, 99999999)}')
print(f'-{int(max(perc_list))}% discount\n')

When using requests.get() you are retrieving the page source of that link then closing it. As you are waiting on the response you don't need the time.sleep(5) line as that is handled by requests.
In order to get the real-time value you'll have to call the page again, this is where you can use time.sleep() so as not to abuse the api.

Extracting follower count from Instagram

I am trying to pull the the number of followers from a list of Instagram accounts. I have tried using the "find" method within Requests, however, the string that I am looking for when I inspect the actual Instagram no longer appears when I print "r" from the code below.
Was able to get this code to run successfully find the past, however, will no longer run.
Webscraping Instagram follower count BeautifulSoup
import requests
user = "espn"
url = 'https://www.instagram.com/' + user
r = requests.get(url).text
start = '"edge_followed_by":{"count":'
end = '},"followed_by_viewer"'
print(r[r.find(start)+len(start):r.rfind(end)])
I receive a "-1" error, which means the substring from the find method was not found within the variable "r".

I think it's because of the last ' in start and first ' in end...this will work:
import requests
import re
user = "espn"
url = 'https://www.instagram.com/' + user
r = requests.get(url).text
followers = re.search('"edge_followed_by":{"count":([0-9]+)}',r).group(1)
print(followers)
'14061730'

I want to suggest an updated solution to this question, as the answer of Derek Eden above from 2019 does not work anymore, as stated in its comments.
The solution was to add the r' before the regular expression in the re.search like so:
follower_count = re.search(r'"edge_followed_by\\":{\\"count\\":([0-9]+)}', response).group(1)
This r'' is really important as without it, Python seems to treat the expression as regular string which leads to the query not giving any results.
Also the instagram page seems to have backslashes in the object we look for at least in my tests, so the code example i use is the following in Python 3.10 and working as of July 2022:
# get follower count of instagram profile
import os.path
import requests
import re
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
# get instagram follower count
def get_instagram_follower_count(instagram_username):
url = "https://www.instagram.com/" + instagram_username
filename = "instagram.html"
try:
if not os.path.isfile(filename):
r = requests.get(url, verify=False)
print(r.status_code)
print(r.text)
response = r.text
if not r.status_code == 200:
raise Exception("Error: " + str(r.status_code))
with open(filename, "w") as f:
f.write(response)
else:
with open(filename, "r") as f:
response = f.read()
# print(response)
follower_count = re.search(r'"edge_followed_by\\":{\\"count\\":([0-9]+)}', response).group(1)
return follower_count
except Exception as e:
print(e)
return 0
print(get_instagram_follower_count('your.instagram.profile'))
The method returns the follower count as expected. Please note that i added a few lines to not hammer Instagrams webserver and get blocked while testing by just saving the response in a file.
This is a slice of the original html content that contains the part we are looking for:
... mRL&s=1\",\"edge_followed_by\":{\"count\":110070},\"fbid\":\"1784 ...
I debugged the regex in regexr, it seems to work just fine at this point in time.
There are many posts about the regex r prefix like this one
Also the documentation of the re package shows clearly that this is the issue with the code above.

Use the Google Custom Search API to search the web from Python

I'm a newbee in Python, HTML and CSS and am trying to reverse engineer "https://github.com/scraperwiki/google-search-python" to learn the three and use the Google Custom Search API to search the web from Python. Specifically, I want to search the search engine I made through Google Custom Search "https://cse.google.com/cse/publicurl?cx=000839040200690289140:u2lurwk5tko"I looked through the code made some minor adjustments and came up with the following. "Search.py"
import os
from google_search import GoogleCustomSearch
#This is for the traceback
import traceback
import sys
#set variables
os.environ["SEARCH_ENGINE_ID"] = "000839... "
os.environ["GOOGLE_CLOUD_API_KEY"] = "AIza... "
SEARCH_ENGINE_ID = os.environ['SEARCH_ENGINE_ID']
API_KEY = os.environ['GOOGLE_CLOUD_API_KEY']
api = GoogleCustomSearch(SEARCH_ENGINE_ID, API_KEY)
print("we got here\n")
#for result in api.search('prayer', 'https://cse.google.com/cse/publicurl?cx=000839040200690289140:u2lurwk5tko'):
for result in api.search('pdf', 'http://scraperwiki.com'):
print(result['title'])
print(result['link'])
print(result['snippet'])
print traceback.format_exc()
And the import ("At least the relevant parts") I believe comes from the following code in google_search.py
class GoogleCustomSearch(object):
def __init__(self, search_engine_id, api_key):
self.search_engine_id = search_engine_id
self.api_key = api_key
def search(self, keyword, site=None, max_results=100):
assert isinstance(keyword, basestring)
for start_index in range(1, max_results, 10): # 10 is max page size
url = self._make_url(start_index, keyword, site)
logging.info(url)
response = requests.get(url)
if response.status_code == 403:
LOG.info(response.content)
response.raise_for_status()
for search_result in _decode_response(response.content):
yield search_result
if 'nextPage' not in search_result['meta']['queries']:
print("No more pages...")
return
However, when I try to compile it, I get the following.
So, here's my problem. I cant quite figure out why the following lines of code don't print to the terminal. What am I overlooking?
print(result['title'])
print(result['link'])
print(result['snippet'])
The only thing I can think of is that I didn't take a correct ID or something. I created a Google custom search and a project on Google developers console as the quick start suggested. Here is where I got my SEARCH_ENGINE_ID and GOOGLE_CLOUD_API_KEY from.
After I added the stacktrace suggested in the comments I got this
Am I just misunderstanding the code, or is there something else I'm missing? I really appreciate any clues that will help me solve this problem, I'm kind of stumped right now.
Thanks in advance guys!

How to retrieve google URL from search query

So I'm trying to create a Python script that will take a search term or query, then search google for that term. It should then return 5 URL's from the result of the search term.
I spent many hours trying to get PyGoogle to work. But later found out Google no longer supports the SOAP API for search, nor do they provide new license keys. In a nutshell, PyGoogle is pretty much dead at this point.
So my question here is... What would be the most compact/simple way of doing this?
I would like to do this entirely in Python.
Thanks for any help

Use BeautifulSoup and requests to get the links from the google search results
import requests
from bs4 import BeautifulSoup
keyword = "Facebook" #enter your keyword here
search = "https://www.google.co.uk/search?sclient=psy-ab&client=ubuntu&hs=k5b&channel=fs&biw=1366&bih=648&noj=1&q=" + keyword
r = requests.get(search)
soup = BeautifulSoup(r.text, "html.parser")
container = soup.find('div',{'id':'search'})
url = container.find("cite").text
print(url)

What issues are you having with pygoogle? I know it is no longer supported, but I've utilized that project on many occasions and it would work fine for the menial task you have described.
Your question did make me curious though--so I went to Google and typed "python google search". Bam, found this repository. Installed with pip and within 5 minutes of browsing their documentation got what you asked:
import google
for url in google.search("red sox", num=5, stop=1):
print(url)
Maybe try a little harder next time, ok?

Here, link is the xgoogle library to do the same.
I tried similar to get top 10 links which also counts words in links we are targeting. I have added the code snippet for your reference :
import operator
import urllib
#This line will import GoogleSearch, SearchError class from xgoogle/search.py file
from xgoogle.search import GoogleSearch, SearchError
my_dict = {}
print "Enter the word to be searched : "
#read user input
yourword = raw_input()
try:
#This will perform google search on our keyword
gs = GoogleSearch(yourword)
gs.results_per_page = 80
#get google search result
results = gs.get_results()
source = ''
#loop through all result to get each link and it's contain
for res in results:
#print res.url.encode('utf8')
#this will give url
parsedurl = res.url.encode("utf8")
myurl = urllib.urlopen(parsedurl)
#above line will read url content, in below line we parse the content of that web page
source = myurl.read()
#This line will count occurrence of enterd keyword in our webpage
count = source.count(yourword)
#We store our result in dictionary data structure. For each url, we store it word occurent. Similar to array, this is dictionary
my_dict[parsedurl] = count
except SearchError, e:
print "Search failed: %s" % e
print my_dict
#sorted_x = sorted(my_dict, key=lambda x: x[1])
for key in sorted(my_dict, key=my_dict.get, reverse=True):
print(key,my_dict[key])

How can I use Google Map Feed API to get a list of my Google Maps using Python?

I want to create a script in Python which downloads the current KML files of all the Maps I created on Google Maps.
To do so manually, I can use this:
http://maps.google.com.br/maps/ms?msid=USER_ID.MAP_ID&msa=0&output=kml
where USER_ID is a constant number Google uses to identify me, and MAP_ID is the individual map identifier generated by the link icon on top-right corner.
This is not very straightforward, because I have to manually browse "My Places" page on Google Maps, and get the links one by one.
From Google Maps API HTTP Protocol Reference:
The Map Feed is a feed of user-created maps.
This feed's full GET URI is:
http://maps.google.com/maps/feeds/maps/default/full
This feed returns a list of all maps for the authenticated user.
** The page says this service is no longer available, so I wonder if there is a way to do the same in the present.
So, the question is: Is there a way to get/download the list of MAP_IDs of all my maps, preferrably using Python?
Thanks for reading

The correct answer to this question involves using Google Maps Data API, HTML interface, which by the way is deprecated but still solves my need in a more official way, or at least more convincing than parsing a web page. Here it goes:
# coding: utf-8
import urllib2, urllib, re, getpass
username = 'heltonbiker'
senha = getpass.getpass('Senha do usuário ' + username + ':')
dic = {
'accountType': 'GOOGLE',
'Email': (username + '#gmail.com'),
'Passwd': senha,
'service': 'local',
'source': 'helton-mapper-1'
}
url = 'https://www.google.com/accounts/ClientLogin?' + urllib.urlencode(dic)
output = urllib2.urlopen(url).read()
authid = output.strip().split('\n')[-1].split('=')[-1]
request = urllib2.Request('http://maps.google.com/maps/feeds/maps/default/full')
request.add_header('Authorization', 'GoogleLogin auth=%s' % authid)
source = urllib2.urlopen(request).read()
for link in re.findall('<link rel=.alternate. type=.text/html. href=((.)[^\1]*?)>', source):
s = link[0]
if 'msa=0' in s:
print s
I arrived with this solution with a bunch of other questions in SO, and a lot of people helped me a lot, so I hope this code might help anyone else trying to do so in the future.

A quick and dirty way I have found, that skips Google Maps API completely and perhaps might brake in the near future, is this:
# coding: utf-8
import urllib, re
from BeautifulSoup import BeautifulSoup as bs
uid = '200931058040775970557'
start = 0
shown = 1
while True:
url = 'http://maps.google.com/maps/user?uid='+uid+'&ptab=2&start='+str(start)
source = urllib.urlopen(url).read()
soup = bs(source)
maptables = soup.findAll(id=re.compile('^map[0-9]+$'))
for table in maptables:
for line in table.findAll('a', 'maptitle'):
mapid = re.search(uid+'\.([^"]*)', str(line)).group(1)
mapname = re.search('>(.*)</a>', str(line)).group(1).strip()[:-2]
print shown, mapid, mapname
shown += 1
# uncomment if you want to download the KML files:
# urllib.urlretrieve('http://maps.google.com.br/maps/ms?msid=' + uid + '.' + str(mapid) +
'&msa=0&output=kml', mapname + '.kml')
if '<span>Next</span>' in str(source):
start += 5
else:
break
Of course it is only printing a numbered list, but from there to save a dictionary and/or automate KML download via &output=kml url trick it goes naturally.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

extracting hyperlinks in python. - python

Related

How can I check a webscraping page with requests realtime (always), (autoupdate)? Python

Extracting follower count from Instagram

Use the Google Custom Search API to search the web from Python

How to retrieve google URL from search query

How can I use Google Map Feed API to get a list of my Google Maps using Python?

Categories

Resources