Search for google output using python - python

I want to include into my python program an option to search for the time in a particular city and to get the google's output on some different things too.
I want to be able to get the google output that appears on the top of the screen(a lot of times it's the output of wikipedia or some other page) by using python code.
For instance:
How would I now copy the 6:10PM output with python?

As shown, the url is timeanddate.com, I capture the location, weekday and time.
I don't show the code to read html from web site, if you are interesting, you can get it from below link. You need to have bs4 and brotli installed.
Jason's Tool.py
import brotli
from Tool import read_URL
from bs4 import BeautifulSoup
url = "https://www.timeanddate.com/worldclock/"
response, html = read_URL(url)
if html:
soup = BeautifulSoup(html, "html.parser")
tags = soup.find_all('td')
time_dict = {tags[i].text:tags[i+1].text for i in range(0, len(tags), 2)
if tags[i].text.strip() != ''}
# a dictionary created with 'location':'weekday hh:mm' key pairs, like
# 'Washington DC *': 'Tue 13:53'
It's better have time_dict as reference, also your system clock. Not to get the time data from web site all the time.

Related

Unable to parse a number to be used within a link

I've created a script in python to get the value of Tax District from a webpage. In it's main page there is a form to fill in to generate the result in which the information i'm looking for is available. When I use my script below, I get the desired result but the thing is I had to use different link to parse the result. The link I used within my script is available only when the form is filled in. The newly generated link (which I've used within my script) has some number which I can't figure out how to find that.
Main link
In the search form there is a radio button Street Address which is selected by default. Then:-
house number: 5587 (just above Exact/Low)
street name: Surrey
This is the link https://wedge.hcauditor.org/view/re/5500171005200/2018/summary generating automatically which has some number 5500171005200 within it.
I've written the following script to get the result but really don't know how the number in that url generates as the number changes when I use different search terms:
import requests
from bs4 import BeautifulSoup
url = 'https://wedge.hcauditor.org/view/re/5500171005200/2018/summary'
r = requests.get(url)
soup = BeautifulSoup(r.text,"lxml")
item = soup.select_one("div:contains('Tax District') + div").text
print(item)
How can I get the number used within the newly generated link?
Seems a POST and GET is fine. No need to look for that other number. I use Session to pass cookies. The link you reference however is found within the GET response.
import requests
from bs4 import BeautifulSoup as bs
data = {
'search_type': 'Address',
'sort_column': 'Address',
'site_house_number_low':5587,
'site_house_number_high':'',
'site_street_name': 'surrey'
}
with requests.Session() as s:
r = s.post('https://wedge.hcauditor.org/execute', data = data)
r = s.get('https://wedge.hcauditor.org/view_result/0')
soup = bs(r.content,'lxml')
print(soup.select_one('.label + div').text)
You can see the details and sequence captured in the web traffic. I happened to use fiddler here.

I'm trying to scrape a website for some list items, but beautiful soup does not find any on the page

I'm attempting to make a table where I collect all the works of each composer from this page and arrange them by adding "score" e.g. 1 point for 300th place, 290 points for 10th place, etc. using a Python script.
However, BeautifulSoup does not seem to find the li elements. What am I doing wrong? A screenshot of the page HTML: https://gyazo.com/73ff53fb332755300d9b7450011a7130
I have already tried using soup.li, soup.findAll("li") and soup.find_all("li"), but all return "none" or similar. Printing soup.body does return the body though, so I think I do have an HTML document.
from bs4 import BeautifulSoup as bsoup
import requests
link = "https://halloffame.classicfm.com/2019/"
response = requests.get(link)
soup = bsoup(response.text, "html.parser")
print(soup.li)
I was hoping this would give me at least one li item, but instead it returns None.
I don't see all rankings from 300-1. Sometimes page shows only to 148, other times to 146, and lowest I have seen is 143. Don't know if this is a design flaw/bug. The page is javascript updated which is why you are getting an empty list. That content hasn't been rendered.
requests only returns content that doesn't rely on javascript to render i.e. you don't get everything that you see when using a browser which, if javascript is enabled, will allow additional content to be loaded as various scripts on the page run. This is a feature of modern responsive/dynamic webpages where you no longer have to update an entire page when, for example, selections are made on the page.
Often you can use dev tools F12 to inspect the web traffic the page is using to update the content via the network tab. With the network tab open refresh the entire page and then filter on XHR.
In this case, the info is actually pulled from a script tag which already holds that info. You can open the elements tab (Chrome) and do Ctrl+F and search for a composer's name. You will find one match occurs in a script tag. I use regex to find the script tag this is in by matching on javascript var songs = []; which is then followed by the object containing the composer info in the following regex group.
Sample from target script tag:
You can grab these from script tag
import requests
from bs4 import BeautifulSoup as bs
import re
soup = bs(requests.get('https://halloffame.classicfm.com/2019/', 'lxml').content, 'lxml')
r = re.compile(r'var songs = \[\];(.*)' , re.DOTALL)
data = soup.find('script', text=r).text
script = r.findall(data)[0].strip()
rp = re.compile(r'position:\s+(\d+)')
rankings = rp.findall(script)
rt = re.compile(r'title:\s+"(.*)"')
titles = rt.findall(script)
print(len(titles))
print(len(rankings))
If you can locate the rest of these rankings you can then zip your lists whilst reversing the rankings list
results = list(zip(titles, rankings[::-1]))
Either way, you can use the len of the titles to generate a list of numbers in reverse that will give the rankings:
rankings = list(range(len(titles), 0, -1))
results = list(zip(titles, rankings[::-1]))

How do I scrape a full instagram page in python?

Long story short, I'm trying to create an Instagram python scraper, that loads the entire page and grabs all the links to the images. I have it working, only problem is, it only loads the original 12 photos that Instagram shows. Is there anyway I can tell requests to load the entire page?
Working code;
import json
import requests
from bs4 import BeautifulSoup
import sys
r = requests.get('https://www.instagram.com/accountName/')
soup = BeautifulSoup(r.text, 'lxml')
script = soup.find('script', text=lambda t: t.startswith('window._sharedData'))
page_json = script.text.split(' = ', 1)[1].rstrip(';')
data = json.loads(page_json)
non_bmp_map = dict.fromkeys(range(0x10000, sys.maxunicode + 1), 0xfffd)
for post in data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['edges']:
image_src = post['node']['display_url']
print(image_src)
As Scratch already mentioned, Instagram uses "infinite scrolling" which won't allow you to load the entire page. But you can check the total amount of messages at the top of the page (within the first span with the _fd86t class). Then you can check if the page already contains all of the messages. Otherwise, you'll have to use a GET request to get a new JSON response. The benefit to this is that this request contains the first field, which seems to allow you to modify how many messages you get. You can modify this from its standard 12 to get all of the remaining messages (hopefully).
The necessary request looks similar to the following (where I've anonymised the actual entries, and with some help from the comments):
https://www.instagram.com/graphql/query/?query_hash=472f257a40c653c64c666ce877d59d2b&variables={"id":"XXX","first":12,"after":"XXX"}
parse_ig.py
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
from InstagramAPI import InstagramAPI
import time
c = webdriver.Chrome()
# load IG page here, whether a hashtag or a public user's page using c.get(url)
for i in range(10):
c.send_keys(Keys.END)
time.sleep(1)
soup = BeautifulSoup(c.page_source, 'html.parser')
ids = [a['href'].split('/') for a in soup.find_all('a') if 'tagged' in a['href']]
Once you have the ids, you can use Instagram's old API to get data for those. I'm not sure if it still works, but there was an API that I used -- which was limited by how much FB has slowly deprecated parts of the old API. Here's the link, in case you don't want to access Instagram API on your own :)
You can also add improvements to this simple code. Like instead of a "for" loop, you could do a "while" loop instead (i.e. while page is still loading, keep pressing END button.)
#zero's answer is incomplete (at least as of 1/15/19). c.send_keys is not a valid method. Instead, this is what I did:
c = webdriver.Chrome()
c.get(some_url)
element = c.find_element_by_tag_name('body') # or whatever tag you're looking to scrape from
for i in range(10):
element.send_keys(Keys.END)
time.sleep(1)
soup = BeautifulSoup(c.page_source, 'html.parser')
Here is a link to good tutorial for scraping Instagram profile info and posts that also handles pagination and works in 2022: Scraping Instagram
In summary, you have to use Instagram GraphQL API endpoint that requires user identifier and cursor from the previous page response: https://instagram.com/graphql/query/?query_id=17888483320059182&id={user_id}&first=24&after={end_cursor}

Go through to original URL on social media management websites

I'm doing web scraping as part of an academic project, where it's important that all links are followed through to the actual content. Annoyingly, there are some important error cases with "social media management" sites, where users post their links to detect who clicks on them.
For instance, consider this link on linkis.com, which links to http:// + bit.ly + /1P1xh9J (separated link due to SO posting restrictions), which in turn links to http://conservatives4palin.com. The issue arises as the original link at linkis.com does not automatically redirect forward. Instead, the user has to click the cross in the top right corner to go to the original URL.
Furthermore, there seems to be different variations (see e.g. linkis.com link 2, where the cross is at the bottom left of the website). These are the only two variations I've found, but there might be more. Note that I'm using a web scraper very similar to this one. The functionality to go through to the actual link does not need to be stable/functioning over time as this is a one-time academic project.
How do I automatically go on to the original URL? Would the best approach be to design a regex that finds the relevant link?
In many cases, you will have to use browser automation to scrape web pages that generate their content using javascript, scraping the html returned by the a get request will not yield the result you want, you have two options here :
Try to get your way around all the additional javascript requests to get the content you want which can be very time consuming .
Use browser automation, which lets you open a real browser and automates its tasks, you can use Selenium for that.
I have been developing bots and scrapers for years now, and unless the webpage you are requesting does not rely heavily on javascript, you should use something like selenium.
Here is some code to get you started with selenium:
from selenium import webdriver
#Create a chrome browser instance, other drivers are also available
driver = webdriver.Chrome()
#Request a page
driver.get('http://linkis.com/conservatives4palin.com/uGXam')
#Select elements on the page and trigger events
#Selenium supports also xpath and css selectors
#Clicks the tag with the given id
driver.find_elements_by_id('some_id').click()
The common architecture that the website follows is that it shows the website as an iframe. The sample code runs for both the cases.
In order to get the final URL you can do something like this:
import requests
from bs4 import BeautifulSoup
urls = ["http://linkis.com/conservatives4palin.com/uGXam", "http://linkis.com/paper.li/gsoberon/jozY2"]
response_data = []
for url in urls:
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
short_url = soup.find("iframe", {"id": "source_site"})['src']
response_data.append(requests.get(short_url).url)
print(response_data)
According to the two websites that you given, i think you may try the following code to get the original url for they all hidden in a part of javascript(the main scraper code i am using is from the question that you post):
try:
from HTMLParser import HTMLParser
except ImportError:
from html.parser import HTMLParser
import requests, re
from contextlib import closing
CHUNKSIZE = 1024
reurl = re.compile("\"longUrl\":\"(.*?)\"")
buffer = ""
htmlp = HTMLParser()
with closing(requests.get("http://linkis.com/conservatives4palin.com/uGXam", stream=True)) as res:
for chunk in res.iter_content(chunk_size=CHUNKSIZE, decode_unicode=True):
buffer = "".join([buffer, chunk])
match = reurl.search(buffer)
if match:
print(htmlp.unescape(match.group(1)).replace('\\',''))
break
say you're able to grab the href attribute/value:
s = 'href="/url/go/?url=http%3A%2F%2Fbit.ly%2F1P1xh9J"'
then you need to do the following:
import urllib.parse
s=s.partition('http')
s=s[1]+urllib.parse.unquote(s[2][0:-1])
s=urllib.parse.unquote(s)
and s will now be a string of the original bit-ly link!
try the following code:
import requests
url = 'http://'+'bit.ly'+'/1P1xh9J'
realsite = requests.get(url)
print(realsite.url)
it prints the desired output:
http://conservatives4palin.com/2015/11/robert-tracinski-the-climate-change-inquisition-begins.html?utm_source=twitterfeed&utm_medium=twitter

Get the code from inspect element using Python

In the Safari browser, I can right-click and select "Inspect Element", and a lot of code appears. Is it possible to get this code using Python? The best solution would be to get a file with the code in it.
More specifically, I am trying to find the links to the images on this page: http://500px.com/popular. I can see the links from "Inspect Element" and I would like to retrieve them with Python.
One way to get at the source code of a web page is to use the Beautiful Soup library. A tutorial of this is shown here. The code from the page is shown below, the comments are mine. This particular code does not work as the contents have changed on the site it uses as an example, but the concept should help you to do what you want to do. Hope it helps.
from bs4 import BeautifulSoup
# If Python2:
#from urllib2 import urlopen
# If Python3 (urllib2 has been split into urllib.request and urllib.error):
from urllib.request import urlopen
BASE_URL = "http://www.chicagoreader.com"
def get_category_links(section_url):
# Put the stuff you see when using Inspect Element in a variable called html.
html = urlopen(section_url).read()
# Parse the stuff.
soup = BeautifulSoup(html, "lxml")
# The next two lines will change depending on what you're looking for. This
# line is looking for <dl class="boccat">.
boccat = soup.find("dl", "boccat")
# This line organizes what is found in the above line into a list of
# hrefs (i.e. links).
category_links = [BASE_URL + dd.a["href"] for dd in boccat.findAll("dd")]
return category_links
EDIT 1: The solution above provides a general way to web-scrape, but I agree with the comments to the question. The API is definitely the way to go for this site. Thanks to yuvi for providing it. The API is available at https://github.com/500px/PxMagic.
EDIT 2: There is an example of your question regarding getting links to popular photos. The Python code from the example is pasted below. You will need to install the API library.
import fhp.api.five_hundred_px as f
import fhp.helpers.authentication as authentication
from pprint import pprint
key = authentication.get_consumer_key()
secret = authentication.get_consumer_secret()
client = f.FiveHundredPx(key, secret)
results = client.get_photos(feature='popular')
i = 0
PHOTOS_NEEDED = 2
for photo in results:
pprint(photo)
i += 1
if i == PHOTOS_NEEDED:
break

Categories

Resources