Extract specific data from an embedded javascript in webpage

Extract specific data from an embedded javascript in webpage - python

I want to extract only the latitudes from the link: "http://hdfc.com/branch-locator" using the method given below.
The latitudes are given inside a javascript variable called 'location'.
The code is:
from lxml import html
import re
URL = "http://hdfc.com/branch-locator"
var_lat = re.compile('(?<="latitude":).+(?=")')
main_page = html.parse(URL).getroot()
lat = main_page.xpath("//script[#type='text/javascript']")[1]
ans = re.search(var_lat,str(lat))
print ans
But the output comes as "None". What changes should I make to the code without changing the approach to the problem?

I think a few small changes are required
in the line
lat = main_page.xpath("//script[#type='text/javascript']")[1] # This should be 10
The line
ans = re.search(var_lat,str(lat))
should be
ans = re.search(var_lat, lat.text)
str(lat) is going to call __str__ function of the object lat, which is not same as lat.text
In general a good idea to actually go through all lats first and then go about searching for the desired string. So this should be -
lat = main_page.xpath("//script[#type='text/javascript']")
for l in lat:
if l.text is None:
continue
# print l.text
ans = re.search(var_lat,(l.text))
if ans is not None:
break
print ans
Sorry, edited to fix the issue. Note: This may not be the exact solution you want - but should give you the first instance where the required regex is matched. You might want to process ans further.

The code that I have written below works for an embedded javascript in a webpage.
from lxml import html
from json import dump
import re
dumped_data = []
class theAddress:
latude = ""
URL = "http://hdfc.com/branch-locator"
var_lat = re.compile('(?<="latitude":").+?(?=")')
main_page = html.parse(URL).getroot()
residue = main_page.xpath("//script[#type='text/javascript']/text()")[1]
all_latude = re.findall(var_lat,residue)
for i in range(len(all_latude)):
obj = theAddress()
obj.latude = all_latude[i]
dumped_data.append(obj.__dict__)
f = open('hdfc_add.json','w')
dump(dumped_data, f, indent = 1)
It also makes use of json module to store scraped data in a proper format.

Related

how to stop repeating same text in loops python

from re import I
from requests import get
res = get("https://subsplease.org/api/?f=latest&tz=canada/central").json()
kek = []
for x in res:
kek.append(x)
lnk = res[kek[0]]['downloads']
anime_name = res[kek[0]]['show']
for x in lnk:
quality = x['res']
links = x['magnet']
data = f"{anime_name}:\n\n{quality}: {links}\n\n"
print(data)
in this code how can i prevent repeating of anime name
if i add this outside of the loop only 1 link be printed

you can separate you string, 1st half outside the loop, 2nd inside the loop:
print(f"{anime_name}:\n\n")
for x in lnk:
quality = x['res']
links = x['magnet']
data = f"{quality}: {links}\n\n"
print(data)

Rewrote a bit, make sure you look at a 'pretty' version of the json request using pprint or something to understand where elements are and where you can loop (remembering to iterate through the dict)
from requests import get
data = get("https://subsplease.org/api/?f=latest&tz=canada/central").json()
for show, info in data.items():
print(show, '\n')
for download in info['downloads']:
print(download['magnet'])
print(download['res'])
print('\n')
Also you won't usually be able to just copy these links to get to the download, you usually need to use a torrent website.

BeautifulSoup find_all('href') returns only part of the value

I'm attempting to scrape actor/actress IDs from an IMDB movie page. I only want actors and actresses (I don't want to get any of the crew), and this question is specifically about getting the person's internal ID. I already have peoples' names, so I don't need help getting those. I'm starting with this webpage (https://www.imdb.com/title/tt0084726/fullcredits?ref_=tt_cl_sm#cast) as a hard-coded url to get the code right.
On examination of the links I was able to find that the links for the actors look like this.
William Shatner
Leonard Nimoy
Nicholas Guest
while the ones for other contributors look like this
Nicholas Meyer
Gene Roddenberry
This should allow me to differentiate actors/actresses from crew like the director or writer by checking for the end of the href being "t[0-9]+$" rather than the same but with "dr" or "wr".
Here's the code I'm running.
import urllib.request
from bs4 import BeautifulSoup
import re
movieNumber = 'tt0084726'
url = 'https://www.imdb.com/title/' + movieNumber + '/fullcredits?ref_=tt_cl_sm#cast'
def clearLists(n):
return [[] for _ in range(n)]
def getSoupObject(urlInput):
page = urllib.request.urlopen(urlInput).read()
soup = BeautifulSoup(page, features="html.parser")
return(soup)
def getPeopleForMovie(soupObject):
listOfPeopleNames, listOfPeopleIDs, listOfMovieIDs = clearLists(3)
#get all the tags with links in them
link_tags = soupObject.find_all('a')
#get the ids of people
for linkTag in link_tags:
link = str(linkTag.get('href'))
#print(link)
p = re.compile('t[0-9]+$')
q = p.search(link)
if link.startswith('/name/') and q != None:
id = link[6:15]
#print(id)
listOfPeopleIDs.append(id)
#return the names and IDs
return listOfPeopleNames, listOfPeopleIDs
newSoupObject = getSoupObject(url)
pNames, pIds = getPeopleForMovie(newSoupObject)
The above code returns an empty list for the IDs, and if you uncomment the print statement you can see that it's because the value that gets put in the "link" variable ends up being what's below (with variations for the specific people)
/name/nm0583292/
/name/nm0000638/
That won't do. I want the IDs only for the actors and actresses so that I can use those IDs later.
I've tried to find other answers on stackoverflow; I haven't been able to find this particular issue.
This question (Beautifulsoup: parsing html – get part of href) is close to what I want to do, but it gets the info from the text part between tags rather than from the href part in the tag attribute.
How can I make sure I get only the name IDs that I want (just the actor ones) from the page?
(Also, feel free to offer suggestions to tighten up the code)

It appears that the links you are trying to match have either been modified by JavaScript after loading, or perhaps get loaded differently based on other variables than the URL alone (like cookies or headers).
However, since you're only after links of people in the cast, an easier way would be to simply match the ids of people in the cast section. This is actually fairly straightforward, since they are all in a single element, <table class="cast_list">
So:
import urllib.request
from bs4 import BeautifulSoup
import re
# it's Python, so use Python conventions, no uppercase in function or variable names
movie_number = 'tt0084726'
# The f-string is often more readable than a + concatenation
url = f'https://www.imdb.com/title/{movieNumber}/fullcredits?ref_=tt_cl_sm#cast'
# this is overly fancy for something as simple as initialising some variables
# how about:
# a, b, c = [], [], []
# def clearLists(n):
# return [[] for _ in range(n)]
# in an object-oriented program, assuming something is an object is the norm
def get_soup(url_input):
page = urllib.request.urlopen(url_input).read()
soup = BeautifulSoup(page, features="html.parser")
# removed needless parentheses - arguably, even `soup` is superfluous:
# return BeautifulSoup(page, features="html.parser")
return soup
# keep two empty lines between functions, it's standard and for good reason
# it's easier to spot where a function starts and stops
# try using an editor or IDE that highlights your PEP8 mistakes, like PyCharm
# (that's just my opinion there, other IDEs than PyCharm will do as well)
def get_people_for_movie(soup_object):
# removed unused variables, also 'list_of_people_ids' is needlessly verbose
# since they go together, why not return people as a list of tuples, or a dictionary?
# I'd prefer a dictionary as it automatically gets rid of duplicates as well
people = {}
# (put a space at the start of your comment blocks!)
# get all the anchors tags inside the `cast_list` table
link_tags = soup_object.find('table', class_='cast_list').find_all('a')
# the whole point of compiling the regex is to only have to do it once,
# so outside the loop
id_regex = re.compile(r'/name/nm(\d+)/')
# get the ids and names of people
for link_tag in link_tags:
# the href attributes is a strings, so casting with str() serves no purpose
href = link_tag.get('href')
# matching and extracting part of the match can all be done in one step:
match = id_regex.search(href)
if match:
# don't shadow Python keywords like `id` with variable names!
identifier = match.group(1)
name = link_tag.text.strip()
# just ignore the ones with no text, they're the thumbs
if name:
people[identifier] = name
# return the names and IDs
return people
def main():
# don't do stuff globally, it'll just cause problems when reusing names in functions
soup = get_soup(url)
people = get_people_for_movie(soup)
print(people)
# not needed here, but a good habit, allows you to import stuff without running the main
if __name__ == '__main__':
main()
Result:
{'0000638': 'William Shatner', '0000559': 'Leonard Nimoy', '0001420': 'DeForest Kelley', etc.
And the code with a few more tweaks and without the commentary on your code:
import urllib.request
from bs4 import BeautifulSoup
import re
def get_soup(url_input):
page = urllib.request.urlopen(url_input).read()
return BeautifulSoup(page, features="html.parser")
def get_people_for_movie(soup_object):
people = {}
link_tags = soup_object.find('table', class_='cast_list').find_all('a')
id_regex = re.compile(r'/name/nm(\d+)/')
# get the ids and names of the cast
for link_tag in link_tags:
match = id_regex.search(link_tag.get('href'))
if match:
name = link_tag.text.strip()
if name:
people[match.group(1)] = name
return people
def main():
movie_number = 'tt0084726'
url = f'https://www.imdb.com/title/{movie_number}/fullcredits?ref_=tt_cl_sm#cast'
people = get_people_for_movie(get_soup(url))
print(people)
if __name__ == '__main__':
main()

How can I group it by using "search" function in regular expression?

I have been developing a python web-crawler to collect the used car stock data from this website. (http://www.bobaedream.co.kr/cyber/CyberCar.php?gubun=I&page=20)
First of all, I would like to collect only "BMW" from the list. So, I used "search" function in regular expression like the code below. But, it keeps returning "None".
Is there anything wrong in my code?
Please give me some advice.
Thanks.
from bs4 import BeautifulSoup
import urllib.request
import re
CAR_PAGE_TEMPLATE = "http://www.bobaedream.co.kr/cyber/CyberCar.php?gubun=I&page="
def fetch_post_list():
for i in range(20,21):
URL = CAR_PAGE_TEMPLATE + str(i)
res = urllib.request.urlopen(URL)
html = res.read()
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('table', class_='cyber')
print ("Page#", i)
# 50 lists per each page
lists=table.find_all('tr', itemtype="http://schema.org/Article")
count=0
r=re.compile("[BMW]")
for lst in lists:
if lst.find_all('td')[3].find('em').text:
lst_price=lst.find_all('td')[3].find('em').text
lst_title=lst.find_all('td')[1].find('a').text
lst_link = lst.find_all('td')[1].find('a')['href']
lst_photo_url=''
if lst.find_all('td')[0].find('img'):
lst_photo_url = lst.find_all('td')[0].find('img')['src']
count+=1
else: continue
print('#',count, lst_title, r.search("lst_title"))
return lst_link
fetch_post_list()

r.search("lst_title")
This is searching inside the string literal "lst_title", not the variable named lst_title, that's why it never matches.
r=re.compile("[BMW]")
The square brackets indicate that you're looking for one of those characters. So, for example, any string containing M will match. You just want "BMW". In fact you don't even need regular expressions, you can just test:
"BMW" in lst_title

Module 'pylab' has no attribute 'scatter'

I am working on a linear regression model for stock ticker data, but I can't get Pylab working properly. I have successfully plotted the data, but I want to get a line of best fit for the data I have. (Not for any particular purpose, just a random set of data to use linear regression on.)
import pylab
import urllib.request
from matplotlib import pyplot as plt
from bs4 import BeautifulSoup
import requests
def chartStocks(*tickers):
# Run loop for each ticker passed in as an argument
for ticker in tickers:
# Convert URL into text for parsing
url = "http://finance.yahoo.com/q/hp?s=" + str(ticker) + "+Historical+Prices"
sourceCode = requests.get(url)
plainText = sourceCode.text
soup = BeautifulSoup(plainText, "html.parser")
# Find all links on the page
for link in soup.findAll('a'):
href = link.get('href')
link = []
for c in href[:48]:
link.append(c)
link = ''.join(link)
# Find the URL for the stock ticker CSV file and convert the data to text
if link == "http://real-chart.finance.yahoo.com/table.csv?s=":
csv_url = href
res = urllib.request.urlopen(csv_url)
csv = res.read()
csv_str = str(csv)
# Parse the CSV to create a list of data points
point = []
points = []
curDay = 0
day = []
commas = 0
lines = csv_str.split("\\n")
lineOne = True
for line in lines:
commas = 0
if lineOne == True:
lineOne = False
else:
for c in line:
if c == ",":
commas += 1
if commas == 4:
point.append(c)
elif commas == 5:
for x in point:
if x == ",":
point.remove(x)
point = ''.join(point)
point = float(point)
points.append(point)
day.append(curDay)
curDay += 1
point = []
commas = 0
points = list(reversed(points))
# Plot the data
pylab.scatter(day,points)
pylab.xlabel('x')
pylab.ylabel('y')
pylab.title('title')
k, b = pylab.polyfit(day,points,1)
yVals = k * day + b
pylab.plot(day,yVals,c='r',linewidth=2)
pylab.title('title')
pylab.show()
chartStocks('AAPL')
For some reason I get an attribute error, and I'm not sure why. Am I improperly passing in data to pylab.scatter()? I'm not totally sure if passing in a list for x and y values is the correct approach. I haven't been able to find anyone else who has run into this issue, and .scatter is definitely part of Pylab, so I'm not sure whats going on.

I think that there is a version clash. Try:
plt.pyplot.scatter(day,points)

When you use pylab it imports some other packages. So when you do import pylab you get numpy with the prefix np so you would need np.polyfit. As this question shows I think it's clearer to the readers of the code if you just import numpy directly to do this.

*Update: How to parse html with python/ beautifulsoup

First, I'm pretty new to Python. I'm trying to scrape contact information from offline websites and output the info to a csv. I'd like to grab the page url(not sure how to do this from the html), email, phone, location data if possible, any names, any phone numbers and the tag line for the html site if it exists.
Updated #2 code:
import os, csv, re
from bs4 import BeautifulSoup
topdir = 'C:\\projects\\training\\html'
output = csv.writer(open("scrape.csv", "wb+"))
output.writerow(["headline", "name", "email", "phone", "location", "url"])
all_contacts = []
for root, dirs, files in os.walk(topdir):
for f in files:
if f.lower().endswith((".html", ".htm")):
soup = BeautifulSoup(f)
def mailto_link(soup):
if soup.name != 'a':
return None
for key, value in soup.attrs:
if key == 'href':
m = re.search('mailto:(.*)',value)
if m:
all_contacts.append(m)
return m.group(1)
return None
for ul in soup.findAll('ul'):
contact = []
for li in soup.findAll('li'):
s = li.find('span')
if not (s and s.string):
continue
if s.string == 'Email:':
a = li.find(mailto_link)
if a:
contact['email'] = mailto_link(a)
elif s.string == 'Website:':
a = li.find('a')
if a:
contact['website'] = a['href']
elif s.string == 'Phone:':
contact['phone'] = unicode(s.nextSibling).strip()
all_contacts.append(contact)
output.writerow([all_contacts])
print "Finished"
This output currently doesn't return anything other than the row headers. What am I missing here? This should be at least returning some info from the html file, which is this page: http://bendoeslife.tumblr.com/about

There are (at least) two problems here.
First, f is a filename, not the file contents, or the Soup made from those contents. So, f.find('h2') is going to find 'h2' within the filename, which isn't very useful.
Second, most find methods (including str.find, which is what you're calling) return an index, not a substring. Calling str on that index is just going to give you the string version of a number. For example:
>>> s = 'A string with an h2 in it'
>>> i = s.find('h2')
>>> str(i)
'17'
So, your code is doing something like this:
>>> f = 'C:\\python\\training\\offline\\somehtml.html'
>>> headline = f.find('h2')
>>> str(headline)
'-1'
You probably want to call methods on the soup object, rather than f. BeautifulSoup.find returns a "sub-tree" of the soup, which is exactly what you want to stringify here.
However, it's impossible to test that without your sample input, so I can't promise that's the only problem in your code.
Meanwhile, when you get stuck with something like this, you should try printing out intermediate values. Print out f, and headline, and headline2, and it will be much more obvious why headline3 is wrong.
Just replacing the f with soup in the find calls, and fixing your indentation error, running against your sample file http://bendoeslife.tumblr.com/about now works.
It doesn't do anything all that useful, however. Since there's no h2 tag anywhere in the file, headline ends up as None. And the same goes for most of the other fields. The only thing that does find anything is url, because you're asking it to find an empty string, which will find something arbitrary. With three different parsers, I get <p>about</p> or <html><body><p>about</p></body></html>, and <html><body></body></html>…
You need to actually understand the structure of the file you're trying to parse before you can do anything useful with it. In this case, for example, there is an email address, but it's in an <a> element with a title of "Email", with an <li> element with an id of "email". So, you need to write a find to locate it based on one of those criteria, or something else it actually matches.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract specific data from an embedded javascript in webpage - python

Related

how to stop repeating same text in loops python

BeautifulSoup find_all('href') returns only part of the value

How can I group it by using "search" function in regular expression?

Module 'pylab' has no attribute 'scatter'

*Update: How to parse html with python/ beautifulsoup

Categories

Resources