Python how to match a url string from html content - python

i'm trying to pull out a url from a function which is from html element
Content
content = 'memoizeFetch("/m/api/v3/classified/1005/221965:1jgntW:T-qH-lYVI3p2dhoiyqFPD1ehlr8/listing-profile/")'
My code :
content = 'memoizeFetch("/m/api/v3/classified/1005/221965:1jgntW:T-qH-lYVI3p2dhoiyqFPD1ehlr8/listing-profile/")'
match = re.search(r'memoizeFetch("(.*?)"', content).group(0)
print(match)
It doesn't work, i need to get the following string from that function:
"/m/api/v3/classified/1005/221965:1jgntW:T-qH-lYVI3p2dhoiyqFPD1ehlr8/listing-profile/"
How i can do that ?

You need to escape the pharantesis inside the string and select group(1).
Change your code to:
content = 'memoizeFetch("/m/api/v3/classified/1005/221965:1jgntW:T-qH-lYVI3p2dhoiyqFPD1ehlr8/listing-profile/")'
match = re.search(r'memoizeFetch\("(.*?)"', content).group(1)
print(match)
Output:
/m/api/v3/classified/1005/221965:1jgntW:T-qH-lYVI3p2dhoiyqFPD1ehlr8/listing-profile/

Related

Unable to get regex in python to match pattern

I'm trying to pull out a number from a copy of an HTML page which I got from using urllib.request
I've tried a few different patterns in regex but keep getting none as the output so I'm clearly not formatting the pattern correctly but can't get it to work
Below is a small part of the HTML I have in the string
</ul>\n \n <p>* * * * *</p>\n -->\n \n <b>DistroWatch database summary</b><br/>\n <ul>\n <li>Number of all distributions in the database: 926<br/>\n <li>Number of <a href="search.php?status=Active">
I'm trying to just get the 926 out of the string and my code is below and I can't figure out what I'm doing wrong
import urllib.request
import re
page = urllib.request.urlopen('http://distrowatch.com/weekly.php?issue=current')
#print(page.read())
print(page.read())
pageString = str(page.read())
#print(pageString)
DistroCount = re.search('^all distributions</a> in the database: ....<br/>\n$', pageString)
print(DistroCount)
any help, pointers or resource suggestions would be much appreciated
You can use BeautifulSoup to convert HTML to text, and then apply a simple regex to extract a number after a hardcoded string:
import urllib.request, re
from bs4 import BeautifulSoup
page = urllib.request.urlopen('http://distrowatch.com/weekly.php?issue=current')
html = page.read()
soup = BeautifulSoup(html, 'lxml')
text = soup.get_text()
m = re.search(r'all distributions in the database:\s*(\d+)', text)
if m:
print(m.group(1))
# => 926
Here,
soup.get_text() converts HTML to plain text and keeps it in the text variable
The all distributions in the database:\s*(\d+) regex matches all distributions in the database:, then zero or more whitespace chars and then captures into Group 1 any one or more digits (with (\d+))
I think your problem is that you are reading the whole document into a single string, but use "^" at beginning of your regex and "$" at the end, so the regex will only match the entire string.
Either drop ^ and $ (and \n as well…), or process your document line by line.

Compare string result from path & requests

I am scraping the HTML code from the URL defined, mainly focussing on the tag, to extract the results of it. Then, compare if string "example" exists in the script, if yes, print something and flag =1.
I am not able to compare the results extracted from the HTML.fromstring
Able to scrape the HTML content and view the full successfully, wanted to proceed further but not able to (compare strings)
import requests
from lxml import html
page = requests.get("http://econpy.pythonanywhere.com/ex/001.html")
tree = html.fromstring(page.text) #was page.content
# To get all the content in <script> of the webpage
scripts = tree.xpath('//script/text()')
# To get line of script that contains the string "location" (text)
keyword = tree.xpath('//script/text()[contains(., "location")]')
# To get the element ID of the script that contains the string "location"
keywordElement = tree.xpath('//script[contains(., "location")]')
print('\n<SCRIPT> is :\n', scripts)
# To print the Element ID
print('\n\KEYWORD script is discovered # ',keywordElement)
# To print the line of script that contain "location" in text form
print('Supporting lines... \n\n',keyword)
# ******************************************************
# code below is where the string comparison comes in
# to compare the "keyword" and display output to user
# ******************************************************
string = "location"
if string in keyword:
print('\nDANGER: Keyword detected in URL entered')
Flag = "Detected" # For DB usage
else:
print('\nSAFE: Keyword does not exist in URL entered')
Flag = "Safe" # For DB usage
# END OF PROGRAM
Actual result: able to retrieve all the necessary information including its element and content
Expected result: To print the DANGER / SAFE word to user and define the variable "Flag" which will then stored into database.
keyword is a list.
You need to index the list to get the string after which you will be able to search for the specific string
"location" in keyword[0] #gives True

Python adding a string to a match list with multiple items

The code I am working on is retrieving a list from an HTML page with 2 fields, URL, and title...
The URL anyway starts with /URL.... And I need to append the "http://website.com" to every returned vauled from a re.findall.
The code so far is this:
bsoup=bs(html)
tag=soup.find('div',{'class':'item'})
reg=re.compile('<a href="(.+?)" rel=".+?" title="(.+?)"')
links=re.findall(reg,str(tag))
*(append "http://website.com" to the href"(.+?)" field)*
return links
Try:
for link in tag.find_all('a'):
link['href'] = 'http://website.com' + link['href']
Then use one of these output methods:
return str(soup) gets you the document after the changes are applied.
return tag.find_all('a') gets you all the link elements.
return [str(i) for i in tag.find_all('a')] gets you all the link elements converted to strings.
Now, don't try to parse HTML with regex while you have a XML parser already working.

BeautifulSoup python to parse html files

I am using BeautifulSoup to replace all the commas in an html file with ‚. Here is my code for that:
f = open(sys.argv[1],"r")
data = f.read()
soup = BeautifulSoup(data)
comma = re.compile(',')
for t in soup.findAll(text=comma):
t.replaceWith(t.replace(',', '‚'))
This code works except when there is some javascript included in the html file. In that case it even replaces the comma(,) with in the javascript code. Which is not required. I only want to replace in all the text content of the html file.
soup.findall can take a callable:
tags_to_skip = set(["script", "style"])
# Add to this list as needed
def valid_tags(tag):
"""Filter tags on the basis of their tag names
If the tag name is found in ``tags_to_skip`` then
the tag is dropped. Otherwise, it is kept.
"""
if tag.source.name.lower() not in tags_to_skip:
return True
else:
return False
for t in soup.findAll(valid_tags):
t.replaceWith(t.replace(',', '‚'))

Python regex to match HTML

I am trying to request a web page via urllib2 using a regex.
Here is my code
def Get(url):
request = urllib2.Request(url)
page = urlOpener.open(request)
return page.read()
page = Get(myurl)
#page = "<html>.....</html>" #local string for test
pattern = re.compile(r'^\s*(<tr>$\s*<td height="25.*?</tr>)$', re.M | re.I | re.DOTALL)
for task in pattern.findall(taskListPage):
If I use a local string (same as Get(myurl)' s result) for test, the pattern works, but if i use Get(myurl), the pattern does not work.
I will be grateful if someone can tell me why.
Valid reservations about using regex on html aside, try this regex instead:
(<tr>\s*<td height="25.*?</tr>)
You were finding only matches at end of input $, and had problem terms at front of regex.
This match is a brittle - let's hope the web guy doesn't change the height of the rows...

Categories

Resources