Cannot decode or read website URL for counting a string - python

I am trying to perform a search and count of data in a website using the code below, you can see I have added a few extra prints in the code for debugging, currently the result is always "0" which says to me there is an error in reading the file of some sort. If I print the variable called html, I can clearly see that all three strings I am searching for are contained in the html, yet as previously mentioned none of my prints print anything, and the final print count simply returns "0". As you can see I have tried three different methods, same problem each time.
import urllib2
import urllib
import re
import json
import mechanize
post_url = "url_of_fishermans_finds"
browser = mechanize.Browser()
browser.set_handle_robots(False)
browser.addheaders = [('User-agent', 'Firefox')]
html = browser.open(post_url).read().decode('UTF-8')
# Attempted method 1
print html.count("SEA BASS")
# Attempted method 2
count = 0
enabled = False
for line in html:
if 'MAIN FISHERMAN' in line:
print "found main fisherman"
enabled = True
elif 'SEA BASS' in line:
print "found fish"
count += 1
elif 'SECONDARY FISHERMAN' in line:
print "found secondary fisherman"
enabled = False
print count
# Attempted method 3
relevant = re.search(r"MAIN FISHERMAN(.*)SECONDARY FISHERMAN", html)[1]
found = relevant.count("SEA BASS")
print found
It is probably something really simple, any comments or help would be greatly appreciated. Kind regards AEA

Regarding your regular expressions method #3, it appears you aren't grouping your search result prior to running count. I don't have the HTML you're looking at but you may also be running into trouble with your use of the '.' if there are newlines between your two search terms. With these issues in mind, try something like the following to correct these errors (note: in Python 3 syntax):
relevantcompile = re.compile("MAIN FISHERMAN(.*)SECONDARY FISHERMAN", re.DOTALL)
relevantsearch = re.search(relevantcompile, html)
relevantgrouped = relevantsearch.group()
relevantcount = relevantgrouped.count("SEA BASS")
print(relevantcount)
Also, keep in mind comments above regarding the case sensitivity of regular expressions searches. Hope this helps :)

Related

Preserving formatting (\t) in scraped text - Python Selenium

I have a program that takes the text from a website using this following code:
import selenium
driver = selenium.webdriver.Chrome(executable_path=r"\chromedriver.exe")
def get_raw_input(link_input, website_input, driver):
driver.get(f'{website_input}')
try:
here_button = driver.find_element_by_xpath('/html/body/div[2]/h3/a')
here_button.click()
raw_data = driver.find_element_by_xpath('/html/body/pre').text
except:
move_on = False
while move_on == False:
try:
raw_data = driver.find_element_by_class_name('output').text
move_on == True
except:
pass
driver.close()
return raw_data
the section of text it is targeting,is formatted like so
englishword tab frenchword
however, the return I get is in this format:
englishword space frenchword
the english part of the text could be a phrase with spaces in it, I cannot simply .split(" ") since it may split the phrase as well.
My end goal is to keep the formatting using tab instead of space so I can .split("\t") to make things easier for later manipulation.
Any help would be greatly appreciated :)
Selenium returns element text in the way how browser renders it. So it typically "normalizes" whitespaces (all inner space symbols turn into a single space).
You can see some discussion here. The solution to get the actually spaced text suggested by Selenium guys is to query textContent property from element.
Here is the example:
raw_data = driver.find_element_by_class_name('output').get_property('textContent')

output more than limited results from a form request

I have the following script that posts a search terms into a form and retrieves results:
import mechanize
url = "http://www.taliesin-arlein.net/names/search.php"
br = mechanize.Browser()
br.set_handle_robots(False) # ignore robots
br.open(url)
br.select_form(name="form")
br["search_surname"] = "*"
res = br.submit()
content = res.read()
with open("surnames.txt", "w") as f:
f.write(content)
however the rendered web page, and hence script here limits the search to 250 results. Is there any way I can bypass this limit and retrieve all results?
Thank you
You could simply iterate over possible prefixes to get around the limit. There is 270,000 names and a limit of 250 results per query, therefore you need to make at least 1080 requests, there are 26 letters in the alphabet so if we assume there is an even distribution this would mean we would need to use a little over 2 letters as a prefix (log(1080)/log(26)), however it is unlikely to be that even (how many people have surnames starting with ZZ after all).
To get around this we use a modified depth first search like so:
import string
import time
import mechanize
def checkPrefix(prefix):
#Return list of names with this prefix.
url = "http://www.taliesin-arlein.net/names/search.php"
br = mechanize.Browser()
br.open(url)
br.select_form(name="form")
br["search_surname"] = prefix+'*'
res = br.submit()
content = res.read()
return extractSurnames(content)
def extractSurnames(pageText):
#write function to extract text from html
Q=[x for x in string.ascii_lowercase]
listOfSurnames=[]
while Q:
curPrefix=Q.pop()
print curPrefix
curSurnames=checkPrefix(curPrefix)
if len(curSurnames)<250:
#store surnames could also write to file.
listOfSurnames+=curSurnames
else:
#We clearly didnt get all of the names need to subdivide more
Q+=[curPrefix+x for x in string.ascii_lowercase]
time.sleep(5) # Sleep here to avoid overloading the server for other people.
Thus we query more in places where there are too many results to be displayed, but we do not query ZZZZ if there is less than 250 surnames that start with ZZZ (or shorter). Without knowing how skewed the name distribution is, hard to estimate how long this will take but the 5 seconds sleep multiplied by 1080 is 1.5 hours or so so you are probably looking at at least half a day if not longer.
Note: This could be made more efficient by declaring the browser globally, however whether this is appropriate depends on where this code will be placed.

ruby fetching url content is always empty

I am so frustrated trying to use Ruby to fetch a specific url content.
I've tried many different ways like open-uri, standard request none worked so far. I always get empty html. I also tried to use python to fetch the same url which always returned the correct html content. I am really not sure why... Please help as I am newbiew to both Ruby and Python... I want to use Ruby (prefer the tidy syntax and human friendly function names, easier to install libs using gem and homebrew (on mac) than python easy_install) but I am now considering Python because it just works (yet still trying to get my head around 2.x and 3.x issue). I may be doing something really stupid but I think is very unlikely.
ruby 1.9.2p136 (2010-12-25 revision 30365) [i386-darwin10.6.0]
Implementation 1:
url = URI.parse('http//:www.stackoverflow.com/') req = Net::HTTP::Get.new(url.path)
res = Net::HTTP.start(url.host, url.port) {|http| http.request(req) }
puts res.body #empty
Implementation 2:
doc = Nokogiri::HTML(open("http//:www.stackoverflow.com/", "User-Agent" => "Safari"))
#empty
#I tried to use without user agent, without Nokogiri none worked.
Python Implementation which worked every time perfectly
f = urllib.urlopen("http//:www.stackoverflow.com/")
# Read from the object, storing the page's contents in 's'.
s = f.read()
f.close()
print s
If that is your exact code it is invalid for several reasons.
http: should be http://
URL needs a path. if you want the root page of example.com it needs to be http://example.com/ the trailing slash is significant.
if you put 2 lines of code on one line you need to use ; to denote the end of the first line
SO
require 'net/http'
url = URI.parse('http://www.yellowpages.com.au/search/listings?clue=plumber&locationClue=Australia')
req = Net::HTTP::Get.new(url.path)
res = Net::HTTP.start(url.host, url.port) {|http| http.request(req) }
puts res.body
Same is true with using open in nokogiri
EDIT: that site is returning bad results many times:
counter = 0
20.times do
url = URI.parse('http://www.yellowpages.com.au/search/listings?clue=plumber&locationClue=Australia')
req = Net::HTTP::Get.new(url.path)
res = Net::HTTP.start(url.host, url.port) {|http| http.request(req) }
sleep 1
counter +=1 unless res.body.empty?
end
puts counter
for me this only returned once a non empty body. If you substitute in another site it works all the time
curl "http://www.yellowpages.com.au/search/listings?clue=plumber&locationClue=Australia"
Yields the same inconsistent results.
Two examples with openURI (standard lib), a wrapper for (among others) the rather cumbersome Net::HTTP :
require 'open-uri'
open("http://www.stackoverflow.com/"){|f| puts f.read}
puts URI::parse("http://www.google.com/").read

Always return proper URL no matter what the user enters?

I have the following python code
from urlparse import urlparse
def clean_url(url):
new_url = urlparse(url)
if new_url.netloc == '':
return new_url.path.strip().decode()
else:
return new_url.netloc.strip().decode()
print clean_url("http://www.facebook.com/john.doe")
print clean_url("http://facebook.com/john.doe")
print clean_url("facebook.com/john.doe")
print clean_url("www.facebook.com/john.doe")
print clean_url("john.doe")
In each example I take in a string and return it. This is not what I want. I am trying to take each example and always return "http://www.facebook.com/john.doe" even if they just type www.* or just john.doe.
I am fairly new to programming so please be gentle.
I know this answer is a little late to the party, but if this is exactly what you're trying to do, I recommend a slightly different approach. Rather than reinventing the wheel for canonicalizing facebook urls, consider using the work that Google has already done for use with their Social Graph API.
They've already implemented patterns for a number of similar sites, including facebook. More information on that is here:
http://code.google.com/p/google-sgnodemapper/
import urlparse
p = urlparse.urlsplit("john.doe")
=> ('','','john.doe','','')
The first element of the tuple should be "http://", the second element of the tuple should be "www.facebook.com/", and you can leave the fourth and fifth elements of the tuple alone. You can then reassemble your URL after processing it.
Just an FYI, to ensure a safe url segment for 'john.doe' (this may not apply to facebook, but its a good rule to know) use urllib.quote(string) to properly escape whitespace, etc.
I am not very sure if I understood what you asked, but you can try this code, I tested and works fine but If you have trouble with this let me know.
I hope it helps
! /usr/bin/env python
import urlparse
def clean_url(url):
url_list = []
# split values into tuple
url_tuple = urlparse.urlsplit(url)
# as tuples are immutable so take this to a list
# so we can change the values that we need
counter = 0
for element in url_tuple:
url_list.append(element)
# validate each element individually
url_list[0] = 'http'
url_list[1] = 'www.facebook.com'
# get user name from the original url
# ** I understood the user is the only value
# for sure in the url, right??
user = url.split('/')
if len(user) == 1:
# the user was the only value sent
url_list[2] = user[0]
else:
# get the last element of the list
url_list[2] = user[len(user)-1]
# convert the list into a tuple and
# get all the elements together in the url again
new_url = urlparse.urlunsplit(tuple(url_list))
return new_url
if name == 'main':
print clean_url("http://www.facebook.com/john.doe")
print clean_url("http://facebook.com/john.doe")
print clean_url("facebook.com/john.doe")
print clean_url("www.facebook.com/john.doe")
print clean_url("john.doe")

Slicing URL with Python

I am working with a huge list of URL's. Just a quick question I have trying to slice a part of the URL out, see below:
http://www.domainname.com/page?CONTENT_ITEM_ID=1234&param2&param3
How could I slice out:
http://www.domainname.com/page?CONTENT_ITEM_ID=1234
Sometimes there is more than two parameters after the CONTENT_ITEM_ID and the ID is different each time, I am thinking it can be done by finding the first & and then slicing off the chars before that &, not quite sure how to do this tho.
Cheers
Use the urlparse module. Check this function:
import urlparse
def process_url(url, keep_params=('CONTENT_ITEM_ID=',)):
parsed= urlparse.urlsplit(url)
filtered_query= '&'.join(
qry_item
for qry_item in parsed.query.split('&')
if qry_item.startswith(keep_params))
return urlparse.urlunsplit(parsed[:3] + (filtered_query,) + parsed[4:])
In your example:
>>> process_url(a)
'http://www.domainname.com/page?CONTENT_ITEM_ID=1234'
This function has the added bonus that it's easier to use if you decide that you also want some more query parameters, or if the order of the parameters is not fixed, as in:
>>> url='http://www.domainname.com/page?other_value=xx&param3&CONTENT_ITEM_ID=1234&param1'
>>> process_url(url, ('CONTENT_ITEM_ID', 'other_value'))
'http://www.domainname.com/page?other_value=xx&CONTENT_ITEM_ID=1234'
The quick and dirty solution is this:
>>> "http://something.com/page?CONTENT_ITEM_ID=1234&param3".split("&")[0]
'http://something.com/page?CONTENT_ITEM_ID=1234'
Another option would be to use the split function, with & as a parameter. That way, you'd extract both the base url and both parameters.
url.split("&")
returns a list with
['http://www.domainname.com/page?CONTENT_ITEM_ID=1234', 'param2', 'param3']
I figured it out below is what I needed to do:
url = "http://www.domainname.com/page?CONTENT_ITEM_ID=1234&param2&param3"
url = url[: url.find("&")]
print url
'http://www.domainname.com/page?CONTENT_ITEM_ID=1234'
Parsin URL is never as simple I it seems to be, that's why there are the urlparse and urllib modules.
E.G :
import urllib
url ="http://www.domainname.com/page?CONTENT_ITEM_ID=1234&param2&param3"
query = urllib.splitquery(url)
result = "?".join((query[0], query[1].split("&")[0]))
print result
'http://www.domainname.com/page?CONTENT_ITEM_ID=1234'
This is still not 100 % reliable, but much more than splitting it yourself because there are a lot of valid url format that you and me don't know and discover one day in error logs.
import re
url = 'http://www.domainname.com/page?CONTENT_ITEM_ID=1234&param2&param3'
m = re.search('(.*?)&', url)
print m.group(1)
Look at the urllib2 file name question for some discussion of this topic.
Also see the "Python Find Question" question.
This method isn't dependent on the position of the parameter within the url string. This could be refined, I'm sure, but it gets the point across.
url = 'http://www.domainname.com/page?CONTENT_ITEM_ID=1234&param2&param3'
parts = url.split('?')
id = dict(i.split('=') for i in parts[1].split('&'))['CONTENT_ITEM_ID']
new_url = parts[0] + '?CONTENT_ITEM_ID=' + id
An ancient question, but still, I'd like to remark that query string paramenters can also be separated by ';' not only '&'.
beside urlparse there is also furl, which has IMHO better API.

Categories

Resources