Searching a website - python

import urllib
import re
import os
search = (raw_input('[!]Search: '))
site = "http://www.exploit-db.com/list.php?description="+search+"&author=&platform=&type=&port=&osvdb=&cve="
print site
source = urllib.urlopen(site).read()
founds = re.findall("href='/exploits/\d+",source)
print "\n[+]Search",len(founds),"Results\n"
if len(founds) >=1:
for found in founds:
found = found.replace("href='","")
print "http://www.exploit-db.com"+found
else:
print "\nCouldnt find anything with your search\n"
When I search the exploit-db.com site I only come up with 25 results, how can I make it go to the other page or go pass 25 results.

Easy to check by just visiting the site and looking at the URLs as you manually page: just put right after the ? in the URL page=1& to look at the second page of results, or page=2& to look at the third page, and so forth.
How is this a Python question? It's a (very elementary!) "screen scraping" question.

Apparently the exploit-db.com site doesn't allow extending the page size. You therefore need to "manually" page through the result list by repeating the urllib.urlopen() to get subsequent pages. The URL is the same as the one initially used, plus the &page=n parameter. Attention this n value appears to be 0-based (i.e. &page=1 will give the second page)

Related

Figuring out page url in Selenium (javascript:void(0);)

I have a problem with this particular webstite link to website
I'm trying to create a script that can go through all entries but with the condition that it has "memory" so it can continue from the page it last was on. That means I need to know current page number AND a direct url to that page.
Here is what I have so far:
current_page_el = driver.find_element_by_xpath("//ul[contains(#class, 'pagination')]/li[#class='disabled']/a")
current_page = int(current_page_el.text)
current_page_url = current_page_el.get_attribute("href")
That code will result with
current_page_url = 'javascript:void(0);'
Is there a way to get current url from sites like this? Also, when you click to get to the next page, link just remains the same like what I posted in the beginning.

Python Requests Module Doesn't Fetch All The Elements It Ought To

This piece of code works properly
from lxml import html
import requests
page = requests.get(c)
tree = html.fromstring(page.content)
link = tree.xpath('//script/text()')
But it doesn't fetch the whole content. Like it is hidden or something.
I can see this is the case because the next thing I do is this
print len(link)
and it returns nine (9)
I then go to the page, which is the string c, above in the code. I go to the source (view-source:) with mozilla. And I hit ctr+f and I write <script with a space in the end.
It returns me thirty three (33) matches. The one I want cannot be fetched.
What's happening? I can't understand. Am I blocked or something? How can I bypass this and make requests module see what mozilla is seeing?
If you try
tree.xpath('//script')
I hope you will get 33 matches.
On your page only nine elements contain something between the opening and closing tags.

How can I use urllib2, regex, and sys to read a specific URL and then return my search and the results?

I am in need of some help for an assignment. I need to build a "simple" (according to my teacher) web-scraper that takes a URL as an argument, searches the source code of that URL, and then returns the links within that source code back (anything after href). The example URL my teacher has been having us use is http://citstudent.lanecc.net/tools.shtml. When the program is executed, there should be ten links returned as well as the URL of the website.
Since I am still trying to wrap my head around these concepts, I wasn't sure where to start so I turned to stack overflow and I found a script that -kind of- works. It does what I want it to do, but does not fulfill every requirement:
import urllib2
url = "http://citstudent.lanecc.net/tools.shtml"
page = urllib2.urlopen(url)
data = page.read().split("</a>")
tag = "<a href=\""
endtag = "\">"
for item in data:
if "<a href" in item:
try:
ind = item.index(tag)
item = item[ind+len(tag):]
end = item.index(endtag)
except: pass
else:
print item[:end]
This works because I hard-coded the URL into my code, and because it prints after some href tags. Normally I would say to just guide me through this and to not just give me the code, but I'm having such a shit day and any explanation or example of this has to be better than what we went over in class. Thank you.

Extract Text from HTML div using Python and lxml

I'm trying to get python to extract text from one spot of a website. I've identified the HTML div:
<div class="number">76</div>
which is in:
...div/div[1]/div/div[2]
I'm trying to use lxml to extract the '76' from that, but can't get a return out of it other than:
[]
Here's my code:
from lxml import html
import requests
url = 'https://sleepiq.sleepnumber.com/#/##1'
values = {'username': 'my#gmail.com',
'password': 'mypassword'}
page = requests.get(url, data=values)
tree = html.fromstring(page.content)
hr = tree.xpath('//div[#class="number"]/text()')
print hr
Any suggestions? I feel this should be pretty easy, thanks in advance!
Update: the element I want is not contained in the page.content from requests.get
Updated Update: It looks like this is not logging me in to the page where the content I want is. It is only getting the login screen content.
Have you tried printing your page.content to make sure your requests.get is retrieving the content you want? That is often where things break. And your empty list returned off the xpath search indicates "not found."
Assuming that's okay, your parsing is close. I just tried the following, which is successful:
from lxml import html
tree = html.fromstring('<body><div class="number">76</div></body>')
number = tree.xpath('//div[#class="number"]/text()')[0]
number now equals '76'. Note the [0] indexing, because xpath always returns a list of what's found. You have to dereference to find the content.
A common gotcha here is that the XPath text() function isn't as inclusive or straightforward as it might seem. If there are any sub-elements to the div--e.g. if the text is really <div class="number"><strong>76</strong></div> then text() will return an empty list, because the text belongs to the strong not the div. In real-world HTML--especially HTML that's ever been cut-and-pasted from a word processor, or otherwise edited by humans--such extra elements are entirely common.
While it won't solve all known text management issues, one handy workaround is to use the // multi-level indirection instead of the / single-level indirection to text:
number = ''.join(tree.xpath('//div[#class="number"]//text()'))
Now, regardless of whether there are sub-elements or not, the total text will be concatenated and returned.
Update Ok, if your problem is logging in, you probably want to try a requests.post (rather than .get) at minimum. In simpler cases, just that change might work. In others, the login needs to be done to a separate page than the page you want to retrieve/scape. In that case, you probably want to use a session object:
with requests.Session() as session:
# First POST to the login page
landing_page = session.post(login_url, data=values)
# Now make authenticated request within the session
page = session.get(url)
# ...use page as above...
This is a bit more complex, but shows the logic for a separate login page. Many sites (e.g. WordPress sites) require this. Post-authentication, they often take you to pages (like the site home page) that isn't interesting content (though it can be scraped to identify whether the login was successful). This altered login workflow doesn't change any of the parsing techniques, which work as above.
Beautiful Soup(http://www.pythonforbeginners.com/beautifulsoup/web-scraping-with-beautifulsoup) will help u out.
another way
http://docs.python-guide.org/en/latest/scenarios/scrape/
I'd use plain regex over xml tools in this case. It's easier to handle.
import re
import requests
url = 'http://sleepiq.sleepnumber.com/#/user/-9223372029758346943##2'
values = {'email-email': 'my#gmail.com', 'password-clear': 'Combination',
'password-password': 'mypassword'}
page = requests.get(url, data=values, timeout=5)
m = re.search(r'(\w*)(<div class="number">)(.*)(<\/div>)', page.content)
# m = re.search(r'(\w*)(<title>)(.*)(<\/title>)', page.content)
if m:
print(m.group(3))
else:
print('Not found')

Finding urls containing a specific string

I haven't used RegEx before, and everyone seems to agree that it's bad for webscraping and html in particular, but I'm not really sure how to solve my little challenge without.
I have a small Python scraper that opens 24 different webpages. In each webpage, there's links to other webpages. I want to make a simple solution that gets the links that I need and even though the webpages are somewhat similar, the links that I want are not.
The only common thing between the urls seems to be a specific string: 'uge' or 'Uge' (uge means week in Danish - and the week number changes every week, duh). It's not like the urls have a common ID or something like that I could use to target the correct ones each time.
I figure it would be possible using RegEx to go through the webpage and find all urls that has 'uge' or 'Uge' in them and then open them. But is there a way to do that using BS? And if I do it using RegEx, how would a possible solution look like?
For example, here are two of the urls I want to grab in different webpages:
http://www.domstol.dk/KobenhavnsByret/retslister/Pages/Uge45-Tvangsauktioner.aspx
http://www.domstol.dk/esbjerg/retslister/Pages/Straffesageruge32.aspx
This should work... The RegEx uge\d\d? tells it to find "uge" followed by a digit, and possibly another one.
import re
for item in listofurls:
l = re.findall("uge\d\d?", item, re.IGNORECASE):
if l:
print item #just do whatever you want to do when it finds it
Yes, you can do this with BeautifulSoup.
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_string)
# To find just 'Uge##' or 'uge##', as specified in the question:
urls = [el["href"] for el in soup.findAll("a", href=re.compile("[Uu]ge\d+"))]
# To find without regard to case at all:
urls = [el["href"] for el in soup.findAll("a", href=re.compile("(?i)uge\d+"))]
Or just use a simple for loop:
list_of_urls = ["""LIST GOES HERE"""]
for url in list_of_urls:
if 'uge' in url.lower():
# Code to execute
The regex expression would look something like: uge\d\d

Categories

Resources