I use lxml xpath for parsing HTML page in Python 3.
As sample I have code, that finds element HTML:
version_android = doc.xpath("//div[#itemprop='operatingSystems']//text()")
Father I have insert Mysql query:
insert = ("insert into tracks (version) values ('%s')" % (version_android[0]))
Problem is, that if is not element in HTML DOM, therefore I get Mysql error when I try to get parsed result in line: version[0] and put in query.
Sometimes result array does not have index version_android[0], but has index version_android[2]: It makes error in insert function Mysql.
How can I validate this correct? I have a lot the same rules for parsing.
I tried this, but I dislike this solution:
version_android = doc.xpath("//div[#itemprop='operatingSystems']//text()")
if len(version_android):
version_android = version_android[0]
else:
version_android = ""
I think the better way (In my opinion) is using the except.
#valid_xpath = '__VIEWSTATEGENERATOR'
invalid_xpath = 'XXXXXXXX'
try:
vgenerator = root.xpath('//*[#id="'+ invalid_xpath +'"]//#value')[0]
except IndexError:
vgenerator = None
print vgenerator
Related
I have a script where I get some data from a web page in Selenium using Python. However for some of the pages I'm scraping through, some of the elements are not present and this throws a NoSuchElementException error.
How do I return a null value for when the element is not present. I reied using or None but it still throws the error. Also, the elements following this one also depend on the presence of the first one as shown below:
metadata = driver.find_element(By.PARTIAL_LINK_TEXT, 'Complete metadata on ') or None
metadata_url = metadata.get_attribute('href') or None
dataset_id = metadata_url.split('metadata/')[1] or None
output_dict['datasets'].append({'title': dataset_title, 'url': dataset_link, 'metadata_url': metadata_url})
The element that is missing from some pages is the metadata.
I'm looking to populate the metadata_url field as null.
Please assist with this.
This code:
var = function_call(param) or None
runs the function, gets the output, transforms this output into a boolean (see truthyness in python), and if that output is False, then it sets that variable to None instead.
However, the function (find_element, here) doesn't return a Falsy value, but raises a NoSuchElementException exception if it doesn't find anything.
That means you need to use a try except block in your code instead of the or None
try:
metadata = driver.find_element(By.PARTIAL_LINK_TEXT, 'Complete metadata on ')
# If we are at this line, then find_element found something, and we
# can set values for our url and dataset id
metadata_url = metadata.get_attribute('href') # this will be None if there's no href attribute
dataset_id = metadata_url.split('metadata/')[1]
except selenium.common.exceptions.NoSuchElementException:
metadata_url = None
dataset_id = None
In the case when metadata_url is None, you will need to handle that case, because metadata_url.split will not work, it will raise a AttributeError: 'NoneType' object has no attribute 'split'.
I guess you're trying to use JS syntax in Python. You'll have to instead check if the element exists first.
if not metadata_url:
return None
This is my code:
I import the modules
import shodan
import json
I create my key,
SHODAN_API_KEY = ('xxxxxxxxxxxxxxxxxxxx')
api = shodan.Shodan(SHODAN_API_KEY)
I open my json file,
with open('Ports.json', 'r') as f:
Ports_dict = json.load(f)
#I loop through my dict,
for Port in Ports_dict:
print(Port['Port_name'])
try:
results = api.search(Port['Port_name']) # how can I filter ports by country??
#and I print the content.
print('Results found: {}'.format(results['total']))
for result in results['matches']:
print('IP: {}'.format(result['ip_str']))
print(result['data'])
print('')
print ('Country_code: %s' % result['location']['country_code'])
except shodan.APIError as e:
print(' Error: %s' % e)
But how can I filter ports by country?
In order to filter the results you need to use a search filter. The following article explains the general search query syntax of Shodan:
https://help.shodan.io/the-basics/search-query-fundamentals
Here is a list of all available search filters:
https://beta.shodan.io/search/filters
And here is a page full of example search queries:
https://beta.shodan.io/search/examples
In your case, you would want to use the port and country filter. For example, the following search query returns the MySQL and PostgreSQL servers in the US:
https://beta.shodan.io/search?query=port%3A3306%2C5432+country%3AUS
I would also recommend using the Shodan CLI for downloading data as it will handle paging through results for you:
https://help.shodan.io/guides/how-to-download-data-with-api
If you need to do it yourself within Python then you would also need to loop through the search results either by providing a page parameter or by simply using the Shodan.search_cursor() method (instead of Shodan.search() as you did in your code). The above article also shows how to use the search_cursor() method.
I would like to get the name of the element data-test-carrier-name with python, more specific i want to get as a result "trenitalia".
The HTML part looks like this:
div class="_1moixrt _dtnn7w" tabindex="0"span data-test-carrier-name="trenitalia"
I tried it as followed but without any luck:
company = driver.find_element_by_xpath('//div[#class="_1moixrt _dtnn7w"]')
company.get_attribute("data-test-carrier-name")
Please check element is visible before you start fetching its attribute.
org = driver.find_element_by_xpath('(//div[#class="_1moixrt _dtnn7w"])[1]/span[1]')
# Find the value of _1moixrt _dtnn7w?
val = org.get_attribute("data-test-carrier-name")
I am writing a python script which queries the database for a URL string. Below is my snippet.
db.execute('select sitevideobaseurl,videositestring '
'from site, video '
'where siteID =1 and site.SiteID=video.VideoSiteID limit 1')
result = db.fetchall()
filename = '/home/Site_info'
output = open(filename, "w")
for row in result:
videosite= row[0:2]
link = videosite[0].format(videosite[1])
full_link = link.replace("http://","https://")
print full_link
output.write("%s\n"%str(full_link))
output.close()
The query basically gives a URL link.It gives me baseURL from a table and the video site string from another table.
output: https://www.youtube.com/watch?v=uqcSJR_7fOc
SiteID is the primary key which is int and not in sequence.
I wish to loop this sql query to pick a new siteId for every execution so that i have unique site URL everytime and write all the results to a file.
desired output: https://www.youtube.com/watch?v=uqcSJR_7fOc
https://www.dailymotion.com/video/hdfchsldf0f
There are about 1178 records.
Thanks for your time and help in advance.
I'm not sure if I completely understand what you're trying to do. I think your goal is to get a list of all links to videos. You get a link to a video by joining the sitevideobaseurl from site and videositestring from video.
From my experience it's much easier to let the database do the heavy lifting, it's build for that. It should be more efficient to join the tables, return all the results and then looping trough them instead of making subsequent queries to the database for each row.
The code should look something like this: (Be careful, I didn't test this)
query = """
select s.sitevideobaseurl,
v.videositestring
from video as v
join site as s
on s.siteID = v.VideoSiteID
"""
db.execute(query)
result = db.fetchall()
filename = '/home/Site_info'
output = open(filename, "w")
for row in result:
link = "%s%s" % (row[0],row[1])
full_link = link.replace("http://","https://")
print full_link
output.write("%s\n" % str(full_link))
output.close()
If you have other reasons for wanting to fetch these ony by one an idea might be to fetch a list of all SiteIDs and store them in a list. Afterwards you start a loop for each item in that list and insert the id into the query via a parameterized query.
I'm sure this is something easy to do for someone with programming skills (unlike me). I am playing around with the Google Sites API. Basically, I want to be able to batch-create a bunch of pages, instead of having to do them one by one using the slow web form, which is a pain.
I have installed all the necessary files, read the documentation, and successfully ran this sample file. As a matter of fact, this sample file already has the python code for creating a page in Google Sites:
elif choice == 4:
print "\nFetching content feed of '%s'...\n" % self.client.site
feed = self.client.GetContentFeed()
try:
selection = self.GetChoiceSelection(
feed, 'Select a parent to upload to (or hit ENTER for none): ')
except ValueError:
selection = None
page_title = raw_input('Enter a page title: ')
parent = None
if selection is not None:
parent = feed.entry[selection - 1]
new_entry = self.client.CreatePage(
'webpage', page_title, '<b>Your html content</b>',
parent=parent)
if new_entry.GetAlternateLink():
print 'Created. View it at: %s' % new_entry.GetAlternateLink().href
I understand the creation of a page revolves around page_title and new_entry and CreatePage. However, instead of creating one page at a time, I want to create many.
I've done some research, and I gather I need something like
page_titles = input("Enter a list of page titles separated by commas: ").split(",")
to gather a list of page titles (like page1, page2, page3, etc. -- I plan to use a text editor or spreadsheet to generate a long list of comma separated names).
Now I am trying to figure out how to get that string and "feed" it to new_entry so that it creates a separate page for each value in the string. I can't figure out how to do that. Can anyone help, please?
In case it helps, this is what the Google API needs to create a page:
entry = client.CreatePage('webpage', 'New WebPage Title', html='<b>HTML content</b>')
print 'Created. View it at: %s' % entry.GetAlternateLink().href
Thanks.
Whenever you want to "use that list to perform a command as many times as necessary with each value", that's a for loop. (It may be an implicit for loop, e.g., in a map call or a list comprehension, but it's still a loop.)
So, after this:
page_titles = raw_input("Enter a list of page titles separated by commas: ").split(",")
You do this:
for page_title in page_titles:
# All the stuff that has to be done for each single title goes here.
# I'm not entirely clear on what you're doing, but I think that's this part:
parent = None
if selection is not None:
parent = feed.entry[selection - 1]
new_entry = self.client.CreatePage(
'webpage', page_title, '<b>Your html content</b>',
parent=parent)
if new_entry.GetAlternateLink():
print 'Created. View it at: %s' % new_entry.GetAlternateLink().href
And that's usually all there is to it.
You can use a for loop to loop over a dict object and create multiple pages. Here is a little snippet to get you started.
import gdata.sites.client
client = gdata.sites.client.SitesClient(
source=SOURCE_APP_NAME, site=site_name, domain=site_domain)
pages = {"page_1":'<b>Your html content</b>',
"page_2":'<b>Your other html content</b>'}
for title, content in pages.items():
feed = client.GetContentFeed()
parent = None
if selection is not None:
parent = feed.entry[selection - 1]
client.CreatePage('webpage', page_title, content, parent=parent)
as for feeding your script from an external source I would recommend something like a csv file
##if you have a csv file called pages.csv with the following lines
##page_1,<b>Your html content</b>
##page_2,<b>Your other html content</b>
import csv
with open("pages.csv", 'r') as f:
reader = csv.reader(f)
for row in reader:
title, content = row
client.CreatePage('webpage', title, content, parent=parent)