Looping SQL query in python - python

I am writing a python script which queries the database for a URL string. Below is my snippet.
db.execute('select sitevideobaseurl,videositestring '
'from site, video '
'where siteID =1 and site.SiteID=video.VideoSiteID limit 1')
result = db.fetchall()
filename = '/home/Site_info'
output = open(filename, "w")
for row in result:
videosite= row[0:2]
link = videosite[0].format(videosite[1])
full_link = link.replace("http://","https://")
print full_link
output.write("%s\n"%str(full_link))
output.close()
The query basically gives a URL link.It gives me baseURL from a table and the video site string from another table.
output: https://www.youtube.com/watch?v=uqcSJR_7fOc
SiteID is the primary key which is int and not in sequence.
I wish to loop this sql query to pick a new siteId for every execution so that i have unique site URL everytime and write all the results to a file.
desired output: https://www.youtube.com/watch?v=uqcSJR_7fOc
https://www.dailymotion.com/video/hdfchsldf0f
There are about 1178 records.
Thanks for your time and help in advance.

I'm not sure if I completely understand what you're trying to do. I think your goal is to get a list of all links to videos. You get a link to a video by joining the sitevideobaseurl from site and videositestring from video.
From my experience it's much easier to let the database do the heavy lifting, it's build for that. It should be more efficient to join the tables, return all the results and then looping trough them instead of making subsequent queries to the database for each row.
The code should look something like this: (Be careful, I didn't test this)
query = """
select s.sitevideobaseurl,
v.videositestring
from video as v
join site as s
on s.siteID = v.VideoSiteID
"""
db.execute(query)
result = db.fetchall()
filename = '/home/Site_info'
output = open(filename, "w")
for row in result:
link = "%s%s" % (row[0],row[1])
full_link = link.replace("http://","https://")
print full_link
output.write("%s\n" % str(full_link))
output.close()
If you have other reasons for wanting to fetch these ony by one an idea might be to fetch a list of all SiteIDs and store them in a list. Afterwards you start a loop for each item in that list and insert the id into the query via a parameterized query.

Related

Dealing with special characters in file paths - Python script and Athena Query

I have a script that I use to parse out S3 info using Athena and S3 inventory. Currently when it finds a specific file type, the script will use that path to generate a SQL query that is then passed to Athena. Here's an example:
def getS3TagReportData(session, tag, table):
whereStatement = ""
for v in tag.values:
first = True
for p in v.paths:
# print(p) #iterate through paths
if first:
whereStatement = f"WHERE dt = (SELECT MAX(dt) FROM \"{awsAthenaDatabase}\".\"{table}\") AND key LIKE '{p}/%'"
first = False
else:
whereStatement += f" OR key LIKE '{p}%'"
This works fine until I hit a path where someone has put in an '
For example: bucketname\data\int'l hotel files\data.jpg
Then the query that gets dynamically generated is:
WHERE dt = (SELECT MAX(dt) FROM \"{awsAthenaDatabase}\".\"{table}\") AND key LIKE 'bucketname\data\int'l hotel files\data.jpg'
As you can see, the Where statement is going to fail because there is that extra ' in there. How can I programmatically work around these errors and report them?
I appreciate the help!

How can I filter ports by country?

This is my code:
I import the modules
import shodan
import json
I create my key,
SHODAN_API_KEY = ('xxxxxxxxxxxxxxxxxxxx')
api = shodan.Shodan(SHODAN_API_KEY)
I open my json file,
with open('Ports.json', 'r') as f:
Ports_dict = json.load(f)
#I loop through my dict,
for Port in Ports_dict:
print(Port['Port_name'])
try:
results = api.search(Port['Port_name']) # how can I filter ports by country??
#and I print the content.
print('Results found: {}'.format(results['total']))
for result in results['matches']:
print('IP: {}'.format(result['ip_str']))
print(result['data'])
print('')
print ('Country_code: %s' % result['location']['country_code'])
except shodan.APIError as e:
print(' Error: %s' % e)
But how can I filter ports by country?
In order to filter the results you need to use a search filter. The following article explains the general search query syntax of Shodan:
https://help.shodan.io/the-basics/search-query-fundamentals
Here is a list of all available search filters:
https://beta.shodan.io/search/filters
And here is a page full of example search queries:
https://beta.shodan.io/search/examples
In your case, you would want to use the port and country filter. For example, the following search query returns the MySQL and PostgreSQL servers in the US:
https://beta.shodan.io/search?query=port%3A3306%2C5432+country%3AUS
I would also recommend using the Shodan CLI for downloading data as it will handle paging through results for you:
https://help.shodan.io/guides/how-to-download-data-with-api
If you need to do it yourself within Python then you would also need to loop through the search results either by providing a page parameter or by simply using the Shodan.search_cursor() method (instead of Shodan.search() as you did in your code). The above article also shows how to use the search_cursor() method.

parsing .log files then sort in access

I'm writing a parsing program that that searches through 100+ .log files after some keyword, then puts the words in different array´s and separates the words in to columns in excel. Now I want to sort them in Access automatically so that I can process the different .log file combinations. I can "copy paste" from my Excel file to Access, but that so inefficient and gives some errors... I would like it to be "automatic". I'm new to Access and don´t know how to link from python to Access, I have tried doing it as I did to Excel but that didn't work and started looking in to OBDC but had some problems there to...
import glob # includes
import xlwt # includes
from os import listdir # includes
from os.path import isfile, join # includes
def logfile(filename, tester, createdate,completeresponse):
# Arrays for strings
response = []
message = []
test = []
date = []
with open(filename) as filesearch: # open search file
filesearch = filesearch.readlines() # read file
for line in filesearch:
file = filename[39:] # extract filename [file]
for lines in filesearch:
if createdate in lines: # extract "Create Time" {date}
date.append(lines[15:34])
if completeresponse in lines:
response.append(lines[19:])
print('pending...')
i = 1 # set a number on log {i}
d = {}
for name in filename:
if not d.get(name, False):
d[name] = i
i += 1
if tester in line:
start = '-> '
end = ':\ ' # |<----------->|
number = line[line.find(start)+3: line.find(end)] #Tester -> 1631 22 F1 2E :\ BCM_APP_31381140 AJ \ Read Data By Identifier \
test.append(number) # extract tester {test}
# |<--------------------------------------------
text = line[line.find(end)+3:] # Tester -> 1631 22 F1 2E :\ BCM_APP_31381140 AJ \ Read Data By Identifier \
message.append(text)
with open('Excel.txt', 'a') as handler: # create .txt file
for i in range(len(message)):
# A B C D E
handler.write(f"{file}|{date[i]}|{i}|{test[i]}|{response[i]}")
# A = filename B = create time C = number in file D = tester E = Complete response
# open with 'w' to "reset" the file.
with open('Excel.txt', 'w') as file_handler:
pass
# ---------------------------------------------------------------------------------
for filename in glob.glob(r'C:\Users\Desktop\Access\*.log'):
logfile(filename, 'Sending Request: Tester ->', 'Create Time:','Complete Response:','Channel')
def if_number(s): # look if number or float
try:
float(s)
return True
except ValueError:
return False
# ----------------------------------------------
my_path = r"C:\Users\Desktop\Access" # directory
# search directory for .txt files
text_files = [join(my_path, f) for f in listdir(my_path) if isfile(join(my_path, f)) and '.txt' in f]
for text_file in text_files: # loop and open .txt document
with open(text_file, 'r+') as wordlist:
string = [] # array ot the saved string
for word in wordlist:
string.append(word.split('|')) # put word to string array
column_list = zip(*string) # make list of all string
workbook = xlwt.Workbook()
worksheet = workbook.add_sheet('Tab')
worksheet.col(0) # construct cell
first_col = worksheet.col(0)
first_col.width = 256*50
second_col = worksheet.col(1)
second_col.width = 256*25
third_col = worksheet.col(2)
third_col.width = 256*10
fourth_col = worksheet.col(3)
fourth_col.width = 256*50
fifth_col = worksheet.col(4)
fifth_col.width = 256*100
i = 0 # choose column 0 = A, 3 = C etc
for column in column_list:
for item in range(len(column)):
value = column[item].strip()
if if_number(value):
worksheet.write(item, i, float(value)) # text / float
else:
worksheet.write(item, i, value) # number / int
i += 1
print('File:', text_files, 'Done')
workbook.save(text_file.replace('.txt', '.xls'))
Is there a way to automate the "copy paste"-command, if so how would that look like and work? and if that's something that can´t be done, some advice would help a lot!
EDIT
Thanks i have done som googling and thanks for your help! but now i get a a error... i still can´t send the information to the Access file, i get a syntax error. and i know it exist because i would want to uppdate the existing file... is there a command to "uppdate an exising Acces file"?
error
pyodbc.ProgrammingError: ('42S01', "[42S01] [Microsoft][ODBC Microsoft Access Driver] Table 'tblLogfile' already exists. (-1303) (SQLExecDirectW)")
code
import pyodbc
UDC = r'C:\Users\Documents\Access\UDC.accdb'
# DSN Connection
constr = " DSN=MS Access Database; DBQ={0};".format(UDC)
# DRIVER connection
constr = "DRIVER={Microsoft Access Driver (*.mdb, *.accdb)};UID=admin;UserCommitSync=Yes;Threads=3;SafeTransactions=0;PageTimeout=5;MaxScanRows=8;MaxBufferSize=2048;FIL={MS Access};DriverId=25;DefaultDir=C:/USERS/DOCUMENTS/ACCESS;DBQ=C:/USERS/DOCUMENTS/ACCESS/UDC.accdb"
# Connect to database UDC and open cursor
db = pyodbc.connect(constr)
cursor = db.cursor()
sql = "SELECT * INTO [tblLogfile]" +\
"FROM [Excel 8.0;HDR=YES;Database=C:/Users/Documents/Access/Excel.xls.[Tab];"
cursor.execute(sql)
db.commit()
cursor.close()
db.close()
First, please note, MS Access, a database management system, is not MS Excel, a spreadsheet application. Access sits on top of a relational engine and maintains strict rules in data and relational integrity whereas in Excel anything can be written across cells or ranges of cells with no rules. Additionally, the Access object library (tabledefs, querydefs, forms, reports, macros, modules) is much different than the Excel object library (workbooks, worksheets, range, cell, etc.), so there is no one-to-one translation in code.
Specifically, for your Python project, consider pyodbc using a make-table query that runs a direct connection to the Excel workbook. Since MS Access' database is the ACE/JET engine (Windows .dll files, available on Windows machines regardless of Access install). One feature of this data store is the ability to connect to workbooks even text files! So really, MSAccess.exe is just a GUI console to view .mdb/.accdb files.
Below creates a new database table that replicates the specific workbook sheet data, assuming the sheet maintains:
tabular format beginning in A1 cell (no merged cells/repeating labels)
headers in top row (no spaces before or after or special characters !#$%^~<>)))
columns of consistent data type format (i.e., data integrity rules)
Python code
import pyodbc
databasename = 'C:\\Path\\To\\Database\\File.accdb'
# DSN Connection
constr = "DSN=MS Access Database;DBQ={0};".format(databasename)
# DRIVER CONNECTION
constr = "DRIVER={{Microsoft Access Driver (*.mdb, *.accdb)}};DBQ={0};".format(databasename)
# CONNECT TO DATABASE AND OPEN CURSOR
db = pyodbc.connect(constr)
cur = db.cursor()
# RUN MAKE-TABLE QUERY FROM EXCEL WORKBOOK SOURCE
# OLDER EXCEL FORMAT
sql = "SELECT * INTO [myNewTable]" + \
" FROM [Excel 8.0;HDR=Yes;Database=C:\Path\To\Workbook.xls].[SheetName$];"
# CURRENT EXCEL FORMAT
sql = "SELECT * INTO [myNewTable]" + \
" FROM [Excel 12.0 Xml;HDR=Yes;Database=C:\Path\To\Workbook.xlsx].[SheetName$];"
cursor.execute(sql)
db.commit()
cur.close()
db.close()
Almost certainly the answer from Parfait above is a better way to go, but for fun I'll leave my answer below
If you are willing to put in the time I think you need 3 things to complete the automation of what you want to do:
1) Send a string representation of your data to the windows Clipboard There is windows specific code for this, or you can just save yourself some time and use pyperclip
2) Learn VBA and use VBA to grab the string from the clipboard and process it. Here is some example VBA code that I used in excel the past to grab text from the Clipboard
Function GetTextFromClipBoard() As String
Dim MSForms_DataObject As New MSForms.DataObject
MSForms_DataObject.GetFromClipboard
GetTextFromClipBoard = MSForms_DataObject.GetText()
End Function
3) use pywin32 (I believe available easily with Anaconda) to automate the vba access calls from Python. This is probably going to be the hardest part as the specific call trees are (in my opinion) not well documented and takes a lot of poking and digging to figure out what exactly you need to do. Its painful to say the least, but use IPython to help you with visual cues of what methods your pywin32 objects have available.
As I look at the instructions above, I realize it may also be possible to skip the clipboard and just send the information directly from python to access via pywin32. If you do the clipboard route however, you can break the steps up.
send one dataset to the clipboard
grab and process the data using the VBA editor in Access
after you figure out 1 and 2, use pywin32 to bridge the gap
Good luck, and maybe write a blog post about it if you figure it out to share the details.

get a list of comma separated values in python and use that list to perform a command as many times as necessary with each value

I'm sure this is something easy to do for someone with programming skills (unlike me). I am playing around with the Google Sites API. Basically, I want to be able to batch-create a bunch of pages, instead of having to do them one by one using the slow web form, which is a pain.
I have installed all the necessary files, read the documentation, and successfully ran this sample file. As a matter of fact, this sample file already has the python code for creating a page in Google Sites:
elif choice == 4:
print "\nFetching content feed of '%s'...\n" % self.client.site
feed = self.client.GetContentFeed()
try:
selection = self.GetChoiceSelection(
feed, 'Select a parent to upload to (or hit ENTER for none): ')
except ValueError:
selection = None
page_title = raw_input('Enter a page title: ')
parent = None
if selection is not None:
parent = feed.entry[selection - 1]
new_entry = self.client.CreatePage(
'webpage', page_title, '<b>Your html content</b>',
parent=parent)
if new_entry.GetAlternateLink():
print 'Created. View it at: %s' % new_entry.GetAlternateLink().href
I understand the creation of a page revolves around page_title and new_entry and CreatePage. However, instead of creating one page at a time, I want to create many.
I've done some research, and I gather I need something like
page_titles = input("Enter a list of page titles separated by commas: ").split(",")
to gather a list of page titles (like page1, page2, page3, etc. -- I plan to use a text editor or spreadsheet to generate a long list of comma separated names).
Now I am trying to figure out how to get that string and "feed" it to new_entry so that it creates a separate page for each value in the string. I can't figure out how to do that. Can anyone help, please?
In case it helps, this is what the Google API needs to create a page:
entry = client.CreatePage('webpage', 'New WebPage Title', html='<b>HTML content</b>')
print 'Created. View it at: %s' % entry.GetAlternateLink().href
Thanks.
Whenever you want to "use that list to perform a command as many times as necessary with each value", that's a for loop. (It may be an implicit for loop, e.g., in a map call or a list comprehension, but it's still a loop.)
So, after this:
page_titles = raw_input("Enter a list of page titles separated by commas: ").split(",")
You do this:
for page_title in page_titles:
# All the stuff that has to be done for each single title goes here.
# I'm not entirely clear on what you're doing, but I think that's this part:
parent = None
if selection is not None:
parent = feed.entry[selection - 1]
new_entry = self.client.CreatePage(
'webpage', page_title, '<b>Your html content</b>',
parent=parent)
if new_entry.GetAlternateLink():
print 'Created. View it at: %s' % new_entry.GetAlternateLink().href
And that's usually all there is to it.
You can use a for loop to loop over a dict object and create multiple pages. Here is a little snippet to get you started.
import gdata.sites.client
client = gdata.sites.client.SitesClient(
source=SOURCE_APP_NAME, site=site_name, domain=site_domain)
pages = {"page_1":'<b>Your html content</b>',
"page_2":'<b>Your other html content</b>'}
for title, content in pages.items():
feed = client.GetContentFeed()
parent = None
if selection is not None:
parent = feed.entry[selection - 1]
client.CreatePage('webpage', page_title, content, parent=parent)
as for feeding your script from an external source I would recommend something like a csv file
##if you have a csv file called pages.csv with the following lines
##page_1,<b>Your html content</b>
##page_2,<b>Your other html content</b>
import csv
with open("pages.csv", 'r') as f:
reader = csv.reader(f)
for row in reader:
title, content = row
client.CreatePage('webpage', title, content, parent=parent)

Email harvester in Python: How to improve performance?

I have found a program called "Best Email Extractor" http://www.emailextractor.net/. The website says it is written in Python. I tried to write a similar program. The above program extracts about 300 - 1000 emails per minute. My program extracts about 30-100 emails per hour. Could someone give me tips on how to improve the performance of my program? I wrote the following:
import sqlite3 as sql
import urllib2
import re
import lxml.html as lxml
import time
import threading
def getUrls(start):
urls = []
try:
dom = lxml.parse(start).getroot()
dom.make_links_absolute()
for url in dom.iterlinks():
if not '.jpg' in url[2]:
if not '.JPG' in url[2]:
if not '.ico' in url[2]:
if not '.png' in url[2]:
if not '.jpeg' in url[2]:
if not '.gif' in url[2]:
if not 'youtube.com' in url[2]:
urls.append(url[2])
except:
pass
return urls
def getURLContent(urlAdresse):
try:
url = urllib2.urlopen(urlAdresse)
text = url.read()
url.close()
return text
except:
return '<html></html>'
def harvestEmail(url):
text = getURLContent(url)
emails = re.findall('[\w\-][\w\-\.]+#[\w\-][\w\-\.]+[a-zA-Z]{1,4}', text)
if emails:
if saveEmail(emails[0]) == 1:
print emails[0]
def saveUrl(url):
connection = sql.connect('url.db')
url = (url, )
with connection:
cursor = connection.cursor()
cursor.execute('SELECT COUNT(*) FROM urladressen WHERE adresse = ?', url)
data = cursor.fetchone()
if(data[0] == 0):
cursor.execute('INSERT INTO urladressen VALUES(NULL, ?)', url)
return 1
return 0
def saveEmail(email):
connection = sql.connect('emails.db')
email = (email, )
with connection:
cursor = connection.cursor()
cursor.execute('SELECT COUNT(*) FROM addresse WHERE email = ?', email)
data = cursor.fetchone()
if(data[0] == 0):
cursor.execute('INSERT INTO addresse VALUES(NULL, ?)', email)
return 1
return 0
def searchrun(urls):
for url in urls:
if saveUrl(url) == 1:
#time.sleep(0.6)
harvestEmail(url)
print url
urls.remove(url)
urls = urls + getUrls(url)
urls1 = getUrls('http://www.google.de/#hl=de&tbo=d&output=search&sclient=psy-ab&q=DVD')
urls2 = getUrls('http://www.google.de/#hl=de&tbo=d&output=search&sclient=psy-ab&q=Jolie')
urls3 = getUrls('http://www.finanzen.net')
urls4 = getUrls('http://www.google.de/#hl=de&tbo=d&output=search&sclient=psy-ab&q=Party')
urls5 = getUrls('http://www.google.de/#hl=de&tbo=d&output=search&sclient=psy-ab&q=Games')
urls6 = getUrls('http://www.spiegel.de')
urls7 = getUrls('http://www.kicker.de/')
urls8 = getUrls('http://www.chessbase.com')
urls9 = getUrls('http://www.nba.com')
urls10 = getUrls('http://www.nfl.com')
try:
threads = []
urls = (urls1, urls2, urls3, urls4, urls5, urls6, urls7, urls8, urls9, urls10)
for urlList in urls:
thread = threading.Thread(target=searchrun, args=(urlList, )).start()
threads.append(thread)
print threading.activeCount()
for thread in threads:
thread.join()
except RuntimeError:
print RuntimeError
I don't think many people are going to help you harvest emails. It's a generally detested activity.
Regarding the performance bottlenecks in your code, you need to find out where the time is going by profiling. At the lowest level, replace each of your functions with a dummy that does no processing but returns valid output; so the email collector could return a list of the same address 100 times (or however many are in these URL results). That will show you which function is costing you time.
Things that stick out:
Get the files behind the URLs from the server beforehand; if you spam Google every time you run the script, they could well block you. Reading from disk is faster than requesting the files from the internet and can be done separately and concurrently.
The database code is creating a new connection for each call to saveEmail etc, which will spend most of its time doing handshaking and authentication. Better to have an object that keeps the connection alive between calls, or better yet to insert multiple records at once.
Once the network and database issues are done, the regex could probably do with \b around it so that the matching does less backtracking.
A series of if not 'foo' in str: then if not 'blah' in str ... is poor coding. Extract the final segment once and check it against multiple values by creating a set or even frozenset of all the non-permitted values like ignoredExtensions = set([jpg,png,gif]) and comparing with that like if not extension in ignoredExtensions. Note also that converting extension to lower case first will mean less checking and work whether it is jpg or JPG.
Finally, consider running the same script without threading on multiple commandlines. There is no real need to have the threading inside the script except for coordinating the different url lists. Frankly it would be far simpler to just have a set of url lists in files, and start a separate script to work on each. Let the OS do the multithreading, it is better at it.

Categories

Resources