I have a text file with some content. I need to search this content frequently. I have the following two options, which one is the best (by means of faster execution) ?
METHOD 1:
def search_list(search_string):
if search_word in li:
print "found at line ",li.indexOf(search_word)+1
if __name__="__main__":
f=open("input.txt","r")
li=[]
for i in f.readlines():
li.append(i.rstrip("\n"))
search_list("appendix")
METHOD 2:
def search_dict(search_string):
if d.has_key(search_word):
print "found at line ",d[search_word]
if __name__="__main__":
f=open("input.txt","r")
d={}
for i,j in zip(range(1,len(f.readlines())),f.readlines()):
d[j.rstrip("\n")]=i
search_dict("appendix")
For frequent searching, a dictionary is definitely better (provided you have enough memory to store the line numbers also) since the keys are hashed and looked up in O(1) operations. However, your implementation won't work. The first f.readlines() will exhaust the file object and you won't read anytihng with the second f.readlines().
What you're looking for is enumerate:
with open('data') as f:
d = dict((j[:-1],i) for i,j in enumerate(f,1))
It should also be pointed out that in both cases, the function which does the searching will be faster if you use try/except provided that the index you're looking for is typically found. (In the first case, it might be faster anyway since in is an order N operation and so is .index for a list).
e.g.:
def search_dict(d, search_string):
try:
print "found at line {0}".format(d[search_string])
except KeyError:
print "string not found"
or for the list:
def search_list(search_string):
try:
print "found at line {0}".format(li.indexOf(search_word)+1)
except ValueError:
print "string not found"
If you do it really frequently, then the second method will be faster (you've built something like an index).
Just adapt it a little bit:
def search_dict(d, search_string):
line = d.get(search_string)
if line:
print "found at line {}".format(line)
else:
print "string not found"
d = {}
with open("input.txt", "r") as f:
for i, word in enumerate(f.readlines(), 1):
d[word.rstrip()] = i
search_dict(d, "appendix")
I'm posting this after reading the answers of eumiro and mgilson.
If you compare your two methods on the command line, I think you'll find that the first one is faster. The other answers that say the second method is faster, but they are based on the premise that you'll do several searches on the file after you've built your index. If you use them as-is from the command line, you will not.
The building of the index is slower than just searching for the string directly, but once you've built an index, searches can be done very quickly, making up for the time spent building it. This extra time is wasted if you just use it once, because when the program is complete, the index is discarded and has to be rebuilt the next run. You need to keep the created index in memory between queries for this to pay off.
There are several ways of doing this, one is making a daemon to hold the index and use a front-end script to query it. Searching for something like python daemon client communication on google will give you pointers on implementing this -- here's one method.
First one is O(n); second one is O(1), but it requires searching on the key. I'd pick the second one.
Neither one will work if you're ad hoc searches in the document. For that you'll need to parse and index using something like Lucene.
Another option to throw in is using the FTS provided by SQLite3... (untested and making the assumption you're looking for wholewords, not substrings of words or other such things)
import sqlite3
# create db and table
db = sqlite3.connect(':memory:') # replace with file on-disk?
db.execute('create virtual table somedata using fts4(line)')
# insert the data
with open('yourfile.txt') as fin:
for lineno, line in enumerate(fin):
# You could put in a check here I guess...
if somestring in line:
print lineo # or whatever....
# put row into FTS table
db.execute('insert into somedata (line) values (?)', (line,))
# or possibly more efficient
db.executemany('insert into somedata (line) values (?)', fin)
db.commit()
look_for = 'somestring'
matches = db.execute('select rowid from somedata where line match ?', (look_for,) )
print '{} is on lines: {}'.format(look_for, ', '.join(match[0] for match in matches))
If you only wanted the first line, then add limit 1 to the end of the query.
You could also look at using mmap to map the file, then use the .find method to get the earliest offset of the string, then assuming it's not -1 (ie, not found - let's say 123456), then do mapped_file[:123456].count('\n') + 1 to get the line number.
Related
Hi all First time having to look for assistance but i am sort of at a brick wall for now. i have been learning python since August and i have been giving a challenge to complete for the end of Novemeber and i hope that there could be some help in making my code works. My task requires to find an ip address which occurs most frequent and count the number of times it appears also this information must be displayed to the user i have been giving 4 files .txt that have the ips. I am also required to make use of non trivial data structures and built in python sorting and/or searching functionalities, make use of functions, parameter passing and return values in the program. Below is a sample data structure they have recommended that i use: -
`enter code here`
def analyse_logs(parameter):
# Your Code Hear
return something
def extract_ip(parameter):
# Your Code Hear
return something
def find_most_frequent(parameter):
# Your Code Hear
return something
# Test Program
def main():
# Your Code Hear
# Call Test Program
main()
And below hear is what i have came up with and the code is completley differant from the sample that has been provided but what i have done dosnt give me output straight back instead creats a new text file which has been sorted but now what i am looking for: -
enter code here
def sorting(filename):
infile = open(filename)
ip_addr = []
for line in infile:
temp = line.split()
for i in temp:
ip_addr.append(i)
infile.close()
ip_addr.sort()
outfile = open("result.txt", "w")
for i in ip_addr:
outfile.writelines(i)
outfile.writelines(" ")
outfile.close()
sorting("sample_log_1.txt")e here
The code that i have created has sorted everything thats in the .txt file and outputs the most frequent that has been used all the way to the lest frequent. All i am look for is for an algorithim that can sort through the .txt file, find the IP address thats more frequent then print that ip out and how many times it appears. I hope i have provided everything and i am sure this is probally somthing very basic but i just cant get my head round it.
You should keep the number of times the IP addresses are repeated in a variable. You can use dictionary.
ip_count_dict = {"IP1": repeat_count, "IP2": repeat_count}
When first time you find a IP in your list set repeat_count 1 and after that if you find same ip again just increase counter.
For example,
ip_count_dict = {}
ip_list = ['1.1.1.1','1.1.1.2','1.1.1.3','1.1.1.1']
#Loop and count ips
#Final version of ip_count_dict {'1.1.1.1':2 , '1.1.1.2':1, '1.1.1.3':1}
With this dictionary you can store all ips and sort by their value.
P.S.: Dictionary keeps key,value pairs you can search "sort dictionary by value" after all counting thing done.
I'd like to make a program that makes offline copies of math questions from Khan Academy. I have a huge 21.6MB text file that contains data on all of their exercises, but I have no idea how to start analyzing it, much less start pulling the questions from it.
Here is a pastebin containing a sample of the JSON data. If you want to see all of it, you can find it here. Warning for long load time.
I've never used JSON before, but I wrote up a quick Python script to try to load up individual "sub-blocks" (or equivalent, correct term) of data.
import sys
import json
exercises = open("exercises.txt", "r+b")
byte = 0
frontbracket = 0
backbracket = 0
while byte < 1000: #while byte < character we want to read up to
#keep at 1000 for testing purposes
char = exercises.read(1)
sys.stdout.write(char)
#Here we decide what to do based on what char we have
if str(char) == "{":
frontbracket = byte
while True:
char = exercises.read(1)
if str(char)=="}":
backbracket=byte
break
exercises.seek(frontbracket)
block = exercises.read(backbracket-frontbracket)
print "Block is " + str(backbracket-frontbracket) + " bytes long"
jsonblock = json.loads(block)
sys.stdout.write(block)
print jsonblock["translated_display_name"]
print "\nENDBLOCK\n"
byte = byte + 1
Ok, the repeated pattern appears to be this: http://pastebin.com/4nSnLEFZ
To get an idea of the structure of the response, you can use JSONlint to copy/paste portions of your string and 'validate'. Even if the portion you copied is not valid, it will still format it into something you can actually read.
First I have used requests library to pull the JSON for you. It's a super-simple library when you're dealing with things like this. The API is slow to respond because it seems you're pulling everything, but it should work fine.
Once you get a response from the API, you can convert that directly to python objects using .json(). What you have is essentially a mixture of nested lists and dictionaries that you can iterate through and pull specific details. In my example below, my_list2 has to use a try/except structure because it would seem that some of the entries do not have two items in the list under translated_problem_types. In that case, it will just put 'None' instead. You might have to use trial and error for such things.
Finally, since you haven't used JSON before, it's also worth noting that it can behave like a dictionary itself; you are not guaranteed the order in which you receive details. However, in this case, it seems the outermost structure is a list, so in theory it's possible that there is a consistent order but don't rely on it - we don't know how the list is constructed.
import requests
api_call = requests.get('https://www.khanacademy.org/api/v1/exercises')
json_response = api_call.json()
# Assume we first want to list "author name" with "author key"
# This should loop through the repeated pattern in the pastebin
# access items as a dictionary
my_list1 = []
for item in json_response:
my_list1.append([item['author_name'], item['author_key']])
print my_list1[0:5]
# Now let's assume we want the 'sha' of the SECOND entry in translated_problem_types
# to also be listed with author name
my_list2 = []
for item in json_response:
try:
the_second_entry = item['translated_problem_types'][0]['items'][1]['sha']
except IndexError:
the_second_entry = 'None'
my_list2.append([item['author_name'], item['author_key'], the_second_entry])
print my_list2[0:5]
I am working on script that needs to be able to track revisions. The general idea is to give it a list of tuples where the first entry is the name of a field (ie "title" or "description" etc.), the second entry is the first version of that field, and the third entry is the revised version. So something like this:
[("Title", "The first version of the title", "The second version of the title")]
Now, using python docx I want my script to create a word file that will show the original version, and the new version with the changes bolded. Example:
Original Title:
This is the first version of the title
Revised Title:
This is the second version of the title
The way that this is done in python docx is to create a list of tuples, where the first entry is the text, and the second one is the formatting. So the way to create the revised title would be this:
paratext = [("This is the ", ''),("second",'b'),(" version of the title",'')]
Having recent discovered difflib I figured this would be a pretty easy task. And indeed, for simple word replacements such as sample above, it is, and can be done with the following function:
def revFinder(str1,str2):
s = difflib.SequenceMatcher(None, str1, str2)
matches = s.get_matching_blocks()[:-1]
paratext = []
for i in range(len(matches)):
print "------"
print str1[matches[i][0]:matches[i][0]+matches[i][2]]
print str2[matches[i][1]:matches[i][1]+matches[i][2]]
paratext.append((str2[matches[i][1]:matches[i][1]+matches[i][2]],''))
if i != len(matches)-1:
print ""
print str1[matches[i][0]+matches[i][2]:matches[i+1][0]]
print str2[matches[i][1]+matches[i][2]:matches[i+1][1]]
if len(str2[matches[i][1]+matches[i][2]:matches[i+1][1]]) > len(str1[matches[i][0]+matches[i][2]:matches[i+1][0]]):
paratext.append((str2[matches[i][1]+matches[i][2]:matches[i+1][1]],'bu'))
else:
paratext.append((str1[matches[i][0]+matches[i][2]:matches[i+1][0]],'bu'))
return paratext
The problems come when I want to do anything else. For example, changing 'teh' to 'the' produces t h e h (without the spaces, I couldn't figure out the formatting). Another issue is that extra text appended to the end is not shown as a change (or at all).
So, my question to all of you is what alternatives are there to difflib which are powerful enough to handle more complicated text comparions, or, how can I use difflib better such that it works for what I want? Thanks in advance
Found this great answer on how to check if a list of strings are within a line
How to check if a line has one of the strings in a list?
But trying to do a similar thing with keys in a dict does not seem to do the job for me:
import urllib2
url_info = urllib2.urlopen('http://rss.timegenie.com/forex.xml')
currencies = {"DKK": [], "SEK": []}
print currencies.keys()
testCounter = 0
for line in url_info:
if any(countryCode in line for countryCode in currencies.keys()):
testCounter += 1
if "DKK" in line or "SEK" in line:
print line
print "testCounter is %i and should be 2 - if not debug the code" % (testCounter)
The output:
['SEK', 'DKK']
<code>DKK</code>
<code>SEK</code>
testCounter is 377 and should be 2 - if not debug the code
Think that perhaps my problem is because that .keys() gives me an array rather than a list.. But haven't figured out how to convert it..
change:
any(countryCode in line for countryCode in currencies.keys())
to:
any([countryCode in line for countryCode in currencies.keys()])
Your original code uses a generator expression whereas (I think) your intention is a list comprehension.
see: Generator Expressions vs. List Comprehension
UPDATE:
I found that using an ipython interpreter with pylab imported I got the same results as you did (377 counts versus the anticipated 2). I realized the issue was that 'any' was from the numpy package which is meant to work on an array.
Next, I loaded an ipython interpreter without pylab such that 'any' was from builtin. In this case your original code works.
So if your using an ipython interpreter type:
help(any)
and make sure it is from the builtin module. If so your original code should work fine.
This is not a very good way to examine an xml file.
It's slow. You are making potentially N*M substring searches where N is the number of lines and M is the number of keys.
XML is not a line-oriented text format. Your substring searches could find attribute names or element names too, which is probably not what you want. And if the XML file happens to put all its elements on one line with no whitespace (common for machine-generated and -processed XML) you will get fewer matches than you expect.
If you have line-oriented text input, I suggest you construct a regex from your list of keys:
import re
linetester = re.compile('|'.join(re.escape(key) for key in currencies))
for match in linetester.finditer(entire_text):
print match.group(0)
#or if entire_text is too long and you want to consume iteratively:
for line in entire_text:
for match in linetester.find(line):
print match.group(0)
However, since you have XML, you should use an actual XML processor:
import xml.etree.cElementTree as ET
for elem in forex.findall('data/code'):
if elem.text in currencies:
print elem.text
If you are only interested in what codes are present and don't care about the particular entry you can use set intersection:
codes = frozenset(e.text for e in forex.findall('data/code'))
print codes & frozenset(currencies)
Is there a way in cx_Oracle to capture the stdout output from an oracle stored procedure? These show up when using Oracle's SQL Developer or SQL Plus, but there does not seem to be a way to fetch it using the database drivers.
You can retrieve dbms_output with DBMS_OUTPUT.GET_LINE(buffer, status). Status is 0 on success and 1 when there's no more data.
You can also use get_lines(lines, numlines). numlines is input-output. You set it to the max number of lines and it is set to the actual number on output. You can call this in a loop and exit when the returned numlines is less than your input. lines is an output array.
Herby a code example based on redcayuga's first answer:
def dbms_lines( cursor):
status = cursor.var( cx_Oracle.NUMBER)
line = cursor.var( cx_Oracle.STRING)
lines = []
while True:
cursor.callproc( 'DBMS_OUTPUT.GET_LINE', (line, status))
if status.getvalue() == 0:
lines.append( line.getvalue())
else:
break
return lines
Then run it after calling your stored procedure with:
for line in dbms_lines( cursor):
log.debug( line)
Whatever you put using put_line, you read using get_line; I believe this is how all these tools work, probably including the very SQL*Plus.
Note that you need to call get_line enough times to exhaust the buffer. If you don't, the unread part will be overwritten by the next put_line.
Do not forget to call
cursor.callproc("dbms_output.enable")
before calling your actual procedure, otherwise the buffer will be empty.
So building on the other two answers here, an example would be (proc_name is your procedure - schema.package.procedure):
def execute_proc(cursor,proc_name):
cursor.callproc("dbms_output.enable")
cursor.callproc(proc_name)
for line in dbms_lines( cursor):
print( line)
Did you tried this?
>>> conn = cx_Oracle.connect('user/pw#SCHEMA')
>>> cursor = conn.cursor()
>>> output = cursor.callproc("dbms_output.put_line", ['foo',])
>>> output
['foo']
The first argument is the procedure to call and the second a sequence of arguments or a dict for bindvars.
see also:
http://cx-oracle.sourceforge.net/html/cursor.html