Python: If dict keys in line - python

Found this great answer on how to check if a list of strings are within a line
How to check if a line has one of the strings in a list?
But trying to do a similar thing with keys in a dict does not seem to do the job for me:
import urllib2
url_info = urllib2.urlopen('http://rss.timegenie.com/forex.xml')
currencies = {"DKK": [], "SEK": []}
print currencies.keys()
testCounter = 0
for line in url_info:
if any(countryCode in line for countryCode in currencies.keys()):
testCounter += 1
if "DKK" in line or "SEK" in line:
print line
print "testCounter is %i and should be 2 - if not debug the code" % (testCounter)
The output:
['SEK', 'DKK']
<code>DKK</code>
<code>SEK</code>
testCounter is 377 and should be 2 - if not debug the code
Think that perhaps my problem is because that .keys() gives me an array rather than a list.. But haven't figured out how to convert it..

change:
any(countryCode in line for countryCode in currencies.keys())
to:
any([countryCode in line for countryCode in currencies.keys()])
Your original code uses a generator expression whereas (I think) your intention is a list comprehension.
see: Generator Expressions vs. List Comprehension
UPDATE:
I found that using an ipython interpreter with pylab imported I got the same results as you did (377 counts versus the anticipated 2). I realized the issue was that 'any' was from the numpy package which is meant to work on an array.
Next, I loaded an ipython interpreter without pylab such that 'any' was from builtin. In this case your original code works.
So if your using an ipython interpreter type:
help(any)
and make sure it is from the builtin module. If so your original code should work fine.

This is not a very good way to examine an xml file.
It's slow. You are making potentially N*M substring searches where N is the number of lines and M is the number of keys.
XML is not a line-oriented text format. Your substring searches could find attribute names or element names too, which is probably not what you want. And if the XML file happens to put all its elements on one line with no whitespace (common for machine-generated and -processed XML) you will get fewer matches than you expect.
If you have line-oriented text input, I suggest you construct a regex from your list of keys:
import re
linetester = re.compile('|'.join(re.escape(key) for key in currencies))
for match in linetester.finditer(entire_text):
print match.group(0)
#or if entire_text is too long and you want to consume iteratively:
for line in entire_text:
for match in linetester.find(line):
print match.group(0)
However, since you have XML, you should use an actual XML processor:
import xml.etree.cElementTree as ET
for elem in forex.findall('data/code'):
if elem.text in currencies:
print elem.text
If you are only interested in what codes are present and don't care about the particular entry you can use set intersection:
codes = frozenset(e.text for e in forex.findall('data/code'))
print codes & frozenset(currencies)

Related

Formating csv file to utilize zabbix sender

I have a csv file that i need to format first before i can send the data to zabbix, but i enconter a problem in 1 of the csv that i have to use.
This is a example of the part of the file that i have problems:
Some_String_That_ineed_01[ifcontainsCell01removehere],xxxxxxxxx02
Some_String_That_ineed_01vtcrmp01[ifcontainsCell01removehere]-[1],aass
so this is 2 lines from the file, the other lines i already treated.
i need to check if Cell01 is in the line
if 'Cell01' in h: do something.
i need to remove all the content beetwen the [ ] included this [] that contains the Cell01 word and leave only this:
Some_String_That_ineed_01,xxxxxxxxx02
Some_String_That_ineed_01vtcrmp01-[1],aass
else my script already treats quite easy. There must be a better way then what i think which is use h.split in the first [ then split again on the , then remove the content that i wanna then add what is left sums strings. since i cant use replace because i need this data([1]).
later on with the desired result i will add this to zabbix sender as the item and item key_. I already have the hosts,timestamps,values.
You should use a regexp (re module):
import re
s = "Some_String_That_ineed_01[ifcontainsCell01removehere],xxxxxxxxx02"
replaced = re.sub('\[.*Cell01.*\]', '', s)
print (replaced)
will return:
[root#somwhere test]# python replace.py
Some_String_That_ineed_01,xxxxxxxxx02
You can also experiment here.

Regex to parse out a part of URL using python

I am having data as follows,
data['url']
http://hostname.com/aaa/uploads/2013/11/a-b-c-d.jpg https://www.aaa.com/
http://hostname.com/bbb/uploads/2013/11/e-f-g-h.gif https://www.aaa.com/
http://hostname.com/ccc/uploads/2013/11/e-f-g-h.png http://hostname.com/ccc/uploads/2013/11/a-a-a-a.html
http://hostname.com/ddd/uploads/2013/11/w-e-r-t.ico
http://hostname.com/ddd/uploads/2013/11/r-t-y-u.aspx https://www.aaa.com/
http://hostname.com/bbb/uploads/2013/11/t-r-w-q.jpeg https://www.aaa.com/
I want to find out the formats such as .jpg, .gif, .png, .ico, .aspx, .html, .jpeg and parse it out backwards until it finds a "/". Also I want to check for several occurance all through the string. My output should be,
data['parsed']
a-b-c-d
e-f-g-h
e-f-g-h a-a-a-a
w-e-r-t
r-t-y-u
t-r-w-q
I am thinking instead of writing individual commands for each of the formats, is there a way to write everything under a single command.
Can anybody help me in writing for theses commands? I am new to regex and any help would be appreciated.
this builds a list of name to extension pairs
import re
results = []
for link in data:
matches = re.search(r'/(\w-\w-\w-\w)\.(\w{2,})\b', link)
results.append((matches.group(1), matches.group(2)))
This pattern returns the file names. I have just used one of your urls to demonstrate, for more, you could simply append the matches to a list of results:
import re
url = "http://hostname.com/ccc/uploads/2013/11/e-f-g-h.png http://hostname.com/ccc/uploads/2013/11/a-a-a-a.html"
p = r'((?:[a-z]-){3}[a-z]).'
matches = re.findall(p, url)
>>> print('\n'.join(matches))
e-f-g-h
a-a-a-a
There is the assumption that the urls all have the general form you provided.
You might try this:
data['parse'] = re.findall(r'[^/]+\.[a-z]+ ',data['url'])
That will pick out all of the file names with their extensions. If you want to remove the extensions, the code above returns a list which you can then process with list comprehension and re.sub like so:
[re.sub('\.[a-z]+$','',exp) for exp in data['parse']]
Use the .join function to create a string as demonstrated in Totem's answer

What stronger alternatives are there to difflib?

I am working on script that needs to be able to track revisions. The general idea is to give it a list of tuples where the first entry is the name of a field (ie "title" or "description" etc.), the second entry is the first version of that field, and the third entry is the revised version. So something like this:
[("Title", "The first version of the title", "The second version of the title")]
Now, using python docx I want my script to create a word file that will show the original version, and the new version with the changes bolded. Example:
Original Title:
This is the first version of the title
Revised Title:
This is the second version of the title
The way that this is done in python docx is to create a list of tuples, where the first entry is the text, and the second one is the formatting. So the way to create the revised title would be this:
paratext = [("This is the ", ''),("second",'b'),(" version of the title",'')]
Having recent discovered difflib I figured this would be a pretty easy task. And indeed, for simple word replacements such as sample above, it is, and can be done with the following function:
def revFinder(str1,str2):
s = difflib.SequenceMatcher(None, str1, str2)
matches = s.get_matching_blocks()[:-1]
paratext = []
for i in range(len(matches)):
print "------"
print str1[matches[i][0]:matches[i][0]+matches[i][2]]
print str2[matches[i][1]:matches[i][1]+matches[i][2]]
paratext.append((str2[matches[i][1]:matches[i][1]+matches[i][2]],''))
if i != len(matches)-1:
print ""
print str1[matches[i][0]+matches[i][2]:matches[i+1][0]]
print str2[matches[i][1]+matches[i][2]:matches[i+1][1]]
if len(str2[matches[i][1]+matches[i][2]:matches[i+1][1]]) > len(str1[matches[i][0]+matches[i][2]:matches[i+1][0]]):
paratext.append((str2[matches[i][1]+matches[i][2]:matches[i+1][1]],'bu'))
else:
paratext.append((str1[matches[i][0]+matches[i][2]:matches[i+1][0]],'bu'))
return paratext
The problems come when I want to do anything else. For example, changing 'teh' to 'the' produces t h e h (without the spaces, I couldn't figure out the formatting). Another issue is that extra text appended to the end is not shown as a change (or at all).
So, my question to all of you is what alternatives are there to difflib which are powerful enough to handle more complicated text comparions, or, how can I use difflib better such that it works for what I want? Thanks in advance

Which of the following datastructure is the best regarding frequent searching?

I have a text file with some content. I need to search this content frequently. I have the following two options, which one is the best (by means of faster execution) ?
METHOD 1:
def search_list(search_string):
if search_word in li:
print "found at line ",li.indexOf(search_word)+1
if __name__="__main__":
f=open("input.txt","r")
li=[]
for i in f.readlines():
li.append(i.rstrip("\n"))
search_list("appendix")
METHOD 2:
def search_dict(search_string):
if d.has_key(search_word):
print "found at line ",d[search_word]
if __name__="__main__":
f=open("input.txt","r")
d={}
for i,j in zip(range(1,len(f.readlines())),f.readlines()):
d[j.rstrip("\n")]=i
search_dict("appendix")
For frequent searching, a dictionary is definitely better (provided you have enough memory to store the line numbers also) since the keys are hashed and looked up in O(1) operations. However, your implementation won't work. The first f.readlines() will exhaust the file object and you won't read anytihng with the second f.readlines().
What you're looking for is enumerate:
with open('data') as f:
d = dict((j[:-1],i) for i,j in enumerate(f,1))
It should also be pointed out that in both cases, the function which does the searching will be faster if you use try/except provided that the index you're looking for is typically found. (In the first case, it might be faster anyway since in is an order N operation and so is .index for a list).
e.g.:
def search_dict(d, search_string):
try:
print "found at line {0}".format(d[search_string])
except KeyError:
print "string not found"
or for the list:
def search_list(search_string):
try:
print "found at line {0}".format(li.indexOf(search_word)+1)
except ValueError:
print "string not found"
If you do it really frequently, then the second method will be faster (you've built something like an index).
Just adapt it a little bit:
def search_dict(d, search_string):
line = d.get(search_string)
if line:
print "found at line {}".format(line)
else:
print "string not found"
d = {}
with open("input.txt", "r") as f:
for i, word in enumerate(f.readlines(), 1):
d[word.rstrip()] = i
search_dict(d, "appendix")
I'm posting this after reading the answers of eumiro and mgilson.
If you compare your two methods on the command line, I think you'll find that the first one is faster. The other answers that say the second method is faster, but they are based on the premise that you'll do several searches on the file after you've built your index. If you use them as-is from the command line, you will not.
The building of the index is slower than just searching for the string directly, but once you've built an index, searches can be done very quickly, making up for the time spent building it. This extra time is wasted if you just use it once, because when the program is complete, the index is discarded and has to be rebuilt the next run. You need to keep the created index in memory between queries for this to pay off.
There are several ways of doing this, one is making a daemon to hold the index and use a front-end script to query it. Searching for something like python daemon client communication on google will give you pointers on implementing this -- here's one method.
First one is O(n); second one is O(1), but it requires searching on the key. I'd pick the second one.
Neither one will work if you're ad hoc searches in the document. For that you'll need to parse and index using something like Lucene.
Another option to throw in is using the FTS provided by SQLite3... (untested and making the assumption you're looking for wholewords, not substrings of words or other such things)
import sqlite3
# create db and table
db = sqlite3.connect(':memory:') # replace with file on-disk?
db.execute('create virtual table somedata using fts4(line)')
# insert the data
with open('yourfile.txt') as fin:
for lineno, line in enumerate(fin):
# You could put in a check here I guess...
if somestring in line:
print lineo # or whatever....
# put row into FTS table
db.execute('insert into somedata (line) values (?)', (line,))
# or possibly more efficient
db.executemany('insert into somedata (line) values (?)', fin)
db.commit()
look_for = 'somestring'
matches = db.execute('select rowid from somedata where line match ?', (look_for,) )
print '{} is on lines: {}'.format(look_for, ', '.join(match[0] for match in matches))
If you only wanted the first line, then add limit 1 to the end of the query.
You could also look at using mmap to map the file, then use the .find method to get the earliest offset of the string, then assuming it's not -1 (ie, not found - let's say 123456), then do mapped_file[:123456].count('\n') + 1 to get the line number.

Why don't I find the words in their original source list?

I am trying to find chinesse words in two differnet files, but It didn't work so I tried to search for the words in the same file I get them from, but it seems it doesn't find it neither? how is it possible?
chin_split = codecs.open("CHIN_split.txt","r+",encoding="utf-8")
used this for the regex code.
import re
for n in re.findall(ur'[\u4e00-\u9fff]+',chin_split.read()):
print n in re.findall(ur'[\u4e00-\u9fff]+',chin_split.read())
how comes I get only falses printed???
FYI I tried to do this and it works:
for x in [1,2,3,4,5,6,6]:
print x in [1,2,3,4,5,6,6]
BTW
chin_split contains words in English Hebrew and Chinese
some lines from chin_split.txt:
he daodan 核导弹 טיל גרעיני
hedantou 核弹头 ראש חץ גרעיני
helu 阖庐 "ביתו, מעונו
helu 阖庐 שם מלך וו בתקופת ה'אביב והסתיו'"
huiwu 会晤 להיפגש עם
You are reading a file descriptor many times and that is wrong.
The first chin_split.read() will yield all the content but the others (inside the loop) will just get an empty string.
That loop makes no sense, but if you want to keep it, save the file content in a variable first.

Categories

Resources