How to determine whether a Python string contains an emoji? - python

I've seen previous responses to this question, but none of them are recent and none of them are working for me in Python 3. I have a list of strings, and I simply want to identify which ones contain emoji. What's the fastest way to do this?
To be more specific, I have a lengthy list of email subject lines from AB tests, and I'm trying to determine which subject lines contained emoji.

this link and this link both count © and other common characters as emoji. Also the former has minor mistakes and the latter still doesn't appear to work.
Here's an implementation that errs on the conservative side using this newer data and this documentation. It only considers code points that are marked with the unicode property Emoji_Presentation (which means it's definitely an emoji), or code points marked only with the property Emoji (which means it defaults to text but it could be an emoji), that are followed by a special variation selector code point fe0f that says to default to an emoji instead. The reason I say this is conservative is because certain systems aren't as picky about the fe0f and will treat characters as emoji wherever they can (read more about this here).
import re
from collections import defaultdict
def parse_line(line):
"""Return a pair (property, codepoints) where property is a string and
codepoints is a set of int unicode code points"""
pat = r'([0-9A-Z]+)(\.\.[0-9A-Z]+)? + ; +(\w+) + #.*'
match = re.match(pat, line)
assert match
codepoints = set()
start = int(match.group(1), 16)
if match.group(2):
trimmed = match.group(2)[2:]
end = int(trimmed, 16) + 1
else:
end = start + 1
for cp in range(start, end):
codepoints.add(cp)
return (match.group(3), codepoints)
def parse_emoji_data():
"""Return a dictionary mapping properties to code points"""
result = defaultdict(set)
with open('emoji-data.txt', mode='r', encoding='utf-8') as f:
for line in f:
if '#' != line[0] and len(line.strip()) > 0:
property, cp = parse_line(line)
result[property] |= cp
return result
def test_parse_emoji_data():
sets = parse_emoji_data()
sizes = {
'Emoji': 1123,
'Emoji_Presentation': 910,
'Emoji_Modifier': 5,
'Emoji_Modifier_Base': 83,
}
for k, v in sizes.items():
assert len(sets[k]) == v
def contains_emoji(text):
"""
Return true if the string contains either a code point with the
`Emoji_Presentation` property, or a code point with the `Emoji`
property that is followed by \uFE0F
"""
sets = parse_emoji_data()
for i, ch in enumerate(text):
if ord(ch) in sets['Emoji_Presentation']:
return True
elif ord(ch) in sets['Emoji']:
if len(text) > i+1 and text[i+1] == '\ufe0f':
return True
return False
test_parse_emoji_data()
assert not contains_emoji('hello')
assert not contains_emoji('hello :) :D 125% #%&*(##%&!#(^*(')
assert contains_emoji('here is a smiley \U0001F601 !!!')
To run this you need ftp://ftp.unicode.org/Public/emoji/3.0/emoji-data.txt in the working directory.
Once the regex module supports emoji properties it will be easier to use that instead.

Related

Consolidate similar patterns into single consensus pattern

In the previous post, I did not clarify the questions properly, therefore, I would like to start a new topic here.
I have the following items:
a sorted list of 59,000 protein patterns (range from 3 characters "FFK" to 152 characters long);
some long protein sequences, aka my reference.
I am going to match these patterns against my reference and find the location of where the match is found. (My friend helped wrtoe a script for that.)
import sys
import re
from itertools import chain, izip
# Read input
with open(sys.argv[1], 'r') as f:
sequences = f.read().splitlines()
with open(sys.argv[2], 'r') as g:
patterns = g.read().splitlines()
# Write output
with open(sys.argv[3], 'w') as outputFile:
data_iter = iter(sequences)
order = ['antibody name', 'epitope sequence', 'start', 'end', 'length']
header = '\t'.join([k for k in order])
outputFile.write(header + '\n')
for seq_name, seq in izip(data_iter, data_iter):
locations = [[{'antibody name': seq_name, 'epitope sequence': pattern, 'start': match.start()+1, 'end': match.end(), 'length': len(pattern)} for match in re.finditer(pattern, seq)] for pattern in patterns]
for loc in chain.from_iterable(locations):
output = '\t'.join([str(loc[k]) for k in order])
outputFile.write(output + '\n')
f.close()
g.close()
outputFile.close()
Problem is, within these 59,000 patterns, after sorted, I found that some part of one pattern match with part of the other patterns, and I would like to consolidate these into one big "consensus" patterns and just keep the consensus (see examples below):
TLYLQMNSLRAED
TLYLQMNSLRAEDT
YLQMNSLRAED
YLQMNSLRAEDT
YLQMNSLRAEDTA
YLQMNSLRAEDTAV
will yield
TLYLQMNSLRAEDTAV
another example:
APRLLIYGASS
APRLLIYGASSR
APRLLIYGASSRA
APRLLIYGASSRAT
APRLLIYGASSRATG
APRLLIYGASSRATGIP
APRLLIYGASSRATGIPD
GQAPRLLIY
KPGQAPRLLIYGASSR
KPGQAPRLLIYGASSRAT
KPGQAPRLLIYGASSRATG
KPGQAPRLLIYGASSRATGIPD
LLIYGASSRATG
LLIYGASSRATGIPD
QAPRLLIYGASSR
will yield
KPGQAPRLLIYGASSRATGIPD
PS : I am aligning them here so it's easier to visualize. The 59,000 patterns initially are not sorted so it's hard to see the consensus in the actual file.
In my particular problem, I am not picking the longest patterns, instead, I need to take each pattern into account to find the consensus. I hope I have explained clearly enough for my specific problem.
Thanks!
Here's my solution with randomized input order to improve confidence of the test.
import re
import random
data_values = """TLYLQMNSLRAED
TLYLQMNSLRAEDT
YLQMNSLRAED
YLQMNSLRAEDT
YLQMNSLRAEDTA
YLQMNSLRAEDTAV
APRLLIYGASS
APRLLIYGASSR
APRLLIYGASSRA
APRLLIYGASSRAT
APRLLIYGASSRATG
APRLLIYGASSRATGIP
APRLLIYGASSRATGIPD
GQAPRLLIY
KPGQAPRLLIYGASSR
KPGQAPRLLIYGASSRAT
KPGQAPRLLIYGASSRATG
KPGQAPRLLIYGASSRATGIPD
LLIYGASSRATG
LLIYGASSRATGIPD
QAPRLLIYGASSR"""
test_li1 = data_values.split()
#print(test_li1)
test_li2 = ["abcdefghi", "defghijklmn", "hijklmnopq", "mnopqrst", "pqrstuvwxyz"]
def aggregate_str(data_li):
copy_data_li = data_li[:]
while len(copy_data_li) > 0:
remove_li = []
len_remove_li = len(remove_li)
longest_str = max(copy_data_li, key=len)
copy_data_li.remove(longest_str)
remove_li.append(longest_str)
while len_remove_li != len(remove_li):
len_remove_li = len(remove_li)
for value in copy_data_li:
value_pattern = "".join([x+"?" for x in value])
longest_match = max(re.findall(value_pattern, longest_str), key=len)
if longest_match in value:
longest_str_index = longest_str.index(longest_match)
value_index = value.index(longest_match)
if value_index > longest_str_index and longest_str_index > 0:
longest_str = value[:value_index] + longest_str
copy_data_li.remove(value)
remove_li.append(value)
elif value_index < longest_str_index and longest_str_index + len(longest_match) == len(longest_str):
longest_str += value[len(longest_str)-longest_str_index:]
copy_data_li.remove(value)
remove_li.append(value)
elif value in longest_str:
copy_data_li.remove(value)
remove_li.append(value)
print(longest_str)
print(remove_li)
random.shuffle(test_li1)
random.shuffle(test_li2)
aggregate_str(test_li1)
#aggregate_str(test_li2)
Output from print().
KPGQAPRLLIYGASSRATGIPD
['KPGQAPRLLIYGASSRATGIPD', 'APRLLIYGASS', 'KPGQAPRLLIYGASSR', 'APRLLIYGASSRAT', 'APRLLIYGASSR', 'APRLLIYGASSRA', 'GQAPRLLIY', 'APRLLIYGASSRATGIPD', 'APRLLIYGASSRATG', 'QAPRLLIYGASSR', 'LLIYGASSRATG', 'KPGQAPRLLIYGASSRATG', 'KPGQAPRLLIYGASSRAT', 'LLIYGASSRATGIPD', 'APRLLIYGASSRATGIP']
TLYLQMNSLRAEDTAV
['YLQMNSLRAEDTAV', 'TLYLQMNSLRAED', 'TLYLQMNSLRAEDT', 'YLQMNSLRAED', 'YLQMNSLRAEDTA', 'YLQMNSLRAEDT']
Edit1 - brief explanation of the code.
1.) Find longest string in list
2.) Loop through all remaining strings and find longest possible match.
3.) Make sure that the match is not a false positive. Based on the way I've written this code, it should avoid pairing single overlaps on terminal ends.
4.) Append the match to the longest string if necessary.
5.) When nothing else can be added to the longest string, repeat the process (1-4) for the next longest string remaining.
Edit2 - Corrected unwanted behavior when treating data like ["abcdefghijklmn", "ghijklmZopqrstuv"]
def main():
#patterns = ["TLYLQMNSLRAED","TLYLQMNSLRAEDT","YLQMNSLRAED","YLQMNSLRAEDT","YLQMNSLRAEDTA","YLQMNSLRAEDTAV"]
patterns = ["APRLLIYGASS","APRLLIYGASSR","APRLLIYGASSRA","APRLLIYGASSRAT","APRLLIYGASSRATG","APRLLIYGASSRATGIP","APRLLIYGASSRATGIPD","GQAPRLLIY","KPGQAPRLLIYGASSR","KPGQAPRLLIYGASSRAT","KPGQAPRLLIYGASSRATG","KPGQAPRLLIYGASSRATGIPD","LLIYGASSRATG","LLIYGASSRATGIPD","QAPRLLIYGASSR"]
test = find_core(patterns)
test = find_pre_and_post(test, patterns)
#final = "YLQMNSLRAED"
final = "KPGQAPRLLIYGASSRATGIPD"
if test == final:
print("worked:" + test)
else:
print("fail:"+ test)
def find_pre_and_post(core, patterns):
pre = ""
post = ""
for pattern in patterns:
start_index = pattern.find(core)
if len(pattern[0:start_index]) > len(pre):
pre = pattern[0:start_index]
if len(pattern[start_index+len(core):len(pattern)]) > len(post):
post = pattern[start_index+len(core):len(pattern)]
return pre+core+post
def find_core(patterns):
test = ""
for i in range(len(patterns)):
for j in range(2,len(patterns[i])):
patterncount = 0
for pattern in patterns:
if patterns[i][0:j] in pattern:
patterncount += 1
if patterncount == len(patterns):
test = patterns[i][0:j]
return test
main()
So what I do first is find the main core in the find_core function by starting with a string of length two, as one character is not sufficient information, for the first string. I then compare that substring and see if it is in ALL the strings as the definition of a "core"
I then find the indexes of the substring in each string to then find the pre and post substrings added to the core. I keep track of these lengths and update them if one length is greater than the other. I didn't have time to explore edge cases so here is my first shot

unicode table information about a character in python

Is there a way in python to get the technical information for a given character like it's displayed in the Unicode table? (cf.https://unicode-table.com/en/)
Example:
for the letter "Ȅ"
Name > Latin Capital Letter E with Double Grave
Unicode number > U+0204
HTML-code > Ȅ
Bloc > Latin Extended-B
Lowercase > ȅ
What I actually need is to get for any Unicode number (like here U+0204) the corresponding name (Latin Capital Letter E with Double Grave) and the lowercase version (here "ȅ").
Roughly:
input = a Unicode number
output = corresponding information
The closest thing I've been able to find is the fontTools library but I can't seem to find any tutorial/documentation on how to use it to do that.
Thank you.
The standard module unicodedata defines a lot of properties, but not everything. A quick peek at its source confirms this.
Fortunately unicodedata.txt, the data file where this comes from, is not hard to parse. Each line consists of exactly 15 elements, ; separated, which makes it ideal for parsing. Using the description of the elements on ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html, you can create a few classes to encapsulate the data. I've taken the names of the class elements from that list; the meaning of each of the elements is explained on that same page.
Make sure to download ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt and ftp://ftp.unicode.org/Public/UNIDATA/Blocks.txt first, and put them inside the same folder as this program.
Code (tested with Python 2.7 and 3.6):
# -*- coding: utf-8 -*-
class UnicodeCharacter:
def __init__(self):
self.code = 0
self.name = 'unnamed'
self.category = ''
self.combining = ''
self.bidirectional = ''
self.decomposition = ''
self.asDecimal = None
self.asDigit = None
self.asNumeric = None
self.mirrored = False
self.uc1Name = None
self.comment = ''
self.uppercase = None
self.lowercase = None
self.titlecase = None
self.block = None
def __getitem__(self, item):
return getattr(self, item)
def __repr__(self):
return '{'+self.name+'}'
class UnicodeBlock:
def __init__(self):
self.first = 0
self.last = 0
self.name = 'unnamed'
def __repr__(self):
return '{'+self.name+'}'
class BlockList:
def __init__(self):
self.blocklist = []
with open('Blocks.txt','r') as uc_f:
for line in uc_f:
line = line.strip(' \r\n')
if '#' in line:
line = line.split('#')[0].strip()
if line != '':
rawdata = line.split(';')
block = UnicodeBlock()
block.name = rawdata[1].strip()
rawdata = rawdata[0].split('..')
block.first = int(rawdata[0],16)
block.last = int(rawdata[1],16)
self.blocklist.append(block)
# make 100% sure it's sorted, for quicker look-up later
# (it is usually sorted in the file, but better make sure)
self.blocklist.sort (key=lambda x: block.first)
def lookup(self,code):
for item in self.blocklist:
if code >= item.first and code <= item.last:
return item.name
return None
class UnicodeList:
"""UnicodeList loads Unicode data from the external files
'UnicodeData.txt' and 'Blocks.txt', both available at unicode.org
These files must appear in the same directory as this program.
UnicodeList is a new interpretation of the standard library
'unicodedata'; you may first want to check if its functionality
suffices.
As UnicodeList loads its data from an external file, it does not depend
on the local build from Python (in which the Unicode data gets frozen
to the then 'current' version).
Initialize with
uclist = UnicodeList()
"""
def __init__(self):
# we need this first
blocklist = BlockList()
bpos = 0
self.codelist = []
with open('UnicodeData.txt','r') as uc_f:
for line in uc_f:
line = line.strip(' \r\n')
if '#' in line:
line = line.split('#')[0].strip()
if line != '':
rawdata = line.strip().split(';')
parsed = UnicodeCharacter()
parsed.code = int(rawdata[0],16)
parsed.characterName = rawdata[1]
parsed.category = rawdata[2]
parsed.combining = rawdata[3]
parsed.bidirectional = rawdata[4]
parsed.decomposition = rawdata[5]
parsed.asDecimal = int(rawdata[6]) if rawdata[6] else None
parsed.asDigit = int(rawdata[7]) if rawdata[7] else None
# the following value may contain a slash:
# ONE QUARTER ... 1/4
# let's make it Python 2.7 compatible :)
if '/' in rawdata[8]:
rawdata[8] = rawdata[8].replace('/','./')
parsed.asNumeric = eval(rawdata[8])
else:
parsed.asNumeric = int(rawdata[8]) if rawdata[8] else None
parsed.mirrored = rawdata[9] == 'Y'
parsed.uc1Name = rawdata[10]
parsed.comment = rawdata[11]
parsed.uppercase = int(rawdata[12],16) if rawdata[12] else None
parsed.lowercase = int(rawdata[13],16) if rawdata[13] else None
parsed.titlecase = int(rawdata[14],16) if rawdata[14] else None
while bpos < len(blocklist.blocklist) and parsed.code > blocklist.blocklist[bpos].last:
bpos += 1
parsed.block = blocklist.blocklist[bpos].name if bpos < len(blocklist.blocklist) and parsed.code >= blocklist.blocklist[bpos].first else None
self.codelist.append(parsed)
def find_code(self,codepoint):
"""Find the Unicode information for a codepoint (as int).
Returns:
a UnicodeCharacter class object or None.
"""
# the list is unlikely to contain duplicates but I have seen Unicode.org
# doing that in similar situations. Again, better make sure.
val = [x for x in self.codelist if codepoint == x.code]
return val[0] if val else None
def find_char(self,str):
"""Find the Unicode information for a codepoint (as character).
Returns:
for a single character: a UnicodeCharacter class object or
None.
for a multicharacter string: a list of the above, one element
per character.
"""
if len(str) > 1:
result = [self.find_code(ord(x)) for x in str]
return result
else:
return self.find_code(ord(str))
When loaded, you can now look up a character code with
>>> ul = UnicodeList() # ONLY NEEDED ONCE!
>>> print (ul.find_code(0x204))
{LATIN CAPITAL LETTER E WITH DOUBLE GRAVE}
which by default is shown as the name of a character (Unicode calls this a 'code point'), but you can retrieve other properties as well:
>>> print ('%04X' % uc.find_code(0x204).lowercase)
0205
>>> print (ul.lookup(0x204).block)
Latin Extended-B
and (as long as you don't get a None) even chain them:
>>> print (ul.find_code(ul.find_code(0x204).lowercase))
{LATIN SMALL LETTER E WITH DOUBLE GRAVE}
It does not rely on your particular build of Python; you can always download an updated list from unicode.org and be assured to get the most recent information:
import unicodedata
>>> print (unicodedata.name('\U0001F903'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: no such name
>>> print (uclist.find_code(0x1f903))
{LEFT HALF CIRCLE WITH FOUR DOTS}
(As tested with Python 3.5.3.)
There are currently two lookup functions defined:
find_code(int) looks up character information by codepoint as an integer.
find_char(string) looks up character information for the character(s) in string. If there is only one character, it returns a UnicodeCharacter object; if there are more, it returns a list of objects.
After import unicodelist (assuming you saved this as unicodelist.py), you can use
>>> ul = UnicodeList()
>>> hex(ul.find_char(u'è').code)
'0xe8'
to look up the hex code for any character, and a list comprehension such as
>>> l = [hex(ul.find_char(x).code) for x in 'Hello']
>>> l
['0x48', '0x65', '0x6c', '0x6c', '0x6f']
for longer strings. Note that you don't actually need all of this if all you want is a hex representation of a string! This suffices:
l = [hex(ord(x)) for x in 'Hello']
The purpose of this module is to give easy access to other Unicode properties. A longer example:
str = 'Héllo...'
dest = ''
for i in str:
dest += chr(ul.find_char(i).uppercase) if ul.find_char(i).uppercase is not None else i
print (dest)
HÉLLO...
and showing a list of properties for a character per your example:
letter = u'Ȅ'
print ('Name > '+ul.find_char(letter).name)
print ('Unicode number > U+%04x' % ul.find_char(letter).code)
print ('Bloc > '+ul.find_char(letter).block)
print ('Lowercase > %s' % chr(ul.find_char(letter).lowercase))
(I left out HTML; these names are not defined in the Unicode standard.)
The unicodedata documentation shows how to do most of this.
The Unicode block name is apparently not available but another Stack Overflow question has a solution of sorts and another has some additional approaches using regex.
The uppercase/lowercase mapping and character number information is not particularly Unicode-specific; just use the regular Python string functions.
So in summary
>>> import unicodedata
>>> unicodedata.name('Ë')
'LATIN CAPITAL LETTER E WITH DIAERESIS'
>>> 'U+%04X' % ord('Ë')
'U+00CB'
>>> '&#%i;' % ord('Ë')
'Ë'
>>> 'Ë'.lower()
'ë'
The U+%04X formatting is sort-of correct, in that it simply avoids padding and prints the whole hex number for code points with a value higher than 65,535. Note that some other formats require the use of %08X padding in this scenario (notably \U00010000 format in Python).
You can do this in some ways :
1- create an API yourself ( I can't find anything that do this )
2- create table in database or excel file
3- load and parse a website to do that
I think the 3rd way is very easy. take a look as This Page. you can find some information there Unicodes.
Get your Unicode number and then, find it in web page using parse tools like LXML , Scrapy , Selenium , etc

Python Find Method in Strings Seems To Be Failing

I have a text file like so: http://pastie.org/10309944
This contains numbers corresponding to lists of EDI segments that could possibly be associated with them. My goal is to write a script that takes in one of these codes (the numbers) as input and outputs the corresponding lists. The lists are surrounded by "-" characters to make the parsing easier.
I wrote the following code: `class SegmentsUsedFinder(object):
'''Finds a transaction code and returns the possible segments used.
'''
def __init__(self, transaction_code):
'''Initializes the segment finder.
Args:
transaction_code: The transaction code to find possible segments from.
'''
self._transaction_code = transaction_code + " -"
def find_segment(self):
'''Finds the possible segments that correspond to the
transaction code.
'''
fileObject = open("transactioncodes.txt", 'r')
data = ""
for line in fileObject:
line = line.rstrip('\n').rstrip()
data += line
fileObject.close()
position = data.find(self._transaction_code) + len(self._transaction_code)
with open("transactioncodes.txt", 'r') as file:
file.seek(position)
segments = ""
char = ""
while True:
char = file.read(1)
if char == "-":
break
segments += char
return segments
I then create a finder object like so:
finder = SegmentsUsedFinder("270")
print finder.find_segment()
This code actually works but when I adjust the string inside the SegmentsUsedFinder constructor to 271 or 837 it fails for some reason. I think I'm perhaps misusing the find method, but it works for the first instance. I can also get it to work for 271 if I add 2 to position and to work for 837 if I add 4 to position.
Any help would be greatly appreciated, thanks.
Here's how your find_segment method should look like:
def find_segment(self):
'''Finds the possible segments that correspond to the
transaction code.
'''
with open("transactioncodes.txt", 'r') as _file:
for line in _file:
if line.startswith(self._transaction_code):
return line[len(self._transaction_code):line.rfind("-")]
return ""
Of course it can be improved (the file name to be a private member of the class), but this is a prototype that works (assuming that all the lines respect the pattern: ID -LIST-).
Note: I also renamed the variable name to _file because it was shadowing the builtin file type.

Generating and applying diffs in python

Is there an 'out-of-the-box' way in python to generate a list of differences between two texts, and then applying this diff to one file to obtain the other, later?
I want to keep the revision history of a text, but I don't want to save the entire text for each revision if there is just a single edited line. I looked at difflib, but I couldn't see how to generate a list of just the edited lines that can still be used to modify one text to obtain the other.
Did you have a look at diff-match-patch from google? Apparantly google Docs uses this set of algoritms. It includes not only a diff module, but also a patch module, so you can generate the newest file from older files and diffs.
A python version is included.
http://code.google.com/p/google-diff-match-patch/
Does difflib.unified_diff do want you want? There is an example here.
The original link is broken. There is an example here
I've implemented a pure python function to apply diff patches to recover either of the input strings, I hope someone finds it useful. It uses parses the Unified diff format.
import re
_hdr_pat = re.compile("^## -(\d+),?(\d+)? \+(\d+),?(\d+)? ##$")
def apply_patch(s,patch,revert=False):
"""
Apply unified diff patch to string s to recover newer string.
If revert is True, treat s as the newer string, recover older string.
"""
s = s.splitlines(True)
p = patch.splitlines(True)
t = ''
i = sl = 0
(midx,sign) = (1,'+') if not revert else (3,'-')
while i < len(p) and p[i].startswith(("---","+++")): i += 1 # skip header lines
while i < len(p):
m = _hdr_pat.match(p[i])
if not m: raise Exception("Cannot process diff")
i += 1
l = int(m.group(midx))-1 + (m.group(midx+1) == '0')
t += ''.join(s[sl:l])
sl = l
while i < len(p) and p[i][0] != '#':
if i+1 < len(p) and p[i+1][0] == '\\': line = p[i][:-1]; i += 2
else: line = p[i]; i += 1
if len(line) > 0:
if line[0] == sign or line[0] == ' ': t += line[1:]
sl += (line[0] != sign)
t += ''.join(s[sl:])
return t
If there are header lines ("--- ...\n","+++ ...\n") it skips over them. If we have a unified diff string diffstr representing the diff between oldstr and newstr:
# recreate `newstr` from `oldstr`+patch
newstr = apply_patch(oldstr, diffstr)
# recreate `oldstr` from `newstr`+patch
oldstr = apply_patch(newstr, diffstr, True)
In Python you can generate a unified diff of two strings using difflib (part of the standard library):
import difflib
_no_eol = "\ No newline at end of file"
def make_patch(a,b):
"""
Get unified string diff between two strings. Trims top two lines.
Returns empty string if strings are identical.
"""
diffs = difflib.unified_diff(a.splitlines(True),b.splitlines(True),n=0)
try: _,_ = next(diffs),next(diffs)
except StopIteration: pass
return ''.join([d if d[-1] == '\n' else d+'\n'+_no_eol+'\n' for d in diffs])
On unix: diff -U0 a.txt b.txt
Code is on GitHub here along with tests using ASCII and random unicode characters: https://gist.github.com/noporpoise/16e731849eb1231e86d78f9dfeca3abc
AFAIK most diff algorithms use a simple Longest Common Subsequence match, to find the common part between two texts and whatever is left is considered the difference. It shouldn't be too difficult to code up your own dynamic programming algorithm to accomplish that in python, the wikipedia page above provides the algorithm too.
Does it have to be a python solution?
My first thoughts as to a solution would be to use either a Version Control System (Subversion, Git, etc.) or the diff / patch utilities that are standard with a unix system, or are part of cygwin for a windows based system.
Probably you can use unified_diff to generate the list of difference in a file. Only the changed texts in your file can be written it into a new text file where you can use it for your future reference.
This is the code which helps you to write only the difference to your new file.
I hope this is what you are asking for !
diff = difflib.unified_diff(old_file, new_file, lineterm='')
lines = list(diff)[2:]
# linesT = list(diff)[0:3]
print (lines[0])
added = [lineA for lineA in lines if lineA[0] == '+']
with open("output.txt", "w") as fh1:
for line in added:
fh1.write(line)
print '+',added
removed = [lineB for lineB in lines if lineB[0] == '-']
with open("output.txt", "a") as fh1:
for line in removed:
fh1.write(line)
print '-',removed
Use this in your code to save only the difference output !

How to refactor this python code block to be more efficient

This code block works - it loops through a file that has a repeating number of sets of data
and extracts out each of the 5 pieces of information for each set.
But I I know that the current factoring is not as efficient as it can be since it is looping
through each key for each line found.
Wondering if some python gurus can offer better way to do this more efficiently.
def parse_params(num_of_params,lines):
for line in lines:
for p in range(1,num_of_params + 1,1):
nam = "model.paramName "+str(p)+" "
par = "model.paramValue "+str(p)+" "
opt = "model.optimizeParam "+str(p)+" "
low = "model.paramLowerBound "+str(p)+" "
upp = "model.paramUpperBound "+str(p)+" "
keys = [nam,par,opt,low,upp]
for key in keys:
if key in line:
a,val = line.split(key)
if key == nam: names.append(val.rstrip())
if key == par: params.append(val.rstrip())
if key == opt: optimize.append(val.rstrip())
if key == upp: upper.append(val.rstrip())
if key == low: lower.append(val.rstrip())
print "Names = ",names
print "Params = ",params
print "Optimize = ",optimize
print "Upper = ",upper
print "Lower = ",lower
Though this doesn't answer your question (other answers are getting at that) something that has helped me a lot in doing things similar to what you're doing are List Comprehensions. They allow you to build lists in a concise and (I think) easy to read way.
For instance, the below code builds a 2-dimenstional array with the values you're trying to get at. some_funct here would be a little regex, if I were doing it, that uses the index of the last space in the key as the parameter, and looks ahead to collect the value you're trying to get in the line (the value which corresponds to the key currently being looked at) and appends it to the correct index in the seen_keys 2D array.
Wordy, yes, but if you get list-comprehension and you're able to construct the regex to do that, you've got a nice, concise solution.
keys = ["model.paramName ","model.paramValue ","model.optimizeParam ""model.paramLowerBound ","model.paramUpperBound "]
for line in lines:
seen_keys = [[],[],[],[],[]]
[seen_keys[keys.index(k)].some_funct(line.index(k) for k in keys if k in line]
It's not totally easy to see the expected format. From what I can see, the format is like:
lines = [
"model.paramName 1 foo",
"model.paramValue 2 bar",
"model.optimizeParam 3 bat",
"model.paramLowerBound 4 zip",
"model.paramUpperBound 5 ech",
"model.paramName 1 foo2",
"model.paramValue 2 bar2",
"model.optimizeParam 3 bat2",
"model.paramLowerBound 4 zip2",
"model.paramUpperBound 5 ech2",
]
I don't see the above code working if there is more than one value in each line. Which means the digit is not really significant unless I'm missing something. In that case this works very easily:
import re
def parse_params(num_of_params,lines):
key_to_collection = {
"model.paramName":names,
"model.paramValue":params,
"model.optimizeParam":optimize,
"model.paramLowerBound":upper,
"model.paramUpperBound":lower,
}
reg = re.compile(r'(.+?) (\d) (.+)')
for line in lines:
m = reg.match(line)
key, digit, value = m.group(1, 2, 3)
key_to_collection[key].append(value)
It's not entirely obvious from your code, but it looks like each line can have one "hit" at most; if that's indeed the case, then something like:
import re
def parse_params(num_of_params, lines):
sn = 'Names Params Optimize Upper Lower'.split()
ks = '''paramName paramValue optimizeParam
paramLowerBound paramUpperBound'''.split()
vals = dict((k, []) for k in ks)
are = re.compile(r'model\.(%s) (\d+) (.*)' % '|'.join(ks))
for line in lines:
mo = are.search(line)
if not mo: continue
p = int(mo.group(2))
if p < 1 or p > num_of_params: continue
vals[mo.group(1)].append(mo.group(3).rstrip())
for k, s in zip(ks, sn):
print '%-8s =' % s,
print vals[k]
might work -- I exercised it with a little code as follows:
if __name__ == '__main__':
lines = '''model.paramUpperBound 1 ZAP
model.paramLowerBound 1 zap
model.paramUpperBound 5 nope'''.splitlines()
parse_params(2, lines)
and it emits
Names = []
Params = []
Optimize = []
Upper = ['zap']
Lower = ['ZAP']
which I think is what you want (if some details must differ, please indicate exactly what they are and let's see if we can fix it).
The two key ideas are: use a dict instead of lots of ifs; use a re to match "any of the following possibilities" with parenthesized groups in the re's pattern to catch the bits of interest (the keyword after model., the integer number after that, and the "value" which is the rest of the line) instead of lots of if x in y checks and string manipulation.
There is a lot of duplication there, and if you ever add another key or param, you're going to have to add it in many places, which leaves you ripe for errors. What you want to do is pare down all of the places you have repeated things and use some sort of data model, such as a dict.
Some others have provided some excellent examples, so I'll just leave my answer here to give you something to think about.
Are you sure that parse_params is the bottle-neck? Have you profiled your app?
import re
from collections import defaultdict
names = ("paramName paramValue optimizeParam "
"paramLowerBound paramUpperBound".split())
stmt_regex = re.compile(r'model\.(%s)\s+(\d+)\s+(.*)' % '|'.join(names))
def parse_params(num_of_params, lines):
stmts = defaultdict(list)
for m in (stmt_regex.match(s) for s in lines):
if m and 1 <= int(m.group(2)) <= num_of_params:
stmts[m.group(1)].append(m.group(3).rstrip())
for k, v in stmts.iteritems():
print "%s = %s" % (k, ' '.join(v))
The code given in the OP does multiple tests per line to try to match against the expected set of values, each of which is being constructed on the fly. Rather than construct paramValue1, paramValue2, etc. for each line, we can use a regular expression to try to do the matching in a cheaper (and more robust) manner.
Here's my code snippet, drawing from some ideas that have already been posted. This lets you add a new keyword to the key_to_collection dictionary and not have to change anything else.
import re
def parse_params(num_of_params, lines):
pattern = re.compile(r"""
model\.
(.+) # keyword
(\d+) # index to keyword
[ ]+ # whitespace
(.+) # value
""", re.VERBOSE)
key_to_collection = {
"paramName": names,
"paramValue": params,
"optimizeParam": optimize,
"paramLowerBound": upper,
"paramUpperBound": lower,
}
for line in lines:
match = pattern.match(line)
if not match:
print "Invalid line: " + line
elif match[1] not in key_to_collection:
print "Invalid key: " + line
# Not sure if you really care about enforcing this
elif match[2] > num_of_params:
print "Invalid param: " + line
else:
key_to_collection[match[1]].append(match[3])
Full disclosure: I have not compiled/tested this.
It can certainly be made more efficient. But, to be honest, unless this function is called hundreds of times a second, or works on thousands of lines, is it necessary?
I would be more concerned about making it clear what is happening... currently, I'm far from clear on that aspect.
Just eyeballing it, the input seems to look like this:
model.paramName 1 A model.paramValue 1 B model.optimizeParam 1 C model.paramLowerBound 1 D model.paramUpperBound 1 E model.paramName 2 F model.paramValue 2 G model.optimizeParam 2 H model.paramLowerBound 2 I model.paramUpperBound 2 J
And your desired output seems to be something like:
Names = AF
Params = BG
etc...
Now, since my input certainly doesn't match yours, the output is likely off too, but I think I have the gist.
There are a few points. First, does it matter how many parameters are passed to the function? For example, if the input has two sets of parameters, do I just want to read both, or is it necessary to allow the function to only read one? For example, your code allows me to call parse_params(1,1) and have it only read parameters ending in a 1 from the same input. If that's not actually a requirement, you can skip a large chunk of the code.
Second, is it important to ONLY read the given parameters? If I, for example, have a parameter called 'paramFoo', is it bad if I read it? You can also simplify the procedure by just grabbing all parameters regardless of their name, and extracting their value.
def parse_params(input):
parameter_list = {}
param = re.compile(r"model\.([^ ]+) [0-9]+ ([^ ]+)")
each_parameter = param.finditer(input)
for match in each_parameter:
key = match[0]
value = match[1]
if not key in paramter_list:
parameter_list[key] = []
parameter_list[key].append(value)
return parameter_list
The output, in this instance, will be something like this:
{'paramName':[A, F], 'paramValue':[B, G], 'optimizeParam':[C, H], etc...}
Notes: I don't know Python well, I'm a Ruby guy, so my syntax may be off. Apologies.

Categories

Resources