I need to extract text emoticons from a text using Python and I've been looking for some solutions to do this but most of them like this or this only cover simple emoticons. I need to parse all of them.
Currently I'm using a list of emoticons that I iterate for every text that I have process but this is so inefficient. Do you know a better solution? Maybe a Python library that can handle this problem?
One of most efficient solution is to use Aho–Corasick string matching algorithm and is nontrivial algorithm designed for this kind of problem. (search of multiple predefined strings in unknown text)
There is package available for this.
https://pypi.python.org/pypi/ahocorasick/0.9
https://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/
Edit:
There are also more recent packages available (haven tried any of them)
https://pypi.python.org/pypi/pyahocorasick/1.0.0
Extra:
I have made some performance test with pyahocorasick and it is faster than python re when searching for more than 1 word in dict (2 or more).
Here it is code:
import re, ahocorasick,random,time
# search N words from dict
N=3
#file from http://norvig.com/big.txt
with open("big.txt","r") as f:
text = f.read()
words = set(re.findall('[a-z]+', text.lower()))
search_words = random.sample([w for w in words],N)
A = ahocorasick.Automaton()
for i,w in enumerate(search_words):
A.add_word(w, (i, w))
A.make_automaton()
#test time for ahocorasic
start = time.time()
print("ah matches",sum(1 for i in A.iter(text)))
print("aho done in ", time.time() - start)
exp = re.compile('|'.join(search_words))
#test time for re
start = time.time()
m = exp.findall(text)
print("re matches",sum(1 for _ in m))
print("re done in ",time.time()-start)
Related
I have two .txt files, one that contains 200.000 words and the second contains 100 keywords( one each line). I want to calculate the cosine similarity between each of the 100 keywords and each word of my 200.000 words , and display for every keyword the 50 words with the highest score.
Here's what I did, note that Bertclient is what i'm using to extract vectors :
from sklearn.metrics.pairwise import cosine_similarity
from bert_serving.client import BertClient
bc = BertClient()
# Process words
with open("./words.txt", "r", encoding='utf8') as textfile:
words = textfile.read().split()
with open("./100_keywords.txt", "r", encoding='utf8') as keyword_file:
for keyword in keyword_file:
vector_key = bc.encode([keyword])
for w in words:
vector_word = bc.encode([w])
cosine_lib = cosine_similarity(vector_key,vector_word)
print (cosine_lib)
This keeps running but it doesn't stop. Any idea how I can correct this ?
I know nothing of Bert...but there's something fishy with the import and run. I don't think you have it installed correctly or something. I tried to pip install it and just run this:
from sklearn.metrics.pairwise import cosine_similarity
from bert_serving.client import BertClient
bc = BertClient()
print ('done importing')
and it never finished. Take a look at the dox for bert and see if something else needs to be done.
On your code, it is generally better do do ALL of the reading first, then the processing, so import both lists first, separately, check a few values with something like:
# check first five
print(words[:5])
Also, you need to look at a different way to do your comparisons instead of the nested loops. You realize now that you are converting each word in words EVERY TIME for each keyword, which is not necessary and probably really slow. I would recommend you either use a dictionary to pair the word with the encoding or make a list of tuples with the (word, encoding) in it if you are more comfortable with that.
Comment me back if that doesn't makes sense after you get Bert up and running.
--Edit--
Here is a chunk of code that works similar to what you want to do. There are a lot of options for how you can hold results, etc. depending on your needs, but this should get you started with "fake bert"
from operator import itemgetter
# fake bert ... just return something like length
def bert(word):
return len(word)
# a fake compare function that will compare "bert" conversions
def bert_compare(x, y):
return abs(x-y)
# Process words
with open("./word_data_file.txt", "r", encoding='utf8') as textfile:
words = textfile.read().split()
# Process keywords
with open("./keywords.txt", "r", encoding='utf8') as keyword_file:
keywords = keyword_file.read().split()
# encode the words and put result in dictionary
encoded_words = {}
for word in words:
encoded_words[word] = bert(word)
encoded_keywords = {}
for word in keywords:
encoded_keywords[word] = bert(word)
# let's use our bert conversions to find which keyword is most similar in
# length to the word
for word in encoded_words.keys():
result = [] # make a new result set for each pass
for kword in encoded_keywords.keys():
similarity = bert_compare(encoded_words.get(word), encoded_keywords.get(kword))
# stuff the answer into a tuple that can be sorted
result.append((word, kword, similarity))
result.sort(key=itemgetter(2))
print(f'the keyword with the closest size to {result[0][0]} is {result[0][1]}')
I am doing a small script in python, but since I am quite new I got stuck in one part:
I need to get timing and text from a .srt file. For example, from
1
00:00:01,000 --> 00:00:04,074
Subtitles downloaded from www.OpenSubtitles.org
I need to get:
00:00:01,000 --> 00:00:04,074
and
Subtitles downloaded from www.OpenSubtitles.org.
I have already managed to make the regex for timing, but i am stuck for the text. I've tried to use look behind where I use my regex for timing:
( ?<=(\d+):(\d+):(\d+)(?:\,)(\d+) --> (\d+):(\d+):(\d+)(?:\,)(\d+) )\w+
but with no effect. Personally, i think that using look behind is the right way to solve this, but i am not sure how to write it correctly. Can anyone help me? Thanks.
Honestly, I don't see any reason to throw regex at this problem. .srt files are highly structured. The structure goes like:
an integer starting at 1, monotonically increasing
start --> stop timing
one or more lines of subtitle content
a blank line
... and repeat. Note the bold part - you might have to capture 1, 2, or 20 lines of subtitle content after the time code.
So, just take advantage of the structure. In this way you can parse everything in just one pass, without needing to put more than one line into memory at a time and still keeping all the information for each subtitle together.
from itertools import groupby
# "chunk" our input file, delimited by blank lines
with open(filename) as f:
res = [list(g) for b,g in groupby(f, lambda x: bool(x.strip())) if b]
For example, using the example on the SRT doc page, I get:
res
Out[60]:
[['1\n',
'00:02:17,440 --> 00:02:20,375\n',
"Senator, we're making\n",
'our final approach into Coruscant.\n'],
['2\n', '00:02:20,476 --> 00:02:22,501\n', 'Very good, Lieutenant.\n']]
And I could further transform that into a list of meaningful objects:
from collections import namedtuple
Subtitle = namedtuple('Subtitle', 'number start end content')
subs = []
for sub in res:
if len(sub) >= 3: # not strictly necessary, but better safe than sorry
sub = [x.strip() for x in sub]
number, start_end, *content = sub # py3 syntax
start, end = start_end.split(' --> ')
subs.append(Subtitle(number, start, end, content))
subs
Out[65]:
[Subtitle(number='1', start='00:02:17,440', end='00:02:20,375', content=["Senator, we're making", 'our final approach into Coruscant.']),
Subtitle(number='2', start='00:02:20,476', end='00:02:22,501', content=['Very good, Lieutenant.'])]
Disagree with #roippi. Regex is a very nice solution to text matching. And the Regex for this solution is not tricky.
import re
f = file.open(yoursrtfile)
# Parse the file content
content = f.read()
# Find all result in content
# The first big (__) retrieve the timing, \s+ match all timing in between,
# The (.+) means retrieve any text content after that.
result = re.findall("(\d+:\d+:\d+,\d+ --> \d+:\d+:\d+,\d+)\s+(.+)", content)
# Just print out the result list. I recommend you do some formatting here.
print result
number:^[0-9]+$
Time:
^[0-9][0-9]:[0-9][0-9]:[0-9][0-9],[0-9][0-9][0-9] --> [0-9][0-9]:[0-9][0-9]:[0-9][0-9],[0-9][0-9][0-9]$
string: *[a-zA-Z]+*
hope this help.
Thanks #roippi for this excellent parser.
It helped me a lot to write a srt to stl converter in less than 40 lines (in python2 though, as it has to fit in a larger project)
from __future__ import print_function, division
from itertools import groupby
from collections import namedtuple
# prepare - adapt to you needs or use sys.argv
inputname = 'FR.srt'
outputname = 'FR.stl'
stlheader = """
$FontName = Arial
$FontSize = 34
$HorzAlign = Center
$VertAlign = Bottom
"""
def converttime(sttime):
"convert from srt time format (0...999) to stl one (0...25)"
st = sttime.split(',')
return "%s:%02d"%(st[0], round(25*float(st[1]) /1000))
# load
with open(inputname,'r') as f:
res = [list(g) for b,g in groupby(f, lambda x: bool(x.strip())) if b]
# parse
Subtitle = namedtuple('Subtitle', 'number start end content')
subs = []
for sub in res:
if len(sub) >= 3: # not strictly necessary, but better safe than sorry
sub = [x.strip() for x in sub]
number, start_end, content = sub[0], sub[1], sub[2:] # py 2 syntax
start, end = start_end.split(' --> ')
subs.append(Subtitle(number, start, end, content))
# write
with open(outputname,'w') as F:
F.write(stlheader)
for sub in subs:
F.write("%s , %s , %s\n"%(converttime(sub.start), converttime(sub.end), "|".join(sub.content)) )
for time:
pattern = ("(\d{2}:\d{2}:\d{2},\d{3}?.*)")
None of the pure REGEx solution above worked for the real life srt files.
Let's take a look of the following SRT patterned text :
1
00:02:17,440 --> 00:02:20,375
Some multi lined text
This is a second line
2
00:02:20,476 --> 00:02:22,501
as well as a single line
3
00:03:20,476 --> 00:03:22,501
should be able to parse unicoded text too
こんにちは
Take a note that :
text may contain unicode characters.
Text can consist of several lines.
every cue started with an integer value and ended with a blank new line which both unix style and windows style CR/LF are accepted
Here is the working regex :
\d+[\r\n](\d+:\d+:\d+,\d+ --> \d+:\d+:\d+,\d+)[\r\n]((.+\r?\n)+(?=(\r?\n)?))
https://regex101.com/r/qICmEM/1
I have a text file that has 120000 lines, where every line has exactly this format:
ean_code;plu;name;price;state
I tried various operations, including working with file straight away, and best results were given if file was just loaded in memory line by line with readlines() and written to list (at the start of the program).
so i have these 2 lines:
matcher = re.compile('^(?:'+eanic.strip()+'(?:;|$)|[^;]*;'+eanic.strip()+'(?:;|$))').match
line=[next(l.split(';') for l in list if matcher(l))]
do sth with line....
What these lines are trying to accomplish is, they are trying to find (as fast as possible) a plu/ean, which was given by user input in fields: ean_code or plu.
I am particulary interested in second line, as it impacts my performance on WinCE device (PyCE port of python 2.5).
I tried every possible solution there is to make it faster, but this is fastest way to iterate through a certain list to find a match that re.compile is generating.
Any faster way other than for in list comprehension to iterate over big list (120000 lines in my case)?
I am looking for any kind of way possible with any kind of data structure (that is supported until Python 2.5) that will give me faster result than above two lines...
Just to mention, that this is performed on Handheld device (630MHz ARM), with 256MB RAM, and without any kind of connection besides USB is present. Sadly, database access and existance is not an option.
I made a test file and tested a few variations. The fastest way of searching for a static string (as you appear to be doing) by iterating over the file is by using string in line.
However, if you'll be using the loaded data to search more than once (actually more than 30 times according to the testnumbers below), it's worth your (computational) time to produce lookup tables for the PLUs and EANs in the form of dicts and use these for future searches.
loaded 120000 lines
question regex 0.114868402481
simpler regex 0.417045307159
other regex 0.386662817001
startswith 0.236350297928
string in 0.020356798172 <-- iteration winner
dict construction 0.611148500443
dict lookup 0.000002503395 <-- best if you are doing many lookups
Test code follows:
import re
import timeit
def timefunc(function, times, *args):
def wrap():
function(*args)
t = timeit.Timer(wrap)
return t.timeit(times) / times
def question(lines):
eanic = "D41RP9"
matcher = re.compile('^(?:'+eanic.strip()+'(?:;|$)|[^;]*;'+eanic.strip()+'(?:;|$))').match
line=[next(l.split(';') for l in lines if matcher(l))]
return line
def splitstart(lines):
eanic = "D41RP9"
ret = []
for l in lines:
s = l.split(';')
if s[0].startswith(eanic) or s[1].startswith(eanic):
ret.append(l)
return ret
def simpler(lines):
eanic = "D41RP9"
matcher = re.compile('(^|;)' + eanic)
return [l for l in lines if matcher.search(l)]
def better(lines):
eanic = "D41RP9"
matcher = re.compile('^(?:' + eanic + '|[^;]*;' + eanic + ')')
return [l for l in lines if matcher.match(l)]
def strin(lines):
eanic = "D41RP9"
return [l for l in lines if eanic in l]
def mkdicts(lines):
ean = {}
plu = {}
for l in lines:
s = l.split(';')
ean[s[0]] = s
plu[s[1]] = s
return (ean, plu)
def searchdicts(ean, plu):
eanic = "D41RP9"
return (ean.get(eanic, None), plu.get(eanic, None))
with open('test.txt', 'r') as f:
lines = f.readlines()
print "loaded", len(lines), "lines"
print "question regex\t", timefunc(question, 10, lines)
print "simpler regex\t", timefunc(simpler, 10, lines)
print "other regex\t", timefunc(simpler, 10, lines)
print "startswith\t", timefunc(splitstart, 10, lines)
print "string in\t", timefunc(strin, 10, lines)
print "dict construction\t", timefunc(mkdicts, 10, lines)
ean, plu = mkdicts(lines)
print "dict lookup\t", timefunc(searchdicts, 10, ean, plu)
At first I looked up some modules, that are available for python 2.5:
You can use csv-module to read your data. Could be faster.
You can store your data via pickle or cPickle module. So you can store python-objects (like dict, tuples, ints and so on). Comparing ints is faster than searching in strings.
You iterate through a list, but you say your data are in a text file. Do not load your whole text file in a list. Maybe the following is fast enough and there is no need for using the modules I mentioned above.
f = open('source.txt','r') # note: python 2.5, no with-statement yet
stripped_eanic = eanic.strip()
for line in f:
if stripped_eanic in line: # the IDs have a maximum length, don't they? So maybe just search in line[:20]
# run further tests, if you think it is necessary
if further_tests:
print line
break
else:
print "No match"
Edit
I thought about what I mentioned above: Do not load the whole file in a list. I think that is only true, if your search is a onetime procedure and your script exits there. But if you want to search several times I suggest using dict (like beerbajay suggested) and cPickle-files instead of text-file.
I'm currently trying to get used to Python and have recently hit block in my coding. I couldn't run a code that would count the number of times a phrase appears in an html file. I've recently received some help constructing the code for counting the frequency in a text file but am wondering there is a way to do this directly from the html file (to bypass the copy and paste alternative). Any advice will be sincerely appreciated. The previous coding I have used is the following:
#!/bin/env python 3.3.2
import collections
import re
# Defining a function named "findWords".
def findWords(filepath):
with open(filepath) as infile:
for line in infile:
words = re.findall('\w+', line.lower())
yield from words
phcnt = collections.Counter()
from itertools import tee
phrases = {'central bank', 'high inflation'}
fw1, fw2 = tee(findWords('02.2003.BenBernanke.txt'))
next(fw2)
for w1,w2 in zip(fw1, fw2):
phrase = ' '.join([w1, w2])
if phrase in phrases:
phcnt[phrase] += 1
print(phcnt)
You can use some_str.count(some_phrase) function
In [19]: txt = 'Text mining, also referred to as text data mining, Text mining,\
also referred to as text data mining,'
In [20]: txt.lower().count('data mining')
Out[20]: 2
What about just stripping the html tags before doing the analysis? html2text does this job quite well.
import html2text
content = html2text.html2text(infile.read())
would give you the text content (somehow formatted, but this is no problem in your approach I think). There are options to ignore images and links additionally, which you would use like
h = html2text.HTML2Text()
h.ignore_images = True
h.ignore_links = True
content = h.handle(infile.read())
I am trying to increment a version number using regex but I can't seem to get the hang of regex at all. I'm having trouple with the symbols in the string I am trying to read and change. The code I have so far is:
version_file = "AssemblyInfo.cs"
read_file = open(version_file).readlines()
write_file = open(version_file, "w")
r = re.compile(r'(AssemblyFileVersion\s*(\s*"\s*)(\S+))\s*"\s*')
for l in read_file:
m1 = r.match(l)
if m1:
VERSION_ID=map(int,m1.group(2).split("."))
VERSION_ID[2]+=1 # increment version
l = r.sub(r'\g<1>' + '.'.join(['%s' % (v) for v in VERSION_ID]), l)
write_file.write(l)
write_file.close()
The string I am trying to read and change is:
[assembly: AssemblyFileVersion("1.0.0.0")]
What I would like written to the file is:
[assembly: AssemblyFileVersion("1.0.0.1")]
So basically I want to increment the build number by one.
Can anyone help me fix my regualr expression. I seem to have trouble getting to grips with regular expression that have to get around symbols.
Thanks for any help.
If you specify the version as "1.0.0.*" then AFAIK it gets updated on each build automagically, at least if you're using Visual Studio.NET.
I'm not sure regex is your best bet, but one way of doing it would be this:
import re
# Don't bother matching everything, just the bits that matter.
pat = re.compile(r'AssemblyFileVersion.*\.(\d+)"')
# ... lines omitted which set up read_file, write_file etc.
for line in read_file:
m = pat.search(line)
if m:
start, end = m.span(1)
line = line[:start] + str(int(line[start:end]) + 1) + line[end:]
write_file.write(line)
Good luck with regex.
If I had to do the same, I'd convert the string to int by removing the dots, add one and convert back to string.
Well, I'd have also used a integer version number in the first place.