split text by periods except in certain cases [duplicate] - python

This question already has answers here:
How can I split a text into sentences?
(20 answers)
Closed 1 year ago.
I am currently trying to split a string containing an entire text document by sentences so that I can convert it to a csv. Naturally, I would use periods as the delimiter and perform str.split('.'), however, the document contains abbreviations 'i.e.' and 'e.g.' which I would want to ignore the periods in this case.
For example,
Original Sentence: During this time, it became apparent that vanilla shortest-path routing would be insufficient to handle the myriad operational, economic, and political factors involved in routing. ISPs began to modify routing configurations to support routing policies, i.e. goals held by the router’s owner that controlled which routes were chosen and which routes were propagated to neighbors.
Resulting List: ["During this time, it became apparent that vanilla shortest-path routing would be insufficient to handle the myriad operational, economic, and political factors involved in routing", "ISPs began to modify routing configurations to support routing policies, i.e. goals held by the router’s owner that controlled which routes were chosen and which routes were propagated to neighbors."]
My only workaround so far is replacing all 'i.e' and 'e.g.' with 'ie' and 'eg' which is both inefficient and grammatically undesirable. I am fiddling with Python's regex library which I suspect holds the answer I desire but my knowledge with it is novice at best.
It is my first time posting a question on here so I apologize if I am using incorrect format or wording.

See How can I split a text into sentences? which suggests the natural language toolkit.
A deeper explanation as to why this is how it is done by way of an example:
I go by the name of I. Brown. I bet I could make a sentence difficult to parse. No one is more suited to this task than I.
How do you break this into different sentences?
You need semantics (a formal sentence is usually made up of a subject, an object, and a verb) which a regular expression won't capture. RegEx does syntax very well, but not semantics (meaning).
To prove this, the answer someone else suggested that involves a lot of complex regex and is fairly slow, with 115 votes, would break with my simple sentence.
It's an NLP problem, so I linked to an answer that gave an NLP package.

This is a crude implementation.
inp = input()
res = []
last = 0
for x in range(len(inp)):
if (x>1):
if (inp[x] == "." and inp[x-2] != "."):
if (x < len(inp)-2):
if (inp[x+2] != "."):
res.append(inp[last:x])
last = x+2
res.append(inp[last:-1])
print(res)
If I use your input, I get this output (hopefully, this is what you are looking for):
['During this time, it became apparent that vanilla shortest-path routing would be insufficient to handle the myriad operational, economic, and political factors involved in routing', 'ISPs began to modify routing configurations to support routing policies, i.e. goals held by the router’s owner that controlled which routes were chosen and which routes were propagated to neighbors']
Note: You might have to adjust this code if the text you are using does not follow grammar rules (no spaces between letters or after starting a new sentence...)

This one should work!
import re
p = "During this time, it became apparentt hat vanilla shortest-path routing would be insufficient to handle the myriad operational, economic, and political factors involved in routing. ISPs began to modify routing configurations to support routing policies, i.e. goals held by the router’s owner that controlled which routes were chosen and which routes were propagated to neighbors."
list = []
while(len(p) > 0):
string = ""
while(True):
match = re.search("[A-Z]+[^A-Z]+",p)
if(match == None):
break
p = p[len(match.group(0)):]
string += match.group(0)
if(match.group(0).endswith(". ") ):
break
list.append(string)
print(list)

Related

Can I use ml/nlp to determine pattern in the type of usernames are generated?

I have a dataset which has first and last names along with their respective email ids. Some of the email ids follow a certain pattern such as:
Fn1 = John , Ln1 = Jacobs, eid1= jj#xyz.com
Fn2 = Emily , Ln2 = Queens, eid2= eq#pqr.com
Fn3 = Harry , Ln3 = Smith, eid3= hsm#abc.com
The content after # has no importance for finding the pattern. I want to find out how many people follow a certain pattern and what is that pattern. Is it possible to do so using nlp and python?
EXTRA: To know what kind of pattern is for a some number of people we can store examples of that pattern along with its count in an excel sheet?!
You certainly could - e.g., you could try to learn a relationship between your input and output data as
(Fn, Ln) --> eid
and further disect this relationship into patterns.
However before hitting the problem with complex tools (especially if new to ml/nlp), I'd do further analysis of the data first.
For example, I'd first be curious to see what portion of your data displays the clear patterns you've shown in the examples - using first character(s) from the individual's first/last name to build the corresponding eid (which could be determined easily programatically).
Setting aside that portion of the data that satisfies this clear pattern - what does the remainder look like?
Is there are another clear, but different pattern to some of this data?
If there is - I'd then perform the same exercise again - construct a proper filter to collect and set aside data satisfying that pattern - and examine the remainder.
Doing this analysis might help determine at least a partial answer to your inquiry rather quickly
To know what kind of pattern is for a some number of people we can store examples of that pattern along with its count in an excel sheet?!
Moreover it will help determine
a) whether you need to even use more complex tooling (if enough patterns can be easily seived out this way - is it worth the investment to go heavier?) or
b) if not, which portion of the data to target with heavier tools (the remainder of this process - those not containing simple patterns).

How to detect the amount of almost-repetition in a text file?

I am a programming teacher, and I would like to write a script that detects the amount of repetition in a C/C++/Python file. I guess I can treat any file as pure text.
The script's output would be the number of similar sequences that repeat. Eventually, I am only interested in a DRY's metric (how much the code satisfied the DRY principle).
Naively I tried to do a simple autocorrelation but it would be hard to find the proper threshold.
u = open("find.c").read()
v = [ord(x) for x in u]
y = np.correlate(v, v, mode="same")
y = y[: int(len(y) / 2)]
x = range(len(y))
z = np.polyval(np.polyfit(x, y, 3), x)
f = (y - z)[: -5]
plt.plot(f)
plt.show();
So I am looking at different strategies... I also tried to compare the similarities between each line, each group of 2 lines, each group of 3 lines ...
import difflib
import numpy as np
lines = open("b.txt").readlines()
lines = [line.strip() for line in lines]
n = 3
d = []
for i in range(len(lines)):
a = lines[i:i+n]
for j in range(len(lines)):
b = lines[j:j+n]
if i == j: continue # skip same line
group_size = np.sum([len(x) for x in a])
if group_size < 5: continue # skip short lines
ratio = 0
for u, v in zip(a, b):
r = difflib.SequenceMatcher(None, u, v).ratio()
ratio += r if r > 0.7 else 0
d.append(ratio)
dry = sum(d) / len(lines)
In the following, we can identify some repetition at a glance:
w = int(len(d) / 100)
e = np.convolve(d, np.ones(w), "valid") / w * 10
plt.plot(range(len(d)), d, range(len(e)), e)
plt.show()
Why not using:
d = np.exp(np.array(d))
Thus, difflib module looks promising, the SequenceMatcher does some magic (Levenshtein?), but I would need some magic constants as well (0.7)... However, this code is > O(n^2) and runs very slowly for long files.
What is funny is that the amount of repetition is quite easily identified with attentive eyes (sorry to this student for having taken his code as a good bad example):
I am sure there is a more clever solution out there.
Any hint?
I would build a system based on compressibility, because that is essentially what things being repeated means. Modern compression algorithms are already looking for how to reduce repetition, so let's piggy back on that work.
Things that are similar will compress well under any reasonable compression algorithm, eg LZ. Under the hood a compression algo is a text with references to itself, which you might be able to pull out.
Write a program that feeds lines [0:n] into the compression algorithm, compare it to the output length with [0:n+1].
When you see the incremental length of the compressed output increases by a lot less than the incremental input, you note down that you potentially have a DRY candidate at that location, plus if you can figure out the format, you can see what previous text it was deemed similar to.
If you can figure out the compression format, you don't need to rely on the "size doesn't grow as much" heuristic, you can just pull out the references directly.
If needed, you can find similar structures with different names by pre-processing the input, for instance by normalizing the names. However I foresee this getting a bit messy, so it's a v2 feature. Pre-processing can also be used to normalize the formatting.
Looks like you're choosing a long path. I wouldn't go there.
I would look into trying to minify the code before analyzing it. To completely remove any influence of variable names, extra spacing, formatting and even slight logic reshuffling.
Another approach would be comparing byte-code of the students. But it may be not a very good idea since the result will likely have to be additionally cleaned up.
Dis would be an interesting option.
I would, most likely, stop on comparing their AST. But ast is likely to give false positives for short functions. Cuz their structure may be too similar, so consider checking short functions with something else, something trivial.
On top of thaaaat, I would consider using Levenshtein distance or something similar to numerically calculate the differences between byte-codes/sources/ast/dis of the students. This would be what? Almost O(N^2)? Shouldn't matter.
Or, if needed, make it more complex and calculate the distance between each function of student A and each function of student B, highlighting cases when the distance is too short. It may be not needed though.
With simplification and normalization of the input, more algorithms should start returning good results. If a student is good enough to take someone's code and reshuffle not only the variables, but the logic and maybe even improve the algo, then this student understands the code well enough to defend it and use it with no help in future. I guess, that's the kind of help a teacher would want to be exchanged between students.
You can treat this as a variant of the longest common subsequence problem between the input and itself, where the trivial matching of each element with itself is disallowed. This retains the optimal substructure of the standard algorithm, since it can be phrased as a non-transitive “equality” and the algorithm never relies on transitivity.
As such, we can write this trivial implementation:
import operator
class Repeat:
def __init__(self,l):
self.l=list(l)
self.memo={}
def __call__(self,m,n):
l=self.l
memo=self.memo
k=m,n
ret=memo.get(k)
if not ret:
if not m or not n: ret=0,None
elif m!=n and l[m-1]==l[n-1]: # critical change here!
z,tail=self(m-1,n-1)
ret=z+1,((m-1,n-1),tail)
else: ret=max(self(m-1,n),self(m,n-1),key=operator.itemgetter(0))
memo[k]=ret
return ret
def go(self):
n=len(self.l)
v=self(n,n)[1]
ret=[]
while v:
x,v=v
ret.append(x)
ret.reverse()
return ret
def repeat(l): return Repeat(l).go()
You might want to canonicalize lines of code by removing whitespace (except perhaps between letters), removing comments, and/or replacing each unique identifier with a standardized label. You might also want to omit trivial lines like } in C/C++ to reduce noise. Finally, the symmetry should allow only cases with, say, m>=n to be treated.
Of course, there are also "real" answers and real research on this issue!
Frame challenge: I’m not sure you should do this
It’d be a fun programming challenge for yourself, but if you intend to use it as a teaching tool—-I’m not sure I would. There’s not a good definition of “repeat” from the DRY principle that would be easy to test for fully in a computer program. The human definition, which I’d say is basically “failure to properly abstract your code at an appropriate level, manifested via some type of repetition of code, whether repeating exact blocks of whether repeating the same idea over and over again, or somewhere in between” isn’t something I think anyone will be able to get working well enough at this time to use as a tool that teaches good habits with respect to DRY without confusing the student or teaching bad habits too. For now I’d argue this is a job for humans because it’s easy for us and hard for computers, at least for now…
That said if you want to give it a try, first define for yourself requirements for what errors you want to catch, what they’ll look like, and what good code looks like, and then define acceptable false positive and false negative rates and test your code on a wide variety of representative inputs, validating your code against human judgement to see if it performs well enough for your intended use. But I’m guessing you’re really looking for more than simple repetition of tokens, and if you want to have a chance at succeeding I think you need to clearly define what you’re looking for and how you’ll measure success and then validate your code. A teaching tool can do great harm if it doesn’t actually teach the correct lesson. For example if your tool simply encourages students to obfuscate their code so it doesn’t get flagged as violating DRY, or if the tool doesn’t flag bad code so the student assumes it’s ok. Or if it flags code that is actually very well written.
More specifically, what types of repetition are ok and what aren’t? Is it good or bad to use “if” or “for” or other syntax repeatedly in code? Is it ok for variables and functions/methods to have names with common substrings (e.g. average_age, average_salary, etc.?). How many times is repetition ok before abstraction should happen, and when it does what kind of abstraction is needed and at what level (e.g. a simple method, or a functor, or a whole other class, or a whole other module?). Is more abstraction always better or is perfect sometimes the enemy of on time on budget? This is a really interesting problem, but it’s also a very hard problem, and honestly I think a research problem, which is the reason for my frame challenge.
Edit:
Or if you definitely want to try this anyway, you can make it a teaching tool--not necessarily as you may have intended, but rather by showing your students your adherence to DRY in the code you write when creating your tool, and by introducing them to the nuances of DRY and the shortcomings of automated code quality assessment by being transparent with them about the limitations of your quality assessment tool. What I wouldn’t do is use it like some professors use plagiarism detection tools, as a digital oracle whose assessment of the quality of the students’ code is unquestioned. That approach is likely to cause more harm than good toward the students.
I suggest the following approach: let's say that repetitions should be at least 3 lines long. Then we hash every 3 lines. If hash repeats, then we write down the line number where it occurred. All is left is to join together adjacent duplicated line numbers to get longer sequences.
For example, if you have duplicate blocks on lines 100-103 and 200-203, you will get {HASH1:(100,200), HASH2:(101,201)} (lines 100-102 and 200-202 will produce the same value HASH1, and HASH2 covers lines 101-103 and 201-203). When you join the results, it will produce a sequence (100,101,200,201). Finding in that monotonic subsequences, you will get ((100,101), (200,201)).
As no loops used, the time complexity is linear (both hashing and dictionary insertion are O(n))
Algorithm:
read text line by line
transform it by
removing blanks
removing empty lines, saving mapping to original for the future
for each 3 transformed lines, join them and calculate hash on it
filter lines for which their hashes occur more then once (these are repetitions of at least 3 lines)
find longest sequences and present repetitive text
Code:
from itertools import groupby, cycle
import re
def sequences(l):
x2 = cycle(l)
next(x2)
grps = groupby(l, key=lambda j: j + 1 == next(x2))
yield from (tuple(v) + (next((next(grps)[1])),) for k,v in grps if k)
with open('program.cpp') as fp:
text = fp.readlines()
# remove white spaces
processed, text_map = [], {}
proc_ix = 0
for ix, line in enumerate(text):
line = re.sub(r"\s+", "", line, flags=re.UNICODE)
if line:
processed.append(line)
text_map[proc_ix] = ix
proc_ix += 1
# calc hashes
hashes, hpos = [], {}
for ix in range(len(processed)-2):
h = hash(''.join(processed[ix:ix+3])) # join 3 lines
hashes.append(h)
hpos.setdefault(h, []).append(ix) # this list will reflect lines that are duplicated
# filter duplicated three liners
seqs = []
for k, v in hpos.items():
if len(v) > 1:
seqs.extend(v)
seqs = sorted(list(set(seqs)))
# find longer sequences
result = {}
for seq in sequences(seqs):
result.setdefault(hashes[seq[0]], []).append((text_map[seq[0]], text_map[seq[-1]+3]))
print('Duplicates found:')
for v in result.values():
print('-'*20)
vbeg, vend = v[0]
print(''.join(text[vbeg:vend]))
print(f'Found {len(v)} duplicates, lines')
for line_numbers in v:
print(f'{1+line_numbers[0]} : {line_numbers[1]}')

Password Cracking

I've asked this question once before on here and didn't get the answer I was looking for, but I got another method of doing it. so here I go again.
(The problem with the previous found answer was that it was extremely efficient, a bit too efficient. I couldn't count comparisons, and I wanted it to be the most bare bone way of finding the password, so that I can implement a rating system to it.)
I am looking to make a password rating program that rates a password based off of the length of time, and the amount of comparisons the program had to make in order to find the correct password.
I'm looking to use multiple different methods in order to crack the input, ex : comparing the input to a database of common words, and generating all possibilities for a password starting with the first character.
Problem :
I can't find a way to start with one element in a list, we will call it A.
Running A through chr(32) - chr(127) ('space' - '~'), and then adding a second element to a list called B.
For the second loop, set A = chr(32) and B would then run through chr(32) - chr(127). A would then turn into chr(33) and B would run through all characters and so forth. until all possible options have been compared, then for the next loop it would add another element onto the list and continue the search, starting with chr(32), chr(32), chr(32-127). Continuing this pattern until it finds the correct password.
This is the closest I could get to something that would work (I know it's terrible).
while ''.join(passCheck) != ''.join(usrPassword) :
for i in range(0, len(usrPassword)) :
if ''.join(passCheck) != ''.join(usrPassword) :
passCheck.append(' ')
for j in range(32, 127) :
if ''.join(passCheck) != ''.join(usrPassword) :
passCheck[i] = chr(j)
for k in range(0, len(passCheck)) :
for l in range(32, 127) :
if ''.join(passCheck) != ''.join(usrPassword) :
passCheck[k] = chr(l)
print(passCheck)
The answer to this question is that you probably want to ask a different one, unfortunately.
Checking password strength is not as simple as calculating all possible combinations, Shannon entropy, etc.
What matters is what's called "guessing entropy", which is a devilishly complex mix of raw bruteforce entropy, known password lists and password leak lists, rules applied to those lists, Levenshtein distance from the elements in those lists, human-generated strings and phrases, keyboard walks, etc.
Further, password strength is deeply rooted in the question "How was this password generated?" ... which is very hard indeed to automatically detect and calculate after the fact. As humans, we can approximate this in many cases by reverse-engineering the psychology of how people select passwords ... but this is still an art, not a science.
For example, 'itsmypartyandillcryifiwantto' is a terrible password, even though its Shannon entropy is quite large. And 'WYuNLDcp0yhsZXvstXko' is a randomly-generated password ... but now that it's public, it's a bad one. And until you know how passwords like 'qscwdvefbrgn' or 'ji32k7au4a83' were generated, they look strong ... but they are definitely not.
So if you apply the strict answer to your question, you're likely to get a tool that dramatically overestimates the strength of many passwords. If your goal is to actually encourage your users to create passwords resistant to bruteforce, you should instead encourage them to use randomly generated passphrases, and ensure that your passwords are stored using a very slow hash (Argon2 family, scrypt, bcrypt, etc. - but be sure to bench these for UX and server performance before choosing one).
References and further reading (disclaimer: some are my answers or have my comments):
https://security.stackexchange.com/a/4631/6203
https://security.stackexchange.com/questions/4630/how-can-we-accurately-measure-a-password-entropy-range
https://security.stackexchange.com/questions/127434/what-are-possible-methods-for-calculating-password-entropy
https://security.stackexchange.com/a/174555/6203
If you check in alphabetic order then you need only one loop to calculate it how many times it would have to check it
You have 96 chars (127-32+1)
password = 'ABC'
rate = 0
for char in password:
rate = rate*96 + (ord(char)-31)
print(rate)

How to count [any name from a list of names] + [specific last name] in a block of text?

first time post here. I’m hoping I can find a little help on something I’m trying to accomplish in terms of text analysis.
First, I’m doing this in python and would like to remain in python as this function would be part of a larger, otherwise healthy tool I’m happy with. I have NLKT and Anaconda all set up as well, so drawing on those resources is also possible.
I’ve been working on a tool that tracks and adds up references to city names in large blocks of text. For instance, the tool can count how many times “Chicago,” “New York” or “Los Angeles,” “San Francisco” etc… are detected in a text chunk and can rank them.
The current problem I am having is figuring out how to remove false positives from city names that are also last names. So, for instance, I would want to count, say Jackson Mississippi, but not count “Frank Jackson” “Jane Jackson” etc…
What I would like to do however is figure out a way to account for any false positive that might be [any name from a long list of first names] + [Select last name].
I have assembled a list of ~5000 first names from the census data that I can also bring into python as a list. I can also check true/false to find if a name is on that list, so I know I’m getting closer.
However, what I can’t figure out is how to express what I want, which is something like (I’ll use Jackson as an example again):
totalfirstnamejacksoncount = count (“[any name from census list] + Jackson”)
More or less. Is there some way I can phrase it as a wildcard from census list so ? Set a variable that would read as “any item in this list” so I could go “anynamevariable + Jackson,”? Or is there any other way to denote something like “any word in census list + Jackson”?
Ideally, my aim is to get a total count of “[Any first name] + [Specified last name]” so I can a) subtract them from the total of [Last name that is also a city name] count and maybe use that count for some other refinements.
In a worst case scenario I can see a way I could directly modify the census list and add Jackson (or whatever last name I need) to each name and have the lines manually add, but I feel like that would make a complete mess of my code when you look at ~5000 names for each name I’d like to do.
Sorry for the long-winded post. I appreciate your help with all this. If you have other suggestions you think might be better ways to approach it I’m happy to hear those out as well.
I propose to use regular expressions in combination with the list of names from NLTK. Suppose your text is:
text = "I met Fred Jackson and Mary Jackson in Jackson Mississippi"
Take a list of all names and convert it in a (huge) regular expression:
jackson_names = re.compile("|".join(w + r"\s+" + "Jackson" \
for w in nltk.corpus.names.words()))
In case you are not familiar with regular expressions, r'\s+' means "separated by one or more white spaces" and "|" means "or". The regular expression can be expanded to handle other last names.
Now, extract all "Jackson" matches from your text:
jackson_catch = jackson_names.findall(text)
#['Fred Jackson', 'Mary Jackson']
len(jackson_catch)
#2
Let's start by assuming that you are able to work with your data by iterating through words, e.g.
s = 'Hello. I am a string.'
s.split()
Output: ['Hello.', 'I', 'am', 'a', 'string.']
and you have managed to normalize the words by eliminating punctuation, capitalization, etc.
So you have a list of words words_list (which is your text converted into a list) and an index i at which you think there might be a city name, OR it might be someone's last name falsely identified as a city name. Let's call your list of first names FIRST_NAMES, which should be of type set (see comments).
if i >= 1:
prev_word = words_list[i-1]
if prev_word in FIRST_NAMES:
# put false positive code here
else:
# put true positive code here
You may also prefer to use regular expressions, as they are more flexible and more powerful. For example, you may notice that even after implementing this, you still have false positives or false negatives for some previously unforeseen reason. RE's could allow you to quickly adapt to the new problem.
On the other hand, if performance is a primary concern, you may be better off not using something so powerful and flexible, so that you can hone your algorithm to fit your specific requirements and run as efficiently as possible.
The current problem I am having is figuring out how to remove false positives from city names that are also last names. So, for instance, I would want to count, say Jackson Mississippi, but not count “Frank Jackson” “Jane Jackson” etc…
The problem you have is called "named entity recognition", and is best solved with a classifier that takes multiple cues into account to find the named entities and classify them according to type (PERSON, ORGANIZATION, LOCATION, etc., or a similar list).
Chapter 7 in the nltk book, and especially section 3, Developing and evaluating chunkers, walks you through the process of building and training a recognizer. Alternatively, you could install the Stanford named-entity recognizer and measure its performance on your data.

fuzzy matching lots of strings

I've got a database with property owners; I would like to count the number of properties owned by each person, but am running into standard mismatch problems:
REDEVELOPMENT AUTHORITY vs. REDEVELOPMENT AUTHORITY O vs. PHILADELPHIA REDEVELOPMEN vs. PHILA. REDEVELOPMENT AUTH
COMMONWEALTH OF PENNA vs. COMMONWEALTH OF PENNSYLVA vs. COMMONWEALTH OF PA
TRS UNIV OF PENN vs. TRUSTEES OF THE UNIVERSIT
From what I've seen, this is a pretty common problem, but my problem differs from those with solutions I've seen for two reasons:
1) I've got a large number of strings (~570,000), so computing the 570000 x 570000 matrix of edit distances (or other pairwise match metrics) seems like a daunting use of resources
2) I'm not focused on one-off comparisons--e.g., as is most common for what I've seen from big data fuzzy matching questions, matching user input to a database on file. I have one fixed data set that I want to condense once and for all.
Are there any well-established routines for such an exercise? I'm most familiar with Python and R, so an approach in either of those would be ideal, but since I only need to do this once, I'm open to branching out to other, less familiar languages (perhaps something in SQL?) for this particular task.
That is exactly what I am facing at my new job daily (but lines counts are few million). My approach is to:
1) find a set of unique strings by using p = unique(a)
2) remove punctuation, split strings in p by whitespaces, make a table of words' frequencies, create a set of rules and use gsub to "recover" abbreviations, mistyped words, etc. E.g. in your case "AUTH" should be recovered back to "AUTHORITY", "UNIV" -> "UNIVERSITY" (or vice versa)
3) recover typos if I spot them by eye
4) advanced: reorder words in strings (to often an improper English) to see if the two or more strings are identical albeit word order (e.g. "10pack 10oz" and "10oz 10pack").
You can also use agrep() in R for fuzzy name matching, by giving a percentage of allowed mismatches. If you pass it a fixed dataset, then you can grep for matches out of your database.

Categories

Resources