Check for word in string with unpredictable delimiters

Check for word in string with unpredictable delimiters - python

I am looking for something slightly more reliable for unpredictable strings than just checking if "word" in "check for word".
To paint an example, lets say I have the following sentence:
"Learning Python!"
If the sentence contains "Python", I'd want to evaluate to true, but what if it were:
"Learning #python!"
Doing a split with a space as a delimiter would give me ["learning", "#python"] which does not match python.
(Note: While I do understand that I could remove the # for this particular case, the problem with this is that 1. I am tagging programming languages and don't want to strip out the # in C#, and 2. This is just an example case, there's a lot of different ways I could see human typed titles including these hints that I'd still like to catch.)
I'd basically like to inspect if beyond reasonable doubt, the sequence of characters I'm looking for is there, despite any weird ways they might mention it. What are some ways to do this? I have looked at fuzzy search a bit, but I haven't seen any use-cases of looking for single words.
The end goal here is that I have tags of programming languages, and I'd like to take in the titles of people's stream titles and tag the language if its mentioned in the title.

This code prints True if the word contains ‘python’, ignoring case.
import re
input = "Learning Python!"
print(re.search("python", input, re.IGNORECASE) is not None)

Related

automise long lists in python

Today I wrote my first program, which is essentially a vocabulary learning program! So naturally I have pretty huge lists of vocabulary and a couple of questions. I created a class with the parameters, one of which is the German vocab and one of which is the Spanish vocab. My first question is: is there anyway to turn all the plain text vocabulary that I copy from an internets vocab list into strings and separate them without adding the " and the commas manually?
And my second question:
I created another list to assign each German vocab to each Spanish vocab and it looks a little bit like that:
vocabs = [
Vocabulary(spanish_word[0], german_word[0])
Vocabulary(spanish_word[1], german_word[1])
etc.
]
Vocabulary would be the class, spanish_word the first word list and German the other obviously.
But with a lot of vocab that's a lot of work too. Is there anyway to automate the process to add each word from the Spanish word list to the German one? I first tried it with the
vocabs = [
for spanish word in german word
Vocabulary(spanish_word[0], german_word[0])
]
But that didn't work. Researching on the internet also didn't help much.
Please don't be rude if those are noob questions I'm actually pretty happy that my program is running so well and I would be thankful for all the help to make it better.

Without knowing what it is you're looking to do with the result, it appears you're trying to do this:
vocabs = [Vocabulary(s, g) for s, g in zip(spanish_word, german_word)]
You didn't provide any code or example data around the "turn all the plain text vocabulary [..] into strings and separate them without adding the quotes and the commas manually". There's sure to be a way to do what you need, but you should probably ask a separate question, after first looking for a solution yourself and coming up with a solution. Ask a question if you can't get it to work.

How to remove all the spaces between letters?

I have text with words like this: a n a l i z e, c l a s s etc. But there are normal words as well. I need to remove all these spaces between letters of words.
reg_let = re.compile('\s[А-Яа-яёЁa-zA-Z](\s)', re.DOTALL)
text = 'T h i s is exactly w h a t I needed'
text = re.sub(reg_let, '', text)
text
OUTPUT:
'Tiis exactlyhtneeded' (while I need - 'This is exactly what I needed')

As far as I know, there is no easy way to do it because your biggest problem is to distinct the words with meaning, in other words, you need some semantic engine to tell you which word is meaningful to the sentence.
The only thing I can think of is a word embedding model, without anything like that you can clear as much spaces as you want but you cant distinct the words, meaning you'll never know which spaces to not remove.
I would love if someone will fix me if theres a simpler way im not aware of.

There is no easy solution to this problem.
The only solution that I can think of is the one in which is used a dictionary to check if a word is correct or no (present in the english dictionary).
But even doing so you'll get a lot of false positives. For example if I got the text:
a n a n a s
the words:
a
an
as
are all correct in the english dictionary. How do I split the text? For me, as human who can read a text, it is clear that the word here is ananas. But one could split the text as such:
an an as
Which is correct grammatically, but doesn't make sense in english. The correctness is given by the context. I, as human, I can understand the context. One could split, concat the string in different ways to check if it makes sense. But unfortunately there is no library, or simple procedure that can understand context.
Machine Learning could be a way, but there is no perfect solution.

highlighting words in an docx file using python-docx gives incorrect results

I would like to highlight specific words in an MS word document (here given as negativeList) and leave the rest of the document as it was before. I have tried to adopt from this one but I can not get it running as it should:
from docx.enum.text import WD_COLOR_INDEX
from docx import Document
import pandas as pd
import copy
import re
doc = Document(docxFileName)
negativList = ["king", "children", "lived", "fire"] # some examples
for paragraph in doc.paragraphs:
for target in negativList:
if target in paragraph.text: # it is worth checking in detail ...
currRuns = copy.copy(paragraph.runs) # deep copy as we delete/clear the object
paragraph.runs.clear()
for run in currRuns:
if target in run.text:
words = re.split('(\W)', run.text) # split into words in order to be able to color only one
for word in words:
if word == target:
newRun = paragraph.add_run(word)
newRun.font.highlight_color = WD_COLOR_INDEX.PINK
else:
newRun = paragraph.add_run(word)
newRun.font.highlight_color = None
else: # our target is not in it so we add it unchanged
paragraph.runs.append(run)
doc.save('output.docx')
As example I am using this text (in a word docx file):
CHAPTER 1
Centuries ago there lived --
"A king!" my little readers will say immediately.
No, children, you are mistaken. Once upon a time there was a piece of
wood. It was not an expensive piece of wood. Far from it. Just a
common block of firewood, one of those thick, solid logs that are put
on the fire in winter to make cold rooms cozy and warm.
There are multiple problems with my code:
1) The first sentence works but the second sentence is in twice. Why?
2) The format gets somehow lost in the part where I highlight. I would possibly need to copy the properties of the original run into the newly created ones but how do I do this?
3) I loose the terminal "--"
4) In the highlighted last paragraph the "cozy and warm" is missing ...
What I would need is a eighter a fix for these problems or maybe I am overthinking it and there is a much easier way to do the highlighting? (something like doc.highlight({"king": "pink"} but I haven't found anything in the documentation)?

You're not overthinking it, this is a challenging problem; it is a form of the search-and-replace problem.
The target text can be located fairly easily by searching Paragraph.text, but replacing it (or in your case adding formatting) while retaining other formatting requires access at the Run level, both of which you've discovered.
There are some complications though, which is what makes it challenging:
There is no guarantee that your "find" target string is located entirely in a single run. So you will need to find the run containing the start of your target string and the run containing the end of your target string, as well as any in-between.
This might be aided by using character offsets, like "King" appears at character offset 3 in '"A king!" ...', and has a length of 4, then identifying which run contains character 3 and which contains character (3+4).
Related to the first complication, there is no guarantee that all the runs in which the target string partly appears are formatted the same. For example, if your target string was "a bold word", the updated version (after adding highlighting) would require at least three runs, one for "a ", one for "bold", and one for " word" (btw, which run each of the two space characters appear in won't change how they appear).
If you accept the simplification that the target string will always be a single word, you can consider the simplification of giving the replacement run the formatting of the first character (first run) of the found target runs, which is probably the usual approach.
So I suppose there are a few possible approaches, but one would be to "normalize" the runs of each paragraph containing the target string, such that the target string appeared within a distinct run. Then you could just apply highlighting to that run and you'd get the result you wanted.
To be of more help, you'll need to narrow down the problem areas and provide specific inputs and outputs. I'd start with the first one (perhaps losing the "--") (in a separate question, perhaps linked from here) and then proceed one by one until it all works. It's asking too much for a respondent to produce their own test case :)
Then you'd have a question like: "I run the string: 'Centuries ago ... --' through this code and the trailing "--" disappears ...", which is a lot easier for folks to reason through.
Another good next step might be to print out the text of each run, just so you get a sense of how they're broken up. That may give you insight into where it's not working.

I know its not the same library, but using wincom32 library you can highlight all the instances of the word in a specific range at once.
The code below will take all highlight all hits.
import win32com.client as win32
word = win32.gencache.EnsureDispatch('Word.Application');word.Visible = True
word = word.Documents.Open("test.docx")
strage = word.Range(Start=0, End=0) #change this range to shorten the replace
strage.Find.Replacement.Highlight = True
strage.Find.Execute(FindText="the",Replace=2,Format=True)

I faced a similar issue where I was supposed to highlight a set of words in a document. I modified certain parts of the OP's code and now I am able to highlight the selected words correctly.
As OP said in the comments: paragraph.runs.clear() was changed to paragraph.clear().
And I added a few lines to the following part of the code:
else:
paragraph.runs.append(run)
to get this:
else:
oldRun = paragraph.add_run(run.text)
if oldRun.text in spell_errors:
oldRun.font.highlight_color = WD_COLOR_INDEX.YELLOW
While iterating over the currRuns, we extract the text content of the run and add it to the paragraph, so we need to highlight those words again.

How to format search autocompletion part lists?

I'm currently working on an AppEngine project, and I'd like to implement autocompletion of search terms. The items that can be searched for are reasonably unambiguous and short, so I was thinking of implementing it by giving each item a list of incomplete typings. So foobar would get a list like [f, fo, foo, foob, fooba, foobar]. The user's text in the searchbox is then compared to this list, and positive matches are suggested.
There are a couple of possible optimizations in this list that I was thinking of:
Removing spaces punctuation from search terms. Foo. Bar to FooBar.
Removing capital letters
Removing leading particles like "the", "a", "an". The Guy would be guy, and indexed as [g, gu, guy].
Only adding substring longer than 2 or 3 to the indexing list. So The Guy would be indexed as [gu, guy]. I thought that suggestions that only match the first letter would not be so relevant.
The users search term would also be formatted in this way, after which the DB is searched. Upon suggesting a search term, the particles, punctuation, and capital letters would be added according to the suggested object's full name. So searching for "the" would give no suggestions, but searching for "The Gu.." or "gu" would suggest "The Guy".
Is this a good idea? Mainly: would this formatting help, or only cause trouble?

I have already run into the same problem and the solution that I adopted was very similar to your idea. I split the items into words, convert them to lowercase, remove accents, and create a list of startings. For instance, "Báz Bar" would become ['b', 'ba', 'bar', 'baz'].
I have posted the code in this thread. The search box of this site is using it. Feel free to use it if you like.

Parsing in Python: what's the most efficient way to suppress/normalize strings?

I'm parsing a source file, and I want to "suppress" strings. What I mean by this is transform every string like "bla bla bla +/*" to something like "string" that is deterministic and does not contain any characters that may confuse my parser, because I don't care about the value of the strings. One of the issues here is string formatting using e.g. "%s", please see my remark about this below.
Take for example the following pseudo code, that may be the contents of a file I'm parsing. Assume strings start with ", and escaping the " character is done by "":
print(i)
print("hello**")
print("hel"+"lo**")
print("h e l l o "+
"hello\n")
print("hell""o")
print(str(123)+"h e l l o")
print(uppercase("h e l l o")+"g o o d b y e")
Should be transformed to the following result:
print(i)
print("string")
print("string"+"string")
print("string"
"string")
print("string")
print(str(123)+"string")
print(uppercase("string")+"string")
Currently I treat it as a special case in the code (i.e. detect beginning of a string, and "manually" run until its end with several sub-special cases on the way). If there's a Python library function i can use or a nice regex that may make my code more efficient, that would be great.
Few remarks:
I would like the "start-of-string" character to be a variable, e.g. ' vs ".
I'm not parsing Python code at this stage, but I plan to, and there the problem obviously becomes more complex because strings can start in several ways and must end in a way corresponding to the start. I'm not attempting to deal with this right now, but if there's any well established best practice I would like to know about it.
The thing bothering me the most about this "suppression" is the case of string formatting with the likes of '%s', that are meaningful tokens. I'm currently not dealing with this and haven't completely thought it through, but if any of you have suggestions about how to deal with this that would be great. Please note I'm not interested in the specific type or formatting of the in-string tokens, it's enough for me to know that there are tokens inside the string (how many). Remark that may be important here: my tokenizer is not nested, because my goal is quite simple (I'm not compiling anything...).
I'm not quite sure about the escaping of the start-string character. What would you say are the common ways this is implemented in most programming languages? Is the assumption of double-occurrence (e.g. "") or any set of two characters (e.g. '\"') to escape enough? Do I need to treat other cases (think of languages like Java, C/C++, PHP, C#)?

Option 1: To sanitize Python source code, try the built-in tokenize module. It can correctly find strings and other tokens in any Python source file.
Option 3: Use pygments with HTML output, and replace anything in blue (etc.) with "string". pygments supports a few dozen languages.
Option 2: For most of the languages, you can build a custom regexp substitution. For example, the following sanitizes Python source code (but it doesn't work if the source file contains """ or '''):
import re
sanitized = re.sub(r'(#.*)|\'(?:[^\'\\]+|\\.)*\'|"(?:[^"\\]+|\\.)*"',
lambda match: match.group(1) or '"string"', source_code)
The regexp above works properly even if the strings contain backslashes (\", \\, \n, \\, \\", \\\" etc. all work fine).
When you are building your regexp, make sure to match comments (so your regexp substitution won't touch strings inside comments) and regular expression literals (e.g. in Perl, Ruby and JavaScript), and pay attention you match backslashes and newlines properly (e.g. in Perl and Ruby a string can contain a newline).

Use a dedicated parser for each language — especially since people have already done that work for you. Most of the languages you mentioned have a grammar.

Nowhere do you mention that you take an approach using a lexer and parser. If in fact you do not, have a look at e.g. the tokenize module (which is probably what you want), or the 3rd party module PLY (Python Lex-Yacc). Your problem needs a systematic approach, and these tools (and others) provide it.
(Note that once you have tokenized the code, you can apply another specialized tokenizer to the contents of the strings to detect special formatting directives such as %s. In this case a regular expression may do the job, though.)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.