Change two characters into one symbol (Python)

Change two characters into one symbol (Python) - python

Im currently working on a file compression task for school, and I find myself unable to understand what's happening in this code (more specifically what ISN'T happening and why it is not happening).
So in this section of the code what I'm aiming to do is, in non-coding terms, change two adjacent letters which are the same into one symbol, therefore taking up less memory:
for i, word in enumerate(file_contents):
#file_contents = LIST of words in any given text file
word_contents = (file_contents[i]).split()
for ind,letter in enumerate(word_contents[:-1]):
if word_contents[ind] == word_contents[ind+1]:
word_contents[ind] = ''
word_contents[ind+1] = '★'
However, when I run the full code with a sample text file, it seemingly doesn't do what I told it to do. For instance, the word 'Sally' should be 'Sa★y' but instead stays the same.
Could anyone help me get on the right track?
EDIT: I missed out a pretty key detail. I want the compressed string to somehow appear back in the original file_contents list where there are double letters, as the purpose of the full compression algorithm is to return a compressed version of the text in an inputted file.

I would suggest use a regex matching same adjacent characters.
Example:
import re
txt = 'sally and bobby'
print(re.sub(r"(.)\1", '*', txt))
# sa*y and bo*y
Loop and condition checking in your code are not required. Use below line instead:
word_contents = re.sub(r"(.)\1", '*', word_contents)

There are a few things wrong with your code (I think).
1) split produces a list not a str, so when you say this enumerate(word_contents[:-1]) It looks like you're assuming that gets you a string?!? at any rate... I'm not sure it is or not.
but then!
2)with this line:
if word_contents[ind] == word_contents[ind+1]:
word_contents[ind] = ''
word_contents[ind+1] = '★'
You're operating on your list again. Where it looks pretty clear that you want to be operating on the string, or a list of characters in a word you're processing. At best this function will do nothing, and at worst, you're corrupting the word content list.
So when you perform your modifications you are modifying the word_contents list and not the list item [:-1] you are actually looking over. There are more issues, but I think that answers your question (I hope)
If you really want to understand what you're doing wrong I recommend putting in print statements along what you're doing. If you're looking for someone to do your homework for you, there is another which already gave you an answer I guess.
Here is an example of how you should add logging to the function
for i, word in enumerate(file_contents):
#file_contents = LIST of words in any given text file
word_contents = (file_contents[i]).split()
# See what the word content list actually is
print(word_contents)
# See what your slice is actually returning
print(word_contents[:-1])
# Unless you have something modifying your list elsewhere you probably want to iterate over the words list generally and not just the slice of it as well.
for ind,letter in enumerate(word_contents[:-1]):
# See what your other test is testing
print(word_contents[ind], word_contents[ind+1])
# Here you probably actually want
# word_contents[:-1][ind]
# which is the list item you iterate over and then the actual string I suspect you get back
if word_contents[ind] == word_contents[ind+1]:
word_contents[ind] = ''
word_contents[ind+1] = '★'
UPDATE: based on the follow up questions from the OP I've made a sample program annotated with descriptions. Note this isn't an optimal solution, but mainly an exercise in teaching flow control and using basic structures.
# define the initial data...
file = "sally was a quick brown fox and jumped over the lazy dog which we'll call billy"
file_contents = file.split()
# Enumerate isn't needed in your example unless you intend to use the index later (example below)
for list_index, word in enumerate(file_contents):
# changing something you iterate over is dangerous and sometimes confusing like in your case you iterated over
# word contents and then modified it. if you have to take
# two characters you change the index and size of the structure making changes potentially invalid. So we'll create a new data structure to dump the results in
compressed_word = []
# since we have a list of strings we'll just iterate over each string (or word) individually
for character in word:
# Check to see if there is any data in the intermediate structure yet if not there are no duplicate chars yet
if compressed_word:
# if there are chars in new structure, test to see if we hit same character twice
if character == compressed_word[-1]:
# looks like we did, replace it with your star
compressed_word[-1] = "*"
# continue skips the rest of this iteration the loop
continue
# if we haven't seen the character before or it is the first character just add it to the list
compressed_word.append(character)
# I guess this is one reason why you may want enumerate, to update the list with the new item?
# join() is just converting the list back to a string
file_contents[list_index] = "".join(compressed_word)
# prints the new version of the original "file" string
print(" ".join(file_contents))
outputs: "sa*y was a quick brown fox and jumped over the lazy dog which we'* ca* bi*y"

Related

How can I join different segments of a list?

I'm having trouble in a school project because I don't know how to join elements of a list in segments. Here's an example: Let's say I have the following list:
list = ["T","h","i","s","I","s","A","L","i","s","t",]
How could I join this list so that the program outputs the following?:
Output: ["This","Is","A","List"]

Assuming list is your input, and without giving you the answer outright since it's a school project you should do yourself, here are some hints.
You'll want to check if a character is uppercase to know when the start of a word is. With python, you can use isupper() (ex: 'C'.isupper() would return True).
Python strings are iterable.
You can add a character to the end of a string using += (ex: myWord += 'a')
You can add a string to a list using append (ex: myList.append(myWord))
Remember this is a learning experience and there's no real value to being given the answer outright, if that's what you were hoping for. Best of luck and welcome to StackOverflow.

You can use regex for this
import re
list = ["T","h","i","s","I","s","A","L","i","s","t",]
sep=[s for s in re.split("([A-Z][^A-Z]*)", ''.join(list)) if s]
print(sep)

Fast way to find if list of words contains at least one word that starts with certain letters (not "find ALL words"!)

I have set (not list) of strings (words). It is a big one. (It's ripped out of images with openCV and tesseract so there's no reliable way to predict its contents.)
At some point of working with this list I need to find out if it contains at least one word that begins with part I'm currently processing.
So it's like (NOT an actual code):
if exists(word.startswith(word_part) in word_set) then continue else break
There is a very good answer on how to find all strings in list that start with something here:
result = [s for s in string_list if s.startswith(lookup)]
or
result = filter(lambda s: s.startswith(lookup), string_list)
But they return list or iterator of all strings found.
I only need to find if any such string exists within set, not get them all.
Performance-wise it seems kinda stupid to get list, then get its len and see if it's more than zero and then just drop that list.
It there a better / faster / cleaner way?

Your pseudocode is very close to real code!
if any(word.startswith(word_part) for word in word_set):
continue
else:
break
any returns as soon as it finds one true element, so it's efficient.

You need yield:
def find_word(word_set, letter):
for word in word_set:
if word.startswith(letter):
yield word
yield None
if next(find_word(word_set, letter)): print('word exists')
Yield gives out words lazily. So if you call it once, it will give out only one word.

I want to transform this code to work wit full sentences

Iam trying to check if keyword occurs in the sentence and then add the said keyword. I managed to write this solution but it only works if the search term is one word (said keyword). How to improve it to work when keyword occurs in a sentence? Here is my code:
keyword = []
for i in keywords['keyword']:
keyword.append(i) #this was in a dataframe after readin xlsx file with Pandas so I made it a list
hit = []
for i in phrase['Search term']:
if i in keyword:
hit.append(i)
else:
hit.append("blank")
phrase['Keyword'] = hit
This only works when a single keyword occurs in "Phrase" - like "cat" but won't work if the word "cat" is part of a sentence. Any pointers to improve it ?
Thank you all in advance

I am not sure what you are trying to achieve here. However, I'm going to point an issue that might help you.
In your comment you said that keyword is a list of words and phrase['Search term'] is a list of sentences.
for i in phrase['Search term']:
if i in keyword:
hit.append(i)
...
In this part of your code you are checking if a entire sentence i can be found in any of the single words in keyword. That logic is flawed, you need to check if a word exists in the sentence, not the other way around.
Something like this:
for i in phrase['Search term']:
for j in keyword:
if j in i:
hit.append(i)
...
This is an example you will need to adjust to your purpose, since now it will check word for word.
The code above may lead to undesirable behavior since it checks if a smaller string(word) exists inside a larger string(sentence). It doesn't really check for words. For example, if looking for cat in a sentence like:
this patient is catatonic
Will trigger your if statement as True. A way to minimize this is spliting your sentence in a list of words and checking if the word is found inside the list. Like this:
for i in phrase['Search term']:
for j in keyword:
if j in i.split(" "):
hit.append(i)
...

highlighting words in an docx file using python-docx gives incorrect results

I would like to highlight specific words in an MS word document (here given as negativeList) and leave the rest of the document as it was before. I have tried to adopt from this one but I can not get it running as it should:
from docx.enum.text import WD_COLOR_INDEX
from docx import Document
import pandas as pd
import copy
import re
doc = Document(docxFileName)
negativList = ["king", "children", "lived", "fire"] # some examples
for paragraph in doc.paragraphs:
for target in negativList:
if target in paragraph.text: # it is worth checking in detail ...
currRuns = copy.copy(paragraph.runs) # deep copy as we delete/clear the object
paragraph.runs.clear()
for run in currRuns:
if target in run.text:
words = re.split('(\W)', run.text) # split into words in order to be able to color only one
for word in words:
if word == target:
newRun = paragraph.add_run(word)
newRun.font.highlight_color = WD_COLOR_INDEX.PINK
else:
newRun = paragraph.add_run(word)
newRun.font.highlight_color = None
else: # our target is not in it so we add it unchanged
paragraph.runs.append(run)
doc.save('output.docx')
As example I am using this text (in a word docx file):
CHAPTER 1
Centuries ago there lived --
"A king!" my little readers will say immediately.
No, children, you are mistaken. Once upon a time there was a piece of
wood. It was not an expensive piece of wood. Far from it. Just a
common block of firewood, one of those thick, solid logs that are put
on the fire in winter to make cold rooms cozy and warm.
There are multiple problems with my code:
1) The first sentence works but the second sentence is in twice. Why?
2) The format gets somehow lost in the part where I highlight. I would possibly need to copy the properties of the original run into the newly created ones but how do I do this?
3) I loose the terminal "--"
4) In the highlighted last paragraph the "cozy and warm" is missing ...
What I would need is a eighter a fix for these problems or maybe I am overthinking it and there is a much easier way to do the highlighting? (something like doc.highlight({"king": "pink"} but I haven't found anything in the documentation)?

You're not overthinking it, this is a challenging problem; it is a form of the search-and-replace problem.
The target text can be located fairly easily by searching Paragraph.text, but replacing it (or in your case adding formatting) while retaining other formatting requires access at the Run level, both of which you've discovered.
There are some complications though, which is what makes it challenging:
There is no guarantee that your "find" target string is located entirely in a single run. So you will need to find the run containing the start of your target string and the run containing the end of your target string, as well as any in-between.
This might be aided by using character offsets, like "King" appears at character offset 3 in '"A king!" ...', and has a length of 4, then identifying which run contains character 3 and which contains character (3+4).
Related to the first complication, there is no guarantee that all the runs in which the target string partly appears are formatted the same. For example, if your target string was "a bold word", the updated version (after adding highlighting) would require at least three runs, one for "a ", one for "bold", and one for " word" (btw, which run each of the two space characters appear in won't change how they appear).
If you accept the simplification that the target string will always be a single word, you can consider the simplification of giving the replacement run the formatting of the first character (first run) of the found target runs, which is probably the usual approach.
So I suppose there are a few possible approaches, but one would be to "normalize" the runs of each paragraph containing the target string, such that the target string appeared within a distinct run. Then you could just apply highlighting to that run and you'd get the result you wanted.
To be of more help, you'll need to narrow down the problem areas and provide specific inputs and outputs. I'd start with the first one (perhaps losing the "--") (in a separate question, perhaps linked from here) and then proceed one by one until it all works. It's asking too much for a respondent to produce their own test case :)
Then you'd have a question like: "I run the string: 'Centuries ago ... --' through this code and the trailing "--" disappears ...", which is a lot easier for folks to reason through.
Another good next step might be to print out the text of each run, just so you get a sense of how they're broken up. That may give you insight into where it's not working.

I know its not the same library, but using wincom32 library you can highlight all the instances of the word in a specific range at once.
The code below will take all highlight all hits.
import win32com.client as win32
word = win32.gencache.EnsureDispatch('Word.Application');word.Visible = True
word = word.Documents.Open("test.docx")
strage = word.Range(Start=0, End=0) #change this range to shorten the replace
strage.Find.Replacement.Highlight = True
strage.Find.Execute(FindText="the",Replace=2,Format=True)

I faced a similar issue where I was supposed to highlight a set of words in a document. I modified certain parts of the OP's code and now I am able to highlight the selected words correctly.
As OP said in the comments: paragraph.runs.clear() was changed to paragraph.clear().
And I added a few lines to the following part of the code:
else:
paragraph.runs.append(run)
to get this:
else:
oldRun = paragraph.add_run(run.text)
if oldRun.text in spell_errors:
oldRun.font.highlight_color = WD_COLOR_INDEX.YELLOW
While iterating over the currRuns, we extract the text content of the run and add it to the paragraph, so we need to highlight those words again.

Text Cleaning Issues

I'm learning text cleaning using python online.
I have get rid of some stop words and lower the letter.
but when i execute this code, it doesn't show anything.
I don't know why.
# we add some words to the stop word list
texts, article = [], []
for w in doc:
# if it's not a stop word or punctuation mark, add it to our article!
if w.text != '\n' and not w.is_stop and not w.is_punct and not w.like_num and w.text != 'I':
# we add the lematized version of the word
article.append(w.lemma_)
# if it's a new line, it means we're onto our next document
if w.text == '\n':
texts.append(article)
article = []
when i try to output texts, it's just blank.

I believe the 'texts' list and 'article' list refer to same content and hence, clearing one list's content also clears the other list.
Here is a link to a similar question: Python: Append a list to another list and Clear the first list
Please see if the above are useful.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.