How to edit items in a list - python

This is a follow on from a previous question.
I've loaded a word list in Python, but there is a problem. For example, when I access the 21st item in wordlist I should get "ABACK". Instead I get:
wordlist[21]
"'ABACK\\n',"
So for each word in wordlist I need to trim "'" off the front, and "\\n'," off the back of every string in wordlist. I've tried different string methods but haven't found one that works yet.

wordlist = [ e [1:-4] for e in wordlist]
That does the trick. Courtesy of whoever commented above and a similar post.

Since there is a backslash before \n, doing a simple strip won't work if you want to remove it. You can do this:
wordlist = [word.strip("'").strip().split("\\n")[0] for word in wordlist]
Additional strip if you actually have a \n to get rid of. Or you can do word[1:-4] as #jonrsharpe has suggested.

If this is generic to every word in list you can utilize lstrip() and rstrip() functions.
for word in wordlist:
word.lstrip("'").rstrip("\\n',")

Related

Python regex to check if a substring is at the beginning or at the end of a bigger path to look for

I have a string containing words in the form word1_word2, word3_word4, word5_word1 (so a word can appear at the left or at the right). I want a regex that looks for all the occurrences of a specific word, and returns the "super word" containing it. So if I'm looking for word1, I expect my regex to return word1_word2, word5_word1. Since the word can appear on the left or on the right, I wrote this:
re.findall("( {}_)?[\u0061-\u007a\u00e0-\u00e1\u00e8-\u00e9\u00ec\u00ed\u00f2-\u00f3\u00f9\u00fa]*(_{} )?".format("w1", "w1"), string)
With the optional blocks at the beginning or at the end of the pattern. However, it takes forever to execute and I think something is not correct because I tried removing the optional blocks and writing two separate regex for looking at the beginning and at the end and they are much faster (but I don't want to use two regex). Am I missing something or is it normal?
This would be the regex solution to your problem:
re.findall(rf'\b({yourWord}_\w+?|\w+?_{yourWord})\b', yourString)
Python provides some methods to do this
a=['word1_word2', 'word3_word4', 'word5_word1']
b = [x for x in a if x.startswith("word1") or x.endswith('word1')]
print(b) # ['word1_word2', 'word5_word1']
Referenece link
s = 'word1_word2, word3_word4, word5_word1'
matches = re.finditer(r'(\w+_word1)|(word1_\w+)', s)
result = list(map(lambda x: x.group(), matches))
['word1_word2', 'word5_word1']
This is one method, but seeing #Carl his answer I voted for his. That is a faster and cleaner method. I will just leave it here as one of many regex options.
this regex will do the job for word1:
regex = (word\d_)*word1(_word\d)*
re.findall(regex, string)
you can also use this:
re.findall(rf'\b(word{number}_\w+?|\w+?_word{number})\b', string)
Try the following regex.
In the following, replace word1 with the word you're looking for. This is assuming that the word you are looking for consists of only alphanumeric characters.
([a-zA-Z0-9]*_word1)|(word1_.[a-zA-Z0-9]*)

How to remove prefixes from strings?

I'm trying to do some text preprocessing so that I can do some string matching activities.
I have a set of strings, i want to check if the first word in the string starts with "1/" prefix. If it does, I want to remove this prefix but maintain the rest of the word/string.
I've come up with the following, but its just removing everything after the first word and not necessarily removing the prefix "1/"
prefixes = (r'1/')
#remove prefixes from string
def prefix_removal(text):
for word in str(text).split():
if word.startswith(prefixes):
return word[len(prefixes):]
else:
return word
Any help would be appreciated!
Thank you!
Assuming you only want to remove the prefix from the first word and leave the rest alone, I see no reason to use a for loop. Instead, I would recommend this:
def prefix_removal(text):
first_word = text.split()[0]
if first_word.startswith(prefixes):
return text[len(prefixes):]
return text
Hopefully this answers your question, good luck!
Starting with Python 3.9 you can use str.removeprefix:
word = word.removeprefix(prefix)
For other versions of Python you can use:
if word.startswith(prefix):
word = word[len(prefix):]

Regex for matching exact words that contain apostrophes in Python?

For the purpose of this project, I'm using more exact regex expressions, rather than more general ones. I'm counting occurrences words from a list of words in a text file called I import into my script called vocabWords, where each word in the list is in the format \bword\b.
When I run my script, \bwhat\b will pick up the words "what" and "what's", but \bwhat's\b will pick up no words. If I switch the order so the apostrophe word is before the root word, words are counted correctly. How can I change my regex list so the words are counted correctly? I understand the problem is using "\b", but I haven't been able to find how to fix this. I cannot have a more general regex, and I have to include the words themselves in the regex pattern.
vocabWords:
\bwhat\b
\bwhat's\b
\biron\b
\biron's\b
My code:
matched = []
regex_all = re.compile('|'.join(vocabWords))
for row in df['test']:
matched.append(re.findall(regex_all, row))
There are at least another 2 solutions:
Test that next symbol isn't an apostrophe r"\bwhat(?!')\b"
Use more general rule r"\bwhat(?:'s)?\b" to caught both variants with/without apostrophe.
If you sort your wordlist by length before turning it into a regexp, longer words (like "what's") will precede shorter words (like "what"). This should do the trick.
regex_all = re.compile('|'.join(sorted(vocabWords, key=len, reverse=True)))

split() not splitting all white spaces?

I am trying to take a text document and write each word separately into another text document. My only issue is with the code I have sometimes the words aren't all split based on the white space and I'm wondering if I'm just using .split wrong? If so, could you explain why or what to do better?
Here's my code:
list_of_words = []
with open('ExampleText.txt', 'r') as ExampleText:
for line in ExampleText:
for word in line.split(''):
list_of_words.append(word)
print("Done!")
print("Also done!")
with open('TextTXT.txt', 'w') as EmptyTXTdoc:
for word in list_of_words:
EmptyTXTdoc.write("%s\n" % word)
EmptyTXTdoc.close()
This is the first line in the ExampleText text document as it is written in the newly created EmptyTXTdoc:
Submit
a personal
statement
of
research
and/or
academic
and/or
career
plans.
Use .split() (or .split(' ') for only spaces) instead of .split('').
Also, consider sanitizing the line with .strip() for every iteration of the file, since the line is accepted with a newline (\n) in its end.
.split('') Will not remove a space because there isn't a space in between the two apostrophes. You're telling it to split on, well, nothing.

Count occurrences of elements in string from a list?

I'm trying to count the number of occurrences of verbal contractions in some speeches I've gathered. One particular speech looks like this:
speech = "I've changed the path of the economy, and I've increased jobs in our own
home state. We're headed in the right direction - you've all been a great help."
So, in this case, I'd like to count four (4) contractions. I have a list of contractions, and here are some of the first few terms:
contractions = {"ain't": "am not; are not; is not; has not; have not",
"aren't": "are not; am not",
"can't": "cannot",...}
My code looks something like this, to begin with:
count = 0
for word in speech:
if word in contractions:
count = count + 1
print count
I'm not getting anywhere with this, however, as the code's iterating over every single letter, as opposed to whole words.
Use str.split() to split your string on whitespace:
for word in speech.split():
This will split on arbitrary whitespace; this means spaces, tabs, newlines, and a few more exotic whitespace characters, and any number of them in a row.
You may need to lowercase your words using str.lower() (otherwise Ain't won't be found, for example), and strip punctuation:
from string import punctuation
count = 0
for word in speech.lower().split():
word = word.strip(punctuation)
if word in contractions:
count += 1
I use the str.strip() method here; it removes everything found in the string.punctuation string from the start and end of a word.
You're iterating over a string. So the items are characters. To get the words from a string you can use naive methods like str.split() that makes this for you (now you can iterate over a list of strings (the words splitted on the argument of str.split(), default: split on whitespace). There is even re.split(), which is more powerful. But I don't think that you need splitting the text with regexes.
What you have to do at least is to lowercase your string with str.lower() or to put all possible occurences (also with capital letters) in the dictionary. I strongly recommending the first alternative. The latter isn't really practicable. Removing the punctuation is also a duty for this. But this is still naive. If you're need a more sophisticated method, you have to split the text via a word tokenizer. NLTK is a good starting point for that, see the nltk tokenizer. But I strongly feel that this problem is not your major one or affects you really in solving your question. :)
speech = """I've changed the path of the economy, and I've increased jobs in our own home state. We're headed in the right direction - you've all been a great help."""
# Maybe this dict makes more sense (list items as values). But for your question it doesn't matter.
contractions = {"ain't": ["am not", "are not", "is not", "has not", "have not"], "aren't": ["are not", "am not"], "i've": ["i have", ]} # ...
# with re you can define advanced regexes, but maybe
# from string import punctuation (suggestion from Martijn Pieters answer
# is still enough for you)
import re
def abbreviation_counter(input_text, abbreviation_dict):
count = 0
# what you want is a list of words. str.split() does this job for you.
# " " is default and you can also omit this. But if you really need better
# methods (see answer text abover), you have to take a word tokenizer tool
# or have to write your own.
for word in input_text.split(" "):
# and also clean word (remove ',', ';', ...) afterwards. The advantage of
# using re over `from string import punctuation` is that you have more
# control in what you want to remove. That means that you can add or
# remove easily any punctuation mark. It could be very handy. It could be
# also overpowered. If the latter is the case, just stick to Martijn Pieters
# solution.
if re.sub(',|;', '', word).lower() in abbreviation_dict:
count += 1
return count
print abbrev_counter(speech, contractions)
2 # yeah, it worked - I've included I've in your list :)
It's a litte bit frustrating to give an answer at the same time as Martijn Pieters does ;), but I hope I still have generated some values for you. That's why I've edited my question to give you some hints for future work in addition.
A for loop in Python iterates over all elements in an iterable. In the case of strings the elements are the characters.
You need to split the string into a list (or tuple) of strings that contain the words. You can use .split(delimiter) for this.
Your problem is quite common, so Python has a shortcut: speech.split() splits at any number of spaces/tabs/newlines, so you only get your words in the list.
So your code should look like this:
count = 0
for word in speech.split():
if word in contractions:
count = count + 1
print(count)
speech.split(" ") works too, but only splits on whitespaces but not tabs or newlines and if there are double spaces you'd get empty elements in your resulting list.

Categories

Resources