String slicing in specific scenarios Python - python

I have a string I'd like to split to new strings which will contain only text (no commas, spaces, dots etc.). The length of each new string must be of variable n. The slicing must go through each possible combination.
Meaning, for example, an input of func('banana pack', 3) will result in ['ban','ana','nan','ana',pac','ack']. So far what I managed to achieve is:
def func(text, n):
text = text.lower()
text = text.translate(str.maketrans("", "", " .,"))
remainder = len(text) % n
split_text = [text[i:i + n] for i in range(0, len(text) - remainder, n)]
if remainder > 0:
split_text.append(text[-n:])
return split_text

First I clean the input, by removing ',' and '.'. The input is then split at spaces to take only full words into account. For each word the sections are appended.
def func(text,n):
text=text.replace('.','').replace(',','') #Cleanup
words = text.split() #split words
output = []
for word in words:
for i in range(len(word)-n+1):
output.append(word[i:i+n])
return output
You could unroll the loop one level if you just iterate over everything and discard results with unwanted symbols.

Related

Replace sequence of the same letter with single one

I am trying to replace the number of letters with a single one, but seems to be either hard either I am totally block how this should be done
So example of input:
aaaabbcddefff
The output should be abcdef
Here is what I was able to do, but when I went to the last piece of the string I can't get it done. Tried different variants, but I am stucked. Can someone help me finish this code?
text = "aaaabbcddefff"
new_string = ""
count = 0
while text:
for i in range(len(text)):
l = text[i]
for n in range(len(text)):
if text[n] == l:
count += 1
continue
new_string += l
text = text.replace(l, "", count)
break
count = 0
break
Using regex
re.sub(r"(.)(?=\1+)", "", text)
>>> import re
>>> text = "aaaabbcddefff"
>>> re.sub(r"(.)(?=\1+)", "", text)
abcdeaf
Side note: You should consider building your string up in a list and then joining the list, because it is expensive to append to a string, since strings are immutable.
One way to do this is to check if every letter you look at is equal to the previous letter, and only append it to the new string if it is not equal:
def remove_repeated_letters(s):
if not s: return ""
ret = [s[0]]
for index, char in enumerate(s[1:], 1):
if s[index-1] != char:
ret.append(char)
return "".join(ret)
Then, remove_repeated_letters("aaaabbcddefff") gives 'abcdef'.
remove_repeated_letters("aaaabbcddefffaaa") gives 'abcdefa'.
Alternatively, use itertools.groupby, which groups consecutive equal elements together, and join the keys of that operation
import itertools
def remove_repeated_letters(s):
return "".join(key for key, group in itertools.groupby(s))

RegEx for matching substrings

I want to write a regex to match on all substrings of a given input string. I have tried the following:
def sub_string(str):
n = len(str)
# For holding all the formed substrings
output = "\\b("
# This loop maintains the starting character
for i in range(0, n):
# This loop will add a character to start character one by one till the end is reached
for j in range(i, n):
output += str[i:(j + 1)] + "|"
return output + "\\b)"
However, it matches on the wrong words. For example if i input "h" it matches. What could be the problem? And is there another approach I could use?
Thanks in advance!

String Index Out of Range Issue - Python

I am trying to make a lossy text compression program that removes all vowels from the input, except for if the vowel is the first letter of a word. I keep getting this "string index out of range" error on line 6. Please help!
text = str(input('Message: '))
text = (' ' + text)
for i in range(0, len(text)):
i = i + 1
if str(text[i-1]) != ' ': #LINE 6
text = text.replace('a', '')
text = text.replace('e', '')
text = text.replace('i', '')
text = text.replace('o', '')
text = text.replace('u', '')
print(text)
As busybear notes, the loop isn't necessary: your replacements don't depend on i.
Here's how I'd do it:
def strip_vowels(s): # Remove all vowels from a string
for v in 'aeiou':
s = s.replace(v, '')
return s
def compress_word(s):
if not s: return '' # Needed to avoid an out-of-range error on the empty string
return s[0] + strip_vowels(s[1:]) # Strip vowels from all but the first letter
def compress_text(s): # Apply to each word
words = text.split(' ')
new_words = compress_word(w) for w in words
return ' '.join(new_words)
When you replace letters with a blank, your word gets shorter. So what was originally len(text) is going to be out of bounds if you remove any letters. Do note however, replace is replacing all occurrences within your string, so a loop isn't even necessary.
An alternative to use the loop is to just keep track of the index of letters to replace while going through the loop, then replace after the loop is complete.
Shortening your string length by replacing any char with "" means that if you remove a character, len(text) used in your iterator is longer than the actual string length. There are plenty of alternative solutions. for example,
text_list = list(text)
for i in range(1, len(text_list)):
if text_list[i] in "aeiou":
text_list[i] = ""
text = "".join(text_list)
By turning your string into a list of its composite characters, you can remove characters but maintain the list length (since empty elements are allowed) then rejoin them.
Be sure to account for special cases, such as len(text)<2.

Need assistance with cleaning words that were counted from a text file

I have an input text file from which I have to count sum of characters, sum of lines, and sum of each word.
So far I have been able to get the count of characters, lines and words. I also converted the text to all lower case so I don't get 2 different counts for same word where one is in lower case and the other is in upper case.
Now looking at the output I realized that, the count of words is not as clean. I have been struggling to output clean data where it does not count any special characters, and also when counting words not to include a period or a comma at the end of it.
Ex. if the text file contains the line: "Hello, I am Bob. Hello to Bob *"
it should output:
2 Hello
2 Bob
1 I
1 am
1 to
Instead my code outputs
1 Hello,
1 Hello
1 Bob.
1 Bob
1 I
1 am
1 to
1 *
Below is the code I have as of now.
# Open the input file
fname = open('2013_honda_accord.txt', 'r').read()
# COUNT CHARACTERS
num_chars = len(fname)
# COUNT LINES
num_lines = fname.count('\n')
#COUNT WORDS
fname = fname.lower() # convert the text to lower first
words = fname.split()
d = {}
for w in words:
# if the word is repeated - start count
if w in d:
d[w] += 1
# if the word is only used once then give it a count of 1
else:
d[w] = 1
# Add the sum of all the repeated words
num_words = sum(d[w] for w in d)
lst = [(d[w], w) for w in d]
# sort the list of words in alpha for the same count
lst.sort()
# list word count from greatest to lowest (will also show the sort in reserve order Z-A)
lst.reverse()
# output the total number of characters
print('Your input file has characters = ' + str(num_chars))
# output the total number of lines
print('Your input file has num_lines = ' + str(num_lines))
# output the total number of words
print('Your input file has num_words = ' + str(num_words))
print('\n The 30 most frequent words are \n')
# print the number of words as a count from the text file with the sum of each word used within the text
i = 1
for count, word in lst[:10000]:
print('%2s. %4s %s' % (i, count, word))
i += 1
Thanks
Try replacing
words = fname.split()
With
get_alphabetical_characters = lambda word: "".join([char if char in 'abcdefghijklmnopqrstuvwxyz' else '' for char in word])
words = list(map(get_alphabetical_characters, fname.split()))
Let me explain the various parts of the code.
Starting with the first line, whenever you have a declaration of the form
function_name = lambda argument1, argument2, ..., argumentN: some_python_expression
What you're looking at is the definition of a function that doesn't have any side effects, meaning it can't change the value of variables, it can only return a value.
So get_alphabetical_characters is a function that we know due to the suggestive name, that it takes a word and returns only the alphabetical characters contained within it.
This is accomplished using the "".join(some_list) idiom which takes a list of strings and concatenates them (in other words, it producing a single string by joining them together in the given order).
And the some_list here is provided by the generator expression [char if char in 'abcdefghijklmnopqrstuvwxyz' else '' for char in word]
What this does is it steps through every character in the given word, and puts it into the list if it's alphebetical, or if it isn't it puts a blank string in it's place.
For example
[char if char in 'abcdefghijklmnopqrstuvwyz' else '' for char in "hello."]
Evaluates to the following list:
['h','e','l','l','o','']
Which is then evaluates by
"".join(['h','e','l','l','o',''])
Which is equivalent to
'h'+'e'+'l'+'l'+'o'+''
Notice that the blank string added at the end will not have any effect. Adding a blank string to any string returns that same string again.
And this in turn ultimately yields
"hello"
Hope that's clear!
Edit #2: If you want to include periods used to mark decimal we can write a function like this:
include_char = lambda pos, a_string: a_string[pos].isalnum() or a_string[pos] == '.' and a_string[pos-1:pos].isdigit()
words = "".join(map(include_char, fname)).split()
What we're doing here is that the include_char function checks if a character is "alphanumeric" (i.e. is a letter or a digit) or that it's a period and that the character preceding it is numeric, and using this function to strip out all the characters in the string we want, and joining them into a single string, which we then separate into a list of strings using the str.split method.
This program may help you:
#I created a list of characters that I don't want \
# them to be considered as words!
char2remove = (".",",",";","!","?","*",":")
#Received an string of the user.
string = raw_input("Enter your string: ")
#Make all the letters lower-case
string = string.lower()
#replace the special characters with white-space.
for char in char2remove:
string = string.replace(char," ")
#Extract all the words in the new string (have repeats)
words = string.split(" ")
#creating a dictionary to remove repeats
to_count = dict()
for word in words:
to_count[word]=0
#counting the word repeats.
for word in to_count:
#if there is space in a word, it is white-space!
if word.isalpha():
print word, string.count(word)
Works as below:
>>> ================================ RESTART ================================
>>>
Enter your string: Hello, I am Bob. Hello to Bob *
i 1
am 1
to 1
bob 2
hello 2
>>>
Another way is using Regex to remove all non-letter chars (to get rid off char2remove list):
import re
regex = re.compile('[^a-zA-Z]')
your_str = raw_input("Enter String: ")
your_str = your_str.lower()
regex.sub(' ', your_str)
words = your_str.split(" ")
to_count = dict()
for word in words:
to_count[word]=0
for word in to_count:
if word.isalpha():
print word, your_str.count(word)

Ignore punctuation and case when comparing two strings in Python

I have a two dimensional array called "beats" with a bunch of data. In the second column of the array, there is a list of words in alphabetical order.
I also have a sentence called "words" which was originally a string, which I've turned into an array.
I need to check if one of the words in "words" matches any of the words in the second column of the array "beats". If a match has been found, the program changes the matched word in the sentence "words" to "match" and then return the words in a string. This is the code I'm using:
i = 0
while i < len(words):
n = 0
while n < len(beats):
if words[i] == beats[n][1]:
words[i] = "match"
n = n + 1
i = i + 1
mystring = ' '.join(words)
return mystring
So if I have the sentence:
"Money is the last money."
And "money" is in the second column of the array "beats", the result would be:
"match is the last match."
But since there's a period behind "match", it doesn't consider it a match.
Is there a way to ignore punctuation when comparing the two strings? I don't want to strip the sentence of punctuation because I want the punctuation to be in tact when I return the string once my program's done replacing the matches.
You can create a new string that has the properties you want, and then compare with the new string(s). This will strip everything but numbers, letters, and spaces while making all letters lowercase.
''.join([letter.lower() for letter in ' '.join(words) if letter.isalnum() or letter == ' '])
To strip everything but letters from a string you can do something like:
from string import ascii_letters
''.join([letter for letter in word if letter in ascii_letters])
You could use a regex:
import re
st="Money is the last money."
words=st.split()
beats=['money','nonsense']
for i,word in enumerate(words):
if word=='match': continue
for tgt in beats:
word=re.sub(r'\b{}\b'.format(tgt),'match',word,flags=re.I)
words[i]=word
print print ' '.join(words)
prints
match is the last match.
If it is only the fullstop that you are worried about, then you can add another if case to match that too. Or similar you can add custom handling if your cases are limited. or otherwise regex is the way to go.
words="Money is the last money. This money is another money."
words = words.split()
i = 0
while i < len(words):
if (words[i].lower() == "money".lower()):
words[i] = "match"
if (words[i].lower() == "money".lower() + '.'):
words[i] = "match."
i = i + 1
mystring = ' '.join(words)
print mystring
Output:
match is the last match. This match is another match.

Categories

Resources