How can I parse only string without using regex in python? [closed]

How can I parse only string without using regex in python? [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 months ago.
Improve this question
I am learning Python and have a question about parsing strings without regex. We should use a while loop. Here is the question;
We will have a string from the user with the input function. And then we will export just alpha characters from this sentence to a list.
For example, sentence: "The weather is so lovely today. Jack (our Jack) – Jason - and Alex went to park..? "
Example output: ["The", "weather", "is", "so","lovely","today","Jack","our","Jack","and","Alex","went","to","park"]
I have to note that punctuation marks and special characters such as parentheses are not part of words.
Below you can find I tried my codes. I couldn't find where I had an error.
s=" The weather is so lovely today. Jack (our Jack) – Jason - and Alex went to park..?"
i = 0
j = 0
l=[]
k=[]
count = 0
while s:
while j<len(s) and not s[j].isalpha():
j+=1
l = s[j:]
s=s[j:]
while j < len(s) and l[j].isalpha():
j+=1
s=s[j:]
k.append(l[0:i])
print(k)
print(l)
On the other hand, I did parse the first word with the code below.
s=" The weather is so lovely today. Jack (our Jack) – Jason - and Alex went to park..?"
i = 0
j = 0
l=[]
k=[]
while j<len(s) and not s[j].isalpha():
j+=1
l = s[j:]
while i < len(l) and l[i].isalpha():
i+=1
s=s[i:]
k.append(l[0:i])
print(k)
print(l)
Thanks for your help.

By and large, if your goal is to parse a string and you find yourself modifying the string, you're probably doing it wrong. That's particularly true of languages like Python where strings are immutable, and modifying a string really means creating a new one, which takes time proportional to the length of the string. Doing that in a loop effectively turns a linear scan into a quadratic-time algorithm; you might not notice the dramatic consequences with a few short test cases, but sooner or later you (or someone) will try your code out on a significantly longer string, and the quadratic time will come back to bite you.
Anyway, there's no need. All you need to do is to look at the characters, or more accurately, look at each position between two characters, in order to find the positions of the beginnings of the words (where an alphabetic character follows a non-alphabetic character) and the ends of the words (where a non-alphabetic character follows an alphabetic character). Once the beginning and end of each word is discovered, the complete word can be added to the word list.
Note that we don't actually care what each character is, only whether it is alphabetic. So in the following code, I don't save the previous character; rather I save the boolean value of whether the previous character was alphabetic. At the start of the scan, previous_was_alphabetic is set to False, so if the first character in the string is alphabetic, that counts as the start of a word.
There's one little Python trick here, to handle the end of the string. If the last character in the string is alphabetic, then it's the end of a word, so it would be convenient to ensure that the string ends with a non-alphabetic character. But I don't really want to create a modified string, and I'd prefer not to have to write special purpose code for the end of the string. What I do instead is to use a slice; instead of looking at s[i] (the ith character), I use s[i:i+1], the one-character slice starting at position i. Conveniently, if i happens to be the length of s, then s[i:i+1] is an empty string, '', and even more conveniently, ''.isalpha() is False. So that will act as though there were an invisible non-alphabetic character at the end of the string.
This is not really very Pythonic, but your assignment seems to be insisting that you use a while loop rather than the much more natural for loop (which would require a different way of dealing with the end of the string).
def words_from(s):
"""Returns a list of the "words" (contiguous sequences of alphabetic
characters) from the string s
"""
words = []
previous_was_alphabetic = False
i = 0
while i <= len(s):
next_is_alphabetic = s[i:i+1].isalpha()
if not previous_was_alphabetic and next_is_alphabetic:
# i is the start of a word
start = i
elif previous_was_alphabetic and not next_is_alphabetic:
# i is the position after the end of a word
words.append(s[start:i])
# Move to the next position
previous_was_alphabetic = next_is_alphabetic
i += 1
return words

I think you might want sth like this:
s = "The weather is so lovely today. Jack (our Jack) – Jason - and Alex went to park..? "
punc = '''!()-[]{};:'"\,–,<>./?##$%^&*_~'''
# Removing punctuations in string
# Using loop + punctuation string
for i in s:
if i in punc:
s = s.replace(i, "")
print(s.split())
output:
['The', 'weather', 'is', 'so', 'lovely', 'today', 'Jack', 'our', 'Jack', 'Jason', 'and', 'Alex', 'went', 'to', 'park']

Related

How to convert a letter to lowercase after a quotation mark?

How to convert a letter to lowercase which is in a string after a quotation mark?
Like this:
"Trees Are Never Sad Look At Them Every Once In Awhile They'Re Quite Beautiful."
should become
"Trees Are Never Sad Look At Them Every Once In Awhile They're Quite Beautiful."

You can loop through the string and then find the character of the apostrophe. Then replace the upper case letter with lowercase letter following the apostrophe:
string = "Trees Are Never Sad Look At Them Every Once In Awhile They'Re Quite Beautiful."
for i in range(len(string)):
if string[i] == "'": # Check for apostrophe
string = string.replace(string[i+1], string[i+1].lower()) # Make the character followed by apostrophe lower case by replacing uppercase letter.
print(string)
Output:
Trees Are Never Sad Look At Them Every Once In Awhile They're Quite Beautiful.

Using re.sub with a callback function we can try:
import re
inp = "Trees Are Never Sad Look At Them Every Once In Awhile They'Re Quite Beautiful."
output = re.sub(r"'([A-Z])", lambda m: "'" + m.group(1).lower(), inp)
print(output)
This prints:
Trees Are Never Sad Look At Them Every Once In Awhile They're Quite Beautiful.

Here is another way to do so using join():
data = "Trees Are Never Sad Look At Them Every Once In Awhile They'Re Quite Beautiful."
new_data = "".join(char.lower() if i and data[i-1] == "'" else char for i, char in enumerate(data))
print(new_data) # Trees Are Never Sad Look At Them Every Once In Awhile They're Quite Beautiful.
Explanation
We use enumerate() to iterate over each character in the string, knowing its index.
For each character, we check if the previous one is ' using data[i-1] == "'".
We also check that it is not the first character of the string, in which case data[i-1] would correspond to the last letter of the string (data[-1]).
We therefore convert the character to lowercase only if the two conditions are met:
i != 0, it can also be written if i, because an integer evaluates to True only if it is non-zero.
data[i-1] == "'"

Python - string index out of range issue

This is the question I was given to solve:
Create a program inputs a phrase (like a famous quotation) and prints all of the words that start with h-z.
I solved the problem, but the first two methods didn't work and I wanted to know why:
#1 string index out of range
quote = input("enter a 1 sentence quote, non-alpha separate words: ")
word = ""
for character in quote:
if character.isalpha():
word += character.upper()
else:
if word[0].lower() >= "h":
print(word)
word = ""
else:
word = ""
I get the IndexError: string index out of range message for any words after "g". Shouldn't the else statement catch it? I don't get why it doesn't, because if I remove the brackets [] from word[0], it works.
#2: last word not printing
quote = input("enter a 1 sentence quote, non-alpha separate words: ")
word = ""
for character in quote:
if character.isalpha():
word += character.upper()
else:
if word.lower() >= "h":
print(word)
word = ""
else:
word = ""
In this example, it works to a degree. It eliminates any words before 'h' and prints words after 'h', but for some reason doesn't print the last word. It doesn't matter what quote i use, it doesn't print the last word even if it's after 'h'. Why is that?

You're calling on word[0]. This accesses the first element of the iterable string word. If word is empty (that is, word == ""), there is no "first element" to access; thus you get an IndexError. If a "word" starts with a non-alphabetic character (e.g. a number or a dash), then this will happen.
The second error you're having, with your second code snippet leaving off the last word, is because of the approach you're using for this problem. It looks like you're trying to walk through the sentence you're given, character by character, and decide whether to print a word after having read through it (which you know because you hit a space character. But this leads to the issue with your second approach, which is that it doesn't print the last string. That's because the last character in your sentence isn't a space - it's just the last letter in the last word. So, your else loop is never executed.
I'd recommend using an entirely different approach, using the method string.split(). This method is built-in to python and will transform one string into a list of smaller strings, split across the character/substring you specify. So if I do
quote = "Hello this is a sentence"
words = quote.split(' ')
print(words)
you'll end up seeing this:
['Hello', 'this', 'is', 'a', 'sentence']
A couple of things to keep in mind on your next approach to this problem:
You need to account for empty words (like if I have two spaces in a row for some reason), and make sure they don't break the script.
You need to account for non-alphanumeric characters like numbers and dashes. You can either ignore them or handle them differently, but you have to have something in place.
You need to make sure that you handle the last word at some point, even if the sentence doesn't end in a space character.
Good luck!

Instead of what you're doing, you can Iterate over each word in the string and count how many of them begin in those letters. Read about the function str.split(), in the parameter you enter the divider, in this case ' ' since you want to count the words, and that returns a list of strings. Iterate over that in the loop and it should work.

Most Frequent Character - User Submitted String without Dictionaries or Counters

Currently, I am in the midst of writing a program that calculates all of the non white space characters in a user submitted string and then returns the most frequently used character. I cannot use collections, a counter, or the dictionary. Here is what I want to do:
Split the string so that white space is removed. Then count each character and return a value. I would have something to post here but everything I have attempted thus far has been met with critical failure. The closest I came was this program here:
strin=input('Enter a string: ')
fc=[]
nfc=0
for ch in strin:
i=0
j=0
while i<len(strin):
if ch.lower()==strin[i].lower():
j+=1
i+=1
if j>nfc and ch!=' ':
nfc=j
fc=ch
print('The most frequent character in string is: ', fc )
If you can fix this code or tell me a better way of doing it that meets the required criteria that would be helpful. And, before you say this has been done a hundred times on this forum please note I created an account specifically to ask this question. Yes there are a ton of questions like this but some that are reading from a text file or an existing string within the program. And an overwhelmingly large amount of these contain either a dictionary, counter, or collection which I cannot presently use in this chapter.

Just do it "the old way". Create a list (okay it's a collection, but a very basic one so shouldn't be a problem) of 26 zeroes and increase according to position. Compute max index at the same time.
strin="lazy cat dog whatever"
l=[0]*26
maxindex=-1
maxvalue=0
for c in strin.lower():
pos = ord(c)-ord('a')
if 0<=pos<=25:
l[pos]+=1
if l[pos]>maxvalue:
maxindex=pos
maxvalue = l[pos]
print("max count {} for letter {}".format(maxvalue,chr(maxindex+ord('a'))))
result:
max count 3 for letter a

As an alternative to Jean's solution (not using a list that allows for one-pass over the string), you could just use str.count here which does pretty much what you're trying to do:
strin = input("Enter a string: ").strip()
maxcount = float('-inf')
maxchar = ''
for char in strin:
c = strin.count(char) if not char.isspace() else 0
if c > maxcount:
maxcount = c
maxchar = char
print("Char {}, Count {}".format(maxchar, maxcount))
If lists are available, I'd use Jean's solution. He doesn't use a O(N) function N times :-)
P.s: you could compact this with one line if you use max:
max(((strin.count(i), i) for i in strin if not i.isspace()))

To keep track of several counts for different characters, you have to use a collection (even if it is a global namespace implemented as a dictionary in Python).
To print the most frequent non-space character while supporting arbitrary Unicode strings:
import sys
text = input("Enter a string (case is ignored)").casefold() # default caseless matching
# count non-space character frequencies
counter = [0] * (sys.maxunicode + 1)
for nonspace in map(ord, ''.join(text.split())):
counter[nonspace] += 1
# find the most common character
print(chr(max(range(len(counter)), key=counter.__getitem__)))
A similar list in Cython was the fastest way to find frequency of each character.

How do I reference a different character in a string while iterating through the string in Python?

I'm trying to write a script that can take doubled letters (aa or tt, for instance) and change them to that letter followed by ː, the length symbol (aa would become aː, and tt would become tː). I want to do this by iterating through the string, and replacing any character in the string that's the same as the last one with a ː. How do I do that?

You could try something like this. I iterated through string and checked each letter against the previous letter. If they match it performs the replacement if not it moves on and stores the new previous letter in previousletter. Also I used the .lower() method to mactch letters even if one is capitalized and one is not.
string = "Tthis is a testt of the ddouble letters"
previousletter = string[0]
for letter in string:
if letter.lower() == previousletter.lower():
string = string.replace("%s%s" % (previousletter, letter) , "%s:" % (letter))
previousletter = letter
print(string)
And here is the output:
t:his is a test: of the d:ouble let:ers
I hope this helps and feel free to ask any questions on the code that I used. Happy programming!

Python script to insert space between different character types: Why is this so slow?

I'm working with some text that has a mix of languages, which I've already done some processing on and is in the form a list of single characters (called "letters"). I can tell which language each character is by simply testing if it has case or not (with a small function called "test_lang"). I then want to insert a space between characters of different types, so I don't have any words that are a mix of character types. At the same time, I want to insert a space between words and punctuation (which I defined in a list called "punc"). I wrote a script that does this in a very straight-forward way that made sense to me (below), but apparently is the wrong way to do it, because it is incredibly slow.
Can anyone tell me what the better way to do this is?
# Add a space between Arabic/foreign mixes, and between words and punc
cleaned = ""
i = 0
while i <= len(letters)-2: #range excludes last letter to avoid Out of Range error for i+1
cleaned += letters[i]
# words that have case are Latin; otherwise Arabic
if test_lang(letters[i]) != test_lang(letters[i+1]):
cleaned += " "
if letters[i] in punc or letters[i+1] in punc:
cleaned += " "
i += 1
cleaned += letters[len(letters)-1] # add in last letter

There are a few things going on here:
You call test_lang() on every letter in the string twice, this is probably the main reason this is slow.
Concatenating strings in Python isn't very efficient, you should instead use a list or generator and then use str.join() (most likely, ''.join()).
Here is the approach I would take, using itertools.groupby():
from itertools import groupby
def keyfunc(letter):
return (test_lang(letter), letter in punc)
cleaned = ' '.join(''.join(g) for k, g in groupby(letters, keyfunc))
This will group the letters into consecutive letters of the same language and whether or not they are punctuation, then ''.join(g) converts each group back into a string, then ' '.join() combines these strings adding a space between each string.
Also, as noted in comments by DSM, make sure that punc is a set.

Every time you perform a string concatenation, a new string is created. The longer the string gets, the longer each concatenation takes.
http://en.wikipedia.org/wiki/Schlemiel_the_Painter's_algorithm
You might be better off declaring a list big enough to store the characters of the output, and joining them at the end.

I suggest an entirely different solution that should be very fast:
import re
cleaned = re.sub(r"(?<!\s)\b(?!\s)", " ", letters, flags=re.LOCALE)
This inserts a space at every word boundary (defining words as "sequences of alphanumeric characters, including accented characters in your current locale", which should work in most cases), unless it's a word boundary next to whitespace.
This should split between Latin and Arabic characters as well as between Latin and punctuation.

Assuming test_lang is not the bottleneck, I'd try:
''.join(
x + ' '
if x in punc or y in punc or test_lang(x) != test_lang(y)
else x
for x, y in zip(letters[:-1], letters[1:])
)

Here is a solution that uses yield. I would be interested to know whether this runs any faster than your original solution.
This avoids all the indexing in the original. It just iterates through the input, holding onto a single previous character.
This should be easy to modify if your requirements change in the future.
ch_sep = ' '
def _sep_chars_by_lang(s_input):
itr = iter(s_input)
ch_prev = next(itr)
yield ch_prev
while True:
ch = next(itr)
if test_lang(ch_prev) != test_lang(ch) or ch_prev in punc:
yield ch_sep
yield ch
ch_prev = ch
def sep_chars_by_lang(s_input):
return ''.join(_sep_chars_by_lang(s_input))

Keeping the basic logic of the OP's original code, we speed it up by not doing all that [i] and [i+1] indexing. We use a prev and next reference that scan through the string, maintaining prev one character behind next:
# Add a space between Arabic/foreign mixes, and between words and punc
cleaned = ''
prev = letters[0]
for next in letters[1:]:
cleaned += prev
if test_lang(prev) != test_lang(next):
cleaned += ' '
if prev in punc or next in punc:
cleaned += ' '
prev = next
cleaned += next
Testing on a string of 10 million characters shows this is about twice the speed of the OP code. The "string concatenation is slow" complaint is obsolete, as others have pointed out. Running the test again using the ''.join(...) metaphor shows a slighly slower execution than using string concatenation.
Further speedup may come through not calling the test_lang() function but by inlining some simple code. Can't comment as I don't really know what test_lang() does :).
Edit: removed a 'return' statement that should not have been there (testing remnant!).
Edit: Could also speedup by not calling test_lang() twice on the same character (on next in one loop and then prev in the following loop). Cache the test_lang(next) result.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I parse only string without using regex in python? [closed] - python

Related

How to convert a letter to lowercase after a quotation mark?

Python - string index out of range issue

Most Frequent Character - User Submitted String without Dictionaries or Counters

How do I reference a different character in a string while iterating through the string in Python?

Python script to insert space between different character types: Why is this so slow?

Categories

Resources

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I parse only string without using regex in python? [closed] - python

Related

How to convert a letter to lowercase after a quotation mark?

Python - string index out of range issue

Most Frequent Character - User Submitted String without Dictionaries or Counters

How do I reference a different character in a string while iterating through the string in Python?

Python script to insert space between different character types: Why is this *so* slow?

Categories

Resources

Python script to insert space between different character types: Why is this so slow?