Splitting string into different line lengths - python

I'm trying to split a variable length string across different but predefined line lengths. I've thrown together some code below which fails on key error 6 when I plonk it into Python Tutor (I don't have access to a proper python IDE right now) I guess this means my while loop isn't working properly and it's trying to keep incrementing lineNum but I'm not too sure why. Is there a better way to do this? Or is this easily fixable?
The code:
import re
#Dictionary containing the line number as key and the max line length
lineLengths = {
1:9,
2:11,
3:12,
4:14,
5:14
}
inputStr = "THIS IS A LONG DESC 7X7 NEEDS SPLITTING" #Test string, should be split on the spaces and around the "X"
splitted = re.split("(?:\s|((?<=\d)X(?=\d)))",inputStr) #splits inputStr on white space and where X is surrounded by numbers eg. dimensions
lineNum = 1 #initialises the line number at 1
lineStr1 = "" #initialises each line as a string
lineStr2 = ""
lineStr3 = ""
lineStr4 = ""
lineStr5 = ""
#Dictionary creating dynamic line variables
lineNumDict = {
1:lineStr1,
2:lineStr2,
3:lineStr3,
4:lineStr4,
5:lineStr5
}
if len(inputStr) > 40:
print "The short description is longer than 40 characters"
else:
while lineNum <= 5:
for word in splitted:
if word != None:
if len(lineNumDict[lineNum]+word) <= lineLengths[lineNum]:
lineNumDict[lineNum] += word
else:
lineNum += 1
else:
if len(lineNumDict[lineNum])+1 <= lineLengths[lineNum]:
lineNumDict[lineNum] += " "
else:
lineNum += 1
lineOut1 = lineStr1.strip()
lineOut2 = lineStr2.strip()
lineOut3 = lineStr3.strip()
lineOut4 = lineStr4.strip()
lineOut5 = lineStr5.strip()
I've taken a look at this answer but don't have any real understanding of C#: Split large text string into variable length strings without breaking words and keeping linebreaks and spaces

It doesn't work because you have the for words in splitted loop inside your loop with the lineLen condition. You have to do this:
if len(inputStr) > 40:
print "The short description is longer than 40 characters"
else:
for word in splitted:
if lineNum > 5:
break
if word != None:
if len(lineNumDict[lineNum]+word) <= lineLengths[lineNum]:
lineNumDict[lineNum] += word
else:
lineNum += 1
else:
if len(lineNumDict[lineNum])+1 <= lineLengths[lineNum]:
lineNumDict[lineNum] += " "
else:
lineNum += 1
Also lineStr1, lineStr2 and so on won't be changed, you have to access the dict directly (strings are immutable). I tried it and got the results working:
print("Lines: %s" % lineNumDict)
Gives:
Lines: {1: 'THIS IS A', 2: 'LONG DESC 7', 3: '7 NEEDS ', 4: '', 5: ''}

for word in splitted:
...
lineNum += 1
your code increments lineNum by the number of words in splitted, i.e. 16 times.

I wonder if a properly commented regular expression wouldn't be easier to understand?
lineLengths = {1:9,2:11,3:12,4:14,5:14}
inputStr = "THIS IS A LONG DESC 7X7 NEEDS SPLITTING"
import re
pat = """
(?: # non-capture around the line as we want to drop leading spaces
\s* # drop leading spaces
(.{{1,{max_len}}}) # up to max_len characters, will be added through 'format'
(?=[\b\sX]|$) # and using word breaks, X and string ending as terminators
# but without capturing as we need X to go into the next match
)? # and ignoring missing matches if not all lines are necessary
"""
# build a pattern matching up to 5 lines with the corresponding max lengths
pattern = ''.join(pat.format(max_len=x) for x in lineLengths.values())
re.match(pattern, inputStr, re.VERBOSE).groups()
# Out: ('THIS IS A', 'LONG DESC 7', '7 NEEDS', 'SPLITTING', None)
Also, there is no real point in using a dict for line_lengths, a list would do nicely.

Related

How do I create a word count that does not include blank spaces?

I am a beginner to python, and I was tasked with creating a function that accepts a string as the parameter and return the number of words in the string.
I am having trouble with the spaces and the blank string I assigned. I feel like I am missing something, but am a bit lost as to what's missing or what I've messed up. Also we can't use split.
Any guidance or help would be greatly appreciated
This is what I have so far:
def word_count(str):
count = 1
for i in str:
if (i == ' '):
count += 1
print (count)
word_count('hello') --> Output = 1 (so far correct)
word_count('how are you?') --> Output = 3 (also correct/at least what i am looking for)
word_count(' this string has wide spaces ') --> Output = 7 (Should be 5...)
word_count(' ') --> Output = 2 (Should be ''. I think it's doing count(1+1))
use this code as an improvement
def word_count(str):
count = 1
for i in str:
if (i == ' '):
count += 1
if str[0] == ' ':
count -= 1
if str[-1] == ' ':
count -= 1
print (count)
your error is because your counting spaces if they start at beginning or appear at end.
NOTE that you can't pass empty string "" since this is evaluated to NONE, and trying to index it will cause an error
The problem seems to be when there is a blank in front or behind the sentence. A way to fix this is by using a built in function 'strip'. For example, we can do the following:
example_string = " This is a string "
print(example_string)
stripped_string = example_string.strip()
print(stripped_string)
The output of the first string will be
" This is a string "
The output of the second string will be
"This is a string"
What you can do is the following:
def word_count(input_str):
return len(input_str.split())
count = word_count(' this is a test ')
print (count)
It basically removes the leading/trailing spaces and splits the phrase into
a list.
If, on the offchance you need to use a loop:
def word_count(input_str):
count = 0
input_str = input_str.strip()
for i in input_str:
if (i == ' '):
count += 1
return count
count = word_count(' this is a test ')
print (count)

Counting specific punctuation symbols in a given text, without using regex or other modules

I have a text file with a huge text written in paragraphs.
I need to count certain punctuation symbols:
without using any module, not even regex
count , and ;
also needs to count ' and -, but only under certain circumstances. Specifically:
count ' marks, but only when they appear as apostrophes surrounded by letters, i.e. indicating a contraction such as "shouldn't" or "won't". (Apostrophe is being included as an indication of more informal writing, perhaps direct speech.)
count - signs, but only when they are surrounded by letters, indicating a compound-word, such as "self-esteem".
Any other punctuation or letters, e.g. digits, should be regarded as white space, so serve to end words.
Note: Some of the texts we will use include double hyphen, i.e. --. This is to be regarded as a space character.
I first created a string and stored some punctuations inside it for example punctuation_string = ";./'-" but it is giving me the total; what I need is count for individual punctuation.
Because of that I have to change certain_cha variable number of times.
with open("/Users/abhishekabhishek/downloads/l.txt") as f:
text_lis = f.read().split()
punctuation_count = {}
certain_cha = "/"
freq_coun = 0
for word in text_lis:
for char in word:
if char in certain_char:
freq_coun += 1
punctuation_count[certain_char] = freq_count
I need values to be displayed like this:
; 40
. 10
/ 5
' 16
etc.
but what I get is total (71).
You will need to create a dictionary where each entry stores the count of each of those punctuation characters.
For commas and semicolons, we can simply do a string search to count the number of occurences in a word. But we'll need to handle ' and - slightly differently.
This should take care of all the cases:
with open("/Users/abhishekabhishek/downloads/l.txt") as f:
text_words = f.read().split()
punctuation_count = {}
punctuation_count[','] = 0
punctuation_count[';'] = 0
punctuation_count["'"] = 0
punctuation_count['-'] = 0
def search_for_single_quotes(word):
single_quote = "'"
search_char_index = word.find(single_quote)
search_char_count = word.count(single_quote)
if search_char_index == -1 and search_char_count != 1:
return
index_before = search_char_index - 1
index_after = search_char_index + 1
# Check if the characters before and after the quote are alphabets,
# and the alphabet after the quote is the last character of the word.
# Will detect `won't`, `shouldn't`, but not `ab'cd`, `y'ess`
if index_before >= 0 and word[index_before].isalpha() and \
index_after == len(word) - 1 and word[index_after].isalpha():
punctuation_count[single_quote] += 1
def search_for_hyphens(word):
hyphen = "-"
search_char_index = word.find(hyphen)
if search_char_index == -1:
return
index_before = search_char_index - 1
index_after = search_char_index + 1
# Check if the character before and after hyphen is an alphabet.
# You can also change it check for characters as well as numbers
# depending on your use case.
if index_before >= 0 and word[index_before].isalpha() and \
index_after < len(word) and word[index_after].isalpha():
punctuation_count[hyphen] += 1
for word in text_words:
for search_char in [',', ';']:
search_char_count = word.count(search_char)
punctuation_count[search_char] += search_char_count
search_for_single_quotes(word)
search_for_hyphens(word)
print(punctuation_count)
following should work:
text = open("/Users/abhishekabhishek/downloads/l.txt").read()
text = text.replace("--", " ")
for symbol in "-'":
text = text.replace(symbol + " ", "")
text = text.replace(" " + symbol, "")
for symbol in ".,/'-":
print (symbol, text.count(symbol))
Because you don't want to import anything this will be slow and will take some time, but it should work:
file = open() # enter your file path as parameter
lines = file.readline() # enter the number of lines in your document as parameter
search_chars = [',', ';', "'", '-'] # store the values to be searched
search_values = {',':0, ';':0, "'":0, '-':0} # a dictionary saves the number of occurences
whitespaces = [' ', '--', '1', '2', ...] # you can add to this list whatever you need
for line in lines:
for search in search_chars:
if search in line and (search in search_chars):
chars = line.split()
for ch_index in chars:
if chars [ch_index] == ',':
search_values [','] += 1
elif chars [ch_index] == ';':
search_values [';'] += 1
elif chars[ch_index] == "'" and not(chars[ch_index-1] in whitespaces) and not(chars[ch_index+1] in whitespaces):
search_values ["'"] += 1
elif chars[ch_index] == "-" and not(chars[ch_index-1] in whitespaces) and not(chars[ch_index+1] in whitespaces):
search_values ["-"] += 1
for key in range(search_values.keys()):
print(str(key) + ': ' + search_values[key])
This is obviously not optimal and it is better to use regex here, but it should work.
Feel free to ask if any questions should arise.

How to compare sentence character by character in python?

I want to write a code to count number of words in a given sentence by using character comparison and below is the code I have written as I am not allowed to use some fancy utilities like split(), etc. So, could you please guide me where am I making mistakes' I am a novice in python and currently trying to fiigure out how to do charactery by character comparison so as to find out simple counts of words, lines, strings withous using built in utitilites. So, kindly guide me about it.
Input Sentence : I am XYZ
Input_Sentence = raw_input("Enter your sentence: ")
print Input_Sentence
count = 0
i=0
while(Input_Sentence[i] != "\n"):
if(Input_Sentence[i] == ' '):
count=count+1
i+=1
else:
i+=1
print ('Number of Words in a given sentence is :' +str(count))
At first I wouldn't use a while loop in this context. Why not using a for loop?
for char in Input_sentence:
With this you iterate over every letter.
Then you can use the rest of you code and check:
if char == ' ':
# initialize the counter
word_count = 0
last_space_index = 0
# loop through each character in the sentence (assuming Input_Sentence is a string)
for i, x in enumerate(Input_Sentence): # enumerate to get the index of the character
# if a space is found (or newline character for end of sentence)
if x in (' ', '\n'):
word_count += 1 # increment the counter
last_space_index = i # set the index of the last space found
if len(Input_Sentence) > (last_space_index + 1): # check if we are at the end of the sentence (this is in case the word does not end with a newline character or a space)
word_count += 1
# print the total number of words
print 'Number of words:', word_count
The following will avoid errors if there's an space at the beginning or the end of the sentence.
Input_Sentence = raw_input("Enter your sentence: ")
print Input_Sentence
count = 0
sentence_length = len(Input_Sentence)
for i in range(sentence_length):
if Input_Sentence[i] == " ":
if i not in (0, sentence_length - 1):
count += 1
count += 1
print "There are %s words in the sentence \"%s\"." % (count, Input_Sentence)
You may use try-except syntax.
In your code you used while(Input_Sentence[i] != "\n") to find when the sentence comes to an end. If you just print the output at every step before i+ = 1 like this:
...
while(Input_Sentence[i] != "\n"):
...
print i,Input_Sentence[i]
i+=1
else:
print i,Input_Sentence[i],'*'
i+=1
...
you can see for yourself that the output is something like this:
Enter your sentence: Python is good
Python is good
0 P *
1 y *
2 t *
3 h *
4 o *
5 n *
6
7 i *
8 s *
9
10 g *
11 o *
12 o *
13 d *
Traceback (most recent call last):
File "prog8.py", line 19, in <module>
while(Input_Sentence[i] != "\n"):
IndexError: string index out of range
which means that the code that you have written works fine upto the length of the input sentence. After that when i is increased by 1 and it is demanded of the code to check if Input_Sentence[i] == "\n" it gives IndexError. This problem can be overcome by using exception handling tools of Python. Which leaves the option to neglect the block inside try if it is an exception and execute the block within except instead.
Input_Sentence = raw_input("Enter your sentence: ")
print Input_Sentence
count = 0
i=0
try:
while (Input_Sentence[i] != "\n"):
if (Input_Sentence[i] == ' '):
count=count+1
i+=1
else:
i+=1
except:
count = count+1
print ('Number of Words in a given sentence is :' +str(count))

Counting words starting with a character

Write a function that accepts a string and a character as input and
returns the count of all the words in the string which start with the
given character. Assume that capitalization does not matter here. You
can assume that the input string is a sentence i.e. words are
separated by spaces and consists of alphabetic characters.
This is my code:
def count_input_character (input_str, character):
input_str = input_str.lower()
character = character.lower()
count = 0
for i in range (0, len(input_str)):
if (input_str[i] == character and input_str[i - 1] == " "):
count += 1
return (count)
#Main Program
input_str = input("Enter a string: ")
character = input("Enter character whose occurances are to be found in the given input string: ")
result = count_input_character(input_str, character)
#print(result)
The only part missing here is that how to check if the first word of the sentence is stating with the user given character. consider this output:
Your answer is NOT CORRECT Your code was tested with different inputs. > For example when your function is called as shown below:
count_input_character ('the brahman the master of the universe', 't')
####### Your function returns ############# 2 The returned variable type is: type 'int'
### Correct return value should be ######## 3 The returned variable type is: type 'int'
You function misses the first t because in this line
if (input_str[i] == character and input_str[i - 1] == " "):
when i is 0, then input_str[i - 1] is input_str[-1] which Python will resolve as the last character of the string!
To fix this, you could change your condition to
if input_str[i] == character and (i == 0 or input_str[i - 1] == " "):
Or use str.split with a list comprehension. Or a regular expression like r'(?i)\b%s', with (?i) meaning "ignore case", \b is word boundary and %s a placeholder for the character..
Instead of looking for spaces, you could split input_str on whitespace, this would produce a list of words that you could then test against character. (Pseudocode below)
function F sentence, character {
l = <sentence split by whitespace>
count = 0
for word in l {
if firstchar(word) == character {
count = count + 1
}
}
return count
}
Although it doesn't fix your specific bug, for educational purposes, please note you could rewrite your function like this using list comprehension:
def count_input_character (input_str, character):
return len([x for x in input_str.lower().split() if x.startswith(character.lower())])
or even more efficiently(thanks to tobias_k)
def count_input_character (input_str, character):
sum(w.startswith(character.lower()) for w in input_str.lower().split())
def c_upper(text, char):
text = text.title() #set leading char of words to uppercase
char = char.upper() #set given char to uppercase
k = 0 #counter
for i in text:
if i.istitle() and i == char: #checking conditions for problem, where i is a char in a given string
k = k + 1
return k

How to count the number of words in a sentence, ignoring numbers, punctuation and whitespace?

How would I go about counting the words in a sentence? I'm using Python.
For example, I might have the string:
string = "I am having a very nice 23!#$ day. "
That would be 7 words. I'm having trouble with the random amount of spaces after/before each word as well as when numbers or symbols are involved.
str.split() without any arguments splits on runs of whitespace characters:
>>> s = 'I am having a very nice day.'
>>>
>>> len(s.split())
7
From the linked documentation:
If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace.
You can use regex.findall():
import re
line = " I am having a very nice day."
count = len(re.findall(r'\w+', line))
print (count)
s = "I am having a very nice 23!#$ day. "
sum([i.strip(string.punctuation).isalpha() for i in s.split()])
The statement above will go through each chunk of text and remove punctuations before verifying if the chunk is really string of alphabets.
This is a simple word counter using regex. The script includes a loop which you can terminate it when you're done.
#word counter using regex
import re
while True:
string =raw_input("Enter the string: ")
count = len(re.findall("[a-zA-Z_]+", string))
if line == "Done": #command to terminate the loop
break
print (count)
print ("Terminated")
Ok here is my version of doing this. I noticed that you want your output to be 7, which means you dont want to count special characters and numbers. So here is regex pattern:
re.findall("[a-zA-Z_]+", string)
Where [a-zA-Z_] means it will match any character beetwen a-z (lowercase) and A-Z (upper case).
About spaces. If you want to remove all extra spaces, just do:
string = string.rstrip().lstrip() # Remove all extra spaces at the start and at the end of the string
while " " in string: # While there are 2 spaces beetwen words in our string...
string = string.replace(" ", " ") # ... replace them by one space!
def wordCount(mystring):
tempcount = 0
count = 1
try:
for character in mystring:
if character == " ":
tempcount +=1
if tempcount ==1:
count +=1
else:
tempcount +=1
else:
tempcount=0
return count
except Exception:
error = "Not a string"
return error
mystring = "I am having a very nice 23!#$ day."
print(wordCount(mystring))
output is 8
How about using a simple loop to count the occurrences of number of spaces!?
txt = "Just an example here move along"
count = 1
for i in txt:
if i == " ":
count += 1
print(count)
import string
sentence = "I am having a very nice 23!#$ day. "
# Remove all punctuations
sentence = sentence.translate(str.maketrans('', '', string.punctuation))
# Remove all numbers"
sentence = ''.join([word for word in sentence if not word.isdigit()])
count = 0;
for index in range(len(sentence)-1) :
if sentence[index+1].isspace() and not sentence[index].isspace():
count += 1
print(count)

Categories

Resources