How to separate using characters instead of whitespace in python - python

I have a text file for example :
test case 1 Pass
Test case 2 Pass
etc etc etc
I am able to separate the strings using split() function by whitespace, but I want to separate them using the keyword "Pass"/"Fail", how should I go about it?
my current code supports separation through whitespace but not all text file will have similar value, but they will have "Pass" or "Fail" keywords
filestr = ''
f = open('/Users/shashankgoud/Downloads/abc/index.txt',"r")
data=f.read()
for line in data.split('\n'):
strlist = line.split(' ')
filestr += (' '.join(strlist[:3]) +','+','.join(strlist[3:]))
filestr += '\n'
print(filestr)
f1 = open('/Users/shashankgoud/Downloads/abc/index.xlsx',"w")
f1.write(filestr)
f1.close()

You can use the re module for that, for example:
import re
txt = "test case 1 Pass Test case 2 Pass etc etc etc"
pattern = re.compile(r'(Pass|Fail)')
parts = pattern.split(txt)
joined_parts = [
parts[i] + parts[i + 1] for i in range(0, len(parts) - 1, 2)
]
joined_parts += [parts[-1]]
print(joined_parts)
>>> ['test case 1 Pass', ' Test case 2 Pass', ' etc etc etc']

Related

Python: While reading the file and counting the words in a line, I want to count words coming between " " or ' ' as a single word

I have a file in which I have to count the number of words in each line, but there is a trick, whatever comes in between ' ' or " ", should be counted as a single word.
Example file:
TopLevel
DISPLAY "In TopLevel. Starting to run program"
PERFORM OneLevelDown
DISPLAY "Back in TopLevel."
STOP RUN.
For the above file the count of words in each line has to be as below:
Line: 1 has: 1 words
Line: 2 has: 2 words
Line: 3 has: 2 words
Line: 4 has: 2 words
Line: 5 has: 2 words
But I am getting as below:
Line: 1 has: 1 words
Line: 2 has: 7 words
Line: 3 has: 2 words
Line: 4 has: 4 words
Line: 5 has: 2 words
from os import listdir
from os.path import isfile, join
srch_dir = r'C:\Users\sagrawal\Desktop\File'
onlyfiles = [srch_dir+'\\'+f for f in listdir(srch_dir) if isfile(join(srch_dir, f))]
for i in onlyfiles:
index = 0
with open(i,mode='r') as file:
lst = file.readlines()
for line in lst:
cnt = 0
index += 1
linewrds=line.split()
for lwrd in linewrds:
if lwrd:
cnt = cnt +1
print('Line:',index,'has:',cnt,' words')
If you only have this simple format (no nested quotes or escaped quotes), you could use a simple regex:
lines = '''TopLevel
DISPLAY "In TopLevel. Starting to run program"
PERFORM OneLevelDown
DISPLAY "Back in TopLevel."
STOP RUN.'''.split('\n')
import re
counts = [len(re.findall(r'\'.*?\'|".*?"|\S+', l))
for l in lines]
# [1, 2, 2, 2, 2]
If not, you have to write a parser
If you are looking for a not regex solution, this is my method for you:
# A simple function that will simply count words in each line
def count_words(line):
# Check the next function
line = manage_quotes(line)
words = line.strip()
# In case of several spaces in a row, We need to filter empty words
words = [word for word in words if len(word) > 0]
return len(words)
# This method will manage the quotes
def manage_quotes(line):
# We do not mind the escaped quotes, They are like a simple char
# Also since the changes will be local we can replace words in line
line = line.replace("\\\"", "Q").replace("\\\'", "q")
# As all words between 2 quotes act as one word we can replace them with 1 simple word and we start with `"`
# This loop will help to find all quotes in one line
while True:
i1 = line.find("\"")
if (i1 == -1): # No `"` anymore
break
i2 = line[i1+1:].find("\"") # Search after the previous one
if (i2 == -1): # What shall we do with not paired quotes???
# raise Exception()
break
line = line[:i1-1] + "QUOTE" + line[i2:]
# Now search for `'`
while True:
i1 = line.find("\'")
if (i1 == -1): # No `'` anymore
break
i2 = line[i1+1:].find("\'") # Search after the previous one
if (i2 == -1): # What shall we do with not paired quotes???
# raise Exception()
break
line = line[:i1-1] + "quote" + line[i2:]
return line
This is how this method works, For example, You have a line like this DISPLAY "Part One \'Test1\'" AND 'Part Two \"Test2\"'
At first, we remove escaped quotes:
DISPLAY "Part One qTest1q" AND 'Part Two QTest2Q'
Then we replace double quotations:
DISPLAY QUOTE AND 'Part Two QTest2Q'
Then the other one:
DISPLAY QUOTE AND quote
And now we count this which is 4
You can solve this without regex if you keep some marker if you are inside a quoted area or not.
str.split() - splitts at spaces, returns a list
str.startswith()
str.endswith() - takes a (tuple of) string(s) and returns True if it starts/ends with (any of) it
Code:
# create input file
name = "file.txt"
with open(name, "w") as f:
f.write("""TopLevel
DISPLAY "In TopLevel. Starting to run program"
PERFORM OneLevelDown
DISPLAY "Back in TopLevel."
STOP RUN.""")
# for testing later
expected = [(1,1),(2,2),(3,2),(4,2),(5,2)] # 1 base line/word count
# program that counts words
counted = []
with open(name) as f:
for line_nr, content in enumerate(f,1): # 1 based line count
splt = content.split()
in_quotation = []
line_count = 0
for word in splt:
if not in_quotation:
line_count += 1 # only increments if list empty
if word.startswith(("'",'"')):
in_quotation.append(word[0])
if word.endswith(("'","'")):
in_quotation.pop()
counted.append((line_nr, line_count))
print(expected)
print(counted)
print("Identical: ", all(a == expected[i] for i,a in enumerate(counted)))
Output:
[(1, 1), (2, 2), (3, 2), (4, 2), (5, 2)]
[(1, 1), (2, 2), (3, 2), (4, 2), (5, 2)]
Identical: True
You can tinker with the code - currently it does not well behave if you space out your " - it does not know if something ends or starts and both tests are True.
It seems that the code attached above doesn't care about ' or ".
And here is the definition of str.split in Python here.
If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace. Consequently, splitting an empty string or a string consisting of just whitespace with a None separator returns [].
Code:
input_str = '''TopLevel
DISPLAY "In TopLevel. Starting to run program"
PERFORM OneLevelDown
DISPLAY "Back in TopLevel."
STOP RUN.'''
input_str_list = input_str.split('\n')
print(input_str_list)
def get_trick_word(s: str):
num = 0
quotation_list = []
previous_char = ' '
for c in s:
need_set_previous = False
if c == ' ' and len(quotation_list) == 0 and previous_char != ' ':
num = num + 1
else:
has_quato = len(quotation_list)
if c == '\'' or c == '"':
if len(quotation_list) != 0 and quotation_list[-1] == c:
quotation_list.pop()
else:
quotation_list.append(c)
if has_quato and len(quotation_list) == 0:
num = num + 1
need_set_previous = True
previous_char = c
if need_set_previous:
previous_char = ' '
if previous_char != ' ' and len(quotation_list) == 0:
num = num + 1
return num
result = [get_trick_word(s) for s in input_str_list]
print(result)
And the result is:
# ['TopLevel ', ' DISPLAY "In TopLevel. Starting to run program" ', ' PERFORM OneLevelDown ', ' DISPLAY "Back in TopLevel." ', ' STOP RUN.']
# [1, 2, 2, 2, 2]

How to count word occurrences without being constrained to only exact matches

I have a file which has content like below.
Someone says; Hello; Someone responded Hello back
Someone again said; Hello; No response
Someone again said; Hello waiting for response
I have a python script which counts number of times a particular word occurred in a file. Following is the script.
#!/usr/bin/env python
filename = "/path/to/file.txt"
number_of_words = 0
search_string = "Hello"
with open(filename, 'r') as file:
for line in file:
words = line.split()
for i in words:
if (i == search_string):
number_of_words += 1
print("Number of words in " + filename + " is: " + str(number_of_words))
I am expecting the output to be 4 since Hello occurs 4 times. But I get the output as 2? Following is the output of the script
Number of words in /path/to/file.txt is: 2
I kind of understand that Hello; is not considered as Hello because of the word not being exactly the one searched for.
Question:
Is there a way I can make my script pick Hello even if it was followed by a comma or semi-colon or a dot? Some simple technique which doesn't require to look for substrings again within the found word.
Regex would be a better tool for this, since you want to ignore punctuation. It could be done with clever filtering and .count() methods, but this is more straightforward:
import re
...
search_string = "Hello"
with open(filename, 'r') as file:
filetext = file.read()
occurrences = len(re.findall(search_string, filetext))
print("Number of words in " + filename + " is: " + str(occurrences))
If you want case-insensitivity, you could change search_string accordingly:
search_string = r"[Hh]ello"
Or if you want explicitly the word Hello but not aHello or Hellon, you could match the \b character beforehand and afterwards (see the documentation for more fun tricks):
search_string = r"\bHello\b"
You can use regex and Counter from collections module:
txt = '''Someone says; Hello; Someone responded Hello back
Someone again said; Hello; No response
Someone again said; Hello waiting for response'''
import re
from collections import Counter
from pprint import pprint
c = Counter()
re.sub(r'\b\w+\b', lambda r: c.update((r.group(0), )), txt)
pprint(c)
Prints:
Counter({'Someone': 4,
'Hello': 4,
'again': 2,
'said': 2,
'response': 2,
'says': 1,
'responded': 1,
'back': 1,
'No': 1,
'waiting': 1,
'for': 1})
You can use regular expressions to find the answer.
import re
filename = "/path/to/file.txt"
number_of_words = 0
search_string = "Hello"
with open(filename, 'r') as file:
for line in file:
words = line.split()
for i in words:
b = re.search(r'\bHello;?\b', i)
if b:
number_of_words += 1
print("Number of words in " + filename + " is: " + str(number_of_words))
This will check if either "Hello" or "Hello;" specifically are in the file. You can expand the regex to fit any other needs (such as lowercase).
It will ignore things such as "Helloing" which other examples here may.
If you prefer not using regex... You can check if taking off the last letter makes it a match such as below:
filename = "/path/to/file.txt"
number_of_words = 0
search_string = "Hello"
with open(filename, 'r') as file:
for line in file:
words = line.split()
for i in words:
if (i == search_string) or (i[:-1] == search_string and i[-1] == ';'):
number_of_words += 1
print("Number of words in " + filename + " is: " + str(number_of_words))

How to remove or strip off white spaces without using strip() function?

Write a function that accepts an input string consisting of alphabetic
characters and removes all the leading whitespace of the string and
returns it without using .strip(). For example if:
input_string = " Hello "
then your function should return a string such as:
output_string = "Hello "
The below is my program for removing white spaces without using strip:
def Leading_White_Space (input_str):
length = len(input_str)
i = 0
while (length):
if(input_str[i] == " "):
input_str.remove()
i =+ 1
length -= 1
#Main Program
input_str = " Hello "
result = Leading_White_Space (input_str)
print (result)
I chose the remove function as it would be easy to get rid off the white spaces before the string 'Hello'. Also the program tells to just eliminate the white spaces before the actual string. By my logic I suppose it not only eliminates the leading but trailing white spaces too. Any help would be appreciated.
You can loop over the characters of the string and stop when you reach a non-space one. Here is one solution :
def Leading_White_Space(input_str):
for i, c in enumerate(input_str):
if c != ' ':
return input_str[i:]
Edit :
#PM 2Ring mentionned a good point. If you want to handle all types of types of whitespaces (e.g \t,\n,\r), you need to use isspace(), so a correct solution could be :
def Leading_White_Space(input_str):
for i, c in enumerate(input_str):
if not c.isspace():
return input_str[i:]
Here's another way to strip the leading whitespace, that actually strips all leading whitespace, not just the ' ' space char. There's no need to bother tracking the index of the characters in the string, we just need a flag to let us know when to stop checking for whitespace.
def my_lstrip(input_str):
leading = True
for ch in input_str:
if leading:
# All the chars read so far have been whitespace
if not ch.isspace():
# The leading whitespace is finished
leading = False
# Start saving chars
result = ch
else:
# We're past the whitespace, copy everything
result += ch
return result
# test
input_str = " \n \t Hello "
result = my_lstrip(input_str)
print(repr(result))
output
'Hello '
There are various other ways to do this. Of course, in a real program you'd simply use the string .lstrip method, but here are a couple of cute ways to do it using an iterator:
def my_lstrip(input_str):
it = iter(input_str)
for ch in it:
if not ch.isspace():
break
return ch + ''.join(it)
and
def my_lstrip(input_str):
it = iter(input_str)
ch = next(it)
while ch.isspace():
ch = next(it)
return ch + ''.join(it)
Use re.sub
>>> input_string = " Hello "
>>> re.sub(r'^\s+', '', input_string)
'Hello '
or
>>> def remove_space(s):
ind = 0
for i,j in enumerate(s):
if j != ' ':
ind = i
break
return s[ind:]
>>> remove_space(input_string)
'Hello '
>>>
Just to be thorough and without using other modules, we can also specify which whitespace to remove (leading, trailing, both or all), including tab and new line characters. The code I used (which is, for obvious reasons, less compact than other answers) is as follows and makes use of slicing:
def no_ws(string,which='left'):
"""
Which takes the value of 'left'/'right'/'both'/'all' to remove relevant
whitespace.
"""
remove_chars = (' ','\n','\t')
first_char = 0; last_char = 0
if which in ['left','both']:
for idx,letter in enumerate(string):
if not first_char and letter not in remove_chars:
first_char = idx
break
if which == 'left':
return string[first_char:]
if which in ['right','both']:
for idx,letter in enumerate(string[::-1]):
if not last_char and letter not in remove_chars:
last_char = -(idx + 1)
break
return string[first_char:last_char+1]
if which == 'all':
return ''.join([s for s in string if s not in remove_chars])
you can use itertools.dropwhile to remove all particualar characters from the start of you string like this
import itertools
def my_lstrip(input_str,remove=" \n\t"):
return "".join( itertools.dropwhile(lambda x:x in remove,input_str))
to make it more flexible, I add an additional argument called remove, they represent the characters to remove from the string, with a default value of " \n\t", then with dropwhile it will ignore all characters that are in remove, to check this I use a lambda function (that is a practical form of write short anonymous functions)
here a few tests
>>> my_lstrip(" \n \t Hello ")
'Hello '
>>> my_lstrip(" Hello ")
'Hello '
>>> my_lstrip(" \n \t Hello ")
'Hello '
>>> my_lstrip("--- Hello ","-")
' Hello '
>>> my_lstrip("--- Hello ","- ")
'Hello '
>>> my_lstrip("- - - Hello ","- ")
'Hello '
>>>
the previous function is equivalent to
def my_lstrip(input_str,remove=" \n\t"):
i=0
for i,x in enumerate(input_str):
if x not in remove:
break
return input_str[i:]

Print next x lines from string1 until string2

I'm trying to write a function that reads through a text file until it finds a word (say "hello"), then print the next x lines of string starting with string 1 (say "start_description") until string 2 (say "end_description").
hello
start_description 123456 end_description
The function should look like description("hello") and the following output should look like
123456
It's a bit hard to explain. I know how to find the certain word in the text file but I don't know how to print, as said, the next few lines between the two strings (start_description and end_description).
EDIT1:
I found some code which allows to print the next 8, 9, ... lines. But because the text in between the two strings is of variable length, that does not work...
EDIT2:
Basically it's the same question as in this post: Python: Print next x lines from text file when hitting string, but the range(8) does not work for me (see EDIT1).
The input file could look like:
HELLO
salut
A: 123456.
BYE
au revoir
A: 789123.
The code should then look like:
import re
def description(word):
doc = open("filename.txt",'r')
word = word.upper()
for line in doc:
if re.match(word,line):
#here it should start printing all the text between start_description and end_description, for example 123456
return output
print description("hello")
123456
print description("bye")
789123
Here's a way using split:
start_desc = 'hello'
end_desc = 'bye'
str = 'hello 12345\nabcd asdf\nqwer qwer erty\n bye'
print str.split('hello')[1].split('bye')[0]
The first split will result in:
('', ' 12345\nabcd asdf\nqwer qwer erty\n bye')
So feed the second element to the second split and it will result in:
('12345\nabcd asdf\nqwer qwer erty\n ', '')
Use the first element.
You can then use strip() to remove the surrounding spaces if you wish.
def description(infilepath, startblock, endblock, word, startdesc, enddesc):
with open(infilepath) as infile:
inblock = False
name = None
found = False
answer = []
for line in infile:
if found and not inblock: return answer
if line.strip() != startblock and not inblock: continue
if line.strip() == startblock: inblock = True
elif line.strip() == endblock: inblock = False
if not line.startswith(startdesc):
name = line.strip()
continue
if name is not None and name != word: continue
if not line.startswith(startdesc): continue
answer.append(line.strip().lstrip(startdesc).rstrip(enddesc))

Print a part of the string

I have a string:
"apples = green"
How do I print:
print everything before '=' (apples)
print everything after '=' (green)
specify a number of the string in a text file. I have .txt file which contains:
apples = green
lemons = yellow
... = ...
... = ...
split the string using .split():
print astring.split(' = ', 1)[0]
still split the string using .split():
print astring.split(' = ', 1)[1]
Alternatively, you could use the .partition() method:
>>> astring = "apples = green"
>>> print astring.split(' = ', 1)
['apples', 'green']
>>> print astring.partition(' = ')
('apples', ' = ', 'green')
Partition always only splits once, but returns the character you split on as well.
If you need to read a specific line in a file, skip lines first by iterating over the file object. The itertools.islice() function is the most compact way to return that line; don't worry too much if you don't understand how that all works. If the file doesn't have that many lines, an empty string is returned instead:
from itertools import islice
def read_specific_line(filename, lineno):
with open(filename) as f:
return next(islice(f, lineno, lineno + 1), '')
To read the 3rd line from a file:
line = read_specific_line('/path/to/some/file.txt', 3)
If instead you need to know what the line number is of a given piece of text, you'd need to use the enumerate() to keep track of the line count so far:
def what_line(filename, text):
with open(filename) as f:
for lineno, line in enumerate(f):
if line.strip() == text:
return lineno
return -1
which would return the line number (starting to count from 0), or -1 if the line was not found in the file.
Every string in python has a function within it called 'split.' If you call string.split("substring") It creates a list which does exactly what you are looking for.
>>> string = "apples = green"
>>> string.split("=")
['apples ', ' green']
>>> string = "apples = green = leaves = chloroplasts"
>>> string.split("=")
['apples ', ' green ', ' leaves ', ' chloroplasts']
So, if you use string.split(), you can call the index in the resulting list to get the substring you want:
>>> string.split(" = ")[0]
'apples'
>>> string.split(" = ")[1]
'green'
>>> string.split(" = ")[2]
'leaves'
etc... Just make sure you have a string which actually contains the substring, or this will throw an IndexError for any index greater than 0.

Categories

Resources