keep quoted blocks intact when splitting by delimiter - python

Given an example string s = 'Hi, my name is Humpty-Dumpty, from "Alice, Through the Looking Glass"' and I want to spearate it to the following chunks:
# To Do: something like {l = s.split(',')}
l = ['Hi', 'my name is Humpty-Dumpty', '"Alice, Through the Looking Glass"']
I don't know where and how many delimiters I'll find.
This is my initial idea, and it is quite long, and not exact, as it removes the all delimiters, while I want the delimiters inside quotes to survive:
s = 'Hi, my name is Humpty-Dumpty, from "Alice, Through the Looking Glass"'
ss = []
inner_string = ""
delimiter = ','
for item in s.split(delimiter):
if not inner_string:
if '\"' not in item: # regullar string. not intersting
ss.append(item)
else:
inner_string += item # start inner string
elif inner_string:
inner_string += item
if '\"' in item: # end inner string
ss.append(inner_string)
inner_string = ""
else: # middle of inner string
pass
print(ss)
# prints ['Hi', ' my name is Humpty-Dumpty', ' from "Alice Through the Looking Glass"'] which is OK-ish

You can split by regular expressions with re.split:
>>> import re
>>> [x for x in re.split(r'([^",]*(?:"[^"]*"[^",]*)*)', s) if x not in (',','')]
when s is equal to:
'Hi, my name is Humpty-Dumpty, from "Alice, Through the Looking Glass"'
it outputs:
['Hi', ' my name is Humpty-Dumpty', ' from "Alice, Through the Looking Glass"']
Regular expression explained:
(
[^",]* zero or more chars other than " or ,
(?: non-capturing group
"[^"]*" quoted block
[^",]* followed by zero or more chars other than " or ,
)* zero or more times
)

I solved this problem by avoiding split entirely:
s = 'Hi, my name is Humpty-Dumpty, from "Alice, Through the Looking Glass"'
l = []
substr = ""
quotes_open = False
for c in s:
if c == ',' and not quotes_open: # check for comma only if no quotes open
l.append(substr)
substr = ""
elif c == '\"':
quotes_open = not quotes_open
else:
substr += c
l.append(substr)
print(l)
Output:
['Hi', ' my name is Humpty-Dumpty', ' from Alice, Through the Looking Glass']
A more generalised function could look something like:
def custom_split(input_str, delimiter=' ', avoid_between_char='\"'):
l = []
substr = ""
between_avoid_chars = False
for c in s:
if c == delimiter and not between_avoid_chars:
l.append(substr)
substr = ""
elif c == avoid_between_char:
between_avoid_chars = not between_avoid_chars
else:
substr += c
l.append(substr)
return l

this would work for this specific case and can provide a starting point.
import re
s = 'Hi, my name is Humpty-Dumpty, from "Alice, Through the Looking Glass"'
cut = re.search('(".*")', s)
r = re.sub('(".*")', '$VAR$', s).split(',')
res = []
for i in r:
res.append(re.sub('\$VAR\$', cut.group(1), i))
Output
print(res)
['Hi', ' my name is Humpty-Dumpty', ' from "Alice, Through the Looking Glass"']

Related

Insert string before first occurence of character

So basically I have this string __int64 __fastcall(IOService *__hidden this);, and I need to insert a word in between __fastcall (this could be anything) and (IOService... such as __int64 __fastcall LmaoThisWorks(IOService *__hidden this);.
I've thought about splitting the string but this seems a bit overkill. I'm hoping there's a simpler and shorter way of doing this:
type_declaration_fun = GetType(fun_addr) # Sample: '__int64 __fastcall(IOService *__hidden this)'
if type_declaration_fun:
print(type_declaration_fun)
type_declaration_fun = type_declaration_fun.split(' ')
first_bit = ''
others = ''
funky_list = type_declaration_fun[1].split('(')
for x in range(0, (len(funky_list))):
if x == 0:
first_bit = funky_list[0]
else:
others = others + funky_list[x]
type_declaration_fun = type_declaration_fun[0] + ' ' + funky_list[0] + ' ' + final_addr_name + others
type_declaration_fun = type_declaration_fun + ";"
print(type_declaration_fun)
The code is not only crap, but it doesn't quite work. Here's a sample output:
void *__fastcall(void *objToFree)
void *__fastcall IOFree_stub_IONetworkingFamilyvoid;
How could I make this work and cleaner?
Notice that there could be nested parentheses and other weird stuff, so you need to make sure that the name is added just before the first parenthesis.
You can use the method replace():
s = 'ABCDEF'
ins = '$'
before = 'DE'
new_s = s.replace(before, ins + before, 1)
print(new_s)
# ABC$DEF
Once you find the index of the character you need to insert before, you can use splicing to create your new string.
string = 'abcdefg'
string_to_insert = '123'
insert_before_char = 'c'
for i in range(len(string)):
if string[i] == insert_before_char:
string = string[:i] + string_to_insert + string[i:]
break
What about this:
s = "__int64__fastcall(IOService *__hidden this);"
t = s.split("__fastcall",1)[0]+"anystring"+s.split("__fastcall",1)[1]
I get:
__int64__fastcallanystring(IOService *__hidden this);
I hope this is what you want. If not, please comment.
Use regex.
In [1]: import re
pattern = r'(?=\()'
string = '__int64 __fastcall(IOService *__hidden this);'
re.sub(pattern, 'pizza', string)
Out[1]: '__int64 __fastcallpizza(IOService *__hidden this);'
The pattern is a positive lookahead to match the first occurrence of (.
x='high speed'
z='new text'
y = x.index('speed')
x =x[:y] + z +x[y:]
print(x)
>>> high new textspeed
this a quick example, please be aware that y inclusuve after the new string.
be Aware that you are changing the original string, or instead just declare a new string.

Removing And Re-Inserting Spaces

What is the most efficient way to remove spaces from a text, and then after the neccessary function has been performed, re-insert the previously removed spacing?
Take this example below, here is a program for encoding a simple railfence cipher:
from string import ascii_lowercase
string = "Hello World Today"
string = string.replace(" ", "").lower()
print(string[::2] + string[1::2])
This outputs the following:
hlooltdyelwrdoa
This is because it must remove the spacing prior to encoding the text. However, if I now want to re-insert the spacing to make it:
hlool tdyel wrdoa
What is the most efficient way of doing this?
As mentioned by one of the other commenters, you need to record where the spaces came from then add them back in
from string import ascii_lowercase
string = "Hello World Today"
# Get list of spaces
spaces = [i for i,x in enumerate(string) if x == ' ']
string = string.replace(" ", "").lower()
# Set string with ciphered text
ciphered = (string[::2] + string[1::2])
# Reinsert spaces
for space in spaces:
ciphered = ciphered[:space] + ' ' + ciphered[space:]
print(ciphered)
You could use str.split to help you out. When you split on spaces, the lengths of the remaining segments will tell you where to split the processed string:
broken = string.split(' ')
sizes = list(map(len, broken))
You'll need the cumulative sum of the sizes:
from itertools import accumulate, chain
cs = accumulate(sizes)
Now you can reinstate the spaces:
processed = ''.join(broken).lower()
processed = processed[::2] + processed[1::2]
chunks = [processed[index:size] for index, size in zip(chain([0], cs), sizes)]
result = ' '.join(chunks)
This solution is not especially straightforward or efficient, but it does avoid explicit loops.
Using list and join operation,
random_string = "Hello World Today"
space_position = [pos for pos, char in enumerate(random_string) if char == ' ']
random_string = random_string.replace(" ", "").lower()
random_string = list(random_string[::2] + random_string[1::2])
for index in space_position:
random_string.insert(index, ' ')
random_string = ''.join(random_string)
print(random_string)
I think this might Help
string = "Hello World Today"
nonSpaceyString = string.replace(" ", "").lower()
randomString = nonSpaceyString[::2] + nonSpaceyString[1::2]
spaceSet = [i for i, x in enumerate(string) if x == " "]
for index in spaceSet:
randomString = randomString[:index] + " " + randomString[index:]
print(randomString)
string = "Hello World Today"
# getting index of ' '
index = [i for i in range(len(string)) if string[i]==' ']
# storing the non ' ' characters
data = [i for i in string.lower() if i!=' ']
# applying cipher code as mention in OP STATEMENT
result = data[::2]+data[1::2]
# inserting back the spaces in there position as they had in original string
for i in index:
result.insert(i, ' ')
# creating a string solution
solution = ''.join(result)
print(solution)
# output hlool tdyel wrdoa
You can make a new string with this small yet simple (kind of) code:
Note this doesn't use any libraries, which might make this slower, but less confusing.
def weird_string(string): # get input value
spaceless = ''.join([c for c in string if c != ' ']) # get spaceless version
skipped = spaceless[::2] + spaceless[1::2] # get new unique 'code'
result = list(skipped) # get list of one letter strings
for i in range(len(string)): # loop over strings
if string[i] == ' ': # if a space 'was' here
result.insert(i, ' ') # add the space back
# end for
s = ''.join(result) # join the results back
return s # return the result

How to write my own split function without using .split and .strip function?

How to write my own split function? I just think I should remove spaces, '\t' and '\n'. But because of the shortage of knowledge, I have no idea of doing this question
Here is the original question:
Write a function split(string) that returns a list of words in the
given string. Words may be separated by one or more spaces ' ' , tabs
'\t' or newline characters '\n' .
And there are examples:
words = split('duff_beer 4.00') # ['duff_beer', '4.00']
words = split('a b c\n') # ['a', 'b', 'c']
words = split('\tx y \n z ') # ['x', 'y', 'z']
Restrictions: Don't use the str.split method! Don't use the str.strip method
Some of the comments on your question provide really interesting ideas to solve the problem with the given restrictions.
But assuming you should not use any python builtin split function, here is another solution:
def split(string, delimiters=' \t\n'):
result = []
word = ''
for c in string:
if c not in delimiters:
word += c
elif word:
result.append(word)
word = ''
if word:
result.append(word)
return result
Example output:
>>> split('duff_beer 4.00')
['duff_beer', '4.00']
>>> split('a b c\n')
['a', 'b', 'c']
>>> split('\tx y \n z ')
['x', 'y', 'z']
I think using regular expressions is your best option as well.
I would try something like this:
import re
def split(string):
return re.findall('\S+',string)
This should return a list of all none whitespace characters in your string.
Example output:
>>> split('duff_beer 4.00')
['duff_beer', '4.00']
>>> split('a b c\n')
['a', 'b', 'c']
>>> split('\tx y \n z ')
['x', 'y', 'z']
This is what you can do with assigning a list, This is tested on python3.6
Below is Just an example..
values = 'This is a sentence'
split_values = []
tmp = ''
for words in values:
if words == ' ':
split_values.append(tmp)
tmp = ''
else:
tmp += words
if tmp:
split_values.append(tmp)
print(split_values)
Desired output:
$ ./splt.py
['This', 'is', 'a', 'sentence']
You can use the following function that sticks to the basics, as your professor apparently prefers:
def split(s):
output = []
delimiters = {' ', '\t', '\n'}
delimiter_found = False
for c in s:
if c in delimiters:
delimiter_found = True
elif output:
if delimiter_found:
output.append('')
delimiter_found = False
output[-1] += c
else:
output.append(c)
return output
so that:
print(split('duff_beer 4.00'))
print(split('a b c\n'))
print(split('\tx y \n z '))
would output:
['duff_beer', '4.00']
['a', 'b', 'c']
['x', 'y', 'z']
One approach would be to iterate over every char until you find a seperator, built a string from that chars and append it to the outputlist like this:
def split(input_str):
out_list = []
word = ""
for c in input_str:
if c not in ("\t\n "):
word += c
else:
out_list.append(word)
word = ""
out_list.append(word)
return out_list
a = "please\nsplit\tme now"
print(split(a))
# will print: ['please', 'split', 'me', 'now']
Another thing you could do is by using regex:
import re
def split(input_str):
out_list = []
for m in re.finditer('\S+', input_str):
out_list.append(m.group(0))
return out_list
a = "please\nsplit\tme now"
print(split(a))
# will print: ['please', 'split', 'me', 'now']
The regex \S+ is looking for any sequence of non whitespace characters and the function re.finditer returns an iterator with MatchObject instances over all non-overlapping matches for the regex pattern.
Please find my solution, it is not the best one, but it works:
def convert_list_to_string(b):
localstring=""
for i in b:
localstring+=i
return localstring
def convert_string_to_list(b):
locallist=[]
for i in b:
locallist.append(i)
return locallist
def mysplit(inputString, separator):
listFromInputString=convert_string_to_list(inputString)
part=[]
result=[]
j=0
for i in range(0, len(listFromInputString)):
if listFromInputString[i]==separator:
part=listFromInputString[j:i]
j=i+1
result.append(convert_to_string(part))
else:
pass
if j != 0:
result.append(convert_to_string(listFromInputString[j:]))
if len(result)==0:
result.append(inputString)
return result
Test:
mysplit("deesdfedefddfssd", 'd')
Result: ['', 'ees', 'fe', 'ef', '', 'fss', '']
Some of your solutions are very good, but it seems to me that there are more alternative options than using the function:
values = 'This is a sentence'
split_values = []
tmp = ''
for words in values:
if words == ' ':
split_values.append(tmp)
tmp = ''
else:
tmp += words
if tmp:
split_values.append(tmp)
print(split_values)
a is string and s is pattern here.
a="Tapas Pall Tapas TPal TapP al Pala"
s="Tapas"
def fun(a,s):
st=""
l=len(s)
li=[]
lii=[]
for i in range(0,len(a)):
if a[i:i+l]!=s:
st=st+a[i]
elif i+l>len(a):
st=st+a[i]
else:
li.append(st)
i=i+l
st=""
li.append(st)
lii.append(li[0])
for i in li[1:]:
lii.append(i[l-1:])
return lii
print(fun(a,s))
print(a.split(s))
This handles for whitespaces in strings and returns empty lists if present
def mysplit(strng):
#
# put your code here
#
result = []
words = ''
for char in strng:
if char != ' ':
words += char
else:
if words:
result.append(words)
words = ''
result.append(words)
for item in result:
if item == '':
result.remove(item)
return result
print(mysplit("To be or not to be, that is the question"))
print(mysplit("To be or not to be,that is the question"))
print(mysplit(" "))
print(mysplit(" abc "))
print(mysplit(""))
def mysplit(strng):
my_string = ''
liste = []
for x in range(len(strng)):
my_string += "".join(strng[x])
if strng[x] == ' ' or x+1 == len(strng):
liste.append(my_string.strip())
my_string = ''
liste = [elem for elem in liste if elem!='']
return liste
It is always a good idea to provide algorithm before coding:
This is the procedure for splitting words on delimiters without using any python built in method or function:
Initialize an empty list [] called result which will be used to save the resulting list of words, and an empty string called word = "" which will be used to concatenate each block of string.
Keep adding string characters as long as the delimiter is not reached
When you reach the delimiter, and len(word) = 0, Don't do whatever is below. Just go to the next iteration. This will help detecting and removing leading spaces.
When you reach the delimiter, and len(word) != 0, Append word to result, reinitialize word and jump to the next iteration without doing whatever is below
Return result
def my_split(s, delimiter = [" ","\t"]):
result,word = [], "" # Step 0
N = len(s)
for i in range(N) : #
if N == 0:# Case of empty string
return result
else: # Non empty string
if s[i] in delimiter and len(word) == 0: # Step 2
continue # Step 2: Skip, jump to the next iteration
if s[i] in delimiter and len(word) != 0: # Step 3
result.append(word) # Step 3
word = "" # Step 3
continue # Step 3: Skip, jump to the next iteration
word = word + s[i] # Step 1.
return result
print(my_split(" how are you? please split me now! "))
All the above answers are good, there is a similar solution with an extra empty list.
def my_split(s):
l1 = []
l2 = []
word = ''
spaces = ['', '\t', ' ']
for letters in s:
if letters != ' ':
word += letters
else:
l1.append(word)
word = ''
if word:
l1.append(word)
for words in l1:
if words not in spaces:
l2.append(words)
return l2
my_string = ' The old fox jumps into the deep river'
y = my_split(my_string)
print(y)

How to remove or strip off white spaces without using strip() function?

Write a function that accepts an input string consisting of alphabetic
characters and removes all the leading whitespace of the string and
returns it without using .strip(). For example if:
input_string = " Hello "
then your function should return a string such as:
output_string = "Hello "
The below is my program for removing white spaces without using strip:
def Leading_White_Space (input_str):
length = len(input_str)
i = 0
while (length):
if(input_str[i] == " "):
input_str.remove()
i =+ 1
length -= 1
#Main Program
input_str = " Hello "
result = Leading_White_Space (input_str)
print (result)
I chose the remove function as it would be easy to get rid off the white spaces before the string 'Hello'. Also the program tells to just eliminate the white spaces before the actual string. By my logic I suppose it not only eliminates the leading but trailing white spaces too. Any help would be appreciated.
You can loop over the characters of the string and stop when you reach a non-space one. Here is one solution :
def Leading_White_Space(input_str):
for i, c in enumerate(input_str):
if c != ' ':
return input_str[i:]
Edit :
#PM 2Ring mentionned a good point. If you want to handle all types of types of whitespaces (e.g \t,\n,\r), you need to use isspace(), so a correct solution could be :
def Leading_White_Space(input_str):
for i, c in enumerate(input_str):
if not c.isspace():
return input_str[i:]
Here's another way to strip the leading whitespace, that actually strips all leading whitespace, not just the ' ' space char. There's no need to bother tracking the index of the characters in the string, we just need a flag to let us know when to stop checking for whitespace.
def my_lstrip(input_str):
leading = True
for ch in input_str:
if leading:
# All the chars read so far have been whitespace
if not ch.isspace():
# The leading whitespace is finished
leading = False
# Start saving chars
result = ch
else:
# We're past the whitespace, copy everything
result += ch
return result
# test
input_str = " \n \t Hello "
result = my_lstrip(input_str)
print(repr(result))
output
'Hello '
There are various other ways to do this. Of course, in a real program you'd simply use the string .lstrip method, but here are a couple of cute ways to do it using an iterator:
def my_lstrip(input_str):
it = iter(input_str)
for ch in it:
if not ch.isspace():
break
return ch + ''.join(it)
and
def my_lstrip(input_str):
it = iter(input_str)
ch = next(it)
while ch.isspace():
ch = next(it)
return ch + ''.join(it)
Use re.sub
>>> input_string = " Hello "
>>> re.sub(r'^\s+', '', input_string)
'Hello '
or
>>> def remove_space(s):
ind = 0
for i,j in enumerate(s):
if j != ' ':
ind = i
break
return s[ind:]
>>> remove_space(input_string)
'Hello '
>>>
Just to be thorough and without using other modules, we can also specify which whitespace to remove (leading, trailing, both or all), including tab and new line characters. The code I used (which is, for obvious reasons, less compact than other answers) is as follows and makes use of slicing:
def no_ws(string,which='left'):
"""
Which takes the value of 'left'/'right'/'both'/'all' to remove relevant
whitespace.
"""
remove_chars = (' ','\n','\t')
first_char = 0; last_char = 0
if which in ['left','both']:
for idx,letter in enumerate(string):
if not first_char and letter not in remove_chars:
first_char = idx
break
if which == 'left':
return string[first_char:]
if which in ['right','both']:
for idx,letter in enumerate(string[::-1]):
if not last_char and letter not in remove_chars:
last_char = -(idx + 1)
break
return string[first_char:last_char+1]
if which == 'all':
return ''.join([s for s in string if s not in remove_chars])
you can use itertools.dropwhile to remove all particualar characters from the start of you string like this
import itertools
def my_lstrip(input_str,remove=" \n\t"):
return "".join( itertools.dropwhile(lambda x:x in remove,input_str))
to make it more flexible, I add an additional argument called remove, they represent the characters to remove from the string, with a default value of " \n\t", then with dropwhile it will ignore all characters that are in remove, to check this I use a lambda function (that is a practical form of write short anonymous functions)
here a few tests
>>> my_lstrip(" \n \t Hello ")
'Hello '
>>> my_lstrip(" Hello ")
'Hello '
>>> my_lstrip(" \n \t Hello ")
'Hello '
>>> my_lstrip("--- Hello ","-")
' Hello '
>>> my_lstrip("--- Hello ","- ")
'Hello '
>>> my_lstrip("- - - Hello ","- ")
'Hello '
>>>
the previous function is equivalent to
def my_lstrip(input_str,remove=" \n\t"):
i=0
for i,x in enumerate(input_str):
if x not in remove:
break
return input_str[i:]

Python cleaning words in a sentence

I am trying to write a function that accepts a string (sentence) and then cleans it and returns all alphabets, numbers and a hypen. however the code seems to error. Kindly know what I am doing wrong here.
Example: Blake D'souza is an !d!0t
Should return: Blake D'souza is an d0t
Python:
def remove_unw2anted(str):
str = ''.join([c for c in str if c in 'ABCDEFGHIJKLNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz1234567890\''])
return str
def clean_sentence(s):
lst = [word for word in s.split()]
#print lst
for items in lst:
cleaned = remove_unw2anted(items)
return cleaned
s = 'Blake D\'souza is an !d!0t'
print clean_sentence(s)
You only return last cleaned word!
Should be:
def clean_sentence(s):
lst = [word for word in s.split()]
lst_cleaned = []
for items in lst:
lst_cleaned.append(remove_unw2anted(items))
return ' '.join(lst_cleaned)
A shorter method could be this:
def is_ok(c):
return c.isalnum() or c in " '"
def clean_sentence(s):
return filter(is_ok, s)
s = "Blake D'souza is an !d!0t"
print clean_sentence(s)
A variation using string.translate which has the benefit ? of being easy to extend and is part of string.
import string
allchars = string.maketrans('','')
tokeep = string.letters + string.digits + '-'
toremove = allchars.translate(None, tokeep)
s = "Blake D'souza is an !d!0t"
print s.translate(None, toremove)
Output:
BlakeDsouzaisand0t
The OP said only keep characters, digits and hyphen - perhaps they meant keep whitespace as well?

Categories

Resources