So let's say I have this string like
string = 'abcd <# string that has whitespace> efgh'
And I want to delete all the white space inside this <#...> And not affect anything outside <#...>
But the characters outside <#...> can change too so the <#...> is not going to be in a fixed position.
How should I do this?
This is not a complicated operation. You just do it like you would as a human being. Find the two delimiters, keep the part before the first one, remove space from the middle, keep the rest.
string = 'abcd <# string that has whitespace> efgh'
i1 = string.find('<#')
i2 = string.find('>')
res = string[:i1] + string[i1:i2].replace(' ','') + string[i2:]
print(res)
Output:
abcd <#stringthathaswhitespace> efgh
How about this...
string = 'abcd <# string that has whitespace> efgh'
s = string.split()
s = ' '.join( (s[0], ''.join(s[1:-1]), s[-1]) )
If <#...> exists consistently, one method to find the string is use regular expressions (regex) to search for that part of the string with the charactersyou want to modify. You then need to strip out the white space.
It takes a bit to get your head around regex, but they can be powerful tool.
Regex
Related
I am handed a bunch of data and trying to get rid of certain characters. The data contains multiple instances of "^{number}" → "^0", "^1", "^2", etc.
I am trying to set all of these instances to an empty string, "", is there a better way to do this than
string.replace("^0", "").replace("^1", "").replace("^2", "")
I understand you can use a dictionary, but it seems a little overkill considering each item will be replaced with "".
I understand that the digits are always at the end of the string, have a look at the solutions below.
with regex:
import re
text = 'xyz125'
s = re.sub("\d+$",'', text)
print(s)
it should print:
'xyz'
without regex, keep in mind that this solution removes all digits and not only the ones at the end of a string:
text = 'xyz125'
result = ''.join(i for i in text if not i.isdigit())
print(result)
it should print:
'xyz'
Here is the code i have until now :
dex = tree.xpath('//div[#class="cd-timeline-topic"]/text()')
names = filter(lambda n: n.strip(), dex)
table = str.maketrans(dict.fromkeys('?:,'))
for index, name in enumerate(dex, start = 0):
print('{}.{}'.format(index, name.strip().translate(table)))
The problem is that the output will print also strings with one special character "My name is/Richard". So what i need it's to replace that special character with a space and in the end the printing output will be "My name is Richard". Can anyone help me ?
Thanks!
Your call to dict.fromkeys() does not include the character / in its argument.
If you want to map all the special characters to None, just passing your list of special chars to dict.fromkeys() should be enough. If you want to replace them with a space, you could then iterate over the dict and set the value to for each key.
For example:
special_chars = "?:/"
special_char_dict = dict.fromkeys(special_chars)
for k in special_char_dict:
special_char_dict[k] = " "
You can do this by extending your translation table:
dex = ["My Name is/Richard????::,"]
table = str.maketrans({'?':None,':':None,',':None,'/':' '})
for index, name in enumerate(dex, start = 0):
print('{}.{}'.format(index, name.strip().translate(table)))
OUTPUT
0.My Name is Richard
You want to replace most special characters with None BUT forward slash with a space. You could use a different method to replace forward slashes as the other answers here do, or you could extend your translation table as above, mapping all the other special characters to None and forward slash to space. With this you could have a whole bunch of different replacements happen for different characters.
Alternatively you could use re.sub function following way:
import re
s = 'Te/st st?ri:ng,'
out = re.sub(r'\?|:|,|/',lambda x:' ' if x.group(0)=='/' else '',s)
print(out) #Te st string
Arguments meaning of re.sub is as follows: first one is pattern - it informs re.sub which substring to replace, ? needs to be escaped as otherwise it has special meaning there, | means: or, so re.sub will look for ? or : or , or /. Second argument is function which return character to be used in place of original substring: space for / and empty str for anything else. Third argument is string to be changed.
>>> a = "My name is/Richard"
>>> a.replace('/', ' ')
'My name is Richard'
To replace any character or sequence of characters from the string, you need to use `.replace()' method. So the solution to your answer is:
name.replace("/", " ")
here you can find details
When I tried to transform the string into a dict-like form, I met this problem
s = '&a: 12, &b:13, &c:14, &d: 15' # the string I want to convert
Before converting it, I tried to find all the matched results at first so I used
dict_form = re.compile(r'(&[a-zA-Z]*:)(.*),')
result = dict_form.findall(s)
print(result) # [('&a:', ' 12, &b:13, &c:14')]
It's quite unexpected, and a little bit messy
But when I tried another way to match the string:
dict_form1 = re.compile(r'(&[a-zA-Z]*:)([^,]*)')
result = dict_form1.findall(s)
print(result) # [('&a:', ' 12'), ('&b:', '13'), ('&c:', '14'), ('&d:', ' 15')]
This time, I get a better one with key and item separately stored in a tuple.
The only difference I made was (.), into [^,]
The first one I thought was to find anything until it matches a comma
The second one I thought was to find anything but comma
What's the difference?
In the first instance:
dict_form = re.compile(r'(&[a-zA-Z]*:)(.*),')
the (.*) operator is greedy. This means it will match everything up to the last comma, which is why you see the match extend up to &c:14.
In the second instance, by excluding the comma, you are forcing the match to be bound by a comma-- it's like saying "match everything until we hit a comma". This will cause the matching behavior you were expecting in the first place.
as have been said the .* will be greedy and try to match as much as possible, to make it non-greedy use the question mark (?) as in .*?. In your code:
dict_form = re.compile(r'(&[a-zA-Z]*:)(.*?),')
result = dict_form.findall(s)
print(result)
Another maybe easier solution is to just use string splits instead of regex:
result = [_s.split(':') for _s in s.split(',')]
Consider an input string :
mystr = "just some stupid string to illustrate my question"
and a list of strings indicating where to split the input string:
splitters = ["some", "illustrate"]
The output should look like
result = ["just ", "some stupid string to ", "illustrate my question"]
I wrote some code which implements the following approach. For each of the strings in splitters, I find its occurrences in the input string, and insert something which I know for sure would not be a part of my input string (for example, this '!!'). Then I split the string using the substring that I just inserted.
for s in splitters:
mystr = re.sub(r'(%s)'%s,r'!!\1', mystr)
result = re.split('!!', mystr)
This solution seems ugly, is there a nicer way of doing it?
Splitting with re.split will always remove the matched string from the output (NB, this is not quite true, see the edit below). Therefore, you must use positive lookahead expressions ((?=...)) to match without removing the match. However, re.split ignores empty matches, so simply using a lookahead expression doesn't work. Instead, you will lose one character at each split at minimum (even trying to trick re with "boundary" matches (\b) does not work). If you don't care about losing one whitespace / non-word character at the end of each item (assuming you only split at non-word characters), you can use something like
re.split(r"\W(?=some|illustrate)")
which would give
["just", "some stupid string to", "illustrate my question"]
(note that the spaces after just and to are missing). You could then programmatically generate these regexes using str.join. Note that each of the split markers is escaped with re.escape so that special characters in the items of splitters do not affect the meaning of the regular expression in any undesired ways (imagine, e.g., a ) in one of the strings, which would otherwise lead to a regex syntax error).
the_regex = r"\W(?={})".format("|".join(re.escape(s) for s in splitters))
Edit (HT to #Arkadiy): Grouping the actual match, i.e. using (\W) instead of \W, returns the non-word characters inserted into the list as seperate items. Joining every two subsequent items would then produce the list as desired as well. Then, you can also drop the requirement of having a non-word character by using (.) instead of \W:
the_new_regex = r"(.)(?={})".format("|".join(re.escape(s) for s in splitters))
the_split = re.split(the_new_regex, mystr)
the_actual_split = ["".join(x) for x in itertools.izip_longest(the_split[::2], the_split[1::2], fillvalue='')]
Because normal text and auxiliary character alternate, the_split[::2] contains the normal split text and the_split[1::2] the auxiliary characters. Then, itertools.izip_longest is used to combine each text item with the corresponding removed character and the last item (which is unmatched in the removed characters)) with fillvalue, i.e. ''. Then, each of these tuples is joined using "".join(x). Note that this requires itertools to be imported (you could of course do this in a simple loop, but itertools provides very clean solutions to these things). Also note that itertools.izip_longest is called itertools.zip_longest in Python 3.
This leads to further simplification of the regular expression, because instead of using auxiliary characters, the lookahead can be replaced with a simple matching group ((some|interesting) instead of (.)(?=some|interesting)):
the_newest_regex = "({})".format("|".join(re.escape(s) for s in splitters))
the_raw_split = re.split(the_newest_regex, mystr)
the_actual_split = ["".join(x) for x in itertools.izip_longest([""] + the_raw_split[1::2], the_raw_split[::2], fillvalue='')]
Here, the slice indices on the_raw_split have swapped, because now the even-numbered items must be added to item afterwards instead of in front. Also note the [""] + part, which is necessary to pair the first item with "" to fix the order.
(end of edit)
Alternatively, you can (if you want) use string.replace instead of re.sub for each splitter (I think that is a matter of preference in your case, but in general it is probably more efficient)
for s in splitters:
mystr = mystr.replace(s, "!!" + s)
Also, if you use a fixed token to indicate where to split, you do not need re.split, but can use string.split instead:
result = mystr.split("!!")
What you could also do (instead of relying on the replacement token not to be in the string anywhere else or relying on every split position being preceded by a non-word character) is finding the split strings in the input using string.find and using string slicing to extract the pieces:
def split(string, splitters):
while True:
# Get the positions to split at for all splitters still in the string
# that are not at the very front of the string
split_positions = [i for i in (string.find(s) for s in splitters) if i > 0]
if len(split_positions) > 0:
# There is still somewhere to split
next_split = min(split_positions)
yield string[:next_split] # Yield everything before that position
string = string[next_split:] # Retain the rest of the string
else:
yield string # Yield the rest of the string
break # Done.
Here, [i for i in (string.find(s) for s in splitters) if i > 0] generates a list of positions where the splitters can be found, for all splitters that are in the string (for this, i < 0 is excluded) and not right at the beginning (where we (possibly) just split, so i == 0 is excluded as well). If there are any left in the string, we yield (this is a generator function) everything up to (excluding) the first splitter (at min(split_positions)) and replace the string with the remaining part. If there are none left, we yield the last part of the string and exit the function. Because this uses yield, it is a generator function, so you need to use list to turn it into an actual list.
Note that you could also replace yield whatever with a call to some_list.append (provided you defined some_list earlier) and return some_list at the very end, I do not consider that to be very good code style, though.
TL;DR
If you are OK with using regular expressions, use
the_newest_regex = "({})".format("|".join(re.escape(s) for s in splitters))
the_raw_split = re.split(the_newest_regex, mystr)
the_actual_split = ["".join(x) for x in itertools.izip_longest([""] + the_raw_split[1::2], the_raw_split[::2], fillvalue='')]
else, the same can also be achieved using string.find with the following split function:
def split(string, splitters):
while True:
# Get the positions to split at for all splitters still in the string
# that are not at the very front of the string
split_positions = [i for i in (string.find(s) for s in splitters) if i > 0]
if len(split_positions) > 0:
# There is still somewhere to split
next_split = min(split_positions)
yield string[:next_split] # Yield everything before that position
string = string[next_split:] # Retain the rest of the string
else:
yield string # Yield the rest of the string
break # Done.
Not especially elegant but avoiding regex:
mystr = "just some stupid string to illustrate my question"
splitters = ["some", "illustrate"]
indexes = [0] + [mystr.index(s) for s in splitters] + [len(mystr)]
indexes = sorted(list(set(indexes)))
print [mystr[i:j] for i, j in zip(indexes[:-1], indexes[1:])]
# ['just ', 'some stupid string to ', 'illustrate my question']
I should acknowledge here that a little more work is needed if a word in splitters occurs more than once because str.index finds only the location of the first occurrence of the word...
Using Python v2, is there a way to ignore a value in a string if it is there?
For instance: I want someone to enter a value of $100.00, or they could enter a value of 100.00 without the leading $ symbol. What I want to do is ignore the '$' value if it is typed in.
Any push in the right direction would be appreciated.
Maybe
s = " $100.00 "
f = float(s.strip().lstrip("$"))
The .strip() strips whitespace from the beginning and the end of the string, and the .lstrip("$") strips a dollar sign from the beginning, if present.
If you only want to remove a '$' then s.replace('$', '') will do want you want.
If you want to replace more than one character then you need to chain replace calls together, which gets very ugly very quickly and in that case one of the other solutions is probably better.
Just filter out unwanted characters from the string. There are multiple ways of doing this, for clarity you could use:
def clean(s, wanted = "0123456789."):
"""Returns version of s without undesired characters in it."""
out = ""
for c in s:
if c in wanted:
out += c
return out
To avoid the dynamic string-building, which is costly, you can build a list and then turn the list into a string:
def clean2(s, wanted = "0123456789."):
outlist = [c for c in s if c in wanted]
return "".join(outlist)
You could simply use a regular expression to extract the number from the string.
Or your could be lazy if you just want to remove a leading $:
if s.startswith('$'):
s = s[1:]
If you want to remove multiple $ signs, replace if with while or use s = s.lstrip('$')
PS: You might want to remove trailing $ signs, too. rstrip() or endswith() and s[:-1] are your friends in this case.
Just lstrip $ from the string before you process it.
value = ...
value = value.lstrip( ' $' ) # Strip blank and $
a = "$100.00"
b = ''.join((c for c in a if c != "$"))
of course this is reasonable if you don't know the position of the character you want to remove