Getting only one '\' character when joining a list - python

I want to have a string with the'\w'character after each letter.
For example:
my_string = 'asdfg'
What I want:
my_string = 'a\ws\wd\wd\wf\wg\w'
Now, how I approached this is first storing each letter into a list:
list=[]
for i in my_string:
list.append(i)
And then joining it with a \w character in between to form my new string. However, I ran into some problems.
'\w'.join(list)
I'm getting a double backslash character instead of one:
'q\\ww\\we\\wr\\wt\\wy\\wu\\wy\\wt\\wr\\we\\ws\\wd\\wf\\wt\\wy\\wu\\wi\\wo\\wk\\wn\\wn'
I'd greatly appreciate any help in fixing this. Thanks.

\w is not a character. You might be thinking of another escape character, but '\w' simply evaluates to '\\w', since \w just doesn't exist.
Oh, you might also want to replace your for loop with simply list(my_string) or tuple(my_string) - or even the entire thing with '[whatever character you actually wanted]'.join(my_string) - it's simpler and does the same thing. To get your expected result, you'll also need to add the character to the end of the string, as in '[x]'.join(my_string) + '[x]'. As it stands now, you won't get the character after the very last letter.

Related

Using regex to check for no characters before another

I am trying to select any letter character (a-z) with no numbers in front of it. For example:
2x+a-3p returns a
a+b-c+d returns a,b,c,d
7g+8k returns nothing
I'm attempting to use regex for this so that I can use the expression in python but I can't find the solution.
I am using Python 3.10.4 if that is necessary.
You may need a negative lookbehind:
(?<!\d)[a-z]
Translates exactly into "any letter character (a-z) with no numbers in front of it".
Check the demo here.
A first pass attempt would be
p = re.compile(r'[^0-9]([a-z])')
This would get a lower-case letter preceded by a character which is not a digit. However, You would miss any letters which occurred at the beginning of a line since there would be no character preceding the letter. So, you can instead do
p = re.compile(r'(?:[^0-9]|^)([a-z])')

Deleting of 'd' and 'n' character in strip in python

dataframe
string1
Data%2Fxxx
Data%2Ffrance
Data%2Fdenmark
Data%2Fnorway
Code
df['string1'] = [x.strip('Data%2F') for x in df.string1]
output
string1
xxx
france
enmark
orway
So, strip function is removing 'd' and 'n' first character. Does anyone know why?How can i stop this from removing?Is this related to '\d' and '\n' ?
python version - 3.7.4
The strip() method returns a copy of the string with both leading and trailing characters stripped. According to https://docs.python.org/3/library/stdtypes.html#str.strip, "The chars argument is not a prefix or suffix; rather, all combinations of its values are stripped." Examples from the documentation:
>>> ' spacious '.strip()
'spacious'
>>> 'www.example.com'.strip('cmowz.')
'example'
In other words, x.strip('Data%2F') is directing Python to strip any a's, t's, D's etc. from the beginning and end of the string. This is why "Data%2Faloha".strip("Data%2F") would actually return 'loh' unless you have, say, a space at the end, which is not part of the chars argument in your example. This is my best guess as to what's happening for you.
str.replace() should work perfectly for you.
>>> x.replace('Data%2F', '')
The correct way to proceed is with string.replace()
df['string1'] = [x.replace('Data%2F','') for x in dbppp.string1]
The string.strip() method returns a copy of the string in which all chars have been stripped from the beginning and the end of the string.
When I tested, it gave me a different result but still incorrect.
string.strip() is more used if you want to remove spaces from the start and end of a string for example.
It should be because of \n if it happens with t as well. You should rather use replace because it won't get rid of whitespaces.
string.replace("Data%2F","")

Is it possible to add "any letter" to a string?

I am parsing a database and extracting entries to a new database. For this I use keywords which should and keywords which should not be included. For a keyword I want excluded, it should be "-anyletter-fv", I wonder if -anyletter- is possible to program. If there is no letter, a space, a comma, or anything but a letter, I don't want to exclude it, only if there is specifically a letter in front of it.
If I understand you correctly, you try to exclude those cases in which your keyword starts with some letter.
Use library re for it (https://docs.python.org/3/library/re.html)
print(re.match("^\w.*", " keyword"))
will return a match object if a pattern that you look for is found, otherwise None.
You can use it for if-expressions.
the "^" marks the beginning of the sequence, "\w" matches all [a-zA-Z0-9], while ".*" matches all other sequences of varying length.
Therefore you get matches for keywords that do not start with ascii character.
I hope this helps you.

Is this regex syntax working?

I wanted to search a string for a substring beginning with ">"
Does this syntax say what I want it to say: this character followed by anything.
regex_firstline = re.compile("[>]{1}.*")
As a pythonic way for such tasks you can use str.startswith() method, and don't need to use regex.
But about your regex "[>]{1}.*" you don't need {1} after your character class and you can specify the start of your regex with anchor ^.So it can be "^>.*"
Using http://regex101.com:
[>]{1} matches the single character > literally exactly one time (but it denotes {1} is a meaningless quantifier), and
.* then matches any character as many times as possible.
If a list was provided inside square brackets (as opposed to a single character), regex would attempt to match a single character within the list exactly one time. http://regex101.com has a good listing of tokens and what they mean.
An ideal regex expression would be ^[>].*, meaning at the beginning of a string find exactly one > character followed by anything else (and with only one character in the square brackets, you can remove those to simplify it even further: ^>.*

Regex: Complement a group of characters (Python)

I want to write a regex to check if a word ends in anything except s,x,y,z,ch,sh or a vowel, followed by an s. Here's my failed attempt:
re.match(r".*[^ s|x|y|z|ch|sh|a|e|i|o|u]s",s)
What is the correct way to complement a group of characters?
Non-regex solution using str.endswith:
>>> from itertools import product
>>> tup = tuple(''.join(x) for x in product(('s','x','y','z','ch','sh'), 's'))
>>> 'foochf'.endswith(tup)
False
>>> 'foochs'.endswith(tup)
True
[^ s|x|y|z|ch|sh|a|e|i|o|u]
This is an inverted character class. Character classes match single characters, so in your case, it will match any character, except one of these: acehiosuxyz |. Note that it will not respect compound groups like ch and sh and the | are actually interpreted as pipe characters which just appear multiple time in the character class (where duplicates are just ignored).
So this is actually equivalent to the following character class:
[^acehiosuxyz |]
Instead, you will have to use a negative look behind to make sure that a trailing s is not preceded by any of the character sequences:
.*(?<!.[ sxyzaeiou]|ch|sh)s
This one has the problem that it will not be able to match two character words, as, to be able to use look behinds, the look behind needs to have a fixed size. And to include both the single characters and the two-character groups in the look behind, I had to add another character to the single character matches. You can however use two separate look behinds instead:
.*(?<![ sxyzaeiou])(?<!ch|sh)s
As LarsH mentioned in the comments, if you really want to match words that end with this, you should add some kind of boundary at the end of the expression. If you want to match the end of the string/line, you should add a $, and otherwise you should at least add a word boundary \b to make sure that the word actually ends there.
It looks like you need a negative lookbehind here:
import re
rx = r'(?<![sxyzaeiou])(?<!ch|sh)s$'
print re.search(rx, 'bots') # ok
print re.search(rx, 'boxs') # None
Note that re doesn't support variable-width LBs, therefore you need two of them.
How about
re.search("([^sxyzaeiouh]|[^cs]h)s$", s)
Using search() instead of match() means the match doesn't have to begin at the beginning of the string, so we can eliminate the .*.
This is assuming that the end of the word is the end of the string; i.e. we don't have to check for a word boundary.
It also assumes that you don't need to match the "word" hs, even it conforms literally to your rules. If you want to match that as well, you could add another alternative:
re.search("([^sxyzaeiouh]|[^cs]|^h)s$", s)
But again, we're assuming that the beginning of the word is the beginning of the string.
Note that the raw string notation, r"...", is unecessary here (but harmless). It only helps when you have backslashes in the regexp, so that you don't have to escape them in the string notation.

Categories

Resources