I'm trying to snip the first phrase in an imported string (s) which always takes the form:
"\first phrase\\...\ ... "
The first phrase can be any length and consist of more than one word
The code I initially tried was:
phrase = s[1:s.find('\',1,len(s))]
which obviously didn't work.
r'\' won't compile (returns EOL error).
Variations of the following: r'\\\'; r'\\\\\\\', "\\\", "\\\\\\\""
resolve to: phrase = s[1:-1].
As the first character is always a backslash I've also tried:
phrase = s[1:find(s[0:1],1,len(s))], but it wasn't having any of it.
Any suggestions appreciated, this was supposed to be a 10 minute job!
Backslashes in string literals need to be escaped.
'\\'
I just use the split command, which will handle your multi-word requirement easily:
>>> s='\\first phrase\\second phrase\\third phrase\\'
>>> print s
\first phrase\second phrase\third phrase\
>>> s.split('\\')
['', 'first phrase', 'second phrase', 'third phrase', '']
>>> s.split('\\')[1]
'first phrase'
The trick is to make sure the backslash is escaped by a backslash.
That's why it turns out to be \\ that you are searching for or splitting on.
You can't have an '\' as the last character of a string, even if it's a raw string - it needs to be written '\\' - in fact, if you look at your question, you'll see the highlighting go somewhat wonky - try changing it as suggested and it may well correct itself...
Related
dataframe
string1
Data%2Fxxx
Data%2Ffrance
Data%2Fdenmark
Data%2Fnorway
Code
df['string1'] = [x.strip('Data%2F') for x in df.string1]
output
string1
xxx
france
enmark
orway
So, strip function is removing 'd' and 'n' first character. Does anyone know why?How can i stop this from removing?Is this related to '\d' and '\n' ?
python version - 3.7.4
The strip() method returns a copy of the string with both leading and trailing characters stripped. According to https://docs.python.org/3/library/stdtypes.html#str.strip, "The chars argument is not a prefix or suffix; rather, all combinations of its values are stripped." Examples from the documentation:
>>> ' spacious '.strip()
'spacious'
>>> 'www.example.com'.strip('cmowz.')
'example'
In other words, x.strip('Data%2F') is directing Python to strip any a's, t's, D's etc. from the beginning and end of the string. This is why "Data%2Faloha".strip("Data%2F") would actually return 'loh' unless you have, say, a space at the end, which is not part of the chars argument in your example. This is my best guess as to what's happening for you.
str.replace() should work perfectly for you.
>>> x.replace('Data%2F', '')
The correct way to proceed is with string.replace()
df['string1'] = [x.replace('Data%2F','') for x in dbppp.string1]
The string.strip() method returns a copy of the string in which all chars have been stripped from the beginning and the end of the string.
When I tested, it gave me a different result but still incorrect.
string.strip() is more used if you want to remove spaces from the start and end of a string for example.
It should be because of \n if it happens with t as well. You should rather use replace because it won't get rid of whitespaces.
string.replace("Data%2F","")
Hello: I have a text file where the double- and single-quote characters cannot be matched and replaced (Python 3.5.2). Below is a sample word copied and pasted.
>>> line_copied_pasted = 'gilingan.”'
>>> line_copied_pasted.replace('"','')
'gilingan.”'
When the string is manually entered, matching succeeds:
>>> line_manually_entered = 'gilingan."'
>>> line_manually_entered
'gilingan."'
>>> line_manually_entered.replace('"','')
'gilingan.'
The file is UTF-16 encoded, I think. Any help to fix the problem? Thanks.
You seem to have it figured out. Since it both ” and " are different, it does not make sense to try replacing first while comparing with the latter.
Just do :
line_copied_pasted.replace('”','')
In copied text ”(right double quotation mark) and "(quotation mark) are different characters. You could check their codes here.
I'm writing a quick Python script to do a bit of inspection on some of our Hibernate mapping files. I'm trying to use this bit of Python to get the table name of a POJO, whether or not its class path is fully defined:
searchObj = re.search(r'<class name="(.*\\.|)' + pojo + '".*table="(.*?)"', contents)
However - say pojo is 'MyObject' - the regex is not matching it to this line:
<class name="com.place.package.MyObject" table="my_cool_object" dynamic-insert="true" dynamic-update="true">
If I print the string (while stopped in Pdb) I'm searching with, I see this:
'<class name="(.*\\\\.|)MyObject".*table="(.*?)"'
I'm quite confused as to what's going wrong here. For one, I was under the impression that the 'r' prefix made it so that the backslashes wouldn't be escaped. Even so, if I remove one of the backslashes such that my search string is this:
searchObj = re.search(r'<class name="(.*\.|)' + pojo + '".*table="(.*?)"', contents)
And the string searched becomes
'<class name="(.*\\.|)MyObject".*table="(.*?)"'
It still doesn't return a match. What's going wrong here? The regex expression I'm intending to use works on regex101.com (with just one backslash in the apparently problematic area.) Any idea what is going wrong here?
Given this:
re.search(r'<class name="(.*\\.|)' + pojo + '".*table="(.*?)"', contents)
The the first part of the pattern is interpreted like this:
1. class name=" a literal string beginning with c and ending with "
2. ( the beginning of a group
3. .* zero or more of any characters
4. \\ a literal single slash
5. . any single character
6. OR
7. nothing
8. ) end of the group
Since the string you're searching for does not have a literal backslash, it won't match.
If what you intend is for \\. to mean "a literal period", you need a single backslash since it is inside a raw string: \.
Also, ending the group with a pipe seems weird. I'm not sure what you think that's accomplishing. If you mean to say "any number of characters ending in a dot, or nothing", you can do that with (.*\.)?, since the ? means "zero or one of the preceding match".
This seems to work for me:
import re
contents1 = '''<class name="com.place.package.MyObject" table="my_cool_object" dynamic-insert="true" dynamic-update="true">'''
contents2 = '''<class name="MyObject" table="my_cool_object" dynamic-insert="true" dynamic-update="true">'''
pojo="MyObject"
pattern = r'<class name="(.*\.)?' + pojo + '.*table="(.*?)"'
assert(re.search(pattern, contents1))
assert(re.search(pattern, contents2))
On Pythex, I tried this regex:
<class name="(.*)\.MyObject" table="([^"]*)"
on this string:
<class name="com.place.package.MyObject" table="my_cool_object" dynamic-insert="true" dynamic-update="true">
and got these two match captures:
com.place.package
my_cool_object
So I think in your case, this line
searchObj = re.search(r'<class name="(.*)\.' + pojo + '"table="([^"]*)"', contents)
will produce the result you want.
About the confusing backslashes – you add two and then four show up, on the Python documentation 7.2. re — Regular expression operations it explains that r'' is “raw string notation”, used to circumvent Python’s regular character escaping, which uses a backslash. So:
'\\' means “a string composed of one backslash”, since the first backslash in the string escapes the second backslash. Python sees the first backslash and thinks, ‘the next character is a special one’; then it sees the second and says, ‘the special character is an actual backslash’. It’s stored as a single character \. If you ask Python to print this, it will escape the output and show you "\\".
r'\\' means “a string composed of two actual backslashes. It’s stored as character \ followed by character \. If you ask Python to print this, it will escape the output and show you "\\\\".
I'm trying to remove the apostrophe from a string in python.
Here is what I am trying to do:
source = 'weatherForecast/dataRAW/2004/grib/tmax/'
destination= 'weatherForecast/csv/2004/tmax'
for file in sftp.listdir(source):
filepath = source + str(file)
subprocess.call(['degrib', filepath, '-C', '-msg', '1', '-Csv', '-Unit', 'm', '-namePath', destination, '-nameStyle', '%e_%R.csv'])
filepath currently comes out as the path with wrapped around by apostrophes.
i.e.
`subprocess.call(['', 'weatherForecast/.../filename')]`
and I want to get the path without the apostrophes
i.e.
subprocess.call(['', weatherForecast/.../filename)]
I have tried source.strip(" ' ", ""), but it doesn't really do anything.
I have tried putting in print(filepath) or return(filepath) since these will remove the apostrophes but they gave me
syntax errors.
filepath = print(source + str(file))
^
SyntaxError: invalid syntax
I'm currently out of ideas. Any suggestions?
The strip method of a string object only removes matching values from the ends of a string, it stops searching for matches when it first encounters a non-required character.
To remove characters, replace them with the empty string.
s = s.replace("'", "")
The accepted answer to this question is actually wrong and can cause lots of trouble. strip method removes as leading/trailing characters. So you use it when you have character to remove from start and end.
If you use replace instead, you will change all characters in the string. Here is a quick example.
my_string = "'Hello rokman's iphone'"
my_string.replace("'", "")
The above code will return Hello rokamns iphone. As you can see you lost the quote before s. This is not someting you would need in your case. However, you only parse location without that character I believe. That's why it was ok for you to use at that time.
For the solution, you are doing just one thing wrong. When you call strip method you leave space before and after. The right way to use it should be like this.
my_string = "'Hello world'"
my_string.strip("'")
However this assumes that you got ', if you get " from the response you can change quotes like this.
my_string = '"Hello world"'
my_string.strip('"')
I'm trying to use pyparsing to parse quoted strings under the following conditions:
The quoted string might contain internal quotes.
I want to use backslashes to escape internal quotes.
The quoted string might end with a backslash.
I'm struggling to define a successful parser. Also, I'm starting to wonder whether the regular expression used by pyparsing for quoted strings of this kind is correct (see my alternative regular expression below).
Am I using pyparsing incorrectly (most likely) or is there a bug in pyparsing?
Here's a script that demonstrates the problem (Note: ignore this script; please focus instead on the Update below.):
import pyparsing as pp
import re
# A single-quoted string having:
# - Internal escaped quote.
# - A backslash as the last character before the final quote.
txt = r"'ab\'cd\'"
# Parse with pyparsing.
# Does not work as expected: grabs only first 3 characters.
parser = pp.QuotedString(quoteChar = "'", escChar = '\\', escQuote = '\\')
toks = parser.parseString(txt)
print
print 'txt: ', txt
print 'pattern:', parser.pattern
print 'toks: ', toks
# Parse with a regex just like the pyparsing pattern, but with
# the last two groups flipped -- which seems more correct to me.
# This works.
rgx = re.compile(r"\'(?:[^'\n\r\\]|(?:\\.)|(?:\\))*\'")
print
print rgx.search(txt).group(0)
Output:
txt: 'ab\'cd\'
pattern: \'(?:[^'\n\r\\]|(?:\\)|(?:\\.))*\'
toks: ["ab'"]
'ab\'cd\'
Update
Thanks for the replies. I suspect that I've confused things by framing my question badly, so let me try again.
Let's say we are trying to parse a language that uses quoting rules generally like Python's. We want users to be able to define strings that can include internal quotes (protected by backslashes) and we want those strings to be able to end with a backslash. Here's an example file in our language. Note that the file would also parse as valid Python syntax, and if we printed foo (in Python), the output would be the literal value: ab'cd\
# demo.txt
foo = 'ab\'cd\\'
My goal is to use pyparsing to parse such a language. Is there a way to do it? The question above is basically where I ended up after several failed attempts. Below is my initial attempt. It fails because there are two backslashes at the end, rather than just one.
with open('demo.txt') as fh:
txt = fh.read().split()[-1].strip()
parser = pp.QuotedString(quoteChar = "'", escChar = '\\')
toks = parser.parseString(txt)
print
print 'txt: ', txt
print 'pattern:', parser.pattern
print 'toks: ', toks # ["ab'cd\\\\"]
I guess the problem is that QuotedString treats the backslash only as a quote-escape whereas Python treats a backslash as a more general-purpose escape.
Is there a simple way to do this that I'm overlooking? One workaround that occurs to me is to use .setParseAction(...) to handle the double-backslashes after the fact -- perhaps like this, which seems to work:
qHandler = lambda s,l,t: [ t[0].replace('\\\\', '\\') ]
parser = pp.QuotedString(quoteChar = "'", escChar = '\\').setParseAction(qHandler)
I think you're misunderstanding the use of escQuote. According to the docs:
escQuote - special quote sequence to escape an embedded quote string (such as SQL's "" to escape an embedded ") (default=None)
So escQuote is for specifying a complete sequence that is parsed as a literal quote. In the example given in the docs, for instance, you would specify escQuote='""' and it would be parsed as ". By specifying a backslash as escQuote, you are causing a single backslash to be interpreted as a quotation mark. You don't see this in your example because you don't escape anything but quotes. However, if you try to escape something else, you'll see it won't work:
>>> txt = r"'a\Bc'"
>>> parser = pyp.QuotedString(quoteChar = "'", escChar = '\\', escQuote = "\\")
>>> parser.parseString(txt)
(["a'Bc"], {})
Notice that the backslash was replaced with '.
As for your alternative, I think the reason that pyparsing (and many other parsers) don't do this is that it involves special-casing one position within the string. In your regex, a single backslash is an escape character everywhere except as the last character in the string, in which position it is treated literally. This means that you cannot tell "locally" whether a given quote is really the end of the string or not --- even if it has a backslash, it might not be the end if there is one later on without a backslash. This can lead to parse ambiguities and surprising parsing behavior. For instance, consider these examples:
>>> txt = r"'ab\'xxxxxxx"
>>> print rgx.search(txt).group(0)
'ab\'
>>> txt = r"'ab\'xxxxxxx'"
>>> print rgx.search(txt).group(0)
'ab\'xxxxxxx'
By adding an apostrophe at the end of the string, I suddenly caused the earlier apostrophe to no longer be the end, and added all the xs to the string at once. In a real-usage context, this can lead to confusing situations in which mismatched quotes silently result in a reparsing of the string rather than a parse error.
Although I can't come up with an example at the moment, I also suspect that this has the possibility to cause "catastrophic backstracking" if you actually try to parse a sizable document containing multiple strings of this type. (This was my point about the "100MB of other text".) Because the parser can't know whether a given \' is the end of the string without parsing further, it might potentially have to go all the way to the end of the file just to make sure there are no more quote marks out there. If that remaining portion of the file contains additional strings of this type, it may become complicated to figure out which quotes are delimiting which strings. For instance, if the input contains something like
'one string \' 'or two'
we can't tell whether this is two valid strings (one string \ and or two) or one with invalid material after it (one string \' and the non-string tokens or two followed by an unmatched quote). This kind of situation is not desirable in many parsing contexts; you want the decisions about where strings begin and end to be locally determinable, and not depend on the occurrence of other tokens much later in the document.
What is it about this code that is not working for you?
from pyparsing import *
s = r"foo = 'ab\'cd\\'" # <--- IMPORTANT - use a raw string literal here
ident = Word(alphas)
strValue = QuotedString("'", escChar='\\')
strAssign = ident + '=' + strValue
results = strAssign.parseString(s)
print results.asList() # displays repr form of each element
for r in results:
print r # displays str form of each element
# count the backslashes
backslash = '\\'
print results[-1].count(backslash)
prints:
['foo', '=', "ab'cd\\\\"]
foo
=
ab'cd\\
2
EDIT:
So "\'" becomes just "'", but "\" is parsed but stays as "\" instead of being an escaped "\". Looks like a bug in QuotedString. For now you can add this workaround:
import re
strValue.setParseAction(lambda t: re.sub(r'\\(.)', r'\g<1>', t[0]))
Which will take every escaped character sequence and just give back the escaped character alone, without the leading '\'.
I'll add this in the next patch release of pyparsing.
PyParsing's QuotedString parser does not handle quoted strings that end with backslashes. This is a fundamental limitation, that doesn't have any easy workaround that I can see. If you want to support that kind of string, you'll need to use something other than QuotedString.
This is not an uncommon limitation either. Python itself does not allow an odd number of backslashes at the end of a "raw" string literal. Try it: r"foo\" will raise an exception, while r"bar\\" will include both backslashes in the output.
The reason you are getting truncated output (rather than an exception) from your current code is because you're passing a backslash as the escQuote parameter. I think that is intended to be an alternative to specifying an escape character, rather than a supplement. What is happening is that the first backslash is being interpreted as an internal quote (which it unescapes), and since it's followed by an actual quote character, the parser thinks it's reached the end of the quoted string. Thus you get ab' as your result.