split a file based on string - python

I am trying to split one big file into individual entries. Each entry ends with the character “//”. So when I try to use
#!/usr/bin/python
import sys,os
uniprotFile=open("UNIPROT-data.txt") #read original alignment file
uniprotFileContent=uniprotFile.read()
uniprotFileList=uniprotFileContent.split("//")
for items in uniprotFileList:
seqInfoFile=open('%s.dat'%items[5:14],'w')
seqInfoFile.write(str(items))
But I realised that there is another string with “//“(http://www.uniprot.org/terms)
hence it splits there as well and eventually I don’t get the result I want. I tried using regex but was not abler to figure it out.

Use a regex that only splits on // if it's not preceded by :
import re
myre = re.compile("(?<!:)//")
uniprotFileList = myre.split(uniprotFileContent)

I am using the code with modified split pattern and it works fine for me:
#!/usr/bin/python
import sys,os
uniprotFile = open("UNIPROT-data.txt")
uniprotFileContent = uniprotFile.read()
uniprotFileList = uniprotFileContent.split("//\n")
for items in uniprotFileList:
seqInfoFile = open('%s.dat' % items[5:17], 'w')
seqInfoFile.write(str(items))

You're confusing \ (backslash) and / (slash). You don't need to escape a slash, just use "/". For a backslash, you do need to escape it, so use "\\".
Secondly, if you split with a backslash it will not split on a slash or vice-versa.

Split using a regular exception that doesn't permit the "http:" part before your // marker.
For example: "([^:])\/\/"

You appear to be splitting on the wrong characters. Based on your question, you should split on r"\", not "//". Open a prompt and inspect the strings you're using. You'll see something like:
>>> "\\"
'\\'
>>> "\"
SyntaxError
>>> r"\"
'\\'
>>> "//"
'//'
So, you can use "\" or r"\" (I recommend r"\" for clarity in splitting and regex operations.

Related

EOL while concatenating string + path

I need to concatenate specific folder path with a string, for example:
mystring = "blablabla"
path = "C:\folder\whatever\"
printing (path + mystring) should return:
C:\folder\whatever\blablabla
However I always get the EOL error, and it's a must the path to have the slash like this: \ and not like this: /
Please show me the way, I tried with r' it's not working, I tried adding double "", nothing works and I can't figure it out.
Always use os.path.join() to join paths and the r prefix to allow single back slashes as Windows path separators:
r"C:\folder\whatever"
Now, now trailing back slash is needed:
>>> import os
>>> mystring = "blablabla"
>>> path = r"C:\folder\whatever"
>>> os.path.join(path, mystring)
'C:\\folder\\whatever\\blablabla'
Either use escape character \\ for \:
mystring = "blablabla"
path = "C:\\folder\\whatever\\"
conc = path + mystring
print(conc)
# C:\folder\whatever\blablabla
Or, make use of raw strings, however moving the last backslash from end of path to the start of myString:
mystring = r"\blablabla"
path = r"C:\folder\whatever"
conc = path + mystring
print(conc)
# C:\folder\whatever\blablabla
The reason why your own raw string approach didn't work is that a raw strings may not end with a single backslash:
Specifically, a raw literal cannot end in a single backslash (since
the backslash would escape the following quote character).
From
https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals
Two things.
First, with regards to the EOL error, my best guess - without access to the actual python session - is that python was complaining because you have an unterminated string caused by the final " character being escaped, which will happend even if the string is prefixed with r. My opinion is that you should drop the prefix and just correctly espace all backslashes like so: \\.
In your example, paththen becomes path = "C:\\folder\\whatever\\"
Secondly, instead of manually concatenating paths, you should use os.path.join:
import os
mystring = "blablabla"
path = "C:\\folder\\whatever"
print os.path.join(path, mystring)
## prints C:\\folder\\whatever\\blablabla
Note that os.path will use the path convetions for the operating system where the application is running, so the above code will produce erroneous/unexpected results if you run it on, say, Linux. Check the notes on the top of the page that I have linked for details.

Simple python regex

I have a text file and a I want to replace the following pattern:
\"
with:
"
The initial version of what I'm looking at looks like:
{"latestProfileVersion":51,
"scannerAccess":true,
"productRatings":"[{\"7H65018000\":{\"reviewCount\":0,\"avgRating\":0}}
So someone embedded a JSON string inside a JSON response.
This is what I have currently:
rawAuthResponseTextFile = open(rawAuthResponseFilename,'r')
formattedAuthResponse = open('formattedAuthResponse.txt', 'w')
try:
stringVersionOfAuthResponse = rawAuthResponseTextFile.read().replace('\n','')
cleanedStringVersionOfAuthResponse = re.sub(r'\"', '"', stringVersionOfAuthResponse)
jsonVersionOfAuthResponse = json.dumps(cleanedStringVersionOfAuthResponse)
formattedAuthResponse.write(jsonVersionOfAuthResponse)
finally:
rawAuthResponseTextFile.close()
formattedAuthResponse.close
Using http://pythex.org/ I have found that r'\"' should match only \", but this is not the case when I look at the output which appears to be adding additional escape characters.
I know I am doing something wrong because I cannot get the quotes around the embedded string to look like the quotes in the regular JSON no matter how much I tweek it, escape characters or no.
You need to use this regex
\\"
You need to escape \ with \

With pyparsing, how do you parse a quoted string that ends with a backslash

I'm trying to use pyparsing to parse quoted strings under the following conditions:
The quoted string might contain internal quotes.
I want to use backslashes to escape internal quotes.
The quoted string might end with a backslash.
I'm struggling to define a successful parser. Also, I'm starting to wonder whether the regular expression used by pyparsing for quoted strings of this kind is correct (see my alternative regular expression below).
Am I using pyparsing incorrectly (most likely) or is there a bug in pyparsing?
Here's a script that demonstrates the problem (Note: ignore this script; please focus instead on the Update below.):
import pyparsing as pp
import re
# A single-quoted string having:
# - Internal escaped quote.
# - A backslash as the last character before the final quote.
txt = r"'ab\'cd\'"
# Parse with pyparsing.
# Does not work as expected: grabs only first 3 characters.
parser = pp.QuotedString(quoteChar = "'", escChar = '\\', escQuote = '\\')
toks = parser.parseString(txt)
print
print 'txt: ', txt
print 'pattern:', parser.pattern
print 'toks: ', toks
# Parse with a regex just like the pyparsing pattern, but with
# the last two groups flipped -- which seems more correct to me.
# This works.
rgx = re.compile(r"\'(?:[^'\n\r\\]|(?:\\.)|(?:\\))*\'")
print
print rgx.search(txt).group(0)
Output:
txt: 'ab\'cd\'
pattern: \'(?:[^'\n\r\\]|(?:\\)|(?:\\.))*\'
toks: ["ab'"]
'ab\'cd\'
Update
Thanks for the replies. I suspect that I've confused things by framing my question badly, so let me try again.
Let's say we are trying to parse a language that uses quoting rules generally like Python's. We want users to be able to define strings that can include internal quotes (protected by backslashes) and we want those strings to be able to end with a backslash. Here's an example file in our language. Note that the file would also parse as valid Python syntax, and if we printed foo (in Python), the output would be the literal value: ab'cd\
# demo.txt
foo = 'ab\'cd\\'
My goal is to use pyparsing to parse such a language. Is there a way to do it? The question above is basically where I ended up after several failed attempts. Below is my initial attempt. It fails because there are two backslashes at the end, rather than just one.
with open('demo.txt') as fh:
txt = fh.read().split()[-1].strip()
parser = pp.QuotedString(quoteChar = "'", escChar = '\\')
toks = parser.parseString(txt)
print
print 'txt: ', txt
print 'pattern:', parser.pattern
print 'toks: ', toks # ["ab'cd\\\\"]
I guess the problem is that QuotedString treats the backslash only as a quote-escape whereas Python treats a backslash as a more general-purpose escape.
Is there a simple way to do this that I'm overlooking? One workaround that occurs to me is to use .setParseAction(...) to handle the double-backslashes after the fact -- perhaps like this, which seems to work:
qHandler = lambda s,l,t: [ t[0].replace('\\\\', '\\') ]
parser = pp.QuotedString(quoteChar = "'", escChar = '\\').setParseAction(qHandler)
I think you're misunderstanding the use of escQuote. According to the docs:
escQuote - special quote sequence to escape an embedded quote string (such as SQL's "" to escape an embedded ") (default=None)
So escQuote is for specifying a complete sequence that is parsed as a literal quote. In the example given in the docs, for instance, you would specify escQuote='""' and it would be parsed as ". By specifying a backslash as escQuote, you are causing a single backslash to be interpreted as a quotation mark. You don't see this in your example because you don't escape anything but quotes. However, if you try to escape something else, you'll see it won't work:
>>> txt = r"'a\Bc'"
>>> parser = pyp.QuotedString(quoteChar = "'", escChar = '\\', escQuote = "\\")
>>> parser.parseString(txt)
(["a'Bc"], {})
Notice that the backslash was replaced with '.
As for your alternative, I think the reason that pyparsing (and many other parsers) don't do this is that it involves special-casing one position within the string. In your regex, a single backslash is an escape character everywhere except as the last character in the string, in which position it is treated literally. This means that you cannot tell "locally" whether a given quote is really the end of the string or not --- even if it has a backslash, it might not be the end if there is one later on without a backslash. This can lead to parse ambiguities and surprising parsing behavior. For instance, consider these examples:
>>> txt = r"'ab\'xxxxxxx"
>>> print rgx.search(txt).group(0)
'ab\'
>>> txt = r"'ab\'xxxxxxx'"
>>> print rgx.search(txt).group(0)
'ab\'xxxxxxx'
By adding an apostrophe at the end of the string, I suddenly caused the earlier apostrophe to no longer be the end, and added all the xs to the string at once. In a real-usage context, this can lead to confusing situations in which mismatched quotes silently result in a reparsing of the string rather than a parse error.
Although I can't come up with an example at the moment, I also suspect that this has the possibility to cause "catastrophic backstracking" if you actually try to parse a sizable document containing multiple strings of this type. (This was my point about the "100MB of other text".) Because the parser can't know whether a given \' is the end of the string without parsing further, it might potentially have to go all the way to the end of the file just to make sure there are no more quote marks out there. If that remaining portion of the file contains additional strings of this type, it may become complicated to figure out which quotes are delimiting which strings. For instance, if the input contains something like
'one string \' 'or two'
we can't tell whether this is two valid strings (one string \ and or two) or one with invalid material after it (one string \' and the non-string tokens or two followed by an unmatched quote). This kind of situation is not desirable in many parsing contexts; you want the decisions about where strings begin and end to be locally determinable, and not depend on the occurrence of other tokens much later in the document.
What is it about this code that is not working for you?
from pyparsing import *
s = r"foo = 'ab\'cd\\'" # <--- IMPORTANT - use a raw string literal here
ident = Word(alphas)
strValue = QuotedString("'", escChar='\\')
strAssign = ident + '=' + strValue
results = strAssign.parseString(s)
print results.asList() # displays repr form of each element
for r in results:
print r # displays str form of each element
# count the backslashes
backslash = '\\'
print results[-1].count(backslash)
prints:
['foo', '=', "ab'cd\\\\"]
foo
=
ab'cd\\
2
EDIT:
So "\'" becomes just "'", but "\" is parsed but stays as "\" instead of being an escaped "\". Looks like a bug in QuotedString. For now you can add this workaround:
import re
strValue.setParseAction(lambda t: re.sub(r'\\(.)', r'\g<1>', t[0]))
Which will take every escaped character sequence and just give back the escaped character alone, without the leading '\'.
I'll add this in the next patch release of pyparsing.
PyParsing's QuotedString parser does not handle quoted strings that end with backslashes. This is a fundamental limitation, that doesn't have any easy workaround that I can see. If you want to support that kind of string, you'll need to use something other than QuotedString.
This is not an uncommon limitation either. Python itself does not allow an odd number of backslashes at the end of a "raw" string literal. Try it: r"foo\" will raise an exception, while r"bar\\" will include both backslashes in the output.
The reason you are getting truncated output (rather than an exception) from your current code is because you're passing a backslash as the escQuote parameter. I think that is intended to be an alternative to specifying an escape character, rather than a supplement. What is happening is that the first backslash is being interpreted as an internal quote (which it unescapes), and since it's followed by an actual quote character, the parser thinks it's reached the end of the quoted string. Thus you get ab' as your result.

python replace single backslash with double backslash [duplicate]

This question already has answers here:
How can I put an actual backslash in a string literal (not use it for an escape sequence)?
(4 answers)
Closed 7 months ago.
In python, I am trying to replace a single backslash ("\") with a double backslash("\"). I have the following code:
directory = string.replace("C:\Users\Josh\Desktop\20130216", "\", "\\")
However, this gives an error message saying it doesn't like the double backslash. Can anyone help?
No need to use str.replace or string.replace here, just convert that string to a raw string:
>>> strs = r"C:\Users\Josh\Desktop\20130216"
^
|
notice the 'r'
Below is the repr version of the above string, that's why you're seeing \\ here.
But, in fact the actual string contains just '\' not \\.
>>> strs
'C:\\Users\\Josh\\Desktop\\20130216'
>>> s = r"f\o"
>>> s #repr representation
'f\\o'
>>> len(s) #length is 3, as there's only one `'\'`
3
But when you're going to print this string you'll not get '\\' in the output.
>>> print strs
C:\Users\Josh\Desktop\20130216
If you want the string to show '\\' during print then use str.replace:
>>> new_strs = strs.replace('\\','\\\\')
>>> print new_strs
C:\\Users\\Josh\\Desktop\\20130216
repr version will now show \\\\:
>>> new_strs
'C:\\\\Users\\\\Josh\\\\Desktop\\\\20130216'
Let me make it simple and clear. Lets use the re module in python to escape the special characters.
Python script :
import re
s = "C:\Users\Josh\Desktop"
print s
print re.escape(s)
Output :
C:\Users\Josh\Desktop
C:\\Users\\Josh\\Desktop
Explanation :
Now observe that re.escape function on escaping the special chars in the given string we able to add an other backslash before each backslash, and finally the output results in a double backslash, the desired output.
Hope this helps you.
Use escape characters: "full\\path\\here", "\\" and "\\\\"
In python \ (backslash) is used as an escape character. What this means that in places where you wish to insert a special character (such as newline), you would use the backslash and another character (\n for newline)
With your example string you would notice that when you put "C:\Users\Josh\Desktop\20130216" in the repl you will get "C:\\Users\\Josh\\Desktop\x8130216". This is because \2 has a special meaning in a python string. If you wish to specify \ then you need to put two \\ in your string.
"C:\\Users\\Josh\\Desktop\\28130216"
The other option is to notify python that your entire string must NOT use \ as an escape character by pre-pending the string with r
r"C:\Users\Josh\Desktop\20130216"
This is a "raw" string, and very useful in situations where you need to use lots of backslashes such as with regular expression strings.
In case you still wish to replace that single \ with \\ you would then use:
directory = string.replace(r"C:\Users\Josh\Desktop\20130216", "\\", "\\\\")
Notice that I am not using r' in the last two strings above. This is because, when you use the r' form of strings you cannot end that string with a single \
Why can't Python's raw string literals end with a single backslash?
https://pythonconquerstheuniverse.wordpress.com/2008/06/04/gotcha-%E2%80%94-backslashes-are-escape-characters/
Maybe a syntax error in your case,
you may change the line to:
directory = str(r"C:\Users\Josh\Desktop\20130216").replace('\\','\\\\')
which give you the right following output:
C:\\Users\\Josh\\Desktop\\20130216
The backslash indicates a special escape character. Therefore, directory = path_to_directory.replace("\", "\\") would cause Python to think that the first argument to replace didn't end until the starting quotation of the second argument since it understood the ending quotation as an escape character.
directory=path_to_directory.replace("\\","\\\\")
Given the source string, manipulation with os.path might make more sense, but here's a string solution;
>>> s=r"C:\Users\Josh\Desktop\\20130216"
>>> '\\\\'.join(filter(bool, s.split('\\')))
'C:\\\\Users\\\\Josh\\\\Desktop\\\\20130216'
Note that split treats the \\ in the source string as a delimited empty string. Using filter gets rid of those empty strings so join won't double the already doubled backslashes. Unfortunately, if you have 3 or more, they get reduced to doubled backslashes, but I don't think that hurts you in a windows path expression.
You could use
os.path.abspath(path_with_backlash)
it returns the path with \
Use:
string.replace(r"C:\Users\Josh\Desktop\20130216", "\\", "\\")
Escape the \ character.

Matching "~" at the end of a filename with a python regular expression

I'm working in a script (Python) to find some files. I compare names of files against a regular expression pattern. Now, I have to find files ending with a "~" (tilde), so I built this regex:
if re.match("~$", string_test):
print "ok!"
Well, Python doesn't seem to recognize the regex, I don't know why. I tried the same regex in other languages and it works perfectly, any idea?
PD: I read in a web that I have to insert
# -*- coding: utf-8 -*-
but doesn't help :( .
Thanks a lot, meanwhile I'm going to keep reading to see if a find something.
re.match() is only successful if the regular expression matches at the beginning of the input string. To search for any substring, use re.search() instead:
if re.search("~$", string_test):
print "ok!"
Your regex will only match strings "~" and (believe it or not) "~\n".
You need re.match(r".*~$", whatever) ... that means zero or more of (anything except a newline) followed by a tilde followed by (end-of-string or a newline preceding the end of string).
In the unlikely event that a filename can include a newline, use the re.DOTALL flag and use \Z instead of $.
"worked" in other languages: you must have used a search function.
r at the beginning of a string constant means raw escapes e.g. '\n' is a newline but r'\n' is two characters, a backslash followed by n -- which can also be represented by '\n'. Raw escapes save a lot of \\ in regexes, one should use r"regex" automatically
BTW: in this case avoid the regex confusion ... use whatever.endswith('~')
For finding files, use glob instead,
import os
import glob
path = '/path/to/files'
os.chdir(path)
files = glob.glob('./*~')
print files
The correct regex and the glob solution have already been posted. Another option is to use the fnmatch module:
import fnmatch
if fnmatch.fnmatch(string_test, "*~"):
print "ok!"
This is a tiny bit easier than using a regex. Note that all methods posted here are essentially equivalent: fnmatch is implemented using regular expressions, and glob in turn uses fnmatch.
Note that only in 2009 a patch was added to fnmatch (after six years!) that added support for file names with newlines.

Categories

Resources