Formatting text that is meant to be replaced - python

This is a rather generic question, but I have a textfile that I want to edit using a script.
What are some ways to format text, so that it will visually stand out but still be recognized by my script?
It works fine when I use text_to_be_replaced, but it is hard to find when you have a large file.
Tried searching, and it seems that the common ways are:
%text_to_be_replaced%
<text_to_be_replaced>
$(text_to_be_replaced)
But maybe there is a commonly used/widely accepted way to format text for visibility?
The language the script is written in is python, if that matters... but I'm looking for a more-or-less generic soluting which will work 90% of the time.

I'm not aware of any generic standard here, but if it's meant to be replaced, you can use the new string formatting method as follows:
string = 'some text {add_text_here} some more text'
Then to replace it when you need to:
value = 'formatted'
string = string.format(add_text_here=value)
Now print it out:
>>> string
'some text formatted some more text'
In fact, this quite neat at the addition of curly {brackets} around the text that needs to be replaced also may make it stand out a little.

At first I thought that {{curly braces}} would be fine, but than I went with $ALLCAPS.
First of all, caps really stands out, while lowercase may be confused with the rest of the code.
And while it $REALLYSTANDSOUT, it shouldn't cause any problems, since it's just a "bookmark" in a text file, and will be replaced with the appropriate stuff determined by the script.

Related

Fastest way to extract part of a long string in Python

I have a large set of strings, and am looking to extract a certain part of each of the strings. Each string contains a sub string like this:
my_token:[
"key_of_interest"
],
This is the only part in each string it says my_token. I was thinking about getting the end index position of ' my_token:[" ' and after that getting the beginning index position of ' "], ' and getting all the text between those two index positions.
Is there a better or more efficient way of doing this? I'll be doing this for string of length ~10,000 and sets of size 100,000.
Edit: The file is a .ion file. From my understanding it can be treated as a flat file - as it is text based and used for describing metadata.
How can this can possibly be done the "dumbest and simplest way"?
find the starting position
look on for the ending position
grab everything indiscriminately between the two
This is indeed what you're doing. Thus any further inprovement can only come from the optimization of each step. Possible ways include:
narrow down the search region (requires additional constraints/assumptions as per comment56995056)
speed up the search operation bits, which include:
extracting raw data from the format
you already did this by disregarding the format altogether - so you have to make sure there'll never be any incorrect parsing (e.g. your search terms embedded in strings elsewhere or matching a part of a token) as per comment56995034
elementary pattern comparison operation
unlikely to attain in pure Python since str.index is implemented in C already and the implementation is probably already as simple as can possibly be
The underlying requirement shows through when you clarify:
I was thinking about getting the end index position of ' my_token:[" ' and after that getting the beginning index position of ' "], ' and getting all the text between those two index positions.
That sounds like you're trying to avoid the correct approach: use a parser for whatever language is in the string.
There is no good reason to build directly on top of string primitives for parsing, unless you are interested in writing yet another parsing framework.
So, use libraries written by people who have dealt with the issues before you.
If it's JSON, use the standard library json module; ditto if it's some other language with a parser already in the Python standard library.
If it's some other widely-implemented standard: get whichever already-existing third-party Python library knows how to parse that properly.
If it's not already implemented: write a custom parser using pyparsing or some other well-known solid library.
So to make a good choice you need to know what is the data format (this is not answered by “what are the file names”; rather, you need to know what is the data format of the content of those files). Then you'll be able to search for a parser library that knows about that data format.
Well, as already mentioned - a parser seems the best option.
But to answer your question without all this extra advice ... if you're just looking at speed, a parser isn't really the best method of doing this. The faster method is you already have a string like this would be to use regex.
matches = re.match(r"my_token:\[\s*"(.*)"\s*\]\.",str)
key_of_interest = matches.groups()[0]
There are other issues that come up. For example what if your key has a " inside it ? strinified JSON will automatically use an escape character there and that will be captures by the regex too. And therefore this gets a bit too complicated.
And JSON is not regex parsable in itself (is-json-a-regular-language). So, use at your own risk. But with the appropriate restrictions and assumptions regex would be faster than a json parser.

Writing unicode symbols to files (as opposed to unicode code)

I'm new to python and unicode is starting to give me headaches.
Currently I write to file like this:
my_string = "马/馬"
f = codecs.open(local_filepath, encoding='utf-8', mode='w+')
f.write(my_string)
f.close()
And when I open file with i.e. Gedit, I can see something like this:
\u9a6c/\u99ac\tm\u01ce
While I'd like to see exactly what I've written:
马/馬
I've tried a few different variations, like writing my_string.decode() or my_string.encode('utf-8') instead of just my_string, I know those two methods are the opposites but I was not sure which one I needed. Neither worked anyway.
If I manually write these symbols to text file, then with python read the file, re-write what I've just read back to the same file and save, symbols get turned to the code \u9a6c. Not sure if this is importat, figured I'd just mention it to help identify the problem.
Edit: the strings came from SQL Alchemy objects repr method, which turned out to be where the problem lied. I didn't mention it because it just didn't occur to me it can be related to the problem somehow. Thanks again for your help!
From the comments it is now clear you are using either the repr() function or calling the object.__repr__() method directly.
Don't do that. You are writing debugging information to your file:
>>> my_string = u"马/馬"
>>> print repr(my_string)
u'\u9a6c/\u99ac'
The value produced is meant to be pastable back into a Python session so you can re-produce the exact same value, and as such it is ASCII-safe (so it can be used in Python 2 source code without encoding issues).
From the repr() documentation:
For many types, this function makes an attempt to return a string that would yield an object with the same value when passed to eval(), otherwise the representation is a string enclosed in angle brackets that contains the name of the type of the object together with additional information often including the name and address of the object.
Write the Unicode objects to your file directly instead, codecs.open() handles encoding to UTF-8 correctly if you do.

[Python]How to deal with a string ending with one backslash?

I'm getting some content from Twitter API, and I have a little problem, indeed I sometimes get a tweet ending with only one backslash.
More precisely, I'm using simplejson to parse Twitter stream.
How can I escape this backslash ?
From what I have read, such raw string shouldn't exist ...
Even if I add one backslash (with two in fact) I still get an error as I suspected (since I have a odd number of backslashes)
Any idea ?
I can just forget about these tweets too, but I'm still curious about that.
Thanks : )
Prepending the string with r (stands for "raw") will escape all characters inside the string. For example:
print r'\b\n\\'
will output
\b\n\\
Have I understood the question correctly?
I guess you are looking a method similar to stripslashes in PHP. So, here you go:
Python version of PHP's stripslashes
You can try using raw strings by prepending an r (so nothing has to be escaped) to the string or re.escape().
I'm not really sure what you need considering I haven't seen the text of the response. If none of the methods you come up with on your own or get from here work, you may have to forget about those tweets.
Unless you update your question and come back with a real problem, I'm asserting that you don't have an issue except confusion.
You get the string from the Tweeter API, ergo the string does not show up in your code. “Raw strings” exist only in your code, and it is “raw strings” in code that can't end in a backslash.
Consider this:
def some_obscure_api():
"This exists in a library, so you don't know what it does"
return r"hello" + "\\" # addition just for fun
my_string = some_obscure_api()
print(my_string)
See? my_string happily ends in a backslash and your code couldn't care less.

In Python what's the best way to emulate Perl's __END__?

Am I correct in thinking that that Python doesn't have a direct equivalent for Perl's __END__?
print "Perl...\n";
__END__
End of code. I can put anything I want here.
One thought that occurred to me was to use a triple-quoted string. Is there a better way to achieve this in Python?
print "Python..."
"""
End of code. I can put anything I want here.
"""
The __END__ block in perl dates from a time when programmers had to work with data from the outside world and liked to keep examples of it in the program itself.
Hard to imagine I know.
It was useful for example if you had a moving target like a hardware log file with mutating messages due to firmware updates where you wanted to compare old and new versions of the line or keep notes not strictly related to the programs operations ("Code seems slow on day x of month every month") or as mentioned above a reference set of data to run the program against. Telcos are an example of an industry where this was a frequent requirement.
Lastly Python's cult like restrictiveness seems to have a real and tiresome effect on the mindset of its advocates, if your only response to a question is "Why would you want to that when you could do X?" when X is not as useful please keep quiet++.
The triple-quote form you suggested will still create a python string, whereas Perl's parser simply ignores anything after __END__. You can't write:
"""
I can put anything in here...
Anything!
"""
import os
os.system("rm -rf /")
Comments are more suitable in my opinion.
#__END__
#Whatever I write here will be ignored
#Woohoo !
What you're asking for does not exist.
Proof: http://www.mail-archive.com/python-list#python.org/msg156396.html
A simple solution is to escape any " as \" and do a normal multi line string -- see official docs: http://docs.python.org/tutorial/introduction.html#strings
( Also, atexit doesn't work: http://www.mail-archive.com/python-list#python.org/msg156364.html )
Hm, what about sys.exit(0) ? (assuming you do import sys above it, of course)
As to why it would useful, sometimes I sit down to do a substantial rewrite of something and want to mark my "good up to this point" place.
By using sys.exit(0) in a temporary manner, I know nothing below that point will get executed, therefore if there's a problem (e.g., server error) I know it had to be above that point.
I like it slightly better than commenting out the rest of the file, just because there are more chances to make a mistake and uncomment something (stray key press at beginning of line), and also because it seems better to insert 1 line (which will later be removed), than to modify X-many lines which will then have to be un-modified later.
But yeah, this is splitting hairs; commenting works great too... assuming your editor supports easily commenting out a region, of course; if not, sys.exit(0) all the way!
I use __END__ all the time for multiples of the reasons given. I've been doing it for so long now that I put it (usually preceded by an exit('0');), along with BEGIN {} / END{} routines, in by force-of-habit. It is a shame that Python doesn't have an equivalent, but I just comment-out the lines at the bottom: extraneous, but that's about what you get with one way to rule them all languages.
Python does not have a direct equivalent to this.
Why do you want it? It doesn't sound like a really great thing to have when there are more consistent ways like putting the text at the end as comments (that's how we include arbitrary text in Python source files. Triple quoted strings are for making multi-line strings, not for non-code-related text.)
Your editor should be able to make using many lines of comments easy for you.

using an alternative string quotation syntax in python

Just wondering...
I find using escape characters too distracting. I'd rather do something like this (console code):
>>> print ^'Let's begin and end with sets of unlikely 2 chars and bingo!'^
Let's begin and end with sets of unlikely 2 chars and bingo!
Note the ' inside the string, and how this syntax would have no issue with it, or whatever else inside for basically all cases. Too bad markdown can't properly colorize it (yet), so I decided to <pre> it.
Sure, the ^ could be any other char, I'm not sure what would look/work better. That sounds good enough to me, tho.
Probably some other language already have a similar solution. And, just maybe, Python already have such a feature and I overlooked it. I hope this is the case.
But if it isn't, would it be too hard to, somehow, change Python's interpreter and be able to select an arbitrary (or even standardized) syntax for notating the strings?
I realize there are many ways to change statements and the whole syntax in general by using pre-compilators, but this is far more specific. And going any of those routes is what I call "too hard". I'm not really needing to do this so, again, I'm just wondering.
Python has this use """ or ''' as the delimiters
print '''Let's begin and end with sets of unlikely 2 chars and bingo'''
How often do you have both of 3' and 3" in a string

Categories

Resources