I have the following code in Python:
linkHTML = "click here" % strLink
The problem is that when strLink has spaces in it the link shows up as
click here
I can use strLink.replace(" ","+")
But I am sure there are other characters which can cause errors. I tried using
urllib.quote(strLink)
But it doesn't seem to help.
Thanks!
Joel
Make sure you use the urllib.quote_plus(string[, safe]) to replace spaces with plus sign.
urllib.quote_plus(string[, safe])
Like quote(), but also replaces spaces
by plus signs, as required for quoting
HTML form values when building up a
query string to go into a URL. Plus
signs in the original string are
escaped unless they are included in
safe. It also does not have safe
default to '/'.
from http://docs.python.org/library/urllib.html#urllib.quote_plus
Ideally you'd be using the urllib.urlencode function and passing it a sequence of key/value pairs like {["q","with space"],["s","with space & other"]} etc.
As well as quote_plus(*), you also need to HTML-encode any text you output to HTML. Otherwise < and & symbols will be markup, with potential security consequences. (OK, you're not going to get < in a URL, but you definitely are going to get &, so just one parameter name that matches an HTML entity name and your string's messed up.
html= 'click here' % cgi.escape(urllib.quote_plus(q))
*: actually plain old quote is fine too; I don't know what wasn't working for you, but it is a perfectly good way of URL-encoding strings. It converts spaces to %20 which is also valid, and valid in path parts too. quote_plus is optimal for generating query strings, but otherwise, when in doubt, quote is safest.
Related
I know raw string, like r'hello world', prevents escaping.
Is it a good practice to always prepend the r symbol even if the string doesn't have any escaping sequences?
Say my exception needs some string literal explanation, I need to connect to a website whose url is a string literal. They don't have backslash. Are there any performance differences between raw string and regular string?
The r sigil means "backslashes in this string are literal backslashes". Putting this sigil on a string which doesn't contain any backslashes is harmless but sometimes mildly confusing to a human reader. A better approach is probably to only use this sigil when you actually need it.
Situations where the string may not contain backslashes at the moment, but where you might expect to add one in the future, such as in regular expressions and Windows file paths, would probably qualify as useful exceptions.
re.findall(r'hello', string) # what if we switch to r'hello\.'?
with open(r'file.txt') as handle: # what if we switch to r'sub\file.txt'?
It would be easy to forget to add the r when you add a backslash, so supplying it in advance has some merit here.
You can do that in Python. But I don't recommend that because if you add something like '\n', it won't work well. You can use that in Regex and paths on Windows.
i have a json file filled with user comments (from web scraping) which I've pulled into python with pandas
import pandas as pd
data = pd.DataFrame(pd.read_json(filename, orient=columnName,encoding="utf-8"),columns=columnName)
data['full_text'] = data['full_text'].replace('^#ABC(\\u2019s)*[ ,\n]*', '', regex=True)
data['full_text'] = data['full_text'].replace('(\\u2019)', "'", regex=True)
data.to_json('new_abc_short.json',orient='records')
The messages don't completely match the respective messages online. (emojis shown as \u0234 or something, apostrophes as \u2019, forward slash in links, and quote marks have back slash.
i want to clean them up so i learnt some regex, so i can pull into python, clean them up and then resave them back to json in a different name (for now) (https://docs.python.org/3/howto/regex.html)
second line helps to remove the twitter handle (if it exists in only in the beginning), then removes 's if it was used (e.g. #ABC's ). If there was no twitter handle at the beginning (maybe used in the middle of the message) then that is kept. then it removes any spaces and commas that were left behind (again only at the beginning of the string)
e.g. "#ABC, hi there" becomes "hi there". "hi there #ABC" stays the same. "#ABC's twitter is big" would become "twitter is big"
third line helps replace every apostrophe that could not be shown (e.g. don\u2019t changes back to don't)
i have thousands of records (not all of them have issues with apostrophes, quotes, links etc), and based on the very small examples i've looked at, they seem to work
but my third one doesn't work:
data['full_text'] = data['full_text'].replace('\\"', '"', regex=True)
Example message in the json: "full_text":"#ABC How can you \"accidentally close\" my account"
i want to remove the \ next to the double quotes so it looks like the real message (i assume it is a escape character which the user obviously didn't type)
but no matter what i do, i can't remove it
from my regex learning, " is't a metacharacter. so backslash shouldn't even be there. But anyway, I've tried:
\\" (which i think should be the obvious one, i have \", no special quirk in " but there is in \ so i need another back slash to escape that)
\\\\" (some forums posts online mention needing 4 slashes
\\\" ( i think someone mention in the forum posts that they got it workin with 3)
\\\(\") (i know that brackets provide groupings so i tried different combinations)
(\\\\")
all of the above expression i encased in single quotes, and they didn't work. I thought maybe the double quote was the problem since i only had one, so i replaced the single quotes with single quotes x3
'''\\"'''
but none of the above worked for triple single quotes either
I keep rechecking the newly saved json and i keep seeing:
"full_text":"How can you \"accidentally close\" my account"
(i.e. removing #ABC with space worked, but not the back slash bit)
originally, i tried looking into converting these unicode issues i.e. using encoding="utf-8") although my experience in this is limited and it kept failing, so regex is my best option
Ow, I missed the pandas hint, so pandas replace does use regexes. But, to be clear, str.replace doesn't work with regexes. re.sub does.
Now
to match a single backslash, your regex is: "\\"
string to describe that regex: "\\\\"
when using a raw string, a double backslash is enough: r'\\'
If your string really contains a \ preceding a ", a regex that would do is:
\\(?=\")
which does a lookahead for your " (Look at regex101).
You would have to use something like:
re.sub(r'\\(?=\")',"",s,0)
or a pandas equivalent using that regex.
I am having the same issue as How to pass variables with spaces through URL in :Django. I have tried the solutions mentioned but everything is returning as "The resource you are looking for has been removed, had its name changed, or is temporarily unavailable."
I am trying to pass a file name example : new 3
in urls.py:
url(r'^file_up/delete_file/(?P<oname>[0-9A-Za-z\ ]+)/$', 'app.views.delete_file' , name='delete_file'),
in views.py:
def delete_file(request,fname):
return render_to_response(
'app/submission_error.html',
{'fname':fname,
},
context_instance=RequestContext(request)
)
url : demo.net/file_up/delete_file/new%25203/
Thanks for the help
Thinking this over; are you stuck with having to use spaces? If not, I think you may find your patterns (and variables) easier to work with. A dash or underscore, or even a forward slash will look cleaner, and more predictable.
I also found this: https://stackoverflow.com/a/497972/352452 which cites:
The space character is unsafe because significant spaces may disappear and insignificant spaces may be introduced when URLs are transcribed or typeset or subjected to the treatment of word-processing programs.
You may also be able to capture your space with a literal %20. Not sure. Just leaving some thoughts here that come to mind.
demo.net/file_up/delete_file/new%25203/
This URL is double-encoded. The space is first encoded to %20, then the % character is encoded to %25. Django only decodes the URL once, so the decoded url is /file_up/delete_file/new%203/. Your pattern does not match the literal %20.
If you want to stick to spaces instead of a different delimiter, you should find the source of that URL and make sure it is only encoded once: demo.net/file_up/delete_file/new%203/.
SHORT VERSION
I am retrieving a database value, which contains a short, but full HTML structure. I want to strip away all of the HTML-tags, and just end up with a single value. The HTML surrounding my relevant info, is always the same, and I just need to figure out what kind of line breaks, tabs or whitespaces the string contains, so that I can make a match, and remove it.
Is there a place I can paste the String online, or another way I can check the actual content of the String, so that I'll be able to remove it?
LONG VERSION, and what I've already tried:
The String is retrieved from a HP Quality Center database, and printed in the console of the automated test execution, the string is interpreted to show as two whitespaces. When pasted into word, eclipse or the QC script editor, it is shown as a linebreak.
I've tried to replace the whitespaces with \n, double whitespace and ¶. Nothing works.
I am translatnig this script from a working VBScript. The problematic invisible characters are defined as vbcrlf and VBCRLF there. For some reason they use lower case in the replace String before the relevant parameter value, and upper case in the string that comes after my relevant substring. They are defined as variables, and are not inside the String itself: <html>"&vbcrlf&"<body>"&vbcrlf&"<div...
This website suggests that I should use \n https://answers.yahoo.com/question/index?qid=20070506205148AAmr92N, as they write:
vbCrLf = "\n" # Carriage returnlinefeed combination
I am a little confused by the inconsitency of the upper/lower case use here though...
EDIT:
After googling Carriage returnlinefeed combination, i learned that it can be defined as /r/n here: Order of carriage return and new line feed.
But I spent an awful long time finding it, and it doesn't answer my question, of how I better can identify exactly what kind of invisible characters a string contains. I'll leave the question open.
To view the contents of a string (including it's "hidden" values) you can always do:
print( [data] )
# or
print( repr(data) )
If you're in a system which you described in the comments you can also do:
with open('/var/log/debug.log', 'w') as fh:
fh.write( str( [data] ) )
This will however just give you a general idea of what your data looks like, but if that solves your question or problem then that is great. If you need further assistance, edit your question or submit a new one :)
I am using a regex to replace quotes within in an input string. My data contains two 'types' of quotes -
" and “
There's a very subtle difference between the two. Currently, I am explicitly mentioning both these types in my regex
\"*\“*
I am afraid though that in future data I may get a different 'type' of quote on which my regex may fail. How many different types of quotes exist? Is there way to normalize these to just one type so that my regex won't break for unseen data?
Edit -
My input data consists of HTML files and I am escaping HTML entities and URLs to ASCII
escaped_line = HTMLParser.HTMLParser().unescape(urllib.unquote(line.decode('ascii','ignore')))
where line specifies each line in the HTML file. I need to 'ignore' the ASCII as all files in my database don't have the same encoding and I don't know the encoding prior to reading the file.
Edit2
I am unable to do so using replace function. I tried replace('"','') but it doesn't replace the other type of quote '“'. If I add it in another replace function it throws me NON-ASCII character error.
Condition
No external libraries allowed, only native python libraries could be used.
I don't think there is a "quotation marks" character class in Python's regex implementation so you'll have to do the matching yourself.
You could keep a list of common quotation mark unicode characters (here's a list for a good start) and build the part of regex that matches quotation marks programmatically.
I can only help you with the original question about quotations marks. As it turns out, Unicode defines many properties per character and these are all available though the Unicode Character Database. "Quotation mark" is one of these properties.
How many different types of quotes exist?
29, according to Unicode, see below.
The Unicode standard brings us a definitive text file on Unicode properties, PropList.txt, among which a list of quotation marks. Since Python does not support all Unicode properties in regular expressions, you cannot currently use \p{QuotationMark}. However, it's trivial to create a regular expression character class:
// placed on multiple lines for readability, remove spaces
// and then place in your regex in place of the current quotes
[\u0022 \u0027 \u00AB \u00BB
\u2018 \u2019 \u201A \u201B
\u201C \u201D \u201E \u201F
\u2039 \u203A \u300C \u300D
\u300E \u300F \u301D \u301E
\u301F \uFE41 \uFE42 \uFE43
\uFE44 \uFF02 \uFF07 \uFF62
\uFF63]
As "tchrist" pointed out above, you can save yourself the trouble by using Matthew Barnett's regex library which supports \p{QuotationMark}.
Turns out there's a much easier way to do this. Just append the literal 'u' in front of your regex you write in python.
regexp = ru'\"*\“*'
Make sure you use the re.UNICODE flag when you want to compile/search/match your regex to your string.
re.findall(regexp, string, re.UNICODE)
Don't forget to include the
#!/usr/bin/python
# -*- coding:utf-8 -*-
at the start of the source file to make sure unicode strings can be written in your source file.