SHORT VERSION
I am retrieving a database value, which contains a short, but full HTML structure. I want to strip away all of the HTML-tags, and just end up with a single value. The HTML surrounding my relevant info, is always the same, and I just need to figure out what kind of line breaks, tabs or whitespaces the string contains, so that I can make a match, and remove it.
Is there a place I can paste the String online, or another way I can check the actual content of the String, so that I'll be able to remove it?
LONG VERSION, and what I've already tried:
The String is retrieved from a HP Quality Center database, and printed in the console of the automated test execution, the string is interpreted to show as two whitespaces. When pasted into word, eclipse or the QC script editor, it is shown as a linebreak.
I've tried to replace the whitespaces with \n, double whitespace and ¶. Nothing works.
I am translatnig this script from a working VBScript. The problematic invisible characters are defined as vbcrlf and VBCRLF there. For some reason they use lower case in the replace String before the relevant parameter value, and upper case in the string that comes after my relevant substring. They are defined as variables, and are not inside the String itself: <html>"&vbcrlf&"<body>"&vbcrlf&"<div...
This website suggests that I should use \n https://answers.yahoo.com/question/index?qid=20070506205148AAmr92N, as they write:
vbCrLf = "\n" # Carriage returnlinefeed combination
I am a little confused by the inconsitency of the upper/lower case use here though...
EDIT:
After googling Carriage returnlinefeed combination, i learned that it can be defined as /r/n here: Order of carriage return and new line feed.
But I spent an awful long time finding it, and it doesn't answer my question, of how I better can identify exactly what kind of invisible characters a string contains. I'll leave the question open.
To view the contents of a string (including it's "hidden" values) you can always do:
print( [data] )
# or
print( repr(data) )
If you're in a system which you described in the comments you can also do:
with open('/var/log/debug.log', 'w') as fh:
fh.write( str( [data] ) )
This will however just give you a general idea of what your data looks like, but if that solves your question or problem then that is great. If you need further assistance, edit your question or submit a new one :)
Related
Does such a thing exist to turn This is a path into This\ is\ a\ path?
Preferably with all the standard escape sequences such as \t, \n, etc.
It's easy enough to unescape things:
print("foo\nbar".encode("unicode_escape").decode("utf-8"))
I'll admit my first response to this question was use pathlib.Path. Anyone else looking at this, you should probably be doing that too. However, this doesn't work for the situation as I peered in more and asked questions. It literally has to be a tool that changes a string into its escaped in the style of \t \n \r etc. version. It's nothing to do with paths I guess. It's easy enough to just write a simple function, but I'm curious if something exists already.
The web has an encoder, so the request for one for the \t \n style isn't that crazy, is it?
http://www.utilities-online.info/urlencode/#.XVYwRZNKg5c
test this string
and this one
test%20this%20string%0Aand%20this%20one
In python, there is actually a replace function that allows you to replace a character in a string with a different one as shown:
string = 'This is a path'
for ' ' in string:
string.replace(' ', '\\ ', 1)
Notice how in the third argument there is a digit. This is to specify how many characters it should change. The first two arguments are pretty self explanatory though.
The first is what you want to change. The second is what you want to replace it with. I understand you could just set the third argument to how many spaces there are however, it is more efficient and automatic this way as it detects it on it's own.
You can also use regex but I think this is the most 'pythonic' way of doing things since you don't need any other modules.
Also, I'm not sure why you would need to encode a string. You're definitely making it more difficult for yourself than it should be.
i have a json file filled with user comments (from web scraping) which I've pulled into python with pandas
import pandas as pd
data = pd.DataFrame(pd.read_json(filename, orient=columnName,encoding="utf-8"),columns=columnName)
data['full_text'] = data['full_text'].replace('^#ABC(\\u2019s)*[ ,\n]*', '', regex=True)
data['full_text'] = data['full_text'].replace('(\\u2019)', "'", regex=True)
data.to_json('new_abc_short.json',orient='records')
The messages don't completely match the respective messages online. (emojis shown as \u0234 or something, apostrophes as \u2019, forward slash in links, and quote marks have back slash.
i want to clean them up so i learnt some regex, so i can pull into python, clean them up and then resave them back to json in a different name (for now) (https://docs.python.org/3/howto/regex.html)
second line helps to remove the twitter handle (if it exists in only in the beginning), then removes 's if it was used (e.g. #ABC's ). If there was no twitter handle at the beginning (maybe used in the middle of the message) then that is kept. then it removes any spaces and commas that were left behind (again only at the beginning of the string)
e.g. "#ABC, hi there" becomes "hi there". "hi there #ABC" stays the same. "#ABC's twitter is big" would become "twitter is big"
third line helps replace every apostrophe that could not be shown (e.g. don\u2019t changes back to don't)
i have thousands of records (not all of them have issues with apostrophes, quotes, links etc), and based on the very small examples i've looked at, they seem to work
but my third one doesn't work:
data['full_text'] = data['full_text'].replace('\\"', '"', regex=True)
Example message in the json: "full_text":"#ABC How can you \"accidentally close\" my account"
i want to remove the \ next to the double quotes so it looks like the real message (i assume it is a escape character which the user obviously didn't type)
but no matter what i do, i can't remove it
from my regex learning, " is't a metacharacter. so backslash shouldn't even be there. But anyway, I've tried:
\\" (which i think should be the obvious one, i have \", no special quirk in " but there is in \ so i need another back slash to escape that)
\\\\" (some forums posts online mention needing 4 slashes
\\\" ( i think someone mention in the forum posts that they got it workin with 3)
\\\(\") (i know that brackets provide groupings so i tried different combinations)
(\\\\")
all of the above expression i encased in single quotes, and they didn't work. I thought maybe the double quote was the problem since i only had one, so i replaced the single quotes with single quotes x3
'''\\"'''
but none of the above worked for triple single quotes either
I keep rechecking the newly saved json and i keep seeing:
"full_text":"How can you \"accidentally close\" my account"
(i.e. removing #ABC with space worked, but not the back slash bit)
originally, i tried looking into converting these unicode issues i.e. using encoding="utf-8") although my experience in this is limited and it kept failing, so regex is my best option
Ow, I missed the pandas hint, so pandas replace does use regexes. But, to be clear, str.replace doesn't work with regexes. re.sub does.
Now
to match a single backslash, your regex is: "\\"
string to describe that regex: "\\\\"
when using a raw string, a double backslash is enough: r'\\'
If your string really contains a \ preceding a ", a regex that would do is:
\\(?=\")
which does a lookahead for your " (Look at regex101).
You would have to use something like:
re.sub(r'\\(?=\")',"",s,0)
or a pandas equivalent using that regex.
I have the following read.json file
{:{"JOL":"EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD.tr","LAPTOP":"error"}
and python script :
import re
shakes = open("read.json", "r")
needed = open("needed.txt", "w")
for text in shakes:
if re.search('JOL":"(.+?).tr', text):
print >> needed, text,
I want it to find what's between two words (JOL":" and .tr) and then print it. But all it does is printing all the text set in "read.json".
You're calling re.search, but you're not doing anything with the returned match, except to check that there is one. Instead, you're just printing out the original text. So of course you get the whole line.
The solution is simple: just store the result of re.search in a variable, so you can use it. For example:
for text in shakes:
match = re.search('JOL":"(.+?).tr', text)
if match:
print >> needed, match.group(1)
In your example, the match is JOL":"EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD.tr, and the first (and only) group in it is EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD, which is (I think) what you're looking for.
However, a couple of side notes:
First, . is a special pattern in a regex, so you're actually matching anything up to any character followed by tr, not .tr. For that, escape the . with a \. (And, once you start putting backslashes into a regex, use a raw string literal.) So: r'JOL":"(.+?)\.tr'.
Second, this is making a lot of assumptions about the data that probably aren't warranted. What you really want here is not "everything between JOL":" and .tr", it's "the value associated with key 'JOL' in the JSON object". The only problem is that this isn't quite a JSON object, because of that prefixed :. Hopefully you know where you got the data from, and therefore what format it's actually in. For example, if you know it's actually a sequence of colon-prefixed JSON objects, the right way to parse it is:
d = json.loads(text[1:])
if 'JOL' in d:
print >> needed, d['JOL']
Finally, you don't actually have anything named needed in your code; you opened a file named 'needed.txt', but you called the file object love. If your real code has a similar bug, it's possible that you're overwriting some completely different file over and over, and then looking in needed.txt and seeing nothing changed each time…
If you know that your starting and ending matching strings only appear once, you can ignore that it's JSON. If that's OK, then you can split on the starting characters (JOL":"), take the 2nd element of the split array [1], then split again on the ending characters (.tr) and take the 1st element of the split array [0].
>>> text = '{:{"JOL":"EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD.tr","LAPTOP":"error"}'
>>> text.split('JOL":"')[1].split('.tr')[0]
'EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD'
In python regex how would I match against a large string of text and flag if any one of the regex values are matched... I have tried this with "|" or statements and i have tried making a regex list.. neither worked for me.. here is an example of what I am trying to do with the or..
I think my "or" gets commented out
patterns=re.compile(r'[\btext String1\b] | [\bText String2\b]')
if(patterns.search(MyTextFile)):
print ("YAY one of your text patterns is in this file")
The above code always says it matches regardless if the string appears and if I change it around a bit I get matches on the first regex but never checks the second.... I believe this is because the "Raw" is commenting out my or statement but how would I get around this??
I also tried to get around this by taking out the "Raw" statement and putting double slashes on my \b for escaping but that didn't work either :(
patterns=re.compile(\\btext String1\\b | \\bText String2\\b)
if(patterns.search(MyTextFile)):
print ("YAY one of your text patterns is in this file")
I then tried to do 2 separate raw statements with the or and the interpreter complains about unsupported str opperands...
patterns=re.compile(r'\btext String1\b' | r'\bText String2\b')
if(patterns.search(MyTextFile)):
print ("YAY one of your text patterns is in this file")
patterns=re.compile(r'(\btext String1\b)|(\bText String2\b)')
You want a group (optionally capturing), not a character class. Technically, you don't need a group here:
patterns=re.compile(r'\btext String1\b|\bText String2\b')
will also work (without any capture).
The way you had it, it checked for either one of the characters between the first square brackets, or one of those between the second pair. You may find a regex tutorial helpful.
It should be clear where the "unsupported str operands" error comes from. You can't OR strings, and you have to remember the | is processed before the argument even gets to compile.
This part [\btext String1\b] means is there a "word separator" or one of the letters in "text String1" present. So that matches anything but an empty line I think.
In a RE pattern, square brackets [ ] indicate a "character class" (depending on what's inside them, "any one of these character" or "any character except one of these", the latter indicate by a caret ^ as the first character after the opening [). This is what you're expressing and it has absolutely nothing to do with what you want -- just remove the brackets and you should be fine;-).
I have the following code in Python:
linkHTML = "click here" % strLink
The problem is that when strLink has spaces in it the link shows up as
click here
I can use strLink.replace(" ","+")
But I am sure there are other characters which can cause errors. I tried using
urllib.quote(strLink)
But it doesn't seem to help.
Thanks!
Joel
Make sure you use the urllib.quote_plus(string[, safe]) to replace spaces with plus sign.
urllib.quote_plus(string[, safe])
Like quote(), but also replaces spaces
by plus signs, as required for quoting
HTML form values when building up a
query string to go into a URL. Plus
signs in the original string are
escaped unless they are included in
safe. It also does not have safe
default to '/'.
from http://docs.python.org/library/urllib.html#urllib.quote_plus
Ideally you'd be using the urllib.urlencode function and passing it a sequence of key/value pairs like {["q","with space"],["s","with space & other"]} etc.
As well as quote_plus(*), you also need to HTML-encode any text you output to HTML. Otherwise < and & symbols will be markup, with potential security consequences. (OK, you're not going to get < in a URL, but you definitely are going to get &, so just one parameter name that matches an HTML entity name and your string's messed up.
html= 'click here' % cgi.escape(urllib.quote_plus(q))
*: actually plain old quote is fine too; I don't know what wasn't working for you, but it is a perfectly good way of URL-encoding strings. It converts spaces to %20 which is also valid, and valid in path parts too. quote_plus is optimal for generating query strings, but otherwise, when in doubt, quote is safest.