How can I code away html and special characters in python?

How can I code away html and special characters in python? - python

1543159687.4969957::I think I\u2019ve gotten far enough into my experiment to give an update: Last year, Child of Humanity was free
for Blac\u2026 https://t.co/M3HR5fAoFZ"
This is the result that I am getting. I'd like to create a regex to replace special elements like \u2019 and \u2026 with a space. They always start with "\u" and continue for four more characters.
I'd also like to get rid of the html. It always starts with "https://t.co/" and continues for 10 characters.
I've tried the code below but it is clearly wrong.
tweet = re.sub("#[\\u].{4}", "", tweet)

Those \u characters are simply unicode characters, there's nothing you need to do since they will be automatically converted when you try to print mystring
As to the final url, you can do:
removed = re.sub(r'http\S*$', '', mystring) # remove the final http string.
>>> removed
'1543159687.4969957::I think I’ve gotten far enough into my experiment to give an update: Last year, Child of Humanity was free for Blac… '

Related

python regex - characters between certain characters

Edit: I should add, that the string in the test is supposed to contain every char there possible is (i.e. * + $ § € / etc.). So i thought of regexp should help best.
i am using regex to find all characters between certain characters([" and "]. My example goes like this:
test = """["this is a text and its supposed to contain every possible char."],
["another one after a newline."],
["and another one even with
newlines
in it."]"""
The supposed output should be like this:
['this is a text and its supposed to contain every possible char.', 'another one after a newline.', 'and another one even with newlines in it.']
My code including the regex looks like this:
import re
my_list = re.findall(r'(?<=\[").*(?="\])*[^ ,\n]', test)
print (my_list)
And my outcome is the following:
['this is a text and its supposed to contain every possible char."]', 'another one after a newline."]', 'and another one even with']
so there are two problems:
1) its not removing "] at the end of a text as i want it to do with (?="\])
2) its not capturing the third text in brackets, guess because of the newlines. But so far i wasnt able to capture those when i try .*\n it gives me back an empty string.
I am thankful for any help or hints with this issue. Thank you in advance.
Btw iam using python 3.6 on anaconda-spyder and the newest regex (2018).
EDIT 2: One Alteration to the test:
test = """[
"this is a text and its supposed to contain every possible char."
],
[
"another one after a newline."
],
[
"and another one even with
newlines
in it."
]"""
Once again i have trouble to remove the newlines from it, guess the whitespaces could be removed with \s, so an regexp like this could solve it, i thought.
my_list = re.findall(r'(?<=\[\S\s\")[\w\W]*(?=\"\S\s\])', test)
print (my_list)
But that returns only an empty list. How to get the supposed output above from that input?

In case you might also accept not regex solution, you can try
result = []
for l in eval(' '.join(test.split())):
result.extend(l)
print(result)
# ['this is a text and its supposed to contain every possible char.', 'another one after a newline.', 'and another one even with newlines in it.']

You can try this mate.
(?<=\[\")[\w\s.]+(?=\"\])
Demo
What you missed in your regex .* will not match newline.
P.S I am not matching special characters. if you want it can be achieved very easily.
This one matches special characters too
(?<=\[\")[\w\W]+?(?=\"\])
Demo 2

So here's what I came up:
test = """["this is a text and its supposed to contain every possible char."],
["another one after a newline."],
["and another one even with
newlines
in it."]"""
for i in test.replace('\n', '').replace(' ', ' ').split(','):
print(i.lstrip(r' ["').rstrip(r'"]'))
Which results in the following being printed to the screen
this is a text and its supposed to contain every possible char.
another one after a newline.
and another one even with newlines in it.
If you want a list of those -exact- strings, we could modify it to-
newList = []
for i in test.replace('\n', '').replace(' ', ' ').split(','):
newList.append(i.lstrip(r' ["').rstrip(r'"]'))

Python3 regex not changing \" to "

i have a json file filled with user comments (from web scraping) which I've pulled into python with pandas
import pandas as pd
data = pd.DataFrame(pd.read_json(filename, orient=columnName,encoding="utf-8"),columns=columnName)
data['full_text'] = data['full_text'].replace('^#ABC(\\u2019s)*[ ,\n]*', '', regex=True)
data['full_text'] = data['full_text'].replace('(\\u2019)', "'", regex=True)
data.to_json('new_abc_short.json',orient='records')
The messages don't completely match the respective messages online. (emojis shown as \u0234 or something, apostrophes as \u2019, forward slash in links, and quote marks have back slash.
i want to clean them up so i learnt some regex, so i can pull into python, clean them up and then resave them back to json in a different name (for now) (https://docs.python.org/3/howto/regex.html)
second line helps to remove the twitter handle (if it exists in only in the beginning), then removes 's if it was used (e.g. #ABC's ). If there was no twitter handle at the beginning (maybe used in the middle of the message) then that is kept. then it removes any spaces and commas that were left behind (again only at the beginning of the string)
e.g. "#ABC, hi there" becomes "hi there". "hi there #ABC" stays the same. "#ABC's twitter is big" would become "twitter is big"
third line helps replace every apostrophe that could not be shown (e.g. don\u2019t changes back to don't)
i have thousands of records (not all of them have issues with apostrophes, quotes, links etc), and based on the very small examples i've looked at, they seem to work
but my third one doesn't work:
data['full_text'] = data['full_text'].replace('\\"', '"', regex=True)
Example message in the json: "full_text":"#ABC How can you \"accidentally close\" my account"
i want to remove the \ next to the double quotes so it looks like the real message (i assume it is a escape character which the user obviously didn't type)
but no matter what i do, i can't remove it
from my regex learning, " is't a metacharacter. so backslash shouldn't even be there. But anyway, I've tried:
\\" (which i think should be the obvious one, i have \", no special quirk in " but there is in \ so i need another back slash to escape that)
\\\\" (some forums posts online mention needing 4 slashes
\\\" ( i think someone mention in the forum posts that they got it workin with 3)
\\\(\") (i know that brackets provide groupings so i tried different combinations)
(\\\\")
all of the above expression i encased in single quotes, and they didn't work. I thought maybe the double quote was the problem since i only had one, so i replaced the single quotes with single quotes x3
'''\\"'''
but none of the above worked for triple single quotes either
I keep rechecking the newly saved json and i keep seeing:
"full_text":"How can you \"accidentally close\" my account"
(i.e. removing #ABC with space worked, but not the back slash bit)
originally, i tried looking into converting these unicode issues i.e. using encoding="utf-8") although my experience in this is limited and it kept failing, so regex is my best option

Ow, I missed the pandas hint, so pandas replace does use regexes. But, to be clear, str.replace doesn't work with regexes. re.sub does.
Now
to match a single backslash, your regex is: "\\"
string to describe that regex: "\\\\"
when using a raw string, a double backslash is enough: r'\\'
If your string really contains a \ preceding a ", a regex that would do is:
\\(?=\")
which does a lookahead for your " (Look at regex101).
You would have to use something like:
re.sub(r'\\(?=\")',"",s,0)
or a pandas equivalent using that regex.

How can I identify invisible characters in python strings?

SHORT VERSION
I am retrieving a database value, which contains a short, but full HTML structure. I want to strip away all of the HTML-tags, and just end up with a single value. The HTML surrounding my relevant info, is always the same, and I just need to figure out what kind of line breaks, tabs or whitespaces the string contains, so that I can make a match, and remove it.
Is there a place I can paste the String online, or another way I can check the actual content of the String, so that I'll be able to remove it?
LONG VERSION, and what I've already tried:
The String is retrieved from a HP Quality Center database, and printed in the console of the automated test execution, the string is interpreted to show as two whitespaces. When pasted into word, eclipse or the QC script editor, it is shown as a linebreak.
I've tried to replace the whitespaces with \n, double whitespace and ¶. Nothing works.
I am translatnig this script from a working VBScript. The problematic invisible characters are defined as vbcrlf and VBCRLF there. For some reason they use lower case in the replace String before the relevant parameter value, and upper case in the string that comes after my relevant substring. They are defined as variables, and are not inside the String itself: <html>"&vbcrlf&"<body>"&vbcrlf&"<div...
This website suggests that I should use \n https://answers.yahoo.com/question/index?qid=20070506205148AAmr92N, as they write:
vbCrLf = "\n" # Carriage returnlinefeed combination
I am a little confused by the inconsitency of the upper/lower case use here though...
EDIT:
After googling Carriage returnlinefeed combination, i learned that it can be defined as /r/n here: Order of carriage return and new line feed.
But I spent an awful long time finding it, and it doesn't answer my question, of how I better can identify exactly what kind of invisible characters a string contains. I'll leave the question open.

To view the contents of a string (including it's "hidden" values) you can always do:
print( [data] )
# or
print( repr(data) )
If you're in a system which you described in the comments you can also do:
with open('/var/log/debug.log', 'w') as fh:
fh.write( str( [data] ) )
This will however just give you a general idea of what your data looks like, but if that solves your question or problem then that is great. If you need further assistance, edit your question or submit a new one :)

Search a delimited string in a file - Python

I have the following read.json file
{:{"JOL":"EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD.tr","LAPTOP":"error"}
and python script :
import re
shakes = open("read.json", "r")
needed = open("needed.txt", "w")
for text in shakes:
if re.search('JOL":"(.+?).tr', text):
print >> needed, text,
I want it to find what's between two words (JOL":" and .tr) and then print it. But all it does is printing all the text set in "read.json".

You're calling re.search, but you're not doing anything with the returned match, except to check that there is one. Instead, you're just printing out the original text. So of course you get the whole line.
The solution is simple: just store the result of re.search in a variable, so you can use it. For example:
for text in shakes:
match = re.search('JOL":"(.+?).tr', text)
if match:
print >> needed, match.group(1)
In your example, the match is JOL":"EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD.tr, and the first (and only) group in it is EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD, which is (I think) what you're looking for.
However, a couple of side notes:
First, . is a special pattern in a regex, so you're actually matching anything up to any character followed by tr, not .tr. For that, escape the . with a \. (And, once you start putting backslashes into a regex, use a raw string literal.) So: r'JOL":"(.+?)\.tr'.
Second, this is making a lot of assumptions about the data that probably aren't warranted. What you really want here is not "everything between JOL":" and .tr", it's "the value associated with key 'JOL' in the JSON object". The only problem is that this isn't quite a JSON object, because of that prefixed :. Hopefully you know where you got the data from, and therefore what format it's actually in. For example, if you know it's actually a sequence of colon-prefixed JSON objects, the right way to parse it is:
d = json.loads(text[1:])
if 'JOL' in d:
print >> needed, d['JOL']
Finally, you don't actually have anything named needed in your code; you opened a file named 'needed.txt', but you called the file object love. If your real code has a similar bug, it's possible that you're overwriting some completely different file over and over, and then looking in needed.txt and seeing nothing changed each time…

If you know that your starting and ending matching strings only appear once, you can ignore that it's JSON. If that's OK, then you can split on the starting characters (JOL":"), take the 2nd element of the split array [1], then split again on the ending characters (.tr) and take the 1st element of the split array [0].
>>> text = '{:{"JOL":"EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD.tr","LAPTOP":"error"}'
>>> text.split('JOL":"')[1].split('.tr')[0]
'EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD'

Extra characters Extracted with XPath and Python (html)

I have been using XPath with scrapy to extract text from html tags online, but when I do I get extra characters attached. An example is trying to extract a number, like "204" from a <td> tag and getting [u'204']. In some cases its much worse. For instance trying to extract "1 - Mathoverflow" and instead getting [u'\r\n\t\t 1 \u2013 MathOverflow\r\n\t\t ']. Is there a way to prevent this, or trim the strings so that the extra characters arent a part of the string? (using items to store the data). It looks like it has something to do with formatting, so how do I get xpath to not pick up that stuff?

What does the line of code look like that returns [u'204']? It looks like what is being returned is a Python list containing a unicode string with the value you want. Nothing wront there--just subscript. As for the carriage returns, linefeeds and tabs, as Wai Yip Tung just answered, strip will take them out.
Probably
my_answer = item1['Title'][0].strip()
Or if you are expecting several matches
for ans_i in item1['Title']:
do_something_with( ans_i.strip() )

The standard XPath function normalize-space() has exactly the wanted effect.
It deletes the leading and trailing wite space and replaces any inner whitespace with just one space.
So, you could use:
normalize-space(someExpression)

Use strip() to remove the leading and trailing white spaces.
>>> u'\r\n\t\t 1 \u2013 MathOverflow\r\n\t\t '.strip()
u'1 \u2013 MathOverflow'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.