Python save text to variable without interpreting it - python

I am fetching a text block, and save it to a variable
Then i am splitting the text block by blank spaces, get a hotword, and save the word next to the hotword in a new variable.
The word I am trying to save is a Math function in the matlab notation.
Python always interprets the brackets and slashes in the text block before I can even process this
Text block example:
"This is a text with the hotword function x**2(3*x)+3*x"
The text should be splitted by blank spaces and saved to an array, but python always messes up the operators (, ) , / , - and +.
How can I escape a text without knowing what will come?
this line creates the error(twitter api):
textVar= tweet['text']

Your example works fine for me on both python2.7 and python3.5
line = "This is a text with the hotword function x**2(3*x)+3*x"
>>> line.split()
['This', 'is', 'a', 'text', 'with', 'the', 'hotword', 'function', 'x**2(3*x)+3*x']

Try using a raw string:
var1 = r"This is a text with the hotword function x**2(3*x)+3*x"
or for an example that has slashes in it:
var2 = r"This is text with \slashes \n and \t escape sequences that aren't interpreted as such"
Try evaluating var2 and you'll see python puts in additional escape sequences to turn each character into a string literal instead of an escape sequence.

Related

Does Python's split function splits by a newline or a whitespace by default

I am learning Python. In particular, i read about the python string's split method and came to know that the default separator for split is a whitespace. So i understand how the following works:
text = 'Hello World'
print(text.split()) #this should print ['Hello', 'World'] which it does
The output of the above program in Python3.6 is ['Hello', 'World'] as expected because in the above string variable text we have whitespace.
But then i tried out the following example:
text = 'Hello\nWorld'
print(text.split()) #this should print ['Hello\nWorld'] but it doesn't
The actual output of the above is:
['Hello', 'World'] #this shouldn't happen because there is no whitespace in text
While the expected output of the above is:
['Hello\nWorld'] #this should happen because there is no whitespace in text
As you can see, since there is not whitespace between 'Hello' and 'World' the output should be ['Hello\nWorld'] because \n is not a whitespace and a whitespace is the default separator for split method.
What is happening here.
If sep is not specified or is None, a different splitting algorithm is
applied: runs of consecutive whitespace are regarded as a single
separator and the result will contain no empty strings at the start
or end if the string has leading or trailing whitespace.
Tabs (\t), newlines (\n), spaces, etc. They all count as whitespace characters as technically they all serve the same purpose. To space things out.

How to remove all unicode representations in python

I am trying to remove all representations of special characters in my document, for example part of the document says: "world\u2019s", when I split this it gives ['world', '\u2019', 's'] but I need only the word(unicode and 's' removed).
I am already removing all punctuation and this works on the actual punctuation that are shown normally not on these unicode representations.
And I have also tried to use regex to match everything that begins with a '\' but that doesn't seem to work either.
import re
string = "world\u2019s"
print (re.sub(r"\b([^\s]+)\\([^\s]+)\b",r'\1',str(string.encode('ascii', 'backslashreplace'), 'ascii')))
Output:
world
You can apply this to your whole string document, should be working.
import re
string = "world\u2019s h\u2018e"
print (re.sub(r"\b([^\s]+)\\([^\s]+)\b",r'\1',str(string.encode('ascii', 'backslashreplace'), 'ascii')))
Output:
world h

Python re.sub(): trying to replace escaped characters only

With Python 3.x, I need to replace escaped double quotes in some text with some custom pattern, leaving non-escaped double quotes as is. So I write as trivial code as:
text = 'These are "quotes", and these are \"escaped quotes\"'
print(re.sub(r'\"', '~', text))
And expect to see:
These are "quotes", and these are ~escaped quotes~
But instead of above, I get:
These are ~quotes~, and these are ~escaped quotes~
So, what't the correct pattern to replace escaped quotes only?
Background of this issue is an attempt to read 'invalid' JSON file containing Javascript function in it, placed with line feeds as is, but with escaped quotes. If there is easier way to parse JSON with newline characters in key values, I appreciate a hint on that.
First, you need to use a raw string to assign text, so that the backslashes will be kept literally (or you can escape the backslashes).
text = r'These are "quotes", and these are \"escaped quotes\"'
Second, you need to escape the backslash in the regexp so that it will be treated literally by the regexp engine.
print(re.sub(r'\\"', '~', text))
using raw text might help.
import re
text = r'These are "quotes", and these are \"escaped quotes\"'
print(re.sub(r'\\"', '~', text))

Regex works in Sublime, not in Python (Jupyter)

I am creating a Jupyter notebook to clean a large amount of novels with regex code I am testing in Sublime.
A lot of my texts contain the phrase 'digitized by Google' because that is where I got the PDF that I ran through Optical Character Recognition from.
I want to remove all sentences that contain the phrase 'Digitized', or rather 'gitized' since the first part isn't always correctly transcribed.
When I use this phrase in Sublimes 'replace function', I get exactly the results I want:
^.*igitized.*$
However, when I try to use the re.sub method in my Jupyter notebook, which works from some other phrases, the 'Digitized by Google' lines are NOT correctly identified and replaced by 'nothing'.
text = re.sub(r'^.*igitized.*$', '', text)
What am I missing?
By default, '^' matches only at the beginning of the string, and '$' only at the end of the string and immediately before the newline (if any) at the end of the string. Add re.MULTILINE flag to match beginning of line.
text = re.sub(r'^.*igitized.*$', '', text, flags=re.MULTILINE)
Using ^ to match beginning of line in Python regex

Read regexes from file and avoid or undo escaping

I want to read regular expressions from a file, where each line contains a regex:
lorem.*
dolor\S*
The following code is supposed to read each and append it to a list of regex strings:
vocabulary=[]
with open(path, "r") as vocabularyFile:
for term in vocabularyFile:
term = term.rstrip()
vocabulary.append(term)
This code seems to escape the \ special character in the file as \\. How can I either avoid escaping or unescape the string so that it can be worked with as if I wrote this?
regex = r"dolor\S*"
You are getting confused by echoing the value. The Python interpreter echoes values by printing the repr() function result, and this makes sure to escape any meta characters:
>>> regex = r"dolor\S*"
>>> regex
'dolor\\S*'
regex is still an 8 character string, not 9, and the single character at index 5 is a single backslash:
>>> regex[4]
'r'
>>> regex[5]
'\\'
>>> regex[6]
'S'
Printing the string writes out all characters verbatim, so no escaping takes place:
>>> print(regex)
dolor\S*
The same process is applied to the contents of containers, like a list or a dict:
>>> container = [regex, 'foo\nbar']
>>> print(container)
['dolor\\S*', 'foo\nbar']
Note that I didn't echo there, I printed. str(list_object) produces the same output as repr(list_object) here.
If you were to print individual elements from the list, you get the same unescaped result again:
>>> print(container[0])
dolor\S*
>>> print(container[1])
foo
bar
Note how the \n in the second element was written out as a newline now. It is for that reason that containers use repr() for contents; to make otherwise hard-to-detect or non-printable data visible.
In other words, your strings do not contain escaped strings here.

Categories

Resources