Odd character appending to the front of python list

Odd character appending to the front of python list - python

I'm having an issue with regards to running python on linux. I'm trying to learn python and wanted to try and parse a small XML file and put the tags and data into the list. But every time I run the code I get a 'u' appending to each element in the list.
[u'world']
defaultdict(<type 'list'>, {u'world': [u'data']})
My code is as follows:
import xml.sax
from collections import defaultdict
class TransformXML(xml.sax.ContentHandler):
def __init__ (self):
self.start_tag_name = -1
self.tag_data = -1
self.myDict = defaultdict(list)
self.tags = []
def startElement(self, name, attrs):
self.start_tag_name = name
print name
print self.start_tag_name
def characters(self, content):
if content.strip(' \r\n\t') != "":
self.tag_data = content.strip(' \r\n\t')
print self.start_tag_name
self.tags.append(self.start_tag_name)
self.myDict[self.start_tag_name].append(content.strip(' \r\n\t'))
def endElement(self, name):
pass
def __del__ (self):
if self.myDict:
del self.myDict
print "deleteing myDict"
Does anyone know what the issue might be?

That 'weird' symbol basically means that the string or character is encoded in unicode
Eg. If i have a string Test:
>>> unicode('Test')
u'Test'
>>> s = unicode('Test')
>>> type(s)
<type 'unicode'>
Documentation here
To sum up, according to the python docs,
...a Unicode string is a sequence of code points, which are numbers from
0 to 0x10ffff. This sequence needs to be represented as a set of bytes
(meaning, values from 0-255) in memory. The rules for translating a
Unicode string into a sequence of bytes are called an encoding.

Related

Using python ast parser to process multi line strings

When using the python AST parser module in combination with scripts containing multi line strings, these multi line strings are always reduced to single line quoted strings. Example:
import ast
script = "text='''Line1\nLine2'''"
code = ast.parse (script, mode='exec')
print (ast.unparse (code))
node = code.body[0].value
print (node.lineno, node.end_lineno)
The output is:
> text = 'Line1\nLine2'
> 1 2
So in spite of being a multi line string before parsing, the text is reduced to a single line quoted string when unparsed. This makes script transformation difficult, because the multi lines are getting lost when unparsing a transformed AST graph.
Is there a way to parse/unparse scripts with multi line strings correctly with AST ?
Thank you in advance.

An examination of ast.unparse's underlying source reveals that the writer for the visit_Constant method, _write_constant, will produce the string repr unless the backslashing process is specifically avoided:
class _Unparse:
...
def _write_constant(self, value):
if isinstance(value, (float, complex)):
...
elif self._avoid_backslashes and isinstance(value, str):
self._write_str_avoiding_backslashes(value)
else:
self.write(repr(value))
By default, _avoid_backslashes is set to False, however, multiline string formatting can be properly performed by overriding visit_Constant and specifically calling _write_str_avoiding_backslashes if the string node is multiline:
import ast
class Unparser(ast._Unparser):
def visit_Constant(self, node):
if isinstance(node.value, str) and node.lineno < node.end_lineno:
super()._write_str_avoiding_backslashes(node.value)
return
return super().visit_Constant(node)
def _unparse(ast_node):
u = Unparser()
return u.visit(ast_node)
script = "text='''Line1\nLine2'''"
print(_unparse(ast.parse(script)))
Output:
text = """Line1
Line2"""

How to find multi-line comments wrapped in quotes?

I am parsing Python code, and I need to remove all possible comments/docstrings. I have successfully been able to remove "comments" of the form:
#comment
"""comment"""
'''comment'''
However, I have found some samples where people write comments of the form:
"'''comment'''"
"\"\"\"\n comment \"\"\""
I am struggling to successfully remove these comments (three single quotes surrounded by a double quote, and double quotes with line breaks). The expression I tried was:
p = re.compile("([\'\"])\1\1(.*?)\1{3}", re.DOTALL)
code = p.sub('', code)
But this did not work for either of the second two cases. Does anyone have any suggestions?

You could try using strip().
It works by removing the characters you place in between the brackets. If nothing is in the brackets it removes spaces but you want to remove the three single quotes surrounded by a double quote, and double quotes with line breaks. So an example is:
txt = ",,,,,rrttgg.....banana....rrr"
x = txt.strip(",.grt")
print(x)
And the output you would get is "banana" as it has removed the ,.grt that was found in between the double brackets( x = txt.strip(",.grt")).
For more info check out this page, and i recommend the info at the bottom for further help:
https://www.w3schools.com/python/python_strings.asp

posting as an answer because my comment was hard to read
This is what I came up with, it's ugly and hacky but it does work.
import re
txt = "if x = 4: continue \"'''hi'''\" print(x) "
print(txt)
#find everything wrapped in double quotes
double_quotes = re.findall(r"\"(.+?)\"", txt)
for string in double_quotes:
triple_single = re.findall(r"\'''(.+?)\'''", string)[0]
full_comment = '"'+"'''" +triple_single+"'''"+'"'
txt = txt.replace(full_comment, '')
print(txt)
Prints:
if x = 4: continue "'''hi'''" print(x)
if x = 4: continue print(x)

Unassigned string literal can be considered as nodes on the source code's Abstract Syntax Tree (AST) representation. Then the problem is reduced to identifying these nodes and rewriting the AST without them, using the tools in the ast module.
Comments (# ...) are not parsed into the AST, so there is not need to code for them.
Unassigned string literals are nodes of type ast.Constant, and form part of the body attribute of nodes that have bodies, such as module definitions, function definitions and class definitions. We can identify these nodes, remove them from their parents' body's and then rewrite the AST.
import ast
import io
from unparse import Unparser
with open('comments.py') as f:
src = f.read()
root = ast.parse(src)
# print(ast.dump(root)) to see the ast structure.
def filter_constants(node):
if isinstance(node, ast.Expr):
if isinstance(node.value, ast.Constant):
if isinstance(node.value.value, str):
return None
return node
class CommentRemover(ast.NodeTransformer):
def visit(self, node):
if hasattr(node, 'body'):
node.body = [n for n in node.body if filter_constants(n)]
return super().visit(node)
remover = CommentRemover()
new = remover.visit(root)
ast.fix_missing_locations(new)
buf = io.StringIO()
Unparser(new, buf)
buf.seek(0)
print(buf.read())
Calling the script on this code (comments.py):
"""Module docstring."""
# A real comment
"""triple-double-quote comment"""
'''triple-single-quote comment'''
"'''weird comment'''"
"\"\"\"\n comment \"\"\""
NOT_A_COMMENT = 'spam'
42
def foo():
"""Function docstring."""
# Function comment
bar = 'baz'
return bar
class Quux:
"""class docstring."""
# class comment
def m(self):
"""method comment"""
return
Gives this output:
NOT_A_COMMENT = 'spam'
42
def foo():
bar = 'baz'
return bar
class Quux():
def m(self):
return
Notes:
the unparse script can be found in your Python distribution's Tools/parser folder (in v3.8 - in previous versions it has been in Tools or in the Demo folder). It may also be downloaded from github - be sure that you download the version for your version of Python
As of Python 3.8, the ast.Constant class is used for all constant nodes; for earlier versions you may need to use ast.Num, ast.Str, ast.Bytes, ast.NameConstant and ast.Ellipsis as appropriate. So in filter_constants might look like this:
def filter_constants(node):
if isinstance(node, ast.Expr):
if isinstance(node.value, ast.Str):
return None
return node
As of Python 3.9, the ast module provide an unparse function that may be used instead of the unparse script
src = ast.unparse(new)
print(src)

unicode table information about a character in python

Is there a way in python to get the technical information for a given character like it's displayed in the Unicode table? (cf.https://unicode-table.com/en/)
Example:
for the letter "Ȅ"
Name > Latin Capital Letter E with Double Grave
Unicode number > U+0204
HTML-code > Ȅ
Bloc > Latin Extended-B
Lowercase > ȅ
What I actually need is to get for any Unicode number (like here U+0204) the corresponding name (Latin Capital Letter E with Double Grave) and the lowercase version (here "ȅ").
Roughly:
input = a Unicode number
output = corresponding information
The closest thing I've been able to find is the fontTools library but I can't seem to find any tutorial/documentation on how to use it to do that.
Thank you.

The standard module unicodedata defines a lot of properties, but not everything. A quick peek at its source confirms this.
Fortunately unicodedata.txt, the data file where this comes from, is not hard to parse. Each line consists of exactly 15 elements, ; separated, which makes it ideal for parsing. Using the description of the elements on ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html, you can create a few classes to encapsulate the data. I've taken the names of the class elements from that list; the meaning of each of the elements is explained on that same page.
Make sure to download ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt and ftp://ftp.unicode.org/Public/UNIDATA/Blocks.txt first, and put them inside the same folder as this program.
Code (tested with Python 2.7 and 3.6):
# -*- coding: utf-8 -*-
class UnicodeCharacter:
def __init__(self):
self.code = 0
self.name = 'unnamed'
self.category = ''
self.combining = ''
self.bidirectional = ''
self.decomposition = ''
self.asDecimal = None
self.asDigit = None
self.asNumeric = None
self.mirrored = False
self.uc1Name = None
self.comment = ''
self.uppercase = None
self.lowercase = None
self.titlecase = None
self.block = None
def __getitem__(self, item):
return getattr(self, item)
def __repr__(self):
return '{'+self.name+'}'
class UnicodeBlock:
def __init__(self):
self.first = 0
self.last = 0
self.name = 'unnamed'
def __repr__(self):
return '{'+self.name+'}'
class BlockList:
def __init__(self):
self.blocklist = []
with open('Blocks.txt','r') as uc_f:
for line in uc_f:
line = line.strip(' \r\n')
if '#' in line:
line = line.split('#')[0].strip()
if line != '':
rawdata = line.split(';')
block = UnicodeBlock()
block.name = rawdata[1].strip()
rawdata = rawdata[0].split('..')
block.first = int(rawdata[0],16)
block.last = int(rawdata[1],16)
self.blocklist.append(block)
# make 100% sure it's sorted, for quicker look-up later
# (it is usually sorted in the file, but better make sure)
self.blocklist.sort (key=lambda x: block.first)
def lookup(self,code):
for item in self.blocklist:
if code >= item.first and code <= item.last:
return item.name
return None
class UnicodeList:
"""UnicodeList loads Unicode data from the external files
'UnicodeData.txt' and 'Blocks.txt', both available at unicode.org
These files must appear in the same directory as this program.
UnicodeList is a new interpretation of the standard library
'unicodedata'; you may first want to check if its functionality
suffices.
As UnicodeList loads its data from an external file, it does not depend
on the local build from Python (in which the Unicode data gets frozen
to the then 'current' version).
Initialize with
uclist = UnicodeList()
"""
def __init__(self):
# we need this first
blocklist = BlockList()
bpos = 0
self.codelist = []
with open('UnicodeData.txt','r') as uc_f:
for line in uc_f:
line = line.strip(' \r\n')
if '#' in line:
line = line.split('#')[0].strip()
if line != '':
rawdata = line.strip().split(';')
parsed = UnicodeCharacter()
parsed.code = int(rawdata[0],16)
parsed.characterName = rawdata[1]
parsed.category = rawdata[2]
parsed.combining = rawdata[3]
parsed.bidirectional = rawdata[4]
parsed.decomposition = rawdata[5]
parsed.asDecimal = int(rawdata[6]) if rawdata[6] else None
parsed.asDigit = int(rawdata[7]) if rawdata[7] else None
# the following value may contain a slash:
# ONE QUARTER ... 1/4
# let's make it Python 2.7 compatible :)
if '/' in rawdata[8]:
rawdata[8] = rawdata[8].replace('/','./')
parsed.asNumeric = eval(rawdata[8])
else:
parsed.asNumeric = int(rawdata[8]) if rawdata[8] else None
parsed.mirrored = rawdata[9] == 'Y'
parsed.uc1Name = rawdata[10]
parsed.comment = rawdata[11]
parsed.uppercase = int(rawdata[12],16) if rawdata[12] else None
parsed.lowercase = int(rawdata[13],16) if rawdata[13] else None
parsed.titlecase = int(rawdata[14],16) if rawdata[14] else None
while bpos < len(blocklist.blocklist) and parsed.code > blocklist.blocklist[bpos].last:
bpos += 1
parsed.block = blocklist.blocklist[bpos].name if bpos < len(blocklist.blocklist) and parsed.code >= blocklist.blocklist[bpos].first else None
self.codelist.append(parsed)
def find_code(self,codepoint):
"""Find the Unicode information for a codepoint (as int).
Returns:
a UnicodeCharacter class object or None.
"""
# the list is unlikely to contain duplicates but I have seen Unicode.org
# doing that in similar situations. Again, better make sure.
val = [x for x in self.codelist if codepoint == x.code]
return val[0] if val else None
def find_char(self,str):
"""Find the Unicode information for a codepoint (as character).
Returns:
for a single character: a UnicodeCharacter class object or
None.
for a multicharacter string: a list of the above, one element
per character.
"""
if len(str) > 1:
result = [self.find_code(ord(x)) for x in str]
return result
else:
return self.find_code(ord(str))
When loaded, you can now look up a character code with
>>> ul = UnicodeList() # ONLY NEEDED ONCE!
>>> print (ul.find_code(0x204))
{LATIN CAPITAL LETTER E WITH DOUBLE GRAVE}
which by default is shown as the name of a character (Unicode calls this a 'code point'), but you can retrieve other properties as well:
>>> print ('%04X' % uc.find_code(0x204).lowercase)
0205
>>> print (ul.lookup(0x204).block)
Latin Extended-B
and (as long as you don't get a None) even chain them:
>>> print (ul.find_code(ul.find_code(0x204).lowercase))
{LATIN SMALL LETTER E WITH DOUBLE GRAVE}
It does not rely on your particular build of Python; you can always download an updated list from unicode.org and be assured to get the most recent information:
import unicodedata
>>> print (unicodedata.name('\U0001F903'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: no such name
>>> print (uclist.find_code(0x1f903))
{LEFT HALF CIRCLE WITH FOUR DOTS}
(As tested with Python 3.5.3.)
There are currently two lookup functions defined:
find_code(int) looks up character information by codepoint as an integer.
find_char(string) looks up character information for the character(s) in string. If there is only one character, it returns a UnicodeCharacter object; if there are more, it returns a list of objects.
After import unicodelist (assuming you saved this as unicodelist.py), you can use
>>> ul = UnicodeList()
>>> hex(ul.find_char(u'è').code)
'0xe8'
to look up the hex code for any character, and a list comprehension such as
>>> l = [hex(ul.find_char(x).code) for x in 'Hello']
>>> l
['0x48', '0x65', '0x6c', '0x6c', '0x6f']
for longer strings. Note that you don't actually need all of this if all you want is a hex representation of a string! This suffices:
l = [hex(ord(x)) for x in 'Hello']
The purpose of this module is to give easy access to other Unicode properties. A longer example:
str = 'Héllo...'
dest = ''
for i in str:
dest += chr(ul.find_char(i).uppercase) if ul.find_char(i).uppercase is not None else i
print (dest)
HÉLLO...
and showing a list of properties for a character per your example:
letter = u'Ȅ'
print ('Name > '+ul.find_char(letter).name)
print ('Unicode number > U+%04x' % ul.find_char(letter).code)
print ('Bloc > '+ul.find_char(letter).block)
print ('Lowercase > %s' % chr(ul.find_char(letter).lowercase))
(I left out HTML; these names are not defined in the Unicode standard.)

The unicodedata documentation shows how to do most of this.
The Unicode block name is apparently not available but another Stack Overflow question has a solution of sorts and another has some additional approaches using regex.
The uppercase/lowercase mapping and character number information is not particularly Unicode-specific; just use the regular Python string functions.
So in summary
>>> import unicodedata
>>> unicodedata.name('Ë')
'LATIN CAPITAL LETTER E WITH DIAERESIS'
>>> 'U+%04X' % ord('Ë')
'U+00CB'
>>> '&#%i;' % ord('Ë')
'Ë'
>>> 'Ë'.lower()
'ë'
The U+%04X formatting is sort-of correct, in that it simply avoids padding and prints the whole hex number for code points with a value higher than 65,535. Note that some other formats require the use of %08X padding in this scenario (notably \U00010000 format in Python).

You can do this in some ways :
1- create an API yourself ( I can't find anything that do this )
2- create table in database or excel file
3- load and parse a website to do that
I think the 3rd way is very easy. take a look as This Page. you can find some information there Unicodes.
Get your Unicode number and then, find it in web page using parse tools like LXML , Scrapy , Selenium , etc

Python:New line at the same postion

In python 2.7 how can you achieve the following feature:
print "some text here"+?+"and then it starts there"
the output on terminal should look like:
some text here
and then it starts here
I have searched around and I think \rshould do the work but I tried it out it does not work. I am confused now.
BTW, is the \r solution portable?
P.S.
In my odd situation, I think knowing the length of prev line is quite difficult for me. so any idea rather then using the length of the line above it?
==================================================================================
Okay the situation is like this, I am writing a tree structure and I want to print it out nicely using the __str__ function
class node:
def __init__(self,key,childern):
self.key = key
self.childern = childern
def __str__(self):
return "Node:"+self.key+"Children:"+str(self.childern)
where Children is a list.
Every time it is printing Children, I want it indented using one more than last line. So I think I cannot predict the length before the line I want to print.

\r is probably not a portable solution, the way it is rendered will depend on whatever text editor or terminal you're using. On older Mac systems, '\r' is was used as the end of line character(On windows it is '\r\n' and on linux and OSX it is '\n'.
You could simply do something like this:
def print_lines_at_same_position(*lines):
prev_len = 0
for line in lines:
print " "*prev_len + line
prev_len += len(line)
Usage example:
>>> print_lines_at_same_position("hello", "world", "this is a test")
hello
world
this is a test
>>>
This will only work if whatever you're outputting to has a font with a fixed character length though. I can't think of anything that will work otherwise
Edit to fit changed question
Okay, so that's an entirely different question. I don't think there's any way to do it with it starting at exactly the position where the last line left off unless self.key has a predictable length. But you can get something pretty close with this:
class node:
def __init__(self,key,children):
self.key = key
self.children = children
self.depth = 0
def set_depth(self, depth):
self.depth = depth
for child in self.children:
child.set_depth(depth+1)
def __str__(self):
indent = " "*4*self.depth
children_str = "\n".join(map(str, self.children))
if children_str:
children_str = "\n" + children_str
return indent + "Node: %s%s" % (self.key, children_str)
Then just set the depth of the root node to 0 and do that again every time you change the structure of the tree. There are more efficient ways if you know exactly how you're changing the tree, you can probably figure those out yourself :)
Usage example:
>>> a = node("leaf", [])
>>> b = node("another leaf", [])
>>> c = node("internal", [a,b])
>>> d = node("root", [c])
>>> d.set_depth(0)
>>> print d
Node: root
Node: internal
Node: leaf
Node: another leaf
>>>

You could use os.linesep to get a more portable linebreak, instead of just \r. I would then use len() to calculate the length of the 1st string in order to calculate whitespace.
>>> import os
>>> my_str = "some text here"
>>> print my_str + os.linesep + ' ' * len(my_str) + 'and then it starts here'
some text here
and then it starts here
The key is ' ' * len(my_str). This will repeat the space character len(my_str) times.

The \r solution is not what you are looking for since it is part of the windows newline, but in mac systems it actually is the newline.
You would need code like the following:
def pretty_print(text):
total = 0
for element in text:
print "{}{}".format(' '*total, element)
total += len(element)
pretty_print(["lol", "apples", "are", "fun"])
Which will print the lines of text the way you want them to.

Try using the len("text") * ' ' to get the amount of white space you want.
To get a portable line break, use os.linesep
>>> import os
>>> os.linesep
'\n'
EDIT
Another option that might be suitable in some cases is to override the stdout stream.
import sys, os
class StreamWrap(object):
TAG = '<br>' # use a string that suits your use case
def __init__(self, stream):
self.stream = stream
def write(self, text):
tokens = text.split(StreamWrap.TAG)
indent = 0
for i, token in enumerate(tokens):
self.stream.write(indent*' ' + token)
if i < len(tokens)-1:
self.stream.write(os.linesep)
indent += len(token)
def flush(self):
self.stream.flush()
sys.stdout = StreamWrap(sys.stdout)
print "some text here"+ StreamWrap.TAG +"and then it starts there"
This will give you a result like this:
>>> python test.py
some text here
and then it starts there

Workarounds when a string is too long for a .join. OverflowError occurs

I'm working through some python problems on pythonchallenge.com to teach myself python and I've hit a roadblock, since the string I am to be using is too large for python to handle. I receive this error:
my-macbook:python owner1$ python singleoccurrence.py
Traceback (most recent call last):
File "singleoccurrence.py", line 32, in <module>
myString = myString.join(line)
OverflowError: join() result is too long for a Python string
What alternatives do I have for this issue? My code looks like such...
#open file testdata.txt
#for each character, check if already exists in array of checked characters
#if so, skip.
#if not, character.count
#if count > 1, repeat recursively with first character stripped off of page.
# if count = 1, add to valid character array.
#when string = 0, print valid character array.
valid = []
checked = []
myString = ""
def recursiveCount(bigString):
if len(bigString) == 0:
print "YAY!"
return valid
myChar = bigString[0]
if myChar in checked:
return recursiveCount(bigString[1:])
if bigString.count(myChar) > 1:
checked.append(myChar)
return recursiveCount(bigString[1:])
checked.append(myChar)
valid.append(myChar)
return recursiveCount(bigString[1:])
fileIN = open("testdata.txt", "r")
line = fileIN.readline()
while line:
line = line.strip()
myString = myString.join(line)
line = fileIN.readline()
myString = recursiveCount(myString)
print "\n"
print myString

string.join doesn't do what you think. join is used to combine a list of words into a single string with the given seperator. Ie:
>>> ",".join(('foo', 'bar', 'baz'))
'foo,bar,baz'
The code snippet you posted will attempt to insert myString between every character in the variable line. You can see how that will get big quickly :-). Are you trying to read the entire file into a single string, myString? If so, the way you want to concatenate the strings is like this:
myString = myString + line
While I'm here... since you're learning Python here are some other suggestions.
There are easier ways to read an entire file into a variable. For instance:
fileIN = open("testdata.txt", "r")
myString = fileIN.read()
(This won't have the exact behaviour of your existing strip() code, but may in fact do what you want.)
Also, I would never recommend practical Python code use recursion to iterate over a string. Your code will make a function call (and a stack entry) for every character in the string. Also I'm not sure Python will be very smart about all the uses of bigString[1:]: it may well create a second string in memory that's a copy of the original without the first character. The simplest way to process every character in a string is:
for mychar in bigString:
... do your stuff ...
Finally, you are using the list named "checked" to see if you've ever seen a particular character before. But the membership test on lists ("if myChar in checked") is slow. In Python you're better off using a dictionary:
checked = {}
...
if not checked.has_key(myChar):
checked[myChar] = True
...
This exercise you're doing is a great way to learn several Python idioms.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Odd character appending to the front of python list - python

Related

Using python ast parser to process multi line strings

How to find multi-line comments wrapped in quotes?

unicode table information about a character in python

Python:New line at the same postion

Workarounds when a string is too long for a .join. OverflowError occurs

Categories

Resources