I'm using Python to compile another Python file. To this end, I use Template from string into which I insert, e.g., a constructed function body, e.g.,
from string import Template
s = Template('''
def main():
${body}
return
''')
# body constructed bit by bit
body = ['a']
body.append('b')
body.append('c')
out = s.substitute(body='\n'.join(body))
print(out)
The output of the above is
def main():
a
b
c
return
which already highlights the problem: ${body} lines other than the first aren't correctly indented. I could of course manually add the spaces when inserting 'b' and 'c' into the body list, but that already assumes knowledge of the template into which the body will be inserted.
(Perhaps string.Template is not be the appropriate template engine to begin with.)
Assuming you need to fix indentation for multi line replacements only when the ${} model is at the beginning of the line (except for the indentation), you could use a regex to find all the tokens in the mask and if they are preceded with only blanks repeat them on all following lines from the replacement list.
You could use code like that:
import string, re
def substitute(s, reps):
t = string.Template(s)
i=0; cr = {} # prepare to iterate through the pattern string
while True:
# search for next replaceable token and its prefix
m =re.search(r'^(.*?)\$\{(.*?)\}', tpl[i:], re.MULTILINE)
if m is None: break # no more : finished
# the list is joined using the prefix if it contains only blanks
sep = ('\n' + m.group(1)) if m.group(1).strip() == '' else '\n'
cr[m.group(2)] = sep.join(rep[m.group(2)])
i += m.end() # continue past last processed replaceable token
return t.substitute(cr) # we can now substitute
With your example (slightly modified), it would give:
s = '''
def main():
${body}
return ${retval}
''')
# body constructed bit by bit
body = ['a']
body.append('b')
body.append('c')
out = substitute(s, { 'body': body, 'retval': 0 }
print (out)
it gives as expected:
def main():
a
b
c
return 0
Related
I'm new to Python and relatively new to programming. I'm trying to replace part of a file path with a different file path. If possible, I'd like to avoid regex as I don't know it. If not, I understand.
I want an item in the Python list [] before the word PROGRAM to be replaced with the 'replaceWith' variable.
How would you go about doing this?
Current Python List []
item1ToReplace1 = \\server\drive\BusinessFolder\PROGRAM\New\new.vb
item1ToReplace2 = \\server\drive\BusinessFolder\PROGRAM\old\old.vb
Variable to replace part of the Python list path
replaceWith = 'C:\ProgramFiles\Microsoft\PROGRAM'
Desired results for Python List []:
item1ToReplace1 = C:\ProgramFiles\Micosoft\PROGRAM\New\new.vb
item1ToReplace2 = C:\ProgramFiles\Micosoft\PROGRAM\old\old.vb
Thank you for your help.
The following code does what you ask, note I updated your '' to '\', you probably need to account for the backslash in your code since it is used as an escape character in python.
import os
item1ToReplace1 = '\\server\\drive\\BusinessFolder\\PROGRAM\\New\\new.vb'
item1ToReplace2 = '\\server\\drive\\BusinessFolder\\PROGRAM\\old\\old.vb'
replaceWith = 'C:\ProgramFiles\Microsoft\PROGRAM'
keyword = "PROGRAM\\"
def replacer(rp, s, kw):
ss = s.split(kw,1)
if (len(ss) > 1):
tail = ss[1]
return os.path.join(rp, tail)
else:
return ""
print(replacer(replaceWith, item1ToReplace1, keyword))
print(replacer(replaceWith, item1ToReplace2, keyword))
The code splits on your keyword and puts that on the back of the string you want.
If your keyword is not in the string, your result will be an empty string.
Result:
C:\ProgramFiles\Microsoft\PROGRAM\New\new.vb
C:\ProgramFiles\Microsoft\PROGRAM\old\old.vb
One way would be:
item_ls = item1ToReplace1.split("\\")
idx = item_ls.index("PROGRAM")
result = ["C:", "ProgramFiles", "Micosoft"] + item_ls[idx:]
result = "\\".join(result)
Resulting in:
>>> item1ToReplace1 = r"\\server\drive\BusinessFolder\PROGRAM\New\new.vb"
... # the above
>>> result
'C:\ProgramFiles\Micosoft\PROGRAM\New\new.vb'
Note the use of r"..." in order to avoid needing to have to 'escape the escape characters' of your input (i.e. the \). Also that the join/split requires you to escape these characters with a double backslash.
Is there a way in python to get the technical information for a given character like it's displayed in the Unicode table? (cf.https://unicode-table.com/en/)
Example:
for the letter "Ȅ"
Name > Latin Capital Letter E with Double Grave
Unicode number > U+0204
HTML-code > Ȅ
Bloc > Latin Extended-B
Lowercase > ȅ
What I actually need is to get for any Unicode number (like here U+0204) the corresponding name (Latin Capital Letter E with Double Grave) and the lowercase version (here "ȅ").
Roughly:
input = a Unicode number
output = corresponding information
The closest thing I've been able to find is the fontTools library but I can't seem to find any tutorial/documentation on how to use it to do that.
Thank you.
The standard module unicodedata defines a lot of properties, but not everything. A quick peek at its source confirms this.
Fortunately unicodedata.txt, the data file where this comes from, is not hard to parse. Each line consists of exactly 15 elements, ; separated, which makes it ideal for parsing. Using the description of the elements on ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html, you can create a few classes to encapsulate the data. I've taken the names of the class elements from that list; the meaning of each of the elements is explained on that same page.
Make sure to download ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt and ftp://ftp.unicode.org/Public/UNIDATA/Blocks.txt first, and put them inside the same folder as this program.
Code (tested with Python 2.7 and 3.6):
# -*- coding: utf-8 -*-
class UnicodeCharacter:
def __init__(self):
self.code = 0
self.name = 'unnamed'
self.category = ''
self.combining = ''
self.bidirectional = ''
self.decomposition = ''
self.asDecimal = None
self.asDigit = None
self.asNumeric = None
self.mirrored = False
self.uc1Name = None
self.comment = ''
self.uppercase = None
self.lowercase = None
self.titlecase = None
self.block = None
def __getitem__(self, item):
return getattr(self, item)
def __repr__(self):
return '{'+self.name+'}'
class UnicodeBlock:
def __init__(self):
self.first = 0
self.last = 0
self.name = 'unnamed'
def __repr__(self):
return '{'+self.name+'}'
class BlockList:
def __init__(self):
self.blocklist = []
with open('Blocks.txt','r') as uc_f:
for line in uc_f:
line = line.strip(' \r\n')
if '#' in line:
line = line.split('#')[0].strip()
if line != '':
rawdata = line.split(';')
block = UnicodeBlock()
block.name = rawdata[1].strip()
rawdata = rawdata[0].split('..')
block.first = int(rawdata[0],16)
block.last = int(rawdata[1],16)
self.blocklist.append(block)
# make 100% sure it's sorted, for quicker look-up later
# (it is usually sorted in the file, but better make sure)
self.blocklist.sort (key=lambda x: block.first)
def lookup(self,code):
for item in self.blocklist:
if code >= item.first and code <= item.last:
return item.name
return None
class UnicodeList:
"""UnicodeList loads Unicode data from the external files
'UnicodeData.txt' and 'Blocks.txt', both available at unicode.org
These files must appear in the same directory as this program.
UnicodeList is a new interpretation of the standard library
'unicodedata'; you may first want to check if its functionality
suffices.
As UnicodeList loads its data from an external file, it does not depend
on the local build from Python (in which the Unicode data gets frozen
to the then 'current' version).
Initialize with
uclist = UnicodeList()
"""
def __init__(self):
# we need this first
blocklist = BlockList()
bpos = 0
self.codelist = []
with open('UnicodeData.txt','r') as uc_f:
for line in uc_f:
line = line.strip(' \r\n')
if '#' in line:
line = line.split('#')[0].strip()
if line != '':
rawdata = line.strip().split(';')
parsed = UnicodeCharacter()
parsed.code = int(rawdata[0],16)
parsed.characterName = rawdata[1]
parsed.category = rawdata[2]
parsed.combining = rawdata[3]
parsed.bidirectional = rawdata[4]
parsed.decomposition = rawdata[5]
parsed.asDecimal = int(rawdata[6]) if rawdata[6] else None
parsed.asDigit = int(rawdata[7]) if rawdata[7] else None
# the following value may contain a slash:
# ONE QUARTER ... 1/4
# let's make it Python 2.7 compatible :)
if '/' in rawdata[8]:
rawdata[8] = rawdata[8].replace('/','./')
parsed.asNumeric = eval(rawdata[8])
else:
parsed.asNumeric = int(rawdata[8]) if rawdata[8] else None
parsed.mirrored = rawdata[9] == 'Y'
parsed.uc1Name = rawdata[10]
parsed.comment = rawdata[11]
parsed.uppercase = int(rawdata[12],16) if rawdata[12] else None
parsed.lowercase = int(rawdata[13],16) if rawdata[13] else None
parsed.titlecase = int(rawdata[14],16) if rawdata[14] else None
while bpos < len(blocklist.blocklist) and parsed.code > blocklist.blocklist[bpos].last:
bpos += 1
parsed.block = blocklist.blocklist[bpos].name if bpos < len(blocklist.blocklist) and parsed.code >= blocklist.blocklist[bpos].first else None
self.codelist.append(parsed)
def find_code(self,codepoint):
"""Find the Unicode information for a codepoint (as int).
Returns:
a UnicodeCharacter class object or None.
"""
# the list is unlikely to contain duplicates but I have seen Unicode.org
# doing that in similar situations. Again, better make sure.
val = [x for x in self.codelist if codepoint == x.code]
return val[0] if val else None
def find_char(self,str):
"""Find the Unicode information for a codepoint (as character).
Returns:
for a single character: a UnicodeCharacter class object or
None.
for a multicharacter string: a list of the above, one element
per character.
"""
if len(str) > 1:
result = [self.find_code(ord(x)) for x in str]
return result
else:
return self.find_code(ord(str))
When loaded, you can now look up a character code with
>>> ul = UnicodeList() # ONLY NEEDED ONCE!
>>> print (ul.find_code(0x204))
{LATIN CAPITAL LETTER E WITH DOUBLE GRAVE}
which by default is shown as the name of a character (Unicode calls this a 'code point'), but you can retrieve other properties as well:
>>> print ('%04X' % uc.find_code(0x204).lowercase)
0205
>>> print (ul.lookup(0x204).block)
Latin Extended-B
and (as long as you don't get a None) even chain them:
>>> print (ul.find_code(ul.find_code(0x204).lowercase))
{LATIN SMALL LETTER E WITH DOUBLE GRAVE}
It does not rely on your particular build of Python; you can always download an updated list from unicode.org and be assured to get the most recent information:
import unicodedata
>>> print (unicodedata.name('\U0001F903'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: no such name
>>> print (uclist.find_code(0x1f903))
{LEFT HALF CIRCLE WITH FOUR DOTS}
(As tested with Python 3.5.3.)
There are currently two lookup functions defined:
find_code(int) looks up character information by codepoint as an integer.
find_char(string) looks up character information for the character(s) in string. If there is only one character, it returns a UnicodeCharacter object; if there are more, it returns a list of objects.
After import unicodelist (assuming you saved this as unicodelist.py), you can use
>>> ul = UnicodeList()
>>> hex(ul.find_char(u'è').code)
'0xe8'
to look up the hex code for any character, and a list comprehension such as
>>> l = [hex(ul.find_char(x).code) for x in 'Hello']
>>> l
['0x48', '0x65', '0x6c', '0x6c', '0x6f']
for longer strings. Note that you don't actually need all of this if all you want is a hex representation of a string! This suffices:
l = [hex(ord(x)) for x in 'Hello']
The purpose of this module is to give easy access to other Unicode properties. A longer example:
str = 'Héllo...'
dest = ''
for i in str:
dest += chr(ul.find_char(i).uppercase) if ul.find_char(i).uppercase is not None else i
print (dest)
HÉLLO...
and showing a list of properties for a character per your example:
letter = u'Ȅ'
print ('Name > '+ul.find_char(letter).name)
print ('Unicode number > U+%04x' % ul.find_char(letter).code)
print ('Bloc > '+ul.find_char(letter).block)
print ('Lowercase > %s' % chr(ul.find_char(letter).lowercase))
(I left out HTML; these names are not defined in the Unicode standard.)
The unicodedata documentation shows how to do most of this.
The Unicode block name is apparently not available but another Stack Overflow question has a solution of sorts and another has some additional approaches using regex.
The uppercase/lowercase mapping and character number information is not particularly Unicode-specific; just use the regular Python string functions.
So in summary
>>> import unicodedata
>>> unicodedata.name('Ë')
'LATIN CAPITAL LETTER E WITH DIAERESIS'
>>> 'U+%04X' % ord('Ë')
'U+00CB'
>>> '&#%i;' % ord('Ë')
'Ë'
>>> 'Ë'.lower()
'ë'
The U+%04X formatting is sort-of correct, in that it simply avoids padding and prints the whole hex number for code points with a value higher than 65,535. Note that some other formats require the use of %08X padding in this scenario (notably \U00010000 format in Python).
You can do this in some ways :
1- create an API yourself ( I can't find anything that do this )
2- create table in database or excel file
3- load and parse a website to do that
I think the 3rd way is very easy. take a look as This Page. you can find some information there Unicodes.
Get your Unicode number and then, find it in web page using parse tools like LXML , Scrapy , Selenium , etc
I have this function, that I would like to see if can be done more pythonic.
The function explains it self what it is trying to achieve.
My concern is that I'm using two regex expressions for content and expected that gives room for error creeping in, best would be if these two variables could use the same regex.
input example:
test_names = "tests[\"Status code: \" +responseCode.code] = responseCode.code === 200;\ntests[\"Schema validator GetChecksumReleaseEventForAll\"] = tv4.validate(data, schema);"
def custom_steps(self, test_names):
""" Extracts unique tests from postman collection """
content = re.findall(r'(?<=tests\[")(.*)(?::|"\])', test_names)
expected = re.findall(r'(?<=\] = )(.*)(?::|;)', test_names)
for i, er in enumerate(expected):
if "===" in er:
expected[i] = er[er.find('===')+4:]
else:
expected[i] = "true"
return content, expected
You can match both groups at the same time:
def custom_steps(self, test_names):
regex = 'tests\["(.*)(?::|"\]).* = (.+)(?::|;)'
for match in re.finditer(regex, test_names):
content, expected = match.groups()
if '===' in expected:
expected = expected[expected.index('===') + 4:]
else:
expected = 'true'
yield content, expected
This gives you a generator over pairs of content, expected:
for c, e in custom_steps(None, test_names):
print c, e
Output:
Status code 200
Schema validator GetChecksumReleaseEventForAll true
hey i need to create simple python randomizer. example input:
{{hey|hello|hi}|{privet|zdravstvuy|kak dela}|{bonjour|salut}}, can {you|u} give me advice?
and output should be:
hello, can you give me advice
i have a script, which can do this but only in one nested level
with open('text.txt', 'r') as text:
matches = re.findall('([^{}]+)', text.read())
words = []
for match in matches:
parts = match.split('|')
if parts[0]:
words.append(parts[random.randint(0, len(parts)-1)])
message = ''.join(words)
this is not enough for me )
Python regex does not support nested structures, so you'll have to find some other way to parse the string.
Here's my quick kludge:
def randomize(text):
start= text.find('{')
if start==-1: #if there are no curly braces, there's nothing to randomize
return text
# parse the choices we have
end= start
word_start= start+1
nesting_level= 0
choices= [] # list of |-separated values
while True:
end+= 1
try:
char= text[end]
except IndexError:
break # if there's no matching closing brace, we'll pretend there is.
if char=='{':
nesting_level+= 1
elif char=='}':
if nesting_level==0: # matching closing brace found - stop parsing.
break
nesting_level-= 1
elif char=='|' and nesting_level==0:
# put all text up to this pipe into the list
choices.append(text[word_start:end])
word_start= end+1
# there's no pipe character after the last choice, so we have to add it to the list now
choices.append(text[word_start:end])
# recursively call this function on each choice
choices= [randomize(t) for t in choices]
# return the text up to the opening brace, a randomly chosen string, and
# don't forget to randomize the text after the closing brace
return text[:start] + random.choice(choices) + randomize(text[end+1:])
As I said above, nesting is essentially useless here, but if you want to keep your current syntax, one way to handle it is to replace braces in a loop until there are no more:
import re, random
msg = '{{hey|hello|hi}|{privet|zdravstvuy|kak dela}|{bonjour|salut}}, can {you|u} give me advice?'
while re.search(r'{.*}', msg):
msg = re.sub(
r'{([^{}]*)}',
lambda m: random.choice(m.group(1).split('|')),
msg)
print msg
# zdravstvuy, can u give me advice?
I have a string that I am getting from a command line application. It has the following structure:
-- section1 --
item11|value11
item12|value12
item13
-- section2 --
item21|value21
item22
what I would like is to parse this to a dict so that I can easily access the values with:
d['section1']['item11']
I already solved it for the case when there are no sections and every key has a value but I get errors otherwise. I have tried a couple things but it is getting complicated because and nothing seems to work. This is what I have now:
s="""
item11|value11
item12|value12
item21|value21
"""
d = {}
for l in s.split('\n'):
print(l, l.split('|'))
if l != '':
d[l.split('|')[0]] = l.split('|')[1]
Can somebody help me extend this for the section case and when no values are present?
Seems like a perfect fit for the ConfigParser module in the standard library:
d = ConfigParser(delimiters='|', allow_no_value=True)
d.SECTCRE = re.compile(r"-- *(?P<header>[^]]+?) *--") # sections regex
d.read_string(s)
Now you have an object that you can access like a dictionary:
>>> d['section1']['item11']
'value11'
>>> d['section2']['item22'] # no value case
None
Regexes are a good take at this:
import re
def parse(data):
lines = data.split("\n") #split input into lines
result = {}
current_header = ""
for line in lines:
if line: #if the line isn't empty
#tries to match anything between double dashes:
match = re.match(r"^-- (.*) --$", line)
if match: #true when the above pattern matches
#grabs the part inside parentheses:
current_header = match.group(1)
else:
#key = 1st element, value = 2nd element:
key, value = line.split("|")
#tries to get the section, defaults to empty section:
section = result.get(current_header, {})
section[key] = value #adds data to section
result[current_header] = section #updates section into result
return result #done.
print parse("""
-- section1 --
item1|value1
item2|value2
-- section2 --
item1|valueA
item2|valueB""")