Splitting a string using given set of delimiters with including them - python

Let's say that I have a python list of strings.
the strings are tokens of a C++-like language that I have tokenized them partially. but I am left with some strings that are haven't been tokenized. The problem that I have a set of symbols of the language that I must include in the list.
Example:
class Test
{
method int foo(boolean a, int b) { }
}
The output I need is:
tokens = ['class', 'Test', '{', 'method', 'int', 'foo', '(', 'boolean', 'a', ',', 'int', 'b', ')', '{', '}', '}']
The output I get after I clean the code from whitespaces:
tokens = ['class', 'Test', '{', 'method', 'int', 'foo(boolean', 'a,', 'int', 'b){', '}', '}']
The Code I Use is is using a partial list which is splitted according to white spaces:
def tokenize(self, tokens):
"""
Breaks all tokens into final tokens as needed.
"""
final_tokens = []
for token in tokens:
if not have_symbols(token):
final_tokens.append(token)
else:
current_string = ""
small_tokens = []
for character in token:
if character in SYMBOLS_SET:
if current_string:
small_tokens.append(current_string)
current_string = ""
small_tokens.append(character)
else:
current_string += character
final_tokens = final_tokens + small_tokens
return final_tokens
where SYMBOLS_SET is a set of symbols:
SYMBOLS_SET = {"{", "}", "(", ")", "[", "]", ".", ",", ";", "+", "-", "*", "/", "&", "|", "<", ">", "=", "~"}
and the method have_symbol(token) returns true if token have a symbol from SYMBOL_SET and false otherwise.
I think that it might be a more elegant way to do this, I would be glad for a guidance.

import re
input = r"""
class Test
{
method int foo(boolean a, int b) { }
}"""
SYMBOLS_SET = {"{", "}", "(", ")", "[", "]", ".", ",", ";", "+", "-", "*", "/", "&", "|", "<", ">", "=", "~"}
regexp = r"\s(" + "".join([re.escape(i) for i in SYMBOLS_SET]) + ")"
splitted = re.split(regexp, input)
tokens = [x for x in splitted if x not in [None, ""]]
print(tokens)
gives you:
['class', 'Test', '{', 'method', 'int', 'foo', '(', 'boolean', 'a', ',', 'int', 'b', ')', '{', '}', '}']
Puttin parens around the SYMBOLS makes them a regexp subgroup and thus appearing in the output. The \s (whitespace) we do not want to be included.

Related

Python fdb firebirdsql.OperationalError: conversion error from string "?"

I use the following bit to make an sql statement using the key:values in the dict
record_number = 627
temp_dict = {
"FOO": 752,
"BAR": "test",
"I": "zzzzz",
"Hate": "tesname",
"SQL": "testsomethingesle",
"SO": "commentlol",
"MCUH": "asadsa",
"FILLING": "zzzzzz",
"NAME": "''",
}
update_query = (
"UPDATE table_name SET {}".format(
", ".join("{} = '?'".format(k) for k in temp_dict)
)
+ " WHERE RECNUM = '"
+ record_number
+ "';"
)
update_values = tuple(temp_dict.values())
cur.execute(update_query, update_values)
the update_query prints out correctly
UPDATE table_name SET FOO = '?', BAR = '?', I = '?', Hate = '?', SQL = '?', SO = '?', MCUH = '?', FILLING = '?', NAME = '?' WHERE RECNUM = '627';
and the update_values also looks right
(752, 'test', 'zzzzz', 'tesname', 'testsomethingesle', 'commentlol', 'asadsa', 'zzzzzz', "''")
but I get back the following error
firebirdsql.OperationalError: conversion error from string "?"
My understanding is that ? is basically a placeholder value and if I put in a tuple or list as the second parameter in the cur.execute() it should replace the ? with the values passed in.
What am I doing wrong?
You're generating a statement that has string literals with a question mark ('?'), not a question mark used as a parameter placeholder (plain ?). This means that when you execute the statement, you're trying to assign the literal value ? to a column, and if that column is not a CHAR, VARCHAR or BLOB, this produces an error, because there is no valid conversion from the string ? to the other data types.
You need to uses "{} = ?" instead (notice the absence of single quotes around the question mark).

How would you clean this dictionary in Python?

This is my first attempt building something non-web and involving logic coding.
Please take a look at this god-awful dictionary below:
Messy_Dict=
{
'name': "['\\r\\n NASDAQ: BKEP\\r\\n ']",
'underlying': "['1.12']",
'strike_prices_list': ["['2.50'", " '5.00'", " '7.50']"],
'call_bid': ["['\\r\\n0.05 '", " '\\r\\n0.00 '", " '\\r\\n0.00 ']"],
'put_ask': ["['\\r\\n2.10 '", " '\\r\\n4.50 '", " '\\r\\n7.00 ']"]
}
What I want to do is clean up the unnecessary sub-strings within each dictionary value to get something like this:
Clean_Dict=
{
'name': "BKEP",
'underlying': "1.12",
'strike_prices_list': ["2.50", "5.00", "7.50"],
'call_bid': ["0.05", "0.00", "0.00"],
'put_ask': ["2.10", "4.50", "7.00"]
}
I have managed to get from Messy_Dict to Clean_Dict but I used very barbaric means to do so. I will just say that it included a for loop and multiple strip(), replace('', '') methods. And it pains me to look at that block of code in my .py file.
So I guess, is there a more elegant method in performing the desired task of converting Messy_Dict to Clean_Dict? I feel as if I'm missing something here in my fundamentals.
Edit
def parse(self, response):
strike_prices_main = response.css('.highlight , .aright .strike-col').css('::text').extract()
if not strike_prices_main:
pass
else:
name = response.css('#instrumentticker::text').extract()
strike_prices_list = response.css('.aright .strike-col').css('::text').extract()
call_bid = response.css('.aright td:nth-child(5)').css('::text').extract()
put_ask = response.css('.aright td:nth-child(14)').css('::text').extract()
underlying = response.css('.pricewrap .bgLast').css('::text').extract()
file.write('%s|%s|%s|%s|%s\n'%(name,underlying,strike_prices_list,call_bid,put_ask))
Using spiders to crawl!
Maybe like this:
import re
Messy_Dict= \
{
'name': "['\\r\\n NASDAQ: BKEP\\r\\n ']",
'underlying': "['1.12']",
'strike_prices_list': ["['2.50'", " '5.00'", " '7.50']"],
'call_bid': ["['\\r\\n0.05 '", " '\\r\\n0.00 '", " '\\r\\n0.00 ']"],
'put_ask': ["['\\r\\n2.10 '", " '\\r\\n4.50 '", " '\\r\\n7.00 ']"]
}
regexstr = "\\\\(r|n)|\s|\[|\]|\'|NASDAQ:"
dict_clean = {}
for k, v in Messy_Dict.items():
if isinstance(v, list):
list_clean = []
for el in v:
el_clean = re.sub(regexstr, "", el)
list_clean.append(el_clean)
dict_clean[k] = list_clean
else:
dict_clean[k] = re.sub(regexstr, "", v)
dict_clean
You can use regular expressions.
Example:
import re
messy_dict = {
'name': "['\\r\\n NASDAQ: BKEP\\r\\n ']",
'underlying': "['1.12']",
'strike_prices_list': ["['2.50'", " '5.00'", " '7.50']"],
'call_bid': ["['\\r\\n0.05 '", " '\\r\\n0.00 '", " '\\r\\n0.00 ']"],
'put_ask': ["['\\r\\n2.10 '", " '\\r\\n4.50 '", " '\\r\\n7.00 ']"]
}
for key in messy_dict:
stripfunc = lambda x: re.sub('[^\d\.]', '', str(x))
if type(messy_dict[key]) is list:
messy_dict[key] = [stripfunc(x) for x in messy_dict[key]]
else:
messy_dict[key] = stripfunc(messy_dict[key])
print(messy_dict)
Explanation: [^ ] matches anything that is NOT in the set. \d is for numeric values and the backslash escapes the dot. Using str(val) to make strings out of the lists.
Output: {'name': '', 'underlying': '1.12', 'strike_prices_list': ['2.50', '5.00', '7.50'], 'call_bid': ['0.05', '0.00', '0.00'], 'put_ask': ['2.10', '4.50', '7.00']}
Edit: just noticed that you also want to keep the dot. Updated the code.

Converting string to valid wordpress url in python 3 to be used by requests library

Given a string https://websiteurl/path/photo's url.jpeg
How to converted it to legal url https://websiteurl/path/photos-url.jpeg from Wordpress point of view using python3.5. This url is going to be used by sending post request with json where src will be key and value the above legal url.
When the photo was uploaded url https://websiteurl/path/photos-url.jpeg was given to it. (' removed and space converted to - )
The only way i see is using "https://websiteurl/path/photo's url.jpeg".replace(\',"").replace(" ","-").
Is there any generic pythonic way?
You could use re.sub and str.replace. Example:
import re
special_chars = ["?", "[", "]", "/", "\\", "=", "<", ">", ":", ";", ",", "'", "\"", "&", "$", "#", "*", "(", ")", "|", "~", "`", "!", "{", "}", "%", "+"]
uri = "photo's url.jpeg"
#use str.replace
for i in special_chars:
uri = uri.replace(i, "")
#or re.sub
#uri = re.sub("\?|\[|\]|/|\\|\=|<|>|:|;|,|'|\"|\&|\$|#|\*|\(|\)|~|`|!|\{|\}|%|\+", "", uri)
uri = re.sub("\s+", "-", uri)
print(uri)
This will change photo's url.jpeg into photos-url.jpeg.
Take a look at how wordpress is doing it in php here: https://core.trac.wordpress.org/browser/tags/4.7.3/src/wp-includes/formatting.php#L1761

list of dict, how to fill out information later

I have a list of dicts like this:
mylist=[
{
'name':None,
'Id':sys.argv[1]
},{
'name':None,
'Id':sys.argv[2]
},{
'name':None,
'Id':sys.argv[3]
}
]
I later invoke a subprocess and process it output, I want to put the output in 'name' value field. after I invoke the command I endup with a list of all lines and I read the lines like this
for line in content:
if line.startswith('some_identifier'):
line.strip('\n')
#put the line into an unused 'name' value field
later I want to generate a login command that is run by the OS like so:
for info in mylist
subprocess.check_output(['iscsicli.exe', 'LoginTarget', info['name'], 'T', portalip, portalport, '*', info['Id'], '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*'])
so what I want to do is be able to put the line read in the second code snippet in the an unused 'name' slot in mylist
for line in content:
if line.startswith('some_identifier'):
line.strip('\n')
#put the line into an unused 'name' value field
for dict in mylist:
if dict['name'] == None:
dict['name'] = line
break

Possible Parser for Unknown String Format(soup?) from SUDS.client

I am using suds package to query a API from a website, the data returned from their website looks like this,:
(1). Can anyone tell me what kind of format is this?
(2). If so, what will be the easiest way to parse the data looks like this? I have dealt quite a lot with HTML/XML format using BeautifulSoup but before I lift my finger to write regular expressions for this type of format. I am curious is this some type of 'popular format' and there are actually some beautiful parser already written. Thanks.
# Below are the header and tail of the response..
(DetailResult)
{
status = (Status){ message = None code = "0" }
searchArgument = (DetailSearchArgument){ reqPartNumber = "BQ" reqMfg = "T" reqCpn = None }
detailsDto[] = (DetailsDto){
summaryDto = (SummaryDto){ PartNumber = "BQ" seMfg = "T" description = "Fast" }
packageDto[] =
(PackageDto){ fetName = "a" fetValue = "b" },
(PackageDto){ fetName = "c" fetValue = "d" },
(PackageDto){ fetName = "d" fetValue = "z" },
(PackageDto){ fetName = "f" fetValue = "Sq" },
(PackageDto){ fetName = "g" fetValue = "p" },
additionalDetailsDto = (AdditionalDetailsDto){ cr = None pOptions = None inv = None pcns = None }
partImageDto = None
riskDto = (RiskDto){ life= "Low" lStage = "Mature" yteol = "10" Date = "2023"}
partOptionsDto[] = (ReplacementDto){ partNumber = "BQ2" manufacturer = "T" type = "Reel" },
inventoryDto[] =
(InventoryDto){ distributor = "V" quantity = "88" buyNowLink = "https://www..." },
(InventoryDto){ distributor = "R" quantity = "7" buyNowLink = "http://www.r." },
(InventoryDto){ distributor = "RS" quantity = "2" buyNowLink = "http://www.rs.." },
},
}
This looks like some kind of nested repr output, similar to JSON but with
structure or object name information ("a Status contains a message and a code").
If it's nested, regexes alone won't do the job. Here is a rough pass at a pyparsing
parser
sample = """
... given sample text ...
"""
from pyparsing import *
# punctuation
LPAR,RPAR,LBRACE,RBRACE,LBRACK,RBRACK,COMMA,EQ = map(Suppress,"(){}[],=")
identifier = Word(alphas,alphanums+"_")
# define some types that can get converted to Python types
# (parse actions will do conversion at parse time)
NONE = Keyword("None").setParseAction(replaceWith(None))
integer = Word(nums).setParseAction(lambda t:int(t[0]))
quotedString.setParseAction(removeQuotes)
# define a placeholder for a nested object definition (since objDefn
# will be referenced within its own definition)
objDefn = Forward()
objType = Combine(LPAR + identifier + RPAR)
objval = quotedString | NONE | integer | Group(objDefn)
objattr = Group(identifier + EQ + objval)
arrayattr = Group(identifier + LBRACK + RBRACK + EQ + Group(OneOrMore(Group(objDefn)+COMMA)) )
# use '<<' operator to assign content to previously declared Forward
objDefn << objType + LBRACE + ZeroOrMore((arrayattr | objattr) + Optional(COMMA)) + RBRACE
# parse sample text
result = objDefn.parseString(sample)
# use pprint to list out indented parsed data
import pprint
pprint.pprint(result.asList())
Prints:
['DetailResult',
['status', ['Status', ['message', None], ['code', '0']]],
['searchArgument',
['DetailSearchArgument',
['reqPartNumber', 'BQ'],
['reqMfg', 'T'],
['reqCpn', None]]],
['detailsDto',
[['DetailsDto',
['summaryDto',
['SummaryDto',
['PartNumber', 'BQ'],
['seMfg', 'T'],
['description', 'Fast']]],
['packageDto',
[['PackageDto', ['fetName', 'a'], ['fetValue', 'b']],
['PackageDto', ['fetName', 'c'], ['fetValue', 'd']],
['PackageDto', ['fetName', 'd'], ['fetValue', 'z']],
['PackageDto', ['fetName', 'f'], ['fetValue', 'Sq']],
['PackageDto', ['fetName', 'g'], ['fetValue', 'p']]]],
['additionalDetailsDto',
['AdditionalDetailsDto',
['cr', None],
['pOptions', None],
['inv', None],
['pcns', None]]],
['partImageDto', None],
['riskDto',
['RiskDto',
['life', 'Low'],
['lStage', 'Mature'],
['yteol', '10'],
['Date', '2023']]],
['partOptionsDto',
[['ReplacementDto',
['partNumber', 'BQ2'],
['manufacturer', 'T'],
['type', 'Reel']]]],
['inventoryDto',
[['InventoryDto',
['distributor', 'V'],
['quantity', '88'],
['buyNowLink', 'https://www...']],
['InventoryDto',
['distributor', 'R'],
['quantity', '7'],
['buyNowLink', 'http://www.r.']],
['InventoryDto',
['distributor', 'RS'],
['quantity', '2'],
['buyNowLink', 'http://www.rs..']]]]]]]]

Categories

Resources