Possible Parser for Unknown String Format(soup?) from SUDS.client - python

I am using suds package to query a API from a website, the data returned from their website looks like this,:
(1). Can anyone tell me what kind of format is this?
(2). If so, what will be the easiest way to parse the data looks like this? I have dealt quite a lot with HTML/XML format using BeautifulSoup but before I lift my finger to write regular expressions for this type of format. I am curious is this some type of 'popular format' and there are actually some beautiful parser already written. Thanks.
# Below are the header and tail of the response..
(DetailResult)
{
status = (Status){ message = None code = "0" }
searchArgument = (DetailSearchArgument){ reqPartNumber = "BQ" reqMfg = "T" reqCpn = None }
detailsDto[] = (DetailsDto){
summaryDto = (SummaryDto){ PartNumber = "BQ" seMfg = "T" description = "Fast" }
packageDto[] =
(PackageDto){ fetName = "a" fetValue = "b" },
(PackageDto){ fetName = "c" fetValue = "d" },
(PackageDto){ fetName = "d" fetValue = "z" },
(PackageDto){ fetName = "f" fetValue = "Sq" },
(PackageDto){ fetName = "g" fetValue = "p" },
additionalDetailsDto = (AdditionalDetailsDto){ cr = None pOptions = None inv = None pcns = None }
partImageDto = None
riskDto = (RiskDto){ life= "Low" lStage = "Mature" yteol = "10" Date = "2023"}
partOptionsDto[] = (ReplacementDto){ partNumber = "BQ2" manufacturer = "T" type = "Reel" },
inventoryDto[] =
(InventoryDto){ distributor = "V" quantity = "88" buyNowLink = "https://www..." },
(InventoryDto){ distributor = "R" quantity = "7" buyNowLink = "http://www.r." },
(InventoryDto){ distributor = "RS" quantity = "2" buyNowLink = "http://www.rs.." },
},
}

This looks like some kind of nested repr output, similar to JSON but with
structure or object name information ("a Status contains a message and a code").
If it's nested, regexes alone won't do the job. Here is a rough pass at a pyparsing
parser
sample = """
... given sample text ...
"""
from pyparsing import *
# punctuation
LPAR,RPAR,LBRACE,RBRACE,LBRACK,RBRACK,COMMA,EQ = map(Suppress,"(){}[],=")
identifier = Word(alphas,alphanums+"_")
# define some types that can get converted to Python types
# (parse actions will do conversion at parse time)
NONE = Keyword("None").setParseAction(replaceWith(None))
integer = Word(nums).setParseAction(lambda t:int(t[0]))
quotedString.setParseAction(removeQuotes)
# define a placeholder for a nested object definition (since objDefn
# will be referenced within its own definition)
objDefn = Forward()
objType = Combine(LPAR + identifier + RPAR)
objval = quotedString | NONE | integer | Group(objDefn)
objattr = Group(identifier + EQ + objval)
arrayattr = Group(identifier + LBRACK + RBRACK + EQ + Group(OneOrMore(Group(objDefn)+COMMA)) )
# use '<<' operator to assign content to previously declared Forward
objDefn << objType + LBRACE + ZeroOrMore((arrayattr | objattr) + Optional(COMMA)) + RBRACE
# parse sample text
result = objDefn.parseString(sample)
# use pprint to list out indented parsed data
import pprint
pprint.pprint(result.asList())
Prints:
['DetailResult',
['status', ['Status', ['message', None], ['code', '0']]],
['searchArgument',
['DetailSearchArgument',
['reqPartNumber', 'BQ'],
['reqMfg', 'T'],
['reqCpn', None]]],
['detailsDto',
[['DetailsDto',
['summaryDto',
['SummaryDto',
['PartNumber', 'BQ'],
['seMfg', 'T'],
['description', 'Fast']]],
['packageDto',
[['PackageDto', ['fetName', 'a'], ['fetValue', 'b']],
['PackageDto', ['fetName', 'c'], ['fetValue', 'd']],
['PackageDto', ['fetName', 'd'], ['fetValue', 'z']],
['PackageDto', ['fetName', 'f'], ['fetValue', 'Sq']],
['PackageDto', ['fetName', 'g'], ['fetValue', 'p']]]],
['additionalDetailsDto',
['AdditionalDetailsDto',
['cr', None],
['pOptions', None],
['inv', None],
['pcns', None]]],
['partImageDto', None],
['riskDto',
['RiskDto',
['life', 'Low'],
['lStage', 'Mature'],
['yteol', '10'],
['Date', '2023']]],
['partOptionsDto',
[['ReplacementDto',
['partNumber', 'BQ2'],
['manufacturer', 'T'],
['type', 'Reel']]]],
['inventoryDto',
[['InventoryDto',
['distributor', 'V'],
['quantity', '88'],
['buyNowLink', 'https://www...']],
['InventoryDto',
['distributor', 'R'],
['quantity', '7'],
['buyNowLink', 'http://www.r.']],
['InventoryDto',
['distributor', 'RS'],
['quantity', '2'],
['buyNowLink', 'http://www.rs..']]]]]]]]

Related

python: custom pandas.DataFrame to dictionary function: some entries are lost

I want to read a .xlsx file, do some things with the data and convert it to a dict to save it in a .json file. To do that I use Python3 and pandas.
This is the code:
import pandas as pd
import json
xls = pd.read_excel(
io = "20codmun.xlsx",
converters = {
"CODAUTO" : str,
"CPRO" : str,
"CMUN" : str,
"DC" : str
}
)
print(xls)
#print(xls.columns.values)
outDict = {}
print(len(xls["NOMBRE"])) # 8131 rows
for i in range(len(xls.index)):
codauto = xls["CODAUTO"][i]
cpro = xls["CPRO"][i]
cmun = xls["CMUN"][i]
dc = xls["DC"][i]
aemetId = cpro + cmun
outDict[xls["NOMBRE"][i]] = {
"CODAUTO" : codauto,
"CPRO" : cpro,
"CMUN" : cmun,
"DC" : dc,
"AEMET_ID" : aemetId
}
print(i) # 8130
print(len(outDict)) # 8114 entries, SOME ENTIRES ARE LOST!!!!!
#print(outDict["Petrer"])
with open("data.json", "w") as outFile:
json.dump(outDict, outFile)
I add here the source of the .xlsx file (Spanish government). Select "Fichero con todas las provincias". You have to delete the first row.
As you can see, the pandas.DataFrame has 8131 rows, the for index at the end is 8130, but the length of the final dict is 8114, so some data is lost!
You can check that "Aljucén" is on the .xlsx file, but not in the .json one. Edit: Solved using json.dump(outDict, outFile, ensure_ascii=False)
I have analyzed the file and seems like some "NOMBRE" values are duplicated. Try executing xls["NOMBRE"].value_counts() and you will see that for example "Sada" is twice. You will also see that the unique values are 8114 exactly.
As you are using the city name as the dictionary key, when the key is duplicated, you are modifying the previous value of the dict.
I agree with gontxomde that if column "NOMBRE" has not only unique values, than it may lead to overwriting this key in the new dictionary.
To make a proof of concept I made a minimal example based on your approach:
import pandas as pd
feature_str = ['a', 'b', 'c']
df = pd.DataFrame({"NOMBRE": [1, 1, 3],
"CODAUTO": feature_str,
"CPRO" : feature_str,
"CMUN" : feature_str,
"DC" : feature_str
})
outDict = {}
print(len(df["NOMBRE"])) # 8131 rows
for i in range(len(df.index)):
codauto = df["CODAUTO"][i]
cpro = df["CPRO"][i]
cmun = df["CMUN"][i]
dc = df["DC"][i]
aemetId = cpro + cmun
outDict[df["NOMBRE"][i]] = {
"CODAUTO" : codauto,
"CPRO" : cpro,
"CMUN" : cmun,
"DC" : dc,
"AEMET_ID" : aemetId
}
print(outDict)
Which yields:
{1: {'CODAUTO': 'b', 'CPRO': 'b', 'CMUN': 'b', 'DC': 'b', 'AEMET_ID': 'bb'},
3: {'CODAUTO': 'c', 'CPRO': 'c', 'CMUN': 'c', 'DC': 'c', 'AEMET_ID': 'cc'}}
If I could suggest, instead of iterating over the index of the DataFrame, it would be better to use DataFrame methods:
df.set_index("NOMBRE") \
.to_dict(orient='index')
If you would use this in a dataset with unique values at NOMBRE, it would yield the same result, than the function you created. Additionally, in case you had duplicates it would raise an ValueError:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [15], in <module>
----> 1 df.set_index("NOMBRE").to_dict(orient='index')
File ~/.pyenv/versions/3.8.7/envs/jupyter/lib/python3.8/site-packages/pandas/core/frame.py:1607, in DataFrame.to_dict(self, orient, into)
1605 elif orient == "index":
1606 if not self.index.is_unique:
-> 1607 raise ValueError("DataFrame index must be unique for orient='index'.")
1608 return into_c(
1609 (t[0], dict(zip(self.columns, t[1:])))
1610 for t in self.itertuples(name=None)
1611 )
1613 else:
ValueError: DataFrame index must be unique for orient='index'.
If you have duplicated values in xls["NOMBRE"], each new duplicated will overwrite the previous one. So, you need to choose the strategy deal with duplicates, e.g. do you want different entries, like Sada and Sada(2)? Or do you want a single key Sada with the data from all the duplicates?
For the first example:
for i in range(len(xls.index)):
# if it's the first time the value appears, just do the "normal" thing
if xls["NOMBRE"][i] not in outDict.keys():
outDict[xls["NOMBRE"][i]] = {
"CODAUTO" : codauto,
"CPRO" : cpro,
"CMUN" : cmun,
"DC" : dc,
"AEMET_ID" : aemetId
}
# if the value was read before, add number of duplicate after the name
else:
for i in range(1, xls['NOMBRE'].value_counts()[xls["NOMBRE"][i]]):
if xls["NOMBRE"][i] + '(' + str(i+1) + ')' not in outDict.keys():
outDict[xls["NOMBRE"][i] + '(' + str(i+1) + ')'] = {
"CODAUTO" : codauto,
"CPRO" : cpro,
"CMUN" : cmun,
"DC" : dc,
"AEMET_ID" : aemetId
}
For the second case, there're good solutions here and here.

How to make a nested dictionary based on a list of URLs?

I have this list of hierarchical URLs:
data = ["https://python-rq.org/","https://python-rq.org/a","https://python-rq.org/a/b","https://python-rq.org/c"]
And I want to dynamically make a nested dictionary for every URL for which there exists another URL that is a subdomain/subfolder of it.
I already tried the follwoing but it is not returning what I expect:
result = []
for key,d in enumerate(data):
form_dict = {}
r_pattern = re.search(r"(http(s)?://(.*?)/)(.*)",d)
r = r_pattern.group(4)
if r == "":
parent_url = r_pattern.group(3)
else:
parent_url = r_pattern.group(3) + "/"+r
print(parent_url)
temp_list = data.copy()
temp_list.pop(key)
form_dict["name"] = parent_url
form_dict["children"] = []
for t in temp_list:
child_dict = {}
if parent_url in t:
child_dict["name"] = t
form_dict["children"].append(child_dict.copy())
result.append(form_dict)
This is the expected output.
{
"name":"https://python-rq.org/",
"children":[
{
"name":"https://python-rq.org/a",
"children":[
{
"name":"https://python-rq.org/a/b",
"children":[
]
}
]
},
{
"name":"https://python-rq.org/c",
"children":[
]
}
]
}
Any advice?
This was a nice problem. I tried going on with your regex method but got stuck and found out that split was actually appropriate for this case. The following works:
data = ["https://python-rq.org/","https://python-rq.org/a","https://python-rq.org/a/b","https://python-rq.org/c"]
temp_list = data.copy()
# This removes the last "/" if any URL ends with one. It makes it a lot easier
# to match the URLs and is not necessary to have a correct link.
data = [x[:-1] if x[-1]=="/" else x for x in data]
print(data)
result = []
# To find a matching parent
def find_match(d, res):
for t in res:
if d == t["name"]:
return t
elif ( len(t["children"])>0 ):
temp = find_match(d, t["children"])
if (temp):
return temp
return None
while len(data) > 0:
d = data[0]
form_dict = {}
l = d.split("/")
# I removed regex as matching the last parentheses wasn't working out
# split does just what you need however
parent = "/".join(l[:-1])
data.pop(0)
form_dict["name"] = d
form_dict["children"] = []
option = find_match(parent, result)
if (option):
option["children"].append(form_dict)
else:
result.append(form_dict)
print(result)
[{'name': 'https://python-rq.org', 'children': [{'name': 'https://python-rq.org/a', 'children': [{'name': 'https://python-rq.org/a/b', 'children': []}]}, {'name': 'https://python-rq.org/c', 'children': []}]}]

How to write Python method with dictionary choices?

I am writting simple API manager and I have problem with using dictionary in method here is what I wrote so far:
class BnManager():
def __init__(self, api_key, api_secret):
self.api_key = api_key
self.api_secret = api_secret
self.client = Client(api_key, api_secret)
def get_candles(self, symbol, interval):
self.symbol = symbol
self.interval = interval
choice = {
'1m' : Client.KLINE_INTERVAL_1MINUTE,
'3m' : Client.KLINE_INTERVAL_3MINUTE,
'5m' : Client.KLINE_INTERVAL_5MINUTE,
'15m' : Client.KLINE_INTERVAL_15MINUTE,
'30m' : Client.KLINE_INTERVAL_30MINUTE,
'1h' : Client.KLINE_INTERVAL_1HOUR,
'2h' : Client.KLINE_INTERVAL_2HOUR,
'4h' : Client.KLINE_INTERVAL_4HOUR,
'6h' : Client.KLINE_INTERVAL_6HOUR,
'8h' : Client.KLINE_INTERVAL_8HOUR,
'12h' : Client.KLINE_INTERVAL_12HOUR,
'1d' : Client.KLINE_INTERVAL_1DAY,
'3d' : Client.KLINE_INTERVAL_3DAY,
'1w' : Client.KLINE_INTERVAL_1WEEK,
'1m' : Client.KLINE_INTERVAL_1MONTH,
}
self.klines = self.client.get_klines(
self.symbol, choice[self.interval])
self.df = pd.DataFrame(self.klines, columns=[
'Date', 'Open', 'High', 'Low', 'Close', 'Volume',
'x', 'x1', 'x2', 'x3', 'x4', 'x5'])
self.df.drop(labels=['x', 'x1', 'x2', 'x3', 'x4', 'x5'],
axis=1, inplace=True).astype(float)
self.df['Date'] = date2num(pd.to_datetime(df.Date, unit='ms'))
self.df['Change'] = df['Close'].diff()
Problem appears when I try to execute get_candles method.
For example when I write manager.get_candles('BTCUSDT', '1m') I am getting:
self.symbol, choice[self.interval] TypeError: get_candles() takes 1
positional argument but 3 were given
I know this is probably trivial question but I really do not see where the problem is. And my second question: how to write it without using dict . I mean I would like to achieve something like :
self.klines = self.client.get_klines(
self.symbol, Client.KLINE_INTERVAL_+interval)
For the rewriting question you could look into Programmatic access to enumeration members and their attributes. Basically the documentation explains that you can use strings as keys for Enums.
from enum import Enum
class Color(Enum):
RED = 1
GREEN = 2
BLUE = 3
print(Color['RED']) # output: <Color.RED: 1>

append key's value on same key

This is what I currently have
code
coll = con['X']['Y']
s = "meta http equiv"
m = {'i': s}
n = json.dumps(m)
o = json.loads(n)
coll.insert(o)
data
{
"_id" : ObjectId("58527fe656c7a95cfaf40a15"),
"i" : "meta http equiv"
}
Now in the next iteration, s will change(as per my computations) and I want to append the value of s to same key
let's say in next iteration s becomes sample test data and on same key i
So I want this
{
"_id" : ObjectId("58527fe656c7a95cfaf40a15"),
"i" : "meta http equiv sample test data and"
}
How to achieve this?
Change the way you have formed s:
s = "meta http equiv"
s = (coll.get('i', '') + ' ' + s) if coll.get('i', '') else s
If coll isn't a dict object use getattr instead:
s = "meta http equiv"
s = (getattr(coll, 'i', '') + ' ' + s) if getattr(coll, 'i', '') else s

Python utility for parsing blocks?

I have a file that starts something like:
databaseCons = {
main = {
database = "readable_name",
hostname = "hostdb1.serv.com",
instances= {
slaves = {
conns = "8"
}
}
maxconns = "5",
user = "user",
pass = "pass"
}
}
So, what I'd like to do is parse this out into a dict of sub-dicts, something like:
{'main': {'database': 'readable_name', 'hostname': 'hostdb1.serv.com', 'maxconns': '5', 'instances': {'slave': {'maxCount': '8'}}, 'user': 'user', 'pass': 'pass'}}
I think the above makes sense... but please feel free to edit this if it doesn't. Basically I want the equivalent of:
conns = '8'
slave = dict()
slave['maxCount'] = conns
instances = dict()
instances['slave'] = slave
database = 'readable_name'
hostname = 'hostdb1.serv.com'
maxconns = '5'
user = 'user'
pas = 'pass'
main = dict()
main['database'] = database
main['hostname'] = hostname
main['instances'] = instances
main['maxconns'] = maxconns
main['user'] = user
main['pass'] = pas
databaseCons = dict()
databaseCons['main'] = main
Are there any modules out there that can handle this sort of parsing? Even what I've suggested above looks messy.. there's got to be a better way I'd imagine.
Here is a pyparsing parser for your config file:
from pyparsing import *
def to_dict(t):
return {k:v for k,v in t}
series = Forward()
struct = Suppress('{') + series + Suppress('}')
value = quotedString.setParseAction(removeQuotes) | struct
token = Word(alphanums)
assignment = Group(token + Suppress('=') + value + Suppress(Optional(",")))
series << ZeroOrMore(assignment).setParseAction(to_dict)
language = series + stringEnd
def config_file_to_dict(filename):
return language.parseFile(filename)[0]
if __name__=="__main__":
from pprint import pprint
pprint(config_file_to_dict('config.txt'))

Categories

Resources