How to ignore tokens in ply.yacc

How to ignore tokens in ply.yacc - python

I'm writing a JSON configuration (i.e config file in JSON format) interpreter with PLY.
There are huge swaths of the configuration file that I'd like to ignore. Some parts that I'd like to ignore contain tokens that I can't ignore in other parts of the file.
For example, I want to ignore:
"features" : [{
"name" : "someObscureFeature",
"version": "1.2",
"options": {
"values" : ["a", "b", "c"]
"allowWithoutContentLength": false,
"enabled": true
}
...
}]
But I do NOT want to ignore:
"features" : [{
"name" : "importantFeature",
"version": "1.1",
"options": {
"value": {
"id": 587842,
"description": "ramy-single-hostmatch",
"products": [
"Fresca"
]
...
}]
There are also lots of other tokens within the array of features that I want to ignore if the name value is not 'importantFeature'. For example there is likely to be an array of values in both important and obscure features. I need to ignore accordingly.
Notice also that I need to extract certain elements of the values field and that I'd like the values field to be tokenized so I can make use of it. Effectively, I'd like to conditionally tokenize the values field if it's inside of an importantMatch.
Also note that importantFeature is just standing in for what will eventually be about a dozen different features, each with their own grammar inside of the their respective features blocks.
The problem I'm running into is that every feature, obviously, has a name. I'd like to write something along these lines:
def p_FEATURES(p):
'''FEATURES : ARRAY_START FEATURE COMMA FEATURES ARRAY_END
| ARRAY_START FEATURE ARRAY_END'''
def p_FEATURE(p):
'''FEATURE : TESTABLE_FEATURE
| UNTESTABLE_FEATURE'''
def p_TESTABLE_FEATURE(p):
'''TESTABLE_FEATURE : BLOCK_START QUOTE NAME_KEY QUOTE COLON QUOTE CPCODE_FEATURE QUOTE COMMA IGNORE_KEY_VAL_PAIRS COMMA CPCODE_OPTIONS COMMA IGNORE_KEY_VAL_PAIRS'''
def p_UNTESTABLE_FEATURE(p):
'''UNTESTABLE_FEATURE : IGNORE_BLOCK '''
def p_IGNORE_BLOCK(p):
'''IGNORE_BLOCK : BLOCK_START LINES BLOCK_END'''
However the problem i'm running into is that I can't just "IGNORE_BLOCK" because the block with have a 'name' and I have a token in my lexer called 'name':
def t_NAME_KEY(t): r'name'; return t
Any help greatly appreciated.

When you define a regex rule function, you can choose whether or not to return the token. Depending on what is returned, the token is either ignored or considered. For example:
def t_BLOCK(t):
r'\{[\s]*name[\s]*:[\s]*(importantFeature)|(obscureFeature)\}' # will match a full block with the 'name' key in it
if 'obscureFeature' not in t:
return t
else:
pass
You can build a rule somewhat along these lines, and then choose whether to return the token or not based on whether your important feature was present or not.
Also, a general convention for specifying tokens to ignore as a string is to append t_IGNORE_ to the name.
Based on OP's edit. Forget about elimination during tokenisation. What you could, instead do is, manually rebuild the json as you parse it with the grammar. For example.
Replace
def p_FEATURE(p):
'''FEATURE : TESTABLE_FEATURE
| UNTESTABLE_FEATURE'''
def p_TESTABLE_FEATURE(p):
'''TESTABLE_FEATURE : BLOCK_START QUOTE NAME_KEY QUOTE COLON QUOTE CPCODE_FEATURE QUOTE COMMA IGNORE_KEY_VAL_PAIRS COMMA CPCODE_OPTIONS COMMA IGNORE_KEY_VAL_PAIRS'''
def p_UNTESTABLE_FEATURE(p):
'''UNTESTABLE_FEATURE : IGNORE_BLOCK '''
with
data = []
def p_FEATURE(p):
'''FEATURE : BLOCK_START DATA BLOCK_END FEATURE
| BLOCK_START DATA BLOCK_END'''
def p_DATA(p):
'''DATA : KEY COLON VALUE COMMA DATA
| KEY COLON VALUE ''' # and so on (have another function for values)
What you can do now is to examine p[2] and see if it is important. If yes, add it to your data variable. Else, ignore.
This is just a rough idea. You'll still have to figure out the grammar rules exactly (for example, VALUE would also probably lead to another state), and adding the right blocks to data and how. But it is possible.

Related

pyparsing syntax tree from named value list

I'd like to parse tag/value descriptions using the delimiters :, and •
E.g. the Input would be:
Name:Test•Title: Test•Keywords: A,B,C
the expected result should be the name value dict
{
"name": "Test",
"title": "Title",
"keywords: "A,B,C"
}
potentially already splitting the keywords in "A,B,C" to a list. (This is a minor detail since the python built in split method of string will happily do this).
Also applying a mapping
keys={
"Name": "name",
"Title": "title",
"Keywords": "keywords",
}
as a mapping between names and dict keys would be helpful but could be a separate step.
I tried the code below https://trinket.io/python3/8dbbc783c7
# pyparsing named values
# Wolfgang Fahl
# 2023-01-28 for Stackoverflow question
import pyparsing as pp
notes_text="Name:Test•Title: Test•Keywords: A,B,C"
keys={
"Name": "name",
"Titel": "title",
"Keywords": "keywords",
}
keywords=list(keys.keys())
runDelim="•"
name_values_grammar=pp.delimited_list(
pp.oneOf(keywords,as_keyword=True).setResultsName("key",list_all_matches=True)
+":"+pp.Suppress(pp.Optional(pp.White()))
+pp.delimited_list(
pp.OneOrMore(pp.Word(pp.printables+" ", exclude_chars=",:"))
,delim=",")("value")
,delim=runDelim).setResultsName("tag", list_all_matches=True)
results=name_values_grammar.parseString(notes_text)
print(results.dump())
and variations of it but i am not even close to the expected result. Currently the dump shows:
['Name', ':', 'Test']
- key: 'Name'
- tag: [['Name', ':', 'Test']]
[0]:
['Name', ':', 'Test']
- value: ['Test']
Seems i don't know how to define the grammar and work on the parseresult in a way to get the needed dict result.
The main questions for me are:
Should i use parse actions?
How is the naming of part results done?
How is the navigation of the resulting tree done?
How is it possible to get the list back from delimitedList?
What does list_all_matches=True achieve - it's behavior seems strange
I searched for answers on the above questions here on stackoverflow and i couldn't find a consistent picture of what to do.
Pyparsing delimited list only returns first element
Finding lists of elements within a string using Pyparsing
PyParsing seems to be a great tool but i find it very unintuitive. There are fortunately lots of answers here so i hope to learn how to get this example working
Trying myself i took a stepwise approach:
First i checked the delimitedList behavior see https://trinket.io/python3/25e60884eb
# Try out pyparsing delimitedList
# WF 2023-01-28
from pyparsing import printables, OneOrMore, Word, delimitedList
notes_text="A,B,C"
comma_separated_values=delimitedList(Word(printables+" ", exclude_chars=",:"),delim=",")("clist")
grammar = comma_separated_values
result=grammar.parseString(notes_text)
print(f"result:{result}")
print(f"dump:{result.dump()}")
print(f"asDict:{result.asDict()}")
print(f"asList:{result.asList()}")
which returns
result:['A', 'B', 'C']
dump:['A', 'B', 'C']
- clist: ['A', 'B', 'C']
asDict:{'clist': ['A', 'B', 'C']}
asList:['A', 'B', 'C']
which looks promising and the key success factor seems to be to name this list with "clist" and the default behavior looks fine.
https://trinket.io/python3/bc2517e25a
shows in more detail where the problem is.
# Try out pyparsing delimitedList
# see https://stackoverflow.com/q/75266188/1497139
# WF 2023-01-28
from pyparsing import printables, oneOf, OneOrMore,Optional, ParseResults, Suppress,White, Word, delimitedList
def show_result(title:str,result:ParseResults):
"""
show pyparsing result details
Args:
result(ParseResults)
"""
print(f"result for {title}:")
print(f" result:{result}")
print(f" dump:{result.dump()}")
print(f" asDict:{result.asDict()}")
print(f" asList:{result.asList()}")
# asXML is deprecated and doesn't work any more
# print(f"asXML:{result.asXML()}")
notes_text="Name:Test•Title: Test•Keywords: A,B,C"
comma_text="A,B,C"
keys={
"Name": "name",
"Titel": "title",
"Keywords": "keywords",
}
keywords=list(keys.keys())
runDelim="•"
comma_separated_values=delimitedList(Word(printables+" ", exclude_chars=",:"),delim=",")("clist")
cresult=comma_separated_values.parseString(comma_text)
show_result("comma separated values",cresult)
grammar=delimitedList(
oneOf(keywords,as_keyword=True)
+Suppress(":"+Optional(White()))
+comma_separated_values
,delim=runDelim
)("namevalues")
nresult=grammar.parseString(notes_text)
show_result("name value list",nresult)
#ogrammar=OneOrMore(
# oneOf(keywords,as_keyword=True)
# +Suppress(":"+Optional(White()))
# +comma_separated_values
#)
#oresult=grammar.parseString(notes_text)
#show_result("name value list with OneOf",nresult)
output:
result for comma separated values:
result:['A', 'B', 'C']
dump:['A', 'B', 'C']
- clist: ['A', 'B', 'C']
asDict:{'clist': ['A', 'B', 'C']}
asList:['A', 'B', 'C']
result for name value list:
result:['Name', 'Test']
dump:['Name', 'Test']
- clist: ['Test']
- namevalues: ['Name', 'Test']
asDict:{'clist': ['Test'], 'namevalues': ['Name', 'Test']}
asList:['Name', 'Test']
while the first result makes sense for me the second is unintuitive. I'd expected a nested result - a dict with a dict of list.
What causes this unintuitive behavior and how can it be mitigated?

Issues with the grammar being that: you are encapsulating OneOrMore in delimited_list and you only want the outer one, and you aren't telling the parser how your data needs to be structured to give the names meaning.
You also don't need the whitespace suppression as it is automatic.
Adding parse_all to the parse_string function will help to see where not everything is being consumed.
name_values_grammar = pp.delimited_list(
pp.Group(
pp.oneOf(keywords,as_keyword=True).setResultsName("key",list_all_matches=True)
+ pp.Suppress(pp.Literal(':'))
+ pp.delimited_list(
pp.Word(pp.printables, exclude_chars=':,').setResultsName('value', list_all_matches=True)
, delim=',')
)
, delim='•'
).setResultsName('tag', list_all_matches=True)
Should i use parse actions? As you can see, you don't technically need to, but you've ended up with a data structure that might be less efficient for what you want. If the grammar gets more complicated, I think using some parse actions would make sense. Take a look below for some examples to map the key names (only if they are found), and cleaning up list parsing for a more complicated grammar.
How is the naming of part results done? By default in a ParseResults object, the last part that is labelled with a name will be returned when you ask for that name. Asking for all matches to be returned using list_all_matches will only work usefully for some simple structures, but it does work. See below for examples.
How is the navigation of the resulting tree done? By default, everything gets flattened. You can use pyparsing.Group to tell the parser not to flatten its contents into the parent list (and therefore retain useful structure and part names).
How is it possible to get the list back from delimitedList? If you don't wrap the delimited_list result in another list then the flattening that is done will remove the structure. Parse actions or Group on the internal structure again to the rescue.
What does list_all_matches=True achieve - its behavior seems strange It is a function of the grammar structure that it seems strange. Consider the different outputs in:
import pyparsing as pp
print(
pp.delimited_list(
pp.Word(pp.printables, exclude_chars=',').setResultsName('word', list_all_matches=True)
).parse_string('x,y,z').dump()
)
print(
pp.delimited_list(
pp.Word(pp.printables, exclude_chars=':,').setResultsName('key', list_all_matches=True)
+ pp.Suppress(pp.Literal(':'))
+ pp.Word(pp.printables, exclude_chars=':,').setResultsName('value', list_all_matches=True)
)
.parse_string('x:a,y:b,z:c').dump()
)
print(
pp.delimited_list(
pp.Group(
pp.Word(pp.printables, exclude_chars=':,').setResultsName('key', list_all_matches=True)
+ pp.Suppress(pp.Literal(':'))
+ pp.Word(pp.printables, exclude_chars=':,').setResultsName('value', list_all_matches=True)
)
).setResultsName('tag', list_all_matches=True)
.parse_string('x:a,y:b,z:c').dump()
)
The first one makes sense, giving you a list of all the tokens you would expect. The third one also makes sense, since you have a structure you can walk. But the second one you end up with two lists that are not necessarily (in a more complicated grammar) going to be easy to match up.
Here's a different way of building the grammar so that it supports quoting strings with delimiters in them so they don't become lists, and keywords that aren't in your mapping. It's harder to do this without parse actions.
import pyparsing as pp
import json
test_string = "Name:Test•Title: Test•Extra: '1,2,3'•Keywords: A,B,C,'D,E',F"
keys={
"Name": "name",
"Title": "title",
"Keywords": "keywords",
}
g_key = pp.Word(pp.alphas)
g_item = pp.Word(pp.printables, excludeChars='•,\'') | pp.QuotedString(quote_char="'")
g_value = pp.delimited_list(g_item, delim=',')
l_key_value_sep = pp.Suppress(pp.Literal(':'))
g_key_value = g_key + l_key_value_sep + g_value
g_grammar = pp.delimited_list(g_key_value, delim='•')
g_key.add_parse_action(lambda x: keys[x[0]] if x[0] in keys else x)
g_value.add_parse_action(lambda x: [x] if len(x) > 1 else x)
g_key_value.add_parse_action(lambda x: (x[0], x[1].as_list()) if isinstance(x[1],pp.ParseResults) else (x[0], x[1]))
key_values = dict()
for k,v in g_grammar.parse_string(test_string, parse_all=True):
key_values[k] = v
print(json.dumps(key_values, indent=2))

Another approach using regular expressions would be:
def _extractByKeyword(keyword: str, string: str) -> typing.Union[str, None]:
"""
Extract the value for the given key from the given string.
designed for simple key value strings without further formatting
e.g.
Title: Hello World
Goal: extraction
For keyword="Goal" the string "extraction would be returned"
Args:
keyword: extract the value associated to this keyword
string: string to extract from
Returns:
str: value associated to given keyword
None: keyword not found in given string
"""
if string is None or keyword is None:
return None
# https://stackoverflow.com/a/2788151/1497139
# value is closure of not space not / colon
pattern = rf"({keyword}:(?P<value>[\s\w,_-]*))(\s+\w+:|\n|$)"
import re
match = re.search(pattern, string)
value = None
if match is not None:
value = match.group('value')
if isinstance(value, str):
value = value.strip()
return value
keys={
"Name": "name",
"Title": "title",
"Keywords": "keywords",
}
notes_text="Name:Test Title: Test Keywords: A,B,C"
lod = {v: _extractByKeyword(k, notes_text) for k,v in keys.items()}
The extraction function was tested with:
import typing
from dataclasses import dataclass
from unittest import TestCase
class TestExtraction(TestCase)
def test_extractByKeyword(self):
"""
tests the keyword extraction
"""
#dataclass
class TestParam:
expected: typing.Union[str, None]
keyword: typing.Union[str, None]
string: typing.Union[str, None]
testParams = [
TestParam("test", "Goal", "Title:Title\nGoal:test\nLabel:title"),
TestParam("test", "Goal", "Title:Title\nGoal:test Label:title"),
TestParam("test", "Goal", "Title:Title\nGoal:test"),
TestParam("test with spaces", "Goal", "Title:Title\nGoal:test with spaces\nLabel:title"),
TestParam("test with spaces", "Goal", "Title:Title\nGoal:test with spaces Label:title"),
TestParam("test with spaces", "Goal", "Title:Title\nGoal:test with spaces"),
TestParam("SQL-DML", "Goal", "Title:Title\nGoal:SQL-DML"),
TestParam("SQL_DML", "Goal", "Title:Title\nGoal:SQL_DML"),
TestParam(None, None, "Title:Title\nGoal:test"),
TestParam(None, "Label", None),
TestParam(None, None, None),
]
for testParam in testParams:
with self.subTest(testParam=testParam):
actual = _extractByKeyword(testParam.keyword, testParam.string)
self.assertEqual(testParam.expected, actual)

For the time being i am using a simple work-around see https://trinket.io/python3/7ccaa91f7e
# Try out parsing name value list
# WF 2023-01-28
import json
notes_text="Name:Test•Title: Test•Keywords: A,B,C"
keys={
"Name": "name",
"Title": "title",
"Keywords": "keywords",
}
result={}
key_values=notes_text.split("•")
for key_value in key_values:
key,value=key_value.split(":")
value=value.strip()
result[keys[key]]=value # could do another split here if need be
print(json.dumps(result,indent=2))
output:
{
"name": "Test",
"title": "Test",
"keywords": "A,B,C"
}

Parse delimited and nested field names from URL parameter for partial response

In a Flask-RESTful based API, I want to allow clients to retrieve a JSON response partially, via the ?fields=... parameter. It lists field names (keys of the JSON object) that will be used to construct a partial representation of the larger original.
This may be, in its simplest form, a comma-separated list:
GET /v1/foobar?fields=name,id,date
That can be done with webargs' DelimitedList schema field easily, and is no trouble for me.
But, to allow nested objects' keys to be represented, the delimited field list may include arbitrarily nested keys enclosed in matching parentheses:
GET /v1/foobar?fields=name,id,another(name,id),date
{
"name": "",
"id": "",
"another": {
"name": "",
"id": ""
},
"date": ""
}
GET /v1/foobar?fields=id,one(id,two(id,three(id),date)),date
{
"id": "",
"one": {
"id: "",
"two": {
"id": "",
"three": {
"id": ""
},
"date": ""
}
},
"date": ""
}
GET /v1/foobar?fields=just(me)
{
"just": {
"me: ""
}
}
My question is two-fold:
Is there a way to do this (validate & deserialize) with webargs and marshmallow natively?
If not, how would I do this with a parsing framework like pyparsing? Any hint on what the BNF grammar is supposed to look like is highly appreciated.

Pyparsing has a couple of helpful built-ins, delimitedList and nestedExpr. Here is an annotated snippet that builds up a parser for your values. (I also included an example where your list elements might be more than just simple alphabetic words):
import pyparsing as pp
# placeholder element that will be used recursively
item = pp.Forward()
# your basic item type - expand as needed to include other characters or types
word = pp.Word(pp.alphas + '_')
list_element = word
# for instance, add support for numeric values
list_element = word | pp.pyparsing_common.number
# retain x(y, z, etc.) groupings using Group
grouped_item = pp.Group(word + pp.nestedExpr(content=pp.delimitedList(item)))
# define content for placeholder; must use '<<=' operator here, not '='
item <<= grouped_item | list_element
# create parser
parser = pp.Suppress("GET /v1/foobar?fields=") + pp.delimitedList(item)
You can test any pyparsing expression using runTests:
parser.runTests("""
GET /v1/foobar?fields=name,id,date
GET /v1/foobar?fields=name,id,another(name,id),date
GET /v1/foobar?fields=id,one(id,two(id,three(id),date)),date
GET /v1/foobar?fields=just(me)
GET /v1/foobar?fields=numbers(1,2,3.7,-26e10)
""", fullDump=False)
Gives:
GET /v1/foobar?fields=name,id,date
['name', 'id', 'date']
GET /v1/foobar?fields=name,id,another(name,id),date
['name', 'id', ['another', ['name', 'id']], 'date']
GET /v1/foobar?fields=id,one(id,two(id,three(id),date)),date
['id', ['one', ['id', ['two', ['id', ['three', ['id']], 'date']]]], 'date']
GET /v1/foobar?fields=just(me)
[['just', ['me']]]
GET /v1/foobar?fields=numbers(1,2,3.7,-26e10)
[['numbers', [1, 2, 3.7, -260000000000.0]]]

Replace single quotes with double quotes but leave ones within double quotes untouched

The ultimate goal or the origin of the problem is to have a field compatible with in json_extract_path_text Redshift.
This is how it looks right now:
{'error': "Feed load failed: Parameter 'url' must be a string, not object", 'errorCode': 3, 'event_origin': 'app', 'screen_id': '6118964227874465', 'screen_class': 'Promotion'}
To extract field I need from the string in Redshift, I replaced single quotes with double quotes.
The particular record is giving error because inside value of error, there is a single quote there. With that, the string will be a invalid json if those get replaced as well.
So what I need is:
{"error": "Feed load failed: Parameter 'url' must be a string, not object", "errorCode": 3, "event_origin": "app", "screen_id": "6118964227874465", "screen_class": "Promotion"}

Several ways, one is to use the regex module with
"[^"]*"(*SKIP)(*FAIL)|'
See a demo on regex101.com.
In Python:
import regex as re
rx = re.compile(r'"[^"]*"(*SKIP)(*FAIL)|\'')
new_string = rx.sub('"', old_string)
With the original re module, you'd need to use a function and see if the group has been matched or not - (*SKIP)(*FAIL) lets you avoid exactly that.

I tried a regex approach but found it to complicated and slow. So i wrote a simple "bracket-parser" which keeps track of the current quotation mode. It can not do multiple nesting you'd need a stack for that. For my usecase converting str(dict) to proper JSON it works:
example input:
{'cities': [{'name': "Upper Hell's Gate"}, {'name': "N'zeto"}]}
example output:
{"cities": [{"name": "Upper Hell's Gate"}, {"name": "N'zeto"}]}'
python unit test
def testSingleToDoubleQuote(self):
jsonStr='''
{
"cities": [
{
"name": "Upper Hell's Gate"
},
{
"name": "N'zeto"
}
]
}
'''
listOfDicts=json.loads(jsonStr)
dictStr=str(listOfDicts)
if self.debug:
print(dictStr)
jsonStr2=JSONAble.singleQuoteToDoubleQuote(dictStr)
if self.debug:
print(jsonStr2)
self.assertEqual('''{"cities": [{"name": "Upper Hell's Gate"}, {"name": "N'zeto"}]}''',jsonStr2)
singleQuoteToDoubleQuote
def singleQuoteToDoubleQuote(singleQuoted):
'''
convert a single quoted string to a double quoted one
Args:
singleQuoted(string): a single quoted string e.g. {'cities': [{'name': "Upper Hell's Gate"}]}
Returns:
string: the double quoted version of the string e.g.
see
- https://stackoverflow.com/questions/55600788/python-replace-single-quotes-with-double-quotes-but-leave-ones-within-double-q
'''
cList=list(singleQuoted)
inDouble=False;
inSingle=False;
for i,c in enumerate(cList):
#print ("%d:%s %r %r" %(i,c,inSingle,inDouble))
if c=="'":
if not inDouble:
inSingle=not inSingle
cList[i]='"'
elif c=='"':
inDouble=not inDouble
doubleQuoted="".join(cList)
return doubleQuoted

Array has multi strings against text with multiline ( regular expression) Python

I am working on the regular expression on python. I spend the whole week I can't understand what wrong with my code. it obvious that multi-string should match, but I get a few of them. such as "model" , '"US"" but I can't match 37abc5afce16xxx and "-104.99875". My goal is just to tell whether there is a match for any string on the array or not and what is that matching.
I have string such as:'
text = {'"version_name"': '"8.5.2"', '"abi"': '"arm64-v8a"', '"x_dpi"':
'515.1539916992188', '"environment"': '{"sdk_version"',
'"time_zone"':
'"America\\/Wash"', '"user"': '{}}', '"density_default"': '560}}',
'"resolution_width"': '1440', '"package_name"':
'"com.okcupid.okcupid"', '"d44bcbfb-873454-4917-9e02-2066d6605d9f"': '{"language"', '"country"':
'"US"}', '"now"': '1.515384841291E9', '{"extras"': '{"sessions"',
'"device"': '{"android_version"', '"y_dpi"': '37abc5afce16xxx',
'"model"': '"Nexus 6P"', '"new"': 'true}]', '"only_respond_with"':
'["triggers"]}\n0\r\n\r\n', '"start_time"': '1.51538484115E9',
'"version_code"': '1057', '"-104.99875"': '"0"', '"no_acks"': 'true}',
'"display"': '{"resolution_height"'}
An array has multi-string as :
Keywords =["37abc5afce16xxx","867686022684243", "ffffffff-f336-7a7a-0f06-65f40033c587", "long", "Lat", "uuid", "WIFI", "advertiser", "d44bcbfb-873454-4917-9e02-2066d6605d9f","deviceFinger", "medialink", "Huawei","Andriod","US","local_ip","Nexus", "android2.10.3","WIFI", "operator", "carrier", "angler", "MMB29M", "-104.99875"]
My code as
for x in Keywords:
pattern = r"^.*"+str(x)+"^.*"
if re.findall(pattern, str(values1),re.M):
print "Match"
print x
else:
print "Not Match"

Your code's goal is a bit confusing, so this is assuming you want to check for which items from the Keywords list are also in the text dictionary
In your code, it looks like you only compare the regex to the dictionary values, not the keys (assuming that's what the values1 variable is).
Also, instead of using the regex "^.*" to match for strings, you can simply do
for X in Keywords:
if X in yourDictionary.keys():
doSomething
if X in yourDictionary.values():
doSomethingElse

MongoDB file path as unique index

How should I organize my collection for documents like this:
{
"path" : "\\192.168.77.1\user\1.wav", // unique text index
"sex" : "male", "age" : 28 // some fields
}
I use this scheme in Python (pymongo):
client = MongoClient(self.addr)
db = self.client['some']
db.files.ensure_index([('path', TEXT)], unique=True)
data = [
{"path": r'\\192.168.77.5\1.wav', "base": "CAGS2"},
{"path": r'\\192.168.77.5\2.wav', "base": "CAGS2"}
]
sid = self.db.files.insert(data)
But error occurs:
pymongo.errors.DuplicateKeyError: insertDocument ::
caused by :: 11000 E11000 duplicate key error index:
some.files.$path_text dup key: { : "168", : 0.75 }
If I remove all dots ('.') inside path keys, everything is ok. What is wrong?

Why are you creating a unique text index? For that matter, why is MongoDB letting you? When you create a text index, the input field values are tokenized:
"the brown fox jumps" -> ["the", "brown", "fox", "jumps"]
The tokens are stemmed, meaning they are reduced (in a language-specific way) to a different form to support natural language matching like "loves" with "love" and "loving" with "love". Stopwords like "the", which are common words that would be more harmful than helpful to match on, are thrown out.
["the", "brown", "fox", "jumps"] -> ["brown", "fox", "jump"]
The index entries for the document are the stemmed tokens of the original field value with a score that's calculated based off of how important the term is in the value string. Ergo, when you put a unique index on these values, you are ensuring that you cannot have two documents with terms that stem to the same thing and have the same score. This is pretty much never what you would want because it's hard to tell what it's going to reject. Here is an example:
> db.test.drop()
> db.test.ensureIndex({ "t" : "text" }, { "unique" : true })
> db.test.insert({ "t" : "ducks are quacking" })
WriteResult({ "nInserted" : 1 })
> db.test.insert({ "t" : "did you just quack?" })
WriteResult({
"nInserted" : 0,
"writeError" : {
"code" : 11000,
"errmsg" : "insertDocument :: caused by :: 11000 E11000 duplicate key error index: test.test.$a_text dup key: { : \"quack\", : 0.75 }"
}
})
> db.test.insert({ "t" : "though I walk through the valley of the shadow of death, I will fear no quack" })
WriteResult({ "nInserted" : 1 })
The stemmed term "quack" will result from all three documents, but in the first two it receives the score of 0.75, so the second insertion is rejected by the unique constraint. It receives a score of 0.5625 in the third document.
What are you actually trying to achieve with the index on the path? A unique text index is not what you want.

have you escaped all the text in the input fields to ensure that it is a valid JSON document?
Here is a valid json document
{
"path": "\"\\\\192.168.77.1\\user\\1.wav\"",
"sex": "male",
"age": 28
}
You have set the text index to be unique - is there already a document in the collection with a path value of "\\192.168.77.1\user\1.wav" ?
Mongo may also be treating the punctuation in the path as delimiters which may be affecting how its stored.
MongoDB $search field

I created a scheme with TEXT index for 'path' and it was saved in DB.
I tried to change TEXT to ASCENDING/DESCENDING after and nothing worked because I didn't do the index reset (or delete and create entire DB again).
So, as wdberkeley wrote below: when you create a text index, the input field values are tokenized:
"the brown fox jumps" -> ["the", "brown", "fox", "jumps"]
And TEXT index is not solution for filenames. Use ASCENDING/DESCENDING instead.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to ignore tokens in ply.yacc - python

Related

pyparsing syntax tree from named value list

Parse delimited and nested field names from URL parameter for partial response

Replace single quotes with double quotes but leave ones within double quotes untouched

Array has multi strings against text with multiline ( regular expression) Python

MongoDB file path as unique index

Categories

Resources