MongoDB file path as unique index - python

How should I organize my collection for documents like this:
{
"path" : "\\192.168.77.1\user\1.wav", // unique text index
"sex" : "male", "age" : 28 // some fields
}
I use this scheme in Python (pymongo):
client = MongoClient(self.addr)
db = self.client['some']
db.files.ensure_index([('path', TEXT)], unique=True)
data = [
{"path": r'\\192.168.77.5\1.wav', "base": "CAGS2"},
{"path": r'\\192.168.77.5\2.wav', "base": "CAGS2"}
]
sid = self.db.files.insert(data)
But error occurs:
pymongo.errors.DuplicateKeyError: insertDocument ::
caused by :: 11000 E11000 duplicate key error index:
some.files.$path_text dup key: { : "168", : 0.75 }
If I remove all dots ('.') inside path keys, everything is ok. What is wrong?

Why are you creating a unique text index? For that matter, why is MongoDB letting you? When you create a text index, the input field values are tokenized:
"the brown fox jumps" -> ["the", "brown", "fox", "jumps"]
The tokens are stemmed, meaning they are reduced (in a language-specific way) to a different form to support natural language matching like "loves" with "love" and "loving" with "love". Stopwords like "the", which are common words that would be more harmful than helpful to match on, are thrown out.
["the", "brown", "fox", "jumps"] -> ["brown", "fox", "jump"]
The index entries for the document are the stemmed tokens of the original field value with a score that's calculated based off of how important the term is in the value string. Ergo, when you put a unique index on these values, you are ensuring that you cannot have two documents with terms that stem to the same thing and have the same score. This is pretty much never what you would want because it's hard to tell what it's going to reject. Here is an example:
> db.test.drop()
> db.test.ensureIndex({ "t" : "text" }, { "unique" : true })
> db.test.insert({ "t" : "ducks are quacking" })
WriteResult({ "nInserted" : 1 })
> db.test.insert({ "t" : "did you just quack?" })
WriteResult({
"nInserted" : 0,
"writeError" : {
"code" : 11000,
"errmsg" : "insertDocument :: caused by :: 11000 E11000 duplicate key error index: test.test.$a_text dup key: { : \"quack\", : 0.75 }"
}
})
> db.test.insert({ "t" : "though I walk through the valley of the shadow of death, I will fear no quack" })
WriteResult({ "nInserted" : 1 })
The stemmed term "quack" will result from all three documents, but in the first two it receives the score of 0.75, so the second insertion is rejected by the unique constraint. It receives a score of 0.5625 in the third document.
What are you actually trying to achieve with the index on the path? A unique text index is not what you want.

have you escaped all the text in the input fields to ensure that it is a valid JSON document?
Here is a valid json document
{
"path": "\"\\\\192.168.77.1\\user\\1.wav\"",
"sex": "male",
"age": 28
}
You have set the text index to be unique - is there already a document in the collection with a path value of "\\192.168.77.1\user\1.wav" ?
Mongo may also be treating the punctuation in the path as delimiters which may be affecting how its stored.
MongoDB $search field

I created a scheme with TEXT index for 'path' and it was saved in DB.
I tried to change TEXT to ASCENDING/DESCENDING after and nothing worked because I didn't do the index reset (or delete and create entire DB again).
So, as wdberkeley wrote below: when you create a text index, the input field values are tokenized:
"the brown fox jumps" -> ["the", "brown", "fox", "jumps"]
And TEXT index is not solution for filenames. Use ASCENDING/DESCENDING instead.

Related

Address Parsing [Py]

so im looking to write a function that will take input in the form:
123 1st street APT 32S or
320 Jumping Alien Road
555 Google Ave
and output in a dictionary / json all the information parsed from the inputted string
dictionary would look something like
output = {
"streetNum" : "123",
"roadName" : "1st",
"suffix" : "street",
"enders" : "APT", #or None /null
"room" : "32S" #or None / null
}
Im trying to thing of the logic but the best I can come up with is something along the lines of address.split(' ') and then taking where the roadname, suffix, and streetname would typically be located in said string but obviously these things aren't always gonna be located in that order and when road names have spaces inside them that would break the function as well.
def addressParser(addressString):
return {
"streetNum" : addressString.split(' ')[0], #prob need regex help
"roadName" : addressString.split(' ')[1],
"suffix" : addressString.split(' ')[2],
"enders" : addressString.split(' ')[3],
"room" : addressString.split(' ')[4]
}
Edit: found exactly what i needed here https://pypi.org/project/address/

How to check if a word or group of words exist in given list of strings and how to extract that word?

I have a list of strings as follows :
list_of_words = ['all saints church','churchill college', "great saint mary's church", 'holy trinity church', "little saint mary's church", 'emmanuel college']
And I have a list of dictionaries that contains 'text' as key and a sentence as a value. It is as follows :
"dict_sentences": [
{
"text": "Can you help me book a taxi going from emmanuel college to churchill college?"
},
{
"text": "Yes, I could! What time would you like to depart from Emmanuel College?"
},
{
"text": "I want a taxi to holy trinity church"
},
{
"text": "Alright! I have a yellow Lexus booked to pick you up. The Contact number is 07543493643. Anything else I can help with?"
},
{
"text": "No, that is everything I needed. Thank you!"
},
{
"text": "Thank you! Have a great day!"
}
]
For each sentence in dict_sentences, I want to check if any of the words from list_of_words exists in that sentence and if yes, I want to store it in another dictionary(as I have to further work on it).
For example, in the first sentence in dict_sentences, "Can you help me book a taxi going from emmanuel college to churchill college?", the substring "churchill college" and 'emmanuel college' exists in our list_of_words, so I want to store the word 'churchill college' and 'emmanuel college' in another dictionary like { sent1 : ['churchill college', 'emmanuel college'] }
So the expected output would be :
{ sent1 : ['churchill college', 'emmanuel college'] ,
sent2 : [ 'emmanuel college' ],
sent3 : [ 'holy trinity church' ]
} # ignore the rest of sentences as no word from list_of_words exist in them
The main problem here is checking if given sentence consists of word/group of words (like 'holy trinity church' - 3 words) in the given sentence and if yes, extracting the same. I went through other answers and following code was suggested for checking if a word from a list occurs in a sentence :
if any(word in sentence for word in list_of_words()):
pass
However, this way we can only check if the word from sentence exists in list_of_words(), to extract the word, I will have to run for loops. But, I refrain from using for loops as I need a very time efficient solution because I have around 300 documents where every document consist of such 10-15(or more) sentences and the list_of_words too is large i.e. around 300 strings. So, I need a time efficient way to check and extract the word from a given sentence that exists in list_of_words.
You could use re.findall so there's no nested loop.
output = {}
find_words = re.compile('|'.join(list_of_words)).findall
for i, (s,) in enumerate(map(dict.values, data['dict_sentences']), 1):
words = find_words(s.lower())
if words:
output[f"sent{i}"] = words
{'sent1': ['emmanuel college', 'churchill college'],
'sent2': ['emmanuel college'],
'sent3': ['holy trinity church']}
This can be done in a dict_comprehension as well using the walrus operator in python 3.8+ although may be a little overboard:
find_sent = re.compile('|'.join(list_of_words)).findall
iter_sent = enumerate(map(dict.values, data['dict_sentences']), 1)
output = {f"sent{i}": words for i, (s,) in iter_sent if (words := find_sent(s.lower()))}
There might be a more efficient way to do this with something like itertools, but I am not very familiar with it.
test = {"dict_sentences":...} # I'm assuming it's a section of a json or a larger dictionary.
output = {}
j = 1
for sent in test["dict_sentences"]:
addition = []
for i in list_of_words:
if i.upper() in sent["text"].upper():
addition.append(i)
if addition:
output[f"sent{j}"] = addition
j += 1
You can do a nested dict comprehension and compare the content by transforming both to lower case, for example:
output = {
f"sent{i+1}": [
phrase for phrase in list_of_words if phrase.lower() in sentence['text'].lower()
] for i,sentence in enumerate(dict_sentences)
}
output_without_empty_matches = { k:v for k,v in output.items() if v }
print(output_without_empty_matches)
>>> {'sent1': ['churchill college', 'emmanuel college'], 'sent2': ['emmanuel college'], 'sent3': ['holy trinity church']}
new_list=[]
new_dict={}
for index, subdict in enumerate(dict_sentences):
for word in list_of_words:
if word in subdict['text'].lower():
key="sent"+str(index+1)
new_list.append(word)
new_dict[key]=new_list
new_list=[]
print(new_dict)

Parse delimited and nested field names from URL parameter for partial response

In a Flask-RESTful based API, I want to allow clients to retrieve a JSON response partially, via the ?fields=... parameter. It lists field names (keys of the JSON object) that will be used to construct a partial representation of the larger original.
This may be, in its simplest form, a comma-separated list:
GET /v1/foobar?fields=name,id,date
That can be done with webargs' DelimitedList schema field easily, and is no trouble for me.
But, to allow nested objects' keys to be represented, the delimited field list may include arbitrarily nested keys enclosed in matching parentheses:
GET /v1/foobar?fields=name,id,another(name,id),date
{
"name": "",
"id": "",
"another": {
"name": "",
"id": ""
},
"date": ""
}
GET /v1/foobar?fields=id,one(id,two(id,three(id),date)),date
{
"id": "",
"one": {
"id: "",
"two": {
"id": "",
"three": {
"id": ""
},
"date": ""
}
},
"date": ""
}
GET /v1/foobar?fields=just(me)
{
"just": {
"me: ""
}
}
My question is two-fold:
Is there a way to do this (validate & deserialize) with webargs and marshmallow natively?
If not, how would I do this with a parsing framework like pyparsing? Any hint on what the BNF grammar is supposed to look like is highly appreciated.
Pyparsing has a couple of helpful built-ins, delimitedList and nestedExpr. Here is an annotated snippet that builds up a parser for your values. (I also included an example where your list elements might be more than just simple alphabetic words):
import pyparsing as pp
# placeholder element that will be used recursively
item = pp.Forward()
# your basic item type - expand as needed to include other characters or types
word = pp.Word(pp.alphas + '_')
list_element = word
# for instance, add support for numeric values
list_element = word | pp.pyparsing_common.number
# retain x(y, z, etc.) groupings using Group
grouped_item = pp.Group(word + pp.nestedExpr(content=pp.delimitedList(item)))
# define content for placeholder; must use '<<=' operator here, not '='
item <<= grouped_item | list_element
# create parser
parser = pp.Suppress("GET /v1/foobar?fields=") + pp.delimitedList(item)
You can test any pyparsing expression using runTests:
parser.runTests("""
GET /v1/foobar?fields=name,id,date
GET /v1/foobar?fields=name,id,another(name,id),date
GET /v1/foobar?fields=id,one(id,two(id,three(id),date)),date
GET /v1/foobar?fields=just(me)
GET /v1/foobar?fields=numbers(1,2,3.7,-26e10)
""", fullDump=False)
Gives:
GET /v1/foobar?fields=name,id,date
['name', 'id', 'date']
GET /v1/foobar?fields=name,id,another(name,id),date
['name', 'id', ['another', ['name', 'id']], 'date']
GET /v1/foobar?fields=id,one(id,two(id,three(id),date)),date
['id', ['one', ['id', ['two', ['id', ['three', ['id']], 'date']]]], 'date']
GET /v1/foobar?fields=just(me)
[['just', ['me']]]
GET /v1/foobar?fields=numbers(1,2,3.7,-26e10)
[['numbers', [1, 2, 3.7, -260000000000.0]]]

look for json key to print it's content without brackets in python

i'm relatively new to python/programming and even more of a noob in json.
I am making a dictionary app for a side projects that I have, it works fine, I can search for words and get their definition, but I want to make it perfect and I want the results to be readable for the final user (I know about indenting but I don't want the brackets and all the json formatting to appear in the results)
So this is the json i'm pulling the data from:
{
"Bonjour": {
"English Word": "Hello",
"Type of word": "whatever",
"Defintion": "Means good day",
"Use case example": "Bonjour igo",
"Additional information": "BRO"
}
}
and this is the code i'm using to get the values (it doesn't work), the "search" variable is = to "Bonjour" in this case (it's a user input)
currentword = json.load(data) #part of the "with open..."
for definition in currentword[search]['English Word', 'Definition', 'Use case example']:
print(definition)
the error I get is the following:
KeyError: ('English Word', 'Definition', 'Use case example')
Now i'm unsure if "Bonjour" is the key or "English Word", etc... are the keys, if not, what is "Bonjour"
Anyways, I want it to print the values of "English Word" and preferably as "English Word - VALUE/DEFINITION"
Thanks for any help
From your problem, looks like you want to extract only some of the key-value pairs from your existing dictionary.
Try like below:
data = {
"Bonjour": {
"English Word": "Hello",
"Type of word": "whatever",
"Definition": "Means good day",
"Use case example": "Bonjour igo",
"Additional information": "BRO"
}
}
currentword = data
search = "Bonjour"
result = dict((k, currentword[search][k]) for k in ['English Word', 'Definition', 'Use case example'])
for k,v in result.items():
print k + ":" + v
Result:
Definition:Means good day
English Word:Hello
Use case example:Bonjour igo
JSON format is simply a nice way to pair keys and values.
Keys are the names we give to Values, so it will be easy to access them.
If we took your JSON, and split it by keys and values, this is what we would get:
Keys: "Bonjour", "English Word", "Type of word", "Defintion", "Use case example", "Additional information".
Showing all values is a little complex, so I'll explain:
The value of "Bonjour" is this:
{
"English Word": "Hello",
"Type of word": "whatever",
"Defintion": "Means good day",
"Use case example": "Bonjour igo",
"Additional information": "BRO"
}
And all other value are described in the value of "Bonjour".
The value of "English Word" is "Hello" and so on.
When you write a line like so: currentword[search]['English Word', 'Definition', 'Use case example'], you are telling Python to look for a key named ('English Word', 'Definition', 'Use case example'), and obviously it does not exist.
What you should do is as follows:
for definition in currentword[search]:
eng_word = definition['English Word']
print('English Word - {}'.format(eng_word))
please note that definition contain all other fields as well, so you can choose whichever one you like.
This line:
currentword[search]['English Word', 'Definition', 'Use case example']
Calls 'English Word', 'Definition', 'Use case example' as a tuple key from the inner dict, which doesn't exist in your dictionary, which is why a KeyError is raised.
If you want just the english word, use this instead:
currentword[search]["English Word"]
Assuming search is "Bonjour".
It also looks like you are also trying to filter out specific keys from the inner dict separately. If this is the case, you can do this:
d = {
"Bonjour": {
"English Word": "Hello",
"Type of word": "whatever",
"Defintion": "Means good day",
"Use case example": "Bonjour igo",
"Additional information": "BRO"
}
}
inner_dict = d['Bonjour']
keys = ["English Word", "Use case example", "Defintion"]
print({k: inner_dict[k] for k in keys})
# {'English Word': 'Hello', 'Use case example': 'Bonjour igo', 'Defintion': 'Means good day'}

How to ignore tokens in ply.yacc

I'm writing a JSON configuration (i.e config file in JSON format) interpreter with PLY.
There are huge swaths of the configuration file that I'd like to ignore. Some parts that I'd like to ignore contain tokens that I can't ignore in other parts of the file.
For example, I want to ignore:
"features" : [{
"name" : "someObscureFeature",
"version": "1.2",
"options": {
"values" : ["a", "b", "c"]
"allowWithoutContentLength": false,
"enabled": true
}
...
}]
But I do NOT want to ignore:
"features" : [{
"name" : "importantFeature",
"version": "1.1",
"options": {
"value": {
"id": 587842,
"description": "ramy-single-hostmatch",
"products": [
"Fresca"
]
...
}]
There are also lots of other tokens within the array of features that I want to ignore if the name value is not 'importantFeature'. For example there is likely to be an array of values in both important and obscure features. I need to ignore accordingly.
Notice also that I need to extract certain elements of the values field and that I'd like the values field to be tokenized so I can make use of it. Effectively, I'd like to conditionally tokenize the values field if it's inside of an importantMatch.
Also note that importantFeature is just standing in for what will eventually be about a dozen different features, each with their own grammar inside of the their respective features blocks.
The problem I'm running into is that every feature, obviously, has a name. I'd like to write something along these lines:
def p_FEATURES(p):
'''FEATURES : ARRAY_START FEATURE COMMA FEATURES ARRAY_END
| ARRAY_START FEATURE ARRAY_END'''
def p_FEATURE(p):
'''FEATURE : TESTABLE_FEATURE
| UNTESTABLE_FEATURE'''
def p_TESTABLE_FEATURE(p):
'''TESTABLE_FEATURE : BLOCK_START QUOTE NAME_KEY QUOTE COLON QUOTE CPCODE_FEATURE QUOTE COMMA IGNORE_KEY_VAL_PAIRS COMMA CPCODE_OPTIONS COMMA IGNORE_KEY_VAL_PAIRS'''
def p_UNTESTABLE_FEATURE(p):
'''UNTESTABLE_FEATURE : IGNORE_BLOCK '''
def p_IGNORE_BLOCK(p):
'''IGNORE_BLOCK : BLOCK_START LINES BLOCK_END'''
However the problem i'm running into is that I can't just "IGNORE_BLOCK" because the block with have a 'name' and I have a token in my lexer called 'name':
def t_NAME_KEY(t): r'name'; return t
Any help greatly appreciated.
When you define a regex rule function, you can choose whether or not to return the token. Depending on what is returned, the token is either ignored or considered. For example:
def t_BLOCK(t):
r'\{[\s]*name[\s]*:[\s]*(importantFeature)|(obscureFeature)\}' # will match a full block with the 'name' key in it
if 'obscureFeature' not in t:
return t
else:
pass
You can build a rule somewhat along these lines, and then choose whether to return the token or not based on whether your important feature was present or not.
Also, a general convention for specifying tokens to ignore as a string is to append t_IGNORE_ to the name.
Based on OP's edit. Forget about elimination during tokenisation. What you could, instead do is, manually rebuild the json as you parse it with the grammar. For example.
Replace
def p_FEATURE(p):
'''FEATURE : TESTABLE_FEATURE
| UNTESTABLE_FEATURE'''
def p_TESTABLE_FEATURE(p):
'''TESTABLE_FEATURE : BLOCK_START QUOTE NAME_KEY QUOTE COLON QUOTE CPCODE_FEATURE QUOTE COMMA IGNORE_KEY_VAL_PAIRS COMMA CPCODE_OPTIONS COMMA IGNORE_KEY_VAL_PAIRS'''
def p_UNTESTABLE_FEATURE(p):
'''UNTESTABLE_FEATURE : IGNORE_BLOCK '''
with
data = []
def p_FEATURE(p):
'''FEATURE : BLOCK_START DATA BLOCK_END FEATURE
| BLOCK_START DATA BLOCK_END'''
def p_DATA(p):
'''DATA : KEY COLON VALUE COMMA DATA
| KEY COLON VALUE ''' # and so on (have another function for values)
What you can do now is to examine p[2] and see if it is important. If yes, add it to your data variable. Else, ignore.
This is just a rough idea. You'll still have to figure out the grammar rules exactly (for example, VALUE would also probably lead to another state), and adding the right blocks to data and how. But it is possible.

Categories

Resources