REGEX Searching in pymongo

REGEX Searching in pymongo - python

I am attempting to create a search in pymongo using REGEX. After the match, I want the data to be appended to a list in the module. I thought that I had everything set, but no matter what I set for the REGEX it returns 0 results. The code is below:
REGEX = '.*\.com'
def myModule(self, data)
#after importing everything and setting up the collection function in the DB I call the following:
cursor = collection.find({'multiple.layers.of.data' : REGEX})
data = []
for x in cursor:
matches.append(x)
return matches
This is but one module of three I am using to filter through a huge amount of json files that have been stored in a mongodb. However, no matter how many times I change this formatting such as /.*.com/ to declare in the operation or using the $regex in mongo...it never finds my data and appends it in the list.
EDIT: Adding in the full code along with what I am trying to identify:
RegEx = '.*\.com' #Or RegEx = re.compile('.*\.com')
def filterData(self, data):
db = self.client[self.dbName]
collection = db[self.collectionName]
cursor = collection.find({'data.item11.sub.level3': {'$regex': RegEx}})
data = []
for x in cursor:
data.append(x)
return data
I am attempting to parse through JSON data in a mongodb. The data is structured like so:
"data": {
"0": {
"item1": "something",
"item2": 0,
"item3": 000,
"item4": 000000000,
"item5": 000000000,
"item6": "0000",
"item7": 00,
"item8": "0000",
"item9": 00,
"item10": "useful",
"item11": {
"0000": {
"sub": {
"level": "letter",
"level1": 0000,
"level2": 0000000000,
"level3": "domain.com"
},
"more_data": "words"
}
}
}
UPDATE: After further testing it appears as though I need to include all of the layers in the search. Thus, it should look like
collection.find({'data.0.item11.0000.sub.level3': {'$regex': RegEx}}).
However, the "0" can be 1 - 50 and the "0000" is randomly generated. Is there a way to set these to index's as variables so that it will step into it no matter what the value? It will always be a number value.

Well, you need to tell mongodb the string should be treated as a regular expression, using the $regex operator:
cursor = collection.find({'multiple.layers.of.data' : {'$regex': REGEX}})
I think simply replacing REGEX = '.*\.com' with import re; REGEX = re.compile('.*\.com') might also work, but I'm not sure (would rely on a specific handling in the pymongo driver).
EDIT:
Regarding the wildcard part of the question: The answer is no.
In a nutshell, values that unknown should
never be assigned as keys because it makes querying very inefficient.
There are no 'wild card' queries.
It is better to restructure the database such that values that are
unknown are not keys
See:
MongoDB wildcard in the key of a query
http://groups.google.com/group/mongodb-user/browse_thread/thread/32b00d38d50bd858
https://groups.google.com/forum/#!topic/mongodb-user/TnAQMe-5ZGs

Related

How can I best convert an API JSON object to a single row for SQL server?

I have a script setup to pull a JSON from an API and I need to convert objects into different columns for a single row layout for a SQL server. See the example below for the body raw layout of an example object:
"answers": {
"agent_star_rating": {
"question_id": 145,
"question_text": "How satisfied are you with the service you received from {{ employee.first_name }} today?",
"comment": "John was exceptionally friendly and knowledgeable.",
"selected_options": {
"1072": {
"option_id": 1072,
"option_text": "5",
"integer_value": 5
}
}
},
In said example I need the output for all parts of agent_star_rating to be individual columns so all data spits out 1 row for the entire survey on our SQL server. I have tried mapping several keys like so:
agent_star_rating = [list(response['answers']['agent_star_rating']['selected_options'].values())[0]['integer_value']]
agent_question = (response['answers']['agent_star_rating']['question_text'])
agent_comment = (response['answers']['agent_star_rating']['comment'])
response['agent_question'] = agent_question
response['agent_comment'] = agent_comment
response['agent_star_rating'] = agent_star_rating
I get the expected result until we reach a point where some surveys have skipped a field like ['question text'] and we'll get a missing key error. This happens over the course of other objects and I am failing to come up with a solution for these missing keys. If there is a better way to format the output as I've described beyond the keys method I've used I'd also love to hear ideas! I'm fresh to learning python/pandas so pardon any improper terminology!

I would do something like this:
# values that you always capture
row = ['value1', 'value2', ...]
gottem_attrs = {'question_id': '' ,
'question_text': '',
'comment': '',
'selected_options': ''}
# find and save the values that response have
for attr in list(response['agent_star_rating']):
gottem_attrs[attr] = response['agent_star_rating'][attr]
# then you have your final row
final_row = row + gottem_attrs.values()
If the response have a value in his attribute, this code will save it. Else, it will save a empty string for that value.

How do I extract nested json names and convert to dot notation string list in python?

I need to pull data in from elasticsearch, do some cleaning/munging and export as table/rds.
To do this I have a long list of variable names required to pull from elasticsearch. This list of variables is required for the pull, but the issue is that not all fields may be represented within a given pull, meaning that I need to add the fields after the fact. I can do this using a schema (in nested json format) of the same list of variable names.
To try and [slightly] future proof this work I would ideally like to only maintain the list/schema in one place, and convert from list to schema (or vice-versa).
Is there a way to do this in python? Please see example below of input and desired output.
Small part of schema:
{
"_source": {
"filters": {"group": {"filter_value": 0}},
"user": {
"email": "",
"uid": ""
},
"status": {
"date": "",
"active": True
}
}
}
Desired string list output:
[
"_source.filters.group.filter_value",
"_source.user.email",
"_source.user.uid",
"_source.status.date",
"_source.status.active"
]
I believe that schema -> list might be an easier transformation than list -> schema, though am happy for it to be the other way round if that is simpler (though need to ensure the schema variables have the correct type, i.e. str, bool, float).
I have explored the following answers which come close, but I am struggling to understand since none appear to be in python:
Convert dot notation to JSON
Convert string with dot notation to JSON

Where d is your json as a dictionary,
def full_search(d):
arr = []
def dfs(d, curr):
if not type(d) == dict or curr[-1] not in d or type(d[curr[-1]]) != dict:
arr.append(curr)
return
for key in d[curr[-1]].keys():
dfs(d[curr[-1]], curr + [key])
for key in d.keys():
dfs(d, [key])
return ['.'.join(x) for x in arr]
If d is in json form, use
import json
res = full_search(json.loads(d))

How to detect and indent json substrings inside longer non-json text?

I have an existing Python application, which logs like:
import logging
import json
logger = logging.getLogger()
some_var = 'abc'
data = {
1: 2,
'blah': {
['hello']
}
}
logger.info(f"The value of some_var is {some_var} and data is {json.dumps(data)}")
So the logger.info function is given:
The value of some_var is abc and data is {1: 2,"blah": {["hello"]}}
Currently my logs go to AWS CloudWatch, which does some magic and renders this with indentation like:
The value of some_var is abc and data is {
1: 2,
"blah": {
["hello"]
}
}
This makes the logs super clear to read.
Now I want to make some changes to my logging, handling it myself with another python script that wraps around my code and emails out logs when there's a failure.
What I want is some way of taking each log entry (or a stream/list of entries), and applying this indentation.
So I want a function which takes in a string, and detects which subset(s) of that string are json, then inserts \n and to pretty-print that json.
example input:
Hello, {"a": {"b": "c"}} is some json data, but also {"c": [1,2,3]} is too
example output
Hello,
{
"a": {
"b": "c"
}
}
is some json data, but also
{
"c": [
1,
2,
3
]
}
is too
I have considered splitting up each entry into everything before and after the first {. Leave the left half as is, and pass the right half to json.dumps(json.loads(x), indent=4).
But what if there's stuff after the json object in the log file?
Ok, we can just select everything after the first { and before the last }.
Then pass the middle bit to the JSON library.
But what if there's two JSON objects in this log entry? (Like in the above example.) We'll have to use a stack to figure out whether any { appears after all prior { have been closed with a corresponding }.
But what if there's something like {"a": "\}"}. Hmm, ok we need to handle escaping.
Now I find myself having to write a whole json parser from scratch.
Is there any easy way to do this?
I suppose I could use a regex to replace every instance of json.dumps(x) in my whole repo with json.dumps(x, indent=4). But json.dumps is sometimes used outside logging statements, and it just makes all my logging lines that extra bit longer. Is there a neat elegant solution?
(Bonus points if it can parse and indent the json-like output that str(x) produces in python. That's basically json with single quotes instead of double.)

In order to extract JSON objects from a string, see this answer. The extract_json_objects() function from that answer will handle JSON objects, and nested JSON objects but nothing else. If you have a list in your log outside of a JSON object, it's not going to be picked up.
In your case, modify the function to also return the strings/text around all the JSON objects, so that you can put them all into the log together (or replace the logline):
from json import JSONDecoder
def extract_json_objects(text, decoder=JSONDecoder()):
pos = 0
while True:
match = text.find('{', pos)
if match == -1:
yield text[pos:] # return the remaining text
break
yield text[pos:match] # modification for the non-JSON parts
try:
result, index = decoder.raw_decode(text[match:])
yield result
pos = match + index
except ValueError:
pos = match + 1
Use that function to process your loglines, add them to a list of strings, which you then join together to produce a single string for your output, logger, etc.:
def jsonify_logline(line):
line_parts = []
for result in extract_json_objects(line):
if isinstance(result, dict): # got a JSON obj
line_parts.append(json.dumps(result, indent=4))
else: # got text/non-JSON-obj
line_parts.append(result)
# (don't make that a list comprehension, quite un-readable)
return ''.join(line_parts)
Example:
>>> demo_text = """Hello, {"a": {"b": "c"}} is some json data, but also {"c": [1,2,3]} is too"""
>>> print(jsonify_logline(demo_text))
Hello, {
"a": {
"b": "c"
}
} is some json data, but also {
"c": [
1,
2,
3
]
} is too
>>>
Other things not directly related which would have helped:
Instead of using json.dumps(x) for all your log lines, following the DRY principle and create a function like logdump(x) which does whatever you'd want to do, like json.dumps(x), or json.dumps(x, indent=4), or jsonify_logline(x). That way, if you needed to change the JSON format for all your logs, you just change that one function; no need for mass "search & replace", which comes with its own issues and edge-cases.
You can even add an optional parameter to it pretty=True to decide if you want it indented or not.
You could mass search & replace all your existing loglines to do logger.blah(jsonify_logline(<previous log f-string or text>))
If you are JSON-dumping custom objects/class instances, then use their __str__ method to always output pretty-printed JSON. And the __repr__ to be non-pretty/compact.
Then you wouldn't need to modify the logline at all. Doing logger.info(f'here is my object {x}') would directly invoke obj.__str__.

Parsing JSON output efficiently in Python?

The below block of code works however I'm not satisfied that it is very optimal due to my limited understanding of using JSON but I can't seem to figure out a more efficient method.
The steam_game_db is like this:
{
"applist": {
"apps": [
{
"appid": 5,
"name": "Dedicated Server"
},
{
"appid": 7,
"name": "Steam Client"
},
{
"appid": 8,
"name": "winui2"
},
{
"appid": 10,
"name": "Counter-Strike"
}
]
}
}
and my Python code so far is
i = 0
x = 570
req_name_from_id = requests.get(steam_game_db)
j = req_name_from_id.json()
while j["applist"]["apps"][i]["appid"] != x:
i+=1
returned_game = j["applist"]["apps"][i]["name"]
print(returned_game)
Instead of looping through the entire app list is there a smarter way to perhaps search for it? Ideally the elements in the data structure with 'appid' and 'name' were numbered the same as their corresponding 'appid'
i.e.
appid 570 in the list is Dota2
However element 570 in the data structure in appid 5069 and Red Faction
Also what type of data structure is this? Perhaps it has limited my searching ability for this answer already. (I.e. seems like a dictionary of 'appid' and 'element' to me for each element?)
EDIT: Changed to a for loop as suggested
# returned_id string for appid from another query
req_name_from_id = requests.get(steam_game_db)
j_2 = req_name_from_id.json()
for app in j_2["applist"]["apps"]:
if app["appid"] == int(returned_id):
returned_game = app["name"]
print(returned_game)

The most convenient way to access things by a key (like the app ID here) is to use a dictionary.
You pay a little extra performance cost up-front to fill the dictionary, but after that pulling out values by ID is basically free.
However, it's a trade-off. If you only want to do a single look-up during the life-time of your Python program, then paying that extra performance cost to build the dictionary won't be beneficial, compared to a simple loop like you already did. But if you want to do multiple look-ups, it will be beneficial.
# build dictionary
app_by_id = {}
for app in j["applist"]["apps"]:
app_by_id[app["appid"]] = app["name"]
# use it
print(app_by_id["570"])
Also think about caching the JSON file on disk. This will save time during your program's startup.

It's better to have the JSON file on disk, you can directly dump it into a dictionary and start building up your lookup table. As an example I've tried to maintain your logic while using the dict for lookups. Don't forget to encode the JSON it has special characters in it.
Setup:
import json
f = open('bigJson.json')
apps = {}
with open('bigJson.json', encoding="utf-8") as handle:
dictdump = json.loads(handle.read())
for item in dictdump['applist']['apps']:
apps.setdefault(item['appid'], item['name'])
Usage 1:
That's the way you have used it
for appid in range(0, 570):
if appid in apps:
print(appid, apps[appid].encode("utf-8"))
Usage 2: That's how you can query a key, using getinstead of [] will prevent a KeyError exception if the appid isn't recorded.
print(apps.get(570, 0))

Create JSON from five top to bottom related tables

We run an app that is highly dependent on location. So, we have five models: Country, Province, District, Sector, Cell and Village:
What I want is to generate a JSON that represents them. What I tried aleady is quite long, but I noticed since the structure is the same, one chunk of the module would show my problem.
So, each cell can have multipe villages inside it:
cells=database.bring('SELECT id,name FROM batteries_cell WHERE sector_id=' + str(sectorID))
if cells:
for cell in cells:
cellID=cell[0]
cellName=cell[1]
cell_pro={'id':cellID,'name':cellName,'villages':{}}
villages=database.bring('SELECT id,name FROM batteries_village WHERE cell_id=' + str(cellID))
if villages:
for village in villages:
villageID=village[0]
villageName=village[1]
village_pro={'id':villageID, 'name':villageName}
cell_pro['villages'].update(village_pro)
However, the update just stores the last village for each cell. Any idea what I am doing wrong because I have been trying and deleting different ways to end up in the same result.
UPDATE needed output is:
[
{
"id": 1,
"name": "Ethiopia",
"villages": [{
"vid": 1,
"vname": "one"
},
{
"vid": 2,
"vname": "village two"
}
]
},
{
"id": 2,
"name": "Sene",
"villages": [{
"vid": 3,
"vname": "third"
},
{
"vid": 4,
"vname": "fourth"
}
]
}
]

The update keeps overwriting the same keys in the cell_pro villages dict. For example, if village_pro is {'id':'1', 'name':'A'}, then cell_pro['villages'].update(village_pro) will set cell_pro['villages']['id'] = '1' and cell_pro['villages']['name'] = 'A'. The next village in the loop will overwrite the id and name with something else.
You probably either want to make cell_pro['villages'] into a list or keep it as a dict and add the villages keyed by id:
cell_pro['villages'][villageID] = village_pro
What format do you want the resulting JSON to be? Maybe you just want:
cell_pro['villages'][villageID] = villageName
EDITED FOR DESIRED JSON ADDED TO QUESTION:
In the JSON, the villages are in an array. For that we use a list in Python. Note that cell_pro['villages'] is now a list and we use append() to add to it.
cells=database.bring('SELECT id,name FROM batteries_cell WHERE sector_id=' + str(sectorID))
if cells:
for cell in cells:
cellID=cell[0]
cellName=cell[1]
cell_pro={'id':cellID,'name':cellName,'villages':[]}
villages=database.bring('SELECT id,name FROM batteries_village WHERE cell_id=' + str(cellID))
if villages:
for village in villages:
villageID=village[0]
villageName=village[1]
village_pro={'vid':villageID, 'vname':villageName}
cell_pro['villages'].append(village_pro)
TIP: I don't know what database access module you're using but it's generally bad practice to build SQL queries like that because the parameters may not be escaped properly and could lead to SQL injection attacks or errors. Most modules have a way to build query strings with bound parameters that automatically safely escape variables in the query string.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

REGEX Searching in pymongo - python

Related

How can I best convert an API JSON object to a single row for SQL server?

How do I extract nested json names and convert to dot notation string list in python?

How to detect and indent json substrings inside longer non-json text?

Parsing JSON output efficiently in Python?

Create JSON from five top to bottom related tables

Categories

Resources