Analysis of fields in nested document elasticsearch - python

I am using a document with nested structure in it where the content is analysed in spite of my telling it "not analysed". The document is defined as follows:
class SearchDocument(es.DocType)
# Verblijfsobject specific data
gebruiksdoel_omschrijving = es.String(index='not_analyzed')
oppervlakte = es.Integer()
bouwblok = es.String(index='not_analyzed')
gebruik = es.String(index='not_analyzed')
panden = es.String(index='not_analyzed')
sbi_codes = es.Nested({
'properties': {
'sbi_code': es.String(index='not_analyzed'),
'hcat': es.String(index='not_analyzed'),
'scat': es.String(index='not_analyzed'),
'hoofdcategorie': es.String(fields= {'raw': es.String(in dex='not_analyzed')}),
'subcategorie': es.String(fields={'raw':es.String(index='not_analyzed')}),
'sub_sub_categorie': es.String(fields= {'raw': es.String(index='not_analyzed')}),
'bedrijfsnaam': es.String(fields= {'raw': es.String(index='not_analyzed')}),
'vestigingsnummer': es.String(index='not_analyzed')
}
})
As is clear, it says "not analysed" in the document for most fields. This works OK for the "regular fields". The problem is in the nested structure. There the hoofdcategorie and other fields are indexed for their separate words instead of the unanalysed version.
The structure is filled with the following data:
[
{
"sbi_code": "74103",
"sub_sub_categorie": "Interieur- en ruimtelijk ontwerp",
"vestigingsnummer": "000000002216",
"bedrijfsnaam": "Flippie Tests",
"subcategorie": "design",
"scat": "22279_12_22254_11",
"hoofdcategorie": "zakelijke dienstverlening",
"hcat": "22279_12"
},
{
"sbi_code": "9003",
"sub_sub_categorie": "Schrijven en overige scheppende kunsten",
"vestigingsnummer": "000000002216",
"bedrijfsnaam": "Flippie Tests",
"subcategorie": "kunst",
"scat": "22281_12_22259_11",
"hoofdcategorie": "cultuur, sport, recreatie",
"hcat": "22281_12"
}
]
Now when I retrieve aggregates it has split the hoofdcategorie in 3 different words ("cultuur", "sport", "recreatie"). This is not what I want, but as far as I know I have specified it correctly using the "not analysed" phrase.
Anyone any ideas?

Related

How do I access specific data in a nested JSON file with Python and Pandas

I am still a newbie with Python and working on my first REST API. I have a JSON file that has a few levels. When I create the data frame with pandas, no matter what I try I cannot access the level I need.
The API is built with Flask and has the correct parameters for the book, chapter and verse.
Below is a small example of the JSON data.
{
"book": "Python",
"chapters": [
{
"chapter": "1",
"verses": [
{
"verse": "1",
"text": "Testing"
},
{
"verse": "2",
"text": "Testing 2"
}
]
}
]
}
Here is my code:
#app.route("/api/v1/<book>/<chapter>/<verse>/")
def api(book, chapter, verse):
book = book.replace(" ", "").title()
df = pd.read_json(f"Python/{book}.json")
filt = (df['chapters']['chapter'] == chapter) & (df['chapters']['verses']['verse'] == verse)
text = df.loc[filt].to_json()
result_dictionary = {'Book': book, 'Chapter': chapter, "Verse": verse, "Text": text}
return result_dictionary
Here is the error I am getting:
KeyError
KeyError: 'chapter'
I have tried normalizing the data, using df.loc to filter and just trying to access the data directly.
Expecting that the API endpoint will allow the user to supply the book, chapter and verse as arguments and then it returns the text for the given position based on those parameters supplied.
You can first create a dataframe of the JSON and then query it.
import json
import pandas as pd
def api(book, chapter, verse):
# Read the JSON file
with open(f"Python/{book}.json", "r") as f:
data = json.load(f)
# Convert it into a DataFrame
df = pd.json_normalize(data, record_path=["chapters", "verses"], meta=["book", ["chapters", "chapter"]])
df.columns = ["Verse", "Text", "Book", "Chapter"] # rename columns
# Query the required content
query = f"Book == '{book}' and Chapter == '{chapter}' and Verse == '{verse}'"
result = df.query(query).to_dict(orient="records")[0]
return result
Here df would look like this after json_normalize:
Verse Text Book Chapter
0 1 Testing Python 1
1 2 Testing 2 Python 1
2 1 Testing Python 2
3 2 Testing 2 Python 2
And result is:
{'Verse': '2', 'Text': 'Testing 2', 'Book': 'Python', 'Chapter': '1'}
You are trying to access a list in a dict with a dict key ?
filt = (df['chapters'][0]['chapter'] == "chapter") & (df['chapters'][0]['verses'][0]['verse'] == "verse")
Will get a result.
But df.loc[filt] requires a list with (boolean) filters and above only gerenerates one false or true, so you can't filter with that.
You can filter like:
df.from_dict(df['chapters'][0]['verses']).query("verse =='1'")
One of the issues here is that "chapters" is a list
"chapters": [
This is why ["chapters"]["chapter"] wont work as you intend.
If you're new to this, it may be helpful to "normalize" the data yourself:
import json
with open("book.json") as f:
book = json.load(f)
for chapter in book["chapters"]:
for verse in chapter["verses"]:
row = book["book"], chapter["chapter"], verse["verse"], verse["text"]
print(repr(row))
('Python', '1', '1', 'Testing')
('Python', '1', '2', 'Testing 2')
It is possible to pass this to pd.DataFrame()
df = pd.DataFrame(
([book["book"], chapter["chapter"], verse["verse"], verse["text"]]
for verse in chapter["verses"]
for chapter in book["chapters"]),
columns=["Book", "Chapter", "Verse", "Text"]
)
Book Chapter Verse Text
0 Python 1 1 Testing
1 Python 1 2 Testing 2
Although it's not clear if you need a dataframe here at all.

How do i get the document id in Marqo?

i added a document to marqo add_documents() but i didn't pass an id and now i am trying to get the document but i don't know what the document_id is?
Here is what my code look like:
mq = marqo.Client(url='http://localhost:8882')
mq.index("my-first-index").add_documents([
{
"Title": title,
"Description": document_body
}]
)
i tried to check whether the document got added or not but ;
no_of_docs = mq.index("my-first-index").get_stats()
print(no_of_docs)
i got;
{'numberOfDocuments': 1}
meaning it was added.
if you don't add the "_id" as part of key/value then by default marqo will generate a random id for you, to access it you can search the document using the document's Title,
doc = mq.index("my-first-index").search(title_of_your_document, searchable_attributes=['Title'])
you should get a dictionary as the result something like this;
{'hits': [{'Description': your_description,
'Title': title_of_your_document,
'_highlights': relevant part of the doc,
'_id': 'ac14f87e-50b8-43e7-91de-ee72e1469bd3',
'_score': 1.0}],
'limit': 10,
'processingTimeMs': 122,
'query': 'The Premier League'}
the part that says _id is the id of your document.

Structuring request JSON for API

I'm building a small API to interact with our database for other projects. I've built the database and have the API functioning fine, however, the data I get back isn't structured how I want it.
I am using Python with Flask/Flask-Restful for the API.
Here is a snippet of my Python that handles the interaction:
class Address(Resource):
def get(self, store):
print('Received a request at ADDRESS for Store ' + store )
conn = sqlite3.connect('store-db.db')
cur = conn.cursor()
addresses = cur.execute('SELECT * FROM Sites WHERE StoreNumber like ' + store)
for adr in addresses:
return(adr, 200)
If I make a request to the /sites/42 endpoint, where 42 is the site id, this is what I'll receive:
[
"42",
"5000 Robinson Centre Drive",
"",
"Pittsburgh",
"PA",
"15205",
"(412) 787-1330",
"(412) 249-9161",
"",
"Dick's Sporting Goods"
]
Here is how it is structured in the database:
Ultimately I'd like to use the column name as the Key in the JSON that's received, but I need a bit of guidance in the right direction so I'm not Googling ambiguous terms hoping to find something.
Here is an example of what I'd like to receive after making a request to that endpoint:
{
"StoreNumber": "42",
"Street": "5000 Robinson Centre Drive",
"StreetSecondary": "",
"City": "Pittsburgh",
"State": "PA",
"ZipCode": "15205",
"ContactNumber": "(412) 787-1330",
"XO_TN": "(412) 249-9161",
"RelocationStatus": "",
"StoreType": "Dick's Sporting Goods"
}
I'm just looking to get some guidance on if I should change how my data is structured in the database (i.e. I've seen some just put the JSON in their database, but I think that's messy) or if there's a more intuitive method I could use to control my data.
Updated Code using Accepted Answer
class Address(Resource):
def get(self, store):
print('Received a request at ADDRESS for Store ' + store )
conn = sqlite3.connect('store-db.db')
cur = conn.cursor()
addresses = cur.execute('SELECT * FROM Sites WHERE StoreNumber like ' + store)
for r in res:
column_names = ["StoreNumber", "Street", "StreetSecondary","City","State", "ZipCode", "ContactNumber", "XO_TN", "RelocationStatus", "StoreType"]
data = [r[0], r[1], r[2], r[3], r[4], r[5], r[6], r[7], r[8]]
datadict = {column_names[itemindex]:item for itemindex, item in enumerate(data)}
return(datadict, 200)
You could just convert your list to a dict and then parse it to a JSON string before passing it back out.
// These are the names of the columns in your database
>>> column_names = ["storeid", "address", "etc"]
// This is the data coming from the database.
// All data is passed as you are using SELECT * in your query
>>> data = [42, "1 the street", "blah"]
// This is a quick notation for creating a dict from a list
// enumerate means we get a list index and a list item
// as the columns are in the same order as the data, we can use the list index to pull out the column_name
>>> datadict = {column_names[itemindex]:item for itemindex, item in enumerate(data)}
//This just prints datadict in my terminal
>>> datadict
We now have a named dict containing your data and the column names.
{'etc': 'blah', 'storeid': 42, 'address': '1 the street'}
Now dump the datadict to a string so that it can be sent to the frontend.
>>> import json
>>> json.dumps(datadict)
The dict has now been converted to a string.
'{"etc": "blah", "storeid": 42, "address": "1 the street"}'
This would require no change to your database but the script would need to know about the column names or retrieve them dynamically using some SQL.
If the data in the database is in the correct format for passing to the frontend then you shouldn't need to change the database structure. If it was not in the correct format then you could either change the way it was stored or change your SQL query to manipulate it.

Get hyperlink from a cell in google Sheet api V4

I want to get the hyperlink of a cell (A1, for example) in Python. I have this code so far. Thanks
properties = {
"requests": [
{
"cell": {
"HyperlinkDisplayType": "LINKED"
},
"fields": "userEnteredFormat.HyperlinkDisplayType"
}
]
}
result = service.spreadsheets().values().get(
spreadsheetId=spreadsheet_id, range=rangeName, body=properties).execute()
values = result.get('values', [])
How about using sheets.spreadsheets.get? This sample script supposes that service of your script has already been able to be used for spreadsheets().values().get().
Sample script :
spreadsheetId = '### Spreadsheet ID ###'
range = ['sheet1!A1:A1'] # This is a sample.
result = service.spreadsheets().get(
spreadsheetId=spreadsheetId,
ranges=range,
fields="sheets/data/rowData/values/hyperlink"
).execute()
If this was not useful for you, I'm sorry.
It seems to me like this is the only way to actually get the link info (address as well as display text):
result = service.spreadsheets().values().get(
spreadsheetId=spreadsheetId, range=range_name,
valueRenderOption='FORMULA').execute()
values = results.get('values', [])
This returns the raw content of the cells which for hyperlinks look like this for each cell:
'=HYPERLINK("sample-link","http://www.sample.com")'
For my use I've parsed it with the following simple regex:
r'=HYPERLINK\("(.*?)","(.*?)"\)'
You can check the hyperlink if you add at the end:
print (values[0])

Dynamically create list/dict during for loop for conversion to JSON

I am trying to build a list/dict that will be converted to JSON later on. I am trying to write the code that builds and populates the multiple levels of the JSON format I ultimately need. I am having an issue wrapping my head around this. Thank you for the help.
What I ultimately need -> Populate this list/dict:
dataset_permission_json = []
with this format:
{
"projects":[
{
"project":"test-project-1",
"datasets":[
{
"dataset":"testing1",
"permissions":[
{
"role":"READER",
"google_group":"testing1#test.com"
}
]
},
{
"dataset":"testing2",
"permissions":[
{
"role":"OWNER",
"google_group":"testing2#test.com"
}
]
},
{
"dataset":"testing3",
"permissions":[
{
"role":"READER",
"google_group":"testing3#test.com"
}
]
},
{
"dataset":"testing4",
"permissions":[
{
"role":"WRITER",
"google_group":"testing4#test.com"
}
]
}
]
}
]
}
I have multiple for loops that successfully print out the information I am pulling from an external API but I to be able to enter that data into the list/dict. The dynamic values I am trying to input are:
'project' i.e. test-project-1
'dataset' i.e. testing1
'role' i.e. READER
'google_group' i.e. testing1#test.com
I have tried things like:
dataset_permission_json.update({'project': project})
but cannot figure out how not to overwrite the data during the multiple for loops.
for project in projects:
print(project) ## Need to add this variable to 'projects'
for bq_group in bq_groups:
delegated_credentials = credentials.create_delegated(bq_group)
http_auth = delegated_credentials.authorize(Http())
list_datasets_in_project = bigquery_service.datasets().list(projectId=project).execute()
datasets = list_datasets_in_project.get('datasets',[])
print(dataset['datasetReference']['datasetId']) ##Add the dataset to 'datasets' under the project
for dataset in datasets:
get_dataset_permissions_result = bigquery_service.datasets().get(projectId=project, datasetId=dataset['datasetReference']['datasetId']).execute()
dataset_permissions = get_dataset_permissions_result.get('access',[])
### ADD THE NEXT LEVEL 'permissions' level here?
for dataset_permission in dataset_permissions:
if 'groupByEmail' in dataset_permission:
if bq_group in dataset_permission['groupByEmail']:
print(dataset['datasetReference']['datasetId'] && dataset_permission['groupByEmail']) ##Add to each dataset
I appreciate the help.
EDIT: Updated Progress
Ok I have created the nested structure that I was looking for using StackOverflow
Things are great except for the last part. I am trying to append the role & group to each 'permission' nest, but after everything runs the data is only appended to the last 'permission' nest in the JSON structure. It seems like it is overwriting itself during the for loop. Thoughts?
Updated for loop:
for project in projects:
for bq_group in bq_groups:
delegated_credentials = credentials.create_delegated(bq_group)
http_auth = delegated_credentials.authorize(Http())
list_datasets_in_project = bigquery_service.datasets().list(projectId=project).execute()
datasets = list_datasets_in_project.get('datasets',[])
for dataset in datasets:
get_dataset_permissions_result = bigquery_service.datasets().get(projectId=project, datasetId=dataset['datasetReference']['datasetId']).execute()
dataset_permissions = get_dataset_permissions_result.get('access',[])
for dataset_permission in dataset_permissions:
if 'groupByEmail' in dataset_permission:
if bq_group in dataset_permission['groupByEmail']:
dataset_permission_json['projects'][project]['datasets'][dataset['datasetReference']['datasetId']]['permissions']
permission = {'group': dataset_permission['groupByEmail'],'role': dataset_permission['role']}
dataset_permission_json['permissions'] = permission
UPDATE: Solved.
dataset_permission_json['projects'][project]['datasets'][dataset['datasetReference']['datasetId']]['permissions']
permission = {'group': dataset_permission['groupByEmail'],'role': dataset_permission['role']}
dataset_permission_json['projects'][project]['datasets'][dataset['datasetReference']['datasetId']]['permissions'] = permission

Categories

Resources