How to get all documents under an elasticsearch index with python client ?

How to get all documents under an elasticsearch index with python client ? - python

I'm trying to get all index document using python client but the result show me only the first document
This is my python code :
res = es.search(index="92c603b3-8173-4d7a-9aca-f8c115ff5a18", doc_type="doc", body = {
'size' : 10000,
'query': {
'match_all' : {}
}
})
print("%d documents found" % res['hits']['total'])
data = [doc for doc in res['hits']['hits']]
for doc in data:
print(doc)
return "%s %s %s" % (doc['_id'], doc['_source']['0'], doc['_source']['5'])

try "_doc" instead of "doc"
res = es.search(index="92c603b3-8173-4d7a-9aca-f8c115ff5a18", doc_type="_doc", body = {
'size' : 100,
'query': {
'match_all' : {}
}
})

Elasticsearch by default retrieve only 10 documents. You could change this behaviour - doc here . The best practice for pagination are search after query and scroll query. It depends from your needs. Please read this answer Elastic search not giving data with big number for page size
To show all the results:
for doc in res['hits']['hits']:
print doc['_id'], doc['_source']

You can try the following query. It will return all the documents.
result = es.search(index="index_name", body={"query":{"match_all":{}}})

You can also use elasticsearch_dsl and its Search API which allows you to iterate over all your documents via the scan method.
import elasticsearch
from elasticsearch_dsl import Search
client = elasticsearch.Elasticsearch()
search = Search(using=client, index="92c603b3-8173-4d7a-9aca-f8c115ff5a18")
for hit in search.scan():
print(hit)

I dont see mentioned that the index must be refreshed if you just added data. Use this:
es.indices.refresh(index="index_name")

Related

constructing a message format from the fetchall result in python

*New to Programming
Question: I need to use the below "Data" (two rows as arrays) queried from sql and use it to create the message structure below.
data from sql using fetchall()
Data = [[100,1,4,5],[101,1,4,6]]
##expected message structure
message = {
"name":"Tom",
"Job":"IT",
"info": [
{
"id_1":"100",
"id_2":"1",
"id_3":"4",
"id_4":"5"
},
{
"id_1":"101",
"id_2":"1",
"id_3":"4",
"id_4":"6"
},
]
}
I tried to create below method to iterate over the rows and then input the values, this is was just a starting, but this was also not working
def create_message(data)
for row in data:
{
"id_1":str(data[0][0],
"id_2":str(data[0][1],
"id_3":str(data[0][2],
"id_4":str(data[0][3],
}
Latest Code
def create_info(data):
info = []
for row in data:
temp_dict = {"id_1_tom":"","id_2_hell":"","id_3_trip":"","id_4_clap":""}
for i in range(0,1):
temp_dict["id_1_tom"] = str(row[i])
temp_dict["id_2_hell"] = str(row[i+1])
temp_dict["id_3_trip"] = str(row[i+2])
temp_dict["id_4_clap"] = str(row[i+3])
info.append(temp_dict)
return info

Edit: Updated answer based on updates to the question and comment by original poster.
This function might work for the example you've given to get the desired output, based on the attempt you've provided:
def create_info(data):
info = []
for row in data:
temp_dict = {}
temp_dict['id_1_tom'] = str(row[0])
temp_dict['id_2_hell'] = str(row[1])
temp_dict['id_3_trip'] = str(row[2])
temp_dict['id_4_clap'] = str(row[3])
info.append(temp_dict)
return info
For the input:
[[100, 1, 4, 5],[101,1,4,6]]
This function will return a list of dictionaries:
[{"id_1_tom":"100","id_2_hell":"1","id_3_trip":"4","id_4_clap":"5"},
{"id_1_tom":"101","id_2_hell":"1","id_3_trip":"4","id_4_clap":"6"}]
This can serve as the value for the key info in your dictionary message. Note that you would still have to construct the message dictionary.

How to load elastic data in python using scroll?

I have an index in elastic search which is having huge data. I am trying to load some of its data (more than 10000 records)in python for further processing. As per documentation and web search scroll is used but it is able to fetch only few records. After sometime this exception occurs,
errorNotFoundError(404, 'search_phase_execution_exception', 'No search context found for id [101781]')
My code is as following:
from elasticsearch import Elasticsearch
##########elastic configuration
host='localhost'
port=9200
user=''
pasw=''
el_index_name = 'test'
es = Elasticsearch([{'host':host , 'port': port}], http_auth=(user,pasw))
res = es.search(index=el_index_name, body={"query": {"match_all": {}}},scroll='10m')
rows=[]
while True:
try:
rows.append(es.scroll(scroll_id=res['_scroll_id'])['hits']['hits'])
except Exception as esl:
print ('error{}'.format(esl))
break
##deleting scroll
es.clear_scroll(scroll_id=res['_scroll_id'])
I have changed the value of scroll='10m' but still, this exception occurs.

You need to change your scroll request line to this:
rows.append(es.scroll(scroll_id=res['_scroll_id'], body={"scroll": "10m","scroll_id": res['_scroll_id']})['hits']['hits'])
As an advice, It is better to increase number of retrieved posts. retrieving just 1 post in each request have negative influence on your performance and it has overhead for your cluster, as well. as an example:
{
"query": {
"match_all": {}
},"size":100
}
I have added the below part to answer to the question in comments. It is not stopping because you have put While True in your code. You need to change it to this:
res = es.search(index=el_index_name, body={"query": {"match_all": {}}}, scroll='10m')
scroll_id = res['_scroll_id']
query = {
"scroll": "10m",
"scroll_id": scroll_id
}
rows = []
while len(res['hits']['hits']):
for item in res['hits']['hits']:
rows.append(item)
res = es.scroll(scroll_id=scroll_id, body=query)
Please let me know if there was any problem with this.

Get hyperlink from a cell in google Sheet api V4

I want to get the hyperlink of a cell (A1, for example) in Python. I have this code so far. Thanks
properties = {
"requests": [
{
"cell": {
"HyperlinkDisplayType": "LINKED"
},
"fields": "userEnteredFormat.HyperlinkDisplayType"
}
]
}
result = service.spreadsheets().values().get(
spreadsheetId=spreadsheet_id, range=rangeName, body=properties).execute()
values = result.get('values', [])

How about using sheets.spreadsheets.get? This sample script supposes that service of your script has already been able to be used for spreadsheets().values().get().
Sample script :
spreadsheetId = '### Spreadsheet ID ###'
range = ['sheet1!A1:A1'] # This is a sample.
result = service.spreadsheets().get(
spreadsheetId=spreadsheetId,
ranges=range,
fields="sheets/data/rowData/values/hyperlink"
).execute()
If this was not useful for you, I'm sorry.

It seems to me like this is the only way to actually get the link info (address as well as display text):
result = service.spreadsheets().values().get(
spreadsheetId=spreadsheetId, range=range_name,
valueRenderOption='FORMULA').execute()
values = results.get('values', [])
This returns the raw content of the cells which for hyperlinks look like this for each cell:
'=HYPERLINK("sample-link","http://www.sample.com")'
For my use I've parsed it with the following simple regex:
r'=HYPERLINK\("(.*?)","(.*?)"\)'

You can check the hyperlink if you add at the end:
print (values[0])

Import list of dicts or JSON file to elastic search with python

I have a .json.gz file that I wish to load into elastic search.
My first attempt involved using the json module to convert the JSON to a list of dicts.
import gzip
import json
from pprint import pprint
from elasticsearch import Elasticsearch
nodes_f = gzip.open("nodes.json.gz")
nodes = json.load(nodes_f)
Dict example:
pprint(nodes[0])
{u'index': 1,
u'point': [508163.122, 195316.627],
u'tax': u'fehwj39099'}
Using Elasticsearch:
es = Elasticsearch()
data = es.bulk(index="index",body=nodes)
However, this returns:
elasticsearch.exceptions.RequestError: TransportError(400, u'illegal_argument_exception', u'Malformed action/metadata line [1], expected START_OBJECT or END_OBJECT but found [VALUE_STRING]')
Beyond this, I wish to be able to find the tax for given point query, in case this has an impact on how I should be indexing the data with elasticsearch.

Alfe pointed me in the right direction, but I couldn't get his code to work.
I found two solutions:
Line by line with a for loop:
es = elasticsearch.Elasticsearch()
for node in nodes:
_id = node['index']
es.index(index='nodes',doc_type='external',id=_id,body=node)
In bulk, using helper:
actions = [
{
"_index" : "nodes_bulk",
"_type" : "external",
"_id" : str(node['index']),
"_source" : node
}
for node in nodes
]
helpers.bulk(es,actions)
Bulk was around 22 times faster for a list of 343724 dicts.

Here is my working code using bulk api:
Define a list of dicts:
from elasticsearch import Elasticsearch, helpers
es = Elasticsearch([{'host':'localhost', 'port': 9200}])
doc = [{'_id': 1,'price': 10, 'productID' : 'XHDK-A-1293-#fJ3'},
{'_id':2, "price" : 20, "productID" : "KDKE-B-9947-#kL5"},
{'_id':3, "price" : 30, "productID" : "JODL-X-1937-#pV7"},
{'_id':4, "price" : 30, "productID" : "QQPX-R-3956-#aD8"}]
helpers.bulk(es, doc, index='products',doc_type='_doc', request_timeout=200)

The ES bulk library showed several problems, including performance trouble, not being able to set specific _ids etc. But since the bulk API of ES is not very complicated, we did it ourselves:
import requests
headers = { 'Content-type': 'application/json',
'Accept': 'text/plain'}
jsons = []
for d in docs:
_id = d.pop('_id') # take _id out of dict
jsons.append('{"index":{"_id":"%s"}}\n%s\n' % (_id, json.dumps(d)))
data = ''.join(jsons)
response = requests.post(url, data=data, headers=headers)
We needed to set a specific _id but I guess you can skip this part in case you want a random _id set by ES automatically.
Hope that helps.

Example of update_item in dynamodb boto3

Following the documentation, I'm trying to create an update statement that will update or add if not exists only one attribute in a dynamodb table.
I'm trying this
response = table.update_item(
Key={'ReleaseNumber': '1.0.179'},
UpdateExpression='SET',
ConditionExpression='Attr(\'ReleaseNumber\').eq(\'1.0.179\')',
ExpressionAttributeNames={'attr1': 'val1'},
ExpressionAttributeValues={'val1': 'false'}
)
The error I'm getting is:
botocore.exceptions.ClientError: An error occurred (ValidationException) when calling the UpdateItem operation: ExpressionAttributeNames contains invalid key: Syntax error; key: "attr1"
If anyone has done anything similar to what I'm trying to achieve please share example.

Found working example here, very important to list as Keys all the indexes of the table, this will require additional query before update, but it works.
response = table.update_item(
Key={
'ReleaseNumber': releaseNumber,
'Timestamp': result[0]['Timestamp']
},
UpdateExpression="set Sanity = :r",
ExpressionAttributeValues={
':r': 'false',
},
ReturnValues="UPDATED_NEW"
)

Details on dynamodb updates using boto3 seem incredibly sparse online, so I'm hoping these alternative solutions are useful.
get / put
import boto3
table = boto3.resource('dynamodb').Table('my_table')
# get item
response = table.get_item(Key={'pkey': 'asdf12345'})
item = response['Item']
# update
item['status'] = 'complete'
# put (idempotent)
table.put_item(Item=item)
actual update
import boto3
table = boto3.resource('dynamodb').Table('my_table')
table.update_item(
Key={'pkey': 'asdf12345'},
AttributeUpdates={
'status': 'complete',
},
)

If you don't want to check parameter by parameter for the update I wrote a cool function that would return the needed parameters to perform a update_item method using boto3.
def get_update_params(body):
"""Given a dictionary we generate an update expression and a dict of values
to update a dynamodb table.
Params:
body (dict): Parameters to use for formatting.
Returns:
update expression, dict of values.
"""
update_expression = ["set "]
update_values = dict()
for key, val in body.items():
update_expression.append(f" {key} = :{key},")
update_values[f":{key}"] = val
return "".join(update_expression)[:-1], update_values
Here is a quick example:
def update(body):
a, v = get_update_params(body)
response = table.update_item(
Key={'uuid':str(uuid)},
UpdateExpression=a,
ExpressionAttributeValues=dict(v)
)
return response

The original code example:
response = table.update_item(
Key={'ReleaseNumber': '1.0.179'},
UpdateExpression='SET',
ConditionExpression='Attr(\'ReleaseNumber\').eq(\'1.0.179\')',
ExpressionAttributeNames={'attr1': 'val1'},
ExpressionAttributeValues={'val1': 'false'}
)
Fixed:
response = table.update_item(
Key={'ReleaseNumber': '1.0.179'},
UpdateExpression='SET #attr1 = :val1',
ConditionExpression=Attr('ReleaseNumber').eq('1.0.179'),
ExpressionAttributeNames={'#attr1': 'val1'},
ExpressionAttributeValues={':val1': 'false'}
)
In the marked answer it was also revealed that there is a Range Key so that should also be included in the Key. The update_item method must seek to the exact record to be updated, there's no batch updates, and you can't update a range of values filtered to a condition to get to a single record. The ConditionExpression is there to be useful to make updates idempotent; i.e. don't update the value if it is already that value. It's not like a sql where clause.
Regarding the specific error seen.
ExpressionAttributeNames is a list of key placeholders for use in the UpdateExpression, useful if the key is a reserved word.
From the docs, "An expression attribute name must begin with a #, and be followed by one or more alphanumeric characters". The error is because the code hasn't used an ExpressionAttributeName that starts with a # and also not used it in the UpdateExpression.
ExpressionAttributeValues are placeholders for the values you want to update to, and they must start with :

Based on the official example, here's a simple and complete solution which could be used to manually update (not something I would recommend) a table used by a terraform S3 backend.
Let's say this is the table data as shown by the AWS CLI:
$ aws dynamodb scan --table-name terraform_lock --region us-east-1
{
"Items": [
{
"Digest": {
"S": "2f58b12ae16dfb5b037560a217ebd752"
},
"LockID": {
"S": "tf-aws.tfstate-md5"
}
}
],
"Count": 1,
"ScannedCount": 1,
"ConsumedCapacity": null
}
You could update it to a new digest (say you rolled back the state) as follows:
import boto3
dynamodb = boto3.resource('dynamodb', 'us-east-1')
try:
table = dynamodb.Table('terraform_lock')
response = table.update_item(
Key={
"LockID": "tf-aws.tfstate-md5"
},
UpdateExpression="set Digest=:newDigest",
ExpressionAttributeValues={
":newDigest": "50a488ee9bac09a50340c02b33beb24b"
},
ReturnValues="UPDATED_NEW"
)
except Exception as msg:
print(f"Oops, could not update: {msg}")
Note the : at the start of ":newDigest": "50a488ee9bac09a50340c02b33beb24b" they're easy to miss or forget.

Small update of Jam M. Hernandez Quiceno's answer, which includes ExpressionAttributeNames to prevent encoutering errors such as:
"errorMessage": "An error occurred (ValidationException) when calling the UpdateItem operation:
Invalid UpdateExpression: Attribute name is a reserved keyword; reserved keyword: timestamp",
def get_update_params(body):
"""
Given a dictionary of key-value pairs to update an item with in DynamoDB,
generate three objects to be passed to UpdateExpression, ExpressionAttributeValues,
and ExpressionAttributeNames respectively.
"""
update_expression = []
attribute_values = dict()
attribute_names = dict()
for key, val in body.items():
update_expression.append(f" #{key.lower()} = :{key.lower()}")
attribute_values[f":{key.lower()}"] = val
attribute_names[f"#{key.lower()}"] = key
return "set " + ", ".join(update_expression), attribute_values, attribute_names
Example use:
update_expression, attribute_values, attribute_names = get_update_params(
{"Status": "declined", "DeclinedBy": "username"}
)
response = table.update_item(
Key={"uuid": "12345"},
UpdateExpression=update_expression,
ExpressionAttributeValues=attribute_values,
ExpressionAttributeNames=attribute_names,
ReturnValues="UPDATED_NEW"
)
print(response)

An example to update any number of attributes given as a dict, and keep track of the number of updates. Works with reserved words (i.e name).
The following attribute names shouldn't be used as we will overwrite the value: _inc, _start.
from typing import Dict
from boto3 import Session
def getDynamoDBSession(region: str = "eu-west-1"):
"""Connect to DynamoDB resource from boto3."""
return Session().resource("dynamodb", region_name=region)
DYNAMODB = getDynamoDBSession()
def updateItemAndCounter(db_table: str, item_key: Dict, attributes: Dict) -> Dict:
"""
Update item or create new. If the item already exists, return the previous value and
increase the counter: update_counter.
"""
table = DYNAMODB.Table(db_table)
# Init update-expression
update_expression = "SET"
# Build expression-attribute-names, expression-attribute-values, and the update-expression
expression_attribute_names = {}
expression_attribute_values = {}
for key, value in attributes.items():
update_expression += f' #{key} = :{key},' # Notice the "#" to solve issue with reserved keywords
expression_attribute_names[f'#{key}'] = key
expression_attribute_values[f':{key}'] = value
# Add counter start and increment attributes
expression_attribute_values[':_start'] = 0
expression_attribute_values[':_inc'] = 1
# Finish update-expression with our counter
update_expression += " update_counter = if_not_exists(update_counter, :_start) + :_inc"
return table.update_item(
Key=item_key,
UpdateExpression=update_expression,
ExpressionAttributeNames=expression_attribute_names,
ExpressionAttributeValues=expression_attribute_values,
ReturnValues="ALL_OLD"
)
Hope it might be useful to someone!

In a simple way you can use below code to update item value with new one:
response = table.update_item(
Key={"my_id_name": "my_id_value"}, # to get record
UpdateExpression="set item_key_name=:item_key_value", # Operation action (set)
ExpressionAttributeValues={":value": "new_value"}, # item that you need to update
ReturnValues="UPDATED_NEW" # optional for declarative message
)

Simple example with multiple fields:
import boto3
dynamodb_client = boto3.client('dynamodb')
dynamodb_client.update_item(
TableName=table_name,
Key={
'PK1': {'S': 'PRIMARY_KEY_VALUE'},
'SK1': {'S': 'SECONDARY_KEY_VALUE'}
}
UpdateExpression='SET #field1 = :field1, #field2 = :field2',
ExpressionAttributeNames={
'#field1': 'FIELD_1_NAME',
'#field2': 'FIELD_2_NAME',
},
ExpressionAttributeValues={
':field1': {'S': 'FIELD_1_VALUE'},
':field2': {'S': 'FIELD_2_VALUE'},
}
)

using previous answer from eltbus , it worked for me , except for minor bug,
You have to delete the extra comma using update_expression[:-1]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to get all documents under an elasticsearch index with python client ? - python

try "_doc" instead of "doc" res = es.search(index="92c603b3-8173-4d7a-9aca-f8c115ff5a18", doc_type="_doc", body = { 'size' : 100, 'query': { 'match_all' : {} } })

You can try the following query. It will return all the documents. result = es.search(index="index_name", body={"query":{"match_all":{}}})

I dont see mentioned that the index must be refreshed if you just added data. Use this: es.indices.refresh(index="index_name")

Related

constructing a message format from the fetchall result in python

How to load elastic data in python using scroll?

Get hyperlink from a cell in google Sheet api V4

Import list of dicts or JSON file to elastic search with python

Example of update_item in dynamodb boto3

Categories

Resources