elasticsearch search for large text has "too many clauses"

elasticsearch search for large text has "too many clauses" - python

I have a set of news articles that I'm trying to index. Sometimes I get the same article with a tiny change (e.g. "Sep" vs "September"). Before loading an article to the database, I'd like to see if there is anything really similar before I load it.
so I tried this (using the python elasticsearch_dsl library)
search = elasticsearch_dsl.Search(index=INDEX, doc_type=DOC_TYPE)
search = search.filter(match, text=article_text)
and that works for a bit, until I get a very long article. Then I get an error message saying "maxClauseCount is set to 1024".
okay, so maybe my text is too long. So i do this:
text_bits = article_text.split()
if len(text_bits) > 1024:
article_text = " ".join(text_bits[:1023])
and that works for the first item with lots of text, but not the second. So maybe my original guess is off, or maybe I'm not doing this right.
(incidentally, I see that there's "more like this" query listed in the documentation, but when I try to use it through Sense, like so:
post /myindex/article/_search
{
"more_like_this" : {
"fields" : ["text"],
"like" : "mary had a little lamb"
}
}
I get "unknown search element 'more_like_this'"

Related

Get documents near a match in pyMongo

I'm trying to get the documents that surround a match in pyMongo. So I would search for a string and get the matches and the entries that are around this match (using the '_index' well, index), so the user has some context on the result.
I'm trying to do it using $setWindowFields to no success, as I'm getting no results. Probably I'm using the wrong syntax?. This is the aggregation that I'm trying:
show_near = ([{'$setWindowFields':{
'partitionBy':None,
'sortBy': {'_index':1},
'output':{
'nearIds':{
'$addToSet':'$_id',
'window':{'documents':[-2,2]}
}
}
}
},
{
'$match':
{field:{'$regex':f'({s})'}}
},
{'$lookup':
{'from':'collection',
'localField':'nearIds',
'foreignField':'_id',
'as':'nearDocs'}
},
{'$unwind':'$nearDocs'},
{'$replaceRoot':{
'newRoot':'$nearDocs'}}])
cursor = self.collection.aggregate(show_near)
Where 's' is the string I want to match and '_index' is the order of the entries.
Any idea? Maybe there is another method to do this? This feature looks perfect for what I want, but maybe I'm mistaken and there is another way. I've tried going back and forth with $gte and $lte, but is not feasible when results start to pile up.
Thanks!

How to Write a Mongo Query Where One of the Key Has a # Character as a Name in a Python / Flask Environment?

I am a rookie to programming. I use Flask/Python as backend and MongoDB as my database. My mongo's documents are uploaded by CSV files and I have no control of the header name. Thus I cannot change the header's name to remove the # character.
**Mongo Collection**
Part #: "ABC123"
Description : "DC Motor 12V"
**Flask/Python/Backend**
query = { "Part #" : "ABC123" }
Since # character denotes comment in Python, I have tried to use "Part \#" to escape # but I think when it was sent to MongoDB as a query, it sees the backslash in the key name and no results appear.
I have googled for long time but unable to find a solution to it. Can someone provide a hint on what I can do? Thank you.

After #difurious commented that there were no issue with query, I looked deeper.
Apparently the collection's field name "Part # " had a _ space after the hash. In Mongo's Compass DB, it doesn't shows up unless you click to edit it.
So it was not returning any result because of wrong field name.
To conclude this question, it was my mistake. # in a python string is acceptable for mongo query.
Thanks to #difurious

Scraping data from a http & javaScript site

I currently want to scrape some data from an amazon page and I'm kind of stuck.
For example, lets take this page.
https://www.amazon.com/NIKE-Hyperfre3sh-Athletic-Sneakers-Shoes/dp/B01KWIUHAM/ref=sr_1_1_sspa?ie=UTF8&qid=1546731934&sr=8-1-spons&keywords=nike+shoes&psc=1
I wanted to scrape every variant of shoe size and color. That data can be found opening the source code and searching for 'variationValues'.
There we can see sort of a dictionary containing all the sizes and colors and, below that, in 'asinToDimentionIndexMap', every product code with numbers indicating the variant from the variationValues 'dictionary'.
For example, in asinToDimentionIndexMap we can see
"B01KWIUH5M":[0,0]
Which means that the product code B01KWIUH5M is associated with the size '8M US' (position 0 in variationValues size_name section) and the color 'Teal' (same idea as before)
I want to scrape both the variationValues and the asinToDimentionIndexMap, so i can associate the IndexMap numbers to the variationValues one.
Another person in the site (thanks for the help btw) suggested doing it this way.
script = response.xpath('//script/text()').extract_frist()
import re
# capture everything between {}
data = re.findall(script, '(\{.+?\}_')
import json
d = json.loads(data[0])
d['products'][0]
I can sort of understand the first part. We get everything that's a 'script' as a string and then get everything between {}. The issue is what happens after that. My knowledge of json is not that great and reading some stuff about it didn't help that much.
Is it there a way to get, from that data, 2 dictionaries or lists with the variationValues and asinToDimentionIndexMap? (maybe using some regular expressions in the middle to get some data out of a big string). Or explain a little bit what happens with the json part.
Thanks for the help!
EDIT: Added photo of variationValues and asinToDimensionIndexMap

I think you are close Manuel!
The following code will turn your scraped source into easy-to-select boxes:
import json
d = json.loads(data[0])
JSON is a universal format for storing object information. In other words, it's designed to interpret string data into object data, regardless of the platform you are working with.
https://www.w3schools.com/js/js_json_intro.asp
I'm assuming where you may be finding things a challenge is if there are any errors when accessing a particular "box" inside you json object.
Your code format looks correct, but your access within "each box" may look different.
Eg. If your 'asinToDimentionIndexMap' object is nested within a smaller box in the larger 'products' object, then you might access it like this (after running the code above):
d['products'][0]['asinToDimentionIndexMap']
I've hacked and slash a little bit so you can better understand the structure of your particular json file. Take a look at the link below. On the right-hand side, you will see "which boxes are within one another" - which is precisely what you need to know for accessing what you need.
JSON Object Viewer
For example, the following would yield "companyCompliancePolicies_feature_div":
import json
d = json.loads(data[0])
d['updateDivLists']['full'][0]['divToUpdate']
The person helping you before outlined a general case for you, but you'll need to go in an look at structure this way to truly find what you're looking for.

variationValues = re.findall(r'variationValues\" : ({.*?})', ' '.join(script))[0]
asinVariationValues = re.findall(r'asinVariationValues\" : ({.*?}})', ' '.join(script))[0]
dimensionValuesData = re.findall(r'dimensionValuesData\" : (\[.*\])', ' '.join(script))[0]
asinToDimensionIndexMap = re.findall(r'asinToDimensionIndexMap\" : ({.*})', ' '.join(script))[0]
dimensionValuesDisplayData = re.findall(r'dimensionValuesDisplayData\" : ({.*})', ' '.join(script))[0]
Now you can easily convert them to json as use them combine as you wish.

Suds error when creating a issue for Jira

i'm getting the following error every time that i try to send an issue to Jira:
suds.WebFault: Server raised fault: 'org.xml.sax.SAXException:
Found character data inside an array element while deserializing'
I search on stackoverflow and the web for the answer and some people say that is suds 0.3 < fault. But i'm using the 0.4.1.1 version.
Here is my issue dict:
issue = {"assignee": "user_test",
"components": "17311",
"project": "TES",
"description" : "This is a test",
"priority" : "Major",
"summary" : "Just a test title",
"type":"Incident"
}
Class Jira made by me:
def create_issue(self,issue):
if(not isinstance(issue,dict)):
raise Exception("Issue must be a dict")
new_issue = self.jira.service.createIssue(in0 = self.auth,in1 = issue)
return new_issue["key"]

Using jira-python I was able to add components with like:
jira.create_issue(project={'key': project_id}, summary=ticket_summary,
description=ticket_description, issuetype={'name': ticket_issue_type},
components=[{'name': 'Application Slow'},], parent={'id': new_issue_key}, customfield_10101=termination_change_date,
)
I kept trying to send a component as "components={'name': 'Application Slow'}", but I was getting a "data was not an array" (or something similar). I took a look at the REST API and how some of their array examples were composed which is how I came to my example above.
https://developer.atlassian.com/display/JIRADEV/JIRA+REST+API+Example+-+Create+Issue#JIRARESTAPIExample-CreateIssue-Request
Labels
"customfield_10006": ["examplelabelnumber1", "examplelabelnumber2"]
Labels are arrays of strings
I know this is a bit off topic, but when I searched for my issue I found myself coming back here often so I hope this is a bit helpful for your case and to anyone else. The concept is the same as the components field will only accept an array of objects.

components is not right. It has to be an array of things because it is multivalued. Some hints at https://developer.atlassian.com/display/JIRADEV/Creating+a+JIRA+SOAP+Client or look at how the JIRA Python CLI does it
'components': [17311]

How to match search strings to content in python

Usually when we search, we have a list of stories, we provide a search string, and expect back a list of results where the given search strings matches the story.
What I am looking to do, is the opposite. Give a list of search strings, and one story and find out which search strings match to that story.
Now this could be done with re but the case here is i wanna use complex search queries as supported by solr. Full details of the query syntax here. Note: i wont use boost.
Basically i want to get some pointers for the doesitmatch function in the sample code below.
def doesitmatch(contents, searchstring):
"""
returns result of searching contents for searchstring (True or False)
"""
???????
???????
story = "big chunk of story 200 to 1000 words long"
searchstrings = ['sajal' , 'sajal AND "is a jerk"' , 'sajal kayan' , 'sajal AND (kayan OR bangkok OR Thailand OR ( webmaster AND python))' , 'bangkok']
matches = [[searchstr] for searchstr in searchstrings if doesitmatch(story, searchstr) ]
Edit: Additionally would also be interested to know if any module exists to convert lucene query like below into regex:
sajal AND (kayan OR bangkok OR Thailand OR ( webmaster AND python) OR "is a jerk")

After extensive googling, i realized what i am looking to do is a Boolean search.
Found the code that makes regex boolean aware : http://code.activestate.com/recipes/252526/
Issue looks solved for now.

Probably slow, but easy solution:
Make a query on the story plus each string to the search engine. If it returns anything, then it matches.
Otherwise you need to implement the search syntax yourself. If that includes things like "title:" and stuff this can be rather complex. If it's only the AND and OR from your example, then it's a recursive function that isn't too hairy.

Some time ago I looked for a python implementaion of lucene and I came accross of Woosh which is a pure python text-based research engine. Maybe it will statisfy your needs.
You can also try pyLucene, but i did'nt investigate this one.

Here's a suggestion in pseudocode. I'm assuming you store a story identifier with the search terms in the index, so that you can retrieve it with the search results.
def search_strings_matching(story_id_to_match, search_strings):
result = set()
for s in search_strings:
result_story_ids = query_index(s) # query_index returns an id iterable
if story_id_to_match in result_story_ids:
result.add(s)
return result

This is probably less interesting to you now, since you've already solved your problem, but what you're describing sounds like Prospective Search, which is what you call it when you have the query first and you want to match it against documents as they come along.
Lucene's MemoryIndex is a class that was designed specifically for something like this, and in your case it might be efficient enough to run many queries against a single document.
This has nothing to do with Python, though. You'd probably be better off writing something like this in java.

If you are writing Python on AppEngine, you can use the AppEngine Prospective Search Service to achieve exactly what you are trying to do here. See: http://code.google.com/appengine/docs/python/prospectivesearch/overview.html

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

elasticsearch search for large text has "too many clauses" - python

Related

Get documents near a match in pyMongo

How to Write a Mongo Query Where One of the Key Has a # Character as a Name in a Python / Flask Environment?

Scraping data from a http & javaScript site

Suds error when creating a issue for Jira

How to match search strings to content in python

Categories

Resources