KeywordAnalyser for lucene python? - python

I want to use lucene using python.
I used StandardAnalyser for indexing and searching both. It works fine but now, my requirement changes, I need to use KeywordAnalyser.
Code for StandardAnalyser:
#Importing packages
import lucene
lucene.initVM()
from org.apache.lucene import analysis, document, index, queryparser, search, store, util
#Initialize Parameters
analyzer = analysis.standard.StandardAnalyzer(util.Version.LUCENE_CURRENT)
config = index.IndexWriterConfig(util.Version.LUCENE_CURRENT, self.analyzer)
directory = store.FSDirectory.open(File(<path_where_to_index>))
iwriter = None
iwriter = index.IndexWriter(directory, config)
#Indexing Part
doc = document.Document()
doc.add(document.Field("fieldname", entity, document.Field.Store.YES, document.Field.Index.ANALYZED))
doc.add(document.Field("category", category, document.Field.Store.YES, document.Field.Index.ANALYZED))
iwriter.addDocument(doc)
iwriter.commit()
#Searching Part
ireader = index.IndexReader.open(directory)
isearcher = search.IndexSearcher(ireader)
parser = queryparser.classic.QueryParser(util.Version.LUCENE_CURRENT, "fieldname", self.analyzer)
query = parser.parse(entity)
hits = isearcher.search(query, None, 100).scoreDocs
print hits
for hit in hits:
hitDoc = isearcher.doc(hit.doc)
print hitDoc
Above code is using StandardAnalyser. I want to use KeywordAnalyser instead of StandardAnalyser.
I have changed analyser in below code. Below Code is using KeywordAnalyser but searching is not performing.
Code for KeywordAnalyser:
#Importing packages
import lucene
lucene.initVM()
from org.apache.lucene import analysis, document, index, queryparser, search, store, util
#Initialize Parameters
analyzer = analysis.core.KeywordAnalyser(util.Version.LUCENE_CURRENT)
config = index.IndexWriterConfig(util.Version.LUCENE_CURRENT, self.analyzer)
directory = store.FSDirectory.open(File(<path_where_to_index>))
iwriter = None
iwriter = index.IndexWriter(directory, config)
#Indexing Part
doc = document.Document()
doc.add(document.Field("fieldname", entity, document.Field.Store.YES, document.Field.Index.ANALYZED))
doc.add(document.Field("category", category, document.Field.Store.YES, document.Field.Index.ANALYZED))
iwriter.addDocument(doc)
iwriter.commit()
#Searching Part
ireader = index.IndexReader.open(directory)
isearcher = search.IndexSearcher(ireader)
parser = queryparser.classic.QueryParser(util.Version.LUCENE_CURRENT, "fieldname", self.analyzer)
query = parser.parse(entity)
hits = isearcher.search(query, None, 100).scoreDocs
print hits
for hit in hits:
hitDoc = isearcher.doc(hit.doc)
print hitDoc
Any help?

I found the solution of my question.
To use KeywordAnalyser, I need analysis.core. I can't use queryparser for searching because it mostly works for StandardAnalyser. To do search on KeywordAnalyser, I need to use index.Term and search.TermQuery
Searching Code:
term_parser = index.Term("fieldname", entity)
query = search.TermQuery(term_parser)
hits = isearcher.search(query, None, 10).scoreDocs

Related

openai.error.InvalidRequestError: Engine not found

Tried accessing the OpenAPI example - Explain code
But it shows error as -
InvalidRequestError: Engine not found
enter code response = openai.Completion.create(
engine="code-davinci-002",
prompt="class Log:\n def __init__(self, path):\n dirname = os.path.dirname(path)\n os.makedirs(dirname, exist_ok=True)\n f = open(path, \"a+\")\n\n # Check that the file is newline-terminated\n size = os.path.getsize(path)\n if size > 0:\n f.seek(size - 1)\n end = f.read(1)\n if end != \"\\n\":\n f.write(\"\\n\")\n self.f = f\n self.path = path\n\n def log(self, event):\n event[\"_event_id\"] = str(uuid.uuid4())\n json.dump(event, self.f)\n self.f.write(\"\\n\")\n\n def state(self):\n state = {\"complete\": set(), \"last\": None}\n for line in open(self.path):\n event = json.loads(line)\n if event[\"type\"] == \"submit\" and event[\"success\"]:\n state[\"complete\"].add(event[\"id\"])\n state[\"last\"] = event\n return state\n\n\"\"\"\nHere's what the above class is doing:\n1.",
temperature=0,
max_tokens=64,
top_p=1.0,
frequency_penalty=0.0,
presence_penalty=0.0,
stop=["\"\"\""]
)
I've been trying to access the engine named code-davinci-002 which is a private beta version engine. So without access it's not possible to access the engine. It seems only the GPT-3 models are of public usage. We need to need to join the OpenAI Codex Private Beta Waitlist in order to access Codex models through API.
Please note that your code is not very readable.
However, from the given error, I think it has to do with the missing colon : in the engine name.
Change this line from:
engine="code-davinci-002",
to
engine="code-davinci:002",
If you are using a finetuned model instead of an engine, you'd want to use model= instead of engine=.
response = openai.Completion.create(
model="<finetuned model>",
prompt=

How do I make it so I only need my api key referenced once?

I am teaching myself how to use python and django to access the google places api to make nearby searches for different types of gyms.
I was only taught how to use python and django with databases you build locally.
I wrote out a full Get request for they four different searches I am doing. I looked up examples but none seem to work for me.
allgyms = requests.get('https://maps.googleapis.com/maps/api/place/nearbysearch/json?location=38.9208,-77.036&radius=2500&type=gym&key=AIzaSyDOwVK7bGap6b5Mpct1cjKMp7swFGi3uGg')
all_text = allgyms.text
alljson = json.loads(all_text)
healthclubs = requests.get('https://maps.googleapis.com/maps/api/place/nearbysearch/json?location=38.9208,-77.036&radius=2500&type=gym&keyword=healthclub&key=AIzaSyDOwVK7bGap6b5Mpct1cjKMp7swFGi3uGg')
health_text = healthclubs.text
healthjson = json.loads(health_text)
crossfit = requests.get('https://maps.googleapis.com/maps/api/place/nearbysearch/json?location=38.9208,-77.036&radius=2500&type=gym&keyword=crossfit&key=AIzaSyDOwVK7bGap6b5Mpct1cjKMp7swFGi3uGg')
cross_text = crossfit.text
crossjson = json.loads(cross_text)
I really would like to be pointed in the right direction on how to have the api key referenced only one time while changing the keywords.
Try this for better readability and better reusability
BASE_URL = 'https://maps.googleapis.com/maps/api/place/nearbysearch/json?'
LOCATION = '38.9208,-77.036'
RADIUS = '2500'
TYPE = 'gym'
API_KEY = 'AIzaSyDOwVK7bGap6b5Mpct1cjKMp7swFGi3uGg'
KEYWORDS = ''
allgyms = requests.get(BASE_URL+'location='+LOCATION+'&radius='+RADIUS+'&type='+TYPE+'&key='+API_KEY) all_text = allgyms.text
alljson = json.loads(all_text)
KEYWORDS = 'healthclub'
healthclubs = requests.get(BASE_URL+'location='+LOCATION+'&radius='+RADIUS+'&type='+TYPE+'&keyword='+KEYWORDS+'&key='+API_KEY)
health_text = healthclubs.text
healthjson = json.loads(health_text)
KEYWORDS = 'crossfit'
crossfit = requests.get(BASE_URL+'location='+LOCATION+'&radius='+RADIUS+'&type='+TYPE+'&keyword='+KEYWORDS+'&key='+API_KEY)
cross_text = crossfit.text
crossjson = json.loads(cross_text)
as V-R suggested in a comment you can go further and define function which makes things more reusable allowing you to use the that function in other places of your application
Function implementation
def makeRequest(location, radius, type, keywords):
BASE_URL = 'https://maps.googleapis.com/maps/api/place/nearbysearch/json?'
API_KEY = 'AIzaSyDOwVK7bGap6b5Mpct1cjKMp7swFGi3uGg'
result = requests.get(BASE_URL+'location='+location+'&radius='+radius+'&type='+type+'&keyword='+keywords+'&key='+API_KEY)
jsonResult = json.loads(result)
return jsonResult
Function invocation
json = makeRequest('38.9208,-77.036', '2500', 'gym', '')
Let me know if there is an issue

Parsing Json with multiple "levels" with Python

I'm trying to parse a json file from an api call.
I have found this code that fits my need and trying to adapt it to what I want:
import math, urllib2, json, re
def download():
graph = {}
page = urllib2.urlopen("http://fx.priceonomics.com/v1/rates/?q=1")
jsrates = json.loads(page.read())
pattern = re.compile("([A-Z]{3})_([A-Z]{3})")
for key in jsrates:
matches = pattern.match(key)
conversion_rate = -math.log(float(jsrates[key]))
from_rate = matches.group(1).encode('ascii','ignore')
to_rate = matches.group(2).encode('ascii','ignore')
if from_rate != to_rate:
if from_rate not in graph:
graph[from_rate] = {}
graph[from_rate][to_rate] = float(conversion_rate)
return graph
And I've turned it into:
import math, urllib2, json, re
def download():
graph = {}
page = urllib2.urlopen("https://bittrex.com/api/v1.1/public/getmarketsummaries")
jsrates = json.loads(page.read())
for pattern in jsrates['result'][0]['MarketName']:
for key in jsrates['result'][0]['Ask']:
matches = pattern.match(key)
conversion_rate = -math.log(float(jsrates[key]))
from_rate = matches.group(1).encode('ascii','ignore')
to_rate = matches.group(2).encode('ascii','ignore')
if from_rate != to_rate:
if from_rate not in graph:
graph[from_rate] = {}
graph[from_rate][to_rate] = float(conversion_rate)
return graph
Now the problem is that there is multiple level in the json "Result > 0, 1,2 etc"
json screenshot
for key in jsrates['result'][0]['Ask']:
I want the zero to be able to be any number, I don't know if thats clear.
So I could get all the ask price to match their marketname.
I have shortened the code so it doesnt make too long of a post.
Thanks
PS: sorry for the english, its not my native language.
You could loop through all of the result values that are returned, ignoring the meaningless numeric index:
for result in jsrates['result'].values():
ask = result.get('Ask')
if ask is not None:
# Do things with your ask...

Using jsonpath in api.ai in case we have already our json file to query in?

import json
def makeWebhookResult(req):
if req.get("result").get("action") != "Phdapp":
return {}
result = req.get("result")
parameters = result.get("parameters")
Progr = parameters.get("PhDsubjects")
time = parameters.get("PhdTime")
Levp = parameters.get("PhDDegLevp")
with open('Sheet1.json') as f:
data = f.read()
jsondata = json.loads(data)
match = jsonpath.jsonpath(jsondata,
'$.features[[?(#.ProgramName == Progr && #.Level == Levp && #.StartDate == time)]].UniversityName')
speech = "This is the universities you were looking for " + match
This is the part of my python code which have errors i can't figure it out, I have an intent with action which is "Phdapp" with three parameters that i need to use their values in my jsonpath querying from "sheet1.json" file in the same repository on GitHub in json format. But i can't get data from my intent neither accessing my son file for querying...is it because api.ai is not compatible with jsonpath or it is the problem of my code! or if there is a best to use which easier it can be my pleasure to know it. Thanks

Optimize Python Script to parse xml

I'm parsing the US Patent XML files (downloaded from Google patent dumps) using Python and Beautifulsoup; parsed data is exported to MYSQL database.
Each year's data contains close to 200-300K patents - which means parsing 200-300K xml files.
The server on which I'm running the python script is pretty powerful - 16 cores, 160 gigs of RAM, etc. but still it is taking close to 3 days to parse one year's worth of data.
I've been learning and using python since 2 years - so I can get stuff done but do not know how to get it done in the most efficient manner. I'm reading on it.
How can I optimize the below script to make it efficient?
Any guidance would be greatly appreciated.
Below is the code:
from bs4 import BeautifulSoup
import pandas as pd
from pandas.core.frame import DataFrame
import MySQLdb as db
import os
cnxn = db.connect('xx.xx.xx.xx','xxxxx','xxxxx','xxxx',charset='utf8',use_unicode=True)
def separated_xml(infile):
file = open(infile, "r")
buffer = [file.readline()]
for line in file:
if line.startswith("<?xml "):
yield "".join(buffer)
buffer = []
buffer.append(line)
yield "".join(buffer)
file.close()
def get_data(soup):
df = pd.DataFrame(columns = ['doc_id','patcit_num','patcit_document_id_country', 'patcit_document_id_doc_number','patcit_document_id_kind','patcit_document_id_name','patcit_document_id_date','category'])
if soup.findAll('us-citation'):
cit = soup.findAll('us-citation')
else:
cit = soup.findAll('citation')
doc_id = soup.findAll('publication-reference')[0].find('doc-number').text
for x in cit:
try:
patcit_num = x.find('patcit')['num']
except:
patcit_num = None
try:
patcit_document_id_country = x.find('country').text
except:
patcit_document_id_country = None
try:
patcit_document_id_doc_number = x.find('doc-number').text
except:
patcit_document_id_doc_number = None
try:
patcit_document_id_kind = x.find('kind').text
except:
patcit_document_id_kind = None
try:
patcit_document_id_name = x.find('name').text
except:
patcit_document_id_name = None
try:
patcit_document_id_date = x.find('date').text
except:
patcit_document_id_date = None
try:
category = x.find('category').text
except:
category = None
print doc_id
val = {'doc_id':doc_id,'patcit_num':patcit_num, 'patcit_document_id_country':patcit_document_id_country,'patcit_document_id_doc_number':patcit_document_id_doc_number, 'patcit_document_id_kind':patcit_document_id_kind,'patcit_document_id_name':patcit_document_id_name,'patcit_document_id_date':patcit_document_id_date,'category':category}
df = df.append(val, ignore_index=True)
df.to_sql(name = 'table_name', con = cnxn, flavor='mysql', if_exists='append')
print '1 doc exported'
i=0
l = os.listdir('/path/')
for item in l:
f = '/path/'+item
print 'Currently parsing - ',item
for xml_string in separated_xml(f):
soup = BeautifulSoup(xml_string,'xml')
if soup.find('us-patent-grant'):
print item, i, xml_string[177:204]
get_data(soup)
else:
print item, i, xml_string[177:204],'***********************************soup not found********************************************'
i+=1
print 'DONE!!!'
Here is a tutorial on multi-threading, because currently that code will run on 1 thread, 1 core.
Remove all try/except statements and handle the code properly. Exceptions are expensive.
Run a profiler to find the chokepoints, and multi-thread those or find a way to do them less times.
So, you're doing two things wrong. First, you're using BeautifulSoup, which is slow, and second, you're using a "find" call, which is also slow.
As a first cut, look at lxml's ability to pre-compile xpath queries (Look at the heading "The Xpath class). That will give you a huge speed boost.
Alternatively, I've been working on a library to do this kind of parsing declaratively, using best practices for lxml speed, including precompiled xpath called yankee.
Yankee on PyPI |
Yankee on GitHub
You could do the same thing with yankee like this:
from yankee.xml import Schema, fields as f
# Create a schema for citations
class Citation(Schema):
num = f.Str(".//patcit")
country = f.Str(".//country")
# ... and so forth for the rest of your fields
# Then create a "wrapper" to get all the citations
class Patent(Schema):
citations = f.List(".//us-citation|.//citation")
# Then just feed the Schema your lxml.etrees for each patent:
import lxml.etree as ET
schema = Patent()
for _, doc in ET.iterparse(xml_string, "xml"):
result = schema.load(doc)
The result will look like this:
{
"citations": [
{
"num": "<some value>",
"country": "<some value>",
},
{
"num": "<some value>",
"country": "<some value>",
},
]
}
I would also check out Dask to help you multithread it more efficiently. Pretty much all my projects use it.

Categories

Resources