How to create an index using Whoosh

How to create an index using Whoosh - python

I am trying to use Whoosh for text searching for the first time. I want to search for documents containing the word "XML". But because I am new to Whoosh, I just wrote a program that search for a word from a document. Where the document is a text file (myRoko.txt)
import os, os.path
from whoosh import index
from whoosh.index import open_dir
from whoosh.fields import Schema, ID, TEXT
from whoosh.qparser import QueryParser
from whoosh.query import *
if not os.path.exists("indexdir3"):
os.mkdir("indexdir3")
schema = Schema(name=ID(stored=True), content=TEXT)
ix = index.create_in("indexdir3", schema)
writer = ix.writer()
path = "myRoko.txt"
with open(path, "r") as f:
content = f.read()
f.close()
writer.add_document(name=path, content= content)
writer.commit()
ix = open_dir("indexdir3")
query_b = QueryParser('content', ix.schema).parse('XML')
with ix.searcher() as srch:
res_b = srch.search(query_b)
print res_b[0]
The above code is supposed to print the document that contain the word "XML". However the code return the following error:
raise ValueError("%r is not unicode or sequence" % value)
ValueError: 'A large number of documents are now represented and stored
as XML document on the web. Thus ................
What could be the cause of this error?

You have a Unicode problem. You should pass unicode strings to the indexer. For that, you need to open the text file as unicode:
import codecs
with codecs.open(path, "r","utf-8") as f:
content = f.read()
and use unicode string for file name:
path = u"myRoko.txt"
After fixes I got this result:
<Hit {'name': u'myRoko.txt'}>

writer.add_document(name=unicode(path), content=unicode(content))
It has to be UNICODE

Related

Saving updated xml without beautifulsoup.prettify

I have created this code to substitute some strings in xml file with other text. I used Beautifulsoup for this excersise and as instructed in the documentation i used soup.prettify in the end in order to save changed xml. However prettified xml is not working for me - i get errors when trying to import it back to the CMS.
Is there any other way to save updated xml without changing xml structure and without re-writing the whole code. See my code for reference below. Thanks for advice!
import openpyxl
import sys
#searching for Part Numbers and descriptions in xml
from bs4 import BeautifulSoup
infile = open('name of my file.xml', "r", encoding="utf8")
contents = infile.read()
infile.close()
soup = BeautifulSoup(contents,'xml')
all_Products = soup.find_all('Product')
#gathering all Part Numbers from xml
for i in all_Products:
PN = i.find('Name')
PN_Descr = i.find_all(AttributeID="PartNumberDescription")
PN_Details = i.find_all(AttributeID="PartNumberDetails")
for y in PN_Descr:
PN_Descr_text = y.find("TranslatableText")
try:
string = PN_Descr_text.string
PN_Descr_text.find(text=string).replace_with("New string")
except AttributeError:
print("Attribute error in: PN Description for: ", PN)
continue
for z in PN_Details:
PN_Details_text = z.find("TranslatableText")
try:
string = PN_Details_text.string
PN_Details_text.find(text=string).replace_with("New string")
except AttributeError:
print("Attribute error in: PN Details for: ", PN)
continue
xml = soup.prettify("utf-8")
with open('name of my file.xml', "wb") as file:
file.write(xml)

Python requests data in file, line by line

I try to get the output of this request (https://api.opendota.com/api/players/7841909) in a file, line by line.
For some reason the output is stored in byte and not str, which I can change by str().
I tried to use a regualar expression to just store the information between the {} and also tried the csv module, which lead to just store digits.
What did I do wrong? The following version ignores the linebreak and the delimiters. :/
import requests
import csv
import re
dotaId = "7841909" #somit als string gespeichert
pfad = "https://api.opendota.com/api/players/" + dotaId + "/matches"
req = requests.get(pfad)
with open('%s.csv' % dotaId, 'w') as file:
clean_line = re.findall(r'\{(.*?)\}', req.text)
file.write(str(clean_line))

Your object clean_line is a list which you are writing as a one liner into the file.
It is better to use the csv writer module and write the content row by row:
with open('new_file.csv', 'w', newline='') as file:
writer = csv.writer(file, quotechar="'")
clean_lines = re.findall(r'\{(.*?)\}', req.text)
for line in clean_lines:
writer.writerow([str(line)])

Remove very first character in file

I'm trying to remove the very first character (") from a file which contains a JSON String. I'm using Python for this. Below is my code:
jsonOutput = 'JsonString_{}.{}'.format(str(uuid.uuid1()), "json")
jsonOutput_File = os.path.join(arcpy.env.scratchFolder, jsonOutput)
with open(jsonOutput_File, 'w') as json_file:
json.dump(jsonString, json_file)
// I was able to remove the very last character using the code below
with open(jsonOutput_File, 'r+') as read_json_file:
read_json_file.seek(-1, os.SEEK_END)
read_json_file.truncate()
Basically when I dump the JSON String to a file, the String is getting surrounded by double quotes. I'm trying to remove these double quotes from the first & last position of the file.

If you already have a JSON string, simply write it to the file.
Encoding the JSON string to JSON again using json.dump() is a bad idea and will not be fixed as simple as removing a leading and a trailing quote.
Consider the following minimal and complete example:
import json
import os
import uuid
myobject = {"hello": "world"}
jsonString = json.dumps(myobject)
jsonOutput = 'JsonString_{}.{}'.format(str(uuid.uuid1()), "json")
jsonOutput_File = os.path.join("d:\\", jsonOutput)
with open(jsonOutput_File, 'w') as json_file:
json.dump(jsonString, json_file)
The output is a file with the content:
"{\"hello\": \"world\"}"
Removing the quotes will not make it valid JSON.
Instead, avoid the duplicate JSON creation, either by removing json.dumps() which converts the object to JSON one time, or by removing json.dump(), which does it a second time.
Solution 1:
import json
import os
import uuid
myobject = {"hello": "world"}
# <-- deleted line here
jsonOutput = 'JsonString_{}.{}'.format(str(uuid.uuid1()), "json")
jsonOutput_File = os.path.join("d:\\", jsonOutput)
with open(jsonOutput_File, 'w') as json_file:
json.dump(myobject, json_file) # <-- changed to object here
Solution 2:
import json
import os
import uuid
myobject = {"hello": "world"}
jsonString = json.dumps(myobject)
jsonOutput = 'JsonString_{}.{}'.format(str(uuid.uuid1()), "json")
jsonOutput_File = os.path.join("d:\\", jsonOutput)
with open(jsonOutput_File, 'w') as json_file:
json_file.write(jsonString) # <-- Note this line

All text is saved in one line

So, I was trying to use NLTK from Python to do a part of speech tagging to a text file.
This is the code I used
import nltk
from nltk import word_tokenize, pos_tag
f = open('all.txt')
raw = f.read()
text = word_tokenize(raw)
paosted = nltk.pos_tag(text)
saveFile = open('ol.txt', 'w')
saveFile.write(str(paosted))
saveFile.close()
The code did work, but the problem is that it saved all the text in one single line as shown in the attached picture. as shown here .. I know I should be using a "\n" function, but I am a novice in python and have no idea how to do it, so any help would be appreciated :) ..
-------- UPDATE -----------
WELL, People have been really helpful and offered some solutions i.e., this code:
import nltk
from nltk import word_tokenize, pos_tag
f = open('all.txt')
raw = f.read()
text = word_tokenize(raw)
paosted = nltk.pos_tag(text)
saveFile.write(str(paosted).replace('),' , '),\n'))
saveFile.close()
But I still need to have it in the form of a paragraph because I am going to use it latter in a concordance software. Please have a look at this screenshot:
https://i.stack.imgur.com/tU1NW.png

paosted is a list of tuple you can iterate over it and write each tuple to a line
Ex:
paosted = nltk.pos_tag(text)
saveFile = open('ol.txt', 'w')
for line in paosted:
saveFile.write(str(line)+ "\n")
saveFile.close()

Updating my answer accordingly to,
temp = []
for i in paosted:
temp.append("_".join(i))
" ".join(temp)

Thank you all! I followed some of your instructions and the best result I got was with this code:
import nltk
from nltk import word_tokenize, pos_tag
f = open('all.txt')
raw = f.read()
text = word_tokenize(raw)
paosted = nltk.pos_tag(text)
saveFile = open('output.txt', 'w')
saveFile.write(str(paosted).replace("('.', '.')" , "\n"))
saveFile.close()

Encoding UTF-8 when writing to CSV

I have some simple code to ingest some JSON Twitter data, and output some specific fields into separate columns of a CSV file. My problem is that I cannot for the life of me figure out the proper way to encode the output as UTF-8. Below is the closest I've been able to get, with the help of a member here, but I still it still isn't running correctly and fails because of the unique characters in the tweet text field.
import json
import sys
import csv
import codecs
def main():
writer = csv.writer(codecs.getwriter("utf-8")(sys.stdout), delimiter="\t")
for line in sys.stdin:
line = line.strip()
data = []
try:
data.append(json.loads(line))
except ValueError as detail:
continue
for tweet in data:
## deletes any rate limited data
if tweet.has_key('limit'):
pass
else:
writer.writerow([
tweet['id_str'],
tweet['user']['screen_name'],
tweet['text']
])
if __name__ == '__main__':
main()

From Docs:
https://docs.python.org/2/howto/unicode.html
a = "string"
encodedstring = a.encode('utf-8')
If that does not work:
Python DictWriter writing UTF-8 encoded CSV files

I have had the same problem. I have a large amount of data from twitter firehose so every possible complication case (and has arisen)!
I've solved it as follows using try / except:
if the dict value is a string: if isinstance(value,basestring) I try to encode it straight away. If not a string, I make it a string and then encode it.
If this fails, it's because some joker is tweeting odd symbols to mess up my script. If that is the case, firstly I decode then re-encode value.decode('utf-8').encode('utf-8') for strings and decode, make into a string and re-encode for non-strings value.decode('utf-8').encode('utf-8')
Have a go with this:
import csv
def export_to_csv(list_of_tweet_dicts,export_name="flat_twitter_output.csv"):
utf8_flat_tweets=[]
keys = []
for tweet in list_of_tweet_dicts:
tmp_tweet = tweet
for key,value in tweet.iteritems():
if key not in keys: keys.append(key)
# convert fields to utf-8 if text
try:
if isinstance(value,basestring):
tmp_tweet[key] = value.encode('utf-8')
else:
tmp_tweet[key] = str(value).encode('utf-8')
except:
if isinstance(value,basestring):
tmp_tweet[key] = value.decode('utf-8').encode('utf-8')
else:
tmp_tweet[key] = str(value.decode('utf-8')).encode('utf-8')
utf8_flat_tweets.append(tmp_tweet)
del tmp_tweet
list_of_tweet_dicts = utf8_flat_tweets
del utf8_flat_tweets
with open(export_name, 'w') as f:
dict_writer = csv.DictWriter(f, fieldnames=keys,quoting=csv.QUOTE_ALL)
dict_writer.writeheader()
dict_writer.writerows(list_of_tweet_dicts)
print "exported tweets to '"+export_name+"'"
return list_of_tweet_dicts
hope that helps you.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to create an index using Whoosh - python

writer.add_document(name=unicode(path), content=unicode(content)) It has to be UNICODE

Related

Saving updated xml without beautifulsoup.prettify

Python requests data in file, line by line

Remove very first character in file

All text is saved in one line

Encoding UTF-8 when writing to CSV

Categories

Resources