Whoosh index file overwritten when I open it to add new documents

Whoosh index file overwritten when I open it to add new documents - python

I have a problem with Whoosh. I want to create an index in different moments, because the query to extract data is heavy. I fixed almost all the problems, but I can't get over the problem that every time I reopen the index to add new documents, the file is cleaned instead of simply adding new documents. I tried to use update_document instead of add_document, and FileStorage.open_index instead of index.open_dir, but nothing changed: I always had an index file much smaller than expected.
if is_new_index_file:
if os.path.isdir(<dirname>):
rmtree(<dirname>)
os.mkdir(<dirname>)
else:
os.mkdir(<dirname>)
schema = TranslationSchema()
index.create_in(<dirname>, <schema>, indexname=<indexname>)
ix = index.open_dir(<dirname>, indexname=<indexname>, schema=<schema>)
else:
#open an existing index object
# ix = index.open_dir(<dirname>, indexname=<indexname>)
# open file storage
ix = FileStorage(<dirname>)
ix.open_index(indexname = <indexname>)
...
list-of-fields = <query-to-the-database-to-extract-fields>
...
writer = ix.writer()
#writer.add_document(<list-of-fields>)
writer.update_document(<list-of-fields>)
writer.commit(merge=False, optimize=True)
ix.close()

Related

Code is working slow - performance issue in python

I have file which has 4 columns with, separated values. I need only first column only so I have read file then split that line with, separated and store it in one list variable called first_file_list.
I have another file which has 6 columns with, separated values. My requirement is read first column of first row of file and check that string is exist in list called first_file_list. If that is exist then copy that line to new file.
My first file has approx. 6 million records and second file has approx. 4.5 million records. Just to check the performance of my code instead of 4.5 million I have put only 100k records in second file and to process the 100k record code takes approx. 2.5 hours.
Following is my logic for this:
first_file_list = []
with open("c:\first_file.csv") as first_f:
next(first_f) # Ignoring first row as it is header and I don't need that
temp = first_f.readlines()
for x in temp:
first_file_list.append(x.split(',')[0])
first_f.close()
with open("c:\second_file.csv") as second_f:
next(second_f)
second_file_co = second_f.readlines()
second_f.close()
out_file = open("c:\output_file.csv", "a")
for x in second_file_co:
if x.split(',')[0] in first_file_list:
out_file.write(x)
out_file.close()
Can you please help me to get to know that what I am doing wrong here so that my code take this much time to compare 100k records? or can you suggest better way to do this in Python.

Use a set for fast membership checking.
Also, there's no need to copy the contents of the entire file to memory. You can just iterate over the remaining contents of the file.
first_entries = set()
with open("c:\first_file.csv") as first_f:
next(first_f)
for line in first_f:
first_entries.add(line.split(',')[0])
with open("c:\second_file.csv") as second_f:
with open("c:\output_file.csv", "a") as out_file:
next(second_f)
for line in second_f:
if line.split(',')[0] in first_entries:
out_file.write(line)
Additionally, I noticed you called .close() on file objects that were opened with the with statement. Using with (context managers) means all the clean up is done after you exit its context. So it handles the .close() for you.

work with sets - see below
first_file_values = set()
second_file_values = set()
with open("c:\first_file.csv") as first_f:
next(first_f)
temp = first_f.readlines()
for x in temp:
first_file_values.add(x.split(',')[0])
with open("c:\second_file.csv") as second_f:
next(second_f)
second_file_co = second_f.readlines()
for x in second_file_co:
second_file_values.add(x.split(',')[0])
with open("c:\output_file.csv", "a") as out_file:
for x in second_file_values:
if x in first_file_values:
out_file.write(x)

python + objectlistview + updatelist

I have an objectlistview. I remove a line from it and then I want to update the list without the removed line. I fill the list with data from a database. I tried repopulatelist, but then it seems to use the data that is already in the list.
I think I can solve it with clearAll (clearing the list) and then addobjects and add the database again. But it seems that it should be possible to just update the list. This is my code:
def deletemeas(self):
MAid = self.objectma.id
MAname = self.pagename
objectsRemList = self.tempmeasurements.GetCheckedObjects()
print 'objectremlist', objectsRemList
for measurement in objectsRemList:
print measurement
Measname = measurement.filename
Measid = database.Measurement.select(database.Measurement.q.filename == Measname)[0].id
deleteMeas = []
deleteMeas.append(MAid)
deleteMeas.append(Measid)
pub.sendMessage('DELETE_MEAS', Container(data=deleteMeas)) #to microanalyse controller
#here I get the latest information from the database what should be viewed in the objectlist self.tempmeasurements
MeasInListFromDB = list(database.Microanalysismeasurement.select(database.Microanalysismeasurement.q.microanalysisid == MAid))
print 'lijstmetingen:', MeasInListFromDB
#this doesn't work
self.tempmeasurements.RefreshObjects(MeasInListFromDB)

Ok, this was actually easier than I thought ...
I added this line:
self.tempmeasurements.RemoveObject(measurement)
So I first removed the data from my database table and then I just removed the line in my objectlistview.

Can't access returned h5py object instance

I have a very weird issue here. I have 2 functions: one which reads an HDF5 file created using h5py and one which creates a new HDF5 file which concatenates the content returned by the former function.
def read_file(filename):
with h5py.File(filename+".hdf5",'r') as hf:
group1 = hf.get('group1')
group1 = hf.get('group2')
dataset1 = hf.get('dataset1')
dataset2 = hf.get('dataset2')
print group1.attrs['w'] # Works here
return dataset1, dataset2, group1, group1
And the create file function
def create_chunk(start_index, end_index):
for i in range(start_index, end_index):
if i == start_index:
mergedhf = h5py.File("output.hdf5",'w')
mergedhf.create_dataset("dataset1",dtype='float64')
mergedhf.create_dataset("dataset2",dtype='float64')
g1 = mergedhf.create_group('group1')
g2 = mergedhf.create_group('group2')
rd1,rd2,rg1,rg2 = read_file(filename)
print rg1.attrs['w'] #gives me <Closed HDF5 group> message
g1.attrs['w'] = "content"
g1.attrs['x'] = "content"
g2.attrs['y'] = "content"
g2.attrs['z'] = "content"
print g1.attrs['w'] # Works Here
return mergedhf.get('dataset1'), mergedhf.get('dataset2'), g1, g2
def calling_function():
wd1, wd2, wg1, wg2 = create_chunk(start_index, end_index)
print wg1.attrs['w'] #Works here as well
Now the problem is, the dataset and the properties from the new file created and represented by wd1, wd2, wg1 and wg2 can be accessed by me and I can access the attribute data but i cant do the same for which I have read and returned the value for.
Can anyone help me fetch the values of the dataset and group when I have returned the reference to the calling function?

The problem is in read_file, this line:
with h5py.File(filename+".hdf5",'r') as hf:
This closes hf at the end of the with block, i.e. when read_file returns. When this happens, the datasets and groups also get closed and you can no longer access them.
There are (at least) two ways to fix this. Firstly, you can open the file like you do in create_chunk:
hf = h5py.File(filename+".hdf5", 'r')
and keep the reference to hf around as long as you need it, before closing it:
hf.close()
The other way is to copy the data from the datasets in read_file and return those instead:
dataset1 = hf.get('dataset1')[:]
dataset2 = hf.get('dataset2')[:]
Note that you can't do this with the groups. The file needs to be open for as long as you need to do things with the groups.

Adding to #Yossarian's answer
The problem is in read_file, this line:
with h5py.File(filename+".hdf5",'r') as hf:
This closes hf at the end of the with block, i.e. when read_file returns. When this happens, the datasets and groups also get closed and you can no longer access them.
For those who come across this and are reading a scalar dataset make sure to index using [()]:
scalar_dataset1 = hf['scalar_dataset1'][()]
Preface
I had a similar issue as OP resulting in a return value of <closed hdf5 dataset>. However, I would get a ValueError when attempting to slice my scalar dataset with [:].
"ValueError: Illegal slicing argument for scalar dataspace"
Indexing with [()] along with #Yossarian's answer helped solve my problem.

Using Python gdata to clear the rows in worksheet before adding data

I have a Google Spreadsheet which I'm populating with values using a python script and the gdata library. If i run the script more than once, it appends new rows to the worksheet, I'd like the script to first clear all the data from the rows before populating it, that way I have a fresh set of data every time I run the script. I've tried using:
UpdateCell(row, col, value, spreadsheet_key, worksheet_id)
but short of running a two for loops like this, is there a cleaner way? Also this loop seems to be horrendously slow:
for x in range(2, 45):
for i in range(1, 5):
self.GetGDataClient().UpdateCell(x, i, '',
self.spreadsheet_key,
self.worksheet_id)

Not sure if you got this sorted out or not, but regarding speeding up the clearing out of current data, try using a batch request. For instance, to clear out every single cell in the sheet, you could do:
cells = client.GetCellsFeed(key, wks_id)
batch_request = gdata.spreadsheet.SpreadsheetsCellsFeed()
# Iterate through every cell in the CellsFeed, replacing each one with ''
# Note that this does not make any calls yet - it all happens locally
for i, entry in enumerate(cells.entry):
entry.cell.inputValue = ''
batch_request.AddUpdate(cells.entry[i])
# Now send the entire batchRequest as a single HTTP request
updated = client.ExecuteBatch(batch_request, cells.GetBatchLink().href)
If you want to do things like save the column headers (assuming they are in the first row), you can use a CellQuery:
# Set up a query that starts at row 2
query = gdata.spreadsheet.service.CellQuery()
query.min_row = '2'
# Pull just those cells
no_headers = client.GetCellsFeed(key, wks_id, query=query)
batch_request = gdata.spreadsheet.SpreadsheetsCellsFeed()
# Iterate through every cell in the CellsFeed, replacing each one with ''
# Note that this does not make any calls yet - it all happens locally
for i, entry in enumerate(no_headers.entry):
entry.cell.inputValue = ''
batch_request.AddUpdate(no_headers.entry[i])
# Now send the entire batchRequest as a single HTTP request
updated = client.ExecuteBatch(batch_request, no_headers.GetBatchLink().href)
Alternatively, you could use this to update your cells as well (perhaps more in line with that you want). The link to the documentation provides a basic way to do that, which is (copied from the docs in case the link ever changes):
import gdata.spreadsheet
import gdata.spreadsheet.service
client = gdata.spreadsheet.service.SpreadsheetsService()
# Authenticate ...
cells = client.GetCellsFeed('your_spreadsheet_key', wksht_id='your_worksheet_id')
batchRequest = gdata.spreadsheet.SpreadsheetsCellsFeed()
cells.entry[0].cell.inputValue = 'x'
batchRequest.AddUpdate(cells.entry[0])
cells.entry[1].cell.inputValue = 'y'
batchRequest.AddUpdate(cells.entry[1])
cells.entry[2].cell.inputValue = 'z'
batchRequest.AddUpdate(cells.entry[2])
cells.entry[3].cell.inputValue = '=sum(3,5)'
batchRequest.AddUpdate(cells.entry[3])
updated = client.ExecuteBatch(batchRequest, cells.GetBatchLink().href)

Optimize python file comparison script

I have written a script which works, but I'm guessing isn't the most efficient. What I need to do is the following:
Compare two csv files that contain user information. It's essentially a member list where one file is a more updated version of the other.
The files contain data such as ID, name, status, etc, etc
Write to a third csv file ONLY the records in the new file that either don't exist in the older file, or contain updated information. For each record, there is a unique ID that allows me to determine if a record is new or previously existed.
Here is the code I have written so far:
import csv
fileAin = open('old.csv','rb')
fOld = csv.reader(fileAin)
fileBin = open('new.csv','rb')
fNew = csv.reader(fileBin)
fileCout = open('NewAndUpdated.csv','wb')
fNewUpdate = csv.writer(fileCout)
old = []
new = []
for row in fOld:
old.append(row)
for row in fNew:
new.append(row)
output = []
x = len(new)
i = 0
num = 0
while i < x:
if new[num] not in old:
fNewUpdate.writerow(new[num])
num += 1
i += 1
fileAin.close()
fileBin.close()
fileCout.close()
In terms of functionality, this script works. However I'm trying to run this on files that contain hundreds of thousands of records and it's taking hours to complete. I am guessing the problem lies with reading both files to lists and treating the entire row of data as a single string for comparison.
My question is, for what I am trying to do is this there a faster, more efficient, way to process the two files to create the third file containing only new and updated records? I don't really have a target time, just mostly wanting to understand if there are better ways in Python to process these files.
Thanks in advance for any help.
UPDATE to include sample row of data:
123456789,34,DOE,JOHN,1764756,1234 MAIN ST.,CITY,STATE,305,1,A

How about something like this? One of the biggest inefficiencies of your code is checking whether new[num] is in old every time because old is a list so you have to iterate through the entire list. Using a dictionary is much much faster.
import csv
fileAin = open('old.csv','rb')
fOld = csv.reader(fileAin)
fileBin = open('new.csv','rb')
fNew = csv.reader(fileBin)
fileCout = open('NewAndUpdated.csv','wb')
fNewUpdate = csv.writer(fileCout)
old = {row[0]:row[1:] for row in fOld}
new = {row[0]:row[1:] for row in fNew}
fileAin.close()
fileBin.close()
output = {}
for row_id in new:
if row_id not in old or not old[row_id] == new[row_id]:
output[row_id] = new[row_id]
for row_id in output:
fNewUpdate.writerow([row_id] + output[row_id])
fileCout.close()

difflib is quite efficient: http://docs.python.org/library/difflib.html

Sort the data by your unique field(s), and then use a comparison process analogous to the merge step of merge sort:
http://en.wikipedia.org/wiki/Merge_sort

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Whoosh index file overwritten when I open it to add new documents - python

Related

Code is working slow - performance issue in python

python + objectlistview + updatelist

Can't access returned h5py object instance

Using Python gdata to clear the rows in worksheet before adding data

Optimize python file comparison script

Categories

Resources