I use a loop query, and I want to avoid fetching data that was previously fetched.
The best idea that I came up with is to make an ever-expanding blacklist of the data that was fetched, and remove data that was blacklisted every time that I fetch.
I've managed to do so by adding every data that was fetched successfully to a blacklist (called 'allWords'):
allWords.Extend(fetchedData)
And then fetching all the items which are not in 'allWords':
c.execute("SELECT formatted FROM dictionary WHERE formatted LIKE ('__A_')")
words=[item[0] for item in c.fetchall() if item[0] not in allWords]
return words
But this way I still fetch all the date, is there any smart way to do so?
Related
I am using Multiprocessing.Pool in my current program because I wanted to increase the fetching and dumping of data from on-premises data center to another db in a different server. The current rate is too slow for MB worth of data, this seems to work best in my current requirement:
def fetch_data()
select data from on_prem_db (id, name...data)
#using Pool and starmap,
#runs dump_data function in 5 parallel threads
dump_data()
pass
def dump_data()
insert entry in table_f1
insert entry in table_g1
Now I am running into the issue where sometimes, multiple threads fetching already processed granules which leads to unique key violation.
eg: first thread fetch [10,20,40,50,70]
second thread fetch[30,40,60,70,80]
rows with id 40 and 70 and duplicated. I am supposed to see 10 entry in my db but I see only 8 entries, and 2 of them raises unique key violation.
How can I make sure that different threads fetch different rows from my source db which is on-prem db so that my program don't try to insert already inserted rows?
eg of my select query:
fetch_data_list_of_ids = [list of ids of processed data]
data_list = list(itertools.islice(on_prem_db.get_data(table_name),5))
Is there a way I can make a list and append the row ids of already processed data in fetch_data () ?
And every time data_list runs a new query to fetch the data, next thing i would do is check if the newly fetched data has ids in fetch_data_list_of_ids list ?
Or is there any other way I can do it to make sure duplicate entries are not being processed??
I have a dataset that I am pulling from an API. The dataset contains fields such as store_id, store_description, monthly_sales, total_sales.
There are 16,219 records in this dataset
I would like to automate the pulling of this data but when I pull from the API more than once I get duplicate records of the data instead of each record being updated. Below is the code I am using to update the data:
for i in json.loads(data):
for j in col.find({}):
if i['store_id'] == j['store_id']:
col.update_one({j}, {i})
else:
col.insert_one({i})
I am not really sure what exaclty I am doing wrong here. I would appreciate any help.
If your aim is to match on a key of store_id and update the entirety of the record if it matches an existing record, or insert it if it doesn't, then use replace_one() with upsert=True, i.e.:
col.replace_one({'store_id': i['store_id']}, i, upsert=True)
I am trying to extract data from a dictionary where performance is very prioritized.
I have ~1 million dicts to process (but this is going to scale up). However the dict can on occasion miss value.
The current solution is based upon the results below but I currently iterate the list to get the input, then pop course, insert it into an unique list if it does not exist (to remove redundancy) and then insert the id back to input with name of "course_id".
Problem:
This is far from efficient but out of ideas or keywords for an better approach. Does Sqlite have any tricks to insert into different tables or can I separate the items in a better way?
My idea:
id is a user_id and the course is an row that should be placed within another table. This data should be inserted to an sqlite database but not sure how to handle the conversion.
Input:
'id':user_id,
'course':{
'id':course.get('id'),
'name':course.get('name')
},
'interests':interests.get('id'),
'address':address.get('name'),
Expected results into user table:
'id':123,
'course_id':1, # ForeignKey to course id given above.
'interests':'Likes to code',
'address':'Parents house',
Expected results into course table:
'id':1,
'name':'some course'
I am currently doing a bulk insert using SQLAlchemy core where I create a list of all items to be inserted and then call execute. Example query:
text('INSERT OR IGNORE INTO users (id, course_id, interests, address) VALUES (:id, :course_id, :interests, :address)'),
I have a dynamodb table to store email attribute information. I have a hash key on the email, range key on timestamp(number). The initial idea for using email as hash key is to query all emails by per email. But one thing I trying to do is retrieve all email ids(in hash key). I am using boto for this, but I am unsure as to how to retrieve distinct email ids.
My current code to pull 10,000 email records is
conn=boto.dynamodb2.connect_to_region('us-west-2')
email_attributes = Table('email_attributes', connection=conn)
s = email_attributes.scan(limit=10000,attributes=['email'])
But to retrieve the distinct records, I will have to do a full table scan and then pick the distinct records in the code. Another idea that I have is to maintain another table that will just store these emails and do conditional writes to see if an email id exists, if not then write. But I am trying to think if this will be more expensive and it will be a conditional write.
Q1.) Is there a way to retrieve distinct records using a DynamoDB scan?
Q2.) Is there a good way to calculate the cost per query?
Using a DynamoDB Scan, you would need to filter out duplicates on the client side (in your case, using boto). Even if you create a GSI with the reverse schema, you will still get duplicates. Given a H+R table of email_id+timestamp called stamped_emails, a list of all unique email_ids is a materialized view of the H+R stamped_emails table. You could enable a DynamoDB Stream on the stamped_emails table, subscribe a Lambda function to stamped_emails' Stream that does a PutItem (email_id) to a Hash-only table called emails_only. Then, you could Scan emails_only and you would get no duplicates.
Finally, regarding your question about cost, Scan will read entire items even if you only request certain projected attributes from those items. Second, Scan has to read through every item, even if it is filtered out by a FilterExpression (Condition Expression). Third, Scan reads through items sequentially. That means that each scan call is treated as one big read for metering purposes. The cost implication of this is that if a Scan call reads 200 different items, it will not necessarily cost 100 RCU. If the size of each of those items is 100 bytes, that Scan call will cost ROUND_UP((20000 bytes / 1024 kb/byte) / 8 kb / EC RCU) = 3 RCU. Even if this call only returns 123 items, if the Scan had to read 200 items, you would incur 3 RCU in this situation.
I have a big json file (+- 50mb) which i have to iterate over, process some text, and then insert the processed text to a mysql table.
My doubt is:
Would it be better to insert record by record in the table while iterating over the json file.
move one item in json -> extract info I need -> open db connection -> insert record -> close db connection -> move to next item in json file... and so on until end of file
In this case would it be better to open and close db connection every time or leave it open until the end of the json file?
Or the other option I thought would be to iterate over the json file and create a list of dictionaries (one dictionary for each record, with keys as the field where to insert and the value of the key as the value to be inserted in the database) and then insert at the database.
iterate over json file -> extract info I need -> store info in dictionary -> add dictionary to a list -> repeat until the end of the file -> open db connection -> iterate over list -> insert record
In this case would it be possible to insert the whole list in the database at once instead of iterating over the list with a for...i... to insert record by record?
Any ideas on what would be the best option?
Sorry if the question might look stupid but I am a beginner and could not find this answer anywhere... I have over 100.000 records to insert...
Thanks in advance for any help!
It is definitely much better to insert all records into the database in one go: there is considerable overhead in creating and closing connections, and executing multiple INSERT statements instead of just one. Perhaps 100.000 records is a bit much to insert at once, if mysql chokes try adding chunks of e.g. 1000 records in one go.
I am assuming memory usage will not be an issue; this depends of course on how large each record is.
My advice would be to use sqlalchemy to access the database, if you often need to access databases from python. Definitely worth the investment! With sqlalchemy, the code would be something like this:
CHUNKSIZE = 1000
< parse JSON >
< store into a list of dictionaries 'records' >
< Refer to sqlalchemy tutorial for how to create Base and Session classes >
class MyRecord(Base):
''' SQLAlchemy record definition.
Let the column names correspond to the keys in the record dictionaries
'''
...
session = Session()
for chunk in [records[CHUNKSIZE*i:CHUNKSIZE*(i+1)] for i in range(1+len(records)/CHUNKSIZE):
for rec in chunk:
session.add(MyRecord(**rec))
session.commit()