Loading a DB table into nested dictionaries in Python - python

I have a table in MySql DB which I want to load it to a dictionary in python.
the table columns is as follows:
id,url,tag,tagCount
tagCount is the number of times that a tag has been repeated for a certain url. So in that case I need a nested dictionary, in other words a dictionary of dictionary, to load this table. Because each url have several tags for which there are different tagCounts.the code that I used is this:( the whole table is about 22,000 records )
cursor.execute( ''' SELECT url,tag,tagCount
FROM wtp ''')
urlTagCount = cursor.fetchall()
d = defaultdict(defaultdict)
for url,tag,tagCount in urlTagCount:
d[url][tag]=tagCount
print d
first of all I want to know if this is correct.. and if it is why it takes so much time? Is there any faster solutions? I am loading this table into memory to have fast access to get rid of the hassle of slow database operations, but with this slow speed it has become a bottleneck itself, it is even much slower than DB access. and anyone help? thanks

You need to ensure that the dictionary (and each of the nested dictionaries) exist before you assign a key, value to them. It is helpful to use setdefault for this purpose. You end up with something like this:
d = {}
for url, tag, tagCount in urlTagCount:
d.setdefault(url, {})[tag] = tagCount

maybe you could try with normal dicts and tuple keys like
d = dict()
for url,tag,tagCount in urlTagCount:
d[(url, tag)] = tagCount
in any case did you try:
d = defaultdict(dict)
instead of
d = defaultdict(defaultdict)

I could manage to verify the code, and it is working perfectly. For those amateurs like me, i suggest never try to "print" a very large nested dictionary. that "print d" in the last line of the code was the problem for it being slow. If remove it or try to access the dictionary with actual keys, then it is very fast.

Related

Python: Filter out rows from result set of pyodbc.row when row contains string

I have not been able to find an answer to this seemingly straightforward filter process.
I have a result set of table names for a simple odbc query and I want to filter that result set anything that contains the prefix 'wer_'
*Some pyodbc connection code*
cursor.execute(<SQL statement which gets the list of tables>)
results = cursor.fetchall()
results = [key for key in results if str(key.name).str.contains('wer_')]
^ I've tried various methods around this but so far no dice. Can you help?
It turned out to be fairly straight forward in the end. It seems pyodbc.row has an attribute of name which you can compare against
*Some pyodbc connection code*
cursor.execute(<SQL statement which gets the list of tables>)
results = cursor.fetchall()
results = [key for key in results if 'wer_' not in key.name]
I hope this helps somebody in the future!

combine fields and values in python for dictionary

I'm probably overlooking something, but I've looked everywhere for a way to do this. I am trying to join fields and values together that come out separated for SQL to something i can use with MongoDB.
So for example (input):
fields = ['first-name', 'last-name', 'email-address', 'phone-number']
values = ['John', 'Doe', 'john.doe#johndoe.com', '1-800-123-4567']
Output:
{
'first-name':'John',
'last-name':'Doe',
'email':'john.doe#johndoe.com',
'phone-number':'1-800-123-4567'
}
I need it like this so i can just do a simple (I know i don't need to do this):
def getFirstName(self, lastname):
client = MongoClient()
db = client.test.contacts
result db.find({ 'last-name':lastname })
return result['first-name']
self.getFirstName("Doe")
My app supports MySQL and PostgreSQL so I can't really change how it spits fields and values without breaking those. Sorry if i made code errors, i typed this at the top of my head.
If you need more info, just ask.
you can use zip to wrap the two lists together and pass that to dict()
dict(zip(fields, values))
this assumes though that the two lists are always the same length
You could use a dict comprehension and iterate through the lists like:
d = {fields[i] : values[i] for i in range(len(fields))}

Python dictionary key length not same as rows returning for the query in mysql

So i am trying to fetch data from the mysql into a python dictionary
here is my code.
def getAllLeadsForThisYear():
charges={}
cur.execute("select lead_id,extract(month from transaction_date),pid,extract(Year from transaction_date) from transaction where lead_id is not NULL and transaction_type='CHARGE' and YEAR(transaction_date)='2015'")
for i in cur.fetchall():
lead_id=i[0]
month=i[1]
pid=i[2]
year=str(i[3])
new={lead_id:[month,pid,year]}
charges.update(new)
return charges
x=getAllLeadsForThisYear()
when i prints (len(x.keys()) it gave me some number say 450
When i run the same query in mysql it returns me 500 rows.Although i do have some same keys in dictionary but it should count them as i have not mentioned it if i not in charges.keys(). Please correct me if i am wrong.
Thanks
As I said, the problem is that you are overwriting your value at a key every time a duplicate key pops up. This can be fixed two ways:
You can do a check before adding a new value and if the key already exists, append to the already existing list.
For example:
#change these lines
new={lead_id:[month,pid,year]}
charges.update(new)
#to
if lead_id in charges:
charges[lead_id].extend([month,pid,year])
else
charges[lead_id] = [month,pid,year]
Which gives you a structure like this:
charges = {
'123':[month1,pid1,year1,month2,pid2,year2,..etc]
}
With this approach, you can reach each separate entry by chunking the value at each key by chunks of 3 (this may be useful)
However, I don't really like this approach because it requires you to do that chunking. Which brings me to approach 2.
Use defaultdict from collections which acts in the exact same way as a normal dict would except that it defaults a value when you try to call a key that hasn't already been made.
For example:
#change
charges={}
#to
charges=defaultdict(list)
#and change
new={lead_id:[month,pid,year]}
charges.update(new)
#to
charges[lead_id].append((month,pid,year))
which gives you a structure like this:
charges = {
'123':[(month1,pid1,year1),(month2,pid2,year2),(..etc]
}
With this approach, you can now iterate through each list at each key with:
for key in charges:
for entities in charges[key]:
print(entities) # would print `(month,pid,year)` for each separate entry
If you are using this approach, dont forget to from collections import defaultdict. If you don't want to import external, you can mimic this by:
if lead_id in charges:
charges[lead_id].append((month,pid,year))
else
charges[lead_id] = [(month,pid,year)]
Which is incredibly similar to the first approach but does the explicit "create a list if the key isnt there" that defaultdict would do implicitly.

Python insert variable in loop into SQLite database using SQLAlchemy

I am using SQLAlchemy with declarative base and Python 2.6.7 to insert data in a loop into an SQLite database.
As brief background, I have implemented a dictionary approach to creating a set of variables in a loop. What I am trying to do is scrape some data from a website, and have between 1 and 12 pieces of data in the following element:
overall_star_ratings = doc.findall("//div[#id='maincontent2']/div/table/tr[2]//td/img")
count_stars = len(overall_star_ratings)
In an empty SQLite database I have variables "t1_star,"..."t12_star," and I want to iterate over the list of values in "overall_star_ratings" and assign the values to the database variables, which varies depending on the page. I'm using SQLAlchemy, so (in highly inefficient language) what I'm looking to do is assign the values and insert into the DB as follows (I'm looping through 'rows' in the code, such that the 'row' command inserts the value for *t1_star* into the database column 't1_star', etc.):
if count==2:
row.t1_star = overall_star_ratings[1].get('alt')
row.t2_star = overall_star_ratings[2].get('alt')
elif count==1:
row.t1_star = overall_star_ratings[1].get('alt')
This works but is highly inefficient, so I implemented a "dictionary" approach to creating the variables, as I've seen in some "variable variables" questions on Stack Overflow. So, here is what I've tried:
d = {}
for x in range(1, count_stars+1):
count = x-1
d["t{0}_star".format(x)] = overall_star_ratings[count].get('alt')
This works for creating the 't1_star,' 't2_star" keys for the dictionary as well as the values. The problem comes when I try to insert the data into the database. I have tried adding the following to the above loop:
key = "t{0}_star".format(x)
value = d["t{0}_star".format(x)]
row.key = value
I've also tried adding the following after the above loop is completed:
for key, value in d.items():
row.key = value
The problem is that it is not inserting anything. It appears that the problem is in the row.key part of the script, not in the value, but I am not certain of that. From all that I can see, the keys are the same strings as I'm seeing when I do it the "inefficient" way (i.e., t1_star, etc.), so I'm not sure why this isn't working.
Any suggestions would be greatly appreciated!
Thanks,
Greg
Python attribute access doesn't work like that. row.key looks up the attribute with the literal name "key", not the value that's in the variable key.
You probably need to use setattr:
setattr(row, key, value)

What's the most efficient way to insert thousands of records into a table (MySQL, Python, Django)

I have a database table with a unique string field and a couple of integer fields. The string field is usually 10-100 characters long.
Once every minute or so I have the following scenario: I receive a list of 2-10 thousand tuples corresponding to the table's record structure, e.g.
[("hello", 3, 4), ("cat", 5, 3), ...]
I need to insert all these tuples to the table (assume I verified neither of these strings appear in the database). For clarification, I'm using InnoDB, and I have an auto-incremental primary key for this table, the string is not the PK.
My code currently iterates through this list, for each tuple creates a Python module object with the appropriate values, and calls ".save()", something like so:
#transaction.commit_on_success
def save_data_elements(input_list):
for (s, i1, i2) in input_list:
entry = DataElement(string=s, number1=i1, number2=i2)
entry.save()
This code is currently one of the performance bottlenecks in my system, so I'm looking for ways to optimize it.
For example, I could generate SQL codes each containing an INSERT command for 100 tuples ("hard-coded" into the SQL) and execute it, but I don't know if it will improve anything.
Do you have any suggestion to optimize such a process?
Thanks
You can write the rows to a file in the format
"field1", "field2", .. and then use LOAD DATA to load them
data = '\n'.join(','.join('"%s"' % field for field in row) for row in data)
f= open('data.txt', 'w')
f.write(data)
f.close()
Then execute this:
LOAD DATA INFILE 'data.txt' INTO TABLE db2.my_table;
Reference
For MySQL specifically, the fastest way to load data is using LOAD DATA INFILE, so if you could convert the data into the format that expects, it'll probably be the fastest way to get it into the table.
If you don't LOAD DATA INFILE as some of the other suggestions mention, two things you can do to speed up your inserts are :
Use prepared statements - this cuts out the overhead of parsing the SQL for every insert
Do all of your inserts in a single transaction - this would require using a DB engine that supports transactions (like InnoDB)
If you can do a hand-rolled INSERT statement, then that's the way I'd go. A single INSERT statement with multiple value clauses is much much faster than lots of individual INSERT statements.
Regardless of the insert method, you will want to use the InnoDB engine for maximum read/write concurrency. MyISAM will lock the entire table for the duration of the insert whereas InnoDB (under most circumstances) will only lock the affected rows, allowing SELECT statements to proceed.
what format do you receive? if it is a file, you can do some sort of bulk load: http://www.classes.cs.uchicago.edu/archive/2005/fall/23500-1/mysql-load.html
This is unrelated to the actual load of data into the DB, but...
If providing a "The data is loading... The load will be done shortly" type of message to the user is an option, then you can run the INSERTs or LOAD DATA asynchronously in a different thread.
Just something else to consider.
I donot know the exact details, but u can use json style data representation and use it as fixtures or something. I saw something similar on Django Video Workshop by Douglas Napoleone. See the videos at http://www.linux-magazine.com/online/news/django_video_workshop. and http://www.linux-magazine.com/online/features/django_reloaded_workshop_part_1. Hope this one helps.
Hope you can work it out. I just started learning django, so I can just point you to resources.

Categories

Resources