Extract specific column and group them from dictionary in Python - python

I want to extract specific columns and group them from the records which I get using MySQLdb. I have written following code:
import _mysql
cdb=_mysql.connect(host="myhost",user="root",
passwd="******",db="my_db")
qry = "select col1,col2,col3,col4,col5,col6 from mytable"
cdb.query(qry)
resultset = cdb.store_result()
records = resultset.fetch_row(0,1) # 0 - no limit, 1 - output is in dictionary form
I want to extract only 3 columns: col1, col3 and col4 from the records and want to make groups of unique values using these three columns i.e. all unique combinations of (col1,col3,col4). I know I have to use set() datatype to find unique values and I tried to used it but I din't find any success. Let me know what will be the good solution for it.
I have thousand of records in the database. I am getting the output of records in following way:
({
'col1':'data11',
'col2':'data11',
'col3':'data13',
'col4':'data14',
'col5':'data15',
'col6':'data16'
},
{
'col1':'data21',
'col2':'data21',
'col3':'data23',
'col4':'data24',
'col5':'data25',
'col6':'data26'
})

I have come up with this solution:
def filter_unique(records, columns):
unique = set(tuple(rec[col] for col in columns) for rec in records)
return [dict(zip(columns, items)) for items in unique]
It first generates a tuple of column values for each record, then removes non-unique occurrences with set(), then reconstructs dictionary by giving names to each value in a tuple.
Call it like this :
filtered_records = filter_unique(records, ['col1','col2','col3'])
Disclaimer: I am a python beginner myself, so my solution might not be the best or the most optimized one.

Related

python - pandas dataframe processing

So I am back with another question about python and pandas.
I have table1 with following columns:
ID;COUNT;FOREIGN_ID;OTHER_DATA
1;3;xyz1
2;1;xyz2
3;1;xyz3
table2
ID;FOREIGN_ID;OTHER_DATA
1;xyz1;000001
2;xyz1;000002
3;xyz1;000003
4;xyz1;000004
5;xyz1;000005
6;xyz2;000000
7;xyz2;000000
8;xyz3;000000
9;xyz3;000000
Both tables are stored as CSV files. I load both of them into dataframe, and then iterate through TABLE1. I must find all records in table2 with same record and randomly select some of them.
df_result = pd.DataFrame()
df_table1 = pd.read_csv(table1, delimiter=';')
df_table2 = pd.read_csv(table2, delimiter=';')
for index, row in df_table1 .iterrows():
df_candidates = df_table2[(df_table2['FOREIGN_ID'] == row['FOREIGN_ID']
random_numbers = np.random.choice(len(df_kandidati), row['count'], replace=False)
df_result.append(df_candidates.iloc[random_numbers])
In my earlier question I got an answer that using For loop is big time waster... But for this problem I can't find a solution where I wouldn't need to use for loop.
EDIT:
I am sorry for editing my question so late.. was busy with other stuff...
As requested below is the result_table. Please note that my real tables are slightly different than those below. I am joining tables on 3 foreign keys in my real use but for demonstration, I am using tables with fake data.
So the logic should be something like this:
Read the first line of table1.
1;3;xyz1
Find all records with same FOREIGN_ID in table2
count = 3, foreign_id = xyz1
Rows with foreign_id = xyz1 are rows:
1;xyz1;000001
2;xyz1;000002
3;xyz1;000003
4;xyz1;000004
5;xyz1;000005
Because count = 3 I must randomly choose 3 of those records.
I do this with the following line:
df_candidates is table of all suitable records (table above)
random_numbers = np.random.choice(len(df_candidates), row['count'], replace=False)
Then I store randomly chosen records in a df_result after parsing all rows from table1 I write df_result to the csv.
Problem is that my tables are 0.5milion - 1 milion rows big so iterating through every row in table1 is really slow... And I am sure there is a better way of doing this.. But I've been stuck on this for past 2 days so..
To select rows, containing only values from Table1, you can use, for example, pd.merge :
col = "FOREIGN_ID"
left = df_table2
right = df_table1[[col]]
filtered = pd.merge(left=left, right=right, on=col, how="inner")
Or df.isin():
ix = df_table2[col].isin(df_table1[col])
filtered = df_table2[ix]
Then to select random sample per group:
def select_random_row(grp):
choice = np.random.randint(len(grp))
return grp.iloc[choice]
filtered.groupby(col).apply(select_random_row)
Have you looked into using pd.merge()
Your call would look something like:
results=pd.merge(table1, table2, how='inner', on='FOREIGN_ID')

Replacing multiple values within a pandas dataframe cell - python

My problem: I have a pandas dataframe and one column in particular which I need to process contains values separated by (":") and in some cases, some of those values between ":" can be value = value, and can appear at the start/middle/end of the string. The length of the string can differ in each cell as we iterate through the row, for e.g.
clickstream['events']
1:3:5:7=23
23=1:5:1:5:3
9:0:8:6=5:65:3:44:56
1:3:5:4
I have a file which contains the lookup values of these numbers,e.g.
event_no,description,event
1,xxxxxx,login
3,ffffff,logout
5,eeeeee,button_click
7,tttttt,interaction
23,ferfef,click1
output required:
clickstream['events']
login:logout:button_click:interaction=23
click1=1:button_click:login:button_click:logout
Is there a pythonic way of looking up these individual values and replacing with the event column corresponding to the event_no row as shown in the output? I have hundreds of events and trying to work out a smart way of doing this. pd.merge would have done the trick if I had a single value, but I'm struggling to work out how I can work across the values and ignore the "=value" part of the string
Edit for to ignore missing keys in Dict:
import pandas as pd
EventsDict = {1:'1:3:5:7',2:'23:45:1:5:3',39:'0:8:46:65:3:44:56',4:'1:3:5:4'}
clickstream = pd.Series(EventsDict)
#Keep this as a dictionary
EventsLookup = {1:'login',3:'logout',5:'button_click',7:'interaction'}
def EventLookup(x):
list1 = [EventsLookup.get(int(item),'Missing') for item in x.split(':')]
return ":".join(list1)
clickstream.apply(EventLookup)
Since you are using a full DF and not just a series, use:
clickstream['events'].apply(EventLookup)
Output:
1 login:logout:button_click:interaction
2 Missing:Missing:login:button_click:logout
4 login:logout:button_click:Missing
39 Missing:Missing:Missing:Missing:logout:Missing...

Obtaining data from PostgreSQL as Dictionary

I have a database table with multiple fields which I am querying and pulling out all data which meets certain parameters. I am using psycopg2 for python with the following syntax:
cur.execute("SELECT * FROM failed_inserts where insertid='%s' AND site_failure=True"%import_id)
failed_sites= cur.fetchall()
This returns the correct values as a list with the data's integrity and order maintained. However I want to query the list returned somewhere else in my application and I only have this list of values, i.e. it is not a dictionary with the fields as the keys for these values. Rather than having to do
desiredValue = failed_sites[13] //where 13 is an arbitrary number of the index for desiredValue
I want to be able to query by the field name like:
desiredValue = failed_sites[fieldName] //where fieldName is the name of the field I am looking for
Is there a simple way and efficient way to do this?
Thank you!
cursor.description will give your the column information (http://www.python.org/dev/peps/pep-0249/#cursor-objects). You can get the column names from it and use them to create a dictionary.
cursor.execute('SELECT ...')
columns = []
for column in cursor.description:
columns.append(column[0].lower())
failed_sites = {}
for row in cursor:
for i in range(len(row)):
failed_sites[columns[i]] = row[i]
if isinstance(row[i], basestring):
failed_sites[columns[i]] = row[i].strip()
The "Dictionary-like cursor", part of psycopg2.extras, seems what you're looking for.

Disturbing odd behavior/bug in Python itertools groupby?

I am using itertools.groupby to parse a short tab-delimited textfile. the text file has several columns and all I want to do is group all the entries that have a particular value x in a particular column. The code below does this for a column called name2, looking for the value in variable x. I tried to do this using csv.DictReader and itertools.groupby. In the table, there are 8 rows that match this criteria so 8 entries should be returned. Instead groupby returns two sets of entries, one with a single entry and another with 7, which seems like the wrong behavior. I do the matching manually below on the same data and get the right result:
import itertools, operator, csv
col_name = "name2"
x = "ENSMUSG00000002459"
print "looking for entries with value %s in column %s" %(x, col_name)
print "groupby gets it wrong: "
data = csv.DictReader(open(f), delimiter="\t", fieldnames=fieldnames)
for name, entries in itertools.groupby(data, key=operator.itemgetter(col_name)):
if name == "ENSMUSG00000002459":
wrong_result = [e for e in entries]
print "wrong result has %d entries" %(len(wrong_result))
print "manually grouping entries is correct: "
data = csv.DictReader(open(f), delimiter="\t", fieldnames=fieldnames)
correct_result = []
for row in data:
if row[col_name] == "ENSMUSG00000002459":
correct_result.append(row)
print "correct result has %d entries" %(len(correct_result))
The output I get is:
looking for entries with value ENSMUSG00000002459 in column name2
groupby gets it wrong:
wrong result has 7 entries
wrong result has 1 entries
manually grouping entries is correct:
correct result has 8 entries
what is going on here? If groupby is really grouping, it seems like I should only get one set of entries per x, but instead it returns two. I cannot figure this out. EDIT: Ah got it it should be sorted.
You're going to want to change your code to force the data to be in key order...
data = csv.DictReader(open(f), delimiter="\t", fieldnames=fieldnames)
sorted_data = sorted(data, key=operator.itemgetter(col_name))
for name, entries in itertools.groupby(data, key=operator.itemgetter(col_name)):
pass # whatever
The main use though, is when the datasets are large, and the data is already in key order, so when you have to sort anyway, then using a defaultdict is more efficient
from collections import defaultdict
name_entries = defaultdict(list)
for row in data:
name_entries[row[col_name]].append(row)
According to the documentation, groupby() groups only consecutive occurrences of the same key.
I don't know what your data looks like but my guess is it's not sorted. groupby works on sorted data

Union all type query with python pandas

I am attempting to use pandas to perform data analysis on a flat source of data. Specifically, what I'm attempting to accomplish is the equivalent of a Union All query in SQL.
I am using the read_csv() method to input the data and the output has unique integer indices and approximately 30+ columns.
Of these columns, several contain identifying information, whilst others contain data.
In total, the first 6 columns contain identifying informations which uniquely identifies an entry. Following these 6 columns there are a range of columns (A,B... etc) which reference the value. Some of these columns are linked together in sets, for example (A,B,C) belong together, as do (D,E,F).
However, (D,E,F) are also related to (A,B,C) as follows ((A,D),(B,E),(C,F)).
What I am attempting to do is take my data set which has as follows:
(id1,id2,id3,id4,id5,id6,A,B,C,D,E,F)
and return the following
((id1,id2,id3,id4,id5,id6,A,B,C),
(id1,id2,id3,id4,id5,id6,D,E,F))
Here, as A and D are linked they are contained within the same column.
(Note, this is a simplification, there are approximately 12 million unique combinations in the total dataset)
I have been attempting to use the merge, concat and join functions to no avail. I feel like I am missing something crucial as in an SQL database I can simply perform a union all query (which is quite slow admittedly) to solve this issue.
I have no working sample code at this stage.
Another way of writing this problem based upon some of the pandas docs.
left = key lval
right = key rval
merge(left, right, on=key) = key, lval, rval
Instead I want:
left = kev, lval
right = key, lval
union(left, right) = key, lval
key, rval
I'm not sure if a new indexing key value would need to be created for this.
I have been able to accomplish what I initially asked for.
It did require a bit of massaging of column names however.
Solution (using pseudo code):
Set up dataframes with the relevant data. e.g.
left = (id1,id2,id3,id4,id5,id6,A,B,C)
right = (id1,id2,id3,id4,id5,id6,D,E,F)
middle = (id1,id2,id3,id4,id5,id6,G,H,I)
Note, here, that for me dataset this resulted in my having non-unique indexing keys for each of the ids. That is, a key is present for each row in left and right.
Rename the column names.
col_names = [id1,id2,id3,id4,id5,id6,val1,val2,val3]
left.columns = col_names
right.columns = col_names
middle.columns = col_names
Concatenate these
pieces = [left, right, middle]
new_df = concat(pieces)
Now, this will create a new dataframe which contains x unique indexing values and 3x entries. This isn't quite ideal but it will do for now, the major shortfall of this is that you cannot uniquely access a single entry row anymore, they will come in triples. To access the data you can create a new dataframe based on the unique id values.
e.g.
check_df = new_df[(new_df[id1] == 'id1') & (new_df[id2] == 'id2') ... etc])
print check_df
key, id1, id2, id3, id4, id5, id6, A, B, C
key, id1, id2, id3, id4, id5, id6, D, E, F
key, id1, id2, id3, id4, id5, id6, G, H, I
Now, this isn't quite ideal but it's the format I needed for some of my other analysis. It may not be applicable for all parties.
If anyone has a better solution please do share it, I'm relatively new to using pandas with python.

Categories

Resources