I have to parse a very complex dump (whatever it is). I have done the parsing through Python. Since the parsed data is very huge in amount, I have to feed it in the database (SQL). I have also done this. Now the thing is I have to compare the data now present in the SQL.
Actually I have to compare the data of 1st dump with the data of the 2nd dump. Both dumps have the same fields (attributes) but the values of their fields may be different. So I have to detect this change. For this, I have to do the comparison. But I don't have the idea how to do this all using Python as my front end.
If you don't have MINUS or EXCEPT, there is also this, which will show all non-matching rows using a UNION/GROUP BY trick
SELECT MAX(table), data1, data2
FROM (
SELECT 'foo1' AS table, foo1.data1, foo1.data2 FROM foo1
UNION ALL
SELECT 'foo2' AS table, foo2.data1, foo2.data2 FROM foo2
) AS X
GROUP BY data1, data2
HAVING COUNT(*) = 1
ORDER BY data1, data2
I have a general purpose table compare SP which also can do a more complex table compare with left and right and inner joins and monetary threshold (or threshold percentage) and subset criteria.
Why not do the 'dectect change' in SQL? Something like:
select foo.data1, foo.data2 from foo where foo.id = 'dump1'
minus
select foo.data1, foo.data2 from foo where foo.id = 'dump2'
Related
SUMMARY:
How to query against values from different data frame columns with table.column_name combinations in SQL Alchemy using the OR_ statement.
I'm working on a SQL Alchemy project where I pull down valid columns of a dataframe and enter them all into SQL Alchemy's filter. I've successfully got it running where it would enter all entries of a column using the head of the column like this:
qry = qry.filter(or_(*[getattr(Query_Tbl,column_head).like(x) \
for x in (df[column_head].dropna().values)]))
This produced the pattern I was looking for of (tbl.column1 like a OR tbl.column1 like b...) AND- etc.
However, there are groups of the dataframe that need to be placed together where the columns are different but still need to be placed within the OR_ category,
i.e. (The desired result)
(tbl1.col1 like a OR tbl.col1 like b OR tbl.col2 like c OR tbl.col2 like d OR tbl.col3 like e...) etc.
My latest attempt was to sub-group the columns I needed grouped together, then repeat the previous style inside those groups like:
qry = qry.filter(or_((*[getattr(Query_Tbl, set_id[0]).like(x) \
for x in (df[set_id[0]].dropna().values)]),
(*[getattr(Query_Tbl, set_id[1]).like(y) \
for y in (df[set_id[1]].dropna().values)]),
(*[getattr(Query_Tbl, set_id[2]).like(z) \
for z in (df[set_id[2]].dropna().values)])
))
Where set_id is a list of 3 strings corresponding to column1, column2, and column 3 so I get the designated results, however, this produces simply:
(What I'm actually getting)
(tbl.col1 like a OR tbl.col1 like b..) AND (tbl.col2 like c OR tbl.col2 like d...) AND (tbl.col3 like e OR...)
Is there a better way to go about this in SQL Alchemy to get the result I want, or would it better to find a way of implementing column values with Pandas directly into getattr() to work it into my existing code?
Thank you for reading and in advance for your help!
It appears I was having issues with the way the data-frame was formatted, and I was reading column names into groups differently. This pattern works for anyone who want to process multiple df columns into the same OR statements.
I apologize for the issue, if anyone has any comments or questions on the subject I will help others with this type of issue.
Alternatively, I found a much cleaner answer. Since SQL Alchemy's OR_ function can be used with a variable column if you use Python's built in getattr() function, you only need to create (column,value) pairs where by you can unpack both in a loop.
for group in [group_2, group_3]:
set_id = list(set(df.columns.values) & set(group))
if len(set_id) > 1:
set_tuple = list()
for column in set_id:
for value in df[column].dropna().values:
set_tuple.append((column, value))
print(set_tuple)
qry = qry.filter(or_(*[getattr(Query_Tbl,id).like(x) for id, x in set_tuple]))
df = df.drop(group, axis=1)
If you know what column need to be grouped in the Or_ statement, you can put them into lists and iterate through them. Inside those, you create a list of tuples where you create the (column, value) pairs you need. Then within the Or_ function you upact the column and values in a loop, and assign them accordingly. The code is must easier to read and much for compack. I found this to be a more robust solution than explicitly writing out cases for the group sizes.
I have a database table with 3 columns (A,B,C). I want to add some rows in the table, for that i am going to take input from user by making a 'textentrydialog' like this https://pastebin.com/0JYm5x6e. But the problem is that i want to add multiple rows in table for the multiple values of 'A' but values of B and C are same (For example)
B = Ram
C = Aam
A = s,t,k
So the values in table should insert in this way:
(s,Ram,Aam)
(t,Ram,Aam)
(k,Ram,Aam)
Can someone please help with this how can i insert?
Here is a proposal, able to create the output you have shown, with the input you have shown.
Note that I assume you insist on the way to input, which implies using a single table.
If you can accept different input, I recommend to use two tables.
One with (id, A, C) one with (id, B) and then query by using join using(id).
A MCVE for this is at the end of the answer. It contains some additional test cases, which I made up to demonstrate that it does not only give output for given input, trying to guess obvious usecases.
Query:
select A, group_concat(B), C
from toy
group by A,C;
Output:
Mar|t,u|Aam
Ram|s,t,k|Aam
Ram|k,s,m|Maa
MCVE:
PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE toy (A varchar(10), B varchar(10), C varchar(10));
INSERT INTO toy VALUES('Ram','s','Aam');
INSERT INTO toy VALUES('Ram','t','Aam');
INSERT INTO toy VALUES('Ram','k','Aam');
INSERT INTO toy VALUES('Mar','t','Aam');
INSERT INTO toy VALUES('Mar','u','Aam');
INSERT INTO toy VALUES('Ram','k','Maa');
INSERT INTO toy VALUES('Ram','s','Maa');
INSERT INTO toy VALUES('Ram','m','Maa');
COMMIT;
I am trying to get the records from my database where studentID, and lessonDate are equal to specific results. The StudentID seems to work fine, but lessonDate does not. Because of date formats, I have converted all dates to strings to be put into the database. I have set up database file so that the field is text field. I am trying to get the lesson name and rating that a student got for all exercises that they performed on a particular date. Database diagram: https://docs.google.com/drawings/d/16IqlDN2iqzVCeaUUGgeCe1R98yhUI7vQswH0tK39cAk/edit?usp=sharing
I am certain that the StudentID is correct, as I use it in other parts of this function. I am certain that the date is correct, as the 'date' variable used in the 4th line results in the correct output written to the file, and I originally had the date in the SQL query as this variable, but it didn't work. I have tried printing the lessons where this student is the foreign key, and the date is '8-10-2016'. I really have no idea why this is happening. Any suggestion or hypothesis from anyone would be greatly appreciated.
template = ("{0:50}{1:50} \n")
print(lessonDateSet)
for date in list(lessonDateSet):
target.write("Lesson date: {} \n".format(date))
target.write("\n")
target.write(template.format("Exercise name:","rating:"))
self.cur.execute("""SELECT b.roadExerciseName, a.rating
FROM lessonExercises a LEFT JOIN roadExerciseInfo b ON a.roadExerciseID=b.roadExerciseID
LEFT JOIN lessons c ON c.lessonID=a.lessonID WHERE c.StudentID = {0} AND c.lessonDate = {1}""".format(studentInfo[0][0],"8-10-2016"))
fetch = self.cur.fetchall()
print(fetch, "fetch")
'fetch' is an empty list. after this.
I have double and tripple checked my data. my data is definitely correct.
Your parameters are not being quoted correctly.
This is why you should not use string interpolation to add data into your queries. You should use the db-api's parameter substitution instead:
self.cur.execute("""SELECT b.roadExerciseName, a.rating
FROM lessonExercises a LEFT JOIN roadExerciseInfo b ON a.roadExerciseID=b.roadExerciseID
LEFT JOIN lessons c ON c.lessonID=a.lessonID
WHERE c.StudentID = ? AND c.lessonDate = ?""",
[studentInfo[0][0], "8-10-2016"])
All I want is the count from TableA grouped by a column from TableB, but of course I need the item from TableB each count is associated with. Better explained with code:
TableA and B are Model objects.
I'm trying to follow this syntax as best I can.
Trying to run this query:
sq = session.query(TableA).join(TableB).\
group_by(TableB.attrB).subquery()
countA = func.count(sq.c.attrA)
groupB = func.first(sq.c.attrB)
print session.query(countA, groupB).all()
But it gives me an AttributeError (sq does not have attrB)
I'm new to SA and I find it difficult to learn. (links to recommended educational resources welcome!)
When you make a subquery out of a select statement, the columns that can be accessed from it must be in the columns clause. Take for example a statement like:
select x, y from mytable where z=5
If we wanted to make a subquery, then GROUP BY 'z', this would not be legal SQL:
select * from (select x, y from mytable where z=5) as mysubquery group by mysubquery.z
Because 'z' is not in the columns clause of "mysubquery" (it's also illegal since 'x' and 'y' should be in the GROUP BY as well, but that's a different issue).
SQLAlchemy works the same exact way. When you say query(..).subquery(), or use the alias() function on a core selectable construct, it means you're wrapping your SELECT statement in parenthesis, giving it a (usually generated) name, and giving it a new .c. collection that has only those columns that are in the "columns" clause, just like real SQL.
So here you'd need to ensure that TableB, at least the column you're dealing with externally, is available. You can also limit the columns clause to just those columns you need:
sq = session.query(TableA.attrA, TableB.attrB).join(TableB).\
group_by(TableB.attrB).subquery()
countA = func.count(sq.c.attrA)
groupB = func.first(sq.c.attrB)
print session.query(countA, groupB).all()
Note that the above query probably only works on MySQL, as in general SQL it's illegal to reference any columns that aren't part of an aggregate function, or part of the GROUP BY, when grouping is used. MySQL has a more relaxed (and sloppy) system in this regard.
edit: if you want the results without the zeros:
import collections
letter_count = collections.defaultdict(int)
for count, letter in session.query(func.count(MyClass.id), MyClass.attr).group_by(MyClass.attr):
letter_count[letter] = count
for letter in ["A", "B", "C", "D", "E", ...]:
print "Letter %s has %d elements" % letter_count[letter]
note letter_count[someletter] defaults to zero if otherwise not populated.
Suppose I have many columns. If 2 columns match and are exactly the same, then they are duplicates.
ID | title | link | size | author
Suppose if link and size are similar for 2 rows or more, then those rows are duplicates.
How do I get those duplicates into a list and process them?
Will return all records that have dups:
SELECT theTable.*
FROM theTable
INNER JOIN (
SELECT link, size
FROM theTable
GROUP BY link, size
HAVING count(ID) > 1
) dups ON theTable.link = dups.link AND theTable.size = dups.size
I like the subquery b/c I can do things like select all but the first or last. (very easy to turn into a delete query then).
Example: select all duplicate records EXCEPT the one with the max ID:
SELECT theTable.*
FROM theTable
INNER JOIN (
SELECT link, size, max(ID) as maxID
FROM theTable
GROUP BY link, size
HAVING count(ID) > 1
) dups ON theTable.link = dups.link
AND theTable.size = dups.size
AND theTable.ID <> dups.maxID
Assuming that none of id, link or size can be NULL, and id field is the primary key. This gives you the id's of duplicate rows. Beware that same id can be in the results several times, if there are three or more rows with identical link and size values.
select a.id, b.id
from tbl a, tbl b
where a.id < b.id
and a.link = b.link
and a.size = b.size
After you remove the duplicates from the MySQL table, you can add a unique index
to the table so no more duplicates can be inserted:
create unique index theTable_index on theTable (link,size);
If you want to do it exclusively in SQL, some kind of self-join of the table (on equality of link and size) is required, and can be accompanied by different kinds of elaboration. Since you mention Python as well, I assume you want to do the processing in Python; in that case, simplest is to build an iterator on a 'SELECT * FROM thetable ORDER BY link, size, and process withitertools.groupbyusing, as key, theoperator.itemgetter` for those two fields; this will present natural groupings of each bunch of 1+ rows with identical values for the fields in question.
I can elaborate on either option if you clarify where you want to do your processing and ideally provide an example of the kind of processing you DO want to perform!