select rows with except sqllite3 - python

I have a database with a dataframe that contains the columns: Name, Award, Winner(1 means won and 0 means did not win) and some other things that are irrelevant for this question.
I want to make a dataframe with the names of people that were selected for the award actress(al awards with the name actress in them count), but never won, using sqlite 3 in python.
These are the first five rows of the dataframe:
Unnamed: 0 CeremonyNumber CeremonyYear CeremonyMonth CeremonyDay FilmYear Award Winner Name FilmDetails
0 0 1 1929 5 16 1927 Actor 1 Emil Jannings The Last Command
1 1 1 1929 5 16 1927 Actor 0 Richard Barthelmess The Noose
2 2 1 1929 5 16 1927 Actress 1 Janet Gaynor 7th Heaven
3 3 1 1929 5 16 1927 Actress 0 Louise Dresser A Ship Comes In
4 4 1 1929 5 16 1927 Actress 0 Gloria Swanson Sadie Thompson
I tried it with this query, but this resulted not in the correct result.
query = '''
select Name
from oscars
where Award like "Actress%"
except select Name
from oscars
where Award like "Actress%" and Winner == 1
'''
The outcome of this query should be a dataframe like this:
Name
0 Abigail Breslin
1 Adriana Barraza
2 Agnes Moorehead
3 Alfre Woodard
4 Ali MacGraw

In order to select all the actresses who were selected for the award and never won, you should use AND rather than EXCEPT. Something like this should work:
SELECT Name from Oscars WHERE Award LIKE "Actress%" AND Winner = 0
Refer to the sqlite docs at https://www.sqlite.org/index.html for more information.

Related

Python pandas merge map with multiple values xlookup

I have a dataframe of actor names:
df1
actor_id actor_name
1 Brad Pitt
2 Nicole Kidman
3 Matthew Goode
4 Uma Thurman
5 Ethan Hawke
And another dataframe of movies that the actors were in:
df2
actor_id actor_movie movie_revenue_m
1 Once Upon a Time in Hollywood 150
2 The Others 50
2 Moulin Rouge 200
3 Stoker 75
4 Kill Bill 125
5 Gattaca 85
I want to merge the two dataframes together to show the actors with their movie names and movie revenues, so I use the merge function:
df3 = df1.merge(df2, on = 'actor_id', how = 'left')
df3
actor_id actor_name actor_movie movie_revenue
1 Brad Pitt Once Upon a Time in Hollywood 150
2 Nicole Kidman Moulin Rouge 50
2 Nicole Kidman The Others 200
3 Matthew Goode Stoker 75
4 Uma Thurman Kill Bill 125
5 Ethan Hawke Gattaca 85
But this pulls in all movies, so Nicole Kidman gets duplicated, and I only want to show one movie per actor. How can I merge the dataframes without "duplicating" my list of actors?
How would I merge the movie title that is alphabetically first?
How would I merge the movie title with the highest revenue?
Thank you!
One way is to continue with the merge and then filter the result set
movie title that is alphabetically first
# sort by name, movie and then pick the first while grouping by actor
df.sort_values(['actor_name','actor_movie'] ).groupby('actor_id', as_index=False).first()
actor_id actor_name actor_movie movie_revenue
0 1 Brad Pitt Once Upon a Time in Hollywood 150
1 2 Nicole Kidman Moulin Rouge 50
2 3 Matthew Goode Stoker 75
3 4 Uma Thurman Kill Bill 125
4 5 Ethan Hawke Gattaca 85
movie title with the highest revenue
# sort by name, and review (descending), groupby actor and pick first
df.sort_values(['actor_name','movie_revenue'], ascending=[1,0] ).groupby('actor_id', as_index=False).first()
actor_id actor_name actor_movie movie_revenue
0 1 Brad Pitt Once Upon a Time in Hollywood 150
1 2 Nicole Kidman The Others 200
2 3 Matthew Goode Stoker 75
3 4 Uma Thurman Kill Bill 125
4 5 Ethan Hawke Gattaca 85

Pivot table rank by Name(Index) and Title(Column)

I have a dataset that looks like this:
The count represents the number of times they worked.
Title Name Count
Coach Bob 4
teacher sam 5
driver mark 8
Coach tina 10
teacher kate 3
driver frank 2
I want to create a table which I think will have to be a pivot, that sorts by count times worked, the name and title, so for example the output would look like this:
coach teacher driver
tina 10 sam 5 mark 8
bob 4 kate 3 drank 2
I am familiar with general pivot table code but I think Im going to need to use something a little bit more comprehensive.
DF_PIV = pd.pivot_table(DF, values=['count'], index=['title','Name'], columns=['title']
aggfunc=np.max)
I get an error ValueError: Grouper for 'view_title' not 1-dimensional, but I do not even think I on the right track here.
You can try:
(df.set_index(['Title', df.groupby('Title').cumcount()])
.unstack(0)
.astype(str)
.T
.groupby(level=1).agg(' '.join)
.T)
Output:
Title Coach driver teacher
0 Bob 4 mark 8 sam 5
1 tina 10 frank 2 kate 3

Get latest value looked up from other dataframe

My first data frame
product=pd.DataFrame({
'Product_ID':[101,102,103,104,105,106,107,101],
'Product_name':['Watch','Bag','Shoes','Smartphone','Books','Oil','Laptop','New Watch'],
'Category':['Fashion','Fashion','Fashion','Electronics','Study','Grocery','Electronics','Electronics'],
'Price':[299.0,1350.50,2999.0,14999.0,145.0,110.0,79999.0,9898.0],
'Seller_City':['Delhi','Mumbai','Chennai','Kolkata','Delhi','Chennai','Bengalore','New York']
})
My 2nd data frame has transactions
customer=pd.DataFrame({
'id':[1,2,3,4,5,6,7,8,9],
'name':['Olivia','Aditya','Cory','Isabell','Dominic','Tyler','Samuel','Daniel','Jeremy'],
'age':[20,25,15,10,30,65,35,18,23],
'Product_ID':[101,0,106,0,103,104,0,0,107],
'Purchased_Product':['Watch','NA','Oil','NA','Shoes','Smartphone','NA','NA','Laptop'],
'City':['Mumbai','Delhi','Bangalore','Chennai','Chennai','Delhi','Kolkata','Delhi','Mumbai']
})
I want Price from 1st data frame to come in the merged dataframe. Common element being 'Product_ID'. Note that against product_ID 101, there are 2 prices - 299.00 and 9898.00. I want the later one to come in the merged data set i.e. 9898.0 (Since this is latest price)
Currently my code is not giving the right answer. It is giving both
customerpur = pd.merge(customer,product[['Price','Product_ID']], on="Product_ID", how = "left")
customerpur
id name age Product_ID Purchased_Product City Price
0 1 Olivia 20 101 Watch Mumbai 299.0
1 1 Olivia 20 101 Watch Mumbai 9898.0
There is no explicit timestamp so I assume the index is the order of the dataframe. You can drop duplicates at the end:
customerpur.drop_duplicates(subset = ['id'], keep = 'last')
result:
id name age Product_ID Purchased_Product City Price
1 1 Olivia 20 101 Watch Mumbai 9898.0
2 2 Aditya 25 0 NA Delhi NaN
3 3 Cory 15 106 Oil Bangalore 110.0
4 4 Isabell 10 0 NA Chennai NaN
5 5 Dominic 30 103 Shoes Chennai 2999.0
6 6 Tyler 65 104 Smartphone Delhi 14999.0
7 7 Samuel 35 0 NA Kolkata NaN
8 8 Daniel 18 0 NA Delhi NaN
9 9 Jeremy 23 107 Laptop Mumbai 79999.0
Please note keep = 'last' argument since we are keeping only last price registered.
Deduplication should be done before merging if Yuo care about performace or dataset is huge:
product = product.drop_duplicates(subset = ['Product_ID'], keep = 'last')
In your data frame there is no indicator of latest entry, so you might need to first remove the the first entry for id 101 from product dataframe as follows:
result_product = product.drop_duplicates(subset=['Product_ID'], keep='last')
It will keep the last entry based on Product_ID and you can do the merge as:
pd.merge(result_product, customer, on='Product_ID')

Pandas, Dataframe, conditional sum of column for each row

I am new to python and trying to move some of my work from excel to python, and wanted an excel SUMIFS equivalent in pandas, for example something like:
SUMIFS(F:F, D:D, "<="&C2, B:B, B2, F:F, ">"&0)
I my case, I have 6 columns, a unique Trade ID, an Issuer, a Trade date, a release date, a trader, and a quantity. I wanted to get a column which show the sum of available quantity for release at each row. Something like the below:
A B C D E F G
ID Issuer TradeDate ReleaseDate Trader Quantity SumOfAvailableRelease
1 Horse 1/1/2012 13/3/2012 Amy 7 0
2 Horse 2/2/2012 15/5/2012 Dave 2 0
3 Horse 14/3/2012 NaN Dave -3 7
4 Horse 16/5/2012 NaN John -4 9
5 Horse 20/5/2012 10/6/2012 John 2 9
6 Fish 6/6/2013 20/6/2013 John 11 0
7 Fish 25/6/2013 9/9/2013 Amy 4 11
8 Fish 8/8/2013 15/9/2013 Dave 5 11
9 Fish 25/9/2013 NaN Amy -3 20
Usually, in excel, I just pull the SUMIFS formulas down the whole column and it will work, I am not sure how I can do it in python.
Many thanks!
What you could do is a df.where
so for example you could say
Qdf = df.where(df["Quantity"]>=5)
and then do you sum, Idk what you want to do since I have 0 knowledge about excell but I hope this helps

How to perform groupby and mean on categorical columns in Pandas

I'm working on a dataset called gradedata.csv in Python Pandas where I've created a new binned column called 'Status' as 'Pass' if grade > 70 and 'Fail' if grade <= 70. Here is the listing of first five rows of the dataset:
fname lname gender age exercise hours grade \
0 Marcia Pugh female 17 3 10 82.4
1 Kadeem Morrison male 18 4 4 78.2
2 Nash Powell male 18 5 9 79.3
3 Noelani Wagner female 14 2 7 83.2
4 Noelani Cherry female 18 4 15 87.4
address status
0 9253 Richardson Road, Matawan, NJ 07747 Pass
1 33 Spring Dr., Taunton, MA 02780 Pass
2 41 Hill Avenue, Mentor, OH 44060 Pass
3 8839 Marshall St., Miami, FL 33125 Pass
4 8304 Charles Rd., Lewis Center, OH 43035 Pass
Now, how do i compute the mean hours of exercise of female students with a 'status' of passing...?
I've used the below code, but it isn't working.
print(df.groupby('gender', 'status')['exercise'].mean())
I'm new to Pandas. Anyone please help me in solving this.
You are very close. Note that your groupby key must be one of mapping, function, label, or list of labels. In this case, you want a list of labels. For example:
res = df.groupby(['gender', 'status'])['exercise'].mean()
You can then extract your desired result via pd.Series.get:
query = res.get(('female', 'Pass'))

Categories

Resources