Select using sub-query in Pandas

Select using sub-query in Pandas - python

I am new to Pandas and created following example to illustrate a problem I like to solve
Data
Consider following dataframe:
df = pd.DataFrame({ 'Person': ['Adam', 'Adam', 'Cesar', 'Diana', 'Diana', 'Diana', 'Erika', 'Erika'],
'Belonging': ['House', 'Car', 'Car', 'House', 'Car', 'Bike', 'House', 'Car'],
'Value': [300, 10, 12, 450, 15, 2, 600, 11],
})
Which looks like this:
Person Belonging Value
0 Adam House 300
1 Adam Car 10
2 Cesar Car 12
3 Diana House 450
4 Diana Car 15
5 Diana Bike 2
6 Erika House 600
7 Erika Car 11
Question
How to find the Value of Persons Car(s), if they have a House valued more then 400.
The result I am looking for is this:
Person Belonging Value
4 Diana Car 15
7 Erika Car 11
How can I achieve this in Pandas, and is there something similar to sub-queries?
Sub-query
In SQL there is something called sub-query. Perhaps there is something similar in Pandas.
SELECT *
FROM df
WHERE person IN
(SELECT person
FROM df
WHERE belonging='House' AND value>400)
AND belonging='Car';
person belonging value
---------- ---------- ----------
Diana Car 15
Erika Car 11

One approach you can use that is very similar to the SQL statement.
Start by finding the people with houses with value over 400:
persons = df.loc[(df['Belonging'] == 'House') & (df['Value'] > 400), 'Person']
This will return a series with "Diana" and "Erika".
Then find the cars for such people:
df[df['Person'].isin(persons) & (df['Belonging'] == 'Car')]
This will return your expected result.
Using a join is also possible with merge(), which might be more efficient than using isin() for a large dataset:
df_join = df.merge(persons, on='Person')
And then you can filter to find out the car:
df_join[df['Belonging'] == 'Car']
This will also return your expected result.
One different approach to this problem is to pivot the data by turning the belongings into columns, so you'd have a single row per person with all their belongings listed.
You can use pivot_table() to get this data into a relatively flat dataframe is:
df_pivot = df.pivot_table(values='Value', index='Person', columns='Belonging', fill_value=-1)
At that point, you can find the value of the cars for people with houses worth more than 400 with:
df_pivot.loc[df_pivot['House'] > 400, 'Car']
Note that this last one will return a series rather than a dataframe, since Person was now turned into the index. The pivot dataframe method is really useful if you want to gather more information about a person, so having a person in a single row makes it really easy to access all data related to that person.

print(df[df.Person.isin(df.loc[df.Value > 400, 'Person']) & (df.Belonging == 'Car')])
Prints:
Person Belonging Value
4 Diana Car 15
7 Erika Car 11

Consider a set-based (similar to SQL) approach with merge and query retaining your WHERE clauses:
final_df = (
df.query("Belonging == 'Car'")
.merge(df.query("Belonging == 'House' & Value > 400"),
on="Person", suffixes=["_Car","_House"])
)
# Person Belonging_Car Value_Car Belonging_House Value_House
# 0 Diana Car 15 House 450
# 1 Erika Car 11 House 600
Or without the house columns:
final_df = (
df.query("Belonging == 'Car'")
.merge((df.query("Belonging == 'House' & Value > 400")
.reindex(["Person"], axis="columns")),
on="Person")
)
# Person Belonging Value
# 0 Diana Car 15
# 1 Erika Car 11

Related

In pandas, how to groupby and apply/transform on each whole group (NOT aggregation)?

I've looked into agg/apply/transform after groupby, but none of them seem to meet my need.
Here is an example DF:
df_seq = pd.DataFrame({
'person':['Tom', 'Tom', 'Tom', 'Lucy', 'Lucy', 'Lucy'],
'day':[1,2,3,1,4,6],
'food':['beef', 'lamb', 'chicken', 'fish', 'pork', 'venison']
})
person,day,food
Tom,1,beef
Tom,2,lamb
Tom,3,chicken
Lucy,1,fish
Lucy,4,pork
Lucy,6,venison
The day column shows that, for each person, he/she consumes food in sequential orders.
Now I would like to group by the person col, and create a DataFrame which contains food pairs for two neighboring days/time (as shown below).
Note the day column is only for example purpose here so the values of it should not be used. It only means the food column is in sequential order. In my real data, it's a datetime column.
person,day,food,food_next
Tom,1,beef,lamb
Tom,2,lamb,chicken
Lucy,1,fish,pork
Lucy,4,pork,venison
At the moment, I can only do this with a for-loop to iterate through all users. It's very slow.
Is it possible to use a groupby and apply/transform to achieve this, or any vectorized operations?

Create new column by DataFrameGroupBy.shift and then remove rows with missing values in food_next by DataFrame.dropna:
df = (df_seq.assign(food_next = df_seq.groupby('person')['food'].shift(-1))
.dropna(subset=['food_next']))
print (df)
person day food food_next
0 Tom 1 beef lamb
1 Tom 2 lamb chicken
3 Lucy 1 fish pork
4 Lucy 4 pork venison

This might be a slightly patchy answer, and it doesn't perform an aggregation in the standard sense.
First, a small querying function that, given a name and a day, will return the first result (assuming the data is pre-sorted) that matches the parameters, and failing that, returns some default value:
def get_next_food(df, person, day):
results = df.query(f"`person`=='{person}' and `day`>{day}")
if len(results)>0:
return results.iloc[0]['food']
else:
return "Mystery"
You can use this as follows:
get_food(df_seq,"Tom", 1)
> 'lamb'
Now, we can use this in an apply statement, to populate a new column with the results of this function applied row-wise:
df_seq['next_food']=df_seq.apply(lambda x : get_food(df_seq, x['person'], x['day']), axis=1)
>
person day food next_food
0 Tom 1 beef lamb
1 Tom 2 lamb chicken
2 Tom 3 chicken Mystery
3 Lucy 1 fish pork
4 Lucy 4 pork venison
5 Lucy 6 venison Mystery
Give it a try, I'm not convinced you'll see a vast performance improvement, but it'd be interesting to find out.

Update a column value based on filter python

I have two datasets say df1 and df:
df1
df1 = pd.DataFrame({'ids': [101,102,103],'vals': ['apple','java','python']})
ids vals
0 101 apple
1 102 java
2 103 python
df
df = pd.DataFrame({'TEXT_DATA': [u'apple a day keeps doctor away', u'apple tree in my farm', u'python is not new language', u'Learn python programming', u'java is second language']})
TEXT_DATA
0 apple a day keeps doctor away
1 apple tree in my farm
2 python is not new language
3 Learn python programming
4 java is second language
What I want to do is want to update the columns values based on filtered data and map the match data to the new column such that my output is
TEXT_DATA NEW_COLUMN
0 apple a day keeps doctor away 101
1 apple tree in my farm 101
2 python is not new language 103
3 Learn python programming 103
4 java is second language 102
I tried matching using
df[df['TEXT_DATA'].str.contains("apple")]
is there any way by which i can do this?

You could do something like this:
my_words = {'python': 103, 'apple': 101, 'java': 102}
for word in my_words.keys():
df1.loc[df1['my_column'].str.contains(word, na=False), ['my_second_column']] = my_words[word]

First, you need to extract the values in df1['vals']. Then, create a new column and add the extraction result to the new column. And finally, merge both dataframes.
extr = '|'.join(x for x in df1['vals'])
df['vals'] = df['TEXT_DATA'].str.extract('('+ extr + ')', expand=False)
newdf = pd.merge(df, df1, on='vals', how='left')
To select the fields in the result, type the column name in the header section:
newdf[['TEXT_DATA','ids']]

You could use a cartesian product of both dataframes and then select the relevant rows and columns.
tmp = df.assign(key=1).merge(df1.assign(key=1), on='key').drop(columns='key')
resul = tmp.loc[tmp.apply(func=(lambda x: x.vals in x.TEXT_DATA), axis=1)]\
.drop(columns='vals').reset_index(drop=True)

How to run a groupby based on result of other/previous groupby?

Let's assume you are selling a product globally and you want to set up a sales office somewhere in a major city. Your decision will be based purely on sales numbers.
This will be your (simplified) sales data:
df={
'Product':'Chair',
'Country': ['USA','USA', 'China','China','China','China','India',
'India','India','India','India','India', 'India'],
'Region': ['USA_West','USA_East', 'China_West','China_East','China_South','China_South', 'India_North','India_North', 'India_North','India_West','India_West','India_East','India_South'],
'City': ['A','B', 'C','D','E', 'F', 'G','H','I', 'J','K', 'L', 'M'],
'Sales':[1000,1000, 1200,200,200, 200,500 ,350,350,100,700,50,50]
}
dff=pd.DataFrame.from_dict(df)
dff
Based on the data you should go for City "G".
The logic should go like this:
1) Find country with Max(sales)
2) in that country, find region with Max(sales)
3) in that region, find city with Max(sales)
I tried: groupby('Product', 'City').apply(lambda x: x.nlargest(1)), but this doesn't work, because it would propose city "C". This is the city with highest sales globally, but China is not the Country with highest sales.
I probably have to go through several loops of groupby. Based on the result, filter the original dataframe and do a groupby again on the next level.
To add to the complexity, you sell other products too (not just 'Chairs', but also other furniture). You would have to store the results of each iteration (like country with Max(sales) per product) somewhere and then use it in the next iteration of the groupby.
Do you have any ideas, how I could implement this in pandas/python?

Idea is aggregate sum per each level with Series.idxmax for top1 value, what is used for filtering for next level by boolean indexing:
max_country = dff.groupby('Country')['Sales'].sum().idxmax()
max_region = dff[dff['Country'] == max_country].groupby('Region')['Sales'].sum().idxmax()
max_city = dff[dff['Region'] == max_region].groupby('City')['Sales'].sum().idxmax()
print (max_city)
G

One way is to add groupwise totals, then sort your dataframe. This goes beyond your requirement by ordering all your data using your preference logic:
df = pd.DataFrame.from_dict(df)
factors = ['Country', 'Region', 'City']
for factor in factors:
df[f'{factor}_Total'] = df.groupby(factor)['Sales'].transform('sum')
res = df.sort_values([f'{x}_Total' for x in factors], ascending=False)
print(res.head(5))
City Country Product Region Sales Country_Total Region_Total \
6 G India Chair India_North 500 2100 1200
7 H India Chair India_North 350 2100 1200
8 I India Chair India_North 350 2100 1200
10 K India Chair India_West 700 2100 800
9 J India Chair India_West 100 2100 800
City_Total
6 500
7 350
8 350
10 700
9 100
So for the most desirable you can use res.iloc[0], for the second res.iloc[1], etc.

python data science Find movies with highest female ratings

I am working on an assignment for my Data Science class. I just need help getting started, as I'm having trouble understanding how to use pandas to group and selecting DISTINCT values.
I need to find the movies with the HIGHEST RATINGS by FEMALES, my code returns me movies with ratings = 5, and gender = 'F', but it also repeats the same movie over and over again, since there are more than 1 users. I'm not sure how to just show movie, count of 5-star ratings, and gender = F. below is my code:
import pandas as pd
import os
m = pd.read_csv('movies.csv')
u = pd.read_csv('users.csv')
r = pd.read_csv('ratings.csv')
ur = pd.merge(u,r)
data = pd.merge(m,ur)
df = pd.DataFrame(data)
top10 = df.loc[(df.gender == 'F')&(df.rating == 5)]
print(top10)
the data files can be downloaded here
I just need some help getting started, theres alot more to the homework, but once I figure this out I can do the rest. Just need a jump-start. thank you very much
mv_id title genres rating user_id gender
1 Toy Story (1995) Animation|Children's|Comedy 5 1 F
2 Jumanji (1995) Adventure|Children's|Fantasy 5 2 F
3 Grumpier Old Men (1995) Comedy|Romance 5 3 F
4 Waiting to Exhale (1995) Comedy|Drama 5 4 F
5 Father of the Bride Part II (1995) Comedy 5 5 F

I would try to do the filtering operation on as little data as possible. To select 5-star ratings of female users, there's no need for the movie metadata (movies.csv). It can be done on the ur data, which is easier than on the df.
# filter the data in `ur`
f_5s_ratings = ur.loc[(ur.gender == 'F')&(ur.rating == 5)]
# count rows per `movie_id`
abs_num_f_5s_ratings = f_5s_ratings.groupby("movie_id").size()
In abs_num_f_5s_ratings you now have a DataFrame counting the total number of 5-star ratings by female users per movie_id:
movie_id
1 253
2 15
3 14
...
If you join that data on the key movie_id with m as a new column (I'll leave it as an exercise to you), you can then sort by this value to get your top 10 movies with absolute number of 5-star ratings by females.

Creating a dictionary of categoricals in SQL and aggregating them in Python

I have a rather "cross platformed" question. I hope it is not too general.
One of my tables, say customers, consists of my customer id's and their associated demographic information. Another table, say transaction, contains all purchases from the customers in the respective shops.
I am interested in analyzing basket compositions together with demographics in python. Hence, I would like to have the shops as columns and the sum for the given customers at the shops in my dataframe
For clarity,
select *
from customer
where id=1 or id=2
gives me
id age gender
1 35 MALE
2 57 FEMALE
and
select *
from transaction
where id=1 or id=2
gives me
customer_id shop amount
1 2 250
1 2 500
2 3 100
2 7 200
2 11 125
Which should end up in a (preferably) Pandas dataframe as
id age gender shop_2 shop_3 shop_7 shop_11
1 35 MALE 750 0 0 0
2 57 FEMALE 0 100 200 125
Such that the last columns is the aggregated baskets of the customers.
I have tried to create a python dictionary of the purchases and amounts for each customer in SQL in the following way:
select customer_id, array_agg(concat(cast(shop as varchar), ' : ', cast(amount as varchar))) as basket
from transaction
group by customer_id
Resulting in
id basket
1 ['2 : 250', '2 : 500']
2 ['3 : 100', '7 : 200', '11 : 125']
which could easily be joined on the customer table.
However, this solution is not optimal due to the fact that it is strings and not integers inside the []. Hence, it involves a lot of manipulation and looping in python to get it on the format I want.
Is there any way where I can aggregate the purchases in SQL making it easier for python to read and aggregate into columns?

One simple solution would be to do the aggregation in pandas using pivot_table on the second dataframe and then merge with the first:
df2 = df2.pivot_table(columns='shop', values='amount', index='customer_id', aggfunc='sum', fill_value=0.0).reset_index()
df = pd.merge(df1, df2, left_on='id', right_on='customer_id')
Resulting dataframe:
id age gender 2 3 7 11
1 35 MALE 750 0 0 0
2 57 FEMALE 0 100 200 125

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Select using sub-query in Pandas - python

print(df[df.Person.isin(df.loc[df.Value > 400, 'Person']) & (df.Belonging == 'Car')]) Prints: Person Belonging Value 4 Diana Car 15 7 Erika Car 11

Related

In pandas, how to groupby and apply/transform on each whole group (NOT aggregation)?

Update a column value based on filter python

How to run a groupby based on result of other/previous groupby?

python data science Find movies with highest female ratings

Creating a dictionary of categoricals in SQL and aggregating them in Python

Categories

Resources