How to enrich dataframe by adding columns in specific condition - python

I have a two different datasets:
users:
+-------+---------+--------+
|user_id| movie_id|timestep|
+-------+---------+--------+
| 100 | 1000 |20200728|
| 101 | 1001 |20200727|
| 101 | 1002 |20200726|
+-------+---------+--------+
movies:
+--------+---------+--------------------------+
|movie_id| title | genre |
+--------+---------+--------------------------+
| 1000 |Toy Story|Adventure|Animation|Chil..|
| 1001 | Jumanji |Adventure|Children|Fantasy|
| 1002 | Iron Man|Action|Adventure|Sci-Fi |
+--------+---------+--------------------------+
How to get dataset in the following format? So I can get user's taste profile, so I can compare different users by their similarity score?
+-------+---------+--------+---------+---------+-----+
|user_id| Action |Adventure|Animation|Children|Drama|
+-------+---------+--------+---------+---------+-----+
| 100 | 0 | 1 | 1 | 1 | 0 |
| 101 | 1 | 1 | 0 | 1 | 0 |
+-------+---------+---------+---------+--------+-----+

Where df is the movies dataframe and dfu is the users dataframe
The 'genre' column needs to be split into a list with pandas.Series.str.split, and then using pandas.DataFrame.explode, transform each element of the list into a row, replicating index values.
pandas.merge the two dataframes on 'movie_id'
Use pandas.DataFrame.groupby on 'user_id' and 'genre' and aggregate by count.
Shape final
.unstack converts the groupby dataframe from long to wide format
.fillna replace NaN with 0
.astype changes the numeric values from float to int
Tested in python 3.10, pandas 1.4.3
import pandas as pd
# data
movies = {'movie_id': [1000, 1001, 1002],
'title': ['Toy Story', 'Jumanji', 'Iron Man'],
'genre': ['Adventure|Animation|Children', 'Adventure|Children|Fantasy', 'Action|Adventure|Sci-Fi']}
users = {'user_id': [100, 101, 101],
'movie_id': [1000, 1001, 1002],
'timestep': [20200728, 20200727, 20200726]}
# set up dataframes
df = pd.DataFrame(movies)
dfu = pd.DataFrame(users)
# split the genre column strings at '|' to make lists
df.genre = df.genre.str.split('|')
# explode the lists in genre
df = df.explode('genre', ignore_index=True)
# merge df with dfu
dfm = pd.merge(dfu, df, on='movie_id')
# groupby, count and unstack
final = dfm.groupby(['user_id', 'genre'])['genre'].count().unstack(level=1).fillna(0).astype(int)
# display(final)
genre Action Adventure Animation Children Fantasy Sci-Fi
user_id
100 0 1 1 1 0 0
101 1 2 0 1 1 1

Related

How to assign a new column to a dataframe based on comparison between other columns?

In my one sheet Excel file that I created through my SQL, I have 3 columns that represent letter ratings. The rating values may differ between ratings 1, 2, and 3, but they can still be ranked with the same value.
I am trying to create a new column in my Excel file that can take these 3 letter ratings and pull the middle rating.
ranking | Rating_1 | Rating_2 | Rating_3 | NEW_COLUMN |
(1 lowest) | -------- | -------- | -------- | -------- |
3 | A+ | AA | Aa | middle(rating)|
2 | B+ | BB | Bb | middle(rating)|
1 | Fa | Fb | Fc | middle(rating)|
| -------- | -------- | -------- | --------- |
There are three scenarios I need to account for:
if all three ratings differ, pick the rating between rating_1, rating_2, and rating_3 that isn't the highest rating or the lowest rating
if all three ratings are the same, pick rating on rating_1
if 2 of the ratings are the same, but one is different, pick the minimum rating
I created a dataframe :
df = pd.DataFrame(
{"Rating_1": ["A+", "AA", "Aa"],
"Rating_2": ["B+", "BB", "Bb"],
"Rating_3": ["Fa", "Fb", "Fc"]}
)
df["NEW COLUMN"] = {insert logic here}
Or is it easier to create a new DF that filters down the the original DF?
With the fowllowing toy dataframe:
import pandas as pd
df = pd.DataFrame(
{
"Rating_1": ["A+", "Cc", "Aa"],
"Rating_2": ["AA", "Cc", "Aa"],
"Rating_3": ["BB", "Cc", "Bb"],
}
)
print(df)
# Output
Rating_1 Rating_2 Rating_3
0 A+ AA BB
1 Cc Cc Cc
2 Aa Aa Bb
Here is one way to do it using Python sets to check conditions:
# First condition
df["Middle_rating"] = df.apply(
lambda x: sorted([x["Rating_1"], x["Rating_2"], x["Rating_3"]])[1]
if len(set([x["Rating_1"], x["Rating_2"], x["Rating_3"]])) == 3
else "",
axis=1,
)
# Second condition
df["Middle_rating"] = df.apply(
lambda x: x["Rating_1"]
if len(set([x["Rating_1"], x["Rating_2"], x["Rating_3"]])) == 1
else x["Middle_rating"],
axis=1,
)
# Third condition
ratings = {
rating: i
for i, rating in enumerate(["A+", "AA", "Aa", "B+", "BB", "Bb", "C+", "CC", "Cc"])
} # ratings ordered from best (A+: 0) to worst (CC: 8)
df["Middle_rating"] = df.apply(
lambda x: max(x["Rating_1"], x["Rating_2"], x["Rating_3"])
if len(
set([ratings[x["Rating_1"]], ratings[x["Rating_2"]], ratings[x["Rating_3"]]])
)
== 2
else x["Middle_rating"],
axis=1,
)
Then:
print(df)
# Output
Rating_1 Rating_2 Rating_3 Middle_rating
0 A+ AA BB AA
1 Cc Cc Cc Cc
2 Aa Aa Bb Bb

How to split a column of dictionary type into two different pandas column of different type?

I have a dataframe with 2 columns (plus index) like this, it has around 14,000 lines.
Employee | RecordID
{'Id': 185, 'Title': 'Full Name'} | 9
I'd like to split the columns like this:
Id | Title | RecordID
185 | 'Full Name' | 9
I tried to use this solution:
df2 = pd.DataFrame(data_df["Employee"].values.tolist(), index=data_df.index) <- error
data_df = pd.concat([data_df, df2], axis = 1).drop(column, axis = 1)
but it gives this error on the df2 line
*** AttributeError: 'float' object has no attribute 'keys'
I have 2 theories: one that it's because i have different column types in the employee dictionary, and two: there are 3 records that have an empty employee id, like this:
Employee | RecordID
nan | 7051
I need to keep those 3 records without an employee record and show their record Id, and in the final data_df show empty columns for employee id and employee name.
So in summary:
INPUT
Employee | RecordID
{'Id': 185, 'Title': 'Full Name'} | 9
nan | 7051
EXPECTED OUTPUT
Id | Title | RecordID
185 | 'Full Name' | 9
nan | nan | 7051
I made it work using data_df["Employee"].apply(pd.Series) but it's painfully slow.
Is there a way not using pd.series to split a column of dictionaries where such dictionary has different column types and nan values to separate columns into the parent pandas dataframe?
Thanks,
You can do
data_df1= data_df.dropna()
df2 = pd.DataFrame(data_df1["Employee"].values.tolist(), index= data_df1.index)
data_df=data_df.join(df2,how='left')

find rows that share values

I have a pandas dataframe that look like this:
df = pd.DataFrame({'name': ['bob', 'time', 'jane', 'john', 'andy'], 'favefood': [['kfc', 'mcd', 'wendys'], ['mcd'], ['mcd', 'popeyes'], ['wendys', 'kfc'], ['tacobell', 'innout']]})
-------------------------------
name | favefood
-------------------------------
bob | ['kfc', 'mcd', 'wendys']
tim | ['mcd']
jane | ['mcd', 'popeyes']
john | ['wendys', 'kfc']
andy | ['tacobell', 'innout']
For each person, I want to find out how many favefood's of other people overlap with their own.
I.e., for each person I want to find out how many other people have a non-empty intersection with them.
The resulting dataframe would look like this:
------------------------------
name | overlap
------------------------------
bob | 3
tim | 2
jane | 2
john | 1
andy | 0
The problem is that I have about 2 million rows of data. The only way I can think of doing this would be through a nested for-loop - i.e. for each person, go through the entire dataframe to see what overlaps (this would be extremely inefficient). Would there be anyway to do this more efficiently using pandas notation? Thanks!
Logic behind it
s=df['favefood'].explode().str.get_dummies().sum(level=0)
s.dot(s.T).ne(0).sum(axis=1)-1
Out[84]:
0 3
1 2
2 2
3 1
4 0
dtype: int64
df['overlap']=s.dot(s.T).ne(0).sum(axis=1)-1
Method from sklearn
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
s=pd.DataFrame(mlb.fit_transform(df['favefood']),columns=mlb.classes_, index=df.index)
s.dot(s.T).ne(0).sum(axis=1)-1
0 3
1 2
2 2
3 1
4 0
dtype: int64

How to create a new dataframe based on value_counts of a column in another dataframe but with certain conditions on other columns?

I have a pandas data-frame of tickets raised on a group of servers like this:
a b c Users Problem
0 data data data User A Server Down
1 data data data User B Server Down
2 date data data User C Memory Full
3 date data data User C Swap Full
4 date data data User D Unclassified
5 date data data User E Unclassified
6 data data data User B RAM Failure
I need to create another dataframe like this with the data grouped by the type of tickets and the count of tickets raised by only two users, A and B separately and a single column with the count for other users.
Expected new Dataframe:
+---------------+--------+--------+-------------+
| Type Of Error | User A | User B | Other Users |
+---------------+--------+--------+-------------+
| Server Down | 50 | 60 | 150 |
+---------------+--------+--------+-------------+
| Memory Full | 40 | 50 | 20 |
+---------------+--------+--------+-------------+
| Swap Full | 10 | 20 | 15 |
+---------------+--------+--------+-------------+
| Unclassified | 10 | 20 | 50 |
+---------------+--------+--------+-------------+
| | | | |
+---------------+--------+--------+-------------+
I've tried .value_counts() which provides total count of that type. I however need it to be based on the User.
If no User A or User B change users to Other Users by Series.where and then use crosstab:
df['Users'] = df['Users'].where(df['Users'].isin(['User A','User B']), 'Other Users')
df = pd.crosstab(df['Problem'], df['Users'])[['User A','User B','Other Users']]
print (df)
Users User A User B Other Users
Problem
Memory Full 0 0 1
RAM Failure 0 1 0
Server Down 1 1 0
Swap Full 0 0 1
Unclassified 0 0 2
You could use pivot_table which is great at using aggregate functions:
users = df.Users.copy()
users[~users.isin(['User A', 'User B'])] = 'Other Users'
df.pivot_table(index='Problem', columns=users, aggfunc='count', values='a',
fill_value=0).reindex(['User A', 'User B', 'Other Users'], axis=1)
It gives:
Users User A User B Other Users
Problem
Memory Full 0 0 1
RAM Failure 0 1 0
Server Down 1 1 0
Swap Full 0 0 1
Unclassified 0 0 2

Searching through data base for partial and full match integers

I'm trying to search through a dataframe with a column that can have one or more integer values, to match one or more given integers.
The integers in the database has a '-' in between For example
--------------------------------------------------
| Customer 1 |1124 |
--------------------------------------------------
| Customer 2 |1124-1123 |
--------------------------------------------------
| Customer 3 |1124-1234-1642 |
--------------------------------------------------
| Customer 3 |1213-1234-1642 |
--------------------------------------------------
The objective here is to do a partial and full match, and be able to and be able to find out how many integers didn't match.
So for example let's say I have find all customers with 1124, the output would look like this(going off the example I provided)
--------------------------------------------------
| Customer 1 |1124 |None
--------------------------------------------------
| Customer 2 |1124-1123 |1
--------------------------------------------------
| Customer 3 |1124-1234-1642 |2
--------------------------------------------------
Thanks ahead of time!
Use set
define x as the test set
make s a series of sets
s - x creates a series of differences
(s - x).str.len() are the sizes of the differences
s & x is a boolean series indicating whether there is an intersection. Or in this case, if x is in s
x = {'1124'}
s = df['col2'].str.split('-').apply(set)
df.assign(col3=(s - x).str.len())[s & x]
col1 col2 col3
0 Customer 1 1124 0
1 Customer 2 1124-1123 1
2 Customer 3 1124-1234-1642 2
Setup
df = pd.DataFrame({
'col1': ['Customer 1', 'Customer 2', 'Customer 3', 'Customer 3'],
'col2': ['1124', '1124-1123', '1124-1234-1642', '1213-1234-1642']
})

Categories

Resources