Set operations on cell level - python

Let's say I have a DataFrame with values in one column being a set:
df = pd.DataFrame([{'song': 'Despacito', 'genres': {'pop','rock'}},
{'song': 'We will rock you', 'genres': {'rock'}},
{'song': 'Fur Eliza', 'genres': {'piano'}}])
print(df)
song genres
0 Despacito {rock, pop}
1 We will rock you {rock}
2 Fur Eliza {piano}
How do I select rows with genres overlapping with expected genres? For instance, I would expect
df[~df['genres'].intersection({'rock', 'metal'})]
to return first two songs:
song genres
0 Despacito {rock, pop}
1 We will rock you {rock}
Obviously, this will fail, because Series does not have intersection() method, but you get the idea.
Is there a way to implement it with pandas DataFrame or DataFrame is not the right structure for my goal?

Use isdisjoint method with Series.map:
df = df[~df['genres'].map({'rock', 'metal'}.isdisjoint)]
print (df)
song genres
0 Despacito {rock, pop}
1 We will rock you {rock}

Related

Long to wide format using a dictionary

I would like to make a long to wide transformation of my dataframe, starting from
match_id player goals home
1 John 1 home
1 Jim 3 home
...
2 John 0 away
2 Jim 2 away
...
ending up with:
match_id player_1 player_2 player_1_goals player_2_goals player_1_home player_2_home ...
1 John Jim 1 3 home home
2 John Jim 0 2 away away
...
Since I'm going to have columns with new names, I though that maybe I should try to build a dictionary for that, where the outer key is match id, for everylike so:
dict = {1: {
'player_1': 'John',
'player_1_goals':1,
'player_1_home': 'home'
'player_2': 'Jim',
'player_2_goals':3,
'player_2_home': 'home'
},
2: {
'player_1': 'John',
'player_1_goals':0,
'player_1_home': 'away',
'player_2': 'Jim',
'player_2_goals':2
'player_2_home': 'away'
},
}
and then:
pd.DataFrame.from_dict(dict).T
In the real case scenario, however, the number of players will vary and I can't hardcode it.
Is this the best way of doing this using diciotnaries? If so, how could I build this dict and populate it from my original pandas dataframe?
It looks like you want to pivot the dataframe. The problem is there is no column in your dataframe that "enumerates" the players for you. If you assign such a column via assign() method, then pivot() becomes easy.
So far, it actually looks incredibly similar this case here. The only difference is you seem to need to format the column names in a specific way where the string "player" needs to prepended to each column name. The set_axis() call below does that.
(df
.assign(
ind=df.groupby('match_id').cumcount().add(1).astype(str)
)
.pivot('match_id', 'ind', ['player', 'goals', 'home'])
.pipe(lambda x: x.set_axis([
'_'.join([c, i]) if c == 'player' else '_'.join(['player', i, c])
for (c, i) in x
], axis=1))
.reset_index()
)

How transform list of strings in column and split dataframe by same string to have several?

I have a dataframe with a column containing list of strings.
id sentence category
0 "I love basketball and dunk to the basket" ['basketball']
1 "I am playing football and basketball tomorrow " ['football', 'basketball']
I would like to do 2 things:
Transform category column where every elements from previous list become a string and have one row for each string and with same id and sentence
Have one dataframe by category
Expected output for step 1):
id sentence category
0 "I love basketball and dunk to the basket" 'basketball'
1 "I am playing football and tomorrow basketball" 'football'
1 "I am playing football and tomorrow basketball" 'basketball'
Expected output for step 2):
DF_1
id sentence category
0 "I love basketball and dunk to the basket" 'basketball'
1 "I am playing football and tomorrow basketball" 'basketball'
DF_2
id sentence category
1 "I am playing football and tomorrow basketball" 'football'
How can I do this ? For each and examine len of each list can work, but is there a more faster/elegant way ?
You could explode "category"; then groupby:
out = [g for _, g in df.explode('category').groupby('category')]
Then if you print the items in out:
for i in out:
print(i, end='\n\n')
you'll see:
id sentence category
0 0 I love basketball and dunk to the basket basketball
1 1 I am playing football and basketball tomorrow basketball
id sentence category
1 1 I am playing football and basketball tomorrow football
You'll need two tools : explode and groupby.
First let's prepare our data, and ensure explode will work with literal_eval :
import pandas as pd
from io import StringIO
from ast import literal_eval
csvfile = StringIO(
"""id\tsentence\tcategory
0\t"I love basketball and dunk to the basket"\t["basketball"]
1\t"I am playing football and basketball tomorrow "\t["football", "basketball"]""")
df = pd.read_csv(csvfile, sep = '\t', engine='python')
df.loc[:, 'category'] = df.loc[:, 'category'].apply(literal_eval)
Then explode regarding your category columns :
df = df.explode('category')
Finally, you can use groupby as a dictionary and store your sub dataframes elsewhere :
dg = df.groupby('category')
list_dg = []
for n, g in dg:
list_dg.append(g)
Imo, I will stick with dg if possible

Find the most string value in a whole dataframe / Pandas

I have a dataframe with 4 columns each containing actor names.
The actors are present in several columns and I want to find the actor or actress most present in all the dataframe.
I used mode and but it doesn't work, it gives me the most present actor in each column
I would strongly advise you to use the Counter class in python. Thereby, you can simply add whole rows and columns into the object. The code would look like this:
import pandas as pd
from collections import Counter
# Artifically creating DataFrame
actors = [
["Will Smith","Johnny Depp","Johnny Depp","Johnny Depp"],
["Will Smith","Morgan Freeman","Morgan Freeman","Morgan Freeman"],
["Will Smith","Mila Kunis","Mila Kunis","Mila Kunis"],
["Will Smith","Charlie Sheen","Charlie Sheen","Charlie Sheen"],
]
df = pd.DataFrame(actors)
# Creating counter
counter = Counter()
# inserting the whole row into the counter
for _, row in df.iterrows():
counter.update(row)
print("counter object:")
print(counter)
# We show the two most common actors
for actor, occurences in counter.most_common(2):
print("Actor {} occured {} times".format(actor, occurences))
The output would look like this:
counter object:
Counter({'Will Smith': 4, 'Morgan Freeman': 3, 'Johnny Depp': 3, 'Mila Kunis': 3, 'Charlie Sheen': 3})
Actor Will Smith occured 4 times
Actor Morgan Freeman occured 3 times
The counter object solves your problem quite fast but be aware that the counter.update-function expects lists. You should not update with pure strings. If you do it like this, your counter counts the single chars.
Use stack and value_counts to get the entire list of actors/actresses:
df.stack().value_counts()
Using #Ofi91 setup:
# Artifically creating DataFrame
actors = [
["Will Smith","Johnny Depp","Johnny Depp","Johnny Depp"],
["Will Smith","Morgan Freeman","Morgan Freeman","Morgan Freeman"],
["Will Smith","Mila Kunis","Mila Kunis","Mila Kunis"],
["Will Smith","Charlie Sheen","Charlie Sheen","Charlie Sheen"],
]
df = pd.DataFrame(actors)
df.stack().value_counts()
Output:
Will Smith 4
Morgan Freeman 3
Johnny Depp 3
Charlie Sheen 3
Mila Kunis 3
dtype: int64
To find most number of appearances:
df.stack().value_counts().idxmax()
Output:
'Will Smith'
Let's consider your data frame to be like this
First we stack all columns to 1 column.
Use the below code to achieve that
df1 = pd.DataFrame(df.stack().reset_index(drop=True))
Now, take the value_counts of the actors column using the code
df2 = df1['actors'].value_counts().sort_values(ascending = False)
Here you go, the resulting data frame has the actor name and the number of occurrences in the data frame.
Happy Analysis!!!

Conditional Filling in Missing Values in a Pandas Data frame using non-conventional means

TLDR; How can I improve my code and make it more pythonic?
Hi,
One of the interesting challenge(s) we were given in a tutorial was the following:
"There are X missing entries in the data frame with an associated code but a 'blank' entry next to the code. This is a random occurance across the data frame. Using your knowledge of pandas, map each missing 'blank' entry to the associated code."
So this looks like the following:
|code| |name|
001 Australia
002 London
...
001 <blank>
My approach I have used is as follows:
Loop through entire dataframe and identify areas with blanks "". Replace all blanks via copying the associated correct code (ordered) to the dataframe.
code_names = [ "",
'Economic management',
'Public sector governance',
'Rule of law',
'Financial and private sector development',
'Trade and integration',
'Social protection and risk management',
'Social dev/gender/inclusion',
'Human development',
'Urban development',
'Rural development',
'Environment and natural resources management'
]
df_copy = df_.copy()
# Looks through each code name, and if it is empty, stores the proper name in its place
for x in range(len(df_copy.mjtheme_namecode)):
for y in range(len(df_copy.mjtheme_namecode[x])):
if(df_copy.mjtheme_namecode[x][y]['name'] == ""):
df_copy.mjtheme_namecode[x][y]['name'] = code_names[int(df_copy.mjtheme_namecode[x][y]['code'])]
limit = 25
counter = 0
for x in range(len(df_copy.mjtheme_namecode)):
for y in range(len(df_copy.mjtheme_namecode[x])):
print(df_copy.mjtheme_namecode[x][y])
counter += 1
if(counter >= limit):
break
While the above approach works - is there a better, more pythonic way of achieving what I'm after? I feel the approach I have used is very clunky due to my skills not being very well developed.
Thank you!
Method 1:
One way to do this would be to replace all your "" blanks with NaN, sort the dataframe by code and name, and use fillna(method='ffill'):
Starting with this:
>>> df
code name
0 1 Australia
1 2 London
2 1
You can apply the following:
new_df = (df.replace({'name':{'':np.nan}})
.sort_values(['code', 'name'])
.fillna(method='ffill')
.sort_index())
>>> new_df
code name
0 1 Australia
1 2 London
2 1 Australia
Method 2:
This is more convoluted, but will work as well:
Using groupby, first, and sqeeze, you can create a pd.Series to map the codes to non-blank names, and use .map to map that series to your code column:
df['name'] = (df['code']
.map(
df.replace({'name':{'':np.nan}})
.sort_values(['code', 'name'])
.groupby('code')
.first()
.squeeze()
))
>>> df
code name
0 1 Australia
1 2 London
2 1 Australia
Explanation: The pd.Series map that this creates looks like this:
code
1 Australia
2 London
And it works because it gets the first instance for every code (via the groupby), sorted in such a manner that the NaNs are last. So as long as each code is associated with a name, this method will work.

Split a column value to mutliple columns pandas /python

I am new to Python/Pandas and have a data frame with two columns one a series and another a string.
I am looking to split the contents of a Column(Series) to multiple columns .Appreciate your inputs on this regard .
This is my current dataframe content
Songdetails Density
0 ["'t Hof Van Commerce", "Chance", "SORETGR12AB... 4.445323
1 ["-123min.", "Try", "SOERGVA12A6D4FEC55"] 3.854437
2 ["10_000 Maniacs", "Please Forgive Us (LP Vers... 3.579846
3 ["1200 Micrograms", "ECSTACY", "SOKYOEA12AB018... 5.503980
4 ["13 Cats", "Please Give Me Something", "SOYLO... 2.964401
5 ["16 Bit Lolitas", "Tim Likes Breaks (intermez... 5.564306
6 ["23 Skidoo", "100 Dark", "SOTACCS12AB0185B85"] 5.572990
7 ["2econd Class Citizen", "For This We'll Find ... 3.756746
8 ["2tall", "Demonstration", "SOYYQZR12A8C144F9D"] 5.472524
Desired output is SONG , ARTIST , SONG ID ,DENSITY i.e. split song details into columns.
for e.g. for the sample data
SONG DETAILS DENSITY
8 ["2tall", "Demonstration", "SOYYQZR12A8C144F9D"] 5.472524
SONG ARTIST SONG ID DENSITY
2tall Demonstration SOYYQZR12A8C144F9D 5.472524
Thanks
The following worked for me:
In [275]:
pd.DataFrame(data = list(df['Song details'].values), columns = ['Song', 'Artist', 'Song Id'])
Out[275]:
Song Artist Song Id
0 2tall Demonstration SOYYQZR12A8C144F9D
1 2tall Demonstration SOYYQZR12A8C144F9D
For you please try: pd.DataFrame(data = list(df['Songdetails'].values), columns = ['SONG', 'ARTIST', 'SONG ID'])
Thank you , i had a do an insert of column to the new data frame and was able to achieve what i needed thanks df2 = pd.DataFrame(series.apply(lambda x: pd.Series(x.split(','))))
df2.insert(3,'Density',finaldf['Density'])

Categories

Resources