Find the most string value in a whole dataframe / Pandas - python

I have a dataframe with 4 columns each containing actor names.
The actors are present in several columns and I want to find the actor or actress most present in all the dataframe.
I used mode and but it doesn't work, it gives me the most present actor in each column

I would strongly advise you to use the Counter class in python. Thereby, you can simply add whole rows and columns into the object. The code would look like this:
import pandas as pd
from collections import Counter
# Artifically creating DataFrame
actors = [
["Will Smith","Johnny Depp","Johnny Depp","Johnny Depp"],
["Will Smith","Morgan Freeman","Morgan Freeman","Morgan Freeman"],
["Will Smith","Mila Kunis","Mila Kunis","Mila Kunis"],
["Will Smith","Charlie Sheen","Charlie Sheen","Charlie Sheen"],
]
df = pd.DataFrame(actors)
# Creating counter
counter = Counter()
# inserting the whole row into the counter
for _, row in df.iterrows():
counter.update(row)
print("counter object:")
print(counter)
# We show the two most common actors
for actor, occurences in counter.most_common(2):
print("Actor {} occured {} times".format(actor, occurences))
The output would look like this:
counter object:
Counter({'Will Smith': 4, 'Morgan Freeman': 3, 'Johnny Depp': 3, 'Mila Kunis': 3, 'Charlie Sheen': 3})
Actor Will Smith occured 4 times
Actor Morgan Freeman occured 3 times
The counter object solves your problem quite fast but be aware that the counter.update-function expects lists. You should not update with pure strings. If you do it like this, your counter counts the single chars.

Use stack and value_counts to get the entire list of actors/actresses:
df.stack().value_counts()
Using #Ofi91 setup:
# Artifically creating DataFrame
actors = [
["Will Smith","Johnny Depp","Johnny Depp","Johnny Depp"],
["Will Smith","Morgan Freeman","Morgan Freeman","Morgan Freeman"],
["Will Smith","Mila Kunis","Mila Kunis","Mila Kunis"],
["Will Smith","Charlie Sheen","Charlie Sheen","Charlie Sheen"],
]
df = pd.DataFrame(actors)
df.stack().value_counts()
Output:
Will Smith 4
Morgan Freeman 3
Johnny Depp 3
Charlie Sheen 3
Mila Kunis 3
dtype: int64
To find most number of appearances:
df.stack().value_counts().idxmax()
Output:
'Will Smith'

Let's consider your data frame to be like this
First we stack all columns to 1 column.
Use the below code to achieve that
df1 = pd.DataFrame(df.stack().reset_index(drop=True))
Now, take the value_counts of the actors column using the code
df2 = df1['actors'].value_counts().sort_values(ascending = False)
Here you go, the resulting data frame has the actor name and the number of occurrences in the data frame.
Happy Analysis!!!

Related

Split dataframe by date column

I have a dataframe that I would like to split into multiple dataframes using the value in my Date column. Ideally, I would like to split my dataframe by decades. Do I need to use np.array_split method or is there a method that does not require numpy?
My Dataframe looks like a larger version of this:
Date Name
0 1746-06-02 Borcke (#p1)
1 1746-09-02 Jordan (#p31)
2 1747-06-02 Sa Majesté (#p32)
3 1752-01-26 Maupertuis (#p4)
4 1755-06-02 Jordan (#p31)
And so I would ideally want in this scenario two data frames like these:
Date Name
0 1746-06-02 Borcke (#p1)
1 1746-09-02 Jordan (#p31)
2 1747-06-02 Sa Majesté (#p32)
Date Name
0 1752-01-26 Maupertuis (#p4)
1 1755-06-02 Jordan (#p31)
Building up on mozways answer for getting the decades.
d = {
"Date": [
"1746-06-02",
"1746-09-02",
"1747-06-02",
"1752-01-26",
"1755-06-02",
],
"Name": [
"Borcke (#p1)",
"Jordan (#p31)",
"Sa Majesté (#p32)",
"Maupertuis (#p4)",
"Jord (#p31)",
],
}
import pandas as pd
import math
df = pd.DataFrame(d)
df["years"] = df['Date'].str.extract(r'(^\d{4})', expand=False).astype(int)
df["decades"] = (df["years"] / 10).apply(math.floor) *10
dfs = [g for _,g in df.groupby(df['decades'])]
Use groupby, you can generate a list of DataFrames:
dfs = [g for _,g in df.groupby(df['Date'].str.extract(r'(^\d{3})', expand=False)]
Or, validating the dates:
dfs = [g for _,g in df.groupby(pd.to_datetime(df['Date']).dt.year//10)]
If you prefer a dictionary for indexing by decade:
dfs = dict(list(df.groupby(pd.to_datetime(df['Date']).dt.year//10*10)))
NB. I initially missed that you wanted decades, not years. I updated the answer. The logic remains unchanged.

Splitting column by multiple custom delimiters in Python

I need to split a column called Creative where each cell contains samples such as:
pn(2021)io(302)ta(Yes)pt(Blue)cn(John)cs(Doe)
Where each two-letter code preceding each bubbled section ( ) is the title of the desired column, and are the same in every row. The only data that changes is what is inside the bubbles. I want the data to look like:
pn
io
ta
pt
cn
cs
2021
302
Yes
Blue
John
Doe
I tried
df[['Creative', 'Creative Size']] = df['Creative'].str.split('cs(',expand=True)
and
df['Creative Size'] = df['Creative Size'].str.replace(')','')
but got an error, error: missing ), unterminated subpattern at position 2, assuming it has something to do with regular expressions.
Is there an easy way to split these ? Thanks.
Use extract with named capturing groups (see here):
import pandas as pd
# toy example
df = pd.DataFrame(data=[["pn(2021)io(302)ta(Yes)pt(Blue)cn(John)cs(Doe)"]], columns=["Creative"])
# extract with a named capturing group
res = df["Creative"].str.extract(
r"pn\((?P<pn>\d+)\)io\((?P<io>\d+)\)ta\((?P<ta>\w+)\)pt\((?P<pt>\w+)\)cn\((?P<cn>\w+)\)cs\((?P<cs>\w+)\)",
expand=True)
print(res)
Output
pn io ta pt cn cs
0 2021 302 Yes Blue John Doe
I'd use regex to generate a list of dictionaries via comprehensions. The idea is to create a list of dictionaries that each represent rows of the desired dataframe, then constructing a dataframe out of it. I can build it in one nested comprehension:
import re
rows = [{r[0]:r[1] for r in re.findall(r'(\w{2})\((.+)\)', c)} for c in df['Creative']]
subtable = pd.DataFrame(rows)
for col in subtable.columns:
df[col] = subtable[col].values
Basically, I regex search for instances of ab(*) and capture the two-letter prefix and the contents of the parenthesis and store them in a list of tuples. Then I create a dictionary out of the list of tuples, each of which is essentially a row like the one you display in your question. Then, I put them into a data frame and insert each of those columns into the original data frame. Let me know if this is confusing in any way!
David
Try with extractall:
names = df["Creative"].str.extractall("(.*?)\(.*?\)").loc[0][0].tolist()
output = df["Creative"].str.extractall("\((.*?)\)").unstack()[0].set_axis(names, axis=1)
>>> output
pn io ta pt cn cs
0 2021 302 Yes Blue John Doe
1 2020 301 No Red Jane Doe
Input df:
df = pd.DataFrame({"Creative": ["pn(2021)io(302)ta(Yes)pt(Blue)cn(John)cs(Doe)",
"pn(2020)io(301)ta(No)pt(Red)cn(Jane)cs(Doe)"]})
We can use str.findall to extract matching column name-value pairs
pd.DataFrame(map(dict, df['Creative'].str.findall(r'(\w+)\((\w+)')))
pn io ta pt cn cs
0 2021 302 Yes Blue John Doe
Using regular expressions, different way of packaging final DataFrame:
import re
import pandas as pd
txt = 'pn(2021)io(302)ta(Yes)pt(Blue)cn(John)cs(Doe)'
data = list(zip(*re.findall('([^\(]+)\(([^\)]+)\)', txt))
df = pd.DataFrame([data[1]], columns=data[0])

Rename specific rows in pandas

I would to like to rename two jobs of my datasets to "pastry". I created a dictionary with as a key the new name and as a list the previous categories
# dataframe for artificial dataframe
salary = [100, 200, 125, 400, 200]
job = ["pastry Commis ", "line cook", "pastry Commis", "pastry chef", "line cook"]
# New categories
cat_ac = {"pastry": ["pastry Commis", "pastry chef"]}
df_test = pd.DataFrame({"salary": salary, "job": job})
df_test.head()
And then
df_test.loc[df_test["job"].isin(cat_ac[list(cat_ac.keys())[0]]), "job"] = list(cat_ac.keys())[0]
df_test
Everything is working fine on this small dataset, but when I do the same experiment on my 40k rows of data, all the line corresponding to the following jobs "pastry Comis" and "pastry chef" just disapear. Or new category "pastry"
# We read the lines with the new category
df.loc[df["job"].isin(["pastry"]), "job"]
Out: Series([], Name: job, dtype: object)
# We read the lines with the previous categories
df.loc[df["job"].isin(cat_baking[list(cat_baking.keys())[0]]), "job"]
Out: Series([], Name: job, dtype: object)
Any idea of what could be the problem ?
you can use:
df_test.job.replace({i:k for i in v for k, v in cat_ac.items()})
0 pastry Commis
1 line cook
2 pastry
3 pastry
4 line cook
Note: i think you have kept a space for the first record so it didnot replace which is intended since your working solution did the same, we can deal with them using str.strip() though
Use your dict of replacements to replace using regex patterns:
for k, v in cat_ac.items():
pat = '|'.join(v)
df_test['job'] = df_test['job'].str.replace(pat, k, regex=True)
You can also do using np.where:
import numpy as np
df_test['job'] = np.where((df_test['job'].str.contains('pastry Commis')) | (df_test['job'].str.contains('pastry chef')), 'pastry', df_test['job'])

Creating new column with pieces of 3 columns

I would like to create a new column in a dataframe containing pieces of 3 different columns.I would like the first 5 letters of the last name, after removing non alphabeticals, if it is that long else just the last name, the first 2 letters of the first name and a code appended to the end.
The code below doesnt work but thats where I am and it isnt close to working
df['namecode'] = df.Last.str.replace('[^a-zA-Z]', '')[:5]+df.First.str.replace('[^a-zA-Z]', '')[:2]+str(jr['code'])
Name lastname code namecode
jeff White 0989 Whiteje0989
Zach Bunt 0798 Buntza0798
ken Black 5764 Blackke5764
Here is one approach.
Use pandas str.slice instead of trying to do string indexing.
For example:
import pandas as pd
df = pd.DataFrame(
{
'First': ['jeff', 'zach', 'ken'],
'Last': ['White^', 'Bun/t', 'Bl?ack'],
'code': ['0989', '0798', '5764']
}
)
print(df['Last'].str.replace('[^a-zA-Z]', '').str.slice(0,5)
+ df['First'].str.slice(0,2) + df['code'])
#0 Whiteje0989
#1 Buntza0798
#2 Blackke5764
#dtype: object

Loading a random sample from CSV with pandas

I have a CSV of the format
Team, Player
What I want to do is apply a filter to the field Team, then take a random subset of 3 players from EACH team.
So for instance, my CSV looks like :
Man Utd, Ryan Giggs
Man Utd, Paul Scholes
Man Utd, Paul Ince
Man Utd, Danny Pugh
Liverpool, Steven Gerrard
Liverpool, Kenny Dalglish
...
I want to end up with an XLS consisting of 3 random players from each team, and only 1 or 2 in the case where there is less than 3 e.g,
Man Utd, Paul Scholes
Man Utd, Paul Ince
Man Utd, Danny Pugh
Liverpool, Steven Gerrard
Liverpool, Kenny Dalglish
I started out using XLRD, my original post is here.
I am now trying to use Pandas as I believe this will be more flexible into the future.
So, in psuedocode what I want to do is :
foreach(team in csv)
print random 3 players + team they are assigned to
I've been looking through Pandas and trying to find the best approach to doing this, but I can't find anything similar to what I want to do (it's a difficult thing to Google!). Here's my attempt so far :
import pandas as pd
from collections import defaultdict
import csv as csv
columns = defaultdict(list) # each value in each column is appended to a list
with open('C:\\Users\\ADMIN\\Desktop\\CSV_1.csv') as f:
reader = csv.DictReader(f) # read rows into a dictionary format
for row in reader: # read a row as {column1: value1, column2: value2,...}
print(row)
#for (k,v) in row.items(): # go over each column name and value
# columns[k].append(v) # append the value into the appropriate list
# based on column name k
So I have commented out the last two lines as I am not really sure if I am needed. I now each row being printed, so I just need to select a random 3 rows per each football team (or 1 or 2 in the case where there are less).
How can I accomplish this ? Any tips/tricks?
Thanks.
First use the better optimised read_csv:
import pandas as pd
df = pd.read_csv('DataFrame')
Now as a random example, use a lambda to get a random subset by randomizing the dataframe (replace 'x' with LivFC for example):
In []
df= pd.DataFrame()
df['x'] = np.arange(0, 10, 1)
df['y'] = np.arange(0, 10, 1)
df['x'] = df['x'].astype(str)
df['y'] = df['y'].astype(str)
df['x'].ix[np.random.random_integers(0, len(df), 10)][:3]
Out [382]:
0 0
3 3
7 7
Name: x, dtype: object
This will make you more familiar with pandas, however starting with version 0.16.x, there is now a DataFrame.sample method built-in:
df = pandas.DataFrame(data)
# Randomly sample 70% of your dataframe
df_0.7 = df.sample(frac=0.7)
# Randomly sample 7 elements from your dataframe
df_7 = df.sample(n=7)
For either approach above, you can get the rest of the rows by doing:
df_rest = df.loc[~df.index.isin(df_0.7.index)]

Categories

Resources