Extracting values as a dictionary from dataframe based on list - python

I have a dataframe with unique value in each columns:
df1 = pd.DataFrame([["Phys","Shane","NY"],["Chem","Mark","LA"],
["Maths","Jack","Mum"],["Bio","Sam","CT"]],
columns = ["cls1","cls2","cls3"])
print(df1)
cls1 cls2 cls3
0 Phys Shane NY
1 Chem Mark LA
2 Maths Jack Mum
3 Bio Sam CT
And a list l1:
l1=["Maths","Bio","Shane","Mark"]
print(l1)
['Maths', 'Bio', 'Shane', 'Mark']
Now I want to retrieve a columns from dataframe that contains elements from list and list of elements.
Expected Output:
{'cls1' : ['Maths','Bio'], 'cls2': ['Shane','Mark']}
The code I have:
cls = []
for cols in df1.columns:
mask = df1[cols].isin(l1)
if mask.any():
cls.append(cols)
print(cls)
The output of above code:
['cls1', 'cls2']
I'm struggling to get common elements from dataframe and list to convert it into dictionary.
Any suggestions are welcome.
Thanks.

Use DataFrame.isin for mask, replace non match values by indexing and reshape with stack:
df = df1[df1.isin(l1)].stack()
print (df)
0 cls2 Shane
1 cls2 Mark
2 cls1 Maths
3 cls1 Bio
dtype: object
Last create list by dict comprehension:
d = {k:v.tolist() for k,v in df.groupby(level=1)}
print(d)
{'cls2': ['Shane', 'Mark'], 'cls1': ['Maths', 'Bio']}
Another solution:
d = {}
for cols in df1.columns:
mask = df1[cols].isin(l1)
if mask.any():
d[cols] = df1.loc[mask, cols].tolist()
print(d)
{'cls2': ['Shane', 'Mark'], 'cls1': ['Maths', 'Bio']}

Related

How to select dataframe in dictionary of dataframes that contains a column with specific substring

I have a dictionary of dataframes df_dict. I then have a substring "blue". I want to identify the name of the dataframe in my dictionary of dataframes that has at least one column that has a name containing the substring "blue".
I am thinking of trying something like:
for df in df_dict:
if df.columns.contains('blue'):
return df
else:
pass
However, I am not sure if a for loop is necessary here. How can I find the name of the dataframe I am looking for in my dictionary of dataframes?
I think loops are necessary for iterate items of dictionary:
df1 = pd.DataFrame({"aa_blue": [1,2,3],
'col':list('abc')})
df2 = pd.DataFrame({"f": [1,2,3],
'col':list('abc')})
df3 = pd.DataFrame({"g": [1,2,3],
'bluecol':list('abc')})
df_dict = {'df1_name' : df1, 'df2_name' : df2, 'df3_name' : df3}
out = [name for name, df in df_dict.items() if df.columns.str.contains('blue').any()]
print (out)
['df1_name', 'df3_name']
Or:
out = [name for name, df in df_dict.items() if any('blue' in y for y in df.columns)]
print (out)
['df1_name', 'df3_name']
For list of DataFrames use:
out = [df for name, df in df_dict.items() if df.columns.str.contains('blue').any()]
out = [df for name, df in df_dict.items() if any('blue' in y for y in df.columns)]
print (out)
[ aa_blue col
0 1 a
1 2 b
2 3 c, g bluecol
0 1 a
1 2 b
2 3 c]

Creating and storing a custom sort procedure as a function

I have the following pandas dataframe:
PLAYER GRP
Mike F3.03
Max F2.01
El G7.99
Billy G7.09
Steve B13.99
Vecna F3.03
I need to sort the dataframe by the grp column, first by letter, than by number before the period, then by number after the period, with all sorts being ascending. If there is a tie (see Vecna and Mike), then it should sort by player ascending. Note that the actual use case does have leading zeros for numbers after the period but not before.
My desired end result is a table sorted as follows:
PLAYER GRP
Steve B13.99
Max F2.01
Mike F3.03
Vecna F3.03
Billy G7.09
El G7.99
Can anybody provide me with a way to do this? I would ideally like to store the procedure as a function which I can then use on any dataframe with a 'Group' column which has the same structure as above?
Sample Data:
import pandas as pd
df = pd.DataFrame({'player': ['Mike', 'Max', 'El', 'Billy', 'Steve', 'Vecna'],
'grp': ['F3.03', 'F2.01', 'G7.99', 'G7.09', 'B13.99', 'F3.03']})
Current Work
I'm pretty new to Python, so this is what I've got so far.
I've been able to isolate each of the respective elements into its own list, but not sure where to go from here:
grp = df['grp']
letters = [re.sub(r'[^a-zA-Z]', '', let) for let in grp]
numbers = [re.sub(r'[^0-9]', '', num.split('.')[0]) for num in grp]
numbers_new = [int(num) for num in [re.sub(r'[^0-9]', '', num_new.split('.')[1]) for num_new in grp]]
list(zip(list(df['player']), grp, letters, numbers, numbers_new))
def sort_custom(d: pd.DataFrame,
primary: str = 'grp',
secondary: str | list = None,
inplace: bool = False) -> pd.DataFrame | None:
"""
Pass a DataFrame containing a LetterNumber column to sort by it.
Defaults to 'grp' columns.
Optional, pass a column or list of other columns to also sort by.
inplace keyword is also possible.
"""
if not inplace:
d = d.copy()
cols = ['l', 'v']
d[cols] = pd.concat([d[primary].str[0],
d[primary].str[1:]
.astype(float)], axis=1).to_numpy()
if secondary:
if isinstance(secondary, list):
d.sort_values(cols + secondary, inplace=True)
else:
d.sort_values(cols + [secondary], inplace=True)
else:
d.sort_values(cols, inplace=True)
d.drop(cols, axis=1, inplace=True)
return d if not inplace else None
df = pd.DataFrame({'player': ['Mike', 'Max', 'El', 'Billy', 'Steve', 'Vecna'],
'grp': ['F3.03', 'F2.01', 'G7.99', 'G7.09', 'B13.99', 'F3.03']})
sort_custom(d=df, primary='grp', secondary='player', inplace=True)
print(df)
Output:
PLAYER grp
4 Steve B13.99
1 Max F2.01
0 Mike F3.03
5 Vecna F3.03
3 Billy G7.09
2 El G7.99
The issue with just sorting by ['grp', 'player'] is the following:
df2 = pd.DataFrame({'player': ['Bob', 'Joe'], 'grp':['A12.09', 'A2.09']})
print(df2.sort_values(['grp', 'player']))
Output:
player grp
0 Bob A12.09
1 Joe A2.09
Here, according to string sorting, A12.09 < A2.09, but we want A12.09 > A2.09.
print(sort_custom(df2))
Output:
player grp
1 Joe A2.09
0 Bob A12.09
Personally, I think the better way of doing this would be permanently splitting the letter and number value and changing everything you do to work with that:
df['num'] = df.grp[1:].astype(float)
df['grp'] = df.grp.str[0]
df = df.sort_values(['grp', 'num', 'player'])
print(df)
Output:
player grp num
4 Steve B 13.99
1 Max F 2.01
0 Mike F 3.03
5 Vecna F 3.03
3 Billy G 7.09
2 El G 7.99
You can always combine them again by doing:
df.grp = df.grp + df.num.astype(str)
If let's say the names of the player are not all in capital letters,
df = pd.DataFrame({'player': ['Mike', 'Max', 'El', 'Billy', 'Steve', 'Vecna', 'adam'],
'grp': ['F3.03', 'F2.01', 'G7.99', 'G7.09', 'B13.99', 'F3.03', 'F3.03']})
You can use df.sort_values with a key function to complete the sort.
df.sort_values(by=['grp','player'],key=lambda col: col.str.lower(),ignore_index=True)
player grp
0 Steve B13.99
1 Max F2.01
2 adam F3.03
3 Mike F3.03
4 Vecna F3.03
5 Billy G7.09
6 El G7.99

How to create a dictionary out of my specific dataframe?

I have a dataframe df with column name:
names
phil/andy
allen
john/william/chris
john
I want to turn it into sort of "dictionary" (pandas dataframe) with unique random number for each name:
name value
phil 1
andy 2
allen 3
john 4
william 5
chris 6
How to do that? dataframe is sample, so I need a function to do same thing with very large dataframe
Here you go.
import numpy as np
import pandas as pd
# Original pd.DataFrame
d = {'phil': [1],
'phil/andy': [2],
'allen': [3],
'john/william/chris': [4],
'john': [5]
}
df = pd.DataFrame(data=d)
# Append all names to a list
names = []
for col in df.columns:
names = names + col.split("/")
# Remove duplicated names from the list
names = [i for n, i in enumerate(names) if i not in names[:n]]
# Create DF
df = pd.DataFrame(
# Random numbers
np.random.choice(
len(names), # Length
size = len(names), # Shape
replace = False # Unique random numbers
),
# Index names
index = names,
# Column names
columns = ['Rand value']
)
If you want to create a dictionary instead of a pd.DataFrame you can also apply d = df.T.to_dict() in the end. If you want numbers 0,1,2,3,...,n instead of random numbers you can replace np.random.choice() with range().

How to make a convert set in list to row

original df
list1 = ['apple','lemon']
list2 = [[('taste','sweet'),('sweetness','5')],[('taste','sour'),('sweetness','0')]]
df = pd.DataFrame(list(zip(list1,list2)), columns=['fruit', 'description'])
df.head()
desired output
list3 = ['apple','lemon']
list4 = ['sweet','sour']
list5 = ['5','0']
df2 = pd.DataFrame(list(zip(list3,list4,list5)), columns=['fruit', 'taste', 'sweetness'])
df2.head()
what had i tried, but this seem 'weird', by trying to remove punctuation one by one, then only convert to row
df['description'] = df['description'].astype(str)
df['description'] = df['description'].str[1:-1]
df['description'] = df['description'].str.replace("(","")
df.head()
is there a better way to convert the list to desired row and column?
Thanks
Create dictionaries from tuples and pass to DataFrame constructor with DataFrame.pop for extract column, last append to original by DataFrame.join:
L = [dict(y) for y in df.pop('description')]
df = df.join(pd.DataFrame(L, index=df.index))
print (df)
fruit taste sweetness
0 apple sweet 5
1 lemon sour 0

how to create a new empty array column in dataframe

I want to create a new column with each item as an empty array in dataframe,
for example, the dataframe is like:
-Index Name
-0 Mike
-1 Tom
-2 Lucy
I want to create a new column, make it like:
-Index Name1 Scores
-0 Mike []
-1 Tom []
-2 Lucy []
Because I need to append values in the arrays in the new column. How should I code?
Solution using just python:
import pandas as pd
df = pd.DataFrame({'Name': ['Mike', 'Tom', 'Lucy']})
df['Scores'] = [[]] * df.shape[0]
print(df)
Output:
Name Scores
0 Mike []
1 Tom []
2 Lucy []
The solution using np.empty function:
import pandas as pd
df = pd.DataFrame({'Index': [0,1,2], 'Name': ['Mike', 'Tom', 'Lucy']})
df['Scores'] = pd.np.empty((len(df), 0)).tolist()
print(df)
The output:
Index Name Scores
0 0 Mike []
1 1 Tom []
2 2 Lucy []
(len(df), 0) - tuple representing given shape of a new array

Categories

Resources