how to create a new empty array column in dataframe - python

I want to create a new column with each item as an empty array in dataframe,
for example, the dataframe is like:
-Index Name
-0 Mike
-1 Tom
-2 Lucy
I want to create a new column, make it like:
-Index Name1 Scores
-0 Mike []
-1 Tom []
-2 Lucy []
Because I need to append values in the arrays in the new column. How should I code?

Solution using just python:
import pandas as pd
df = pd.DataFrame({'Name': ['Mike', 'Tom', 'Lucy']})
df['Scores'] = [[]] * df.shape[0]
print(df)
Output:
Name Scores
0 Mike []
1 Tom []
2 Lucy []

The solution using np.empty function:
import pandas as pd
df = pd.DataFrame({'Index': [0,1,2], 'Name': ['Mike', 'Tom', 'Lucy']})
df['Scores'] = pd.np.empty((len(df), 0)).tolist()
print(df)
The output:
Index Name Scores
0 0 Mike []
1 1 Tom []
2 2 Lucy []
(len(df), 0) - tuple representing given shape of a new array

Related

Drop rows in a pandas dataframe by criteria from another dataframe

I have the following dataframe containing scores for a competition as well as a column that counts what number entry for each person.
import pandas as pd
df = pd.DataFrame({'Name': ['John', 'Jim', 'John','Jim', 'John','Jim','John','Jim','John','Jim','Jack','Jack','Jack','Jack'],'Score': [10,8,9,3,5,0, 1, 2,3, 4,5,6,8,9]})
df['Entry_No'] = df.groupby(['Name']).cumcount() + 1
df
Then I have another table that stores data on the maximum number of entries that each person can have:
df2 = pd.DataFrame({'Name': ['John', 'Jim', 'Jack'],'Limit': [2,3,1]})
df2
I am trying to drop rows from df where the entry number is greater than the Limit according to each person in df2 so that my expected output is this:
If there are any ideas on how to help me achieve this that would be fantastic! Thanks
You can use pandas.merge to create another dataframe and drop columns by your condition:
df3 = pd.merge(df, df2, on="Name", how="left")
df3[df3["Entry_No"] <= df3["Limit"]][df.columns].reset_index(drop=True)
Name Score Entry_No
0 John 10 1
1 Jim 8 1
2 John 9 2
3 Jim 3 2
4 Jim 0 3
5 Jack 5 1
I used how="left" to keep the order of df and reset_index(drop=True) to reset the index of the resulting dataframe.
You could join the 2 dataframes, and then drop with a condition:
import pandas as pd
df = pd.DataFrame({'Name': ['John', 'Jim', 'John','Jim', 'John','Jim','John','Jim','John','Jim','Jack','Jack','Jack','Jack'],'Score': [10,8,9,3,5,0, 1, 2,3, 4,5,6,8,9]})
df['Entry_No'] = df.groupby(['Name']).cumcount() + 1
df2 = pd.DataFrame({'Name': ['John', 'Jim', 'Jack'],'Limit': [2,3,1]})
df2 = df2.set_index('Name')
df = df.join(df2, on='Name')
df.drop(df[df.Entry_No>df.Limit].index, inplace = True)
gives the expected output

How to create a dictionary out of my specific dataframe?

I have a dataframe df with column name:
names
phil/andy
allen
john/william/chris
john
I want to turn it into sort of "dictionary" (pandas dataframe) with unique random number for each name:
name value
phil 1
andy 2
allen 3
john 4
william 5
chris 6
How to do that? dataframe is sample, so I need a function to do same thing with very large dataframe
Here you go.
import numpy as np
import pandas as pd
# Original pd.DataFrame
d = {'phil': [1],
'phil/andy': [2],
'allen': [3],
'john/william/chris': [4],
'john': [5]
}
df = pd.DataFrame(data=d)
# Append all names to a list
names = []
for col in df.columns:
names = names + col.split("/")
# Remove duplicated names from the list
names = [i for n, i in enumerate(names) if i not in names[:n]]
# Create DF
df = pd.DataFrame(
# Random numbers
np.random.choice(
len(names), # Length
size = len(names), # Shape
replace = False # Unique random numbers
),
# Index names
index = names,
# Column names
columns = ['Rand value']
)
If you want to create a dictionary instead of a pd.DataFrame you can also apply d = df.T.to_dict() in the end. If you want numbers 0,1,2,3,...,n instead of random numbers you can replace np.random.choice() with range().

python-pandas: new column based on index?

I have a df similar to the one below:
name age sex
1 john 12 m
2 mary 13 f
3 joseph 12 m
4 maria 14 f
How can I make a new column based on the index? for example for index 1 and 2, i want them to have the label 1 and for index 3 and 4, i want them to be labeled 2, like so:
name age sex label
1 john 12 m cluster1
2 mary 13 f cluster1
3 joseph 12 m cluster2
4 maria 14 f cluster2
Should i use something like (df.index.isin([1, 2])) == 'cluster1'? I think it's not possible to do df['target'] = (df.index.isin([1, 2])) == 'cluster1 assuming that label doesn't exist in the beginning.
I think this is what you are looking for? You can use lists for different clusters to make your labels arbitrary in this way.
import pandas as pd
data = {'name':['bob','sue','mary','steve'], 'age':[11, 23, 53, 44]}
df = pd.DataFrame(data)
print(df)
df['label'] = 0
cluster1 = [0, 3]
cluster2 = [1, 2]
df.loc[cluster1, 'label'] = 1
df.loc[cluster2, 'label'] = 2
#another way
#df.iloc[cluster1, df.columns.get_loc('label')] = 1
#df.iloc[cluster2, df.columns.get_loc('label')] = 2
print(df)
output:
name age
0 bob 11
1 sue 23
2 mary 53
3 steve 44
name age label
0 bob 11 1
1 sue 23 2
2 mary 53 2
3 steve 44 1
You can let the initial column creation to be anything. So you can either have it be one of the cluster values (so then you only have to set the other cluster manually instead of both) or you can have it be None so you can then easily check after assigning labels that you didn't miss any rows.
If the assignment to clusters is truly arbitrary I don't think you'll be able to automate it much more than this.
Is this the solution you are looking for? I doubled the data so you can try different sequences. Here, if you write create_label(df, 3) instead of 2, it will iterate over 3 by 3. It gives you an opportunity to have a parametric solution.
import pandas as pd
df = pd.DataFrame({'name': ['john', 'mary', 'joseph', 'maria', 'john', 'mary', 'joseph', 'maria'],
'age': [12, 13, 12, 14, 12, 13, 12, 14],
'sex': ['m', 'f','m', 'f', 'm', 'f','m', 'f']})
df.index = df.index + 1
df['label'] = pd.Series()
def create_label(data, each_row):
i = 0
j = 1
while i <= len(data):
data['label'][i: i + each_row] = 'label' + str(j)
i += each_row
j += 1
return data
df_new = create_label(df, 2)
For small data frame or dataset you can use the below code
Label=pd.Series(['cluster1','cluster1','cluster2','cluster2'])
df['label']=Label
you can use a for loop and use list to get a new column with desired data
import pandas as pd
df = pd.read_csv("dataset.csv")
list1 = []
for i in range(len(df.name)):
if i < 2:
list1.append('cluster1')
else:
list1.append('cluster2')
label = pd.Series(list1)
df['label'] = label
You can simply use iloc and assign the values for the columns:
import pandas as pd
df = pd.read_csv('test.txt',sep='\+', engine = "python")
df["label"] = "" # adds empty "label" column
df["label"].iloc[0:2] = "cluster1"
df["label"].iloc[2:4] = "cluster2"
Since the values do not follow a certain order, as per your comments, you'd have to assign each "cluster" value manually.

Filling a dataframe from a dictionary keys and values: efficient way

I have the following dataframe as an example.
df_test = pd.DataFrame(data=0, index=["green","yellow","red"], columns=["bear","dog","cat"])
I have the following dictionary with keys and values that are the same or related to the index and columns od my dataframe.
d = {"green":["bear","dog"], "yellow":["bear"], "red":["bear"]}
I filled my dataframe according with the keys and values that are presented, using:
for k, v in d.items():
for x in v:
df_test.loc[k, x] = 1
My problem here is that the dataframe and the dictionary I'm working with are very large and it took too much time to compute. Is there a more efficient way to do it? Maybe iterating over rows in the dataframe instead of keys and values in the dictionary?
Because performance is important use MultiLabelBinarizer:
d = {"green":["bear","dog"], "yellow":["bear"], "red":["bear"]}
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(list(d.values())),
columns=mlb.classes_,
index=list(d.keys()))
print (df)
bear dog
green 1 1
yellow 1 0
red 1 0
And then add missing columns and index labels by DataFrame.reindex:
df_test = df.reindex(columns=df_test.columns, index=df_test.index, fill_value=0)
print (df_test)
bear dog cat
green 1 1 0
yellow 1 0 0
red 1 0 0
use get_dummies()
# convert dict to a Series
s = pd.Series(d)
# explode your list into columns and get dummies
df = pd.get_dummies(s.apply(pd.Series), prefix='', prefix_sep='')
bear dog
green 1 1
yellow 1 0
red 1 0
update
# convert dict to a Series
s = pd.Series(d)
# create a new data frame
df = pd.DataFrame(s.values.tolist(), index=s.index)
# get_dummies
new_df = pd.get_dummies(df, prefix='', prefix_sep='')

Extracting values as a dictionary from dataframe based on list

I have a dataframe with unique value in each columns:
df1 = pd.DataFrame([["Phys","Shane","NY"],["Chem","Mark","LA"],
["Maths","Jack","Mum"],["Bio","Sam","CT"]],
columns = ["cls1","cls2","cls3"])
print(df1)
cls1 cls2 cls3
0 Phys Shane NY
1 Chem Mark LA
2 Maths Jack Mum
3 Bio Sam CT
And a list l1:
l1=["Maths","Bio","Shane","Mark"]
print(l1)
['Maths', 'Bio', 'Shane', 'Mark']
Now I want to retrieve a columns from dataframe that contains elements from list and list of elements.
Expected Output:
{'cls1' : ['Maths','Bio'], 'cls2': ['Shane','Mark']}
The code I have:
cls = []
for cols in df1.columns:
mask = df1[cols].isin(l1)
if mask.any():
cls.append(cols)
print(cls)
The output of above code:
['cls1', 'cls2']
I'm struggling to get common elements from dataframe and list to convert it into dictionary.
Any suggestions are welcome.
Thanks.
Use DataFrame.isin for mask, replace non match values by indexing and reshape with stack:
df = df1[df1.isin(l1)].stack()
print (df)
0 cls2 Shane
1 cls2 Mark
2 cls1 Maths
3 cls1 Bio
dtype: object
Last create list by dict comprehension:
d = {k:v.tolist() for k,v in df.groupby(level=1)}
print(d)
{'cls2': ['Shane', 'Mark'], 'cls1': ['Maths', 'Bio']}
Another solution:
d = {}
for cols in df1.columns:
mask = df1[cols].isin(l1)
if mask.any():
d[cols] = df1.loc[mask, cols].tolist()
print(d)
{'cls2': ['Shane', 'Mark'], 'cls1': ['Maths', 'Bio']}

Categories

Resources