I have 2 df
df = pd.DataFrame({'Ages':[20, 22, 57], 'Label':[1,1,2]})
label_df = pd.DataFrame({'Label':[1,2,3], 'Description':['Young','Old','Very Old']})
I want to replace the label values in df to the description in label_df
Wanted result:
df = pd.DataFrame({'Ages':[20, 22, 57], 'Label':['Young','Young','Old']})
Use Series.map with Series by label_df:
df['Label'] = df['Label'].map(label_df.set_index('Label')['Description'])
print (df)
Ages Label
0 20 Young
1 22 Young
2 57 Old
simple use merge
df['Label'] = df.merge(label_df,on='Label')['Description']
Ages Label
0 20 Young
1 22 Young
2 57 Old
https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#database-style-dataframe-or-named-series-joining-merging
Related
How can change the shape of my multiindexed dataframe from:
to something like this, but with all cells values, not only of the first index:
I have tried to do it but somehow receive only the dataframe as above with this code:
numbers = [100,50,20,10,5,2,1]
for number in numbers:
dfj[number] = df['First_column_value_name'].xs(key=number, level='Second_multiindex_column_name')
list_of_columns_position = []
for number in numbers:
R_string = '{}_R'.format(number)
list_of_columns_position.append(R_string)
df_positions_as_columns = pd.concat(dfj.values(), ignore_index=True, axis=1)
df_positions_as_columns.columns = list_of_columns_position
Split your first columns into 2 parts then join the result with the second column and finally pivot your dataframe:
Setup:
data = {'A': ['TLM_1/100', 'TLM_1/50', 'TLM_1/20',
'TLM_2/100', 'TLM_2/50', 'TLM_2/20'],
'B': [11, 12, 13, 21, 22, 23]}
df = pd.DataFrame(data)
print(df)
# Output:
A B
0 TLM_1/100 11
1 TLM_1/50 12
2 TLM_1/20 13
3 TLM_2/100 21
4 TLM_2/50 22
5 TLM_2/20 23
>>> df[['B']].join(df['A'].str.split('/', expand=True)) \
.pivot(index=0, columns=1, values='B') \
.rename_axis(index=None, columns=None) \
.add_suffix('_R')
100_R 20_R 50_R
TLM_1 11 13 12
TLM_2 21 23 22
use a regular expression to split the label column into two columns a and b then group by column a and unstack the grouping.
I have a df similar to the one below:
name age sex
1 john 12 m
2 mary 13 f
3 joseph 12 m
4 maria 14 f
How can I make a new column based on the index? for example for index 1 and 2, i want them to have the label 1 and for index 3 and 4, i want them to be labeled 2, like so:
name age sex label
1 john 12 m cluster1
2 mary 13 f cluster1
3 joseph 12 m cluster2
4 maria 14 f cluster2
Should i use something like (df.index.isin([1, 2])) == 'cluster1'? I think it's not possible to do df['target'] = (df.index.isin([1, 2])) == 'cluster1 assuming that label doesn't exist in the beginning.
I think this is what you are looking for? You can use lists for different clusters to make your labels arbitrary in this way.
import pandas as pd
data = {'name':['bob','sue','mary','steve'], 'age':[11, 23, 53, 44]}
df = pd.DataFrame(data)
print(df)
df['label'] = 0
cluster1 = [0, 3]
cluster2 = [1, 2]
df.loc[cluster1, 'label'] = 1
df.loc[cluster2, 'label'] = 2
#another way
#df.iloc[cluster1, df.columns.get_loc('label')] = 1
#df.iloc[cluster2, df.columns.get_loc('label')] = 2
print(df)
output:
name age
0 bob 11
1 sue 23
2 mary 53
3 steve 44
name age label
0 bob 11 1
1 sue 23 2
2 mary 53 2
3 steve 44 1
You can let the initial column creation to be anything. So you can either have it be one of the cluster values (so then you only have to set the other cluster manually instead of both) or you can have it be None so you can then easily check after assigning labels that you didn't miss any rows.
If the assignment to clusters is truly arbitrary I don't think you'll be able to automate it much more than this.
Is this the solution you are looking for? I doubled the data so you can try different sequences. Here, if you write create_label(df, 3) instead of 2, it will iterate over 3 by 3. It gives you an opportunity to have a parametric solution.
import pandas as pd
df = pd.DataFrame({'name': ['john', 'mary', 'joseph', 'maria', 'john', 'mary', 'joseph', 'maria'],
'age': [12, 13, 12, 14, 12, 13, 12, 14],
'sex': ['m', 'f','m', 'f', 'm', 'f','m', 'f']})
df.index = df.index + 1
df['label'] = pd.Series()
def create_label(data, each_row):
i = 0
j = 1
while i <= len(data):
data['label'][i: i + each_row] = 'label' + str(j)
i += each_row
j += 1
return data
df_new = create_label(df, 2)
For small data frame or dataset you can use the below code
Label=pd.Series(['cluster1','cluster1','cluster2','cluster2'])
df['label']=Label
you can use a for loop and use list to get a new column with desired data
import pandas as pd
df = pd.read_csv("dataset.csv")
list1 = []
for i in range(len(df.name)):
if i < 2:
list1.append('cluster1')
else:
list1.append('cluster2')
label = pd.Series(list1)
df['label'] = label
You can simply use iloc and assign the values for the columns:
import pandas as pd
df = pd.read_csv('test.txt',sep='\+', engine = "python")
df["label"] = "" # adds empty "label" column
df["label"].iloc[0:2] = "cluster1"
df["label"].iloc[2:4] = "cluster2"
Since the values do not follow a certain order, as per your comments, you'd have to assign each "cluster" value manually.
Apologies for the crappy title...
Say I have two pandas dataframes concerning field sampling locations. DF1 contains sample ID, coordinates, year of recording etc. DF2 contains a meteorological variable, with values provided per year as columns:
import pandas as pd
df1 = pd.DataFrame(data = {'ID': [10, 20, 30], 'YEAR': [1980, 1981, 1991]}, index=[1,2,3])
df2 = pd.DataFrame(data= np.random.randint(0,100,size=(3, 10)), columns=['year_{0}'.format(x) for x in range(1980, 1991)], index=[10, 20, 30])
print(df1)
> ID YEAR
1 10 1980
2 20 1981
3 30 1991
print(df2)
> year_1980 year_1981 ... year_1990
10 48 61 ... 53
20 68 69 ... 21
30 76 37 ... 70
Note how the Plot ID's from DF1 correspond to DF2.index and also how DF1 sampling years extend beyond the coverage of DF2. I'd like to add as a new column to DF1 the value from DF2 corresponding to the year column in DF1. What I have so far is:
def grab(df, plot_id, yr):
try:
out = df.loc[plot_id, 'year_{}'.format(yr)]
except KeyError:
out = -99
return out
df1['meteo_val'] = df1.apply(lambda row: grab(df2, row.index, row.year), axis=1)
print(df1)
> ID YEAR meteo_val
1 10 1980 48
2 20 1981 69
3 30 1991 -99
This works, but seems to take an awful long time to compute. I wonder for a smarter, quicker, approach to solving this. Any suggestions?
SetUp
np.random.seed(0)
df1 = pd.DataFrame(data = {'ID': [10, 20, 30], 'YEAR': [1980, 1981, 1991]}, index=[1,2,3])
df2 = pd.DataFrame(data= np.random.randint(0,100,size=(3, 11)),
columns=['year_{0}'.format(x) for x in range(1980, 1991)],
index=[10, 20, 30])
Solution with DataFrame.lookup:
mapper = df1.assign(YEAR = ('year_' + df1['YEAR'].astype(str)))
c2 = mapper['ID'].isin(df2.index)
c1 = mapper['YEAR'].isin(df2.columns)
mapper = mapper.loc[c1 & c2]
df1.loc[c2&c1, 'meteo_val'] = df2.lookup(mapper['ID'], mapper['YEAR'])
df1 ['meteo_val'] = df1['meteo_val'].fillna(-99)
ID YEAR meteo_val
1 10 1980 44.0
2 20 1981 88.0
3 30 1991 -99.0
Alternative with DataFrame.join and DataFrame.stack
df1 = df1.join(df2.set_axis(df2.columns.str.split('_').str[1].astype(int),
axis=1).stack().rename('meteo_val'),
on = ['ID', 'YEAR'], how='left').fillna(-99)
I have df1
df1 = {'Name':['Tom', 'nick', 'krish', 'jack'], 'Age':[20, 21, 19, 18]}
df1 = pd.DataFrame(df1)
and I have another df2
df2 = {'Name':['krish', 'jack','Tom', 'nick',]}
df2 = pd.DataFrame(df2)
df2['Name'] is exactly same with df1. However, they are in a different order.
I want to fill df2['Age'] based on df1.
If I used df2['Age'] = df1['Age'] the value of is filled but wrong.
How to map those values on df2 from df1 correctly?
Thank you
Use:
df2 = df2.merge(df1,on='Name')
df2
Name Age
0 krish 19
1 jack 18
2 Tom 20
3 nick 21
Set Name as index and reindex based on df2:
df1.set_index('Name').reindex(df2.Name).reset_index()
Name Age
0 krish 19
1 jack 18
2 Tom 20
3 nick 21
Or for a better performance, we can use pd.Categorical here:
df1['Name'] = pd.Categorical(df1.Name, categories=df2.Name)
df1.sort_values('Name', inplace=True)
This is my first question on this forum, sorry if my English is not so good!
I want to add a row to a DataFrame only if a specific column doesn't already contain a specific value. Let say I write this :
df = pd.DataFrame([['Mark', 9], ['Laura', 22]], columns=['Name', 'Age'])
new_friend = pd.DataFrame([['Alex', 23]], columns=['Name', 'Age'])
df = df.append(new_friend, ignore_index=True)
print(df)
Name Age
0 Mark 9
1 Laura 22
2 Alex 23
Now I want to add another friend, but I want to make sure I don't already have a friend with the same name. Here is what I'm actually doing:
new_friend = pd.DataFrame([['Mark', 16]], columns=['Name', 'Age'])
df = df.append(new_friend, ignore_index=True)
print(df)
Name Age
0 Mark 9
1 Laura 22
2 Alex 55
3 Mark 16
then :
df = df.drop_duplicates(subset='Name', keep='first')
df = df.reset_index(drop=True)
print(df)
Name Age
0 Mark 9
1 Laura 22
2 Alex 55
Is there another way of doing this something like :
if name in Column 'Name'is True:
don't add friend
else:
add friend
Thank you!
if 'Mark' in list(df['Name']):
print('Mark already in DF')
else:
print('Mark not in DF')