How to efficiently replace items between Dataframes in pandas?

How to efficiently replace items between Dataframes in pandas? - python

I have 2 df
df = pd.DataFrame({'Ages':[20, 22, 57], 'Label':[1,1,2]})
label_df = pd.DataFrame({'Label':[1,2,3], 'Description':['Young','Old','Very Old']})
I want to replace the label values in df to the description in label_df
Wanted result:
df = pd.DataFrame({'Ages':[20, 22, 57], 'Label':['Young','Young','Old']})

Use Series.map with Series by label_df:
df['Label'] = df['Label'].map(label_df.set_index('Label')['Description'])
print (df)
Ages Label
0 20 Young
1 22 Young
2 57 Old

simple use merge
df['Label'] = df.merge(label_df,on='Label')['Description']
Ages Label
0 20 Young
1 22 Young
2 57 Old
https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#database-style-dataframe-or-named-series-joining-merging

Related

Change shape of dataframe with multiindex

How can change the shape of my multiindexed dataframe from:
to something like this, but with all cells values, not only of the first index:
I have tried to do it but somehow receive only the dataframe as above with this code:
numbers = [100,50,20,10,5,2,1]
for number in numbers:
dfj[number] = df['First_column_value_name'].xs(key=number, level='Second_multiindex_column_name')
list_of_columns_position = []
for number in numbers:
R_string = '{}_R'.format(number)
list_of_columns_position.append(R_string)
df_positions_as_columns = pd.concat(dfj.values(), ignore_index=True, axis=1)
df_positions_as_columns.columns = list_of_columns_position

Split your first columns into 2 parts then join the result with the second column and finally pivot your dataframe:
Setup:
data = {'A': ['TLM_1/100', 'TLM_1/50', 'TLM_1/20',
'TLM_2/100', 'TLM_2/50', 'TLM_2/20'],
'B': [11, 12, 13, 21, 22, 23]}
df = pd.DataFrame(data)
print(df)
# Output:
A B
0 TLM_1/100 11
1 TLM_1/50 12
2 TLM_1/20 13
3 TLM_2/100 21
4 TLM_2/50 22
5 TLM_2/20 23
>>> df[['B']].join(df['A'].str.split('/', expand=True)) \
.pivot(index=0, columns=1, values='B') \
.rename_axis(index=None, columns=None) \
.add_suffix('_R')
100_R 20_R 50_R
TLM_1 11 13 12
TLM_2 21 23 22

use a regular expression to split the label column into two columns a and b then group by column a and unstack the grouping.

python-pandas: new column based on index?

I have a df similar to the one below:
name age sex
1 john 12 m
2 mary 13 f
3 joseph 12 m
4 maria 14 f
How can I make a new column based on the index? for example for index 1 and 2, i want them to have the label 1 and for index 3 and 4, i want them to be labeled 2, like so:
name age sex label
1 john 12 m cluster1
2 mary 13 f cluster1
3 joseph 12 m cluster2
4 maria 14 f cluster2
Should i use something like (df.index.isin([1, 2])) == 'cluster1'? I think it's not possible to do df['target'] = (df.index.isin([1, 2])) == 'cluster1 assuming that label doesn't exist in the beginning.

I think this is what you are looking for? You can use lists for different clusters to make your labels arbitrary in this way.
import pandas as pd
data = {'name':['bob','sue','mary','steve'], 'age':[11, 23, 53, 44]}
df = pd.DataFrame(data)
print(df)
df['label'] = 0
cluster1 = [0, 3]
cluster2 = [1, 2]
df.loc[cluster1, 'label'] = 1
df.loc[cluster2, 'label'] = 2
#another way
#df.iloc[cluster1, df.columns.get_loc('label')] = 1
#df.iloc[cluster2, df.columns.get_loc('label')] = 2
print(df)
output:
name age
0 bob 11
1 sue 23
2 mary 53
3 steve 44
name age label
0 bob 11 1
1 sue 23 2
2 mary 53 2
3 steve 44 1
You can let the initial column creation to be anything. So you can either have it be one of the cluster values (so then you only have to set the other cluster manually instead of both) or you can have it be None so you can then easily check after assigning labels that you didn't miss any rows.
If the assignment to clusters is truly arbitrary I don't think you'll be able to automate it much more than this.

Is this the solution you are looking for? I doubled the data so you can try different sequences. Here, if you write create_label(df, 3) instead of 2, it will iterate over 3 by 3. It gives you an opportunity to have a parametric solution.
import pandas as pd
df = pd.DataFrame({'name': ['john', 'mary', 'joseph', 'maria', 'john', 'mary', 'joseph', 'maria'],
'age': [12, 13, 12, 14, 12, 13, 12, 14],
'sex': ['m', 'f','m', 'f', 'm', 'f','m', 'f']})
df.index = df.index + 1
df['label'] = pd.Series()
def create_label(data, each_row):
i = 0
j = 1
while i <= len(data):
data['label'][i: i + each_row] = 'label' + str(j)
i += each_row
j += 1
return data
df_new = create_label(df, 2)

For small data frame or dataset you can use the below code
Label=pd.Series(['cluster1','cluster1','cluster2','cluster2'])
df['label']=Label

you can use a for loop and use list to get a new column with desired data
import pandas as pd
df = pd.read_csv("dataset.csv")
list1 = []
for i in range(len(df.name)):
if i < 2:
list1.append('cluster1')
else:
list1.append('cluster2')
label = pd.Series(list1)
df['label'] = label

You can simply use iloc and assign the values for the columns:
import pandas as pd
df = pd.read_csv('test.txt',sep='\+', engine = "python")
df["label"] = "" # adds empty "label" column
df["label"].iloc[0:2] = "cluster1"
df["label"].iloc[2:4] = "cluster2"
Since the values do not follow a certain order, as per your comments, you'd have to assign each "cluster" value manually.

Pandas: select from column with index corresponding to values in another column

Apologies for the crappy title...
Say I have two pandas dataframes concerning field sampling locations. DF1 contains sample ID, coordinates, year of recording etc. DF2 contains a meteorological variable, with values provided per year as columns:
import pandas as pd
df1 = pd.DataFrame(data = {'ID': [10, 20, 30], 'YEAR': [1980, 1981, 1991]}, index=[1,2,3])
df2 = pd.DataFrame(data= np.random.randint(0,100,size=(3, 10)), columns=['year_{0}'.format(x) for x in range(1980, 1991)], index=[10, 20, 30])
print(df1)
> ID YEAR
1 10 1980
2 20 1981
3 30 1991
print(df2)
> year_1980 year_1981 ... year_1990
10 48 61 ... 53
20 68 69 ... 21
30 76 37 ... 70
Note how the Plot ID's from DF1 correspond to DF2.index and also how DF1 sampling years extend beyond the coverage of DF2. I'd like to add as a new column to DF1 the value from DF2 corresponding to the year column in DF1. What I have so far is:
def grab(df, plot_id, yr):
try:
out = df.loc[plot_id, 'year_{}'.format(yr)]
except KeyError:
out = -99
return out
df1['meteo_val'] = df1.apply(lambda row: grab(df2, row.index, row.year), axis=1)
print(df1)
> ID YEAR meteo_val
1 10 1980 48
2 20 1981 69
3 30 1991 -99
This works, but seems to take an awful long time to compute. I wonder for a smarter, quicker, approach to solving this. Any suggestions?

SetUp
np.random.seed(0)
df1 = pd.DataFrame(data = {'ID': [10, 20, 30], 'YEAR': [1980, 1981, 1991]}, index=[1,2,3])
df2 = pd.DataFrame(data= np.random.randint(0,100,size=(3, 11)),
columns=['year_{0}'.format(x) for x in range(1980, 1991)],
index=[10, 20, 30])
Solution with DataFrame.lookup:
mapper = df1.assign(YEAR = ('year_' + df1['YEAR'].astype(str)))
c2 = mapper['ID'].isin(df2.index)
c1 = mapper['YEAR'].isin(df2.columns)
mapper = mapper.loc[c1 & c2]
df1.loc[c2&c1, 'meteo_val'] = df2.lookup(mapper['ID'], mapper['YEAR'])
df1 ['meteo_val'] = df1['meteo_val'].fillna(-99)
ID YEAR meteo_val
1 10 1980 44.0
2 20 1981 88.0
3 30 1991 -99.0
Alternative with DataFrame.join and DataFrame.stack
df1 = df1.join(df2.set_axis(df2.columns.str.split('_').str[1].astype(int),
axis=1).stack().rename('meteo_val'),
on = ['ID', 'YEAR'], how='left').fillna(-99)

Mapping dataframe values from another dataframe with condition

I have df1
df1 = {'Name':['Tom', 'nick', 'krish', 'jack'], 'Age':[20, 21, 19, 18]}
df1 = pd.DataFrame(df1)
and I have another df2
df2 = {'Name':['krish', 'jack','Tom', 'nick',]}
df2 = pd.DataFrame(df2)
df2['Name'] is exactly same with df1. However, they are in a different order.
I want to fill df2['Age'] based on df1.
If I used df2['Age'] = df1['Age'] the value of is filled but wrong.
How to map those values on df2 from df1 correctly?
Thank you

Use:
df2 = df2.merge(df1,on='Name')
df2
Name Age
0 krish 19
1 jack 18
2 Tom 20
3 nick 21

Set Name as index and reindex based on df2:
df1.set_index('Name').reindex(df2.Name).reset_index()
Name Age
0 krish 19
1 jack 18
2 Tom 20
3 nick 21
Or for a better performance, we can use pd.Categorical here:
df1['Name'] = pd.Categorical(df1.Name, categories=df2.Name)
df1.sort_values('Name', inplace=True)

How to look for a value in a Python DataFrame column?

This is my first question on this forum, sorry if my English is not so good!
I want to add a row to a DataFrame only if a specific column doesn't already contain a specific value. Let say I write this :
df = pd.DataFrame([['Mark', 9], ['Laura', 22]], columns=['Name', 'Age'])
new_friend = pd.DataFrame([['Alex', 23]], columns=['Name', 'Age'])
df = df.append(new_friend, ignore_index=True)
print(df)
Name Age
0 Mark 9
1 Laura 22
2 Alex 23
Now I want to add another friend, but I want to make sure I don't already have a friend with the same name. Here is what I'm actually doing:
new_friend = pd.DataFrame([['Mark', 16]], columns=['Name', 'Age'])
df = df.append(new_friend, ignore_index=True)
print(df)
Name Age
0 Mark 9
1 Laura 22
2 Alex 55
3 Mark 16
then :
df = df.drop_duplicates(subset='Name', keep='first')
df = df.reset_index(drop=True)
print(df)
Name Age
0 Mark 9
1 Laura 22
2 Alex 55
Is there another way of doing this something like :
if name in Column 'Name'is True:
don't add friend
else:
add friend
Thank you!

if 'Mark' in list(df['Name']):
print('Mark already in DF')
else:
print('Mark not in DF')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to efficiently replace items between Dataframes in pandas? - python

Use Series.map with Series by label_df: df['Label'] = df['Label'].map(label_df.set_index('Label')['Description']) print (df) Ages Label 0 20 Young 1 22 Young 2 57 Old

simple use merge df['Label'] = df.merge(label_df,on='Label')['Description'] Ages Label 0 20 Young 1 22 Young 2 57 Old https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#database-style-dataframe-or-named-series-joining-merging

Related

Change shape of dataframe with multiindex

python-pandas: new column based on index?

Pandas: select from column with index corresponding to values in another column

Mapping dataframe values from another dataframe with condition

How to look for a value in a Python DataFrame column?

Categories

Resources