How not to iterate over pandas Dataframe - python

I have a pretty simple problem I could solve just by iterating over rows of a dataframe. But I read it's never a good practice, so I'm wondering how to avoid this step.
Dummy DataFrame
In this example I'd like to automatically give a new name to fruits that are special, according to a conventional rule (as shown in the code below).
This default name should only be applied if the fruit is special and 'Logic name' is still unknown.
In python I would write something like this:
for idx in range(len(a['Fruit'])):
if df.loc[idx]['Logic name'] == 'unknown' and df.loc[idx]['Special'] == 'yes':
df.loc[idx]['Logic name'] = df.loc[idx]['color'] + df.loc[idx]['Fruit'][2:]
The final result is this
Final Dataframe
How would you avoid iteration in this case?

Use numpy.where with a condition on "special"
import numpy as np
df['Logic name'] = np.where(df['Special'].eq('yes')&df['Logic name'].eq('unknown'),
df['color']+df['Fruit'].str[2;],
df['Logic name'])

Related

Changing column values for a value in an adjacent column in the same dataframe using Python

I am quite new to Python programming.
I am working with the following dataframe:
Before
Note that in column "FBgn", there is a mix of FBgn and FBtr string values. I would like to replace the FBtr-containing values with FBgn values provided in the adjacent column called "## FlyBase_FBgn". However, I want to keep the FBgn values in column "FBgn". Maybe keep in mind that I am showing only a portion of the dataframe (reality: 1432 rows). How would I do that? I tried the replace() method from Pandas, but it did not work.
This is actually what I would like to have:
After
Thanks a lot!
With Pandas, you could try:
df.loc[df["FBgn"].str.contains("FBtr"), "FBgn"] = df["## FlyBase_FBgn"]
Welcome to stackoverflow. Please next time provide more info including your code. It is always helpful
Please see the code below, I think you need something similar
import pandas as pd
#ignore the dict1, I just wanted to recreate your df
dict1= {"FBgn": ['FBtr389394949', 'FBgn3093840', 'FBtr000025'], "FBtr": ['FBgn546466646', '', 'FBgn15565555']}
df = pd.DataFrame(dict1) #recreating your dataframe
#print df
print(df)
#function to replace the values
def replace_values(df):
for i in range(0, (df.size//2)):
if 'tr' in df['FBgn'][i]:
df['FBgn'][i] = df['FBtr'][i]
return df
df = replace_values(df)
#print new df
print(df)

Creating new dataframe by appending rows from an old dataframe

I'm trying to create a dataframe by selecting rows that meet only specific conditions from a different dataframe.
Technicians can only select one of several fields for Column 1 using a dropdown menu so I want to specify the specific field. However, column 2 is a freetext entry therefore I'm looking for two specific key words with any type of spelling/case.
I want all columns from the rows in the new dataframe.
Any help or insight would be much appreciated.
import pandas as pd
df = pd.read_excel (r'File.xlsx, sheet_name = 'Sheet1')
filter = ['x', 'y']
columns=df.columns
data = pd.DataFrame(columns=columns)
for row in df.iterrows():
if 'Column 1' == 'a':
row.data.append()
elif df['Column 2'].str.contains('filter', case = 'false'):
row.data.append()
print(data.head())
In general, it's best to have a vectorized solution to things, so I'll put my solution as follows (there are many ways to do this, this is one of the ways that came to my head). Here, you can use a simple boolean mask to filter out some specific rows that you don't want, since you've already clearly defined your criteria (df['Column 1'] == 'a' or df['Column 2'].str.contains('filter', case = 'false')).
As such, you can simply create a boolean mask that includes this criteria. By itself, df['Column 1'] == 'a' will create an indexing dataframe with the structure of [1, 0, 1, 1, ...], where each number corresponds to whether it's true in the original array. Once you have that, you can simply index back into the original array with df[df['Column 1'] == 'a'] to return your filtered array.
Of course, since you have two conditions here (which seem to follow an "or" clause), you can simply feed both of these conditions into the boolean mask, such as df[df['Column 1'] == 'a' & df['Column 2'].str.contains('filter', case = 'false')].
I'm not at my development computer, so this might not work as expected due to a couple minor issues, but this is the general idea. This line should replace your entire df.iterrows block. Hope this helps :)

Pandas Dataframe subset not working as expected

This seemingly simple exercise is throwing me off my tracks, I'm sure it's something simple skipping my eye.
Let's say I have a dataframe
datas = pd.DataFrame({'age':[10,20,30],
'name':['John','Mark','Lisa']})
I now want to subset the dataframe by the name 'Mark' so I did:
if (datas['name']=='Mark').any():
datas.loc[datas['name'] == 'Mark']
else:
print('no')
Expected result is
age name
20 Mark
but I get the original dataframe back again, please assist.
I've looked at several posts but none seems to help.
Posts example I looked at: Check if string is in a pandas dataframe
I think you need assign back to original DataFrame if need overwrite original DataFrame by subset:
datas = datas.loc[datas['name'] == 'Mark']
Or assign to new variable, e.g. df1:
df1 = datas.loc[datas['name'] == 'Mark']
Next if data pare processing and assign putput to new variable like df1is necessary use DataFrame.copy for prevent SettingWithCopyWarning:
df1 = datas.loc[datas['name'] == 'Mark'].copy()
If you modify values in df1 later you will find that the modifications do not propagate back to the original data (df), and that Pandas does warning.
Did you mean to print the subset? Right now your code doesn't change anything.
if (datas['name']=='Mark').any():
print( datas.loc[datas['name'] == 'Mark'] )
else:
print('no')
You can change your dataset even in one line:
datas = datas[datas['name']=='Mark']

Vectorized Flag Assignment in Dataframe

I have a dataframe with observations possessing a number of codes. I want to compare the codes present in a row with a list. If any codes are in that list, I wish to flag the row. I can accomplish this using the itertuples method as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame({ 'id' : [1,2,3,4,5],
'cd1' : ['abc1', 'abc2', 'abc3','abc4','abc5'],
'cd2' : ['abc3','abc4','abc5','abc6',''],
'cd3' : ['abc10', '', '', '','']})
code_flags = ['abc1','abc6']
# initialize flag column
df['flag'] = 0
# itertuples method
for row in df.itertuples():
if any(df.iloc[row.Index, 1:4].isin(code_flags)):
df.at[row.Index, 'flag'] = 1
The output correctly adds a flag column with the appropriate flags, where 1 indicates a flagged entry.
However, on my actual use case, this takes hours to complete. I have attempted to vectorize this approach using numpy.where.
df['flag'] = 0 # reset
df['flag'] = np.where(any(df.iloc[:,1:4].isin(code_flags)),1,0)
Which appears to evaluate everything the same. I think I'm confused on how the vectorization treats the index. I can remove the semicolon and write df.iloc[1:4] and obtain the same result.
Am I misunderstanding the where function? Is my indexing incorrect and causing a True evaluation for all cases? Is there a better way to do this?
Using np.where with .any not any(..)
np.where((df.iloc[:,1:4].isin(code_flags)).any(1),1,0)

pandas iterrows() two data frame

I'm doing something that I know that I shouldn't be doing. I'm doing a for loop within a for loop (it sounds even more horrible, as I write it down.) Basically, what I want to do, theoretically, using two dataframes is something like this:
for index, row in df_2.iterrows():
for index_1, row_1 in df_1.iterrows():
if row['column_1'] == row_1['column_1'] and row['column_2'] == row_1['column_2'] and row['column_3'] == row_1['column_2']:
row['column_4'] = row_1['column_4']
There has got to be a (better) way to do something like this. Please help!
As pointed out by #Andy Hayden in is it possible to do fuzzy match merge with python pandas?, you can use difflib : get_closest_matches function to create new join columns.
import difflib
df_2['fuzzy_column_1'] = df_2['column_1'].apply(lambda x: difflib.get_close_matches(x, df_1['column_1'])[0])
# Do same for all other columns
Now you can apply inner join using pandas merge function.
result_df = df_1.merge(df_2,left_on=['column_1', 'column_2','column_3'], and right_on=['fuzzy_column_1','fuzzy_column_2','fuzzy_column_3] )
You can use drop function to remove unwanted columns.

Categories

Resources