I'm doing something that I know that I shouldn't be doing. I'm doing a for loop within a for loop (it sounds even more horrible, as I write it down.) Basically, what I want to do, theoretically, using two dataframes is something like this:
for index, row in df_2.iterrows():
for index_1, row_1 in df_1.iterrows():
if row['column_1'] == row_1['column_1'] and row['column_2'] == row_1['column_2'] and row['column_3'] == row_1['column_2']:
row['column_4'] = row_1['column_4']
There has got to be a (better) way to do something like this. Please help!
As pointed out by #Andy Hayden in is it possible to do fuzzy match merge with python pandas?, you can use difflib : get_closest_matches function to create new join columns.
import difflib
df_2['fuzzy_column_1'] = df_2['column_1'].apply(lambda x: difflib.get_close_matches(x, df_1['column_1'])[0])
# Do same for all other columns
Now you can apply inner join using pandas merge function.
result_df = df_1.merge(df_2,left_on=['column_1', 'column_2','column_3'], and right_on=['fuzzy_column_1','fuzzy_column_2','fuzzy_column_3] )
You can use drop function to remove unwanted columns.
Related
I have a pretty simple problem I could solve just by iterating over rows of a dataframe. But I read it's never a good practice, so I'm wondering how to avoid this step.
Dummy DataFrame
In this example I'd like to automatically give a new name to fruits that are special, according to a conventional rule (as shown in the code below).
This default name should only be applied if the fruit is special and 'Logic name' is still unknown.
In python I would write something like this:
for idx in range(len(a['Fruit'])):
if df.loc[idx]['Logic name'] == 'unknown' and df.loc[idx]['Special'] == 'yes':
df.loc[idx]['Logic name'] = df.loc[idx]['color'] + df.loc[idx]['Fruit'][2:]
The final result is this
Final Dataframe
How would you avoid iteration in this case?
Use numpy.where with a condition on "special"
import numpy as np
df['Logic name'] = np.where(df['Special'].eq('yes')&df['Logic name'].eq('unknown'),
df['color']+df['Fruit'].str[2;],
df['Logic name'])
I am quite new to Python programming.
I am working with the following dataframe:
Before
Note that in column "FBgn", there is a mix of FBgn and FBtr string values. I would like to replace the FBtr-containing values with FBgn values provided in the adjacent column called "## FlyBase_FBgn". However, I want to keep the FBgn values in column "FBgn". Maybe keep in mind that I am showing only a portion of the dataframe (reality: 1432 rows). How would I do that? I tried the replace() method from Pandas, but it did not work.
This is actually what I would like to have:
After
Thanks a lot!
With Pandas, you could try:
df.loc[df["FBgn"].str.contains("FBtr"), "FBgn"] = df["## FlyBase_FBgn"]
Welcome to stackoverflow. Please next time provide more info including your code. It is always helpful
Please see the code below, I think you need something similar
import pandas as pd
#ignore the dict1, I just wanted to recreate your df
dict1= {"FBgn": ['FBtr389394949', 'FBgn3093840', 'FBtr000025'], "FBtr": ['FBgn546466646', '', 'FBgn15565555']}
df = pd.DataFrame(dict1) #recreating your dataframe
#print df
print(df)
#function to replace the values
def replace_values(df):
for i in range(0, (df.size//2)):
if 'tr' in df['FBgn'][i]:
df['FBgn'][i] = df['FBtr'][i]
return df
df = replace_values(df)
#print new df
print(df)
I would like to do the following in pandas which I would do in SQL:
SELECT * FROM table WHERE field = value
I was thinking I could use something similar to an apply or map with a similar interface. Something like:
def filter_func(row):
if row['name'] == 'Bob':
return True
else:
return False
df.filter(filter_func, axis=1)
Similar to how I can do:
df['new_col'] = df.apply(apply_func, axis=1)
Is there a way to do something similar so that it only returns the rows where name='Bob' ?
The strangest thing is the pandas filter function says:
Note that this routine does not filter a dataframe on its contents. The filter is applied to the labels of the index.
That seems to me like quite a useless way to make use of a filter ?
Check with
df_filter = df[df['name'] == 'Bob']
For sql in operation we have isin
#SELECT * FROM table WHERE field IN ('A','B')
df_filter = df[df['name'].isin('A','B)]
filter is named badly , which is the filter for columns, or when we do groupby filter
I have a dataframe that has 100 variables va1--var100. I want to bring var40, var20, and var30 to the front with other variables remain the original order. I've searched online, methods like
1: df[[var40, var20, var30, var1....]]
2: columns= [var40, var20, var30, var1...]
all require to specify all the variables in the dataframe. With 100 variables exists in my dataframe, how can I do it efficiently?
I am a SAS user, in SAS, we can use a retain statement before the set statement to achieve the goal. Is there a equivalent way in python too?
Thanks
Consider reindex with a conditional list comprehension:
first_cols = ['var30', 'var40', 'var20']
df = df.reindex(first_cols + [col for col in df.columns if col not in first_cols],
axis = 'columns')
I have a dataframe with many rows. I am appending a column using data produced from a custom function, like this:
import numpy
df['new_column'] = numpy.vectorize(fx)(df['col_a'], df['col_b'])
# takes 180964.377 ms
It is working fine, what I am trying to do is speed it up. There is really only a small group of unique combinations of col_a and col_b. Many of the iterations are redundant. I was thinking maybe pandas would just figure that out on its own but I don't think that is the case. Consider this:
print len(df.index) #prints 127255
df_unique = df.copy().drop_duplicates(['col_a', 'col_b'])
print len(df_unique.index) #prints 9834
I also convinced myself of the possible speedup by running this:
df_unique['new_column'] = numpy.vectorize(fx)(df_unique['col_a'], df_unique['col_b'])
# takes 14611.357 ms
Since there is a lot of redundant data, what I am trying to do is update the large dataframe ( df 127255 rows ) but only need to run the fx function the minimum amount of times ( 9834 times ). This is because of all the duplicate rows for col_a and col_b. Of course this means that there will be multiple rows in df that have the same values for col_a and col_b, but that is OK, the other columns of df are different and make each row unique.
Before I create a normal iterative for loop to loop through the df_unique dataframe and do a conditional update on df, I wanted to ask if there was a more "pythonic" neat way of doing this kind of update. Thanks a lot.
** UPDATE **
I created the simple for loop mentioned above, like this:
df = ...
df_unique = df.copy().drop_duplicates(['col_a', 'col_b'])
df_unique['new_column'] = np.vectorize(fx)(df_unique['col_a'], df_unique['col_b'])
for index, row in df_unique.iterrows():
df.loc[(df['col_a'] == row['col_a']) & (df['col_b'] == row['col_b']),'new_column'] = row['new_column']
# takes 165971.890
So with this for loop there may be a slight performance increase but not nearly what I would have expected.
FYI
This is the fx function. It queries a mysql database.
def fx(d):
exp_date = datetime.strptime(d.col_a, '%m/%d/%Y')
if exp_date.weekday() == 5:
exp_date -= timedelta(days=1)
p = pandas.read_sql("select stat from table where a = '%s' and b_date = '%s';" % (d.col_a,exp_date.strftime('%Y-%m-%d')),engine)
if len(p.index) == 0:
return None
else:
return p.iloc[0].close
UPDATE:
if you can manage to read up your three columns ['stat','a','b_date'] belonging to table table into tab DF then you could merge it like this:
tab = pd.read_sql('select stat,a,b_date from table', engine)
df.merge(tab, left_on=[...], right_on=[...], how='left')
OLD answer:
you can merge/join your precalculated df_unique DF with the original df DF:
df['new_column'] = df.merge(df_unique, on=['col_a','col_b'], how='left')['new_column']
MaxU's answer may be already something you want. But I'll show another approach which may be a bit faster (I didn't measure).
I assume that:
df[['col_a', 'col_b']] is sorted so that all identical entries are in consecutive rows (it's important)
df has a unique index (if not, you may create some temporary unique index).
I'll use the fact that df_unique.index is a subset of df.index.
# (keep='first' is actually default)
df_unique = df[['col_a', 'col_b']].drop_duplicates(keep='first').copy()
# You may try .apply instead of np.vectorize (I think it may be faster):
df_unique['result'] = df_unique.apply(fx, axis=1)
# Main part:
df['result'] = df_unique['result'] # uses 2.
df['result'].fillna(method='ffill', inplace=True) # uses 1.