I have a scenario where I need to transform the values of a particular column based on the value present in another column in the same row, and the value in another dataframe.
Example-
print(parent_df)
school location modifed_date
0 school_1 New Delhi 2020-04-06
1 school_2 Kolkata 2020-04-06
2 school_3 Bengaluru 2020-04-06
3 school_4 Mumbai 2020-04-06
4 school_5 Chennai 2020-04-06
print(location_df)
school location
0 school_10 New Delhi
1 school_20 Kolkata
2 school_30 Bengaluru
3 school_40 Mumbai
4 school_50 Chennai
As per this use case, I need to transform the school names present in parent_df, based on the location column present in the same df, and the location property present in location_df
To achieve this transformation, I wrote the following method.
def transform_school_name(row, location_df):
name_alias = location_df[location_df['location'] == row['location']]
if len(name_alias) > 0:
return location_df.school.iloc[0]
else:
return row['school']
And this is how I am calling this method
parent_df['school'] = parent_df.apply(UtilityMethods.transform_school_name, args=(self.location_df,), axis=1)
The issue is that for just 46K records, I am seeing the entire tranformation happening in around 2 mins, which is too slow. How can I improve the performance of this solution?
EDITED
Following is the actual scenario I am dealing with wherein there is a small tranformation that is needed to be done before we can replace the value in the original column. I am not sure if this can be done within replace() method as mentioned in one of the answers below.
print(parent_df)
school location modifed_date type
0 school_1 _pre_New Delhi_post 2020-04-06 Govt
1 school_2 _pre_Kolkata_post 2020-04-06 Private
2 school_3 _pre_Bengaluru_post 2020-04-06 Private
3 school_4 _pre_Mumbai_post 2020-04-06 Govt
4 school_5 _pre_Chennai_post 2020-04-06 Private
print(location_df)
school location type
0 school_10 New Delhi Govt
1 school_20 Kolkata Private
2 school_30 Bengaluru Private
Custom Method code
def transform_school_name(row, location_df):
location_values = row['location'].split('_')
name_alias = location_df[location_df['location'] == location_values[1]]
name_alias = name_alias[name_alias['type'] == location_df['type']]
if len(name_alias) > 0:
return location_df.school.iloc[0]
else:
return row['school']
def transform_school_name(row, location_df):
name_alias = location_df[location_df['location'] == row['location']]
if len(name_alias) > 0:
return location_df.school.iloc[0]
else:
return row['school']
This is the actual scenario what I need to handle, so using replace() method won't help.
You can use map/replace:
parent_df['school'] = parent_df.location.replace(location_df.set_index('location')['school'])
Output:
school location modifed_date
0 school_10 New Delhi 2020-04-06
1 school_20 Kolkata 2020-04-06
2 school_30 Bengaluru 2020-04-06
3 school_40 Mumbai 2020-04-06
4 school_50 Chennai 2020-04-06
IIUC, this is more of a regex issue as the pattern doesn't match exactly. First extract the required pattern, create mapping of location in parent_df to location_df, map the values.
pat = '.*?' + '(' + '|'.join(location_df['location']) + ')' + '.*?'
mapping = parent_df['location'].str.extract(pat)[0].map(location_df.set_index('location')['school'])
parent_df['school'] = mapping.combine_first(parent_df['school'])
parent_df
school location modifed_date type
0 school_10 _pre_New Delhi_post 2020-04-06 Govt
1 school_20 _pre_Kolkata_post 2020-04-06 Private
2 school_30 _pre_Bengaluru_post 2020-04-06 Private
3 school_4 _pre_Mumbai_post 2020-04-06 Govt
4 school_5 _pre_Chennai_post 2020-04-06 Private
As I understood the edited task, the following update is to be performed:
for each row in parent_df,
find a row in location_df with matching location (a part of
location column and type),
if found, overwrite school column in parent_df with school
from the row just found.
To do it, proceed as follows:
Step 1: Generate a MultiIndex to locate school names by city and
school type:
ind = pd.MultiIndex.from_arrays([parent_df.location.str
.split('_', expand=True)[2], parent_df.type])
For your sample data, the result is:
MultiIndex([('New Delhi', 'Govt'),
( 'Kolkata', 'Private'),
('Bengaluru', 'Private'),
( 'Mumbai', 'Govt'),
( 'Chennai', 'Private')],
names=[2, 'type'])
Don't worry about strange first level column name (2), it will disappear soon.
Step 2: Generate a list of "new" locations:
locList = location_df.set_index(['location', 'type']).school[ind].tolist()
The result is:
['school_10', 'school_20', 'school_30', nan, nan]
For first 3 schools something has been found, for last 2 - nothing.
Step 3: Perform the actual update with the above list, through "non-null"
mask:
parent_df.school = parent_df.school.mask(pd.notnull(locList), locList)
Execution speed
Due to usage of vectorized operations and lookup by index, my code
runs sustantially faster that apply to each row.
Example: I replicated your parent_df 10,000 times and checked with
%timeit the execution time of your code (actually, a bit changed
version, described below) and mine.
To allow repeated execution, I changed both versions so that they set
school_2 column, and school remains unchanged.
Your code was running 34.9 s, whereas my code - only 161 ms - 261
times faster.
Yet quicker version
If parent_df has a default index (consecutive numbers starting from 0),
then the whole operation can be performed with a single instruction:
parent_df.school = location_df.set_index(['location', 'type']).school[
pd.MultiIndex.from_arrays(
[parent_df.location.str.split('_', expand=True)[2],
parent_df.type])
]\
.reset_index(drop=True)\
.combine_first(parent_df.school)
Steps:
location_df.set_index(...) - Set index to 2 "criterion" columns.
.school - Leave only school column (with the above index).
[...] - Retrieve from it elements indicated by the MultiIndex
defined inside.
pd.MultiIndex.from_arrays( - Create the MultiIndex.
parent_df.location.str.split('_', expand=True)[2] - The first level
of the MultiIndex - the "city" part from location.
parent_df.type - The second level of the MultiIndex - type.
reset_index(...) - Change the MultiIndex into a default index
(now the index is just the same as in parent_df.
combine_first(...) - Overwrite NaN values in the result generated
so far with original values from school.
parent_df.school = - Save the result back in school column.
For the test purpose, to check the execution speed, you can change it
with parent_df['school_2'].
According to my assessment, the execution time is by 9 % shorter than
for my original solution.
Corrections to your code
Take a look at location_values[1]]. It retrieves pre segment, whereas
actually the next segment (city name) should be retrieved.
Ther is no need to create a temporary list, based on the first condition
and then narrow down it, filtering with the second condition.
Both your conditions (for equality of location and type) can be
performed in a single instruction, so that execution time is a bit
shorter.
The value returned in the "positive" case should be from name_alias,
not location_df.
So if for some reason you wanted to stay by your code, change the
respective fragment to:
name_alias = location_df[location_df['location'].eq(location_values[2]) &
location_df['type'].eq(row.type)]
if len(name_alias) > 0:
return name_alias.school.iloc[0]
else:
return row['school']
If I'm reading the question correctly, what you're implementing with the apply method is a kind of join operation. Pandas excels are vectorizing operations, plus its c-based implementation of join ('merge') is almost certainly faster than a python / apply based one. Therefore, I would try to use the following solution:
parent_df["location_short"] = parent_df.location.str.split("_", expand=True)[2]
parent_df = pd.merge(parent_df, location_df, how = "left", left_on=["location_short", "type"],
right_on=["location", "type"], suffixes = ["", "_by_location"])
parent_df.loc[parent_df.school_by_location.notna(), "school"] = \
parent_df.loc[parent_df.school_by_location.notna(), "school_by_location"]
As far as I could understand, it produces what you're looking for:
Related
This is an extension to a question I posted earlier: Python Sum lookup dynamic array table with df column
I'm currently investigating a way to efficiently map a decision variable to a dataframe. The main DF and the lookup table will be dynamic in length (+15,000 lines and +20 lines, respectively). Thus was hoping not to do this with a loop, but happy to hear suggestions.
The DF (DF1) will mostly look like the following, where I would like to lookup/search for the decision.
Where the decision value is found on a separate DF (DF0).
For Example: the first DF1["ValuesWhereXYcomefrom"] value is 6.915 which is between 3.8>=(value)>7.4 on the key table and thus the corresponding value DF0["Decision"] is -1. The process then repeats until every line is mapped to a decision.
I was thinking to use the python bisect library, but have not prevailed to any working solution and also working with a loop. Now I'm wondering if I am looking at the problem incorrectly as mapping and looping 15k lines is time consuming.
Example Main Data (DF1):
time
Value0
Value1
Value2
ValuesWhereXYcomefrom
Value_toSum
Decision Map
1
41.43
6.579482077
0.00531021
2
41.650002
6.756817908
46.72466411
6.915187703
0.001200456
-1
3
41.700001
6.221966706
11.64727001
1.871959552
0.000959257
-1
4
41.740002
6.230847055
46.92753343
7.531485368
0.006228989
1
5
42
6.637399856
8.031374656
1.210018204
0.010238095
-1
6
42.43
7.484894608
16.24547568
2.170434793
-0.007777563
-1
7
42.099998
7.595291765
38.73871244
5.100358702
0.003562993
-1
8
42.25
7.567457423
37.07538953
4.899319211
0.01088755
-1
9
42.709999
8.234795546
64.27986403
7.805884636
0.005151042
1
10
42.93
8.369526407
24.72700129
2.954408659
-0.003028209
-1
11
42.799999
8.146653099
61.52243361
7.55186613
0
1
Example KeyTable (DF0):
ValueX
ValueY
SUM
Decision
0.203627201
3.803627201
0.040294925
-1
3.803627201
7.403627201
0.031630668
-1
7.403627201
11.0036272
0.011841521
1
Here's how I would go about this, assuming your first DataFrame is called df and your second is decision:
def map_func(x):
for i in range(len(decision)):
try:
if x < decision["ValueY"].iloc[i]:
return decision["Decision"].iloc[i]
except Exception:
return np.nan
df["decision"] = df["ValuesWhereXYcomefrom"].apply(lambda x: map_func(x))
This will create a new row in your DataFrame called "decision" that contains the looked up value. You can then just query it:
df.decision.iloc[row]
I have a question regarding text file handling. My text file prints as one column. The column has data scattered throughout the rows and visually looks great & somewhat uniform however, still just one column. Ultimately, I'd like to append the row where the keyword is found to the end of the top previous row until data is one long row. Then I'll use str.split() to cut up sections into columns as I need.
In Excel (code below-Top) I took this same text file and removed headers, aligned left, and performed searches for keywords. When found, Excel has a nice feature called offset where you can place or append the cell value basically anywhere using this offset(x,y).value from the active-cell start position. Once done, I would delete the row. This allowed my to get the data into a tabular column format that I could work with.
What I Need:
The below Python code will cycle down through each row looking for the keyword 'Address:'. This part of the code works. Once it finds the keyword, the next line should append the row to the end of the previous row. This is where my problem is. I can not find a way to get the active row number into a variable so I can use in place of the word [index] for the active row. Or [index-1] for the previous row.
Excel Code of similar task
Do
Set Rng = WorkRng.Find("Address", LookIn:=xlValues)
If Not Rng Is Nothing Then
Rng.Offset(-1, 2).Value = Rng.Value
Rng.Value = ""
End If
Loop While Not Rng Is Nothing
Python Equivalent
import pandas as pd
from pandas import DataFrame, Series
file = {'Test': ['Last Name: Nobody','First Name: Tommy','Address: 1234 West Juniper St.','Fav
Toy', 'Notes','Time Slot' ] }
df = pd.DataFrame(file)
Test
0 Last Name: Nobody
1 First Name: Tommy
2 Address: 1234 West Juniper St.
3 Fav Toy
4 Notes
5 Time Slot
I've tried the following:
for line in df.Test:
if line.startswith('Address:'):
df.loc[[index-1],:].values = df.loc[index-1].values + ' ' + df.loc[index].values
Line above does not work with index statement
else:
pass
# df.loc[[1],:] = df.loc[1].values + ' ' + df.loc[2].values # copies row 2 at the end of row 1,
# works with static row numbers only
# df.drop([2,0], inplace=True) # Deletes row from df
Expected output:
Test
0 Last Name: Nobody
1 First Name: Tommy Address: 1234 West Juniper St.
2 Address: 1234 West Juniper St.
3 Fav Toy
4 Notes
5 Time Slot
I am trying to wrap my head around the entire series vectorization approach but still stuck trying loops that I'm semi familiar with. If there is a way to achieve this please point me in the right direction.
As always, I appreciate your time and your knowledge. Please let me know if you can help with this issue.
Thank You,
Use Series.shift on Test then use Series.str.startswith to create a boolean mask, then use boolean indexing with this mask to update the values in Test column:
s = df['Test'].shift(-1)
m = s.str.startswith('Address', na=False)
df.loc[m, 'Test'] += (' ' + s[m])
Result:
Test
0 Last Name: Nobody
1 First Name: Tommy Address: 1234 West Juniper St.
2 Address: 1234 West Juniper St.
3 Fav Toy
4 Notes
5 Time Slot
TLDR; How can I improve my code and make it more pythonic?
Hi,
One of the interesting challenge(s) we were given in a tutorial was the following:
"There are X missing entries in the data frame with an associated code but a 'blank' entry next to the code. This is a random occurance across the data frame. Using your knowledge of pandas, map each missing 'blank' entry to the associated code."
So this looks like the following:
|code| |name|
001 Australia
002 London
...
001 <blank>
My approach I have used is as follows:
Loop through entire dataframe and identify areas with blanks "". Replace all blanks via copying the associated correct code (ordered) to the dataframe.
code_names = [ "",
'Economic management',
'Public sector governance',
'Rule of law',
'Financial and private sector development',
'Trade and integration',
'Social protection and risk management',
'Social dev/gender/inclusion',
'Human development',
'Urban development',
'Rural development',
'Environment and natural resources management'
]
df_copy = df_.copy()
# Looks through each code name, and if it is empty, stores the proper name in its place
for x in range(len(df_copy.mjtheme_namecode)):
for y in range(len(df_copy.mjtheme_namecode[x])):
if(df_copy.mjtheme_namecode[x][y]['name'] == ""):
df_copy.mjtheme_namecode[x][y]['name'] = code_names[int(df_copy.mjtheme_namecode[x][y]['code'])]
limit = 25
counter = 0
for x in range(len(df_copy.mjtheme_namecode)):
for y in range(len(df_copy.mjtheme_namecode[x])):
print(df_copy.mjtheme_namecode[x][y])
counter += 1
if(counter >= limit):
break
While the above approach works - is there a better, more pythonic way of achieving what I'm after? I feel the approach I have used is very clunky due to my skills not being very well developed.
Thank you!
Method 1:
One way to do this would be to replace all your "" blanks with NaN, sort the dataframe by code and name, and use fillna(method='ffill'):
Starting with this:
>>> df
code name
0 1 Australia
1 2 London
2 1
You can apply the following:
new_df = (df.replace({'name':{'':np.nan}})
.sort_values(['code', 'name'])
.fillna(method='ffill')
.sort_index())
>>> new_df
code name
0 1 Australia
1 2 London
2 1 Australia
Method 2:
This is more convoluted, but will work as well:
Using groupby, first, and sqeeze, you can create a pd.Series to map the codes to non-blank names, and use .map to map that series to your code column:
df['name'] = (df['code']
.map(
df.replace({'name':{'':np.nan}})
.sort_values(['code', 'name'])
.groupby('code')
.first()
.squeeze()
))
>>> df
code name
0 1 Australia
1 2 London
2 1 Australia
Explanation: The pd.Series map that this creates looks like this:
code
1 Australia
2 London
And it works because it gets the first instance for every code (via the groupby), sorted in such a manner that the NaNs are last. So as long as each code is associated with a name, this method will work.
I have the following two databases:
url='https://raw.githubusercontent.com/108michael/ms_thesis/master/rgdp_catcode.merge'
df=pd.read_csv(url, index_col=0)
df.head(1)
naics catcode GeoName Description ComponentName year GDP state
0 22 E1600',\t'E1620',\t'A4000',\t'E5000',\t'E3000'... Alabama Utilities Real GDP by state 2004 5205 AL
url='https://raw.githubusercontent.com/108michael/ms_thesis/master/mpl.Bspons.merge'
df1=pd.read_csv(url, index_col=0)
df1.head(1)
state year unemployment log_diff_unemployment id.thomas party type date bills id.fec years_exp session name disposition catcode
0 AK 2006 6.6 -0.044452 1440 Republican sen 2006-05-01 s2686-109 S2AK00010 39 109 National Cable & Telecommunications Association support C4500
Regarding df, I had to manually input the catcode values. I think that is why the formatting is off. What I would like is to simply have the values without the \t prefix. I want to merge the dfs on catcode, state, year. I made a test earlier wherein a df1.catcode with only one value per cell was matched with the values in another df.catcode that had more than one value per cell and it worked.
So technically, all I need to do is lose the \t before each consecutive value in df.catcode, but additionally, if anyone has ever done a merge of this sort before, any 'caveats' learned through experience would be appreciated. My merge code looks like this:
mplmerge=pd.merge(df1,df, on=(['catcode', 'state', 'year']), how='left' )
I think this can be done with the regex method, I'm looking at the documentation now.
Cleaning catcode column in df is rather straightforward:
catcode_fixed = df.catcode.str.findall('[A-Z][0-9]{4}')
This will produce a series with a list of catcodes in every row:
catcode_fixed.head(3)
Out[195]:
0 [E1600, E1620, A4000, E5000, E3000, E1000]
1 [X3000, X3200, L1400, H6000, X5000]
2 [X3000, X3200, L1400, H6000, X5000]
Name: catcode, dtype: object
If I understand correctly what you want, then you need to "ungroup" these lists. Here is the trick, in short:
catcode_fixed = catcode_fixed = catcode_fixed.apply(pd.Series).stack()
catcode_fixed.index = catcode_fixed.index.droplevel(-1)
So, we've got (note the index values):
catcode_fixed.head(12)
Out[206]:
0 E1600
0 E1620
0 A4000
0 E5000
0 E3000
0 E1000
1 X3000
1 X3200
1 L1400
1 H6000
1 X5000
2 X3000
dtype: object
Now, dropping the old catcode and joining in the new one:
df.drop('catcode',axis = 1, inplace = True)
catcode_fixed.name = 'catcode'
df = df.join(catcode_fixed)
By the way, you may also need to use df1.reset_index() when merging the data frames.
(I suck at titling these questions...)
So I've gotten 90% of the way through a very laborious learning process with pandas, but I have one thing left to figure out. Let me show an example (actual original is a comma-delimited CSV that has many more rows):
Name Price Rating URL Notes1 Notes2 Notes3
Foo $450 9 a.com/x NaN NaN NaN
Bar $99 5 see over www.b.com Hilarious Nifty
John $551 2 www.c.com Pretty NaN NaN
Jane $999 8 See Over in Notes Funky http://www.d.com Groovy
The URL column can say many different things, but they all include "see over," and do not indicate with consistency which column to the right includes the site.
I would like to do a few things, here: first, move websites from any Notes column to URL; second, collapse all notes columns to one column with a new line between them. So this (NaN's removed because pandas makes me in order to use them in df.loc):
Name Price Rating URL Notes1
Foo $450 9 a.com/x
Bar $99 5 www.b.com Hilarious
Nifty
John $551 2 www.c.com Pretty
Jane $999 8 http://www.d.com Funky
Groovy
I got partway there by doing this:
df['URL'] = df['URL'].fillna('')
df['Notes1'] = df['Notes1'].fillna('')
df['Notes2'] = df['Notes2'].fillna('')
df['Notes3'] = df['Notes3'].fillna('')
to_move = df['URL'].str.lower().str.contains('see over')
df.loc[to_move, 'URL'] = df['Notes1']
What I don't know is how to find the Notes column with either www or .com. If I, for example, try to use my above method as a condition, e.g.:
if df['Notes1'].str.lower().str.contains('www'):
df.loc[to_move, 'URL'] = df['Notes1']
I get back ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all() But adding .any() or .all() has the obvious flaw that they don't give me what I'm looking for: with any, e.g., every line that meets the to_move requirement in URL will get whatever's in Notes1. I need the check to occur row by row. For similar reasons, I can't even get started collapsing the Notes columns (and I don't know how to check for non-null empty string cells, either, a problem I created at this point).
Where it stands, I know I also have to move in Notes2 to Notes1, Notes3 to Notes2, and '' to Notes3 when the first condition is satisfied, because I don't want the leftover URLs in the Notes columns. I'm sure pandas has easier routes than what I'm doing, because it's pandas, and when I try to do anything with pandas, I find out that it can be done in one line instead of my 20...
(PS, I don't care if the empty columns Notes2 and Notes3 are left over, b/c I'm not using them in my CSV import in the next step, though I can always learn more than I need)
UPDATE: So I figured out a crummy verbose solution using my non-pandas python logic one step at a time. I came up with this (same first five lines above, minus the df.loc line):
url_in1 = df['Notes1'].str.contains('\.com')
url_in2 = df['Notes2'].str.contains('\.com')
to_move = df['URL'].str.lower().str.contains('see-over')
to_move1 = to_move & url_in1
to_move2 = to_move & url_in2
df.loc[to_move1, 'URL'] = df.loc[url_in1, 'Notes1']
df.loc[url_in1, 'Notes1'] = df['Notes2']
df.loc[url_in1, 'Notes2'] = ''
df.loc[to_move2, 'URL'] = df.loc[url_in2, 'Notes2']
df.loc[url_in2, 'Notes2'] = ''
(Lines moved around and to_move repeated in actual code) I know there has to be a more efficient method... This also doesn't collapse in the Notes columns, but that should be easy using the same method, except that I still don't know a good way to find the empty strings.
I'm still learning pandas, so some parts of this code may be not so elegant, but general idea is - get all notes columns, find all urls in there, combine it with URL column and then concat remaining notes into Notes1 column:
import pandas as pd
import numpy as np
import pandas.core.strings as strings
# Just to get first notnull occurence
def geturl(s):
try:
return next(e for e in s if not pd.isnull(e))
except:
return np.NaN
df = pd.read_csv("d:/temp/data2.txt")
dfnotes = df[[e for e in df.columns if 'Notes' in e]]
# Notes1 Notes2 Notes3
# 0 NaN NaN NaN
# 1 www.b.com Hilarious Nifty
# 2 Pretty NaN NaN
# 3 Funky http://www.d.com Groovy
dfurls = dfnotes.apply(lambda x: x.str.contains('\.com'), axis=1)
dfurls = dfurls.fillna(False).astype(bool)
# Notes1 Notes2 Notes3
# 0 False False False
# 1 True False False
# 2 False False False
# 3 False True False
turl = dfnotes[dfurls].apply(geturl, axis=1)
df['URL'] = np.where(turl.isnull(), df['URL'], turl)
df['Notes1'] = dfnotes[~dfurls].apply(lambda x: strings.str_cat(x[~x.isnull()], sep=' '), axis=1)
del df['Notes2']
del df['Notes3']
df
# Name Price Rating URL Notes1
# 0 Foo $450 9 a.com/x
# 1 Bar $99 5 www.b.com Hilarious Nifty
# 2 John $551 2 www.c.com Pretty
# 3 Jane $999 8 http://www.d.com Funky Groovy