I need to use a DataFrame as a lookup table on columns that are not part of the index. For example (this is a simple one just to illustrate):
import pandas as pd
westcoast = pd.DataFrame([['Washington','Olympia'],['Oregon','Salem'],
['California','Sacramento']],
columns=['state','capital'])
print westcoast
state capital
0 Washington Olympia
1 Oregon Salem
2 California Sacramento
It's easy to lookup and get a Series as an output:
westcoast[westcoast.state=='Oregon'].capital
1 Salem
Name: capital, dtype: object
but I want to obtain the string 'Salem':
westcoast[westcoast.state=='Oregon'].capital.values[0]
'Salem'
and the .values[0] seems somewhat clunky... is there a better way?
(FWIW: my real data has maybe 50 rows at most, but lots of columns, so if I do set an index column, no matter what column I choose, there will be a lookup operation like this that is not based on an index, and the relatively small number of rows means that I don't care if it's O(n) lookup.)
Yes, you can use Series.item if the lookup will always returns one element from the Series:
westcoast.loc[westcoast.state=='Oregon', 'capital'].item()
Exceptions can be handled if the lookup returns nothing, or one or more values and you need only the first item:
s = westcoast.loc[westcoast.state=='Oregon', 'capital']
s = np.nan if s.empty else s.iat[0]
print (s) #Salem
s = westcoast.loc[westcoast.state=='New York', 'capital']
s = np.nan if s.empty else s.iat[0]
print (s)
nan
A more general solution to handle the exceptions because there are 3 possible output scenarios:
westcoast = pd.DataFrame([['Washington','Olympia'],['Oregon','Salem'],
['California','Sacramento'],['Oregon','Portland']],
columns=['state','capital'])
print (westcoast)
state capital
0 Washington Olympia
1 Oregon Salem
2 California Sacramento
3 Oregon Portland
s = westcoast.loc[westcoast.state=='Oregon', 'capital']
#if not value returned
if s.empty:
s = 'no match'
#if only one value returned
elif len(s) == 1:
s = s.item()
else:
# if multiple values returned, return a list of values
s = s.tolist()
print (s)
['Salem', 'Portland']
It is possible to create a lookup function:
def look_up(a):
s = westcoast.loc[westcoast.state==a, 'capital']
#for no match
if s.empty:
return np.nan
#for match only one value
elif len(s) == 1:
return s.item()
else:
#for return multiple values
return s.tolist()
print (look_up('Oregon'))
['Salem', 'Portland']
print (look_up('California'))
Sacramento
print (look_up('New Yourk'))
nan
If you are going to do frequent lookups of this sort, then it pays to make state the index:
state_capitals = westcoast.set_index('state')['capital']
print(state_capitals['Oregon'])
# Salem
With an index, each lookup is O(1) on average, whereas westcoast['state']=='Oregon' requires O(n) comparisons. Of course, building the index is also O(n), so you would need to do many lookups for this to pay off.
At the same time, once you have state_capitals the syntax is simple and dict-like. That might be reason enough to build state_capitals.
Related
So, I've got two dataframes, one with 54k rows and 1 column and another with 139k rows and 3 columns, I need to check weather the values of a column from first dataframe lies in between values of two columns in second dataframe, and if they match, I need to replace that particular value with corresponding string value in the second dataframe into first dataframe.
I tried doing it with simple for loops and if else statements, but the number of iteration are huge and my cell is taking forever to run. I've attached some snippets down below, If there is any better way to rewrite that particular part of code, It would be great help. Thanks in advance.
First DataFrame:
ip_address_to_clean
IP_Address_clean
0 815237196
1 1577685417
2 979279225
3 3250268602
4 2103448748
... ...
54208 4145673247
54209 1344187002
54210 3156712153
54211 1947493810
54212 2872038579
54213 rows × 1 columns
Second DataFrame:
ip_boundaries_file
country lower_bound_ip_address_clean upper_bound_ip_address_clean
0 Australia 16777216 16777471
1 China 16777472 16777727
2 China 16777728 16778239
3 Australia 16778240 16779263
4 China 16779264 16781311
... ... ... ...
138841 Hong Kong 3758092288 3758093311
138842 India 3758093312 3758094335
138843 China 3758095360 3758095871
138844 Singapore 3758095872 3758096127
138845 Australia 3758096128 3758096383
138846 rows × 3 columns
Code I've written :
ip_address_to_clean_copy = ip_address_to_clean.copy()
o_ip = ip_address_to_clean['IP_Address_clean'].values
l_b = ip_boundaries_file['lower_bound_ip_address_clean'].values
for i in range(len(o_ip)):
for j in range(len(l_b)):
if (ip_address_to_clean['IP_Address_clean'][i] > ip_boundaries_file['lower_bound_ip_address_clean'][j]) and (ip_address_to_clean['IP_Address_clean'][i] < ip_boundaries_file['upper_bound_ip_address_clean'][j]):
ip_address_to_clean_copy['IP_Address_clean'][i] = ip_boundaries_file['country'][j]
#print(ip_address_to_clean_copy['IP_Address_clean'][i])
#print(i)
This works (I tested it on small tables).
replacement1 = [None]*3758096384
replacement2 = []
for _, row in ip_boundaries_file.iterrows():
a,b,c = row['lower_bound_ip_address_clean'], row['upper_bound_ip_address_clean'], row['country']
replacement1[a+1:b]=[len(replacement2)]*(b-a-1)
replacement2.append(c)
ip_address_to_clean_copy['IP_Address_clean'] = ip_address_to_clean_copy['IP_Address_clean'].apply(lambda x:replacement2[replacement1[x]] if (x < len(replacement1) and replacement1[x]!=None) else x)
I tweaked the lambda function to keep the original ip if it's not in the replacement table.
Notes:
Compared to my comment, I added the replacement2 table to hold the actual strings, and put the indexes in replacement1 to make it more memory efficient.
This is based on one of the methods to sort a list in O(n) when you know the contained values are bounded.
Example:
Inputs:
ip_address_to_clean = pd.DataFrame([10,33,2,179,2345,123], columns = ['IP_Address_clean'])
ip_boundaries_file = pd.DataFrame([['China',1,12],
['Australia', 20,40],
['China',2000,3000],
['France', 100,150]],
columns = ['country', 'lower_bound_ip_address_clean',
'upper_bound_ip_address_clean'])
Output:
ip_address_to_clean_copy
# Out[13]:
# IP_Address_clean
# 0 China
# 1 Australia
# 2 China
# 3 179
# 4 China
# 5 France
As I mentioned in another comment, here's another script that performs a dichotomy search on the 2nd DataFrame; it works in O(n log(p)), which is slower than the above script, but consumes far less memory!
def replace(n, df):
if len(df) == 0:
return n
i = len(df)//2
if df.iloc[i]['lower_bound_ip_address_clean'] < n < df.iloc[i]['upper_bound_ip_address_clean']:
return df.iloc[i]['country']
elif len(df) == 1:
return n
else:
if n <= df.iloc[i]['lower_bound_ip_address_clean']:
return replace(n, df.iloc[:i-1])
else:
return replace(n, df.iloc[i+1:])
ip_address_to_clean_copy['IP_Address_clean'] = ip_address_to_clean['IP_Address_clean'].apply(lambda x: replace(x,ip_boundaries_file))
This is my first time posting a question, so take it easy on me if I don't know stack overflow norm of asking questions.
Attached is a snippet of what I am trying to accomplish on my side-project. I want to be able to compare a user input with a database .xlsx file that was imported by pandas.
I want to compare the user input with the database column ['Component'], if that component is there, it will grab its properties associated with that component.
comp_loc = r'C:\Users\ayubi\Documents\Python Files\Chemical_Database.xlsx'
data = pd.read_excel(comp_loc)
print(data)
LK = input('What is the Light Key?: ') #Answer should be Benzene in this case
if LK == data['Component'].any():
Tcrit = data['TC, (K)']
Pcrit = data['PC, (bar)']
A = data['A']
B = data['B']
C = data['C']
D = data['D']
else:
print('False')
Results
Component TC, (K) PC, (bar) A B C D
0 Benzene 562.2 48.9 -6.983 1.332 -2.629 -3.333
1 Toluene 591.8 41.0 -7.286 1.381 -2.834 -2.792
What is the Light Key?: Benzene
False
Please let me know if you have any questions.
I do appreciate your help!
You can do this by taking advantage of indices and using the df.loc accessor in pandas:
# set index to Component column for convenience
data = data.set_index('Component')
LK = input('What is the Light Key?: ') #Answer should be Benzene in this case
# check if your lookup is in the index
if LK in data.index:
# grab the row by the index using .loc
row = data.loc[LK]
# if the column name has spaces, you need to access as key
Tcrit = row['TC, (K)']
Pcrit = row['PC, (bar)']
# if the column name doesn't have a space, you can access as attribute
A = row.A
B = row.B
C = row.C
D = row.D
else:
print('False')
This is a great case for an Index. Set 'Component' to the Index, and then you can use a very fast loc call to look up the data. Instead of the if-else use a try-except as a KeyError is going to tell you that the LK doesn't exist, without requiring the slower check of first checking whether it's in the index.
I also highly suggest you keep the values as a single Series, instead of floating around as 6 different varibales. It's simple to access each item by the Series index, i.e. Series['A'].
df = df.set_index('Component')
def grab_data(df, LK):
try:
return df.loc[LK]
except KeyError:
return False
grab_data(df, 'Benzene')
#TC, (K) 562.200
#PC, (bar) 48.900
#A -6.983
#B 1.332
#C -2.629
#D -3.333
#Name: Benzene, dtype: float64
grab_data(df, 'foo')
#False
I have a scenario where I need to transform the values of a particular column based on the value present in another column in the same row, and the value in another dataframe.
Example-
print(parent_df)
school location modifed_date
0 school_1 New Delhi 2020-04-06
1 school_2 Kolkata 2020-04-06
2 school_3 Bengaluru 2020-04-06
3 school_4 Mumbai 2020-04-06
4 school_5 Chennai 2020-04-06
print(location_df)
school location
0 school_10 New Delhi
1 school_20 Kolkata
2 school_30 Bengaluru
3 school_40 Mumbai
4 school_50 Chennai
As per this use case, I need to transform the school names present in parent_df, based on the location column present in the same df, and the location property present in location_df
To achieve this transformation, I wrote the following method.
def transform_school_name(row, location_df):
name_alias = location_df[location_df['location'] == row['location']]
if len(name_alias) > 0:
return location_df.school.iloc[0]
else:
return row['school']
And this is how I am calling this method
parent_df['school'] = parent_df.apply(UtilityMethods.transform_school_name, args=(self.location_df,), axis=1)
The issue is that for just 46K records, I am seeing the entire tranformation happening in around 2 mins, which is too slow. How can I improve the performance of this solution?
EDITED
Following is the actual scenario I am dealing with wherein there is a small tranformation that is needed to be done before we can replace the value in the original column. I am not sure if this can be done within replace() method as mentioned in one of the answers below.
print(parent_df)
school location modifed_date type
0 school_1 _pre_New Delhi_post 2020-04-06 Govt
1 school_2 _pre_Kolkata_post 2020-04-06 Private
2 school_3 _pre_Bengaluru_post 2020-04-06 Private
3 school_4 _pre_Mumbai_post 2020-04-06 Govt
4 school_5 _pre_Chennai_post 2020-04-06 Private
print(location_df)
school location type
0 school_10 New Delhi Govt
1 school_20 Kolkata Private
2 school_30 Bengaluru Private
Custom Method code
def transform_school_name(row, location_df):
location_values = row['location'].split('_')
name_alias = location_df[location_df['location'] == location_values[1]]
name_alias = name_alias[name_alias['type'] == location_df['type']]
if len(name_alias) > 0:
return location_df.school.iloc[0]
else:
return row['school']
def transform_school_name(row, location_df):
name_alias = location_df[location_df['location'] == row['location']]
if len(name_alias) > 0:
return location_df.school.iloc[0]
else:
return row['school']
This is the actual scenario what I need to handle, so using replace() method won't help.
You can use map/replace:
parent_df['school'] = parent_df.location.replace(location_df.set_index('location')['school'])
Output:
school location modifed_date
0 school_10 New Delhi 2020-04-06
1 school_20 Kolkata 2020-04-06
2 school_30 Bengaluru 2020-04-06
3 school_40 Mumbai 2020-04-06
4 school_50 Chennai 2020-04-06
IIUC, this is more of a regex issue as the pattern doesn't match exactly. First extract the required pattern, create mapping of location in parent_df to location_df, map the values.
pat = '.*?' + '(' + '|'.join(location_df['location']) + ')' + '.*?'
mapping = parent_df['location'].str.extract(pat)[0].map(location_df.set_index('location')['school'])
parent_df['school'] = mapping.combine_first(parent_df['school'])
parent_df
school location modifed_date type
0 school_10 _pre_New Delhi_post 2020-04-06 Govt
1 school_20 _pre_Kolkata_post 2020-04-06 Private
2 school_30 _pre_Bengaluru_post 2020-04-06 Private
3 school_4 _pre_Mumbai_post 2020-04-06 Govt
4 school_5 _pre_Chennai_post 2020-04-06 Private
As I understood the edited task, the following update is to be performed:
for each row in parent_df,
find a row in location_df with matching location (a part of
location column and type),
if found, overwrite school column in parent_df with school
from the row just found.
To do it, proceed as follows:
Step 1: Generate a MultiIndex to locate school names by city and
school type:
ind = pd.MultiIndex.from_arrays([parent_df.location.str
.split('_', expand=True)[2], parent_df.type])
For your sample data, the result is:
MultiIndex([('New Delhi', 'Govt'),
( 'Kolkata', 'Private'),
('Bengaluru', 'Private'),
( 'Mumbai', 'Govt'),
( 'Chennai', 'Private')],
names=[2, 'type'])
Don't worry about strange first level column name (2), it will disappear soon.
Step 2: Generate a list of "new" locations:
locList = location_df.set_index(['location', 'type']).school[ind].tolist()
The result is:
['school_10', 'school_20', 'school_30', nan, nan]
For first 3 schools something has been found, for last 2 - nothing.
Step 3: Perform the actual update with the above list, through "non-null"
mask:
parent_df.school = parent_df.school.mask(pd.notnull(locList), locList)
Execution speed
Due to usage of vectorized operations and lookup by index, my code
runs sustantially faster that apply to each row.
Example: I replicated your parent_df 10,000 times and checked with
%timeit the execution time of your code (actually, a bit changed
version, described below) and mine.
To allow repeated execution, I changed both versions so that they set
school_2 column, and school remains unchanged.
Your code was running 34.9 s, whereas my code - only 161 ms - 261
times faster.
Yet quicker version
If parent_df has a default index (consecutive numbers starting from 0),
then the whole operation can be performed with a single instruction:
parent_df.school = location_df.set_index(['location', 'type']).school[
pd.MultiIndex.from_arrays(
[parent_df.location.str.split('_', expand=True)[2],
parent_df.type])
]\
.reset_index(drop=True)\
.combine_first(parent_df.school)
Steps:
location_df.set_index(...) - Set index to 2 "criterion" columns.
.school - Leave only school column (with the above index).
[...] - Retrieve from it elements indicated by the MultiIndex
defined inside.
pd.MultiIndex.from_arrays( - Create the MultiIndex.
parent_df.location.str.split('_', expand=True)[2] - The first level
of the MultiIndex - the "city" part from location.
parent_df.type - The second level of the MultiIndex - type.
reset_index(...) - Change the MultiIndex into a default index
(now the index is just the same as in parent_df.
combine_first(...) - Overwrite NaN values in the result generated
so far with original values from school.
parent_df.school = - Save the result back in school column.
For the test purpose, to check the execution speed, you can change it
with parent_df['school_2'].
According to my assessment, the execution time is by 9 % shorter than
for my original solution.
Corrections to your code
Take a look at location_values[1]]. It retrieves pre segment, whereas
actually the next segment (city name) should be retrieved.
Ther is no need to create a temporary list, based on the first condition
and then narrow down it, filtering with the second condition.
Both your conditions (for equality of location and type) can be
performed in a single instruction, so that execution time is a bit
shorter.
The value returned in the "positive" case should be from name_alias,
not location_df.
So if for some reason you wanted to stay by your code, change the
respective fragment to:
name_alias = location_df[location_df['location'].eq(location_values[2]) &
location_df['type'].eq(row.type)]
if len(name_alias) > 0:
return name_alias.school.iloc[0]
else:
return row['school']
If I'm reading the question correctly, what you're implementing with the apply method is a kind of join operation. Pandas excels are vectorizing operations, plus its c-based implementation of join ('merge') is almost certainly faster than a python / apply based one. Therefore, I would try to use the following solution:
parent_df["location_short"] = parent_df.location.str.split("_", expand=True)[2]
parent_df = pd.merge(parent_df, location_df, how = "left", left_on=["location_short", "type"],
right_on=["location", "type"], suffixes = ["", "_by_location"])
parent_df.loc[parent_df.school_by_location.notna(), "school"] = \
parent_df.loc[parent_df.school_by_location.notna(), "school_by_location"]
As far as I could understand, it produces what you're looking for:
Overview: I am working with pandas dataframes of census information, while they only have two columns, they are several hundred thousand rows in length. One column is a census block ID number and the other is a 'place' value, which is unique to the city in which that census block ID resides.
Example Data:
BLOCKID PLACEFP
0 60014001001000 53000
1 60014001001001 53000
...
5844 60014099004021 53000
5845 60014100001000
5846 60014100001001
5847 60014100001002 53000
Problem: As shown above, there are several place values that are blank, though they have a census block ID in their corresponding row. What I found was that in several instances, the census block ID that is missing a place value, is located within the same city as the surrounding blocks that do not have a missing place value, especially if the bookend place values are the same - as shown above, with index 5844 through 5847 - those two blocks are located within the same general area as the surrounding blocks, but just seem to be missing the place value.
Goal: I want to be able to go through this dataframe, find these instances and fill in the missing place value, based on the place value before the missing value and the place value that immediately follows.
Current State & Obstacle: I wrote a loop that goes through the dataframe to correct these issues, shown below.
current_state_blockid_df = pandas.DataFrame({'BLOCKID':[60014099004021,60014100001000,60014100001001,60014100001002,60014301012019,60014301013000,60014301013001,60014301013002,60014301013003,60014301013004,60014301013005,60014301013006],
'PLACEFP': [53000,,,53000,11964,'','','','','','',11964]})
for i in current_state_blockid_df.index:
if current_state_blockid_df.loc[i, 'PLACEFP'] == '':
#Get value before blank
prior_place_fp = current_state_blockid_df.loc[i - 1, 'PLACEFP']
next_place_fp = ''
_n = 1
# Find the end of the blank section
while next_place_fp == '':
next_place_fp = current_state_blockid_df.loc[i + _n, 'PLACEFP']
if next_place_fp == '':
_n += 1
# if the blanks could likely be in the same city, assign them the city's place value
if prior_place_fp == next_place_fp:
for _i in range(1, _n):
current_state_blockid_df.loc[_i, 'PLACEFP'] = prior_place_fp
However, as expected, it is very slow when dealing with hundreds of thousands or rows of data. I have considered using maybe ThreadPool executor to split up the work, but I haven't quite figured out the logic I'd use to get that done. One possibility to speed it up slightly, is to eliminate the check to see where the end of the gap is and instead just fill it in with whatever the previous place value was before the blanks. While that may end up being my goto, there's still a chance it's too slow and ideally I'd like it to only fill in if the before and after values match, eliminating the possibility of the block being mistakenly assigned. If someone has another suggestion as to how this could be achieved quickly, it would be very much appreciated.
You can use shift to help speed up the process. However, this doesn't solve for cases where there are multiple blanks in a row.
df['PLACEFP_PRIOR'] = df['PLACEFP'].shift(1)
df['PLACEFP_SUBS'] = df['PLACEFP'].shift(-1)
criteria1 = df['PLACEFP'].isnull()
criteria2 = df['PLACEFP_PRIOR'] == df['PLACEFP_AFTER']
df.loc[criteria1 & criteria2, 'PLACEFP'] = df.loc[criteria1 & criteria2, 'PLACEFP_PRIOR']
If you end up needing to iterate over the dataframe, use df.itertuples. You can access the column values in the row via dot notation (row.column_name).
for idx, row in df.itertuples():
# logic goes here
Using your dataframe as defined
def fix_df(current_state_blockid_df):
df_with_blanks = current_state_blockid_df[current_state_blockid_df['PLACEFP'] == '']
df_no_blanks = current_state_blockid_df[current_state_blockid_df['PLACEFP'] != '']
sections = {}
last_i = 0
grouping = []
for i in df_with_blanks.index:
if i - 1 == last_i:
grouping.append(i)
last_i = i
else:
last_i = i
if len(grouping) > 0:
sections[min(grouping)] = {'indexes': grouping}
grouping = []
grouping.append(i)
if len(grouping) > 0:
sections[min(grouping)] = {'indexes': grouping}
for i in sections.keys():
sections[i]['place'] = current_state_blockid_df.loc[i-1, 'PLACEFP']
l = []
for i in sections:
for x in sections[i]['indexes']:
l.append(sections[i]['place'])
df_with_blanks['PLACEFP'] = l
final_df = pandas.concat([df_with_blanks, df_no_blanks]).sort_index(axis=0)
return final_df
df = fix_df(current_state_blockid_df)
print(df)
Output:
BLOCKID PLACEFP
0 60014099004021 53000
1 60014100001000 53000
2 60014100001001 53000
3 60014100001002 53000
4 60014301012019 11964
5 60014301013000 11964
6 60014301013001 11964
7 60014301013002 11964
8 60014301013003 11964
9 60014301013004 11964
10 60014301013005 11964
11 60014301013006 11964
I have two dataframes, a df of actors who have a feature that is a list of movie identifier numbers for films that they've worked on. I also have a list of movies that have an identifier number that will show up in the actor's list if the actor was in that movie.
I've attempted to iterate through the movies dataframe, which does produce results but is too slow.
It seems like iterating through the list of movies from the actors dataframe would result in less looping, but I've been unable to save results.
Here is the actors dataframe:
print(actors[['primaryName', 'knownForTitles']].head())
primaryName knownForTitles
0 Rowan Atkinson tt0109831,tt0118689,tt0110357,tt0274166
1 Bill Paxton tt0112384,tt0117998,tt0264616,tt0090605
2 Juliette Binoche tt1219827,tt0108394,tt0116209,tt0241303
3 Linda Fiorentino tt0110308,tt0119654,tt0088680,tt0120655
4 Richard Linklater tt0243017,tt1065073,tt2209418,tt0405296
And the movies dataframe:
print(movies[['tconst', 'primaryTitle']].head())
tconst primaryTitle
0 tt0001604 The Fatal Wedding
1 tt0002467 Romani, the Brigand
2 tt0003037 Fantomas: The Man in Black
3 tt0003593 Across America by Motor Car
4 tt0003830 Detective Craig's Coup
As you can see, the movies['tconst'] identifier shows up in a list in the actors dataframe.
My very slow iteration through the movie dataframe is as follows:
def add_cast(movie_df, actor_df):
results = movie_df.copy()
length = len(results)
#create an empty feature
results['cast'] = ""
#iterate through the movie identifiers
for index, value in results['tconst'].iteritems():
#create a new dataframe containing all the cast associated with the movie id
cast = actor_df[actor_df['knownForTitles'].str.contains(value)]
#check to see if the 'primaryName' list is empty
if len(list(cast['primaryName'].values)) != 0:
#set the new movie 'cast' feature equal to a list of the cast names
results.loc[index]['cast'] = list(cast['primaryName'].values)
#logging
if index % 1000 == 0:
logging.warning(f'Results location: {index} out of {length}')
#delete cast df to free up memory
del cast
return results
This generates some results but is not fast enough to be useful. One observation is that by creating a new dataframe of all the actors who have the movie identifier in their knownForTitles is that this list can be put into a single feature of the movies dataframe.
Whereas for my attempt to loop through the actors dataframe below, I don't seem to be able to append items into the movies dataframe:
def actors_loop(movie_df, actor_df):
results = movie_df.copy()
length = len(actor_df)
#create an empty feature
results['cast'] = ""
#iterate through all actors
for index, value in actor_df['knownForTitles'].iteritems():
#skip empties
if str(value) == r"\N":
logging.warning(f'skipping: {index} with a value of {value}')
continue
#generate a list of movies that this actor has been in
cinemetography = [x.strip() for x in value.split(',')]
#iterate through every movie the actor has been in
for movie in cinemetography:
#pull out the movie info if it exists
movie_info = results[results['tconst'] == movie]
#continue if empty
if len(movie_info) == 0:
continue
#set the cast variable equal to the actor name
results[results['tconst'] == movie]['cast'] = (actor_df['primaryName'].loc[index])
#delete the df to save space ?maybe
del movie_info
#logging
if index % 1000 == 0:
logging.warning(f'Results location: {index} out of {length}')
return results
So if I run the above code, I get a very fast result, but the 'cast' field remains empty.
I figured out the problem I was having with def actors_loop(movie_df, actor_df) function. The problem is that
results['tconst'] == movie]['cast'] = (actor_df['primaryName'].loc[index])
is setting the value equal to a copy of the results dataframe. It would be better to use the df.set_value() method or the df.at[] method.
I also figured out a much faster solution to the problem, rather than iterate through two dataframes and create recursive looping, it would be better to iterate once. So I created a list of tuples:
def actor_tuples(actor_df):
tuples =[]
for index, value in actor_df['knownForTitles'].iteritems():
cinemetography = [x.strip() for x in value.split(',')]
for movie in cinemetography:
tuples.append((actor_df['primaryName'].loc[index], movie))
return tuples
This created a list of tuples of the following form:
[('Fred Astaire', 'tt0043044'),
('Lauren Bacall', 'tt0117057')]
I then created a list of movie identifier numbers and index points (from the movie dataframe), that took this form:
{'tt0000009': 0,
'tt0000147': 1,
'tt0000335': 2,
'tt0000502': 3,
'tt0000574': 4,
'tt0000615': 5,
'tt0000630': 6,
'tt0000675': 7,
'tt0000676': 8,
'tt0000679': 9}
I then used the below function to iterate through the actor tuples and use the movie identifier as the key in the movie dictionary, this returned the correct movie index, which I used to add the actor name tuple to the target dataframe:
def add_cast(movie_df, Atuples, Mtuples):
results_df = movie_df.copy()
results_df['cast'] = ''
counter = 0
total = len(Atuples)
for tup in Atuples:
#this passes the movie ID into the movie_dict that will return an index
try:
movie_index = Mtuples[tup[1]]
if results_df.at[movie_index, 'cast'] == '':
results_df.at[movie_index, 'cast'] += tup[0]
else:
results_df.at[movie_index, 'cast'] += ',' + tup[0]
except KeyError:
pass
#logging
counter +=1
if counter % 1000000 == 0:
logging.warning(f'Index {counter} out of {total}, {counter/total}% finished')
return results_df
It ran in 10 minutes (making 2 sets of tuples, then the adding function) for 16.5 million actor tuples. The results are below:
0 tt0000009 Miss Jerry 1894 Romance
1 tt0000147 The Corbett-Fitzsimmons Fight 1897 Documentary,News,Sport
2 tt0000335 Soldiers of the Cross 1900 Biography,Drama
3 tt0000502 Bohemios 1905 \N
4 tt0000574 The Story of the Kelly Gang 1906 Biography,Crime,Drama
cast
0 Blanche Bayliss,Alexander Black,William Courte...
1 Bob Fitzsimmons,Enoch J. Rector,John L. Sulliv...
2 Herbert Booth,Joseph Perry,Orrie Perry,Reg Per...
3 Antonio del Pozo,El Mochuelo,Guillermo Perrín,...
4 Bella Cola,Sam Crewes,W.A. Gibson,Millard John...
Thank you stack overflow!