I have a dataframe which has 10k movie names and 40k actor names.
The reason is I'm trying to make a graph from nx but the graphic becomes unreadable because of the names of the actor. So I want to change their names to numbers. Some of these actors played on multiple movies which means they are exists more than once. I want to change all these actors to numbers like 'Leslie Howard' = '1' and so on. I tried some loops and lists but I failed. I want to make a dictionary to be able to check which number was which actor. Can you help me?
You could get all unique names of the column, generate a dictionary and then use map to change the values to the numbers. At the same time you have the dictionary to check to which actor the number refers.
all_names = df['Actor_Name'].unique()
dic = dict((v,k) for k,v in enumerate(all_names))
df['Actor_Name'] = df['Actor_Name'].map(dic)
You can just do factorize
df['Movie_name'] = df['Movie_name'].factorize()[0]
df['Actor_name'] = df['Actor_name'].factorize()[0]
Convert the column into type category and get their unique values with .cat.codes:
df['Actor_Name'] = df['Actor_Name'].astype('category').cat.codes
Related
I have a bunch of keywords stored in a 620x2 pandas dataframe seen below. I think I need to treat each entry as its own set, where semicolons separate elements. So, we end up with 1240 sets. Then I'd like to be able to search how many times keywords of my choosing appear together. For example, I'd like to figure out how many times 'computation theory' and 'critical infrastructure' appear together as a subset in these sets, in any order. Is there any straightforward way I can do this?
Use .loc to find if the keywords appear together.
Do this after you have split the data into 1240 sets. I don't understand whether you want to make new columns or just want to keep the columns as is.
# create a filter for keyword 1
filter_keyword_2 = (df['column_name'].str.contains('critical infrastructure'))
# create a filter for keyword 2
filter_keyword_2 = (df['column_name'].str.contains('computation theory'))
# you can create more filters with the same construction as above.
# To check the number of times both the keywords appear
len(df.loc[filter_keyword_1 & filter_keyword_2])
# To see the dataframe
subset_df = df.loc[filter_keyword_1 & filter_keyword_2]
.loc selects the conditional data frame. You can use subset_df=df[df['column_name'].str.contains('string')] if you have only one condition.
To the column split or any other processing before you make the filters or run the filters again after processing.
Not sure if this is considered straightforward, but it works. keyword_list is the list of paired keywords you want to search.
df['Author Keywords'] = df['Author Keywords'].fillna('').str.split(';\s*').apply(set)
df['Index Keywords'] = df['Index Keywords'].fillna('').str.split(';\s*').apply(set)
df.apply(lambda x : x.apply(lambda y : all([kw in y for kw in keyword_list]))).sum().sum()
Newbie here. Just as the title says, I have a list of dataframes (each dataframe is a class of students). All dataframes have the same columns. I have made certain columns global.
BINARY_CATEGORIES = ['Gender', 'SPED', '504', 'LAP']
for example. These are yes/no or male/female categories, and I have already changed all of the data to be 1's and 0's for these columns. There are several other columns which I want to ignore as I iterate.
I am trying to accept the list of classes (dataframes) into my function and perform calculations on each dataframe using only my BINARY_CATEGORIES list of columns. This is what I've got, but it isn't making it through all of the classes and/or all of the columns.
def bal_bin_cols(classes):
i = 0
c = 0
for x in classes:
total_binary = classes[c][BINARY_CATEGORIES[i]].sum()
print(total_binary)
i+=1
c+=1
Eventually I need a new dataframe from this all of the sums corresponding to the categories and the respective classes. print(total binary) is just a place holder/debugger. I don't have that code yet that will populate the dataframe from the results of the above code, but I'd like it to be the classes as the index and the total calculation as the columns.
I know there's probably a vectorized way to do this, or enum, or groupby, but I will take a fix to my loop. I've been stuck forever. Please help.
Try something like:
Firstly create a dictionary:
d={
'male':1,
'female':0,
'yes':1,
'no':0
}
Finally use replace():
df[BINARY_CATEGORIES]=df[BINARY_CATEGORIES].replace(d.keys(),d.values(),regex=True)
This is my first post to the coding community, so I hope I get the right level of detail in my request for help!
Background info:
I want to repeat (loop) command in a df using a variable that contains a list of options. While the series 'amenity_options' contains a simple list of specific items (let's say only four amenities as the example below) the df is a large data frame with many other items. My goal is the run the operation below for each item in the 'amenity_option' until the end of the list.
amenity_options = ['cafe','bar','cinema','casino'] # this is a series type with multiple options
df = df[df['amenity'] == amenity_options] # this is my attempt to select the the first value in the series (e.g. cafe) out of dataframe that contains such a column name.
df.to_excel('{}_amenity.xlsx, format('amenity') # wish to save the result (e.g. cafe_amenity) as a separate file.
Desired result:I wish to loop step one and two for each and every item available in the list (e.g. cafe, bar, cinema...). So that I will have separate excel files in the end. Any thoughts?
What #Rakesh suggested is correct, you probably just need one more step.
df = df[df['amenity'].isin(amenity_options)]
for key, g in df.groupby('amenity'):
g.to_excel('{}_amenity.xlsx'.format(key))
After you call groupby() of your df, you will get 4 groups so that you can directly loop on them.
The key is the group key, which are cafe, bar and etc. and the g is the sub-dataframe that specifically filtered by that key.
Seems like you just need a simple for loop:
for amenity in amenity_options:
df[df['amenity'] == amenity].to_excel(f"{amenity}_amenity.xlsx")
Sorry for yet another question but I'm very new at python.
I have reaction time data for my go/no go conditions. I have put them into a dictionary called rts and split with two keys (go) and (no-go). I have worked out how to separate each numpy array row within these conditions as each row is a participant (there are 20 participants). I've managed to print out the mean and standard deviation for each participant into a table. This is the code below:
for row in range(0,20):
go_row=rts["go"][row,:]
nogo_row=rts["nogo"][row,:]
participant=row+1
print ("{} {:.2f} {:.2f} {:.2f} {:.2f}".format (participant, \
go_row.mean(), go_row.std(),nogo_row.mean(), nogo_row.std()))
What I'm struggling to do is make a variable with each of the mean values for each participant. I want to do this as I want to create a histogram showing the distribution in performance across participants. Any help would be appreciated.
IIUC you want list
means_participant = []
for row in range(0,20):
go_row=rts["go"][row,:]
nogo_row=rts["nogo"][row,:]
participant=row+1
means_participant.append(go_row.mean())
Store the values for each row in a dictionary, then add the dictionaries to a list that can be looped over later. This can be condensed, but I left it spelled out for clarity.
values = []
for row in range(0,20):
go_row=rts["go"][row,:]
nogo_row=rts["nogo"][row,:]
participant=row+1
d = {}
d['participant'] = participant
d['go_row_mean'] = go_row.mean()
d['go_row_std'] = go_row.std()
d['nogo_row_mean'] = nogo_row.mean()
d['nogo_row_std'] = nogo_row.std()
values.append(d)
The dictionary would be unnecessary if you know you only want one of the values, such as the go_row.mean(), and if you didn't care about matching the means in the list back up with a participant.
I have a pandas DataFrame with columns Teacher_ID and Student_ID. I also have dicts for each, TDict and SDict, giving, say, the grade in which each teacher teaches and the grade each student is enrolled in, with their ID numbers as the keys.
I want to create a new column in my DataFrame referencing the information in the dicts. But when I try to create a column with a formula something like TDict[Teacher_ID] + SDict[Student_ID], I get an error message telling me that "'Series' objects are mutable, thus they cannot be hashed."
What's the approved way around this? Do I have to copy the ID's into new columns, replace the values in those columns with the dict values, and then work from there? I'm guessing there's a better way....
If I understand you correctly then you can simply call map:
df['Teaching_grade'] = df['Teacher_ID'].map(TDict)
df['Student_grade'] = df['Student_ID'].map(SDict)
This will perform the lookup and assign the value to the new column