Using dicts to look up values for DataFrame variables - python

I have a pandas DataFrame with columns Teacher_ID and Student_ID. I also have dicts for each, TDict and SDict, giving, say, the grade in which each teacher teaches and the grade each student is enrolled in, with their ID numbers as the keys.
I want to create a new column in my DataFrame referencing the information in the dicts. But when I try to create a column with a formula something like TDict[Teacher_ID] + SDict[Student_ID], I get an error message telling me that "'Series' objects are mutable, thus they cannot be hashed."
What's the approved way around this? Do I have to copy the ID's into new columns, replace the values in those columns with the dict values, and then work from there? I'm guessing there's a better way....

If I understand you correctly then you can simply call map:
df['Teaching_grade'] = df['Teacher_ID'].map(TDict)
df['Student_grade'] = df['Student_ID'].map(SDict)
This will perform the lookup and assign the value to the new column

Related

How do I change value names to numbers in dataframe?

I have a dataframe which has 10k movie names and 40k actor names.
The reason is I'm trying to make a graph from nx but the graphic becomes unreadable because of the names of the actor. So I want to change their names to numbers. Some of these actors played on multiple movies which means they are exists more than once. I want to change all these actors to numbers like 'Leslie Howard' = '1' and so on. I tried some loops and lists but I failed. I want to make a dictionary to be able to check which number was which actor. Can you help me?
You could get all unique names of the column, generate a dictionary and then use map to change the values to the numbers. At the same time you have the dictionary to check to which actor the number refers.
all_names = df['Actor_Name'].unique()
dic = dict((v,k) for k,v in enumerate(all_names))
df['Actor_Name'] = df['Actor_Name'].map(dic)
You can just do factorize
df['Movie_name'] = df['Movie_name'].factorize()[0]
df['Actor_name'] = df['Actor_name'].factorize()[0]
Convert the column into type category and get their unique values with .cat.codes:
df['Actor_Name'] = df['Actor_Name'].astype('category').cat.codes

Create a column in dataframe with name of an existing array (initial 4 letters of array name)

I would like to create a column in dataframe having name of an array. For example, the name of array is "customer" then name of the column should be "cust_prop" (initial 4 letters from array's name). Is there any way to get it?
Your question is a bit unclear, but presuming that you are asking: how do i turn the string "customer" into "cust_prop", thats easy enough:
Str = "customer"
NewStr = Str[0:4] + "_prop"
you might need to some extra checking for shorter strings, but i dont know what the behaviour there would be that you want.
If you mean something else, please post some code examples of what you have tried.
You didn't really describe from where you get an array name, so I'll just assume you have it in a variable:
array_name = 'customer'
to slice only first four digit and use it:
new_col_name = f'{array_name[0:4]}_prop'
df[new_col_name] = 1
here I "created" a new column in existing dataframe df, and put value of 1 to the entire column. Instead, you can create a series with any value you want:
series = pd.Series(name=new_col_name, data=array_customer)
Here I created a series with the name as desired, and assumed you have an array_customer variable which holds the array

Iterate through list of dataframes, performing calculations on certain columns of each dataframe, resulting in new dataframe of the results

Newbie here. Just as the title says, I have a list of dataframes (each dataframe is a class of students). All dataframes have the same columns. I have made certain columns global.
BINARY_CATEGORIES = ['Gender', 'SPED', '504', 'LAP']
for example. These are yes/no or male/female categories, and I have already changed all of the data to be 1's and 0's for these columns. There are several other columns which I want to ignore as I iterate.
I am trying to accept the list of classes (dataframes) into my function and perform calculations on each dataframe using only my BINARY_CATEGORIES list of columns. This is what I've got, but it isn't making it through all of the classes and/or all of the columns.
def bal_bin_cols(classes):
i = 0
c = 0
for x in classes:
total_binary = classes[c][BINARY_CATEGORIES[i]].sum()
print(total_binary)
i+=1
c+=1
Eventually I need a new dataframe from this all of the sums corresponding to the categories and the respective classes. print(total binary) is just a place holder/debugger. I don't have that code yet that will populate the dataframe from the results of the above code, but I'd like it to be the classes as the index and the total calculation as the columns.
I know there's probably a vectorized way to do this, or enum, or groupby, but I will take a fix to my loop. I've been stuck forever. Please help.
Try something like:
Firstly create a dictionary:
d={
'male':1,
'female':0,
'yes':1,
'no':0
}
Finally use replace():
df[BINARY_CATEGORIES]=df[BINARY_CATEGORIES].replace(d.keys(),d.values(),regex=True)

how make columns of dataframe variable

I want to make the columns of Salary_Data_split variables, depending of Sal_name (type : list) where:
Sal_name = ['Success_S_1', 'Failure_S_1', 'Success_S_2', 'Failure_S_2','Success_S_4', 'Failure_S_4','Success_S_7', 'Failure_S_7','Success_S_8', 'Failure_S_8']
and Salary_Data_split must be as follow, it contains: Salary + existing rows on Sal_name. Like :
Salary_Data_split = data[["Salary",'Success_S_1', 'Failure_S_1', 'Success_S_2', 'Failure_S_2','Success_S_4', 'Failure_S_4','Success_S_7', 'Failure_S_7','Success_S_8', 'Failure_S_8']]
I have tried this code but it doesnt work
Salary_Data_split = data[["Salary", Sal_name]]
Please always include example data in your posts. It's also important to always include error messages in your posts. That way, your question is alot more clear. I am guessing data is your dataframe with columns Sal_name and Salary, which you want to combine in Sal_data_split?
data['sal_Data_Split'] = [data['Salary'], data['Sal_name']]
This will put the columns Salary and Sal_name in a list, resulting in a nested list if data['Sal_name'] is a list itself. The way you assigned Salary_Data_split = data[["Salary", Sal_name]] in your original post it just indexes 2 columns of the dataframe at once. You also forgot the quotation marks around Sal_name if that is what you meant.

How to convert Multilevel Dictionary with Irregular Data to Desired Format

Dict = {'Things' : {'Car':'Lambo', 'Home':'NatureVilla', 'Gadgets':{'Laptop':{'Programs':{'Data':'Excel', 'Officework': 'Word', 'Coding':{'Python':'PyCharm', 'Java':'Eclipse', 'Others': 'SublimeText'}, 'Wearables': 'SamsungGear', 'Smartphone': 'Nexus'}, 'clothes': 'ArmaaniSuit', 'Bags':'TravelBags'}}}}
d = {(i,j,k,l,m,n): Dict[i][j][k][l][m][n]
for i in Dict.keys()
for j in Dict[i].keys()
for k in Dict[j].keys()
for l in Dict[k].keys()
for m in Dict[l].keys()
for n in Dict[n].keys()
}
mux = pd.MultiIndex.from_tuples(d.keys())
df = pd.DataFrame(list(d.values()), index=mux)
print (df)
What I have already done:
I tried to Multiindex this Irregular Data using pandas but I am getting KeyError at 'Car'. Then I tried to handle exceptions and tried to PASS it but then it results in a Syntax Error. So May be I lost the direction. If there is any other module or way I can index this irregular data and put it in a table somehow. I have a chunk of raw data like this.
What I am trying to do:
I wanted to use this data for printing in QTableView which is from PyQt5 (Making a program with GUI).
Conditions:
This Data keeps on updating every hour from an API.
What I have thought till now:
May be I can append all this data to MySQL. But then when this data updates from API, only Values will change, rest of the KEYS will be the same. But then It will require more space.
References:
How to convert a 3-level dictionary to a desired format?
How to build a MultiIndex Pandas DataFrame from a nested dictionary with lists
Any Help will be appreciated. Thanks for reading the question.
You data is not actually a 6-level dictionary like a dictionary in a 3-level example you referenced to. The difference is: your dictionary has a data on multiple different levels, e.g. 'Lambo' value is on second level of hierarchy with key ('Things','Car') but 'Eclipse' value is on sixth level of hierarchy with key ('Things','Gadgets','Laptop','Programs','Coding','Java')
If you want to 'flatten' your structure you will need to decide what to do with 'missed' key values for deeper levels for values like 'Lambo'.
Btw, maybe it is not actually a solution for your problem, maybe you need to use more appropriate UI widgets like TreeView to work with such kind of hierarchical data, but I will try to directly address your exact question.
Unfortunately it seems to be no easy way to reference all different level values uniformly in one simple dict or list comprehension statement.
Just look at your 'value extractor' (Dict[i][j][k][l][m][n]) there are no such values for i,j,k,l,m,n exists which allows you to get a 'Lambo'. Because to get a Lambo you will need to just use Dict['Things']['Car'] (ironically, in a real life it is also could be difficult to get a Lambo :-) )
One straightforward way to solve your task is:
extract a second level data, extract a third level data, and so on, and combine them together.
E.g. to extract second level values you can write something like this:
val_level2 = {(k1,k2):Dict[k1][k2]
for k1 in Dict
for k2 in Dict[k1]
if isinstance(Dict[k1],dict) and
not isinstance(Dict[k1][k2],dict)}
but if you want to combine it later with six level values, it will need to add some padding to your key tuples:
val_level2 = {(k1,k2,'','','',''):Dict[k1][k2]
for k1 in Dict
for k2 in Dict[k1]
if isinstance(Dict[k1],dict) and
not isinstance(Dict[k1][k2],dict)}
later you can combine all together by something like:
d = {}
d.update(val_level2)
d.update(val_level3)
But usually the most organic way to work with hierarchical data is to use some recursion, like this:
def flatten_dict(d,key_prefix,max_deep):
return [(tuple(key_prefix+[k]+['']*(max_deep-len(key_prefix))),v)
for k,v in d.items() if not isinstance(v,dict)] +\
sum([flatten_dict(v,key_prefix+[k],max_deep)
for k,v in d.items() if isinstance(v,dict)],[])
And later with code like this:
d={k:v for k,v in flatten_dict(Dict,[],5)}
mux = pd.MultiIndex.from_tuples(d.keys())
df = pd.DataFrame(list(d.values()), index=mux)
df.reset_index()
I actually get this result with your data:
P.S. According to https://www.python.org/dev/peps/pep-0008/#prescriptive-naming-conventions we prefer a lowercase_with_underscores for variable names, CapWords is for classes. So src_dict would be much better, than Dict in your case.
You information looks a lot like json and that's what the API is returning. If that's the case, and you are turning it into a dictionary, then you might me better off using python's json library or even panda's built it read_json format.
Pandas read json
Python's json

Categories

Resources