Tuple key Dictionary to Table/Graph 3-dimensional - python

I have a dictionary like this:
dict = {(100,22,123):'55%',(110,24,123):'58%'}
Where, for example, the elements of the tuple are (x,y,z) and the value is the error rate of something... I want to print that dictionary but I'm not very clear how to do it or in what format to do it (which would be better to see easily the information, maybe: x - y - z - Rate ).
I found that: Converting Dictionary to Dataframe with tuple as key ,but I think it does not fit what I want and I can not understand it.
Thank you

You can use Series with reset_index, last only set new column names:
import pandas as pd
d = {(100,22,123):'55%',(110,24,123):'58%'}
df = pd.Series(d).reset_index()
df.columns = ['a','b','c', 'd']
print (df)
a b c d
0 100 22 123 55%
1 110 24 123 58%

Related

Obtain a view of a DataFrame using the loc method

I am trying to obtain a view of a pandas dataframe using the loc method but it is not working as expected when I am modifying the original DataFrame.
I want to extract a row/slice of a DataFrame using the loc method so that when a modification is done to the DataFrame, the slice reflects the change.
Let's have a look at this example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'ID':np.arange(0,5,2), 'a':np.arange(3), 'b':np.arange(3)}).set_index('ID')
df
a b
ID
0 0 0
2 1 1
4 2 2
Now I create a slice using loc:
slice1 = df.loc[[2],]
slice1
a b
ID
2 1 1
Then I modify the original DataFrame:
df.loc[2, 'b'] = 9
df
a b
ID
0 0 0
2 1 9
4 2 2
But unfortunately our slice does not reflect this modification as I would be expecting for a view:
slice1
a b
ID
2 1 1
My expectation:
a b
ID
2 1 9
I found an ugly fix using a mix of iloc and loc but I hope there is a nicer way to obtain the result I am expecting.
Thank you for your help.
Disclaimer: This is not an answer.
I tried testing how over-writing the values in chained assignment vs .loc referring to the pandas documentation link that was shared by #Quang Hoang above.
This is what I tried:
dfmi = pd.DataFrame([list('abcd'),
list('efgh'),
list('ijkl'),
list('mnop')],
columns=pd.MultiIndex.from_product([['one', 'two'],
['first', 'second']]))
df1 = dfmi['one']['second']
df2 = dfmi.loc[:, ('one', 'second')]
Output of both df1 and df2:
0 b
1 f
2 j
3 n
Iteration 1:
value = ['z', 'x', 'c', 'v']
dfmi['one']['second'] = value
Output df1:
0 z
1 x
2 c
3 v
Iteration 2:
value = ['z', 'x', 'c', 'v']
dfmi.loc[:, ('one', 'second')] = value
Output df2:
0 z
1 x
2 c
3 v
The assignment of new sets is changing the values in both the cases.
The documentation says:
Quote 1: 'method 2 (.loc) is much preferred over method 1 (chained [])'
Quote 2:
'Outside of simple cases, it’s very hard to predict whether "getitem" (used by chained option) will return a view or a copy (it depends on the memory layout of the array, about which pandas makes no guarantees), and therefore whether the "setitem" (used by .loc) will modify dfmi or a temporary object that gets thrown out immediately afterward.'
I am not able to understand the explanation above. If the value in dfmi can change (in my case) and may not change (like in Benoit's case) then which way to obtain the result? Not sure if I am missing a point here.
Looking for help
The reason the slice didn't reflect the changes you made in the original dataframe is b/c you created the slice first.
When you create a slice, you create a "copy" of a slice of the data. You're not directly linking the two.
The short answer here is that you have two options 1) changed the original df first, then create a slice 2) don't slice, just do your operations referencing the original df using .loc or iloc
The memory address of your dataframe and slice are different, so changes in dataframe won't reflect in the slice-
The answer is to change the value in the dataframe and then slice it -

How to create a summary row in Pandas from audit fields

I am trying to derive a single row based on original input and then various changes to individual column values at different points in time. I have simplified the list below.
I have read in some data into my dataframe as so:
A B C D E
0 h h h h h
1 x
2 y 1
3 2 3
row 0 - "h" represents my original record.
rows 1 - 3 are changes over time to a specific column
I would like to create a single "result row" that would look something like:
'x', 'y, '2', '3' 'h'
Is there a simple way to do this with Pandas and Python with out excessive looping?
You can get it as a list like so:
>>> [df[s][df[s].last_valid_index()] for s in df]
['x', 'y', 2, 3, 'h']
If you need it as appended or something with a name, then you need to provide it with an index and then append it, like so:
df.append(pd.Series(temp, index=df.columns, name='total'))
# note, this returns a new object
# where 'temp' is the output of the code above
You can just try
#df=df.replace({'':np.nan})
df.ffill().iloc[[-1],:]

how to assign value to the pandas column?

I have a DataFrame, say one column is:
{'university':'A','B','A','C'}
I want to change the column into:
{'university':1,2,1,3}
According to an imaginary dict:
{'A':1,'B':2,'C':3}
how to get this done?
ps: I solved the original problem, it's something about my own computer setting.
And I changed the question accordingly to be more helpful.
I think you need map by dict - d:
df.university = df.university.map(d)
If need encode the object as an enumerated type or categorical variable use factorize:
df.university = pd.factorize(df.university)[0] + 1
Sample:
d = {'A':1,'B':2,'C':3}
df = pd.DataFrame({'university':['A','B','A','C']})
df['a'] = df.university.map(d)
df['b'] = pd.factorize(df.university)[0] + 1
print (df)
university a b
0 A 1 1
1 B 2 2
2 A 1 1
3 C 3 3
I try rewrite your function:
def given_value(column):
columnlist=column.drop_duplicates()
#reset to default monotonic increasing (0,1,2, ...)
columnlist = columnlist.reset_index(drop=True)
#print (columnlist)
#swap index and values to new Series columnlist_rev
columnlist_rev= pd.Series(columnlist.index, index=columnlist.values)
#map by columnlist_rev
column=column.map(columnlist_rev)
return column
print (given_value(df.university))
0 0
1 1
2 0
3 2
Name: university, dtype: int64
AttributeError: 'DataFrame' object has no attribute 'column'
Your answer is written in the Exception statement! DataFrame object doesn't have an attribute called column, which means you can't call on DataFrame.column at any point in your code. I believe your problem exists outside of what you have posted here, likely to be somewhere near the part where you imported the data as a DataFrame fro the first time. My guess is that when you were naming the columns, you did something like df.column = [university] instead of df.columns = [university]. The s matters. If you read the Traceback closely, you'll be able to figure out precisely which line is throwing the error.
Also, in your posted function, you do not need the parameter df as it is not used at any point during the process.

Creating a New Pandas Grouped Object

In some transformations, I seem to be forced to break from the Pandas dataframe grouped object, and I would like a way to return to that object.
Given a dataframe of time series data, if one groups by one of the values in the dataframe, we are given an underlying dictionary from key to dataframe.
Being forced to make a Python dict from this, the structure cannot be converted back into a Dataframe using the .from_dict() because the structure is key to dataframe.
The only way to go back to Pandas without some hacky column renaming is, to my knowledge, by converting it back to a grouped object.
Is there any way to do this?
If not, how would I convert a dictionary of instance to dataframe back into a Pandas datastructure?
EDIT ADDING SAMPLE::
rng = pd.date_range('1/1/2000', periods=10, freq='10m')
df = pd.DataFrame({'a':pd.Series(randn(len(rng)), index=rng), 'b':pd.Series(randn(len(rng)), index=rng)})
// now have dataframe with 'a's and 'b's in time series
for k, v in df.groupby('a'):
df_dict[k] = v
// now we apply some transformation that cannot be applied view aggregate, transform, or apply
// how do we get this back into a groupedby object?
If I understand OP's question correctly, you want to group a dataframe by some key(s), do different operations on each group (possibly generating new columns, etc.) and then go back to the original dataframe.
Modifying you example (group by random integers instead of floats which are usually unique):
np.random.seed(200)
rng = pd.date_range('1/1/2000', periods=10, freq='10m')
df = pd.DataFrame({'a':pd.Series(np.random.randn(len(rng)), index=rng), 'b':pd.Series(np.random.randn(len(rng)), index=rng)})
df['group'] = np.random.randint(3,size=(len(df)))
Usually, If I need single values for each columns per group, I'll do this (for example, sum of 'a', mean of 'b')
In [10]: df.groupby('group').aggregate({'a':np.sum, 'b':np.mean})
Out[10]:
a b
group
0 -0.214635 -0.319007
1 0.711879 0.213481
2 1.111395 1.042313
[3 rows x 2 columns]
However, if I need a series for each group,
In [19]: def func(sub_df):
sub_df['c'] = sub_df['a'] * sub_df['b'].shift(1)
return sub_df
....:
In [20]: df.groupby('group').apply(func)
Out[20]:
a b group c
2000-01-31 -1.450948 0.073249 0 NaN
2000-11-30 1.910953 1.303286 2 NaN
2001-09-30 0.711879 0.213481 1 NaN
2002-07-31 -0.247738 1.017349 2 -0.322874
2003-05-31 0.361466 1.911712 2 0.367737
2004-03-31 -0.032950 -0.529672 0 -0.002414
2005-01-31 -0.221347 1.842135 2 -0.423151
2005-11-30 0.477257 -1.057235 0 -0.252789
2006-09-30 -0.691939 -0.862916 2 -1.274646
2007-07-31 0.792006 0.237631 0 -0.837336
[10 rows x 4 columns]
I'm guess you want something like the second example. But the original question wasn't very clear even with your example.

Replacing column values in a pandas DataFrame

I'm trying to replace the values in one column of a dataframe. The column ('female') only contains the values 'female' and 'male'.
I have tried the following:
w['female']['female']='1'
w['female']['male']='0'
But receive the exact same copy of the previous results.
I would ideally like to get some output which resembles the following loop element-wise.
if w['female'] =='female':
w['female'] = '1';
else:
w['female'] = '0';
I've looked through the gotchas documentation (http://pandas.pydata.org/pandas-docs/stable/gotchas.html) but cannot figure out why nothing happens.
Any help will be appreciated.
If I understand right, you want something like this:
w['female'] = w['female'].map({'female': 1, 'male': 0})
(Here I convert the values to numbers instead of strings containing numbers. You can convert them to "1" and "0", if you really want, but I'm not sure why you'd want that.)
The reason your code doesn't work is because using ['female'] on a column (the second 'female' in your w['female']['female']) doesn't mean "select rows where the value is 'female'". It means to select rows where the index is 'female', of which there may not be any in your DataFrame.
You can edit a subset of a dataframe by using loc:
df.loc[<row selection>, <column selection>]
In this case:
w.loc[w.female != 'female', 'female'] = 0
w.loc[w.female == 'female', 'female'] = 1
w.female.replace(to_replace=dict(female=1, male=0), inplace=True)
See pandas.DataFrame.replace() docs.
Slight variation:
w.female.replace(['male', 'female'], [1, 0], inplace=True)
This should also work:
w.female[w.female == 'female'] = 1
w.female[w.female == 'male'] = 0
This is very compact:
w['female'][w['female'] == 'female']=1
w['female'][w['female'] == 'male']=0
Another good one:
w['female'] = w['female'].replace(regex='female', value=1)
w['female'] = w['female'].replace(regex='male', value=0)
You can also use apply with .get i.e.
w['female'] = w['female'].apply({'male':0, 'female':1}.get):
w = pd.DataFrame({'female':['female','male','female']})
print(w)
Dataframe w:
female
0 female
1 male
2 female
Using apply to replace values from the dictionary:
w['female'] = w['female'].apply({'male':0, 'female':1}.get)
print(w)
Result:
female
0 1
1 0
2 1
Note: apply with dictionary should be used if all the possible values of the columns in the dataframe are defined in the dictionary else, it will have empty for those not defined in dictionary.
Using Series.map with Series.fillna
If your column contains more strings than only female and male, Series.map will fail in this case since it will return NaN for other values.
That's why we have to chain it with fillna:
Example why .map fails:
df = pd.DataFrame({'female':['male', 'female', 'female', 'male', 'other', 'other']})
female
0 male
1 female
2 female
3 male
4 other
5 other
df['female'].map({'female': '1', 'male': '0'})
0 0
1 1
2 1
3 0
4 NaN
5 NaN
Name: female, dtype: object
For the correct method, we chain map with fillna, so we fill the NaN with values from the original column:
df['female'].map({'female': '1', 'male': '0'}).fillna(df['female'])
0 0
1 1
2 1
3 0
4 other
5 other
Name: female, dtype: object
Alternatively there is the built-in function pd.get_dummies for these kinds of assignments:
w['female'] = pd.get_dummies(w['female'],drop_first = True)
This gives you a data frame with two columns, one for each value that occurs in w['female'], of which you drop the first (because you can infer it from the one that is left). The new column is automatically named as the string that you replaced.
This is especially useful if you have categorical variables with more than two possible values. This function creates as many dummy variables needed to distinguish between all cases. Be careful then that you don't assign the entire data frame to a single column, but instead, if w['female'] could be 'male', 'female' or 'neutral', do something like this:
w = pd.concat([w, pd.get_dummies(w['female'], drop_first = True)], axis = 1])
w.drop('female', axis = 1, inplace = True)
Then you are left with two new columns giving you the dummy coding of 'female' and you got rid of the column with the strings.
w.replace({'female':{'female':1, 'male':0}}, inplace = True)
The above code will replace 'female' with 1 and 'male' with 0, only in the column 'female'
There is also a function in pandas called factorize which you can use to automatically do this type of work. It converts labels to numbers: ['male', 'female', 'male'] -> [0, 1, 0]. See this answer for more information.
w.female = np.where(w.female=='female', 1, 0)
if someone is looking for a numpy solution. This is useful to replace values based on a condition. Both if and else conditions are inherent in np.where(). The solutions that use df.replace() may not be feasible if the column included many unique values in addition to 'male', all of which should be replaced with 0.
Another solution is to use df.where() and df.mask() in succession. This is because neither of them implements an else condition.
w.female.where(w.female=='female', 0, inplace=True) # replace where condition is False
w.female.mask(w.female=='female', 1, inplace=True) # replace where condition is True
dic = {'female':1, 'male':0}
w['female'] = w['female'].replace(dic)
.replace has as argument a dictionary in which you may change and do whatever you want or need.
I think that in answer should be pointed which type of object do you get in all methods suggested above: is it Series or DataFrame.
When you get column by w.female. or w[[2]] (where, suppose, 2 is number of your column) you'll get back DataFrame.
So in this case you can use DataFrame methods like .replace.
When you use .loc or iloc you get back Series, and Series don't have .replace method, so you should use methods like apply, map and so on.
To answer the question more generically so it applies to more use cases than just what the OP asked, consider this solution. I used jfs's solution solution to help me. Here, we create two functions that help feed each other and can be used whether you know the exact replacements or not.
import numpy as np
import pandas as pd
class Utility:
#staticmethod
def rename_values_in_column(column: pd.Series, name_changes: dict = None) -> pd.Series:
"""
Renames the distinct names in a column. If no dictionary is provided for the exact name changes, it will default
to <column_name>_count. Ex. female_1, female_2, etc.
:param column: The column in your dataframe you would like to alter.
:param name_changes: A dictionary of the old values to the new values you would like to change.
Ex. {1234: "User A"} This would change all occurrences of 1234 to the string "User A" and leave the other values as they were.
By default, this is an empty dictionary.
:return: The same column with the replaced values
"""
name_changes = name_changes if name_changes else {}
new_column = column.replace(to_replace=name_changes)
return new_column
#staticmethod
def create_unique_values_for_column(column: pd.Series, except_values: list = None) -> dict:
"""
Creates a dictionary where the key is the existing column item and the value is the new item to replace it.
The returned dictionary can then be passed the pandas rename function to rename all the distinct values in a
column.
Ex. column ["statement"]["I", "am", "old"] would return
{"I": "statement_1", "am": "statement_2", "old": "statement_3"}
If you would like a value to remain the same, enter the values you would like to stay in the except_values.
Ex. except_values = ["I", "am"]
column ["statement"]["I", "am", "old"] would return
{"old", "statement_3"}
:param column: A pandas Series for the column with the values to replace.
:param except_values: A list of values you do not want to have changed.
:return: A dictionary that maps the old values their respective new values.
"""
except_values = except_values if except_values else []
column_name = column.name
distinct_values = np.unique(column)
name_mappings = {}
count = 1
for value in distinct_values:
if value not in except_values:
name_mappings[value] = f"{column_name}_{count}"
count += 1
return name_mappings
For the OP's use case, it is simple enough to just use
w["female"] = Utility.rename_values_in_column(w["female"], name_changes = {"female": 0, "male":1}
However, it is not always so easy to know all of the different unique values within a data frame that you may want to rename. In my case, the string values for a column are hashed values so they hurt the readability. What I do instead is replace those hashed values with more readable strings thanks to the create_unique_values_for_column function.
df["user"] = Utility.rename_values_in_column(
df["user"],
Utility.create_unique_values_for_column(df["user"])
)
This will changed my user column values from ["1a2b3c", "a12b3c","1a2b3c"] to ["user_1", "user_2", "user_1]. Much easier to compare, right?
If you have only two classes you can use equality operator. For example:
df = pd.DataFrame({'col1':['a', 'a', 'a', 'b']})
df['col1'].eq('a').astype(int)
# (df['col1'] == 'a').astype(int)
Output:
0 1
1 1
2 1
3 0
Name: col1, dtype: int64

Categories

Resources