Groupby selecting certain columns - python

I follow the example here: (https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#flexible-apply)
Data:
df = pd.DataFrame(
{
"A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
"B": ["one", "one", "two", "three", "two", "two", "one", "three"],
"C": np.random.randn(8),
"D": np.random.randn(8),
}
)
Groupby 'A' but selecting on column 'C', then perform apply
grouped = df.groupby('A')['C']
def f(group):
return pd.DataFrame({'original': group,
'demeaned': group - group.mean()})
grouped.apply(f)
Everything is ok, but when I try with groupby'A' and selecting column 'C' and 'D', I cannot succeed in doing so:
grouped = df.groupby('A')[['C', 'D']]
for name, val in grouped:
print(name)
print(val)
grouped.apply(f)
So what do I do wrong here?
Thank you
Phan

When you get single column (['C']) then it gives pandas.Series, but when you get many columns ([ ['C', 'D'] ]) then it gives pandas.DataFrame - and this need different code in f()
It could be
grouped = df.groupby('A')[['C', 'D']]
def f(group):
return pd.DataFrame({
'original_C': group['C'],
'original_D': group['D'],
'demeaned_C': group['C'] - group['C'].mean(),
'demeaned_D': group['D'] - group['D'].mean(),
})
grouped.apply(f)
Result:
original_C original_D demeaned_C demeaned_D
0 -0.122789 0.216775 -0.611724 1.085802
1 -0.500153 0.912777 -0.293509 0.210248
2 0.875879 -1.582470 0.386944 -0.713443
3 -0.250717 1.770375 -0.044073 1.067846
4 1.261891 0.177318 0.772956 1.046345
5 0.130939 -0.575565 0.337582 -1.278094
6 -1.121481 -0.964481 -1.610417 -0.095454
7 1.551176 -2.192277 1.062241 -1.323250
Because with two columns you already have DataFrame so you can also write it shorter without converting to pd.DataFrame()
def f(group):
group[['demeaned_C', 'demeaned_D']] = group - group.mean()
return group
or more universal
def f(group):
for col in group.columns:
group[f'demeaned_{col}'] = group[col] - group[col].mean()
return group
BTW:
If you use [ ['C'] ] instead of ['C'] then you also get DataFrame instead of Series and you can use last version of f().

Related

finding values in dictionary based on their key

I'm trying to find the values of keys based on their 3 first letters. I have three different categories of subjects that i have to get the grade from, stored as value with the subject being key. I have ECO, GEO, and INF. As there are multiple subjects i want to get the values from every key containing either ECO, GEO or INF.
subject={"INFO100":"A"}
(subject.get("INF"))
In this method i don't get the value, i have to use the whole Key. Is there a work-a-round? I want the values seperately so i can calculate their GPA based on their field of study:)
You need to iterate on the pairs, to filter on the key and keep the value
subject = {"INFO100": "A", "INF0200": "B", "ECO1": "C"}
grades_inf = [v for k, v in subject.items() if k.startswith("INF")]
print(grades_inf) # ['A', 'B']
grades_eco = [v for k, v in subject.items() if k.startswith("ECO")]
print(grades_eco) # ['C']
A said in the comments, the purpose of a dictionary is to have unique keys. Indexing is extremely fast as it uses hash tables. By searching for parts of the keys you need to loop and lose the benefit of hashing.
Why don't you store your data in a nested dictionary?
subject={'INF': {"INFO100":"A", "INFO200":"B"},
'OTH': {"OTHER100":"C", "OTHER200":"D"},
}
Then access:
# all subitems
subject['INF']
# given item
subject['INF']['INFO100']
For understanding porpoises, you can create a function that returns a dictionary, like:
def getGradesBySubject(dict, search_subject):
return [grade for subject,grade in dict.iteritems() if subject.startwith(search_subject)]
I'd suggest using a master dict object that contains a mapping of the three-letter subjects like ECO, GEO, to all subject values. For example:
subject = {"INFO100": "A",
"INFO200": "B",
"GEO100": "D",
"ECO101": "B",
"GEO003": "C",
"INFO101": "C"}
master_dict = {}
for k, v in subject.items():
master_dict.setdefault(k[:3], []).append(v)
print(master_dict)
# now you can access it like: master_dict['INF']
Output:
{'INF': ['A', 'B', 'C'], 'GEO': ['D', 'C'], 'ECO': ['B']}
If you want to eliminate duplicate grades for a subject, or just as an alternate approach, I'd also suggest a defaultdict:
from collections import defaultdict
subject = {"INFO100": "A",
"INFO300": "A",
"INFO200": "B",
"GEO100": "D",
"ECO101": "B",
"GEO003": "C",
"GEO102": "D",
"INFO101": "C"}
master_dict = defaultdict(set)
for k, v in subject.items():
master_dict[k[:3]].add(v)
print(master_dict)
defaultdict(<class 'set'>, {'INF': {'B', 'A', 'C'}, 'GEO': {'D', 'C'}, 'ECO': {'B'}})

How to get the name of a nested list? [duplicate]

This question already has answers here:
How can I get the name of an object?
(18 answers)
Closed 1 year ago.
I'm wondering if something like this is possible. Let's suppose this snippet of code:
area = ["A","B","C"]
level = ["L1","L2","L3"]
sector = [area, level]
print(sector)
print(sector[1])
Output:
Print 1: [['A', 'B', 'C'], ['L1', 'L2', 'L3']]
Print 2: ['L1', 'L2', 'L3']
The first print is OK for me. It shows the lists and their elements.
However, for the second print I would like to have the name of the list instead of its elements. In this case level
Is that possible?
What you can do though, is use a dictionnary:
di = {"area": ["A", "B", "C"], "level": ["L1", "L2", "L3"]}
di["area"]
Output :
["A", "B", "C"]
You could compare the id:
for sec in sector:
if id(sec) == id(area):
print('area')
elif id(sec) == id(level):
print('level')
etc.
However, this is a dubious way to go. Why do you need this?
Make it into a Dictionary of Lists instead of a List of Lists:
area = ["A","B","C"]
level = ["L1","L2","L3"]
sectors = {
"area": area,
"level": level
}
print(sectors["level"])
print(sectors["area"])
""" some other useful dict methods below """
print(sectors.keys())
print(sectors.values())
print(sectors.items())
You can use a dictionary for your usecase.
You can make the name of the variables as key and lists as values.
Try this:
area = ["A","B","C"]
level = ["L1","L2","L3"]
sector = {"area":area, "level":level}
print(list(sector.values()))
print(list(sector.keys()))
Outputs:
[['A', 'B', 'C'], ['L1', 'L2', 'L3']]
['area', 'level']

Sorting styled dataframe return keyError in Pandas

I would like to groupby and sortindex upon styling a dataframe. However, the compiler return an error
KeyError: ('Other', 'B')
May I know what is the issue here?
The code to reproduce the above error:
import pandas as pd
import numpy as np
dict_map=dict(group_one=['D','GG','G'],group_two=['A','C','E','F'])
vv=np.random.randn(5, 4)
# ['foo', '*', 'bar','ff']
nn=np.array([['foo', '*', 'bar','ff'], ['foo', '*', 'bar','**'],
['foo', '*', 'bar','**'],['foo', '*', 'bar','ff'],
['foo', '*', '**','ff']])
arrays = [["bar", "bar", "baz", "baz"],
["one", "two", "one", "two"]]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])
df = pd.DataFrame(nn, index=["A", "B", "C","D",'G'], columns=index)
df = df.rename_axis ( index=['my_ch'] ).reset_index()
d = {i:k for k,v in dict_map.items() for i in v}
out = df.assign(Group=df.xs("my_ch",axis=1).map(d).fillna('Other'))
def highlight_(s):
return np.select(
condlist=[s.str.contains('\*\*'), s.str.contains('\*')],
choicelist=['background-color:green', 'background-color:purple'],
default='')
df=out.style.apply(highlight_)
df.data=df.data.set_index(['Group', 'my_ch'])
df.data=df.data.sort_index(level=0)
df.to_excel('n1test.xlsx')
Please note that, in actual use case. sorting the index level 0 is required
This should work:
import pandas as pd
import numpy as np
dict_map = dict(group_one=["D", "GG", "G"],
group_two=["A", "C", "E", "F"])
vv = np.random.randn(5, 4)
nn = np.array(
[
["foo", "*", "bar", "ff"],
["foo", "*", "bar", "**"],
["foo", "*", "bar", "**"],
["foo", "*", "bar", "ff"],
["foo", "*", "**", "ff"],
]
)
arrays = [["bar", "bar", "baz", "baz"], ["one", "two", "one", "two"]]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])
df = pd.DataFrame(nn, index=["A", "B", "C", "D", "G"], columns=index)
df = df.rename_axis(index=["my_ch"]).reset_index()
d = {i: k for k, v in dict_map.items() for i in v}
out = df.assign(Group=df.xs("my_ch", axis=1).map(d).fillna("Other"))
def highlight_(s):
return np.select(
condlist=[s.str.contains("\*\*"), s.str.contains("\*")],
choicelist=["background-color:green", "background-color:purple"],
default=None,
)
(
out.sort_index(level=0)
.set_index(["Group", "my_ch"])
.style.apply(highlight_)
.to_excel("n1test.xlsx")
)
The main difference is to sort first then set the index then apply the Styler and save it as an Excel file. All expressions are enclosed in the parenthesis instead of break lines.

Assign value based on lookup dictionary in a multilevel column Pandas

The objective is to assign value to column Group based on the comparison between the value in column my_ch and look-up dict dict_map
The dict_map is define as
dict_map=dict(group_one=['B','D','GG','G'],group_two=['A','C','E','F'])
Whereas the df is as below
first my_ch bar ... foo qux
second one two ... two one two
0 A 0.037718 0.089609 ... 0.202885 0.706059 -2.280754
1 B 0.578452 0.039445 ... -0.153135 0.178715 -0.040345
2 C 2.139270 1.104547 ... 0.989953 -0.280724 -0.739488
3 D 0.733355 0.227912 ... -1.359441 0.761619 -1.119464
4 G -1.565185 -1.070280 ... 0.458847 1.072471 1.724417
This comparison should produce the output as below
first Group my_ch bar ... foo qux
second one two ... two one two
0 group_two A 0.037718 0.089609 ... 0.202885 0.706059 -2.280754
1 group_one B 0.578452 0.039445 ... -0.153135 0.178715 -0.040345
2 group_two C 2.139270 1.104547 ... 0.989953 -0.280724 -0.739488
3 group_one D 0.733355 0.227912 ... -1.359441 0.761619 -1.119464
4 group_one G -1.565185 -1.070280 ... 0.458847 1.072471 1.724417
My impression, this can be simply achieved via the line
df[('Group', slice ( None ))]=df.loc [:, ('my_ch', slice ( None ))].apply(lambda x: dict_map.get(x))
However, the compiler return an error of
TypeError: unhashable type: 'Series'
Im thinking of converting the series into Dataframe type to bypass this issue, but I wonder there is more reasonable way of solving this issue.
The full code to reproduce the above error is
import pandas as pd
import numpy as np
dict_map=dict(group_one=['B','D','GG','G'],group_two=['A','C','E','F'])
arrays = [["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
["one", "two", "one", "two", "one", "two", "one", "two"]]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])
df = pd.DataFrame(np.random.randn(5, 8), index=["A", "B", "C","D",'G'], columns=index)
df = df.rename_axis ( index=['my_ch'] ).reset_index()
df[('Group', slice ( None ))]=df.loc [:, ('my_ch', slice ( None ))].apply(lambda x: dict_map.get(x))
Edit:
df['Group']=df['my_ch'].apply(lambda x: dict_map.get(x))
Produced Group of None
first my_ch bar baz ... foo qux Group
second one two one ... two one two
0 A 1.220946 0.714748 0.053371 ... -1.743287 0.400862 -1.066441 None
1 B 0.606736 0.844995 0.579328 ... -0.472185 1.102245 0.454315 None
2 C 1.666148 -0.333102 1.950425 ... -0.021484 3.178110 -0.176937 None
3 D -0.673474 2.263407 -0.074996 ... -0.605594 1.410987 -1.253847 None
4 G 0.652557 2.271662 -0.569529 ... -0.549246 -0.021359 -0.532386 None
Slice the my_ch using df.xs column and then map after reversing the dict:
d = {i:k for k,v in dict_map.items() for i in v}
out = df.assign(Group=df.xs("my_ch",axis=1).map(d))

Zipping List of Pandas DataFrames Yields Unexpected Results

Can somebody explain the following code?
import pandas as pd
a = pd.DataFrame({"col1": [1,2,3], "col2": [2,3,4]})
b = pd.DataFrame({"col3": [1,2,3], "col4": [2,3,4]})
list(zip(*[a,b]))
Output:
[('col1', 'col3'), ('col2', 'col4')]
a:
b:
zip function returns tuple:
a = ("John", "Charles", "Mike")
b = ("Jenny", "Christy", "Monica", "Vicky")
x = zip(a, b)
#use the tuple() function to display a readable version of the result:
print(tuple(x))
with [a,b] inside zip - U get the whole values from df.
There is also combine the all possible combination (16 permutations) :
eg:
d = list(zip(a['col1'],b['col4']))

Categories

Resources