Assign value based on lookup dictionary in a multilevel column Pandas - python

The objective is to assign value to column Group based on the comparison between the value in column my_ch and look-up dict dict_map
The dict_map is define as
dict_map=dict(group_one=['B','D','GG','G'],group_two=['A','C','E','F'])
Whereas the df is as below
first my_ch bar ... foo qux
second one two ... two one two
0 A 0.037718 0.089609 ... 0.202885 0.706059 -2.280754
1 B 0.578452 0.039445 ... -0.153135 0.178715 -0.040345
2 C 2.139270 1.104547 ... 0.989953 -0.280724 -0.739488
3 D 0.733355 0.227912 ... -1.359441 0.761619 -1.119464
4 G -1.565185 -1.070280 ... 0.458847 1.072471 1.724417
This comparison should produce the output as below
first Group my_ch bar ... foo qux
second one two ... two one two
0 group_two A 0.037718 0.089609 ... 0.202885 0.706059 -2.280754
1 group_one B 0.578452 0.039445 ... -0.153135 0.178715 -0.040345
2 group_two C 2.139270 1.104547 ... 0.989953 -0.280724 -0.739488
3 group_one D 0.733355 0.227912 ... -1.359441 0.761619 -1.119464
4 group_one G -1.565185 -1.070280 ... 0.458847 1.072471 1.724417
My impression, this can be simply achieved via the line
df[('Group', slice ( None ))]=df.loc [:, ('my_ch', slice ( None ))].apply(lambda x: dict_map.get(x))
However, the compiler return an error of
TypeError: unhashable type: 'Series'
Im thinking of converting the series into Dataframe type to bypass this issue, but I wonder there is more reasonable way of solving this issue.
The full code to reproduce the above error is
import pandas as pd
import numpy as np
dict_map=dict(group_one=['B','D','GG','G'],group_two=['A','C','E','F'])
arrays = [["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
["one", "two", "one", "two", "one", "two", "one", "two"]]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])
df = pd.DataFrame(np.random.randn(5, 8), index=["A", "B", "C","D",'G'], columns=index)
df = df.rename_axis ( index=['my_ch'] ).reset_index()
df[('Group', slice ( None ))]=df.loc [:, ('my_ch', slice ( None ))].apply(lambda x: dict_map.get(x))
Edit:
df['Group']=df['my_ch'].apply(lambda x: dict_map.get(x))
Produced Group of None
first my_ch bar baz ... foo qux Group
second one two one ... two one two
0 A 1.220946 0.714748 0.053371 ... -1.743287 0.400862 -1.066441 None
1 B 0.606736 0.844995 0.579328 ... -0.472185 1.102245 0.454315 None
2 C 1.666148 -0.333102 1.950425 ... -0.021484 3.178110 -0.176937 None
3 D -0.673474 2.263407 -0.074996 ... -0.605594 1.410987 -1.253847 None
4 G 0.652557 2.271662 -0.569529 ... -0.549246 -0.021359 -0.532386 None

Slice the my_ch using df.xs column and then map after reversing the dict:
d = {i:k for k,v in dict_map.items() for i in v}
out = df.assign(Group=df.xs("my_ch",axis=1).map(d))

Related

Groupby selecting certain columns

I follow the example here: (https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#flexible-apply)
Data:
df = pd.DataFrame(
{
"A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
"B": ["one", "one", "two", "three", "two", "two", "one", "three"],
"C": np.random.randn(8),
"D": np.random.randn(8),
}
)
Groupby 'A' but selecting on column 'C', then perform apply
grouped = df.groupby('A')['C']
def f(group):
return pd.DataFrame({'original': group,
'demeaned': group - group.mean()})
grouped.apply(f)
Everything is ok, but when I try with groupby'A' and selecting column 'C' and 'D', I cannot succeed in doing so:
grouped = df.groupby('A')[['C', 'D']]
for name, val in grouped:
print(name)
print(val)
grouped.apply(f)
So what do I do wrong here?
Thank you
Phan
When you get single column (['C']) then it gives pandas.Series, but when you get many columns ([ ['C', 'D'] ]) then it gives pandas.DataFrame - and this need different code in f()
It could be
grouped = df.groupby('A')[['C', 'D']]
def f(group):
return pd.DataFrame({
'original_C': group['C'],
'original_D': group['D'],
'demeaned_C': group['C'] - group['C'].mean(),
'demeaned_D': group['D'] - group['D'].mean(),
})
grouped.apply(f)
Result:
original_C original_D demeaned_C demeaned_D
0 -0.122789 0.216775 -0.611724 1.085802
1 -0.500153 0.912777 -0.293509 0.210248
2 0.875879 -1.582470 0.386944 -0.713443
3 -0.250717 1.770375 -0.044073 1.067846
4 1.261891 0.177318 0.772956 1.046345
5 0.130939 -0.575565 0.337582 -1.278094
6 -1.121481 -0.964481 -1.610417 -0.095454
7 1.551176 -2.192277 1.062241 -1.323250
Because with two columns you already have DataFrame so you can also write it shorter without converting to pd.DataFrame()
def f(group):
group[['demeaned_C', 'demeaned_D']] = group - group.mean()
return group
or more universal
def f(group):
for col in group.columns:
group[f'demeaned_{col}'] = group[col] - group[col].mean()
return group
BTW:
If you use [ ['C'] ] instead of ['C'] then you also get DataFrame instead of Series and you can use last version of f().

Zipping List of Pandas DataFrames Yields Unexpected Results

Can somebody explain the following code?
import pandas as pd
a = pd.DataFrame({"col1": [1,2,3], "col2": [2,3,4]})
b = pd.DataFrame({"col3": [1,2,3], "col4": [2,3,4]})
list(zip(*[a,b]))
Output:
[('col1', 'col3'), ('col2', 'col4')]
a:
b:
zip function returns tuple:
a = ("John", "Charles", "Mike")
b = ("Jenny", "Christy", "Monica", "Vicky")
x = zip(a, b)
#use the tuple() function to display a readable version of the result:
print(tuple(x))
with [a,b] inside zip - U get the whole values from df.
There is also combine the all possible combination (16 permutations) :
eg:
d = list(zip(a['col1'],b['col4']))

Pandas integrate over columns per each row

In a simplified dataframe:
import pandas as pd
df1 = pd.DataFrame({'350': [7.898167, 6.912074, 6.049002, 5.000357, 4.072320],
'351': [8.094912, 7.090584, 6.221289, 5.154516, 4.211746],
'352': [8.291657, 7.269095, 6.393576, 5.308674, 4.351173],
'353': [8.421007, 7.374317, 6.496641, 5.403691, 4.439815],
'354': [8.535562, 7.463452, 6.584512, 5.485725, 4.517310],
'355': [8.650118, 7.552586, 6.672383, 4.517310, 4.594806]},
index=[1, 2, 3, 4, 5])
int_range = df1.columns.astype(float)
a = 0.005
b = 0.837
I would like to solve an equation which is attached as an image below:
I is equal to the values in the data frame. x is the int_range values so in this case from 350 to 355 with a dx=1.
a and b are optional constants
I need to get a dataframe as an output per each row
For now I do something like this, but I'm not sure it's correct:
dict_INT = {}
for index, row in df1.iterrows():
func = df1.loc[index]*df1.loc[index].index.astype('float')
x = df1.loc[index].index.astype('float')
dict_INT[index] = integrate.trapz(func, x)
df_out = pd.DataFrame(dict_INT, index=['INT']).T
df_fin = df_out/(a*b)
This is the final sum I get per row:
1 3.505796e+06
2 3.068796e+06
3 2.700446e+06
4 2.199336e+06
5 1.840992e+06
I solved this by first converting the dataframe to dict and then performing your equation by each item in row, then writing these value to dict using collections defaultdict. I will break it down:
import pandas as pd
from collections import defaultdict
df1 = pd.DataFrame({'350': [7.898167, 6.912074, 6.049002, 5.000357, 4.072320],
'351': [8.094912, 7.090584, 6.221289, 5.154516, 4.211746],
'352': [8.291657, 7.269095, 6.393576, 5.308674, 4.351173],
'353': [8.421007, 7.374317, 6.496641, 5.403691, 4.439815],
'354': [8.535562, 7.463452, 6.584512, 5.485725, 4.517310],
'355': [8.650118, 7.552586, 6.672383, 4.517310, 4.594806]},
index=[1, 2, 3, 4, 5]
)
int_range = df1.columns.astype(float)
a = 0.005
b = 0.837
dx = 1
df_dict = df1.to_dict() # convert df to dict for easier operations
integrated_dict = {} # initialize empty dict
d = defaultdict(list) # initialize empty dict of lists for tuples later
integrated_list = []
for k,v in df_dict.items(): # unpack df dict of dicts
for x,y in v.items(): # unpack dicts by column and index (x is index, y is column)
integrated_list.append((k, (((float(k)*float(y)*float(dx))/(a*b))))) #store a list of tuples.
for x,y in integrated_list: # create dict with column header as key and new integrated calc as value (currently a tuple)
d[x].append(y)
d = {k:tuple(v) for k, v in d.items()} # unpack to multiple values
integrated_df = pd.DataFrame.from_dict(d) # to df
integrated_df['Sum'] = integrated_df.iloc[:, :].sum(axis=1)
output (updated to include sum):
350 351 352 353 354 \
0 660539.653524 678928.103226 697410.576822 710302.382557 722004.527599
1 578070.704898 594694.141935 611402.972521 622015.269056 631317.086738
2 505890.250896 521785.529032 537763.142652 547984.294624 556969.473835
3 418189.952210 432314.245161 446512.126165 455795.202628 464025.483871
4 340576.344086 353243.212903 365976.797133 374493.356033 382109.376344
355 Sum
0 733761.502987 4.202947e+06
1 640661.416965 3.678162e+06
2 565996.646356 3.236389e+06
3 383188.781362 2.600026e+06
4 389762.516129 2.206162e+06

How to combine multiple columns from a pandas df into a list

How can you combine multiple columns from a dataframe into a list?
Input:
df = pd.DataFrame(np.random.randn(10000, 7), columns=list('ABCDEFG'))
If I wanted to create a list from column A I would perform:
df1 = df['A'].tolist()
But if I wanted to combine numerous columns into this list it wouldn't be efficient write df['A','B','C'...'Z'].tolist()
I have tried to do the following but it just adds the columns headers to a list.
df1 = list(df.columns)[0:8]
Intended input:
A B C D E F G
0 0.787576 0.646178 -0.561192 -0.910522 0.647124 -1.388992 0.728360
1 0.265409 -1.919283 -0.419196 -1.443241 -2.833812 -1.066249 0.553379
2 0.343384 0.659273 -0.759768 0.355124 -1.974534 0.399317 -0.200278
Intended Output:
[0.787576, 0.646178, -0.561192, -0.910522, 0.647124, -1.388992, 0.728360,
0.265409, -1.919283, -0.419196, -1.443241, -2.833812, -1.066249, 0.553379,
0.343384, 0.659273, -0.759768, 0.355124, -1.974534, 0.399317, -0.200278]
Is this what you are looking for
lst = df.values.tolist()
flat_list = [item for x in lst for item in x]
print(flat_list)
You can using to_dict
df = pd.DataFrame(np.random.randn(10, 10), columns=list('ABCDEFGHIJ'))
df.to_dict('l')
Out[1036]:
{'A': [-0.5611441440595607,
-0.3785906500723589,
-0.19480328695097676,
-0.7472526275034221,
-2.4232786057647457,
0.10506614562827334,
0.4968179288412277,
1.635737019365132,
-1.4286421753281746,
0.4973223222844811],
'B': [-1.0550082961139444,
-0.1420067090193365,
0.30130476834580633,
1.1271866812852227,
0.38587456174846285,
-0.531163142682951,
-1.1335754634118729,
0.5975963084356348,
-0.7361022807495443,
1.4329395663140427],
...}
Or adding values.tolist()
df[list('ABC')].values.tolist()
Out[1041]:
[[0.09552771302434987, 0.18551596484768904, -0.5902249875268607],
[-1.5285190712746388, 1.2922627021799646, -0.8347422966138306],
[-0.4092028716404067, -0.5669107267579823, 0.3627970727410332],
[-1.3546346273319263, -0.9352316948439341, 1.3568726575880614],
[-1.3509518030469496, 0.10487182694997808, -0.6902134363370515]]
Edit : np.concatenate(df[list('ABC')].T.values.tolist())

How to replace elements of a DataFrame from other indicated columns

I have a DataFrame like:
df = pd.DataFrame([{'v1':'a', 'v2':'b', 'v3':'1'},
{'v1':'2', 'v2':'c', 'v3':'d'}])
or
v1 v2 v3
0 a b 1
1 2 c d
When the contents of a column/row is '1', '2' or '3', I would like to replace its contents with the corresponding item from the column indicated. I.e., in the first row, column v3 has value "1" so I would like to replace it with the value of the first element in column v1. Doing this for both rows, I should get:
v1 v2 v3
0 a b a
1 c c d
I can do this with the following code:
for i in range(3):
for j in range(3):
df.loc[df['v%d' % (i+1)]==('%d' % (j+1)),'v%d' % (i+1)]= \
df.loc[df['v%d' % (i+1)]==('%d' % (j+1)),'v%d' % (j+1)]
Is there a less cumbersome way to do this?
df.apply(lambda row: [row['v'+v] if 'v'+v in row else v for v in row], 1)
This iterates over each row and replaces any value v with the value in column named 'v'+v if that column exists, otherwise it does not change the value.
output:
v1 v2 v3
0 a b a
1 c c d
Note that this will not limit the replacements to digits only. For example, if you have a column named 'va', it will replace all cells that contain 'a' with the value in the 'va' column in a that row. To limit the rows that you can replace from, you can define a list of acceptable column names. For example, lets say you only wanted to make replacements from column v1:
acceptable_columns = ['v1']
df.apply(lambda row: [row['v'+v] if 'v'+v in acceptable_columns else v for v in row], 1)
output:
v1 v2 v3
0 a b a
1 2 c d
EDIT
It was pointed out that the answer above throws an error if you have non-string types in your dataframe. You can avoid this by explicitly converting each cell value to a string:
df.apply(lambda row: [row['v'+str(v)] if 'v'+str(v) in row else v for v in row], 1)
ORIGINAL (INCORRECT) ANSWER BELOW
note that the answer below only applies when the values to replace are on a diagonal (which is the case in the example but that was not the question asked ... my bad)
You can do this with pandas' replace method and numpy's diag method:
First select the values to replace, these will be the digits 1 to the length of your dataframe:
to_replace = [str(i) for i in range(1,len(df)+1)]
Then select values that each should be replaced with, these will be the diagonal of your data frame:
import numpy as np
replace_with = np.diag(df)
Now you can do the actual replacement:
df.replace(to_replace, replace_with)
which gives:
v1 v2 v3
0 a b a
1 c c d
And of course if you want the whole thing as a one liner:
df.replace([str(i) for i in range(1,len(df)+1)], np.diag(df))
Add the inplace=True keyword arg to replace if you want to do the replacement in place.
I see 2 options.
Loop over the columns and then over the mapping
mapping = {'1': 'v1', '3': 'v3', '2': 'v2'}
df1 = df.copy()
for column_name, column in df1.iteritems():
for k, v in mapping.items():
df1.loc[column == k, column_name] = df1.loc[column == k, v]
df1
v1 v2 v3
0 a b a
1 c c d
Loop over the columns, then loop over all the 'hits'
df2 = df.copy()
for column_name, column in df2.iteritems():
hits = column.isin(mapping.keys())
for idx, item in column[hits].iteritems():
df2.loc[idx, column_name] = df2.loc[idx, mapping[item]]
df2
v1 v2 v3
0 a b a
1 c c d
If you've chosen a way, you could reduce the 2 nested for-loops to 1 for-loop with itertools.product
I made this:
df = pd.DataFrame([{'v1':'a', 'v2':'b', 'v3':'1'},
{'v1':'2', 'v2':'c', 'v3':'d'}])
def replace_col(row, columns, col_num_dict={1: 'v1', 2: 'v2', 3: 'v3'}):
for col in columns:
x = getattr(row, col)
try:
x = int(x)
if int(x) in col_num_dict.keys():
setattr(row, col, getattr(row, col_num_dict[int(x)]))
except ValueError:
pass
return row
df = df.apply(replace_col, axis=1, args=(df.columns,))
It applies the function replace_col on every row. The row object's attributes which correspond to its columns get replaced with the right value from the same row. It looks a bit complicated due to the multiple set/get attribute functions, but it does exactly what is needed without too much overhead.
you can modify the data before convert to df
data = [{'v1':'a', 'v2':'b', 'v3':'1'},{'v1':'2', 'v2':'c', 'v3':'d'}]
mapping = {'1': 'v1', '3': 'v3', '2': 'v2'}
for idx,line in enumerate(data):
... for item in line:
... try:
... int(line[item ])
... data[idx][item ] = data[idx][mapping[line[item ]]]
... except Exception:
... pass
[{'v1': 'a', 'v2': 'b', 'v3': 'a'}, {'v1': 'c', 'v2': 'c', 'v3': 'd'}]

Categories

Resources