The objective is to assign value to column Group based on the comparison between the value in column my_ch and look-up dict dict_map
The dict_map is define as
dict_map=dict(group_one=['B','D','GG','G'],group_two=['A','C','E','F'])
Whereas the df is as below
first my_ch bar ... foo qux
second one two ... two one two
0 A 0.037718 0.089609 ... 0.202885 0.706059 -2.280754
1 B 0.578452 0.039445 ... -0.153135 0.178715 -0.040345
2 C 2.139270 1.104547 ... 0.989953 -0.280724 -0.739488
3 D 0.733355 0.227912 ... -1.359441 0.761619 -1.119464
4 G -1.565185 -1.070280 ... 0.458847 1.072471 1.724417
This comparison should produce the output as below
first Group my_ch bar ... foo qux
second one two ... two one two
0 group_two A 0.037718 0.089609 ... 0.202885 0.706059 -2.280754
1 group_one B 0.578452 0.039445 ... -0.153135 0.178715 -0.040345
2 group_two C 2.139270 1.104547 ... 0.989953 -0.280724 -0.739488
3 group_one D 0.733355 0.227912 ... -1.359441 0.761619 -1.119464
4 group_one G -1.565185 -1.070280 ... 0.458847 1.072471 1.724417
My impression, this can be simply achieved via the line
df[('Group', slice ( None ))]=df.loc [:, ('my_ch', slice ( None ))].apply(lambda x: dict_map.get(x))
However, the compiler return an error of
TypeError: unhashable type: 'Series'
Im thinking of converting the series into Dataframe type to bypass this issue, but I wonder there is more reasonable way of solving this issue.
The full code to reproduce the above error is
import pandas as pd
import numpy as np
dict_map=dict(group_one=['B','D','GG','G'],group_two=['A','C','E','F'])
arrays = [["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
["one", "two", "one", "two", "one", "two", "one", "two"]]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])
df = pd.DataFrame(np.random.randn(5, 8), index=["A", "B", "C","D",'G'], columns=index)
df = df.rename_axis ( index=['my_ch'] ).reset_index()
df[('Group', slice ( None ))]=df.loc [:, ('my_ch', slice ( None ))].apply(lambda x: dict_map.get(x))
Edit:
df['Group']=df['my_ch'].apply(lambda x: dict_map.get(x))
Produced Group of None
first my_ch bar baz ... foo qux Group
second one two one ... two one two
0 A 1.220946 0.714748 0.053371 ... -1.743287 0.400862 -1.066441 None
1 B 0.606736 0.844995 0.579328 ... -0.472185 1.102245 0.454315 None
2 C 1.666148 -0.333102 1.950425 ... -0.021484 3.178110 -0.176937 None
3 D -0.673474 2.263407 -0.074996 ... -0.605594 1.410987 -1.253847 None
4 G 0.652557 2.271662 -0.569529 ... -0.549246 -0.021359 -0.532386 None
Slice the my_ch using df.xs column and then map after reversing the dict:
d = {i:k for k,v in dict_map.items() for i in v}
out = df.assign(Group=df.xs("my_ch",axis=1).map(d))
Related
I follow the example here: (https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#flexible-apply)
Data:
df = pd.DataFrame(
{
"A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
"B": ["one", "one", "two", "three", "two", "two", "one", "three"],
"C": np.random.randn(8),
"D": np.random.randn(8),
}
)
Groupby 'A' but selecting on column 'C', then perform apply
grouped = df.groupby('A')['C']
def f(group):
return pd.DataFrame({'original': group,
'demeaned': group - group.mean()})
grouped.apply(f)
Everything is ok, but when I try with groupby'A' and selecting column 'C' and 'D', I cannot succeed in doing so:
grouped = df.groupby('A')[['C', 'D']]
for name, val in grouped:
print(name)
print(val)
grouped.apply(f)
So what do I do wrong here?
Thank you
Phan
When you get single column (['C']) then it gives pandas.Series, but when you get many columns ([ ['C', 'D'] ]) then it gives pandas.DataFrame - and this need different code in f()
It could be
grouped = df.groupby('A')[['C', 'D']]
def f(group):
return pd.DataFrame({
'original_C': group['C'],
'original_D': group['D'],
'demeaned_C': group['C'] - group['C'].mean(),
'demeaned_D': group['D'] - group['D'].mean(),
})
grouped.apply(f)
Result:
original_C original_D demeaned_C demeaned_D
0 -0.122789 0.216775 -0.611724 1.085802
1 -0.500153 0.912777 -0.293509 0.210248
2 0.875879 -1.582470 0.386944 -0.713443
3 -0.250717 1.770375 -0.044073 1.067846
4 1.261891 0.177318 0.772956 1.046345
5 0.130939 -0.575565 0.337582 -1.278094
6 -1.121481 -0.964481 -1.610417 -0.095454
7 1.551176 -2.192277 1.062241 -1.323250
Because with two columns you already have DataFrame so you can also write it shorter without converting to pd.DataFrame()
def f(group):
group[['demeaned_C', 'demeaned_D']] = group - group.mean()
return group
or more universal
def f(group):
for col in group.columns:
group[f'demeaned_{col}'] = group[col] - group[col].mean()
return group
BTW:
If you use [ ['C'] ] instead of ['C'] then you also get DataFrame instead of Series and you can use last version of f().
Can somebody explain the following code?
import pandas as pd
a = pd.DataFrame({"col1": [1,2,3], "col2": [2,3,4]})
b = pd.DataFrame({"col3": [1,2,3], "col4": [2,3,4]})
list(zip(*[a,b]))
Output:
[('col1', 'col3'), ('col2', 'col4')]
a:
b:
zip function returns tuple:
a = ("John", "Charles", "Mike")
b = ("Jenny", "Christy", "Monica", "Vicky")
x = zip(a, b)
#use the tuple() function to display a readable version of the result:
print(tuple(x))
with [a,b] inside zip - U get the whole values from df.
There is also combine the all possible combination (16 permutations) :
eg:
d = list(zip(a['col1'],b['col4']))
In a simplified dataframe:
import pandas as pd
df1 = pd.DataFrame({'350': [7.898167, 6.912074, 6.049002, 5.000357, 4.072320],
'351': [8.094912, 7.090584, 6.221289, 5.154516, 4.211746],
'352': [8.291657, 7.269095, 6.393576, 5.308674, 4.351173],
'353': [8.421007, 7.374317, 6.496641, 5.403691, 4.439815],
'354': [8.535562, 7.463452, 6.584512, 5.485725, 4.517310],
'355': [8.650118, 7.552586, 6.672383, 4.517310, 4.594806]},
index=[1, 2, 3, 4, 5])
int_range = df1.columns.astype(float)
a = 0.005
b = 0.837
I would like to solve an equation which is attached as an image below:
I is equal to the values in the data frame. x is the int_range values so in this case from 350 to 355 with a dx=1.
a and b are optional constants
I need to get a dataframe as an output per each row
For now I do something like this, but I'm not sure it's correct:
dict_INT = {}
for index, row in df1.iterrows():
func = df1.loc[index]*df1.loc[index].index.astype('float')
x = df1.loc[index].index.astype('float')
dict_INT[index] = integrate.trapz(func, x)
df_out = pd.DataFrame(dict_INT, index=['INT']).T
df_fin = df_out/(a*b)
This is the final sum I get per row:
1 3.505796e+06
2 3.068796e+06
3 2.700446e+06
4 2.199336e+06
5 1.840992e+06
I solved this by first converting the dataframe to dict and then performing your equation by each item in row, then writing these value to dict using collections defaultdict. I will break it down:
import pandas as pd
from collections import defaultdict
df1 = pd.DataFrame({'350': [7.898167, 6.912074, 6.049002, 5.000357, 4.072320],
'351': [8.094912, 7.090584, 6.221289, 5.154516, 4.211746],
'352': [8.291657, 7.269095, 6.393576, 5.308674, 4.351173],
'353': [8.421007, 7.374317, 6.496641, 5.403691, 4.439815],
'354': [8.535562, 7.463452, 6.584512, 5.485725, 4.517310],
'355': [8.650118, 7.552586, 6.672383, 4.517310, 4.594806]},
index=[1, 2, 3, 4, 5]
)
int_range = df1.columns.astype(float)
a = 0.005
b = 0.837
dx = 1
df_dict = df1.to_dict() # convert df to dict for easier operations
integrated_dict = {} # initialize empty dict
d = defaultdict(list) # initialize empty dict of lists for tuples later
integrated_list = []
for k,v in df_dict.items(): # unpack df dict of dicts
for x,y in v.items(): # unpack dicts by column and index (x is index, y is column)
integrated_list.append((k, (((float(k)*float(y)*float(dx))/(a*b))))) #store a list of tuples.
for x,y in integrated_list: # create dict with column header as key and new integrated calc as value (currently a tuple)
d[x].append(y)
d = {k:tuple(v) for k, v in d.items()} # unpack to multiple values
integrated_df = pd.DataFrame.from_dict(d) # to df
integrated_df['Sum'] = integrated_df.iloc[:, :].sum(axis=1)
output (updated to include sum):
350 351 352 353 354 \
0 660539.653524 678928.103226 697410.576822 710302.382557 722004.527599
1 578070.704898 594694.141935 611402.972521 622015.269056 631317.086738
2 505890.250896 521785.529032 537763.142652 547984.294624 556969.473835
3 418189.952210 432314.245161 446512.126165 455795.202628 464025.483871
4 340576.344086 353243.212903 365976.797133 374493.356033 382109.376344
355 Sum
0 733761.502987 4.202947e+06
1 640661.416965 3.678162e+06
2 565996.646356 3.236389e+06
3 383188.781362 2.600026e+06
4 389762.516129 2.206162e+06
How can you combine multiple columns from a dataframe into a list?
Input:
df = pd.DataFrame(np.random.randn(10000, 7), columns=list('ABCDEFG'))
If I wanted to create a list from column A I would perform:
df1 = df['A'].tolist()
But if I wanted to combine numerous columns into this list it wouldn't be efficient write df['A','B','C'...'Z'].tolist()
I have tried to do the following but it just adds the columns headers to a list.
df1 = list(df.columns)[0:8]
Intended input:
A B C D E F G
0 0.787576 0.646178 -0.561192 -0.910522 0.647124 -1.388992 0.728360
1 0.265409 -1.919283 -0.419196 -1.443241 -2.833812 -1.066249 0.553379
2 0.343384 0.659273 -0.759768 0.355124 -1.974534 0.399317 -0.200278
Intended Output:
[0.787576, 0.646178, -0.561192, -0.910522, 0.647124, -1.388992, 0.728360,
0.265409, -1.919283, -0.419196, -1.443241, -2.833812, -1.066249, 0.553379,
0.343384, 0.659273, -0.759768, 0.355124, -1.974534, 0.399317, -0.200278]
Is this what you are looking for
lst = df.values.tolist()
flat_list = [item for x in lst for item in x]
print(flat_list)
You can using to_dict
df = pd.DataFrame(np.random.randn(10, 10), columns=list('ABCDEFGHIJ'))
df.to_dict('l')
Out[1036]:
{'A': [-0.5611441440595607,
-0.3785906500723589,
-0.19480328695097676,
-0.7472526275034221,
-2.4232786057647457,
0.10506614562827334,
0.4968179288412277,
1.635737019365132,
-1.4286421753281746,
0.4973223222844811],
'B': [-1.0550082961139444,
-0.1420067090193365,
0.30130476834580633,
1.1271866812852227,
0.38587456174846285,
-0.531163142682951,
-1.1335754634118729,
0.5975963084356348,
-0.7361022807495443,
1.4329395663140427],
...}
Or adding values.tolist()
df[list('ABC')].values.tolist()
Out[1041]:
[[0.09552771302434987, 0.18551596484768904, -0.5902249875268607],
[-1.5285190712746388, 1.2922627021799646, -0.8347422966138306],
[-0.4092028716404067, -0.5669107267579823, 0.3627970727410332],
[-1.3546346273319263, -0.9352316948439341, 1.3568726575880614],
[-1.3509518030469496, 0.10487182694997808, -0.6902134363370515]]
Edit : np.concatenate(df[list('ABC')].T.values.tolist())
I have a DataFrame like:
df = pd.DataFrame([{'v1':'a', 'v2':'b', 'v3':'1'},
{'v1':'2', 'v2':'c', 'v3':'d'}])
or
v1 v2 v3
0 a b 1
1 2 c d
When the contents of a column/row is '1', '2' or '3', I would like to replace its contents with the corresponding item from the column indicated. I.e., in the first row, column v3 has value "1" so I would like to replace it with the value of the first element in column v1. Doing this for both rows, I should get:
v1 v2 v3
0 a b a
1 c c d
I can do this with the following code:
for i in range(3):
for j in range(3):
df.loc[df['v%d' % (i+1)]==('%d' % (j+1)),'v%d' % (i+1)]= \
df.loc[df['v%d' % (i+1)]==('%d' % (j+1)),'v%d' % (j+1)]
Is there a less cumbersome way to do this?
df.apply(lambda row: [row['v'+v] if 'v'+v in row else v for v in row], 1)
This iterates over each row and replaces any value v with the value in column named 'v'+v if that column exists, otherwise it does not change the value.
output:
v1 v2 v3
0 a b a
1 c c d
Note that this will not limit the replacements to digits only. For example, if you have a column named 'va', it will replace all cells that contain 'a' with the value in the 'va' column in a that row. To limit the rows that you can replace from, you can define a list of acceptable column names. For example, lets say you only wanted to make replacements from column v1:
acceptable_columns = ['v1']
df.apply(lambda row: [row['v'+v] if 'v'+v in acceptable_columns else v for v in row], 1)
output:
v1 v2 v3
0 a b a
1 2 c d
EDIT
It was pointed out that the answer above throws an error if you have non-string types in your dataframe. You can avoid this by explicitly converting each cell value to a string:
df.apply(lambda row: [row['v'+str(v)] if 'v'+str(v) in row else v for v in row], 1)
ORIGINAL (INCORRECT) ANSWER BELOW
note that the answer below only applies when the values to replace are on a diagonal (which is the case in the example but that was not the question asked ... my bad)
You can do this with pandas' replace method and numpy's diag method:
First select the values to replace, these will be the digits 1 to the length of your dataframe:
to_replace = [str(i) for i in range(1,len(df)+1)]
Then select values that each should be replaced with, these will be the diagonal of your data frame:
import numpy as np
replace_with = np.diag(df)
Now you can do the actual replacement:
df.replace(to_replace, replace_with)
which gives:
v1 v2 v3
0 a b a
1 c c d
And of course if you want the whole thing as a one liner:
df.replace([str(i) for i in range(1,len(df)+1)], np.diag(df))
Add the inplace=True keyword arg to replace if you want to do the replacement in place.
I see 2 options.
Loop over the columns and then over the mapping
mapping = {'1': 'v1', '3': 'v3', '2': 'v2'}
df1 = df.copy()
for column_name, column in df1.iteritems():
for k, v in mapping.items():
df1.loc[column == k, column_name] = df1.loc[column == k, v]
df1
v1 v2 v3
0 a b a
1 c c d
Loop over the columns, then loop over all the 'hits'
df2 = df.copy()
for column_name, column in df2.iteritems():
hits = column.isin(mapping.keys())
for idx, item in column[hits].iteritems():
df2.loc[idx, column_name] = df2.loc[idx, mapping[item]]
df2
v1 v2 v3
0 a b a
1 c c d
If you've chosen a way, you could reduce the 2 nested for-loops to 1 for-loop with itertools.product
I made this:
df = pd.DataFrame([{'v1':'a', 'v2':'b', 'v3':'1'},
{'v1':'2', 'v2':'c', 'v3':'d'}])
def replace_col(row, columns, col_num_dict={1: 'v1', 2: 'v2', 3: 'v3'}):
for col in columns:
x = getattr(row, col)
try:
x = int(x)
if int(x) in col_num_dict.keys():
setattr(row, col, getattr(row, col_num_dict[int(x)]))
except ValueError:
pass
return row
df = df.apply(replace_col, axis=1, args=(df.columns,))
It applies the function replace_col on every row. The row object's attributes which correspond to its columns get replaced with the right value from the same row. It looks a bit complicated due to the multiple set/get attribute functions, but it does exactly what is needed without too much overhead.
you can modify the data before convert to df
data = [{'v1':'a', 'v2':'b', 'v3':'1'},{'v1':'2', 'v2':'c', 'v3':'d'}]
mapping = {'1': 'v1', '3': 'v3', '2': 'v2'}
for idx,line in enumerate(data):
... for item in line:
... try:
... int(line[item ])
... data[idx][item ] = data[idx][mapping[line[item ]]]
... except Exception:
... pass
[{'v1': 'a', 'v2': 'b', 'v3': 'a'}, {'v1': 'c', 'v2': 'c', 'v3': 'd'}]