Can somebody explain the following code?
import pandas as pd
a = pd.DataFrame({"col1": [1,2,3], "col2": [2,3,4]})
b = pd.DataFrame({"col3": [1,2,3], "col4": [2,3,4]})
list(zip(*[a,b]))
Output:
[('col1', 'col3'), ('col2', 'col4')]
a:
b:
zip function returns tuple:
a = ("John", "Charles", "Mike")
b = ("Jenny", "Christy", "Monica", "Vicky")
x = zip(a, b)
#use the tuple() function to display a readable version of the result:
print(tuple(x))
with [a,b] inside zip - U get the whole values from df.
There is also combine the all possible combination (16 permutations) :
eg:
d = list(zip(a['col1'],b['col4']))
Related
Trying to create a new column that is the key/value pairs extracted from a dict in another column using list items in a second column.
Sample Data:
names name_dicts
['Mary', 'Joe'] {'Mary':123, 'Ralph':456, 'Joe':789}
Expected Result:
names name_dicts new_col
['Mary', 'Joe'] {'Mary':123, 'Ralph':456, 'Joe':789} {'Mary':123, 'Joe':789}
I have attempted to use AST to convert the name_dicts column to a column of true dictionaries.
This function errored out with a "cannot convert string" error.
col here is the df['name_dicts'] col
def get_name_pairs(col):
for k,v in col.items():
if k.isin(df['names']):
return
Using a list comprehension and operator.itemgetter:
from operator import itemgetter
df['new_col'] = [dict(zip(l, itemgetter(*l)(d)))
for l,d in zip(df['names'], df['name_dicts'])]
output:
names name_dicts new_col
0 [Mary, Joe] {'Mary': 123, 'Ralph': 456, 'Joe': 789} {'Mary': 123, 'Joe': 789}
used input:
df = pd.DataFrame({'names': [['Mary', 'Joe']],
'name_dicts': [{'Mary':123, 'Ralph':456, 'Joe':789}]
})
You can apply a lambda function with dictionary comprehension at row level to get the values from the dict in second column based on the keys in the list of first column:
# If col values are stored as string:
import ast
for col in df:
df[col] = df[col].apply(ast.literal_eval)
df['new_col']=df.apply(lambda x: {k:x['name_dicts'].get(k,0) for k in x['names']},
axis=1)
# Replace above lambda by
# lambda x: {k:x['name_dicts'][k] for k in x['names'] if k in x['name_dicts']}
# If you want to include only key/value pairs for the key that is in
# both the list and the dictionary
names ... new_col
0 [Mary, Joe] ... {'Mary': 123, 'Joe': 789}
[1 rows x 3 columns]
PS: ast.literal_eval runs without error for the sample data you have posted for above code.
Your function needs only small change - and you can use it with .apply()
import pandas as pd
df = pd.DataFrame({
'names': [['Mary', 'Joe']],
'name_dicts': [{'Mary':123, 'Ralph':456, 'Joe':789}],
})
def filter_data(row):
result = {}
for key, val in row['name_dicts'].items():
if key in row['names']:
result[key] = val
return result
df['new_col'] = df.apply(filter_data, axis=1)
print(df.to_string())
Result:
names name_dicts new_col
0 [Mary, Joe] {'Mary': 123, 'Ralph': 456, 'Joe': 789} {'Mary': 123, 'Joe': 789}
EDIT:
If you have string "{'Mary':123, 'Ralph':456, 'Joe':789}" in name_dicts then you can replace ' with " and you will have json which you can convert to dictionary using json.loads
import json
df['name_dicts'] = df['name_dicts'].str.replace("'", '"').apply(json.loads)
Or directly convert it as Python's code:
import ast
df['name_dicts'] = df['name_dicts'].apply(ast.literal_eval)
Eventually:
df['name_dicts'] = df['name_dicts'].apply(eval)
Full code:
import pandas as pd
df = pd.DataFrame({
'names': [['Mary', 'Joe']],
'name_dicts': ["{'Mary':123, 'Ralph':456, 'Joe':789}",], # strings
})
#import json
#df['name_dicts'] = df['name_dicts'].str.replace("'", '"').apply(json.loads)
#df['name_dicts'] = df['name_dicts'].apply(eval)
import ast
df['name_dicts'] = df['name_dicts'].apply(ast.literal_eval)
def filter_data(row):
result = {}
for key, val in row['name_dicts'].items():
if key in row['names']:
result[key] = val
return result
df['new_col'] = df.apply(filter_data, axis=1)
print(df.to_string())
multipliers = {'A' : 5, 'B' : 10, 'C' : 15, 'D' : 20}
df = pd.util.testing.makeDataFrame() # a random df with columns A,B,C,D
f = lambda x, col: multipliers[col] * x
Is there Pandas non-loop way to apply f to each column, like df.apply(f, axis = 0, ?)? What I can achieve with loop is
df2 = df.copy()
for c in df.columns:
df2[c] = f(df[c], c)
(real f is more complex than the above example, please treat f as a black box function of two variables, arg1 is data, arg2 is column name)
Use lambda function and for pass column name use x.name:
np.random.seed(2022)
multipliers = {'A' : 5, 'B' : 10, 'C' : 15, 'D' : 20}
df = pd.util.testing.makeDataFrame() # a random df with columns A,B,C,D
f = lambda x, col: multipliers[col] * x
df2 = df.copy()
for c in df.columns:
df2[c] = f(df[c], c)
print (df2.head())
A B C D
9CTWXXW3ys 2.308860 6.375789 5.362095 -23.354181
yq1PHBltEO 2.876024 1.950080 15.772909 -13.776645
lWtMioDq6A -11.206739 17.691500 -12.175996 25.957264
lEHcq1pxLr -6.510434 -6.004475 14.084401 13.999673
xvL04Y66tm -3.827731 -3.104207 -4.111277 1.440596
df2 = df.apply(lambda x: f(x, x.name))
print (df2.head())
A B C D
9CTWXXW3ys 2.308860 6.375789 5.362095 -23.354181
yq1PHBltEO 2.876024 1.950080 15.772909 -13.776645
lWtMioDq6A -11.206739 17.691500 -12.175996 25.957264
lEHcq1pxLr -6.510434 -6.004475 14.084401 13.999673
xvL04Y66tm -3.827731 -3.104207 -4.111277 1.440596
You can convert your dictionary to series and transform your function to vectorized operation. For example:
df * pd.Series(multipliers)
You can also use the method transform that accepts a dict of functions:
def func(var):
# return your function
return lambda x: x * var
df.transform({k: func(v) for k, v in multipliers.items()})
I follow the example here: (https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#flexible-apply)
Data:
df = pd.DataFrame(
{
"A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
"B": ["one", "one", "two", "three", "two", "two", "one", "three"],
"C": np.random.randn(8),
"D": np.random.randn(8),
}
)
Groupby 'A' but selecting on column 'C', then perform apply
grouped = df.groupby('A')['C']
def f(group):
return pd.DataFrame({'original': group,
'demeaned': group - group.mean()})
grouped.apply(f)
Everything is ok, but when I try with groupby'A' and selecting column 'C' and 'D', I cannot succeed in doing so:
grouped = df.groupby('A')[['C', 'D']]
for name, val in grouped:
print(name)
print(val)
grouped.apply(f)
So what do I do wrong here?
Thank you
Phan
When you get single column (['C']) then it gives pandas.Series, but when you get many columns ([ ['C', 'D'] ]) then it gives pandas.DataFrame - and this need different code in f()
It could be
grouped = df.groupby('A')[['C', 'D']]
def f(group):
return pd.DataFrame({
'original_C': group['C'],
'original_D': group['D'],
'demeaned_C': group['C'] - group['C'].mean(),
'demeaned_D': group['D'] - group['D'].mean(),
})
grouped.apply(f)
Result:
original_C original_D demeaned_C demeaned_D
0 -0.122789 0.216775 -0.611724 1.085802
1 -0.500153 0.912777 -0.293509 0.210248
2 0.875879 -1.582470 0.386944 -0.713443
3 -0.250717 1.770375 -0.044073 1.067846
4 1.261891 0.177318 0.772956 1.046345
5 0.130939 -0.575565 0.337582 -1.278094
6 -1.121481 -0.964481 -1.610417 -0.095454
7 1.551176 -2.192277 1.062241 -1.323250
Because with two columns you already have DataFrame so you can also write it shorter without converting to pd.DataFrame()
def f(group):
group[['demeaned_C', 'demeaned_D']] = group - group.mean()
return group
or more universal
def f(group):
for col in group.columns:
group[f'demeaned_{col}'] = group[col] - group[col].mean()
return group
BTW:
If you use [ ['C'] ] instead of ['C'] then you also get DataFrame instead of Series and you can use last version of f().
I have a pandas dataframe like so:
A
a
b
c
d
I am trying to create a python dictionary which would look like this:
df_dict = {'a':0, 'b':1, 'c':2, 'd':3}
What I've tried:
df.reset_index(inplace=True)
df = {x : y for x in df['A'] for y in df['index']}
But the df is 75k long and its taking a while now, not even sure if this produces the result I need. Is there a neat, fast way of achieving this?
Use dict with zip and range:
d = dict(zip(df['A'], range(len(df))))
print (d)
{'a': 0, 'b': 1, 'c': 2, 'd': 3}
You can do it like this:
#creating example dataframe with 75 000 rows
import uuid
df = pd.DataFrame({"col": [str(uuid.uuid4()) for _ in range(75000) ] } )
#your bit
{ i:v for i,v in df.reset_index().values }
It runs in seconds.
You could convert series to list and use enumerate:
lst = { x: i for i, x in enumerate(df['A'].tolist()) }
How can you combine multiple columns from a dataframe into a list?
Input:
df = pd.DataFrame(np.random.randn(10000, 7), columns=list('ABCDEFG'))
If I wanted to create a list from column A I would perform:
df1 = df['A'].tolist()
But if I wanted to combine numerous columns into this list it wouldn't be efficient write df['A','B','C'...'Z'].tolist()
I have tried to do the following but it just adds the columns headers to a list.
df1 = list(df.columns)[0:8]
Intended input:
A B C D E F G
0 0.787576 0.646178 -0.561192 -0.910522 0.647124 -1.388992 0.728360
1 0.265409 -1.919283 -0.419196 -1.443241 -2.833812 -1.066249 0.553379
2 0.343384 0.659273 -0.759768 0.355124 -1.974534 0.399317 -0.200278
Intended Output:
[0.787576, 0.646178, -0.561192, -0.910522, 0.647124, -1.388992, 0.728360,
0.265409, -1.919283, -0.419196, -1.443241, -2.833812, -1.066249, 0.553379,
0.343384, 0.659273, -0.759768, 0.355124, -1.974534, 0.399317, -0.200278]
Is this what you are looking for
lst = df.values.tolist()
flat_list = [item for x in lst for item in x]
print(flat_list)
You can using to_dict
df = pd.DataFrame(np.random.randn(10, 10), columns=list('ABCDEFGHIJ'))
df.to_dict('l')
Out[1036]:
{'A': [-0.5611441440595607,
-0.3785906500723589,
-0.19480328695097676,
-0.7472526275034221,
-2.4232786057647457,
0.10506614562827334,
0.4968179288412277,
1.635737019365132,
-1.4286421753281746,
0.4973223222844811],
'B': [-1.0550082961139444,
-0.1420067090193365,
0.30130476834580633,
1.1271866812852227,
0.38587456174846285,
-0.531163142682951,
-1.1335754634118729,
0.5975963084356348,
-0.7361022807495443,
1.4329395663140427],
...}
Or adding values.tolist()
df[list('ABC')].values.tolist()
Out[1041]:
[[0.09552771302434987, 0.18551596484768904, -0.5902249875268607],
[-1.5285190712746388, 1.2922627021799646, -0.8347422966138306],
[-0.4092028716404067, -0.5669107267579823, 0.3627970727410332],
[-1.3546346273319263, -0.9352316948439341, 1.3568726575880614],
[-1.3509518030469496, 0.10487182694997808, -0.6902134363370515]]
Edit : np.concatenate(df[list('ABC')].T.values.tolist())