This question already has answers here:
Vectorized lookup on a pandas dataframe
(3 answers)
Closed 3 years ago.
I have a pandas DataFrame with values in a number of columns, make it two for simplicity, and a column of column names I want to use to pick values from the other columns:
import pandas as pd
import numpy as np
np.random.seed(1337)
df = pd.DataFrame(
{"a": np.arange(10), "b": 10 - np.arange(10), "c": np.random.choice(["a", "b"], 10)}
)
which gives
> df['c']
0 b
1 b
2 a
3 a
4 b
5 b
6 b
7 a
8 a
9 a
Name: c, dtype: object
That is, I want the first and second elements to be picked from column b, the third from a and so on.
This works:
def pick_vals_from_cols(df, col_selector):
condlist = np.row_stack(col_selector.map(lambda x: x == df.columns))
values = np.select(condlist.transpose(), df.values.transpose())
return values
> pick_vals_from_cols(df, df["c"])
array([10, 9, 2, 3, 6, 5, 4, 7, 8, 9], dtype=object)
But it just feels so fragile and clunky. Is there a better way to do this?
lookup
df.lookup(df.index, df.c)
array([10, 9, 2, 3, 6, 5, 4, 7, 8, 9])
Comprehension
But why when you have lookup?
[df.at[t] for t in df.c.items()]
[10, 9, 2, 3, 6, 5, 4, 7, 8, 9]
Bonus Hack
Not intended for actual use
[*map(df.at.__getitem__, zip(df.index, df.c))]
[10, 9, 2, 3, 6, 5, 4, 7, 8, 9]
Because df.get_value is deprecated
[*map(df.get_value, df.index, df.c)]
FutureWarning: get_value is deprecated and will be removed in a future release. Please use .at[] or .iat[] accessors instead
[10, 9, 2, 3, 6, 5, 4, 7, 8, 9]
Related
It seems like a simple question, but I need ur help.
For example, I have df:
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y = [2, 1, 3, 1, 8, 9, 6, 7, 4, 6]
How can I group 'x' in range from 1 to 5, and from 6 to 10 and calc mean 'y' value for this two bins?
I expect to get new df like:
x_grpd = [5, 10]
y_grpd = [3, 6.4]
Range of 'x' is given as an example. Ideally i want to be able to set any int value to get different bins quantity.
You can use cut and groupby.mean:
bins = [5, 10]
df2 = (df
.groupby(pd.cut(df['x'], [0]+bins,
labels=bins,
right=True))
['y'].mean()
.reset_index()
)
Output:
x y
0 5 3.0
1 10 6.4
df1 and df2 are of different sizes. Set the df1(row1, 'Z') value to df2(row2, 'C') value when df1(row1, 'A') is equal to df2(row2, 'B').
What is the recommended way to implement df1['Z'] = df2['C'] if df1['A']==df2['B']?
df1 = pd.DataFrame({'A': ['foo', 'bar', 'test'], 'b': [1, 2, 3], 'c': [3, 4, 5]})
df2 = pd.DataFrame({'B': ['foo', 'baz'], 'C': [3, 1]})
df1
A b c
0 foo 1 3
1 bar 2 4
2 test 3 5
df2
B C
0 foo 3
1 baz 1
After change
df1
A b c Z
0 foo 1 3 3
1 bar 2 4 NaN
2 test 3 5 NaN
What if there require multiple assignments following multiple conditions. Is iterating over rows recommended as shown below?
for i, row in df1.iterrows():
if <condition(s)>:
do assignment(s): df.at[i, 'hjk']=something
You can use numpy.where, passing the condition as df1.A equals df2.B, and for true boolean, take df2.C else take df1.Z:
np.where(df1.A.eq(df2.B), df2.C, df1.Z)
Assign above result to df1.Z
SAMPLE:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'A': np.random.randint(5,10,20), 'Z': np.random.randint(5,10,20)})
df2 = pd.DataFrame({'C': np.random.randint(5,10,20), 'B': np.random.randint(5,10,20)})
>>>df1.Z.values
Out[41]: array([7, 6, 7, 7, 6, 8, 9, 7, 6, 6, 7, 6, 8, 7, 8, 8, 9, 6, 7, 7])
>>> np.where(df1.A.eq(df2.B), df2.C, df1.Z)
Out[42]: array([7, 6, 6, 7, 6, 8, 9, 7, 6, 9, 7, 8, 8, 7, 8, 8, 9, 6, 7, 7])
I would like to try map
df1['Z'] = df1['A'].map(dict(zip(df2['B'],df2['C'])))
Cant Really bend my mind around this problem I'm having:
say I have 2 arrays
A = [2, 7, 4, 3, 9, 4, 2, 6]
B = [1, 1, 1, 4, 4, 7, 7, 7]
what I'm trying to do is that if a value is repeated in array B (like how 1 is repeated 3 times), those corresponding values in array A are added up to be appended to another array (say C)
so C would look like (from above two arrays):
C = [13, 12, 12]
Also sidenote.. the application I'd be using this code for uses timestamps from a database acting as array B (so that once a day is passed, that value in the array obviously won't be repeated)
Any help is appreciated!!
Here is a solution without pandas, using only itertools groupby:
from itertools import groupby
C = [sum( a for a,_ in g) for _,g in groupby(zip(A,B),key = lambda x: x[1])]
yields:
[13, 12, 12]
I would use pandas for this
Say you put those arrays in a DataFrame. This does the job:
df = pd.DataFrame(
{
'A': [2, 7, 4, 3, 9, 4, 2, 6],
'B': [1, 1, 1, 4, 4, 7, 7, 7]
}
)
df.groupby('B').sum()
If you want pure python solution, you can use itertools.groupby:
from itertools import groupby
A = [2, 7, 4, 3, 9, 4, 2, 6]
B = [1, 1, 1, 4, 4, 7, 7, 7]
out = []
for _, g in groupby(zip(A, B), lambda k: k[1]):
out.append(sum(v for v, _ in g))
print(out)
Prints:
[13, 12, 12]
This is my dataframe:
date ids
0 2011-04-23 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
1 2011-04-24 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2 2011-04-25 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
3 2011-04-26 Nan
4 2011-04-27 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
5 2011-04-28 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
I want to replace Nan with []. How to do that? Fillna([]) did not work. I even tried replace(np.nan, []) but it gives error:
TypeError('Invalid "to_replace" type: \'float\'',)
My approach is similar to #hellpanderrr's, but instead tests for list-ness rather than using isnan:
df['ids'] = df['ids'].apply(lambda d: d if isinstance(d, list) else [])
I originally tried using pd.isnull (or pd.notnull) but, when given a list, that returns the null-ness of each element.
After a lot of head-scratching I found this method that should be the most efficient (no looping, no apply), just assigning to a slice:
isnull = df.ids.isnull()
df.loc[isnull, 'ids'] = [ [[]] * isnull.sum() ]
The trick was to construct your list of [] of the right size (isnull.sum()), and then enclose it in a list: the value you are assigning is a 2D array (1 column, isnull.sum() rows) containing empty lists as elements.
A simple solution would be:
df['ids'].fillna("").apply(list)
As noted by #timgeb, this requires df['ids'] to contain lists or nan only.
You can first use loc to locate all rows that have a nan in the ids column, and then loop through these rows using at to set their values to an empty list:
for row in df.loc[df.ids.isnull(), 'ids'].index:
df.at[row, 'ids'] = []
>>> df
date ids
0 2011-04-23 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
1 2011-04-24 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
2 2011-04-25 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
3 2011-04-26 []
4 2011-04-27 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
5 2011-04-28 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
Surprisingly, passing a dict with empty lists as values seems to work for Series.fillna, but not DataFrame.fillna - so if you want to work on a single column you can use this:
>>> df
A B C
0 0.0 2.0 NaN
1 NaN NaN 5.0
2 NaN 7.0 NaN
>>> df['C'].fillna({i: [] for i in df.index})
0 []
1 5
2 []
Name: C, dtype: object
The solution can be extended to DataFrames by applying it to every column.
>>> df.apply(lambda s: s.fillna({i: [] for i in df.index}))
A B C
0 0 2 []
1 [] [] 5
2 [] 7 []
Note: for large Series/DataFrames with few missing values, this might create an unreasonable amount of throwaway empty lists.
Tested with pandas 1.0.5.
Another solution using numpy:
df.ids = np.where(df.ids.isnull(), pd.Series([[]]*len(df)), df.ids)
Or using combine_first:
df.ids = df.ids.combine_first(pd.Series([[]]*len(df)))
Without assignments:
1) Assuming we have only floats and integers in our dataframe
import math
df.apply(lambda x:x.apply(lambda x:[] if math.isnan(x) else x))
2) For any dataframe
import math
def isnan(x):
if isinstance(x, (int, long, float, complex)) and math.isnan(x):
return True
df.apply(lambda x:x.apply(lambda x:[] if isnan(x) else x))
Maybe not the most short/optimized solution, but I think is pretty readable:
# Masking-in nans
mask = df['ids'].isna()
# Filling nans with a list-like string and literally-evaluating such string
df.loc[mask, 'ids'] = df.loc[mask, 'ids'].fillna('[]').apply(eval)
Maybe more dense:
df['ids'] = [[] if type(x) != list else x for x in df['ids']]
This is probably faster, one liner solution:
df['ids'].fillna('DELETE').apply(lambda x : [] if x=='DELETE' else x)
Another solution that is explicit:
# use apply to only replace the nulls with the list
df.loc[df.ids.isnull(), 'ids'] = df.loc[df.ids.isnull(), 'ids'].apply(lambda x: [])
Create a function that checks your condition, if not, it returns an empty list/empty set etc.
Then apply that function to the variable, but also assigning the new calculated variable to the old one or to a new variable if you wish.
aa=pd.DataFrame({'d':[1,1,2,3,3,np.NaN],'r':[3,5,5,5,5,'e']})
def check_condition(x):
if x>0:
return x
else:
return list()
aa['d]=aa.d.apply(lambda x:check_condition(x))
You can try this:
df.fillna(df.notna().applymap(lambda x: x or []))
I am sitting in front of a probably very simple problem. I have two pandas DataFrames with some common Indices, like so:
import pandas as pd
x = pd.DataFrame(index=[1, 2, 3, 4],
data={'d': [5, 5, 5, 5]})
y = pd.DataFrame(index=[3, 4, 5, 6],
data={'d': [6, 6, 6, 6]})
What i now want to do is to update x by y. This means to me three things:
The indices 1, 2 are only in x and not in y. Keep the values from x.
The indices 3, 4 are common indices in x and y. Update the values with the new info from y.
The indices 5, 6 are only in y. Add them with their respective values to x.
In total, the result should look like this:
x = pd.DataFrame(index=[1, 2, 3, 4, 5, 6],
data={'d': [5, 5, 6, 6, 6, 6]})
Thinking in terms of python dictionaries, I tried x.update(y), which did steps 1. and 2., but doesn't do step 3.
I am confident that this is a one-liner, but i just cannot find it.
Addendum
I mentioned dictionaries (with the index as key), the approach there would look like this:
a = {1: 5,
2: 5,
3: 5,
4: 5}
b = {3: 6,
4: 6,
5: 6,
7: 6}
a.update(b)
It returns:
{1: 5, 2: 5, 3: 6, 4: 6, 5: 6, 7: 6}
You can call combine_first but using y as the destination, this will overwrite the values from x that are missing from y:
In [75]:
y.combine_first(x)
Out[75]:
d
1 5
2 5
3 6
4 6
5 6
6 6
you can't use update to achieve what you want as this only updates the existing values:
In [79]:
x.update(y)
x
Out[79]:
d
1 5
2 5
3 6
4 6