Pandas update values using loc with repeated indices - python

All,
I have a dataframe with repeated indices. I'm trying to update the values using the index for all rows with that index. Here is an example of what I have
name x
t
0 A 5
0 B 2
1 A 7
2 A 5
2 B 9
2 C 3
"A" is present at every time. I want to replace "x" with the current value of "x", minus the value of "x" for "A" at that time. The tricky part is to get with an array or dataframe that is, in this case
array([5, 5, 7, 5, 5, 5])
which is the value for "A", but repeated for each timestamp. I can then subtract this from df['x']. My working solution is below.
temp = df[df['name'] == 'A']
d = dict(zip(temp.index, temp['x']))
df['x'] = df['x'] - df.index.to_frame()['t'].replace(d)
name x
t
0 A 0
0 B -3
1 A 0
2 A 0
2 B 4
2 C -2
This works, but feels a bit hacky, and I can't help but think there is a better (and must faster) solution...

I will do reindex
df.x-=df.loc[df.name=='A','x'].reindex(df.index).values
df
Out[362]:
name x
t
0 A 0
0 B -3
1 A 0
2 A 0
2 B 4
2 C -2

groupby .cumsum() of where name =A and subtract fast value in each group from the rest
df['x']=df.groupby((df.name=='A').cumsum())['x'].apply(lambda s:s.sub(s.iloc[0]))
name x
t
0 A 0
0 B -3
1 A 0
2 A 0
2 B 4
2 C -2

Related

How to use lambda function on a pandas data frame via map/apply where lambda takes different values for each column

The idea is to transform a data frame in the fastest way according to the values specific to each column.
For simplicity, here is an example where each element of a column is compared to the mean of the column it belongs to and replaced with 0 if greater than mean(column) or 1 otherwise.
In [26]: df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6]]))
In [27]: df
Out[27]:
0 1 2
0 1 2 3
1 4 5 6
In [28]: df.mean().values.tolist()
Out[28]: [2.5, 3.5, 4.5]
Snippet bellow, it is not real code but more to exemplify the desired behavior. I used apply method but it can be whatever works fastest.
In [29]: f = lambda x: 0 if x < means else 1
In [30]: df.apply(f)
In [27]: df
Out[27]:
0 1 2
0 0 0 0
1 1 1 1
This is a toy example but the solution has to be applied to a big data frame, therefore, it has to be fast.
Cheers!
You can create a boolean mask of the dataframe by comparing each element with the mean of that column. It can be easily achieved using
df > df.mean()
0 1 2
0 False False False
1 True True True
Since True equates to 1 and False to 0, a boolean dataframe can be easily converted to integer using astype.
(df > df.mean()).astype(int)
0 1 2
0 0 0 0
1 1 1 1
If you need the output to be some strings rather than 0 and 1, use np.where which works as (condition, if true, else)
pd.DataFrame(np.where(df > df.mean(), 'm', 'n'))
0 1 2
0 n n n
1 m m m
Edit: Addressing qn in comment; What if m and n are column dependent
df = pd.DataFrame(np.arange(12).reshape(4,3))
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
pd.DataFrame(np.where(df > df.mean(), df.min(), df.max()))
0 1 2
0 9 10 11
1 9 10 11
2 0 1 2
3 0 1 2

Select rows which have only zeros in columns

I want to select the rows in a dataframe which have zero in every column in a list of columns. e.g. this df:.
In:
df = pd.DataFrame([[1,2,3,6], [2,4,6,8], [0,0,3,4],[1,0,3,4],[0,0,0,0]],columns =['a','b','c','d'])
df
Out:
a b c d
0 1 2 3 6
1 2 4 6 8
2 0 0 3 4
3 1 0 3 4
4 0 0 0 0
Then:
In:
mylist = ['a','b']
selection = df.loc[df['mylist']==0]
selection
I would like to see:
Out:
a b c d
2 0 0 3 4
4 0 0 0 0
Should be simple but I'm having a slow day!
You'll need to determine whether all columns of a row have zeros or not. Given a boolean mask, use DataFrame.all(axis=1) to do that.
df[df[mylist].eq(0).all(1)]
a b c d
2 0 0 3 4
4 0 0 0 0
Note that if you wanted to find rows with zeros in every column, remove the subsetting step:
df[df.eq(0).all(1)]
a b c d
4 0 0 0 0
Using reduce and Numpy's logical_and
The point of this is to eliminate the need to create new Pandas objects and simply produce the mask we are looking for using the data where it sits.
from functools import reduce
df[reduce(np.logical_and, (df[c].values == 0 for c in mylist))]
a b c d
2 0 0 3 4
4 0 0 0 0

pandas filter and apply

Hello I have the following data frame (df):
Group Value
A 1
A 2
A 3
B -1
B 2
B 3
I would like to convert all of group B to negative values if they arent already (ie multiply by -1).
df[df['group'] == 'B', 'value'].apply(... if value less than 0 then -1*value)
Please let me know the correct way to go about this in pandas framework. Thank you
In [85]: df.loc[df.Group.eq('B') & df.Value.gt(0), 'Value'] *= -1
In [86]: df
Out[86]:
Group Value
0 A 1
1 A 2
2 A 3
3 B -1
4 B -2
5 B -3
A different way using mask and np.sign
df.assign(Value=df.Value.mask(df.Group == 'B', -np.sign(df.Value) * df.Value))
Group Value
0 A 1
1 A 2
2 A 3
3 B -1
4 B -2
5 B -3

Pandas Insert a row above the Index and the Series data in a Dataframe

I ve been around several trials, nothing seems to work so far.
I have tried df.insert(0, "XYZ", 555) which seemed to work until it did not for some reasons i am not certain.
I understand that the issue is that Index is not considered a Series and so, df.iloc[0] does not allow you to insert data directly above the Index column.
I ve also tried manually adding in the list of indices part of the definition of the dataframe a first index with the value "XYZ"..but nothing has work.
Thanks for your help
A B C D are my columns. range(5) is my index. I am trying to obtain this below, for an arbitrary row starting with type, and then a list of strings..thanks
A B C D
type 'string1' 'string2' 'string3' 'string4'
0
1
2
3
4
If you use Timestamps as Index adding a row and a custom single row with its own custom index will throw an error:
ValueError: Cannot add integral value to Timestamp without offset. I am guessing it's due to the difference in the operands, if i substract an Integer from a Timestamp for example.. ? how could i fix this in a generic manner? thanks! –
if you want to insert a row before the first row, you can do it this way:
data:
In [57]: df
Out[57]:
id var
0 a 1
1 a 2
2 a 3
3 b 5
4 b 9
adding one row:
In [58]: df.loc[df.index.min() - 1] = ['z', -1]
In [59]: df
Out[59]:
id var
0 a 1
1 a 2
2 a 3
3 b 5
4 b 9
-1 z -1
sort index:
In [60]: df = df.sort_index()
In [61]: df
Out[61]:
id var
-1 z -1
0 a 1
1 a 2
2 a 3
3 b 5
4 b 9
optionally reset your index :
In [62]: df = df.reset_index(drop=True)
In [63]: df
Out[63]:
id var
0 z -1
1 a 1
2 a 2
3 a 3
4 b 5
5 b 9

Using a column with a boolean to access other columns

I have a pandas dataframe like the following:
A B C
1 2 1
3 4 0
5 2 0
5 3 1
And would like to get the value from A if the value of C is 1 and the value of B if C is zero. How would I do this? Ultimately I'd like to end up with a vector with the values of A if C is one and B if C is 0 which would be [1,4,2,5]
Assuming you mean "from A is the value of C is 1 and from B if the value of C is 0", which makes sense given your intended output, I might use Series.where:
>>> df
A B C
0 1 2 1
1 3 4 0
2 5 2 0
3 5 3 1
>>> df.A.where(df.C, df.B)
0 1
1 4
2 2
3 5
dtype: int64
which is read "make a series using values of A if the corresponding value of C is true, otherwise use the corresponding value of B". Here since 1 is true we can just use df.C, but we could use df.C == 1 or df.C*5+3 < 4 or any other boolean Series.

Categories

Resources