Find a series in dataframe and replace it with original row - python

I have below dataframe df but some D4s with True was causing an issue in my custom ordering. Temporarily, I stored such rows in a list and turned those D4 values to False intentionally and sorted with my custom ordering.
Index D1 D2 D3 D4 D5
0 8 5 0 False True
1 45 35 0 True False
2 35 10 1 False True
3 40 5 0 True False
4 12 10 5 False False
5 18 15 13 False True
6 25 15 5 True False
7 35 10 11 False True
8 95 50 0 False False
hacked_rows = []
def hack_d4(row):
if row['D3'] in [0, 1]:
row['D4'] = False
hacked_rows.append(row)
return row
df_hacked = df.apply(lambda x: hack_d4(x), axis=1)
ordered_df = order_df(df_hacked) # Returns same df with some rows in custom order.
So, Technically, in short I have to revert below ordered_df to the original df with the help of list hacked_rows. Row Order is not important, only hacked rows should be replaced back in the original dataset.
Index D1 D2 D3 D4 D5
0 0 8 5 0 False True
2 2 35 10 1 False True
3 3 40 5 0 False False
1 1 45 35 0 False False
5 5 18 15 13 False True
4 4 12 10 5 False False
7 7 35 10 11 False True
8 8 95 50 0 False False
6 6 25 15 5 True False
Now I am done with custom ordering. Now I want to revert hacked_rows back to the original dataframe which are there on the list, but not sure how to replace them back.
I tried below code for one row, but no luck, its throwing TypeError:
item = hacked_rows[0]
item = item.drop('D3')
ordered_df.loc[item] # But this line is throwing error.
Note- I am okay if anyone can suggest a different approach to replace the True values temporarily.

I think the error is when you create the data frame again.
hacked_rows = []
def hack_d4(row):
if row['D3'] in [0, 1]:
row['D4'] = False
hacked_rows.append(row)
return row
df = df.apply(lambda x: hack_d4(x), axis=1)
ordered_df = pd.DataFrame(df) # code update
df
Index D1 D2 D3 D4 D5
0 0 8 5 0 False True
1 1 45 35 0 False False
2 2 35 10 1 False True
3 3 40 5 0 False False
4 4 12 10 5 False False
5 5 18 15 13 False True
6 6 25 15 5 True False
7 7 35 10 11 False True
8 8 95 50 0 False False
Update:
I added the code with the understanding that I wanted to convert the hacked_rows to a list of data frames, with the comment you gave me.
new_df = pd.DataFrame(index=[], columns=[])
for i in hacked_rows:
new_df = pd.concat([new_df, pd.Series(i)], axis=1, ignore_index=True)
new_df.stack().unstack(level=1).T
Index D1 D2 D3 D4 D5
1 0 8 5 0 False True
2 1 45 35 0 False False
3 2 35 10 1 False True
4 3 40 5 0 False False
5 8 95 50 0 False False

Related

Pandas, create column using previous new column value

I am using Python and have the following Pandas Dataframe:
idx
result
grouping
1
False
2
True
3
True
4
False
5
True
6
True
7
True
8
False
9
True
10
True
11
True
12
True
What I would like is to do the following logic...
if the result is False then I want grouping to be the idx value.
if the result is True then I want the grouping to be the previous grouping value
So the end result will be:
idx
result
grouping
1
False
1
2
True
1
3
True
1
4
False
4
5
True
4
6
True
4
7
True
4
8
False
8
9
True
8
10
True
8
11
True
8
12
True
8
I have tried all sorts to get this working from using the Pandas shift() command to using lambda, but I am just not getting it.
I know I could iterate through the dataframe and perform the calculation but there has to be a better method.
examples of what I have tried and failed with are:
df['grouping'] = df['idx'] if not df['result'] else df['grouping'].shift(1)
df['grouping'] = df.apply(lambda x: x['idx'] if not x['result'] else x['grouping'].shift(1), axis=1)
Many Thanks for any assistance you can provide.
mask true values then forward fill
df['grouping'] = df['idx'].mask(df['result']).ffill(downcast='infer')
idx result grouping
0 1 False 1
1 2 True 1
2 3 True 1
3 4 False 4
4 5 True 4
5 6 True 4
6 7 True 4
7 8 False 8
8 9 True 8
9 10 True 8
10 11 True 8
11 12 True 8

Adding True / False values to a pandas dataframe from a condition on other dataframe

I have two dataframes:
a = pd.DataFrame({'id': [10, 20, 30, 40, 50, 60, 70]})
b = pd.DataFrame({'id': [10, 30, 40, 70]})
print(a)
print(b)
# a
id
0 10
1 20
2 30
3 40
4 50
5 60
6 70
# b
id
0 10
1 30
2 40
3 70
I am trying to have an extra column in a if id is present on b like so:
# a
id present
0 10 True
1 20 False
2 30 True
3 40 True
4 50 False
5 60 False
6 70 True
What I've tried:
a.join(b,rsuffix='a')
# and then thought I'd replace nans with False and values with True
# but it does not return what I expect as it joined row by row
id ida
0 10 10.000
1 20 30.000
2 30 40.000
3 40 70.000
4 50 nan
5 60 nan
6 70 nan
Then I added:
a.join(b,rsuffix='a', on='id')
But did not get what I expected as well:
id ida
0 10 nan
1 20 nan
2 30 nan
3 40 nan
4 50 nan
5 60 nan
6 70 nan
I also tried a['present'] = b['id'].isin(a['id']) but that returned not what I expect:
id present
0 10 True
1 20 True
2 30 True
3 40 True
4 50 NaN
5 60 NaN
6 70 NaN
How can I have an extra column in a denoting if id is present in b with True / False statements?
You are close, need test a['id'] with b['id'] in Series.isin:
a['present'] = a['id'].isin(b['id'])
print (a)
id present
0 10 True
1 20 False
2 30 True
3 40 True
4 50 False
5 60 False
6 70 True
With merge is possible use parameter indicator=True in left join and test _merge column for both:
a['present'] = a.merge(b, on='id', how='left', indicator=True)['_merge'].eq('both')
print (a)
id present
0 10 True
1 20 False
2 30 True
3 40 True
4 50 False
5 60 False
6 70 True

Trying to sort a pandas dataframe by a number column, but getting strange output

I have a dataframe (called df) with a length of 460 that looks like this
index Position T/F
0 1 True
1 2 False
4 3 False
8 4 False
9 18 True
13 5 False
And I would like to sort it by 'position' so that the whole dataframe looks like this
index Position T/F
0 1 True
1 2 False
4 3 False
8 4 False
13 5 False
20 6 False
28 7 True
I have attempted to use
df = df.sort_values('Position', ascending=True)
However, that outputs a rather bizarre dataframe with this form
index Position T/F
0 1 True
52 10 False
456 100 False
470 101 False
477 102 False
...
59 11 False
666 110 False
644 111 True
...
1 2 False
You get the idea. I'm not sure why it's sorting it like this, but I would like to figure out how to fix this issue so that I can output the desired DataFrame
Position seems to be string.
df['position'] = df['position'].astype(int)
Then do sorting.
df = df.sort_values('Position', ascending=True)
Output:
index Position T/F
0 1 True
1 2 False
4 3 False
8 4 False
13 5 False
20 6 False
28 7 True

Pandas dataframe: propagate True values if timestamp is identical

Best described by an example. Input is
ts val
0 10 False
1 20 True
2 20 False
3 30 True
4 40 False
5 40 False
6 40 False
7 60 True
8 60 False
desired output is
ts val
0 10 False
1 20 True
2 20 True
3 30 True
4 40 False
5 40 False
6 40 False
7 60 True
8 60 True
The idea is as follows: if we see at least one True value inside the same ts cluster(i.e. same ts value), make all other values True that have the exact same timestamp.
You can use groupby on column 'ts', and then apply using .any() to determine whether any of val is True in the cluster/group.
import pandas as pd
# your data
# =====================
print(df)
Out[58]:
ts val data
0 10 False 0.3332
1 20 True -0.6877
2 20 False -0.6004
3 30 True 0.1922
4 40 False 0.2472
5 40 False -0.0117
6 40 False 0.8607
7 60 True -1.1464
8 60 False 0.0698
# processing
# =====================
# as suggested by #DSM, transform is best way to do it
df['val'] = df.groupby('ts')['val'].transform(any)
Out[61]:
ts val data
0 10 False 0.3332
1 20 True -0.6877
2 20 True -0.6004
3 30 True 0.1922
4 40 False 0.2472
5 40 False -0.0117
6 40 False 0.8607
7 60 True -1.1464
8 60 True 0.0698

pandas assigning series view to a series view doesn't work?

I'm trying to take a slice view from a series (logically indexed by a conditional), process it then assign the result back to that logically-indexed slice.
The LHS and RHS in the assign are Series with matching indices, but the assign ends up being scalar for some unknown reason (see bottom). How to get the desired assign? (I checked SO and pandas 0.11.0 doc for anything related).
import numpy as np
import pandas as pd
# A dataframe with sample data and some boolean conditional
df = pd.DataFrame(data={'x': range(1,20)})
df['cond'] = df.x.apply(lambda xx: ((xx%3)==1) )
# Create a new col and selectively assign to it... elsewhere being NaN...
df['newcol'] = np.nan
# This attempted assign to a view of the df doesn't work (in reality the RHS expression would actually be a return value from somefunc)
df.ix[df.cond, df.columns.get_loc('newcol')] = 2* df.ix[df.cond, df.columns.get_loc('x')]
# yet a scalar assign does...
df.ix[df.cond, df.columns.get_loc('newcol')] = 99.
# Likewise bad trying to use -df.cond as the logical index:
df.ix[-df.cond, df.columns.get_loc('newcol')] = 2* df.ix[-df.cond, df.columns.get_loc('x')]
Currently I just get a stupid scalar assign:
>>> df.ix[-df.cond, df.columns.get_loc('newcol')] = 2* df.ix[-df.cond, df.columns.get_loc('x')]
>>> df
x cond newcol
0 1 True NaN
1 2 False 4
2 3 False 4
3 4 True NaN
4 5 False 4
5 6 False 4
6 7 True NaN
7 8 False 4
8 9 False 4
9 10 True NaN
10 11 False 4
11 12 False 4
12 13 True NaN
13 14 False 4
14 15 False 4
15 16 True NaN
16 17 False 4
17 18 False 4
18 19 True NaN
In [21]: df = pd.DataFrame(data={'x': range(1,20)})
In [22]: df['cond'] = df.x.apply(lambda xx: ((xx%3)==1) )
In [23]: df
Out[23]:
x cond
0 1 True
1 2 False
2 3 False
3 4 True
4 5 False
5 6 False
6 7 True
7 8 False
8 9 False
9 10 True
10 11 False
11 12 False
12 13 True
13 14 False
14 15 False
15 16 True
16 17 False
17 18 False
18 19 True
In [24]: df['newcol'] = 2*df.loc[df.cond, 'x']
In [25]: df
Out[25]:
x cond newcol
0 1 True 2
1 2 False NaN
2 3 False NaN
3 4 True 8
4 5 False NaN
5 6 False NaN
6 7 True 14
7 8 False NaN
8 9 False NaN
9 10 True 20
10 11 False NaN
11 12 False NaN
12 13 True 26
13 14 False NaN
14 15 False NaN
15 16 True 32
16 17 False NaN
17 18 False NaN
18 19 True 38
In [10]: def myfunc(df_):
....: return 2 * df_
....:
In [26]: df['newcol'] = myfunc(df.ix[df.cond, df.columns.get_loc('newcol')])
In [27]: df
Out[27]:
x cond newcol
0 1 True 4
1 2 False NaN
2 3 False NaN
3 4 True 16
4 5 False NaN
5 6 False NaN
6 7 True 28
7 8 False NaN
8 9 False NaN
9 10 True 40
10 11 False NaN
11 12 False NaN
12 13 True 52
13 14 False NaN
14 15 False NaN
15 16 True 64
16 17 False NaN
17 18 False NaN
18 19 True 76
I found this workaround:
tmp = pd.Series(np.repeat(np.nan, len(df)))
tmp[-cond] = 2* df.loc[df.cond, 'x']
df['newcol'] = tmp
Strangely, the following sometimes works (assigning the slice to the entire Series)
(but fails with a more complex RHS with AssertionError: Length of values does not match length of index)
(According to pandas doc, the RHS Series indexes are supposed to get aligned to the LHS, well at least if the LHS is a dataframe - but not if it's a Series? Is this a bug?)
>>> df['newcol'] = 2* df.loc[df.cond, 'x']
>>> df
x cond newcol
0 1 True 2
1 2 False NaN
2 3 False NaN
3 4 True 8
4 5 False NaN
5 6 False NaN
6 7 True 14
7 8 False NaN
8 9 False NaN
9 10 True 20
10 11 False NaN
11 12 False NaN
12 13 True 26
13 14 False NaN
14 15 False NaN
15 16 True 32
16 17 False NaN
17 18 False NaN
18 19 True 38
Jeff, what's weird is we can assign to df['newcol'] (which is supposed to be a copy not a view, right?)
when we do:
df['newcol'] = 2* df.loc[df.cond, 'x']
but not when we do the same with the RHS coming from a fn:
def myfunc(df_):
"""Some func transforming and returning said Series slice"""
return 2* df_
df['newcol'] = myfunc( df.ix[df.cond, df.columns.get_loc('newcol')] )

Categories

Resources