Compare consecutive rows and delete based on condition - python

I would like to compare consecutive rows from the column one and delete based on this condition:
if 2 or more consecutive rows are the same, keep them
If one row it's different from the previous and the next delete it
Example df:
a = [['A', 'B', 'C'], ['A', 'B', 'C'], ['B', 'B', 'C'],['C', 'B', 'C'],['C', 'B', 'C'],['C', 'B', 'C']]
df = pd.DataFrame(a, columns=['one', 'two', 'three'])
print output would be:
one two three
0 A B C
1 A B C
2 B B C
3 D B C
4 C B C
5 C B C
Expected output would be:
one two three
0 A B C
1 A B C
3 c B C
4 C B C
5 C B C
So the line from index 2 will be deleted.
I've tried using shift but I am a stucked, because like I am doing now, it deletes also the first and last column. Can someone please tell me a better way of doing this? Or maybe how to apply shift but ignore the first and last row ?
#First I take only the one column
df = df['one']
#Then apply shift
df.loc[df.shift(-1) == df]
With the above code I get this. Which is not correct because it delets also the first and last row
0 A
3 C
4 C

Try shifting up and down:
mask = (df.one == df.one.shift(-1)) | (df.one == df.one.shift(1))
adj_df = df[mask]

You could use shift in both directions (and you need an all condition to check that all the columns are the same):
df[(df.shift(1) == df).all(axis=1) | (df.shift(-1) == df).all(axis=1)]

Related

have two variable list and hoping to work on some calculation of the two list

I am hoping to create few new columns for 'data'.
The first created col is a/d, second b/e, and third c/f.
col1 is a list of names for the original columns
The output of df should look like this
a b c d e f res_a res_c res_e
1 2 3 4 2 3 0.5 0.75 2/3
res_a is a divide b a = 1, b = 2, therefore res_a = 1/2 = 0.5
c/d c = 3, d= 4 res_c = 3/4 = 0.75
my code looks like this now, but I can't get a/b, c/d, and e/f
col1 = ['a', 'b', 'c']
col2 = ['d', 'e', 'f']
for col in cols2:
data[f'res_{col}'] = np.round(data[col1]/ data[col2],decimals=2)
You could also use the pandas.IndexSlice to pick up alternate columns with a list slicing type of syntax
cix = pd.IndexSlice
df[['res_a', 'res_c', 'res_e']] = np.divide(df.loc[:, cix['a'::2]], df.loc[:, cix['b'::2]])
print(df)
# a b c d e f res_a res_c res_e
# 0 1 2 3 4 2 3 0.5 0.75 0.666667
You can read more about the pandas slicers in the docs
Use zip() to loop over two lists in parallel.
cols1 = ['a', 'c', 'e']
cols2 = ['b', 'd', 'f']
for c1, c2 in zip(cols1, cols2):
data[f'res_{c1}'] = np.round(data[c1] / data[c2], decimals=2)

Change this row and previous row vectorised in pandas

I have a dataframe that encodes the last value of row 'this' in row 'last'. I want to match the column 'this' in the table according to value in a list, e.g. ['b', 'c'] and then change the preceding row's 'this', as well as this row's 'last' to the value 'd' on such a match.
For example, I want to change this:
this
last
a
b
a
a
b
c
a
a
c
Into this:
this
last
d
b
d
d
b
c
d
a
c
This is straightforward if iterating, but too slow:
for i, v in df['this'].iteritems():
if v in ['b', 'c']:
df['this'].iloc[i - 1] = 'd'
df['last'].iloc[i] = 'd'
I believe this can be done by assigning df.this.shift(-1) to column 'last', however I'm not sure how to do this when I'm matching values in the list ['b', 'c']. How can I do this without iterating?
df
this last
0 a NaN
1 b a
2 a b
3 c a
4 a c
You can use isin to get boolean index where the values belong to the list (l1). Then populate corresponding last with d. And then shift in upward direction the boolean index, to populate required this values with d
l1 = ['b', 'c']
this_in_l1 = df['this'].isin(l1)
df.loc[this_in_l1, 'last'] = 'd'
df.loc[this_in_l1.shift(-1, fill_value=False), 'this'] = 'd'
df
this last
0 d NaN
1 b d
2 d b
3 c d
4 a c

Function Value with Combination(or Permutation) of Variables and Assign to Dataframe

I have n variables. Suppose n equals 3 in this case. I want to apply one function to all of the combinations(or permutations, depending on how you want to solve this) of variables and store the result in the same row and column in dataframe.
a = 1
b = 2
c = 3
indexes = ['a', 'b', 'c']
df = pd.DataFrame({x:np.nan for x in indexes}, index=indexes)
If I apply sum(the function can be anything), then the result that I want to get is like this:
a b c
a 2 3 4
b 3 4 5
c 4 5 6
I can only think of iterating all the variables, apply the function one by one, and use the index of the iterators to set the value in the dataframe. Is there any better solution?
You can use apply and return a pd.Series for that effect. In such cases, pandas uses the series indices as columns in the resulting dataframe.
s = pd.Series({"a": 1, "b": 2, "c": 3})
s.apply(lambda x: x+s)
Just note that the operation you do is between an element and a series.
I believe you need broadcast sum of array created from variables if performance is important:
a = 1
b = 2
c = 3
indexes = ['a', 'b', 'c']
arr = np.array([a,b,c])
df = pd.DataFrame(arr + arr[:, None], index=indexes, columns=indexes)
print (df)
a b c
a 2 3 4
b 3 4 5
c 4 5 6

Merge pandas dataframe with overwrite of columns

What is the quickest way to merge to python data frames in this manner?
I have two data frames with similar structures (both have a primary key id and some value columns).
What I want to do is merge the two data frames based on id. Are there any ways do this based on pandas operations? How I've implemented it right now is as coded below:
import pandas as pd
a = pd.DataFrame({'id': [1,2,3], 'letter': ['a', 'b', 'c']})
b = pd.DataFrame({'id': [1,3,4], 'letter': ['A', 'C', 'D']})
a_dict = {e[id]: e for e in a.to_dict('record')}
b_dict = {e[id]: e for e in b.to_dict('record')}
c_dict = a_dict.copy()
c_dict.update(b_dict)
c = pd.DataFrame(list(c.values())
Here, c would be equivalent to
pd.DataFrame({'id': [1,2,3,4], 'letter':['A','b', 'C', 'D']})
id letter
0 1 A
1 2 b
2 3 C
3 4 D
combine_first
If 'id' is your primary key, then use it as your index.
b.set_index('id').combine_first(a.set_index('id')).reset_index()
id letter
0 1 A
1 2 b
2 3 C
3 4 D
merge with groupby
a.merge(b, 'outer', 'id').groupby(lambda x: x.split('_')[0], axis=1).last()
id letter
0 1 A
1 2 b
2 3 C
3 4 D
One way may be as following:
append dataframe a to dataframe b
drop duplicates based on id
sort values on remaining by id
reset index and drop older index
You can try:
import pandas as pd
a = pd.DataFrame({'id': [1,2,3], 'letter': ['a', 'b', 'c']})
b = pd.DataFrame({'id': [1,3,4], 'letter': ['A', 'C', 'D']})
c = b.append(a).drop_duplicates(subset='id').sort_values('id').reset_index(drop=True)
print(c)
Try this
c = pd.concat([a, b], axis=0).sort_values('letter').drop_duplicates('id', keep='first').sort_values('id')
c.reset_index(drop=True, inplace=True)
print(c)
id letter
0 1 A
1 2 b
2 3 C
3 4 D

How to Access Element of Pandas Series that is a List

I have a Dataframe series that contains is a list of strings for each row. I'd like to create another series that is the last string in the list for that row.
So one row may have a list e.g
['a', 'b', 'c', 'd']
I'd like to create another pandas series made up of the last element of the row, normally access as a -1 reference, in this 'd'. The lists for each observation (i.e. row) are of varying length. How can this be done?
I believe need indexing with str, it working with all iterables:
df = pd.DataFrame({'col':[['a', 'b', 'c', 'd'],['a', 'b'],['a'], []]})
df['last'] = df['col'].str[-1]
print (df)
col last
0 [a, b, c, d] d
1 [a, b] b
2 [a] a
3 [] NaN
strings are iterables too:
df = pd.DataFrame({'col':['abcd','ab','a', '']})
df['last'] = df['col'].str[-1]
print (df)
col last
0 abcd d
1 ab b
2 a a
3 NaN
Why not making the list column to a info dataframe, and you can using the index for join
Infodf=pd.DataFrame(df.col.values.tolist(),index=df.index)
Infodf
Out[494]:
0 1 2 3
0 a b c d
1 a b None None
2 a None None None
3 None None None None
I think I over looked the question , and both PiR and Jez provided their valuable suggestion to help me achieve the final result .
Infodf.ffill(1).iloc[:,-1]

Categories

Resources