I have a dataframe that encodes the last value of row 'this' in row 'last'. I want to match the column 'this' in the table according to value in a list, e.g. ['b', 'c'] and then change the preceding row's 'this', as well as this row's 'last' to the value 'd' on such a match.
For example, I want to change this:
this
last
a
b
a
a
b
c
a
a
c
Into this:
this
last
d
b
d
d
b
c
d
a
c
This is straightforward if iterating, but too slow:
for i, v in df['this'].iteritems():
if v in ['b', 'c']:
df['this'].iloc[i - 1] = 'd'
df['last'].iloc[i] = 'd'
I believe this can be done by assigning df.this.shift(-1) to column 'last', however I'm not sure how to do this when I'm matching values in the list ['b', 'c']. How can I do this without iterating?
df
this last
0 a NaN
1 b a
2 a b
3 c a
4 a c
You can use isin to get boolean index where the values belong to the list (l1). Then populate corresponding last with d. And then shift in upward direction the boolean index, to populate required this values with d
l1 = ['b', 'c']
this_in_l1 = df['this'].isin(l1)
df.loc[this_in_l1, 'last'] = 'd'
df.loc[this_in_l1.shift(-1, fill_value=False), 'this'] = 'd'
df
this last
0 d NaN
1 b d
2 d b
3 c d
4 a c
Related
I have a large dataset called pop and want to return the only 2 rows that have the same value in column 'J'. I do not know what rows have the same value and do not know what the common value is... I want to return these two rows.
Without knowing the common value, this code is not helpful:
pop.loc[pop['X'] == some_value]
I tried this but it returned the entire dataset:
pop.query('X' == 'X')
Any input is appreciated...
You can do .value_counts() then get the first element, which has been sorted to be the most common value.
I'll use some dummy data here:
In [2]: df = pd.DataFrame(['a', 'b', 'c', 'd', 'b', 'f'], columns=['X'])
In [3]: df
Out[3]:
X
0 a
1 b
2 c
3 d
4 b
5 f
In [4]: wanted_value = df['X'].value_counts().index[0]
In [5]: wanted_value
Out[5]: 'b'
In [6]: df[df['X'] == wanted_value]
Out[6]:
X
1 b
4 b
For reference, df['X'].value_counts() is:
b 2
a 1
c 1
d 1
f 1
Name: X, dtype: int64
Thanks, I figured out another way that seemed a bit easier...
pop['X'].value_counts()
the top value was 21 and showed '2', indicating 21 was the duplicated value; all remaining values indicated '1', no duplicates
pop.loc[pop['X'] == 21]
returned the 2 rows with the duplicated value in column X.
I would like to compare consecutive rows from the column one and delete based on this condition:
if 2 or more consecutive rows are the same, keep them
If one row it's different from the previous and the next delete it
Example df:
a = [['A', 'B', 'C'], ['A', 'B', 'C'], ['B', 'B', 'C'],['C', 'B', 'C'],['C', 'B', 'C'],['C', 'B', 'C']]
df = pd.DataFrame(a, columns=['one', 'two', 'three'])
print output would be:
one two three
0 A B C
1 A B C
2 B B C
3 D B C
4 C B C
5 C B C
Expected output would be:
one two three
0 A B C
1 A B C
3 c B C
4 C B C
5 C B C
So the line from index 2 will be deleted.
I've tried using shift but I am a stucked, because like I am doing now, it deletes also the first and last column. Can someone please tell me a better way of doing this? Or maybe how to apply shift but ignore the first and last row ?
#First I take only the one column
df = df['one']
#Then apply shift
df.loc[df.shift(-1) == df]
With the above code I get this. Which is not correct because it delets also the first and last row
0 A
3 C
4 C
Try shifting up and down:
mask = (df.one == df.one.shift(-1)) | (df.one == df.one.shift(1))
adj_df = df[mask]
You could use shift in both directions (and you need an all condition to check that all the columns are the same):
df[(df.shift(1) == df).all(axis=1) | (df.shift(-1) == df).all(axis=1)]
I have a list of list and a dataframe df:
test_list=[[A,B,C],[A,B,D],[A,B,E],[F,G]]
and dataframe is
ID
B
C
D
E
The element of List of list represent hierarchy .I want to create a new column "type" in the dataframe whose value represent its parent.
My final Dataframe should be like:
value parent
B A
C B
D B
E B
I have a very large dataset and test_list is also very large
As per my comments on using a dictionary, here's the code.
import pandas as pd
test_list=[["A","B","C"],["A","B","D"],["A","B","E"],["F","G"]]
dict = {}
for sublist in test_list:
for n, elem in enumerate(sublist):
if n != 0:
dict[elem] = prev
prev = elem
df = pd.DataFrame([dict.keys(), dict.values()]).T
df.columns= ['element', 'parent']
df.set_index('element', inplace=True)
print(df)
giving the following output.
parent
element
B A
C B
D B
E B
G F
You could use a dictionary. Here is a working example :
df = pd.DataFrame({'ID': ['B', 'C', 'D', 'E']})
test_list=[['A','B','C'],['A','B','D'],['A','B','E'],['F','G']]
parent = {}
for element in test_list:
for i in range(len(element)-1):
parent[element[i+1]] = element[i]
df['parent'] = [parent[x] for x in df['ID']]
In [1] : print(df)
Out[1] : ID parent
0 B A
1 C B
2 D B
3 E B
I have a Dataframe series that contains is a list of strings for each row. I'd like to create another series that is the last string in the list for that row.
So one row may have a list e.g
['a', 'b', 'c', 'd']
I'd like to create another pandas series made up of the last element of the row, normally access as a -1 reference, in this 'd'. The lists for each observation (i.e. row) are of varying length. How can this be done?
I believe need indexing with str, it working with all iterables:
df = pd.DataFrame({'col':[['a', 'b', 'c', 'd'],['a', 'b'],['a'], []]})
df['last'] = df['col'].str[-1]
print (df)
col last
0 [a, b, c, d] d
1 [a, b] b
2 [a] a
3 [] NaN
strings are iterables too:
df = pd.DataFrame({'col':['abcd','ab','a', '']})
df['last'] = df['col'].str[-1]
print (df)
col last
0 abcd d
1 ab b
2 a a
3 NaN
Why not making the list column to a info dataframe, and you can using the index for join
Infodf=pd.DataFrame(df.col.values.tolist(),index=df.index)
Infodf
Out[494]:
0 1 2 3
0 a b c d
1 a b None None
2 a None None None
3 None None None None
I think I over looked the question , and both PiR and Jez provided their valuable suggestion to help me achieve the final result .
Infodf.ffill(1).iloc[:,-1]
I want to, at the same time, create a new column in a pandas dataframe and set its first value to a list.
I want to transform this dataframe
df = pd.DataFrame.from_dict({'a':[1,2],'b':[3,4]})
a b
0 1 3
1 2 4
into this one
a b c
0 1 3 [2,3]
1 2 4 NaN
I tried :
df.loc[0, 'c'] = [2,3]
df.loc[0, 'c'] = np.array([2,3])
df.loc[0, 'c'] = [[2,3]]
df.at[0,'c'] = [2,3]
df.at[0,'d'] = [[2,3]]
It does not work.
How should I proceed?
If the first element of a series is a list, then the series must be of type object (not the most efficient for numerical computations). This should work, however.
df = df.assign(c=None)
df.loc[0, 'c'] = [2, 3]
>>> df
a b c
0 1 3 [2, 3]
1 2 4 None
If you really need the remaining values of column c to be NaNs instead of None, use this:
df.loc[1:, 'c'] = np.nan
The problem seems to have something to do with the type of the c column. If you convert it to type 'object', you can use iat, loc or set_value to set a cell as a list.
df2 = (
df.assign(c=np.nan)
.assign(c=lambda x: x.c.astype(object))
)
df2.set_value(0,'c',[2,3])
Out[86]:
a b c
0 1 3 [2, 3]
1 2 4 NaN