Comparison of two columns - python

How can I find the same values in the columns regardless of their position?
df = pd.DataFrame({'one':['A','B', 'C', 'D', 'E', np.nan, 'H'],
'two':['B', 'E', 'C', np.nan, np.nan, 'H', 'L']})
The result I want to get:
three
0 B
1 E
2 C
3 H

The exact logic is unclear, you can try:
out = pd.DataFrame({'three': sorted(set(df['one'].dropna())
&set(df['two'].dropna()))})
output:
three
0 B
1 C
2 E
3 H
Or maybe you want to keep the items of col two?
out = (df.loc[df['two'].isin(df['one'].dropna()), 'two']
.to_frame(name='three')
)
output:
three
0 B
1 E
2 C
5 H

Try this:
df = pd.DataFrame(set(df['one']).intersection(df['two']), columns=['Three']).dropna()
print(df)
Output:
Three
1 C
2 H
3 E
4 B

Related

Concatenate/combine two columns into one for Pandas dataframe when axis = 0

I have dataframe:
d_test = {
'c1' : ['a', 'b', np.nan, 'c'],
'c2' : ['d', np.nan, 'e', 'f'],
'test': [1,2,3,4],
}
df_test = pd.DataFrame(d_test)
And I want to concatenate columns c1 and c2 in one and have following resulted dataframe:
a 1
b 2
c 4
d 1
e 3
f 4
I tired to use
pd.concat([df_test.c1 , df_test.c2], axis = 0)
to generate such a column but have no idea how to keep 'test' column as well during concationation.
use melt
df_test.melt('test').dropna()[['value', 'test']]
result:
value test
0 a 1
1 b 2
3 c 4
4 d 1
6 e 3
7 f 4

How to add padded rows of 0 to a pandas dataframe?

I have a df in the following form
import pandas as pd
df = pd.DataFrame({'col1' : [1,1,1,2,2,3,3,4],
'col2' : ['a', 'b', 'c', 'a', 'b', 'a', 'b', 'a'],
'col3' : ['x', 'y', 'z', 'p','q','r','s','t']
})
col1 col2 col3
0 1 a x
1 1 b y
2 1 c z
3 2 a p
4 2 b q
5 3 a r
6 3 b s
7 4 a t
df2 = df.groupby(['col1','col2'])['col3'].sum()
df2
col1 col2
1 a x
b y
c z
2 a p
b q
3 a r
b s
4 a t
Now I want to add padded 0 rows to each of col1 index where a , b, c, d is missing , so expected output should be
col1 col2
1 a x
b y
c z
d 0
2 a p
b q
c 0
d 0
3 a r
b s
c 0
d 0
4 a t
b 0
c 0
d 0
Use unstack + reindex + stack:
out = (
df2.unstack(fill_value=0)
.reindex(columns=['a', 'b', 'c', 'd'], fill_value=0)
.stack()
)
out:
col1 col2
1 a x
b y
c z
d 0
2 a p
b q
c 0
d 0
3 a r
b s
c 0
d 0
4 a t
b 0
c 0
d 0
dtype: object
Complete Working Example:
import pandas as pd
df = pd.DataFrame({
'col1': [1, 1, 1, 2, 2, 3, 3, 4],
'col2': ['a', 'b', 'c', 'a', 'b', 'a', 'b', 'a'],
'col3': ['x', 'y', 'z', 'p', 'q', 'r', 's', 't']
})
df2 = df.groupby(['col1', 'col2'])['col3'].sum()
out = (
df2.unstack(fill_value=0)
.reindex(columns=['a', 'b', 'c', 'd'], fill_value=0)
.stack()
)
print(out)
Here's another way using pd.MultiIndex.from_product, then reindex:
mindx = pd.MultiIndex.from_product([df2.index.levels[0], [*'abcd']])
df2.reindex(mindx, fill_value=0)
Output:
col1
1 a x
b y
c z
d 0
2 a p
b q
c 0
d 0
3 a r
b s
c 0
d 0
4 a t
b 0
c 0
d 0
Name: col3, dtype: object

How to transform dataframe to from-to pairs?

If I have a dataframe:
>>> import pandas as pd
>>> df = pd.DataFrame([
... ['A', 'B', 'C', 'D'],
... ['E', 'B', 'C']
... ])
>>> df
0 1 2 3
0 A B C D
1 E B C None
>>>
I shoudl transform the dataframe to two columns format:
x, y
-----
A, B
B, C
C, D
E, B
B, C
For each row, from left to right, take two neighbor values and make a pair of it.
It is kind of from-to if you consider each row as a path.
How to do the transformation?
We can do explode with zip
s=pd.DataFrame(df.apply(lambda x : list(zip(x.dropna()[:-1],x.dropna()[1:])),axis=1).explode().tolist())
Out[336]:
0 1
0 A B
1 B C
2 C D
3 E B
4 B C
Update
s=df.apply(lambda x : list(zip(x.dropna()[:-1],x.dropna()[1:])),axis=1).explode()
s=pd.DataFrame(s.tolist(),index=s.index)
s
Out[340]:
0 1
0 A B
0 B C
0 C D
1 E B
1 B C
Pre-preparing the data could help too:
import pandas as pd
inp = [['A', 'B', 'C', 'D'],
['E', 'B', 'C']]
# Convert beforehand
inp2 = [[[i[k], i[k+1]] for k in range(len(i)-1)] for i in inp]
inp2 = inp2[0] + inp2[1]
df = pd.DataFrame(inp2)
print(df)
Output:
0 1
0 A B
1 B C
2 C D
3 E B
4 B C

Latest values based on time column

I have mydf below, which I have sorted on a dummy time column and the id:
mydf = pd.DataFrame(
{
'id': ['A', 'B', 'B', 'C', 'A', 'C', 'A'],
'time': [1, 4, 3, 5, 2, 6, 7],
'val': ['a', 'b', 'c', 'd', 'e', 'f', 'g']
}
).sort_values(['id', 'time'], ascending=False)
mydf
id time val
5 C 6 f
3 C 5 d
1 B 4 b
2 B 3 c
6 A 7 g
4 A 2 e
0 A 1 a
I want to add a column (last_val) which, for each unique id, holds the latest val based on the time column. Entries for which there is no last_val can be dropped. The output in this example would look like:
mydf
id time val last_val
5 C 6 f d
1 B 4 b c
6 A 7 g e
4 A 2 e a
Any ideas?
Use DataFrameGroupBy.shift after sort_values(['id', 'time'], ascending=False) (already in question) and then remove rows with missing values by DataFrame.dropna:
mydf['last_val'] = mydf.groupby('id')['val'].shift(-1)
mydf = mydf.dropna(subset=['last_val'])
Similar solution, only removed last duplicated rows by id column:
mydf['last_val'] = mydf.groupby('id')['val'].shift(-1)
mydf = mydf[mydf['id'].duplicated(keep='last')]
print (mydf)
id time val last_val
5 C 6 f d
1 B 4 b c
6 A 7 g e
4 A 2 e a

How do I use groupby on continuous similar values for a pandas dataframe?

Lets say we have a data frame called df:
A B
1 a
1 b
1 c
2 d
2 e
1 f
1 g
I'd like to use groupby to create the following :
1: [a,b,c]
2: [d,e]
1: [f,g]
Currently if I use something on the lines of
{k: list(v) for k,v in df.groupby("A")["B"]}
I get
1: [a,b,c,f,g]
2: [d,e]
I'd like the separation to be based on the data being both similar and continuous.
You can groupby by Series which is create by cumsum of shifted column A by shift:
print (df["A"].ne(df["A"].shift()).cumsum())
0 1
1 1
2 1
3 2
4 2
5 3
6 3
Name: A, dtype: int32
df = df["B"].groupby(df["A"].ne(df["A"].shift()).cumsum()).apply(list).reset_index()
print (df)
A B
0 1 [a, b, c]
1 2 [d, e]
2 3 [f, g]
For dict:
d = {k: list(v) for k,v in df['B'].groupby(df["A"].ne(df["A"].shift()).cumsum())}
print (d)
{1: ['a', 'b', 'c'], 2: ['d', 'e'], 3: ['f', 'g']}
d = df["B"].groupby(df["A"].ne(df["A"].shift()).cumsum()).apply(list).to_dict()
print (d)
{1: ['a', 'b', 'c'], 2: ['d', 'e'], 3: ['f', 'g']}
EDIT1:
df = df["B"].groupby([df['A'], df["A"].ne(df["A"].shift()).cumsum()]).apply(list)
df = df.groupby(level=0).apply(lambda x: x.tolist() if len(x) > 1 else x.iat[0]).to_dict()
print (df)
{1: [['a', 'b', 'c'], ['f', 'g']], 2: ['d', 'e']}

Categories

Resources