How to add padded rows of 0 to a pandas dataframe? - python

I have a df in the following form
import pandas as pd
df = pd.DataFrame({'col1' : [1,1,1,2,2,3,3,4],
'col2' : ['a', 'b', 'c', 'a', 'b', 'a', 'b', 'a'],
'col3' : ['x', 'y', 'z', 'p','q','r','s','t']
})
col1 col2 col3
0 1 a x
1 1 b y
2 1 c z
3 2 a p
4 2 b q
5 3 a r
6 3 b s
7 4 a t
df2 = df.groupby(['col1','col2'])['col3'].sum()
df2
col1 col2
1 a x
b y
c z
2 a p
b q
3 a r
b s
4 a t
Now I want to add padded 0 rows to each of col1 index where a , b, c, d is missing , so expected output should be
col1 col2
1 a x
b y
c z
d 0
2 a p
b q
c 0
d 0
3 a r
b s
c 0
d 0
4 a t
b 0
c 0
d 0

Use unstack + reindex + stack:
out = (
df2.unstack(fill_value=0)
.reindex(columns=['a', 'b', 'c', 'd'], fill_value=0)
.stack()
)
out:
col1 col2
1 a x
b y
c z
d 0
2 a p
b q
c 0
d 0
3 a r
b s
c 0
d 0
4 a t
b 0
c 0
d 0
dtype: object
Complete Working Example:
import pandas as pd
df = pd.DataFrame({
'col1': [1, 1, 1, 2, 2, 3, 3, 4],
'col2': ['a', 'b', 'c', 'a', 'b', 'a', 'b', 'a'],
'col3': ['x', 'y', 'z', 'p', 'q', 'r', 's', 't']
})
df2 = df.groupby(['col1', 'col2'])['col3'].sum()
out = (
df2.unstack(fill_value=0)
.reindex(columns=['a', 'b', 'c', 'd'], fill_value=0)
.stack()
)
print(out)

Here's another way using pd.MultiIndex.from_product, then reindex:
mindx = pd.MultiIndex.from_product([df2.index.levels[0], [*'abcd']])
df2.reindex(mindx, fill_value=0)
Output:
col1
1 a x
b y
c z
d 0
2 a p
b q
c 0
d 0
3 a r
b s
c 0
d 0
4 a t
b 0
c 0
d 0
Name: col3, dtype: object

Related

Comparison of two columns

How can I find the same values in the columns regardless of their position?
df = pd.DataFrame({'one':['A','B', 'C', 'D', 'E', np.nan, 'H'],
'two':['B', 'E', 'C', np.nan, np.nan, 'H', 'L']})
The result I want to get:
three
0 B
1 E
2 C
3 H
The exact logic is unclear, you can try:
out = pd.DataFrame({'three': sorted(set(df['one'].dropna())
&set(df['two'].dropna()))})
output:
three
0 B
1 C
2 E
3 H
Or maybe you want to keep the items of col two?
out = (df.loc[df['two'].isin(df['one'].dropna()), 'two']
.to_frame(name='three')
)
output:
three
0 B
1 E
2 C
5 H
Try this:
df = pd.DataFrame(set(df['one']).intersection(df['two']), columns=['Three']).dropna()
print(df)
Output:
Three
1 C
2 H
3 E
4 B

How to create multi-relational edge-list from pandas dataframe?

I have a pandas data frame like this:
from itertools import *
from pandas as pd
d = {'col1': ['a', 'b','c','d','a','b','d'], 'col2': ['XX','XX','XY','XX','YY','YY','XY']}
df_rel = pd.DataFrame(data=d)
df_rel
col1 col2
0 a XX
1 b XX
2 c XY
3 d XX
4 a YY
5 b YY
6 d XY
The unique nodes are:
uniq_nodes = df_rel['col1'].unique()
uniq_nodes
array(['a', 'b', 'c', 'd'], dtype=object)
For each Relationship the source (Src) and destination (Dst) can be generated:
df1 = pd.DataFrame(
data=list(combinations(uniq_nodes, 2)),
columns=['Src', 'Dst'])
df1
Src Dst
0 a b
1 a c
2 a d
3 b c
4 b d
5 c d
I need the new dataframe newdf based on the shared elements in col2 of df_rel. The Relationship column comes from the col2. Thus the desire dataframe with edgelist will be:
newdf
Src Dst Relationship
0 a b XX
1 a b YY
2 a d XX
3 c d XY
Is there any fastest way to achieve this? The original dataframe has 30,000 rows.
I took this approach. It works but still not very fast for the large dataframe.
from itertools import *
from pandas as pd
d = {'col1': ['a', 'b','c','d','a','b','d'], 'col2': ['XX','XX','XY','XX','YY','YY','XY']}
df_rel = pd.DataFrame(data=d)
df_rel
col1 col2
0 a XX
1 b XX
2 c XY
3 d XX
4 a YY
5 b YY
6 d XY
uniq_nodes = df_rel['col1'].unique()
uniq_nodes
array(['a', 'b', 'c', 'd'], dtype=object)
df1 = pd.DataFrame(
data=list(combinations(unique_nodes, 2)),
columns=['Src', 'Dst'])
filter1 = df_rel['col1'].isin(df1['Src'])
src_df = df_rel[filter1]
src_df.rename(columns={'col1':'Src'}, inplace=True)
filter2 = df_rel['col1'].isin(df1['Dst'])
dst_df = df_rel[filter2]
dst_df.rename(columns={'col1':'Dst'}, inplace=True)
new_df = pd.merge(src_df,dst_df, on = "col2",how="inner")
print ("after removing the duplicates")
new_df = new_df.drop_duplicates()
print(new_df.shape)
print ("after removing self loop")
new_df = new_df[new_df['Src'] != new_df['Dst']]
new_df = new_df[new_df['Src'] != new_df['Dst']]
new_df.rename(columns={'col2':'Relationship'}, inplace=True)
print(new_df.shape)
print (new_df)
Src Relationship Dst
0 a XX b
1 a XX d
3 b XX d
5 c XY d
6 a YY b
You need to loop through your df1 rows, and find the rows from df_rel that matches the df1['Src'] and df1['Dst'] columns. Once you have the df1['col2'] values of Src and Dst, compare them and if they match create a row in newdf. Try this - check if it performs for large datasets
Data setup (same as yours):
d = {'col1': ['a', 'b', 'c', 'd', 'a', 'b', 'd'], 'col2': ['XX', 'XX', 'XY', 'XX', 'YY', 'YY', 'XY']}
df_rel = pd.DataFrame(data=d)
uniq_nodes = df_rel['col1'].unique()
df1 = pd.DataFrame(data=list(combinations(uniq_nodes, 2)), columns=['Src', 'Dst'])
Code:
newdf = pd.DataFrame(columns=['Src','Dst','Relationship'])
for i, row in df1.iterrows():
src = (df_rel[df_rel['col1'] == row['Src']]['col2']).to_list()
dst = (df_rel[df_rel['col1'] == row['Dst']]['col2']).to_list()
for x in src:
if x in dst:
newdf = newdf.append(pd.Series({'Src': row['Src'], 'Dst': row['Dst'], 'Relationship': x}),
ignore_index=True, sort=False)
print(newdf)
Result:
Src Dst Relationship
0 a b XX
1 a b YY
2 a d XX
3 b d XX
4 c d XY

How to transform dataframe to from-to pairs?

If I have a dataframe:
>>> import pandas as pd
>>> df = pd.DataFrame([
... ['A', 'B', 'C', 'D'],
... ['E', 'B', 'C']
... ])
>>> df
0 1 2 3
0 A B C D
1 E B C None
>>>
I shoudl transform the dataframe to two columns format:
x, y
-----
A, B
B, C
C, D
E, B
B, C
For each row, from left to right, take two neighbor values and make a pair of it.
It is kind of from-to if you consider each row as a path.
How to do the transformation?
We can do explode with zip
s=pd.DataFrame(df.apply(lambda x : list(zip(x.dropna()[:-1],x.dropna()[1:])),axis=1).explode().tolist())
Out[336]:
0 1
0 A B
1 B C
2 C D
3 E B
4 B C
Update
s=df.apply(lambda x : list(zip(x.dropna()[:-1],x.dropna()[1:])),axis=1).explode()
s=pd.DataFrame(s.tolist(),index=s.index)
s
Out[340]:
0 1
0 A B
0 B C
0 C D
1 E B
1 B C
Pre-preparing the data could help too:
import pandas as pd
inp = [['A', 'B', 'C', 'D'],
['E', 'B', 'C']]
# Convert beforehand
inp2 = [[[i[k], i[k+1]] for k in range(len(i)-1)] for i in inp]
inp2 = inp2[0] + inp2[1]
df = pd.DataFrame(inp2)
print(df)
Output:
0 1
0 A B
1 B C
2 C D
3 E B
4 B C

Latest values based on time column

I have mydf below, which I have sorted on a dummy time column and the id:
mydf = pd.DataFrame(
{
'id': ['A', 'B', 'B', 'C', 'A', 'C', 'A'],
'time': [1, 4, 3, 5, 2, 6, 7],
'val': ['a', 'b', 'c', 'd', 'e', 'f', 'g']
}
).sort_values(['id', 'time'], ascending=False)
mydf
id time val
5 C 6 f
3 C 5 d
1 B 4 b
2 B 3 c
6 A 7 g
4 A 2 e
0 A 1 a
I want to add a column (last_val) which, for each unique id, holds the latest val based on the time column. Entries for which there is no last_val can be dropped. The output in this example would look like:
mydf
id time val last_val
5 C 6 f d
1 B 4 b c
6 A 7 g e
4 A 2 e a
Any ideas?
Use DataFrameGroupBy.shift after sort_values(['id', 'time'], ascending=False) (already in question) and then remove rows with missing values by DataFrame.dropna:
mydf['last_val'] = mydf.groupby('id')['val'].shift(-1)
mydf = mydf.dropna(subset=['last_val'])
Similar solution, only removed last duplicated rows by id column:
mydf['last_val'] = mydf.groupby('id')['val'].shift(-1)
mydf = mydf[mydf['id'].duplicated(keep='last')]
print (mydf)
id time val last_val
5 C 6 f d
1 B 4 b c
6 A 7 g e
4 A 2 e a

Python/Pandas: Selecting dataframe rows with multiple conditions - col 1 matches col 2 OR col 3

If I have a df like so:
dfdict = {'1': ['a', 'a', 'a', 'b'], '2': ['a', 'b', 'c', 'a'], '3': ['b', 'a', 'd', 'c']}
df1 = pd.DataFrame(dfdict)
1 2 3
0 a a b
1 a b a
2 a c d
3 b a c
I want to save only the rows where col 1 matches 2 OR 1 matches 3. In this case, rows 0 and 1 would be saved:
1 2 3
0 a a b
1 a b a
I tried:
df2 = df1.loc[df1['1'] == df1['2'] & df1['1'] == df1['3']]
but I get error TypeError: unsupported operand type(s) for &: 'str' and 'bool'.
I would also like to get the other lines where col 1 does NOT match 2 OR 3, i.e. rows 2 and 3, in a separate df.
Option 1
eq, fixing your code,
df1[df1['1'].eq(df1['2']) | df1['1'].eq(df1['3'])]
1 2 3
0 a a b
1 a b a
Option 2
np.vectorize
f = np.vectorize(lambda x, y, z: x in (y, z))
df[f(df1['1'], df1['2'], df1['3'])]
1 2 3
0 a a b
1 a b a

Categories

Resources