I have a pandas data frame like this:
from itertools import *
from pandas as pd
d = {'col1': ['a', 'b','c','d','a','b','d'], 'col2': ['XX','XX','XY','XX','YY','YY','XY']}
df_rel = pd.DataFrame(data=d)
df_rel
col1 col2
0 a XX
1 b XX
2 c XY
3 d XX
4 a YY
5 b YY
6 d XY
The unique nodes are:
uniq_nodes = df_rel['col1'].unique()
uniq_nodes
array(['a', 'b', 'c', 'd'], dtype=object)
For each Relationship the source (Src) and destination (Dst) can be generated:
df1 = pd.DataFrame(
data=list(combinations(uniq_nodes, 2)),
columns=['Src', 'Dst'])
df1
Src Dst
0 a b
1 a c
2 a d
3 b c
4 b d
5 c d
I need the new dataframe newdf based on the shared elements in col2 of df_rel. The Relationship column comes from the col2. Thus the desire dataframe with edgelist will be:
newdf
Src Dst Relationship
0 a b XX
1 a b YY
2 a d XX
3 c d XY
Is there any fastest way to achieve this? The original dataframe has 30,000 rows.
I took this approach. It works but still not very fast for the large dataframe.
from itertools import *
from pandas as pd
d = {'col1': ['a', 'b','c','d','a','b','d'], 'col2': ['XX','XX','XY','XX','YY','YY','XY']}
df_rel = pd.DataFrame(data=d)
df_rel
col1 col2
0 a XX
1 b XX
2 c XY
3 d XX
4 a YY
5 b YY
6 d XY
uniq_nodes = df_rel['col1'].unique()
uniq_nodes
array(['a', 'b', 'c', 'd'], dtype=object)
df1 = pd.DataFrame(
data=list(combinations(unique_nodes, 2)),
columns=['Src', 'Dst'])
filter1 = df_rel['col1'].isin(df1['Src'])
src_df = df_rel[filter1]
src_df.rename(columns={'col1':'Src'}, inplace=True)
filter2 = df_rel['col1'].isin(df1['Dst'])
dst_df = df_rel[filter2]
dst_df.rename(columns={'col1':'Dst'}, inplace=True)
new_df = pd.merge(src_df,dst_df, on = "col2",how="inner")
print ("after removing the duplicates")
new_df = new_df.drop_duplicates()
print(new_df.shape)
print ("after removing self loop")
new_df = new_df[new_df['Src'] != new_df['Dst']]
new_df = new_df[new_df['Src'] != new_df['Dst']]
new_df.rename(columns={'col2':'Relationship'}, inplace=True)
print(new_df.shape)
print (new_df)
Src Relationship Dst
0 a XX b
1 a XX d
3 b XX d
5 c XY d
6 a YY b
You need to loop through your df1 rows, and find the rows from df_rel that matches the df1['Src'] and df1['Dst'] columns. Once you have the df1['col2'] values of Src and Dst, compare them and if they match create a row in newdf. Try this - check if it performs for large datasets
Data setup (same as yours):
d = {'col1': ['a', 'b', 'c', 'd', 'a', 'b', 'd'], 'col2': ['XX', 'XX', 'XY', 'XX', 'YY', 'YY', 'XY']}
df_rel = pd.DataFrame(data=d)
uniq_nodes = df_rel['col1'].unique()
df1 = pd.DataFrame(data=list(combinations(uniq_nodes, 2)), columns=['Src', 'Dst'])
Code:
newdf = pd.DataFrame(columns=['Src','Dst','Relationship'])
for i, row in df1.iterrows():
src = (df_rel[df_rel['col1'] == row['Src']]['col2']).to_list()
dst = (df_rel[df_rel['col1'] == row['Dst']]['col2']).to_list()
for x in src:
if x in dst:
newdf = newdf.append(pd.Series({'Src': row['Src'], 'Dst': row['Dst'], 'Relationship': x}),
ignore_index=True, sort=False)
print(newdf)
Result:
Src Dst Relationship
0 a b XX
1 a b YY
2 a d XX
3 b d XX
4 c d XY
Related
I have a df in the following form
import pandas as pd
df = pd.DataFrame({'col1' : [1,1,1,2,2,3,3,4],
'col2' : ['a', 'b', 'c', 'a', 'b', 'a', 'b', 'a'],
'col3' : ['x', 'y', 'z', 'p','q','r','s','t']
})
col1 col2 col3
0 1 a x
1 1 b y
2 1 c z
3 2 a p
4 2 b q
5 3 a r
6 3 b s
7 4 a t
df2 = df.groupby(['col1','col2'])['col3'].sum()
df2
col1 col2
1 a x
b y
c z
2 a p
b q
3 a r
b s
4 a t
Now I want to add padded 0 rows to each of col1 index where a , b, c, d is missing , so expected output should be
col1 col2
1 a x
b y
c z
d 0
2 a p
b q
c 0
d 0
3 a r
b s
c 0
d 0
4 a t
b 0
c 0
d 0
Use unstack + reindex + stack:
out = (
df2.unstack(fill_value=0)
.reindex(columns=['a', 'b', 'c', 'd'], fill_value=0)
.stack()
)
out:
col1 col2
1 a x
b y
c z
d 0
2 a p
b q
c 0
d 0
3 a r
b s
c 0
d 0
4 a t
b 0
c 0
d 0
dtype: object
Complete Working Example:
import pandas as pd
df = pd.DataFrame({
'col1': [1, 1, 1, 2, 2, 3, 3, 4],
'col2': ['a', 'b', 'c', 'a', 'b', 'a', 'b', 'a'],
'col3': ['x', 'y', 'z', 'p', 'q', 'r', 's', 't']
})
df2 = df.groupby(['col1', 'col2'])['col3'].sum()
out = (
df2.unstack(fill_value=0)
.reindex(columns=['a', 'b', 'c', 'd'], fill_value=0)
.stack()
)
print(out)
Here's another way using pd.MultiIndex.from_product, then reindex:
mindx = pd.MultiIndex.from_product([df2.index.levels[0], [*'abcd']])
df2.reindex(mindx, fill_value=0)
Output:
col1
1 a x
b y
c z
d 0
2 a p
b q
c 0
d 0
3 a r
b s
c 0
d 0
4 a t
b 0
c 0
d 0
Name: col3, dtype: object
If I have a dataframe:
>>> import pandas as pd
>>> df = pd.DataFrame([
... ['A', 'B', 'C', 'D'],
... ['E', 'B', 'C']
... ])
>>> df
0 1 2 3
0 A B C D
1 E B C None
>>>
I shoudl transform the dataframe to two columns format:
x, y
-----
A, B
B, C
C, D
E, B
B, C
For each row, from left to right, take two neighbor values and make a pair of it.
It is kind of from-to if you consider each row as a path.
How to do the transformation?
We can do explode with zip
s=pd.DataFrame(df.apply(lambda x : list(zip(x.dropna()[:-1],x.dropna()[1:])),axis=1).explode().tolist())
Out[336]:
0 1
0 A B
1 B C
2 C D
3 E B
4 B C
Update
s=df.apply(lambda x : list(zip(x.dropna()[:-1],x.dropna()[1:])),axis=1).explode()
s=pd.DataFrame(s.tolist(),index=s.index)
s
Out[340]:
0 1
0 A B
0 B C
0 C D
1 E B
1 B C
Pre-preparing the data could help too:
import pandas as pd
inp = [['A', 'B', 'C', 'D'],
['E', 'B', 'C']]
# Convert beforehand
inp2 = [[[i[k], i[k+1]] for k in range(len(i)-1)] for i in inp]
inp2 = inp2[0] + inp2[1]
df = pd.DataFrame(inp2)
print(df)
Output:
0 1
0 A B
1 B C
2 C D
3 E B
4 B C
I have the following pandas DataFrame:-
import pandas as pd
df = pd.DataFrame({
'code': ['eq150', 'eq150', 'eq152', 'eq151', 'eq151', 'eq150'],
'reg': ['A', 'C', 'H', 'P', 'I', 'G'],
'month': ['1', '2', '4', '2', '1', '1']
})
df
code reg month
0 eq150 A 1
1 eq150 C 2
2 eq152 H 4
3 eq151 P 2
4 eq151 I 1
5 eq150 G 1
Expected Output:-
1 2 3 4
eq150 A, G C
eq152 H
eq151 I P
If you want the output to include the empty 3 column as well:
all_cols = list(map(
str,
list(range(
df.month.astype(int).min(),
df.month.astype(int).max()+1
))
))
df_cols = list(df.month.unique())
add_cols = list(set(all_cols)-set(df_cols))
df = df.pivot_table(
index='code',
columns='month',
aggfunc=','.join
).reg.rename_axis(None).rename_axis(None, axis=1).fillna('')
for col in add_cols: df[col] = ''
df = df[all_cols]
df
1 2 3 4
eq150 A,G C
eq151 I P
eq152 H
Use pivot_table with DataFrame.reindex for add missing months:
df['month'] = df['month'].astype(int)
r = range(df['month'].min(), df['month'].max() + 1)
df1 = (df.pivot_table(index='code',
columns='month',
values='reg',
aggfunc=','.join,
fill_value='')
.reindex(r, fill_value='', axis=1))
print (df1)
month 1 2 3 4
code
eq150 A,G C
eq151 I P
eq152 H
I have mydf below, which I have sorted on a dummy time column and the id:
mydf = pd.DataFrame(
{
'id': ['A', 'B', 'B', 'C', 'A', 'C', 'A'],
'time': [1, 4, 3, 5, 2, 6, 7],
'val': ['a', 'b', 'c', 'd', 'e', 'f', 'g']
}
).sort_values(['id', 'time'], ascending=False)
mydf
id time val
5 C 6 f
3 C 5 d
1 B 4 b
2 B 3 c
6 A 7 g
4 A 2 e
0 A 1 a
I want to add a column (last_val) which, for each unique id, holds the latest val based on the time column. Entries for which there is no last_val can be dropped. The output in this example would look like:
mydf
id time val last_val
5 C 6 f d
1 B 4 b c
6 A 7 g e
4 A 2 e a
Any ideas?
Use DataFrameGroupBy.shift after sort_values(['id', 'time'], ascending=False) (already in question) and then remove rows with missing values by DataFrame.dropna:
mydf['last_val'] = mydf.groupby('id')['val'].shift(-1)
mydf = mydf.dropna(subset=['last_val'])
Similar solution, only removed last duplicated rows by id column:
mydf['last_val'] = mydf.groupby('id')['val'].shift(-1)
mydf = mydf[mydf['id'].duplicated(keep='last')]
print (mydf)
id time val last_val
5 C 6 f d
1 B 4 b c
6 A 7 g e
4 A 2 e a
In this sample dataframe which contains 3 variables:
data = {'A':['m', 'f', 'm', 'm'],
'B':['y', 'y', 'n', 'n'],
'C':['ab','bc','cd','ef'] }
# Create DataFrame
df = pd.DataFrame(data)
df
A B C
0 m y ab
1 f y bc
2 m n cd
3 m n ef
After some manipulations, the above dataframe becomes:
data1 = {'x0_m':[1,0,1,1],
'x0_f':[0,1,0,0],
'x1_y':[1,1,0,0],
'x1_n':[0,0,1,1],
'x2_ab':[1,0,0,0],
'x2_bc':[0,1,0,0],
'x2_cd':[0,0,1,0],
'x2_ef':[0,0,0,1]}
# Create DataFrame
df1 = pd.DataFrame(data1)
df1
x0_m x0_f x1_y x1_n x2_ab x2_bc x2_cd x2_ef
0 1 0 1 0 1 0 0 0
1 0 1 1 0 0 1 0 0
2 1 0 0 1 0 0 1 0
3 1 0 0 1 0 0 0 1
I want to replace the "x0" variables with the column names in the original dataframe. For example, "x0_m" and "x0_f" should become "A_m", "A_f" respectively.
I have identified two steps for this procedure:
Step 1: create a dictionary which will include variables x's and the corresponding column names. I tried this:
list_num = ['x%s' % (i) for i in range(3)]
list_num
['x0', 'x1', 'x2']
Extracting the column names from the original dataframe df:
features = list(df.columns)
features
['A', 'B', 'C']
Then i tried to create a dictionary:
dict = {x: features for x in list_num}
dict
{'x0': ['A', 'B', 'C'], 'x1': ['A', 'B', 'C'], 'x2': ['A', 'B', 'C']}
But, that is not what I want. I'm expecting:
{'x0': 'A', 'x1': 'B', 'x2': 'C'}
How to get the desired output.
STEP2: Replace a part of the columns in df1 with the help of the dictionary created above.
This part, I'm completely lost. Need help
You can use the method str.replace():
df1.columns = (
df1.columns
.str.replace('x0', 'A')
.str.replace('x1', 'B')
.str.replace('x2', 'C')
)
or using a dictionary:
for k, v in dct.items():
df1.columns = df1.columns.str.replace(k, v)
Step2:
import pandas as pd
import numpy as np
data1 = {'x0_m':[1,0,1,1],
'x0_f':[0,1,0,0],
'x1_y':[1,1,0,0],
'x1_n':[0,0,1,1],
'x2_ab':[1,0,0,0],
'x2_bc':[0,1,0,0],
'x2_cd':[0,0,1,0],
'x2_ef':[0,0,0,1]}
df1 = pd.DataFrame(data1)
colnames = list(df1.columns)
new_names = {'x0': 'A', 'x1': 'B', 'x2': 'C'}
for key, value in new_names.items():
colnames = [col.replace(key, value) for col in colnames]
df1.columns = colnames
df1
A_m A_f B_y B_n C_ab C_bc C_cd C_ef
0 1 0 1 0 1 0 0 0
1 0 1 1 0 0 1 0 0
2 1 0 0 1 0 0 1 0
3 1 0 0 1 0 0 0 1
Just use a dictionary comprehension together with zip:
mapping = {col: feature for col, feature in zip(list_num, features)}
>>> mapping
{'x0': 'A', 'x1': 'B', 'x2': 'C'}
To modify the columns in your second dataframe:
new_cols = []
for col in df1:
a, b = col.split('_')
new_cols.append('_'.join([mapping.get(a, a), b]))
df1.columns = new_cols
>>> new_cols
['A_m', 'A_f', 'B_y', 'B_n', 'C_ab', 'C_bc', 'C_cd', 'C_ef']