I have the following pandas DataFrame:-
import pandas as pd
df = pd.DataFrame({
'code': ['eq150', 'eq150', 'eq152', 'eq151', 'eq151', 'eq150'],
'reg': ['A', 'C', 'H', 'P', 'I', 'G'],
'month': ['1', '2', '4', '2', '1', '1']
})
df
code reg month
0 eq150 A 1
1 eq150 C 2
2 eq152 H 4
3 eq151 P 2
4 eq151 I 1
5 eq150 G 1
Expected Output:-
1 2 3 4
eq150 A, G C
eq152 H
eq151 I P
If you want the output to include the empty 3 column as well:
all_cols = list(map(
str,
list(range(
df.month.astype(int).min(),
df.month.astype(int).max()+1
))
))
df_cols = list(df.month.unique())
add_cols = list(set(all_cols)-set(df_cols))
df = df.pivot_table(
index='code',
columns='month',
aggfunc=','.join
).reg.rename_axis(None).rename_axis(None, axis=1).fillna('')
for col in add_cols: df[col] = ''
df = df[all_cols]
df
1 2 3 4
eq150 A,G C
eq151 I P
eq152 H
Use pivot_table with DataFrame.reindex for add missing months:
df['month'] = df['month'].astype(int)
r = range(df['month'].min(), df['month'].max() + 1)
df1 = (df.pivot_table(index='code',
columns='month',
values='reg',
aggfunc=','.join,
fill_value='')
.reindex(r, fill_value='', axis=1))
print (df1)
month 1 2 3 4
code
eq150 A,G C
eq151 I P
eq152 H
Related
The desired result is this:
id name
1 A
2 B
3 C
4 D
5 E
6 F
7 G
8 H
Currently I do it this way:
import pandas as pd
df = pd.DataFrame({'home_id': ['1', '3', '5', '7'],
'home_name': ['A', 'C', 'E', 'G'],
'away_id': ['2', '4', '6', '8'],
'away_name': ['B', 'D', 'F', 'H']})
id_col = pd.concat([df['home_id'], df['away_id']])
name_col = pd.concat([df['home_name'], df['away_name']])
result = pd.DataFrame({'id': id_col, 'name': name_col})
result = result.sort_index().reset_index(drop=True)
print(result)
But this form uses the index to reclassify the columns, generating possible errors in cases where there are equal indexes.
How can I intercalate the column values always being:
Use the home of the 1st line, then the away of the 1st line, then the home of the 2nd line, then the away of the 2nd line and so on...
try this:
out = pd.DataFrame(df.values.reshape(-1, 2), columns=['ID', 'Name'])
print(out)
>>>
ID Name
0 1 A
1 2 B
2 3 C
3 4 D
4 5 E
5 6 F
6 7 G
7 8 H
Similar to the python zip, you go iterate through both dataframes:
home = pd.DataFrame(df[['home_id', 'home_name']].values, columns=('id', 'name'))
away = pd.DataFrame(df[['away_id', 'away_name']].values, columns=('id', 'name'))
def zip_dataframes(df1, df2):
rows = []
for i in range(len(df1)):
rows.append(df1.iloc[i, :])
rows.append(df2.iloc[i, :])
return pd.concat(rows, axis=1).T
zip_dataframes(home, away)
id name
0 1 A
0 2 B
1 3 C
1 4 D
2 5 E
2 6 F
3 7 G
3 8 H
You can do this using pd.wide_to_long with a little column header renaming:
import pandas as pd
df = pd.DataFrame({'home_id': ['1', '3', '5', '7'],
'home_name': ['A', 'C', 'E', 'G'],
'away_id': ['2', '4', '6', '8'],
'away_name': ['B', 'D', 'F', 'H']})
dfr = df.rename(columns=lambda x: '_'.join(x.split('_')[::-1])).reset_index()
df_out = (pd.wide_to_long(dfr, ['id', 'name'], 'index', 'No', sep='_', suffix='.*')
.reset_index(drop=True)
.sort_values('id'))
df_out
Output:
id name
0 1 A
4 2 B
1 3 C
5 4 D
2 5 E
6 6 F
3 7 G
7 8 H
I have a df in the following form
import pandas as pd
df = pd.DataFrame({'col1' : [1,1,1,2,2,3,3,4],
'col2' : ['a', 'b', 'c', 'a', 'b', 'a', 'b', 'a'],
'col3' : ['x', 'y', 'z', 'p','q','r','s','t']
})
col1 col2 col3
0 1 a x
1 1 b y
2 1 c z
3 2 a p
4 2 b q
5 3 a r
6 3 b s
7 4 a t
df2 = df.groupby(['col1','col2'])['col3'].sum()
df2
col1 col2
1 a x
b y
c z
2 a p
b q
3 a r
b s
4 a t
Now I want to add padded 0 rows to each of col1 index where a , b, c, d is missing , so expected output should be
col1 col2
1 a x
b y
c z
d 0
2 a p
b q
c 0
d 0
3 a r
b s
c 0
d 0
4 a t
b 0
c 0
d 0
Use unstack + reindex + stack:
out = (
df2.unstack(fill_value=0)
.reindex(columns=['a', 'b', 'c', 'd'], fill_value=0)
.stack()
)
out:
col1 col2
1 a x
b y
c z
d 0
2 a p
b q
c 0
d 0
3 a r
b s
c 0
d 0
4 a t
b 0
c 0
d 0
dtype: object
Complete Working Example:
import pandas as pd
df = pd.DataFrame({
'col1': [1, 1, 1, 2, 2, 3, 3, 4],
'col2': ['a', 'b', 'c', 'a', 'b', 'a', 'b', 'a'],
'col3': ['x', 'y', 'z', 'p', 'q', 'r', 's', 't']
})
df2 = df.groupby(['col1', 'col2'])['col3'].sum()
out = (
df2.unstack(fill_value=0)
.reindex(columns=['a', 'b', 'c', 'd'], fill_value=0)
.stack()
)
print(out)
Here's another way using pd.MultiIndex.from_product, then reindex:
mindx = pd.MultiIndex.from_product([df2.index.levels[0], [*'abcd']])
df2.reindex(mindx, fill_value=0)
Output:
col1
1 a x
b y
c z
d 0
2 a p
b q
c 0
d 0
3 a r
b s
c 0
d 0
4 a t
b 0
c 0
d 0
Name: col3, dtype: object
I have 2 dataframes:
df1 = pd.DataFrame({'A':[1,2,3,4],
'B':[5,6,7,8],
'D':[9,10,11,12]})
and
df2 = pd.DataFrame({'type':['A', 'B', 'C', 'D', 'E'],
'color':['yellow', 'green', 'red', 'pink', 'black'],
'size':['S', 'M', 'L', 'S', 'M']})
I want to map Information from df2 to Header of df1, the result should look like below:
how can I do this? Many thanks :)
Use rename with aggregate values by DataFrame.agg:
df1 = pd.DataFrame({'A1':[1,2,3,4],
'B':[5,6,7,8],
'D':[9,10,11,12]})
s = df2.set_index('type', drop=False).agg(','.join, axis=1)
df1 = df1.rename(columns=s)
print (df1)
A1 B,green,M D,pink,S
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
For () need more processing:
s = df2.set_index('type').agg(','.join, axis=1).add(')').radd('(')
s = s.index +' ' + s
df1 = df1.rename(columns=s)
print (df1)
A (yellow,S) B (green,M) D (pink,S)
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
I have a pandas data frame like this:
from itertools import *
from pandas as pd
d = {'col1': ['a', 'b','c','d','a','b','d'], 'col2': ['XX','XX','XY','XX','YY','YY','XY']}
df_rel = pd.DataFrame(data=d)
df_rel
col1 col2
0 a XX
1 b XX
2 c XY
3 d XX
4 a YY
5 b YY
6 d XY
The unique nodes are:
uniq_nodes = df_rel['col1'].unique()
uniq_nodes
array(['a', 'b', 'c', 'd'], dtype=object)
For each Relationship the source (Src) and destination (Dst) can be generated:
df1 = pd.DataFrame(
data=list(combinations(uniq_nodes, 2)),
columns=['Src', 'Dst'])
df1
Src Dst
0 a b
1 a c
2 a d
3 b c
4 b d
5 c d
I need the new dataframe newdf based on the shared elements in col2 of df_rel. The Relationship column comes from the col2. Thus the desire dataframe with edgelist will be:
newdf
Src Dst Relationship
0 a b XX
1 a b YY
2 a d XX
3 c d XY
Is there any fastest way to achieve this? The original dataframe has 30,000 rows.
I took this approach. It works but still not very fast for the large dataframe.
from itertools import *
from pandas as pd
d = {'col1': ['a', 'b','c','d','a','b','d'], 'col2': ['XX','XX','XY','XX','YY','YY','XY']}
df_rel = pd.DataFrame(data=d)
df_rel
col1 col2
0 a XX
1 b XX
2 c XY
3 d XX
4 a YY
5 b YY
6 d XY
uniq_nodes = df_rel['col1'].unique()
uniq_nodes
array(['a', 'b', 'c', 'd'], dtype=object)
df1 = pd.DataFrame(
data=list(combinations(unique_nodes, 2)),
columns=['Src', 'Dst'])
filter1 = df_rel['col1'].isin(df1['Src'])
src_df = df_rel[filter1]
src_df.rename(columns={'col1':'Src'}, inplace=True)
filter2 = df_rel['col1'].isin(df1['Dst'])
dst_df = df_rel[filter2]
dst_df.rename(columns={'col1':'Dst'}, inplace=True)
new_df = pd.merge(src_df,dst_df, on = "col2",how="inner")
print ("after removing the duplicates")
new_df = new_df.drop_duplicates()
print(new_df.shape)
print ("after removing self loop")
new_df = new_df[new_df['Src'] != new_df['Dst']]
new_df = new_df[new_df['Src'] != new_df['Dst']]
new_df.rename(columns={'col2':'Relationship'}, inplace=True)
print(new_df.shape)
print (new_df)
Src Relationship Dst
0 a XX b
1 a XX d
3 b XX d
5 c XY d
6 a YY b
You need to loop through your df1 rows, and find the rows from df_rel that matches the df1['Src'] and df1['Dst'] columns. Once you have the df1['col2'] values of Src and Dst, compare them and if they match create a row in newdf. Try this - check if it performs for large datasets
Data setup (same as yours):
d = {'col1': ['a', 'b', 'c', 'd', 'a', 'b', 'd'], 'col2': ['XX', 'XX', 'XY', 'XX', 'YY', 'YY', 'XY']}
df_rel = pd.DataFrame(data=d)
uniq_nodes = df_rel['col1'].unique()
df1 = pd.DataFrame(data=list(combinations(uniq_nodes, 2)), columns=['Src', 'Dst'])
Code:
newdf = pd.DataFrame(columns=['Src','Dst','Relationship'])
for i, row in df1.iterrows():
src = (df_rel[df_rel['col1'] == row['Src']]['col2']).to_list()
dst = (df_rel[df_rel['col1'] == row['Dst']]['col2']).to_list()
for x in src:
if x in dst:
newdf = newdf.append(pd.Series({'Src': row['Src'], 'Dst': row['Dst'], 'Relationship': x}),
ignore_index=True, sort=False)
print(newdf)
Result:
Src Dst Relationship
0 a b XX
1 a b YY
2 a d XX
3 b d XX
4 c d XY
In this sample dataframe which contains 3 variables:
data = {'A':['m', 'f', 'm', 'm'],
'B':['y', 'y', 'n', 'n'],
'C':['ab','bc','cd','ef'] }
# Create DataFrame
df = pd.DataFrame(data)
df
A B C
0 m y ab
1 f y bc
2 m n cd
3 m n ef
After some manipulations, the above dataframe becomes:
data1 = {'x0_m':[1,0,1,1],
'x0_f':[0,1,0,0],
'x1_y':[1,1,0,0],
'x1_n':[0,0,1,1],
'x2_ab':[1,0,0,0],
'x2_bc':[0,1,0,0],
'x2_cd':[0,0,1,0],
'x2_ef':[0,0,0,1]}
# Create DataFrame
df1 = pd.DataFrame(data1)
df1
x0_m x0_f x1_y x1_n x2_ab x2_bc x2_cd x2_ef
0 1 0 1 0 1 0 0 0
1 0 1 1 0 0 1 0 0
2 1 0 0 1 0 0 1 0
3 1 0 0 1 0 0 0 1
I want to replace the "x0" variables with the column names in the original dataframe. For example, "x0_m" and "x0_f" should become "A_m", "A_f" respectively.
I have identified two steps for this procedure:
Step 1: create a dictionary which will include variables x's and the corresponding column names. I tried this:
list_num = ['x%s' % (i) for i in range(3)]
list_num
['x0', 'x1', 'x2']
Extracting the column names from the original dataframe df:
features = list(df.columns)
features
['A', 'B', 'C']
Then i tried to create a dictionary:
dict = {x: features for x in list_num}
dict
{'x0': ['A', 'B', 'C'], 'x1': ['A', 'B', 'C'], 'x2': ['A', 'B', 'C']}
But, that is not what I want. I'm expecting:
{'x0': 'A', 'x1': 'B', 'x2': 'C'}
How to get the desired output.
STEP2: Replace a part of the columns in df1 with the help of the dictionary created above.
This part, I'm completely lost. Need help
You can use the method str.replace():
df1.columns = (
df1.columns
.str.replace('x0', 'A')
.str.replace('x1', 'B')
.str.replace('x2', 'C')
)
or using a dictionary:
for k, v in dct.items():
df1.columns = df1.columns.str.replace(k, v)
Step2:
import pandas as pd
import numpy as np
data1 = {'x0_m':[1,0,1,1],
'x0_f':[0,1,0,0],
'x1_y':[1,1,0,0],
'x1_n':[0,0,1,1],
'x2_ab':[1,0,0,0],
'x2_bc':[0,1,0,0],
'x2_cd':[0,0,1,0],
'x2_ef':[0,0,0,1]}
df1 = pd.DataFrame(data1)
colnames = list(df1.columns)
new_names = {'x0': 'A', 'x1': 'B', 'x2': 'C'}
for key, value in new_names.items():
colnames = [col.replace(key, value) for col in colnames]
df1.columns = colnames
df1
A_m A_f B_y B_n C_ab C_bc C_cd C_ef
0 1 0 1 0 1 0 0 0
1 0 1 1 0 0 1 0 0
2 1 0 0 1 0 0 1 0
3 1 0 0 1 0 0 0 1
Just use a dictionary comprehension together with zip:
mapping = {col: feature for col, feature in zip(list_num, features)}
>>> mapping
{'x0': 'A', 'x1': 'B', 'x2': 'C'}
To modify the columns in your second dataframe:
new_cols = []
for col in df1:
a, b = col.split('_')
new_cols.append('_'.join([mapping.get(a, a), b]))
df1.columns = new_cols
>>> new_cols
['A_m', 'A_f', 'B_y', 'B_n', 'C_ab', 'C_bc', 'C_cd', 'C_ef']