I have a dataframe like so:
id type count_x count_y count_z sum_x sum_y sum_z
1 A 12 1 6 34 43 25
1 B 4 5 8 12 37 28
Now I want to transform it by grouping by id and type, then transforming from wide to long like so:
id type variable value calc
1 A x 12 count
1 A y 1 count
1 A z 6 count
1 B x 4 count
1 B y 5 count
1 B z 8 count
1 A x 34 sum
1 A y 43 sum
1 A z 25 sum
1 B x 12 sum
1 B y 37 sum
1 B z 28 sum
How can I achieve this?
try using melt:
res = pd.melt(df,id_vars=['id', 'type'])
res[['calc', 'variable']] = res.variable.str.split('_', expand=True)
id
type
variable
value
calc
0
1
A
x
12
count
1
1
B
x
4
count
2
1
A
y
1
count
3
1
B
y
5
count
4
1
A
z
6
count
5
1
B
z
8
count
6
1
A
x
34
sum
7
1
B
x
12
sum
8
1
A
y
43
sum
9
1
B
y
37
sum
10
1
A
z
25
sum
11
1
B
z
28
sum
Update:
Using stack:
df1 = (df.set_index(['id', 'type']).stack().rename('value').reset_index())
df1 = df1.drop('level_2',axis=1).join(df1['level_2'].str.split('_', 1, expand=True).rename(columns={0:'calc', 1:'variable'}))
id
type
value
calc
variable
0
1
A
12
count
x
1
1
A
1
count
y
2
1
A
6
count
z
3
1
A
34
sum
x
4
1
A
43
sum
y
5
1
A
25
sum
z
6
1
B
4
count
x
7
1
B
5
count
y
8
1
B
8
count
z
9
1
B
12
sum
x
10
1
B
37
sum
y
11
1
B
28
sum
z
You can use a combination of melt and split()
df = pd.DataFrame({'id': [1,1], 'type': ['A', 'B'], 'count_x':[12,4], 'count_y': [1,5], 'count_z': [6,8], 'sum_x': [34, 12], 'sum_y': [43, 37], 'sum_z': [25, 28]})
df_melt = df.melt(id_vars=['id', 'type'])
df_melt[['calc', 'variable']] = df_melt['variable'].str.split("_", expand=True)
df_melt
id type variable value calc
0 1 A x 12 count
1 1 B x 4 count
2 1 A y 1 count
3 1 B y 5 count
4 1 A z 6 count
5 1 B z 8 count
6 1 A x 34 sum
7 1 B x 12 sum
8 1 A y 43 sum
9 1 B y 37 sum
10 1 A z 25 sum
11 1 B z 28 sum
Assuming your pandas DataFrame is df_wide, you can get the desired result in df_long as,
df_long = df.melt(id_vars=['id', 'type'], value_vars=['count_x', 'count_y', 'count_z', 'sum_x', 'sum_y', 'sum_z'])
df_long['calc'] = df_long['variable'].apply(lambda x: x.split('_')[0])
df_long['variable'] = df_long['variable'].apply(lambda x: x.split('_')[1])
You could reshape the data using pivot_longer from pyjanitor:
df.pivot_longer(index = ['id', 'type'],
# names of the new columns
# note the order
names_to = ['calc', 'variable'],
values_to = 'value',
# delimiter for the columns
# first value is assigned to `calc`,
# the other goes to `variable`
names_sep='_')
id type calc variable value
0 1 A count x 12
1 1 B count x 4
2 1 A count y 1
3 1 B count y 5
4 1 A count z 6
5 1 B count z 8
6 1 A sum x 34
7 1 B sum x 12
8 1 A sum y 43
9 1 B sum y 37
10 1 A sum z 25
11 1 B sum z 28
Related
I want to separate values in "alpha" column like this
Start:
alpha
beta
gamma
A
1
0
A
1
1
B
1
0
B
1
1
B
1
0
C
1
1
End:
alpha
beta
gamma
A
1
0
A
1
1
X
X
X
B
1
0
B
1
1
B
1
0
X
X
X
C
1
1
Thanks for help <3
You can try
out = (df.groupby('alpha')
.apply(lambda g: pd.concat([g, pd.DataFrame([['X', 'X', 'X']], columns=df.columns)]))
.reset_index(drop=True)[:-1])
print(out)
alpha beta gamma
0 A 1 0
1 A 1 1
2 X X X
3 B 1 0
4 B 1 1
5 B 1 0
6 X X X
7 C 1 1
Assuming a range index as in the example, you can use:
# get indices in between 2 groups
idx = df['alpha'].ne(df['alpha'].shift(-1).ffill())
df2 = pd.concat([df, df[idx].assign(**{c: 'X' for c in df})]).sort_index(kind='stable')
Or without groupby and sort_index:
idx = df['alpha'].ne(df['alpha'].shift(-1).ffill())
df2 = df.loc[df.index.repeat(idx+1)]
df2.loc[df2.index.duplicated()] = 'X'
output:
alpha beta gamma
0 A 1 0
1 A 1 1
1 X X X
2 B 1 0
3 B 1 1
4 B 1 0
4 X X X
5 C 1 1
NB. add reset_index(drop=True) to get a new index
You can do:
dfx = pd.DataFrame({'alpha':['X'],'beta':['X'],'gamma':['X']})
df = df.groupby('alpha',as_index=False).apply(lambda x:x.append(dfx)).reset_index(drop=True)
Output:
alpha beta gamma
0 A 1 0
1 A 1 1
2 X X X
3 B 1 0
4 B 1 1
5 B 1 0
6 X X X
7 C 1 1
8 X X X
To avoid adding a [X, X, X] at the end you can check the index first like:
df.groupby('alpha',as_index=False).apply(
lambda x:x.append(dfx)
if x.index[-1] != df.index[-1] else x).reset_index(drop=True)
I have a dataframe which looks like this:
df = pd.DataFrame({'id':[1,2,3,4,5,6,7,8,9],'sid':['X','Y','X','Z','X','Y','X','Y','Z'], 'cl':[0,1,1,0,0,1,0,0,1]})
df
id sid cl
0 1 X 0
1 2 Y 1
2 3 X 1
3 4 Z 0
4 5 X 0
5 6 Y 1
6 7 X 0
7 8 Y 0
8 9 Z 1
What i want to do is first group by 'sid' and take count of 'cl' values, the cl values having the max count should be the value for all the rows in that group.
So for the data frame df, sid ='X', the cl values are 0,1,0,0 . As we can see that 0 is the most frequent occuring value, so all cl values under the sid "X", should be updated to 0, similarly for 'Y' it should be 1 and for 'Z' both 0 and 1 occur once and hence any of the 1 value can be chosen.
The resulting data frame should look like:
id sid cl cl_new
0 1 X 0 0
1 2 Y 1 1
2 3 X 1 0
3 4 Z 0 0
4 5 X 0 0
5 6 Y 1 1
6 7 X 0 0
7 8 Y 0 1
8 9 Z 1 0
You can use groupby.transform + Series.mode:
df["cl_new"] = df.groupby("sid")["cl"].transform(lambda x: x.mode()[0])
print(df)
Prints:
id sid cl cl_new
0 1 X 0 0
1 2 Y 1 1
2 3 X 1 0
3 4 Z 0 0
4 5 X 0 0
5 6 Y 1 1
6 7 X 0 0
7 8 Y 0 1
8 9 Z 1 0
Group by sid and count values per group and return the index of the max occurences then merge to you original dataframe:
df['cl_new'] = df.groupby('sid')['cl'].transform(lambda x: x.value_counts().idxmax())
>>> df
id sid cl cl_new
0 1 X 0 0
1 2 Y 1 1
2 3 X 1 0
3 4 Z 0 0
4 5 X 0 0
5 6 Y 1 1
6 7 X 0 0
7 8 Y 0 1
8 9 Z 1 0
I have a dataframe with one column and I would like to get a Dataframe with N columns all of which will be identical to the first one. I can simply duplicate it by:
df[['new column name']] = df[['column name']]
but I have to make more than 1000 identical columns that's why it doesnt work
. One important thing is figures in columns should change for instance if first columns is 0 the nth column is n and the previous is n-1
If it's a single column, you can use tranpose and then simply replicate them with pd.concat and tranpose back to the original format, this avoids looping and should be faster, then you can change the column names in a second line, but without dealing with all the data in the dataframe which would be the most consuming performance wise:
import pandas as pd
df = pd.DataFrame({'Column':[1,2,3,4,5]})
Original dataframe:
Column
0 1
1 2
2 3
3 4
4 5
df = pd.concat([df.T]*1000).T
Output:
Column Column Column Column ... Column Column Column Column
0 1 1 1 1 ... 1 1 1 1
1 2 2 2 2 ... 2 2 2 2
2 3 3 3 3 ... 3 3 3 3
3 4 4 4 4 ... 4 4 4 4
4 5 5 5 5 ... 5 5 5 5
[5 rows x 1000 columns]
df.columns = ['Column'+'_'+str(i) for i in range(1000)]
Say that you have a df:, with column name 'company_name' that consists of 8 companies:
df = {"company_name":{"0":"Telia","1":"Proximus","2":"Tmobile","3":"Orange","4":"Telefonica","5":"Verizon","6":"AT&T","7":"Koninklijke"}}
company_name
0 Telia
1 Proximus
2 Tmobile
3 Orange
4 Telefonica
5 Verizon
6 AT&T
7 Koninklijke
You can use a loop and range to determine how many identical columns to create, and do:
for i in range(0,1000):
df['company_name'+str(i)] = df['company_name']
which results in the shape of the df:
df.shape
(8, 1001)
i.e. it replicated 1000 times the same columns. The names of the duplicated columns will be the same as the original one, plus an integer (=+1) at the end:
'company_name', 'company_name0', 'company_name1', 'company_name2','company_name..N'
df
A B C
0 x x x
1 y x z
Duplicate column "C" 5 times using df.assign:
n = 5
df2 = df.assign(**{f'C{i}': df['C'] for i in range(1, n+1)})
df2
A B C C1 C2 C3 C4 C5
0 x x x x x x x x
1 y x z z z z z z
Set n to 1000 to get your desired output.
You can also directly assign the result back:
df[[f'C{i}' for i in range(1, n+1)]] = df[['C']*n].to_numpy()
df
A B C C1 C2 C3 C4 C5
0 x x x x x x x x
1 y x z z z z z z
I think the most efficient is to index with DataFrame.loc instead of using an outer loop
n = 3
new_df = df.loc[:, ['column_duplicate']*n +
df.columns.difference(['column_duplicate']).tolist()]
print(new_df)
column_duplicate column_duplicate column_duplicate other
0 0 0 0 10
1 1 1 1 11
2 2 2 2 12
3 3 3 3 13
4 4 4 4 14
5 5 5 5 15
6 6 6 6 16
7 7 7 7 17
8 8 8 8 18
9 9 9 9 19
If you want add a suffix
suffix_tup = ('a', 'b', 'c')
not_dup_cols = df.columns.difference(['column_duplicate']).tolist()
new_df = (df.loc[:, ['column_duplicate']*len(suffix_tup) +
not_dup_cols]
.set_axis(list(map(lambda suffix: f'column_duplicate_{suffix}',
suffix_tup)) +
not_dup_cols, axis=1)
)
print(new_df)
column_duplicate_a column_duplicate_b column_duplicate_c other
0 0 0 0 10
1 1 1 1 11
2 2 2 2 12
3 3 3 3 13
4 4 4 4 14
5 5 5 5 15
6 6 6 6 16
7 7 7 7 17
8 8 8 8 18
or add an index
n = 3
not_dup_cols = df.columns.difference(['column_duplicate']).tolist()
new_df = (df.loc[:, ['column_duplicate']*n +
not_dup_cols]
.set_axis(list(map(lambda x: f'column_duplicate_{x}', range(n))) +
not_dup_cols, axis=1)
)
print(new_df)
column_duplicate_0 column_duplicate_1 column_duplicate_2 other
0 0 0 0 10
1 1 1 1 11
2 2 2 2 12
3 3 3 3 13
4 4 4 4 14
5 5 5 5 15
6 6 6 6 16
7 7 7 7 17
8 8 8 8 18
9 9 9 9 19
I have a dataframe df=
Type ID QTY_1 QTY_2 RES_1 RES_2
X 1 10 15 y N
X 2 12 25 N N
X 3 25 16 Y Y
X 4 14 62 N Y
X 5 21 75 Y Y
Y 1 10 15 y N
Y 2 12 25 N N
Y 3 25 16 Y Y
Y 4 14 62 N N
Y 5 21 75 Y Y
I want the result data set of two different data frames with QTY which has Y in their respective RES.
Below is my expected result
df1=
Type ID QTY_1
X 1 10
X 3 25
X 5 21
Y 1 10
Y 3 25
Y 5 21
df2 =
Type ID QTY_2
X 3 16
X 4 62
X 5 75
Y 3 16
Y 5 75
You can do this:
df1 = df[['Type', 'ID', 'QTY_1']].loc[df.RES_1.isin(['Y', 'y'])]
df2 = df[['Type', 'ID', 'QTY_2']].loc[df.RES_2.isin(['Y', 'y'])]
or
df1 = df[['Type', 'ID', 'QTY_1']].loc[df.RES_1.str.lower() == 'y']
df2 = df[['Type', 'ID', 'QTY_2']].loc[df.RES_2.str.lower() == 'y']
Output:
>>> df1
Type ID QTY_1
0 X 1 10
2 X 3 25
4 X 5 21
5 Y 1 10
7 Y 3 25
9 Y 5 21
>>> df2
Type ID QTY_2
2 X 3 16
3 X 4 62
4 X 5 75
7 Y 3 16
9 Y 5 75
Use a dictionary
It's good practice to use a dictionary for a variable number of variables. Although in this case there may be only a couple of categories, you benefit from organized data. For example, you can access RES_1 data via dfs[1].
dfs = {i: df.loc[df['RES_'+str(i)].str.lower() == 'y', ['Type', 'ID', 'QTY_'+str(i)]] \
for i in range(1, 3)}
print(dfs)
{1: Type ID QTY_1
0 X 1 10
2 X 3 25
4 X 5 21
5 Y 1 10
7 Y 3 25
9 Y 5 21,
2: Type ID QTY_2
2 X 3 16
3 X 4 62
4 X 5 75
7 Y 3 16
9 Y 5 75}
You need:
df1 = df.loc[(df['RES_1']=='Y') | (df['RES_1']=='y')].drop(['QTY_2', 'RES_1', 'RES_2'], axis=1)
df2 = df.loc[(df['RES_2']=='Y') | (df['RES_2']=='y')].drop(['QTY_1', 'RES_1', 'RES_2'], axis=1)
print(df1)
print(df2)
Output:
Type ID QTY_1
0 X 1 10
2 X 3 25
4 X 5 21
5 Y 1 10
7 Y 3 25
9 Y 5 21
Type ID QTY_2
2 X 3 16
3 X 4 62
4 X 5 75
7 Y 3 16
9 Y 5 75
I have a dataframe (df_temp) which is like the following:
ID1 ID2
0 A X
1 A X
2 A Y
3 A Y
4 A Z
5 B L
6 B L
What I need is to add a column which shows the cummulative number of unique values of ID2 for each ID1, so something like
ID1 ID2 CumUniqueIDs
0 A X 1
1 A X 1
2 A Y 2
3 A Y 2
4 A Z 3
5 B L 1
6 B L 1
I've tried:
dfl_temp['CumUniqueIDs'] = dfl_temp.groupby(by=[ID1])[ID2].nunique().cumsum()+1
But this simply fills CumUniqueIDs with NaN.
Not sure what I'm doing wrong here! Any help much appreciated!
you can use groupby() + transform() + factorize():
In [12]: df['CumUniqueIDs'] = df.groupby('ID1')['ID2'].transform(lambda x: pd.factorize(x)[0]+1)
In [13]: df
Out[13]:
ID1 ID2 CumUniqueIDs
0 A X 1
1 A X 1
2 A Y 2
3 A Y 2
4 A Z 3
5 B L 1
6 B L 1
By using category
df.groupby(['ID1']).ID2.apply(lambda x : x.astype('category').cat.codes.add(1))
Out[551]:
0 1
1 1
2 2
3 2
4 3
5 1
6 1
Name: ID2, dtype: int8
After assign it back
df['CumUniqueIDs']=df.groupby(['ID1']).ID2.apply(lambda x : x.astype('category').cat.codes.add(1))
df
Out[553]:
ID1 ID2 CumUniqueIDs
0 A X 1
1 A X 1
2 A Y 2
3 A Y 2
4 A Z 3
5 B L 1
6 B L 1