I have three lists, [1,4,3] , [2,5,6] , [9,8,7] which refer to a dataframe's series indices. I'm using each list to slice the dataframe into a smaller dataframe for batch data processing. After the processing, I want to recombine the dataframes into the original dataframe, preserving the order of the columns.
df_1 = df.iloc[:,list1]
#carry out preprocessing
df_2 = df.iloc[:,list2]
#carry out preprocessing
df_3 = df.iloc[:,list3]
#carry out preprocessing
#join the frames back together
frames = [df_1,df_2,df_3]
df = pd.concat(frames, axis = 1)
Is there a simple way to concat and preserve the original order of the series? i.e. [1,2,3,4,5,6,7,8,9]
I think not, need sort_index for sorting columns names:
df = pd.concat(frames, axis = 1).sort_index(axis=1)
If want sorted by indices positions:
L = list1 + list2 + list3
df1 = pd.concat(frames, axis = 1).reindex(columns=df.columns[sorted(L)])
Or sorting in iloc:
df_1 = df.iloc[:,sorted(list1)]
#carry out preprocessing
df_2 = df.iloc[:,sorted(list2)]
#carry out preprocessing
df_3 = df.iloc[:,sorted(list3)]
#carry out preprocessing
Sample:
np.random.seed(100)
df = pd.DataFrame(np.random.randint(10, size=(5,10)), columns=list('EFGHIJABCD'))
print (df)
E F G H I J A B C D
0 8 8 3 7 7 0 4 2 5 2
1 2 2 1 0 8 4 0 9 6 2
2 4 1 5 3 4 4 3 7 1 1
3 7 7 0 2 9 9 3 2 5 8
4 1 0 7 6 2 0 8 2 5 1
list1 = [1,4,3]
list2 = [2,5,6]
list3 = [9,8,7]
df_1 = df.iloc[:,list1]
#carry out preprocessing
df_2 = df.iloc[:,list2]
#carry out preprocessing
df_3 = df.iloc[:,list3]
#carry out preprocessing
#join the frames back together
frames = [df_1,df_2,df_3]
L = list1 + list2 + list3
df1 = pd.concat(frames, axis = 1).reindex(columns=df.columns[sorted(L)])
print (df1)
F G H I J A B C D
0 8 3 7 7 0 4 2 5 2
1 2 1 0 8 4 0 9 6 2
2 1 5 3 4 4 3 7 1 1
3 7 0 2 9 9 3 2 5 8
4 0 7 6 2 0 8 2 5 1
df2 = pd.concat(frames, axis = 1).sort_index(axis=1)
print (df2)
A B C D F G H I J
0 4 2 5 2 8 3 7 7 0
1 0 9 6 2 2 1 0 8 4
2 3 7 1 1 1 5 3 4 4
3 3 2 5 8 7 0 2 9 9
4 8 2 5 1 0 7 6 2 0
EDIT:
If same columns names as values in list L:
L.sort()
df = df[L]
Or:
df = df[sorted(L)]
Related
I am trying to sum every row of a dataframe with a series.
I have a dataframe with [107 rows and 42 columns] and a series of length 42. I would like to sum every row with the series such that every column in the dataframe would have the same number added to it. I tried df.add(series) but the result was a dataframe with 107 rows and 84 columns with all NaN values.
For example
dataframe:
Index a b c
d 1 2 3
e 4 5 6
f 7 8 9
g 0 0 0
series: 1 2 3
result would be
Index a b c
d 2 4 6
e 5 7 9
f 8 10 12
g 1 2 3
You can use DataFrame.add or + with numpy array if differet index values like columns names:
s = pd.Series([1,2,3])
df = df.add(s.to_numpy())
#alternative
#df = df + s.to_numpy()
print (df)
a b c
d 2 4 6
e 5 7 9
f 8 10 12
g 1 2 3
s = pd.Series([1,2,3])
s.index = df.columns
df = df.add(s)
#alternative
#df = df + s
print (df)
a b c
d 2 4 6
e 5 7 9
f 8 10 12
g 1 2 3
I have a dataframe with one column and I would like to get a Dataframe with N columns all of which will be identical to the first one. I can simply duplicate it by:
df[['new column name']] = df[['column name']]
but I have to make more than 1000 identical columns that's why it doesnt work
. One important thing is figures in columns should change for instance if first columns is 0 the nth column is n and the previous is n-1
If it's a single column, you can use tranpose and then simply replicate them with pd.concat and tranpose back to the original format, this avoids looping and should be faster, then you can change the column names in a second line, but without dealing with all the data in the dataframe which would be the most consuming performance wise:
import pandas as pd
df = pd.DataFrame({'Column':[1,2,3,4,5]})
Original dataframe:
Column
0 1
1 2
2 3
3 4
4 5
df = pd.concat([df.T]*1000).T
Output:
Column Column Column Column ... Column Column Column Column
0 1 1 1 1 ... 1 1 1 1
1 2 2 2 2 ... 2 2 2 2
2 3 3 3 3 ... 3 3 3 3
3 4 4 4 4 ... 4 4 4 4
4 5 5 5 5 ... 5 5 5 5
[5 rows x 1000 columns]
df.columns = ['Column'+'_'+str(i) for i in range(1000)]
Say that you have a df:, with column name 'company_name' that consists of 8 companies:
df = {"company_name":{"0":"Telia","1":"Proximus","2":"Tmobile","3":"Orange","4":"Telefonica","5":"Verizon","6":"AT&T","7":"Koninklijke"}}
company_name
0 Telia
1 Proximus
2 Tmobile
3 Orange
4 Telefonica
5 Verizon
6 AT&T
7 Koninklijke
You can use a loop and range to determine how many identical columns to create, and do:
for i in range(0,1000):
df['company_name'+str(i)] = df['company_name']
which results in the shape of the df:
df.shape
(8, 1001)
i.e. it replicated 1000 times the same columns. The names of the duplicated columns will be the same as the original one, plus an integer (=+1) at the end:
'company_name', 'company_name0', 'company_name1', 'company_name2','company_name..N'
df
A B C
0 x x x
1 y x z
Duplicate column "C" 5 times using df.assign:
n = 5
df2 = df.assign(**{f'C{i}': df['C'] for i in range(1, n+1)})
df2
A B C C1 C2 C3 C4 C5
0 x x x x x x x x
1 y x z z z z z z
Set n to 1000 to get your desired output.
You can also directly assign the result back:
df[[f'C{i}' for i in range(1, n+1)]] = df[['C']*n].to_numpy()
df
A B C C1 C2 C3 C4 C5
0 x x x x x x x x
1 y x z z z z z z
I think the most efficient is to index with DataFrame.loc instead of using an outer loop
n = 3
new_df = df.loc[:, ['column_duplicate']*n +
df.columns.difference(['column_duplicate']).tolist()]
print(new_df)
column_duplicate column_duplicate column_duplicate other
0 0 0 0 10
1 1 1 1 11
2 2 2 2 12
3 3 3 3 13
4 4 4 4 14
5 5 5 5 15
6 6 6 6 16
7 7 7 7 17
8 8 8 8 18
9 9 9 9 19
If you want add a suffix
suffix_tup = ('a', 'b', 'c')
not_dup_cols = df.columns.difference(['column_duplicate']).tolist()
new_df = (df.loc[:, ['column_duplicate']*len(suffix_tup) +
not_dup_cols]
.set_axis(list(map(lambda suffix: f'column_duplicate_{suffix}',
suffix_tup)) +
not_dup_cols, axis=1)
)
print(new_df)
column_duplicate_a column_duplicate_b column_duplicate_c other
0 0 0 0 10
1 1 1 1 11
2 2 2 2 12
3 3 3 3 13
4 4 4 4 14
5 5 5 5 15
6 6 6 6 16
7 7 7 7 17
8 8 8 8 18
or add an index
n = 3
not_dup_cols = df.columns.difference(['column_duplicate']).tolist()
new_df = (df.loc[:, ['column_duplicate']*n +
not_dup_cols]
.set_axis(list(map(lambda x: f'column_duplicate_{x}', range(n))) +
not_dup_cols, axis=1)
)
print(new_df)
column_duplicate_0 column_duplicate_1 column_duplicate_2 other
0 0 0 0 10
1 1 1 1 11
2 2 2 2 12
3 3 3 3 13
4 4 4 4 14
5 5 5 5 15
6 6 6 6 16
7 7 7 7 17
8 8 8 8 18
9 9 9 9 19
Given dataframe:
df = pd.DataFrame({'a':[1,2,4,5,6,8],
'b':[5,6,4,8,9,6],
'c':[6,3,3,7,8,4],
'd':[1,2,3,8,7,3],
'e':[3,2,4,4,6,2],
'f':[3,2,6,4,5,5]})
I want to divide/split df several parts (into 2,3,4.. n parts)
Desired output:
df1 =
a b c d e f
0 1 5 6 1 3 3
1 2 6 3 2 2 2
df2 =
a b c d e f
2 4 4 3 3 4 6
3 5 8 7 8 4 4
df3 =
a b c d e f
4 6 9 8 7 6 5
5 8 6 4 3 2 5
UPDATED
Real data has not equal dividable size!
real data 4351 rows × 3 columns
Use qcut to split. How you want to store it after is up to you
import pandas as pd
gp = df.groupby(pd.qcut(range(df.shape[0]), 3)) # N = 3
d = {f'df{i+1}': x[1] for i, x in enumerate(gp)}
d['df1']
# a b c d e f
#0 1 5 6 1 3 3
#1 2 6 3 2 2 2
Assuming your DataFrame can be evenly divided into n chunks:
n = 3
dfs = [df.loc[i] for i in np.split(df.index, n)]
dfs is a list containing 3 dataframes.
I have a data frame with a multi index and one column.
Index fields are type and amount, the column is called count
I would like to add a column that multiplies amount and count
df2 = df.groupby(['type','amount']).count().copy()
# I then dropped all columns but one and renamed it to "count"
df2['total_amount'] = df2['count'].multiply(df2['amount'], axis='index')
doesn't work. I get a key error on amount.
How do I access a part of the multi index to use it in calculations?
Use GroupBy.transform for Series with same size as original df with aggregated values, so possible multiple:
count = df.groupby(['type','amount'])['type'].transform('count')
df['total_amount'] = df['amount'].multiply(count, axis='index')
print (df)
A amount C D E type total_amount
0 a 4 7 1 5 a 8
1 b 5 8 3 3 a 5
2 c 4 9 5 6 a 8
3 d 5 4 7 9 b 10
4 e 5 2 1 2 b 10
5 f 4 3 0 4 b 4
Or:
df = pd.DataFrame({'A':list('abcdef'),
'amount':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'type':list('aaabbb')})
print (df)
A amount C D E type
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b
df2 = df.groupby(['type','amount'])['type'].count().to_frame('count')
df2['total_amount'] = df2['count'].mul(df2.index.get_level_values('amount'))
print (df2)
count total_amount
type amount
a 4 2 8
5 1 5
b 4 1 4
5 2 10
How I can merge following two data frames on columns A and B:
df1
A B C
1 2 3
2 8 2
4 7 9
df2
A B C
5 6 7
2 8 9
And with result to get only results of those two matching rows.
df3
A B C
2 8 2
2 8 9
You can concatenate them and drop the ones that are not duplicated:
conc = pd.concat([df1, df2])
conc[conc.duplicated(subset=['A', 'B'], keep=False)]
Out:
A B C
1 2 8 2
1 2 8 9
If you have duplicates,
df1
Out:
A B C
0 1 2 3
1 2 8 2
2 4 7 9
3 4 7 9
4 2 8 5
df2
Out:
A B C
0 5 6 7
1 2 8 9
3 5 6 4
4 2 8 10
You can keep track of the duplicated ones via boolean arrays:
cols = ['A', 'B']
bool1 = df1[cols].isin(df2[cols].to_dict('l')).all(axis=1)
bool2 = df2[cols].isin(df1[cols].to_dict('l')).all(axis=1)
pd.concat([df1[bool1], df2[bool2]])
Out:
A B C
1 2 8 2
4 2 8 5
1 2 8 9
4 2 8 10
Solution with Index.intersection, then select values in both DataFrames by loc and last concat them together:
df1.set_index(['A','B'], inplace=True)
df2.set_index(['A','B'], inplace=True)
idx = df1.index.intersection(df2.index)
print (idx)
MultiIndex(levels=[[2], [8]],
labels=[[0], [0]],
names=['A', 'B'],
sortorder=0)
df = pd.concat([df1.loc[idx],df2.loc[idx]]).reset_index()
print (df)
A B C
0 2 8 2
1 2 8 9
Here is a less efficient method that should preserve duplicates, but involves two merge/joins
# create a merged DataFrame with variables C_x and C_y with the C values
temp = pd.merge(df1, df2, how='inner', on=['A', 'B'])
# join columns A and B to a stacked DataFrame with the Cs on index
temp[['A', 'B']].join(
pd.DataFrame({'C':temp[['C_x', 'C_y']].stack()
.reset_index(level=1, drop=True)})).reset_index(drop=True)
This returns
A B C
0 2 8 2
1 2 8 9