I am trying to sum every row of a dataframe with a series.
I have a dataframe with [107 rows and 42 columns] and a series of length 42. I would like to sum every row with the series such that every column in the dataframe would have the same number added to it. I tried df.add(series) but the result was a dataframe with 107 rows and 84 columns with all NaN values.
For example
dataframe:
Index a b c
d 1 2 3
e 4 5 6
f 7 8 9
g 0 0 0
series: 1 2 3
result would be
Index a b c
d 2 4 6
e 5 7 9
f 8 10 12
g 1 2 3
You can use DataFrame.add or + with numpy array if differet index values like columns names:
s = pd.Series([1,2,3])
df = df.add(s.to_numpy())
#alternative
#df = df + s.to_numpy()
print (df)
a b c
d 2 4 6
e 5 7 9
f 8 10 12
g 1 2 3
s = pd.Series([1,2,3])
s.index = df.columns
df = df.add(s)
#alternative
#df = df + s
print (df)
a b c
d 2 4 6
e 5 7 9
f 8 10 12
g 1 2 3
Related
Given dataframe:
df = pd.DataFrame({'a':[1,2,4,5,6,8],
'b':[5,6,4,8,9,6],
'c':[6,3,3,7,8,4],
'd':[1,2,3,8,7,3],
'e':[3,2,4,4,6,2],
'f':[3,2,6,4,5,5]})
I want to divide/split df several parts (into 2,3,4.. n parts)
Desired output:
df1 =
a b c d e f
0 1 5 6 1 3 3
1 2 6 3 2 2 2
df2 =
a b c d e f
2 4 4 3 3 4 6
3 5 8 7 8 4 4
df3 =
a b c d e f
4 6 9 8 7 6 5
5 8 6 4 3 2 5
UPDATED
Real data has not equal dividable size!
real data 4351 rows × 3 columns
Use qcut to split. How you want to store it after is up to you
import pandas as pd
gp = df.groupby(pd.qcut(range(df.shape[0]), 3)) # N = 3
d = {f'df{i+1}': x[1] for i, x in enumerate(gp)}
d['df1']
# a b c d e f
#0 1 5 6 1 3 3
#1 2 6 3 2 2 2
Assuming your DataFrame can be evenly divided into n chunks:
n = 3
dfs = [df.loc[i] for i in np.split(df.index, n)]
dfs is a list containing 3 dataframes.
I have a data frame with a multi index and one column.
Index fields are type and amount, the column is called count
I would like to add a column that multiplies amount and count
df2 = df.groupby(['type','amount']).count().copy()
# I then dropped all columns but one and renamed it to "count"
df2['total_amount'] = df2['count'].multiply(df2['amount'], axis='index')
doesn't work. I get a key error on amount.
How do I access a part of the multi index to use it in calculations?
Use GroupBy.transform for Series with same size as original df with aggregated values, so possible multiple:
count = df.groupby(['type','amount'])['type'].transform('count')
df['total_amount'] = df['amount'].multiply(count, axis='index')
print (df)
A amount C D E type total_amount
0 a 4 7 1 5 a 8
1 b 5 8 3 3 a 5
2 c 4 9 5 6 a 8
3 d 5 4 7 9 b 10
4 e 5 2 1 2 b 10
5 f 4 3 0 4 b 4
Or:
df = pd.DataFrame({'A':list('abcdef'),
'amount':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'type':list('aaabbb')})
print (df)
A amount C D E type
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b
df2 = df.groupby(['type','amount'])['type'].count().to_frame('count')
df2['total_amount'] = df2['count'].mul(df2.index.get_level_values('amount'))
print (df2)
count total_amount
type amount
a 4 2 8
5 1 5
b 4 1 4
5 2 10
I'm trying to create a reusable function in python 2.7(pandas) to form categorical bins, i.e. group less-value categories as 'other'. Can someone help me to create a function for the below: col1, col2, etc. are different categorical variable columns.
##Reducing categories by binning categorical variables - column1
a = df.col1.value_counts()
#get top 5 values of index
vals = a[:5].index
df['col1_new'] = df.col1.where(df.col1.isin(vals), 'other')
df = df.drop(['col1'],axis=1)
##Reducing categories by binning categorical variables - column2
a = df.col2.value_counts()
#get top 6 values of index
vals = a[:6].index
df['col2_new'] = df.col2.where(df.col2.isin(vals), 'other')
df = df.drop(['col2'],axis=1)
You can use:
df = pd.DataFrame({'A':list('abcdefabcdefabffeg'),
'D':[1,3,5,7,1,0,1,3,5,7,1,0,1,3,5,7,1,0]})
print (df)
A D
0 a 1
1 b 3
2 c 5
3 d 7
4 e 1
5 f 0
6 a 1
7 b 3
8 c 5
9 d 7
10 e 1
11 f 0
12 a 1
13 b 3
14 f 5
15 f 7
16 e 1
17 g 0
def replace_under_top(df, c, n):
a = df[c].value_counts()
#get top n values of index
vals = a[:n].index
#assign columns back
df[c] = df[c].where(df[c].isin(vals), 'other')
#rename processes column
df = df.rename(columns={c : c + '_new'})
return df
Test:
df1 = replace_under_top(df, 'A', 3)
print (df1)
A_new D
0 other 1
1 b 3
2 other 5
3 other 7
4 e 1
5 f 0
6 other 1
7 b 3
8 other 5
9 other 7
10 e 1
11 f 0
12 other 1
13 b 3
14 f 5
15 f 7
16 e 1
17 other 0
df2 = replace_under_top(df, 'D', 4)
print (df2)
A D_new
0 other 1
1 b 3
2 other 5
3 other 7
4 e 1
5 f other
6 other 1
7 b 3
8 other 5
9 other 7
10 e 1
11 f other
12 other 1
13 b 3
14 f 5
15 f 7
16 e 1
17 other other
I have three lists, [1,4,3] , [2,5,6] , [9,8,7] which refer to a dataframe's series indices. I'm using each list to slice the dataframe into a smaller dataframe for batch data processing. After the processing, I want to recombine the dataframes into the original dataframe, preserving the order of the columns.
df_1 = df.iloc[:,list1]
#carry out preprocessing
df_2 = df.iloc[:,list2]
#carry out preprocessing
df_3 = df.iloc[:,list3]
#carry out preprocessing
#join the frames back together
frames = [df_1,df_2,df_3]
df = pd.concat(frames, axis = 1)
Is there a simple way to concat and preserve the original order of the series? i.e. [1,2,3,4,5,6,7,8,9]
I think not, need sort_index for sorting columns names:
df = pd.concat(frames, axis = 1).sort_index(axis=1)
If want sorted by indices positions:
L = list1 + list2 + list3
df1 = pd.concat(frames, axis = 1).reindex(columns=df.columns[sorted(L)])
Or sorting in iloc:
df_1 = df.iloc[:,sorted(list1)]
#carry out preprocessing
df_2 = df.iloc[:,sorted(list2)]
#carry out preprocessing
df_3 = df.iloc[:,sorted(list3)]
#carry out preprocessing
Sample:
np.random.seed(100)
df = pd.DataFrame(np.random.randint(10, size=(5,10)), columns=list('EFGHIJABCD'))
print (df)
E F G H I J A B C D
0 8 8 3 7 7 0 4 2 5 2
1 2 2 1 0 8 4 0 9 6 2
2 4 1 5 3 4 4 3 7 1 1
3 7 7 0 2 9 9 3 2 5 8
4 1 0 7 6 2 0 8 2 5 1
list1 = [1,4,3]
list2 = [2,5,6]
list3 = [9,8,7]
df_1 = df.iloc[:,list1]
#carry out preprocessing
df_2 = df.iloc[:,list2]
#carry out preprocessing
df_3 = df.iloc[:,list3]
#carry out preprocessing
#join the frames back together
frames = [df_1,df_2,df_3]
L = list1 + list2 + list3
df1 = pd.concat(frames, axis = 1).reindex(columns=df.columns[sorted(L)])
print (df1)
F G H I J A B C D
0 8 3 7 7 0 4 2 5 2
1 2 1 0 8 4 0 9 6 2
2 1 5 3 4 4 3 7 1 1
3 7 0 2 9 9 3 2 5 8
4 0 7 6 2 0 8 2 5 1
df2 = pd.concat(frames, axis = 1).sort_index(axis=1)
print (df2)
A B C D F G H I J
0 4 2 5 2 8 3 7 7 0
1 0 9 6 2 2 1 0 8 4
2 3 7 1 1 1 5 3 4 4
3 3 2 5 8 7 0 2 9 9
4 8 2 5 1 0 7 6 2 0
EDIT:
If same columns names as values in list L:
L.sort()
df = df[L]
Or:
df = df[sorted(L)]
I have a dataset D with Columns from [A - Z] in total 26 columns. I have done some test and got to know which are the useful columns to me in a series S.
D #Dataset with columns from A - Z
S
B 0.78
C 1.04
H 2.38
S has the columns and a value associated with it, So I now know their importance and would like to keep only those Columns in the Dataset eg(B, C, D) How can I do it?
IIUC you can use:
cols = ['B','C','D']
df = df[cols]
Or if column names are in Series as values:
S = pd.Series(['B','C','D'])
df = df[S]
Sample:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
S = pd.Series(['B','C','D'])
print (S)
0 B
1 C
2 D
dtype: object
print (df[S])
B C D
0 4 7 1
1 5 8 3
2 6 9 5
Or index values:
S = pd.Series([1,2,3], index=['B','C','D'])
print (S)
B 1
C 2
D 3
dtype: int64
print (df[S.index])
B C D
0 4 7 1
1 5 8 3
2 6 9 5