I have a NxM dataframe and a NxL numpy matrix. I'd like to add the matrix to the dataframe to create L new columns by simply appending the columns and rows the same order they appear. I tried merge() and join(), but I end up with errors:
assign() keywords must be strings
and
columns overlap but no suffix specified
respectively.
Is there a way I can add a numpy matrix as dataframe columns?
You can turn the matrix into a datframe and use concat with axis=1:
For example, given a dataframe df and a numpy array mat:
>>> df
a b
0 5 5
1 0 7
2 1 0
3 0 4
4 6 4
>>> mat
array([[0.44926098, 0.29567859, 0.60728561],
[0.32180566, 0.32499134, 0.94950085],
[0.64958125, 0.00566706, 0.56473627],
[0.17357589, 0.71053224, 0.17854188],
[0.38348102, 0.12440952, 0.90359566]])
You can do:
>>> pd.concat([df, pd.DataFrame(mat)], axis=1)
a b 0 1 2
0 5 5 0.449261 0.295679 0.607286
1 0 7 0.321806 0.324991 0.949501
2 1 0 0.649581 0.005667 0.564736
3 0 4 0.173576 0.710532 0.178542
4 6 4 0.383481 0.124410 0.903596
Setup
df = pd.DataFrame({'a': [5,0,1,0,6], 'b': [5,7,0,4,4]})
mat = np.random.rand(5,3)
Using join:
df.join(pd.DataFrame(mat))
a b 0 1 2
0 5 5 0.884061 0.803747 0.727161
1 0 7 0.464009 0.447346 0.171881
2 1 0 0.353604 0.912781 0.199477
3 0 4 0.466095 0.136218 0.405766
4 6 4 0.764678 0.874614 0.310778
If there is the chance of overlapping column names, simply supply a suffix:
df = pd.DataFrame({0: [5,0,1,0,6], 1: [5,7,0,4,4]})
mat = np.random.rand(5,3)
df.join(pd.DataFrame(mat), rsuffix='_')
0 1 0_ 1_ 2
0 5 5 0.783722 0.976951 0.563798
1 0 7 0.946070 0.391593 0.273339
2 1 0 0.710195 0.827352 0.839212
3 0 4 0.528824 0.625430 0.465386
4 6 4 0.848423 0.467256 0.962953
Related
I would like to drop the [] for a given df
df=pd.DataFrame(dict(a=[1,2,4,[],5]))
Such that the expected output will be
a
0 1
1 2
2 4
3 5
Edit:
or to make thing more interesting, what if we have two columns and some of the cell is with [] to be dropped.
df=pd.DataFrame(dict(a=[1,2,4,[],5],b=[2,[],1,[],6]))
One way is to get the string repr and filter:
df = df[df['a'].map(repr)!='[]']
Output:
a
0 1
1 2
2 4
4 5
For multiple columns, we could apply the above:
out = df[df.apply(lambda c: c.map(repr)).ne('[]').all(axis=1)]
Output:
a b
0 1 2
2 4 1
4 5 6
You can't use equality directly as pandas will try to align a Series and a list, but you can use isin:
df[~df['a'].isin([[]])]
output:
a
0 1
1 2
2 4
4 5
To act on all columns:
df[~df.isin([[]]).any(1)]
output:
a b
0 1 2
2 4 1
4 5 6
Given a df
a b ngroup
0 1 3 0
1 1 4 0
2 1 1 0
3 3 7 2
4 4 4 2
5 1 1 4
6 2 2 4
7 1 1 4
8 6 6 5
I would like to compute the summation of multiple columns (i.e., a and b) grouped by the column ngroup.
In addition, I would like to count the number of element for each of the group.
Based on these two condition, the expected output as below
a b nrow_same_group ngroup
3 8 3 0
7 11 2 2
4 4 3 4
6 6 1 5
The following code should do the work
import pandas as pd
df=pd.DataFrame(list(zip([1,1,1,3,4,1,2,1,6,10],
[3,4,1,7,4,1,2,1,6,1],
[0,0,0,2,2,4,4,4,5])),columns=['a','b','ngroup'])
grouped_df = df.groupby(['ngroup'])
df1 = grouped_df[['a','b']].agg('sum').reset_index()
df2 = df['ngroup'].value_counts().reset_index()
df2.sort_values('index', axis=0, ascending=True, inplace=True, kind='quicksort', na_position='last')
df2.reset_index(drop=True, inplace=True)
df2.rename(columns={'index':'ngroup','ngroup':'nrow_same_group'},inplace=True)
df= pd.merge(df1, df2, on=['ngroup'])
However, I wonder whether there exist built-in pandas that achieve something similar, in single line.
You can do it using only groupby + agg.
import pandas as pd
df=pd.DataFrame(list(zip([1,1,1,3,4,1,2,1,6,10],
[3,4,1,7,4,1,2,1,6,1],
[0,0,0,2,2,4,4,4,5])),columns=['a','b','ngroup'])
res = (
df.groupby('ngroup', as_index=False)
.agg(a=('a','sum'), b=('b', 'sum'),
nrow_same_group=('a', 'size'))
)
Here the parameters passed to agg are tuples whose first element is the column to aggregate and the second element is the aggregation function to apply to that column. The parameter names are the labels for the resulting columns.
Output:
>>> res
ngroup a b nrow_same_group
0 0 3 8 3
1 2 7 11 2
2 4 4 4 3
3 5 6 6 1
First aggregate a, b with sum then calculate size of each group and assign this to nrow_same_group column
g = df.groupby('ngroup')
g.sum().assign(nrow_same_group=g.size())
a b nrow_same_group
ngroup
0 3 8 3
2 7 11 2
4 4 4 3
5 6 6 1
I tried to create a data frame df using the below code :
import numpy as np
import pandas as pd
index = [0,1,2,3,4,5]
s = pd.Series([1,2,3,4,5,6],index= index)
t = pd.Series([2,4,6,8,10,12],index= index)
df = pd.DataFrame(s,columns = ["MUL1"])
df["MUL2"] =t
print df
MUL1 MUL2
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
While trying to create the same data frame using the below syntax, I am getting a wierd output.
df = pd.DataFrame([s,t],columns = ["MUL1","MUL2"])
print df
MUL1 MUL2
0 NaN NaN
1 NaN NaN
Please explain why the NaN is being displayed in the dataframe when both the Series are non empty and why only two rows are getting displayed and no the rest.
Also provide the correct way to create the data frame same as has been mentioned above by using the columns argument in the pandas DataFrame method.
One of the correct ways would be to stack the array data from the input list holding those series into columns -
In [161]: pd.DataFrame(np.c_[s,t],columns = ["MUL1","MUL2"])
Out[161]:
MUL1 MUL2
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
Behind the scenes, the stacking creates a 2D array, which is then converted to a dataframe. Here's what the stacked array looks like -
In [162]: np.c_[s,t]
Out[162]:
array([[ 1, 2],
[ 2, 4],
[ 3, 6],
[ 4, 8],
[ 5, 10],
[ 6, 12]])
If remove columns argument get:
df = pd.DataFrame([s,t])
print (df)
0 1 2 3 4 5
0 1 2 3 4 5 6
1 2 4 6 8 10 12
Then define columns - if columns not exist get NaNs column:
df = pd.DataFrame([s,t], columns=[0,'MUL2'])
print (df)
0 MUL2
0 1.0 NaN
1 2.0 NaN
Better is use dictionary:
df = pd.DataFrame({'MUL1':s,'MUL2':t})
print (df)
MUL1 MUL2
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
And if need change columns order add columns parameter:
df = pd.DataFrame({'MUL1':s,'MUL2':t}, columns=['MUL2','MUL1'])
print (df)
MUL2 MUL1
0 2 1
1 4 2
2 6 3
3 8 4
4 10 5
5 12 6
More information is in dataframe documentation.
Another solution by concat - DataFrame constructor is not necessary:
df = pd.concat([s,t], axis=1, keys=['MUL1','MUL2'])
print (df)
MUL1 MUL2
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10
5 6 12
A pandas.DataFrame takes in the parameter data that can be of type ndarray, iterable, dict, or dataframe.
If you pass in a list it will assume each member is a row. Example:
a = [1,2,3]
b = [2,4,6]
df = pd.DataFrame([a, b], columns = ["Col1","Col2", "Col3"])
# output 1:
Col1 Col2 Col3
0 1 2 3
1 2 4 6
You are getting NaN because it expects index = [0,1] but you are giving [0,1,2,3,4,5]
To get the shape you want, first transpose the data:
data = np.array([a, b]).transpose()
How to create a pandas dataframe
import pandas as pd
a = [1,2,3]
b = [2,4,6]
df = pd.DataFrame(dict(Col1=a, Col2=b))
Output:
Col1 Col2
0 1 2
1 2 4
2 3 6
This sounds a bit weird, but I think that's exactly what I needed now:
I got several pandas dataframes that contains columns with float numbers, for example:
a b c
0 0 1 2
1 3 4 5
2 6 7 8
Now I want to add a column, with only one row, and the value is equal to the average of column 'a', in this case, is 3.0. So the new dataframe will looks like this:
a b c average
0 0 1 2 3.0
1 3 4 5
2 6 7 8
And all the rows below are empty.
I've tried things like df['average'] = np.mean(df['a']) but that give me a whole column of 3.0. Any help will be appreciated.
Assign a series, this is cleaner.
df['average'] = pd.Series(df['a'].mean(), index=df.index[[0]])
Or, even better, assign with loc:
df.loc[df.index[0], 'average'] = df['a'].mean().item()
Filling NaNs is straightforward, you can do
df['average'] = df['average'].fillna('')
df
a b c average
0 0 1 2 3
1 3 4 5
2 6 7 8
Can do something like:
df['average'] = [np.mean(df['a'])]+['']*(len(df)-1)
Here is a full example:
import pandas as pd
import numpy as np
df = pd.DataFrame(
[(0,1,2), (3,4,5), (6,7,8)],
columns=['a', 'b', 'c'])
print(df)
a b c
0 0 1 2
1 3 4 5
2 6 7 8
df['average'] = ''
df['average'][0] = df['a'].mean()
print(df)
a b c average
0 0 1 2 3
1 3 4 5
2 6 7 8
The question was originally asked here as a comment but could not get a proper answer as the question was marked as a duplicate.
For a given pandas.DataFrame, let us say
df = DataFrame({'A' : [5,6,3,4], 'B' : [1,2,3, 5]})
df
A B
0 5 1
1 6 2
2 3 3
3 4 5
How can we select rows from a list, based on values in a column ('A' for instance)
For instance
# from
list_of_values = [3,4,6]
# we would like, as a result
# A B
# 2 3 3
# 3 4 5
# 1 6 2
Using isin as mentioned here is not satisfactory as it does not keep order from the input list of 'A' values.
How can the abovementioned goal be achieved?
One way to overcome this is to make the 'A' column an index and use loc on the newly generated pandas.DataFrame. Eventually, the subsampled dataframe's index can be reset.
Here is how:
ret = df.set_index('A').loc[list_of_values].reset_index(inplace=False)
# ret is
# A B
# 0 3 3
# 1 4 5
# 2 6 2
Note that the drawback of this method is that the original indexing has been lost in the process.
More on pandas indexing: What is the point of indexing in pandas?
Use merge with helper DataFrame created by list and with column name of matched column:
df = pd.DataFrame({'A' : [5,6,3,4], 'B' : [1,2,3,5]})
list_of_values = [3,6,4]
df1 = pd.DataFrame({'A':list_of_values}).merge(df)
print (df1)
A B
0 3 3
1 6 2
2 4 5
For more general solution:
df = pd.DataFrame({'A' : [5,6,5,3,4,4,6,5], 'B':range(8)})
print (df)
A B
0 5 0
1 6 1
2 5 2
3 3 3
4 4 4
5 4 5
6 6 6
7 5 7
list_of_values = [6,4,3,7,7,4]
#create df from list
list_df = pd.DataFrame({'A':list_of_values})
print (list_df)
A
0 6
1 4
2 3
3 7
4 7
5 4
#column for original index values
df1 = df.reset_index()
#helper column for count duplicates values
df1['g'] = df1.groupby('A').cumcount()
list_df['g'] = list_df.groupby('A').cumcount()
#merge together, create index from column and remove g column
df = list_df.merge(df1).set_index('index').rename_axis(None).drop('g', axis=1)
print (df)
A B
1 6 1
4 4 4
3 3 3
5 4 5
1] Generic approach for list_of_values.
In [936]: dff = df[df.A.isin(list_of_values)]
In [937]: dff.reindex(dff.A.map({x: i for i, x in enumerate(list_of_values)}).sort_values().index)
Out[937]:
A B
2 3 3
3 4 5
1 6 2
2] If list_of_values is sorted. You can use
In [926]: df[df.A.isin(list_of_values)].sort_values(by='A')
Out[926]:
A B
2 3 3
3 4 5
1 6 2