How can I concatenate two Series and create one DataFrame ?
For example, I have series like:
a=pd.Series([1,2,3])
b=pd.Series([4,5,6])
And, I want to get a data frame like:
pd.DataFrame([[1,4], [2,5], [3,6]])
Shortest would be:
pd.DataFrame([a,b]).T
Or:
pd.DataFrame(zip(a,b))
0 1
0 1 4
1 2 5
2 3 6
Or use concat:
>>> pd.concat([a,b],axis=1)
0 1
0 1 4
1 2 5
2 3 6
>>>
Or join:
>>> a.to_frame().join(b.to_frame(name=1))
0 1
0 1 4
1 2 5
2 3 6
>>>
Another possible faster solution could be,
pd.DataFrame(np.vstack((a,b)).T)
Related
I would like to drop the [] for a given df
df=pd.DataFrame(dict(a=[1,2,4,[],5]))
Such that the expected output will be
a
0 1
1 2
2 4
3 5
Edit:
or to make thing more interesting, what if we have two columns and some of the cell is with [] to be dropped.
df=pd.DataFrame(dict(a=[1,2,4,[],5],b=[2,[],1,[],6]))
One way is to get the string repr and filter:
df = df[df['a'].map(repr)!='[]']
Output:
a
0 1
1 2
2 4
4 5
For multiple columns, we could apply the above:
out = df[df.apply(lambda c: c.map(repr)).ne('[]').all(axis=1)]
Output:
a b
0 1 2
2 4 1
4 5 6
You can't use equality directly as pandas will try to align a Series and a list, but you can use isin:
df[~df['a'].isin([[]])]
output:
a
0 1
1 2
2 4
4 5
To act on all columns:
df[~df.isin([[]]).any(1)]
output:
a b
0 1 2
2 4 1
4 5 6
I am looking to update the values in a pandas series that satisfy a certain condition and take the corresponding value from another column.
Specifically, I want to look at the subcluster column and if the value equals 1, I want the record to update to the corresponding value in the cluster column.
For example:
Cluster
Subcluster
3
1
3
2
3
1
3
4
4
1
4
2
Should result in this
Cluster
Subcluster
3
3
3
2
3
3
3
4
4
4
4
2
I've been trying to use apply and a lambda function, but can't seem to get it to work properly. Any advice would be greatly appreciated. Thanks!
You can use np.where:
import numpy as np
df['Subcluster'] = np.where(df['Subcluster'].eq(1), df['Cluster'], df['Subcluster'])
Output:
Cluster Subcluster
0 3 3
1 3 2
2 3 3
3 3 4
4 4 4
5 4 2
In your case try mask
df.Subcluster.mask(lambda x : x==1, df.Cluster,inplace=True)
df
Out[12]:
Cluster Subcluster
0 3 3
1 3 2
2 3 3
3 3 4
4 4 4
5 4 2
Or
df.loc[df.Subcluster==1,'Subcluster'] = df['Cluster']
Really all you need here is to use .loc with a mask (you don't actually need to create the mask, you could apply a mask inline)
df = pd.DataFrame({'cluster':np.random.randint(0,10,10)
,'subcluster':np.random.randint(0,3,10)}
)
df.to_clipboard(sep=',')
df at this point
,cluster,subcluster
0,8,0
1,5,2
2,6,2
3,6,1
4,8,0
5,1,1
6,0,0
7,6,0
8,1,0
9,3,1
create and apply the mask (you could do this all in one line)
mask = df.subcluster == 1
df.loc[mask,'subcluster'] = df.loc[mask,'cluster']
df.to_clipboard(sep=',')
final output:
,cluster,subcluster
0,8,0
1,5,2
2,6,2
3,6,6
4,8,0
5,1,1
6,0,0
7,6,0
8,1,0
9,3,3
Here's the lambda you couldn't write. In lamba, x corresponds to the index, so you can use that to refer a specific row in a column.
df['Subcluster'] = df.apply(lambda x: x['Cluster'] if x['Subcluster'] == 1 else x['Subcluster'], axis = 1)
And the output:
Cluster Subcluster
0 3 3
1 3 2
2 3 3
3 3 4
4 4 4
5 4 2
I am using apply to leverage one dataframe to manipulate a second dataframe and return results. Here is a simplified example that I realize could be more easily answered with "in" logic, but for now let's keep the use of .apply() as a constraint:
import pandas as pd
df1 = pd.DataFrame({'Name':['A','B'],'Value':range(1,3)})
df2 = pd.DataFrame({'Name':['A']*3+['B']*4+['C'],'Value':range(1,9)})
def filter_df(x, df):
return df[df['Name']==x['Name']]
df1.apply(filter_df, axis=1, args=(df2, ))
Which is returning:
0 Name Value
0 A 1
1 A 2
2 ...
1 Name Value
3 B 4
4 B 5
5 ...
dtype: object
What I would like to see instead is one formated DataFrame with Name and Value headers. All advice appreciated!
Name Value
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6
6 B 7
In my opinion, this cannot be done solely based on apply, you need pandas.concat:
result = pd.concat(df1.apply(filter_df, axis=1, args=(df2,)).to_list())
print(result)
Output
Name Value
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6
6 B 7
I have a NxM dataframe and a NxL numpy matrix. I'd like to add the matrix to the dataframe to create L new columns by simply appending the columns and rows the same order they appear. I tried merge() and join(), but I end up with errors:
assign() keywords must be strings
and
columns overlap but no suffix specified
respectively.
Is there a way I can add a numpy matrix as dataframe columns?
You can turn the matrix into a datframe and use concat with axis=1:
For example, given a dataframe df and a numpy array mat:
>>> df
a b
0 5 5
1 0 7
2 1 0
3 0 4
4 6 4
>>> mat
array([[0.44926098, 0.29567859, 0.60728561],
[0.32180566, 0.32499134, 0.94950085],
[0.64958125, 0.00566706, 0.56473627],
[0.17357589, 0.71053224, 0.17854188],
[0.38348102, 0.12440952, 0.90359566]])
You can do:
>>> pd.concat([df, pd.DataFrame(mat)], axis=1)
a b 0 1 2
0 5 5 0.449261 0.295679 0.607286
1 0 7 0.321806 0.324991 0.949501
2 1 0 0.649581 0.005667 0.564736
3 0 4 0.173576 0.710532 0.178542
4 6 4 0.383481 0.124410 0.903596
Setup
df = pd.DataFrame({'a': [5,0,1,0,6], 'b': [5,7,0,4,4]})
mat = np.random.rand(5,3)
Using join:
df.join(pd.DataFrame(mat))
a b 0 1 2
0 5 5 0.884061 0.803747 0.727161
1 0 7 0.464009 0.447346 0.171881
2 1 0 0.353604 0.912781 0.199477
3 0 4 0.466095 0.136218 0.405766
4 6 4 0.764678 0.874614 0.310778
If there is the chance of overlapping column names, simply supply a suffix:
df = pd.DataFrame({0: [5,0,1,0,6], 1: [5,7,0,4,4]})
mat = np.random.rand(5,3)
df.join(pd.DataFrame(mat), rsuffix='_')
0 1 0_ 1_ 2
0 5 5 0.783722 0.976951 0.563798
1 0 7 0.946070 0.391593 0.273339
2 1 0 0.710195 0.827352 0.839212
3 0 4 0.528824 0.625430 0.465386
4 6 4 0.848423 0.467256 0.962953
I'm trying to replace a row in a dataframe with the row of another dataframe only if they share a common column.
Here is the first dataframe:
index no foo
0 0 1
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6
and the second dataframe:
index no foo
0 2 aaa
1 3 bbb
2 22 3
3 33 4
4 44 5
5 55 6
I'd like my result to be
index no foo
0 0 1
1 1 2
2 2 aaa
3 3 bbb
4 4 5
5 5 6
The result of the inner merge between both dataframes returns the correct rows, but I'm having trouble inserting them at the correct index in the first dataframe
Any help would be greatly appreciated.
Thank you.
This should work as well
df1['foo'] = pd.merge(df1, df2, on='no', how='left').apply(lambda r: r['foo_y'] if r['foo_y'] == r['foo_y'] else r['foo_x'], axis=1)
You could use apply, there is probably a better way than this:
In [67]:
# define a function that takes a row and tries to find a match
def func(x):
# find if 'no' value matches, test the length of the series
if len(df1.loc[df1.no ==x.no, 'foo']) > 0:
return df1.loc[df1.no ==x.no, 'foo'].values[0] # return the first array value
else:
return x.foo # no match so return the existing value
# call apply and using a lamda apply row-wise (axis=1 means row-wise)
df.foo = df.apply(lambda row: func(row), axis=1)
df
Out[67]:
index no foo
0 0 0 1
1 1 1 2
2 2 2 aaa
3 3 3 bbb
4 4 4 5
5 5 5 6
[6 rows x 3 columns]