I'd like to be able to do one hot encoding on a data frame based on levels from another data frame.
For instance, in the example below data provides the levels for two variables. Based on those levels only, I want to create dummy variables in data2.
How can I go about this?
import pandas as pd
#unique levels (A,B for VAR1, and X,Y,Z for VAR2) in
#this dataset determine the possible levels for the following dataset
data = {'VAR1': ['A', 'A', 'A', 'A','B', 'B'],
'VAR2': ['X', 'Y', 'Y', 'Y','X', 'Z']}
frame = pd.DataFrame(data)
#data2 contains same variables as data, but might or might not
#contain same levels
data2 = {'VAR1': ['A', 'C'],
'VAR2': ['X', 'Y']}
frame2 = pd.DataFrame(data2)
#after applying one hot encoding to data2, this is what it should look like
data_final = {
'A': ['1', '0'],
'B': ['0', '0'],
'X': ['1', '0'],
'Y': ['0', '1'],
'Z': ['0', '0'],
}
frame_final = pd.DataFrame(data_final)
There are probably a lot of ways to achieve this. For whatever reason I'm draw this approach:
In [74]: part = pd.concat([pd.get_dummies(frame2[x]) for x in frame2], axis=1)
In [75]: part
Out[75]:
A C X Y
0 1 0 1 0
1 0 1 0 1
You can see we are already almost there, the only missing columns are those that don't show up anywhere in frame2, B and Z. Again there would be multiple ways to get these added in (I'd be curious to hear of any you think are more suitable), but I wanted to use the reindex_axis method. To use this, we need another index containing all the possible values.
In [76]: idx = pd.Index(np.ravel(frame.values)).unique()
In [77]: idx
Out[77]: array(['A', 'X', 'Y', 'B', 'Z'], dtype=object)
Finally reindex and fill the NaNs with 0:
In [78]: part.reindex_axis(idx, axis=1).fillna(0)
Out[78]:
A X Y B Z
0 1 1 0 0 0
1 0 0 1 0 0
You can sort if necessary.
Related
Problem:
Given a large data set (3 million rows x 6 columns) what's the fastest way to join values of columns in a single pandas data frame, based on the rows where the mask is true?
My current solution:
import pandas as pd
import numpy as np
# Note: Real data will be 3 millon rows X 6 columns,
df = pd.DataFrame({'time': ['0', '1', '2', '3'],
'msg': ['msg0', 'msg1', 'msg0', 'msg2'],
'd0': ['a', 'x', 'a', '1'],
'd1': ['b', 'x', 'b', '2'],
'd2': ['c', 'x', np.nan, '3']})
#print(df)
msg_text_filter = ['msg0', 'msg2']
columns = df.columns.drop(df.columns[:3])
column_join = ["d0"]
mask = df['msg'].isin(msg_text_filter)
df.replace(np.nan,'',inplace=True)
# THIS IS SLOW, HOW TO SPEED UP?
df['d0'] = np.where(
mask,
df[['d0','d1','d2']].agg(''.join, axis=1),
df['d0']
)
df.loc[mask, columns] = np.nan
print(df)
IMHO you can save a lot of time by using
df[['d0', 'd1', 'd2']].sum(axis=1)
instead of
df[['d0', 'd1', 'd2']].agg(''.join, axis=1)
And I think instead of using np.where you could just do:
df.loc[mask, 'd0'] = df.loc[mask, ['d0', 'd1', 'd2']].sum(axis=1)
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 years ago.
This question is related to How to concatenate combinations of rows from two different dataframes? but with a minor twist.
I have two dataframes with a common column. I want to create a new dataframe whose column names are the common column plus the concatenation of the two dataframes columns.
The resulting dataframe will have all possible combinations (cartesian product?) between rows of the two datasets that have the same value in the common column.
The two original datasets are:
df1 = pd.DataFrame({'common': ['x', 'y', 'y'], 'A': ['1', '2', '3']})
df2 = pd.DataFrame({'common': ['x', 'x', 'y'], 'B': ['a', 'b', 'c']})
and the resulting dataset would be:
df3 = pd.DataFrame({'common': ['x', 'x', 'y', 'y'],
'A': ['1', '1' '2', '3'],
'B': ['a', 'b', 'c', 'c']})
Use pandas' merge:
df1 = pd.DataFrame({'common': ['x', 'y', 'y'], 'A': ['1', '2', '3']})
df2 = pd.DataFrame({'common': ['x', 'x', 'y'], 'B': ['a', 'b', 'c']})
df3=pd.merge(df1,df2,on='common')
I have a pandas data frame that looks like:
col11 col12
X ['A']
Y ['A', 'B', 'C']
Z ['C', 'A']
And another one that looks like:
col21 col22
'A' 'alpha'
'B' 'beta'
'C' 'gamma'
I would like to replace col12 base on col22 in a efficient way and get, as a result:
col31 col32
X ['alpha']
Y ['alpha', 'beta', 'gamma']
Z ['gamma', 'alpha']
One solution is to use an indexed series as a mapper with a list comprehension:
import pandas as pd
df1 = pd.DataFrame({'col1': ['X', 'Y', 'Z'],
'col2': [['A'], ['A', 'B', 'C'], ['C', 'A']]})
df2 = pd.DataFrame({'col21': ['A', 'B', 'C'],
'col22': ['alpha', 'beta', 'gamma']})
s = df2.set_index('col21')['col22']
df1['col2'] = [list(map(s.get, i)) for i in df1['col2']]
Result:
col1 col2
0 X [alpha]
1 Y [alpha, beta, gamma]
2 Z [gamma, alpha]
I'm not sure its the most efficient way but you can turn your DataFrame to a dict and then use apply to map the keys to the values:
Assuming your first DataFrame is df1 and the second is df2:
df_dict = dict(zip(df2['col21'], df2['col22']))
df3 = pd.DataFrame({"31":df1['col11'], "32": df1['col12'].apply(lambda x: [df_dict[y] for y in x])})
or as #jezrael suggested with nested list comprehension:
df3 = pd.DataFrame({"31":df1['col11'], "32": [[df_dict[y] for y in x] for x in df1['col12']]})
note: df3 has a default index
31 32
0 X [alpha]
1 Y [alpha, beta, gamma]
2 Z [gamma, alpha]
I'm trying to simplify pandas and python syntax when executing a basic Pandas operation.
I have 4 columns:
a_id
a_score
b_id
b_score
I create a new label called doc_type based on the following:
a >= b, doc_type: a
b > a, doc_type: b
Im struggling in how to calculate in Pandas where a exists but b doesn't, in this case then a needs to be the label. Right now it returns the else statement or b.
I needed to create 2 additional comparison which at scale may be efficient as I already compare the data before. Looking how to improve it.
df = pd.DataFrame({
'a_id': ['A', 'B', 'C', 'D', '', 'F', 'G'],
'a_score': [1, 2, 3, 4, '', 6, 7],
'b_id': ['a', 'b', 'c', 'd', 'e', 'f', ''],
'b_score': [0.1, 0.2, 3.1, 4.1, 5, 5.99, None],
})
print df
# Replace empty string with NaN
m_score = r['a_score'] >= r['b_score']
m_doc = (r['a_id'].isnull() & r['b_id'].isnull())
df = df.apply(lambda x: x.str.strip() if isinstance(x, str) else x).replace('', np.nan)
# Calculate higher score
df['doc_id'] = df.apply(lambda df: df['a_id'] if df['a_score'] >= df['b_score'] else df['b_id'], axis=1)
# Select type based on higher score
r['doc_type'] = numpy.where(m_score, 'a',
numpy.where(m_doc, numpy.nan, 'b'))
# Additional lines looking for improvement:
df['doc_type'].loc[(df['a_id'].isnull() & df['b_id'].notnull())] = 'b'
df['doc_type'].loc[(df['a_id'].notnull() & df['b_id'].isnull())] = 'a'
print df
Use numpy.where, assuming your logic is:
Both exist, the doc_type will be the one with higher score;
One missing, the doc_type will be the one not null;
Both missing, the doc_type will be null;
Added an extra edge case at the last line:
import numpy as np
df = df.replace('', np.nan)
df['doc_type'] = np.where(df.b_id.isnull() | (df.a_score >= df.b_score),
np.where(df.a_id.isnull(), None, 'a'), 'b')
df
Not sure I fully understand all conditions or if this has any particular edge cases, but I think you can just do an np.argmax on the columns and swap the values for 'a' or 'b' when you're done:
In [21]: import numpy as np
In [22]: df['doc_type'] = pd.Series(np.argmax(df[["a_score", "b_score"]].values, axis=1)).replace({0: 'a', 1: 'b'})
In [23]: df
Out[23]:
a_id a_score b_id b_score doc_type
0 A 1 a 0.10 a
1 B 2 b 0.20 a
2 C 3 c 3.10 b
3 D 4 d 4.10 b
4 2 e 5.00 b
5 F f 5.99 a
6 G 7 NaN a
Use the apply method in pandas with a custom function, trying out on your dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'a_id': ['A', 'B', 'C', 'D', '', 'F', 'G'],
'a_score': [1, 2, 3, 4, '', 6, 7],
'b_id': ['a', 'b', 'c', 'd', 'e', 'f', ''],
'b_score': [0.1, 0.2, 3.1, 4.1, 5, 5.99, None],
})
df = df.replace('',np.NaN)
def func(row):
if np.isnan(row.a_score) and np.isnan(row.b_score):
return np.NaN
elif np.isnan(row.b_score) and not(np.isnan(row.a_score)):
return 'a'
elif not(np.isnan(row.b_score)) and np.isnan(row.a_score):
return 'a'
elif row.a_score>=row.b_score:
return 'a'
elif row.b_score>row.a_score:
return 'b'
df['doc_type'] = df.apply(func,axis=1)
You can make the function as complicated as you need and include any amount of comparisons and add more conditions later if you need to.
I have this data frame:
>> df = pd.DataFrame({'Place' : ['A', 'A', 'B', 'B', 'C', 'C'], 'Var' : ['All', 'French', 'All', 'German', 'All', 'Spanish'], 'Values' : [250, 30, 120, 12, 200, 112]})
>> df
Place Values Var
0 A 250 All
1 A 30 French
2 B 120 All
3 B 12 German
4 C 200 All
5 C 112 Spanish
It has a repeating pattern of two rows for every Place. I want to reshape it so it's one row per Place and the Var column becomes two columns, one for "All" and one for the other value.
Like so:
Place All Language Value
A 250 French 30
B 120 German 12
C 200 Spanish 112
A pivot table would make a column for each unique value, and I don't want that.
What's the reshaping method for this?
Because the data appears in alternating pattern, we can conceptualize the transformation in 2 steps.
Step 1:
Go from
a,a,a
b,b,b
To
a,a,a,b,b,b
Step 2: drop redundant columns.
The following solution applies reshape to the values of the DataFrame; the arguments to reshape are (-1, df.shape[1] * 2), which says 'give me a frame that has twice as many columns and as many rows as you can manage.
Then, I hardwired the column indexes for the filter: [0, 1, 4, 5] based on your data layout. Resulting numpy array has 4 columns, so we pass it into a DataFrame constructor along with the correct column names.
It is an unreadable solution that depends on the df layout and produces columns in the wrong order;
import pandas as pd
df = pd.DataFrame({'Place' : ['A', 'A', 'B', 'B', 'C', 'C'], 'Var' : ['All', 'French', 'All', 'German', 'All', 'Spanish'], 'Values' : [250, 30, 120, 12, 200, 112]})
df = pd.DataFrame(df.values.reshape(-1, df.shape[1] * 2)[:,[0,1,4,5]],
columns = ['Place', 'All', 'Value', 'Language'])
A different approach:
df = pd.DataFrame({'Place' : ['A', 'A', 'B', 'B', 'C', 'C'], 'Var' : ['All', 'French', 'All', 'German', 'All', 'Spanish'], 'Values' : [250, 30, 120, 12, 200, 112]})
df1 = df.set_index('Place').pivot(columns='Var')
df1.columns = df1.columns.droplevel()
df1 = df1.set_index('All', append=True).stack().reset_index()
print(df1)
Output:
Place All Var 0
0 A 250.0 French 30.0
1 B 120.0 German 12.0
2 C 200.0 Spanish 112.0