Fastest way to join coulmn values in pandas dataframe? - python

Problem:
Given a large data set (3 million rows x 6 columns) what's the fastest way to join values of columns in a single pandas data frame, based on the rows where the mask is true?
My current solution:
import pandas as pd
import numpy as np
# Note: Real data will be 3 millon rows X 6 columns,
df = pd.DataFrame({'time': ['0', '1', '2', '3'],
'msg': ['msg0', 'msg1', 'msg0', 'msg2'],
'd0': ['a', 'x', 'a', '1'],
'd1': ['b', 'x', 'b', '2'],
'd2': ['c', 'x', np.nan, '3']})
#print(df)
msg_text_filter = ['msg0', 'msg2']
columns = df.columns.drop(df.columns[:3])
column_join = ["d0"]
mask = df['msg'].isin(msg_text_filter)
df.replace(np.nan,'',inplace=True)
# THIS IS SLOW, HOW TO SPEED UP?
df['d0'] = np.where(
mask,
df[['d0','d1','d2']].agg(''.join, axis=1),
df['d0']
)
df.loc[mask, columns] = np.nan
print(df)

IMHO you can save a lot of time by using
df[['d0', 'd1', 'd2']].sum(axis=1)
instead of
df[['d0', 'd1', 'd2']].agg(''.join, axis=1)
And I think instead of using np.where you could just do:
df.loc[mask, 'd0'] = df.loc[mask, ['d0', 'd1', 'd2']].sum(axis=1)

Related

in wide_to_long function the error of [stubname can't be identical to a column name]

Below wide_to_long function can't work, anyone can help ? Thanks!
import pandas as pd
ori_df = pd.DataFrame()
ori_df = pd.DataFrame([['a', '1'], ['w:', 'z'], ['t', '6'], ['f:', 'z'], ['a', '2']],
columns=['type', 'value']
)
ori_df['id'] = ori_df.index
pd.wide_to_long(ori_df, ['type', 'value'], i='id', j='amount')

Trying to do a left join of two datasets but getting strange results

To make this as clear as possible I started with a simple example. I created two random dataframes
dummy_data1 = {
'id': ['1', '2', '3', '4', '5'],
'Feature1': ['A', 'C', 'E', 'G', 'I'],
'Feature2': ['B', 'D', 'F', 'H', 'J']}
df1 = pd.DataFrame(dummy_data1, columns = ['id', 'Feature1', 'Feature2'])
dummy_data2 = {
'id': ['1', '2', '6', '7', '8'],
'Feature3': ['K', 'M', 'O', 'Q', 'S'],
'Feature4': ['L', 'N', 'P', 'R', 'T']}
df2 = pd.DataFrame(dummy_data2, columns = ['id', 'Feature3', 'Feature4'])
And if I apply this df_merge = pd.merge(df1, df2, on = 'id', how='outer') or df_merge = df1.merge(df2,how='left', left_on='id', right_on='id') I get the desired output of
Now I am trying to apply the same technique with two large datasets that have the same number of rows. All I want to do is join the columns together into one large dataframe. The length of each dataframe is 512573 But when I apply
df_merge = orig_data_updated.merge(demographic_data1,how='left', left_on='Location+Type', right_on='Location+Type')
Then the length magically becomes 3596301 which is simply not possible. My question is simple. How do I do a left join on two dataframes such that the number of rows is the same and I just join the columns together?

How to concatenate combinations of rows *with a given condition* from two different dataframes? [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 years ago.
This question is related to How to concatenate combinations of rows from two different dataframes? but with a minor twist.
I have two dataframes with a common column. I want to create a new dataframe whose column names are the common column plus the concatenation of the two dataframes columns.
The resulting dataframe will have all possible combinations (cartesian product?) between rows of the two datasets that have the same value in the common column.
The two original datasets are:
df1 = pd.DataFrame({'common': ['x', 'y', 'y'], 'A': ['1', '2', '3']})
df2 = pd.DataFrame({'common': ['x', 'x', 'y'], 'B': ['a', 'b', 'c']})
and the resulting dataset would be:
df3 = pd.DataFrame({'common': ['x', 'x', 'y', 'y'],
'A': ['1', '1' '2', '3'],
'B': ['a', 'b', 'c', 'c']})
Use pandas' merge:
df1 = pd.DataFrame({'common': ['x', 'y', 'y'], 'A': ['1', '2', '3']})
df2 = pd.DataFrame({'common': ['x', 'x', 'y'], 'B': ['a', 'b', 'c']})
df3=pd.merge(df1,df2,on='common')

Reindex DataFrame Columns by Label Series

I have a Series of Labels
pd.Series(['L1', 'L2', 'L3'], ['A', 'B', 'A'])
and a dataframe
pd.DataFrame([[1,2], [3,4]], ['I1', 'I2'], ['A', 'B'])
I'd like to have a dataframe with columns ['L1', 'L2', 'L3'] with the column data from 'A', 'B', 'A' respectively. Like so...
pd.DataFrame([[1,2,1], [3,4,3]], ['I1', 'I2'], ['L1', 'L2', 'L3'])
in a nice pandas way.
Since you mention reindex
#s=pd.Series(['L1', 'L2', 'L3'], ['A', 'B', 'A'])
#df=pd.DataFrame([[1,2], [3,4]], ['I1', 'I2'], ['A', 'B'])
df.reindex(s.index,axis=1).rename(columns=s.to_dict())
Out[598]:
L3 L2 L3
I1 1 2 1
I2 3 4 3
This will produce the dataframe you described:
import pandas as pd
import numpy as np
data = [['A','B','A','A','B','B'],
['B','B','B','A','B','B'],
['A','B','A','B','B','B']]
columns = ['L1', 'L2', 'L3', 'L4', 'L5', 'L6']
pd.DataFrame(data, columns = columns)
You can use loc accessor:
s = pd.Series(['L1', 'L2', 'L3'], ['A', 'B', 'A'])
df = pd.DataFrame([[1,2], [3,4]], ['I1', 'I2'], ['A', 'B'])
res = df.loc[:, s.index]
print(res)
A B A
I1 1 2 1
I2 3 4 3
Or iloc accesor with columns.get_loc:
res = df.iloc[:, s.index.map(df.columns.get_loc)]
Both methods allows accessing duplicate labels / locations, in the same vein as NumPy arrays.

Dummy variables from levels of other data frame

I'd like to be able to do one hot encoding on a data frame based on levels from another data frame.
For instance, in the example below data provides the levels for two variables. Based on those levels only, I want to create dummy variables in data2.
How can I go about this?
import pandas as pd
#unique levels (A,B for VAR1, and X,Y,Z for VAR2) in
#this dataset determine the possible levels for the following dataset
data = {'VAR1': ['A', 'A', 'A', 'A','B', 'B'],
'VAR2': ['X', 'Y', 'Y', 'Y','X', 'Z']}
frame = pd.DataFrame(data)
#data2 contains same variables as data, but might or might not
#contain same levels
data2 = {'VAR1': ['A', 'C'],
'VAR2': ['X', 'Y']}
frame2 = pd.DataFrame(data2)
#after applying one hot encoding to data2, this is what it should look like
data_final = {
'A': ['1', '0'],
'B': ['0', '0'],
'X': ['1', '0'],
'Y': ['0', '1'],
'Z': ['0', '0'],
}
frame_final = pd.DataFrame(data_final)
There are probably a lot of ways to achieve this. For whatever reason I'm draw this approach:
In [74]: part = pd.concat([pd.get_dummies(frame2[x]) for x in frame2], axis=1)
In [75]: part
Out[75]:
A C X Y
0 1 0 1 0
1 0 1 0 1
You can see we are already almost there, the only missing columns are those that don't show up anywhere in frame2, B and Z. Again there would be multiple ways to get these added in (I'd be curious to hear of any you think are more suitable), but I wanted to use the reindex_axis method. To use this, we need another index containing all the possible values.
In [76]: idx = pd.Index(np.ravel(frame.values)).unique()
In [77]: idx
Out[77]: array(['A', 'X', 'Y', 'B', 'Z'], dtype=object)
Finally reindex and fill the NaNs with 0:
In [78]: part.reindex_axis(idx, axis=1).fillna(0)
Out[78]:
A X Y B Z
0 1 1 0 0 0
1 0 0 1 0 0
You can sort if necessary.

Categories

Resources