Fastest way to compare all rows of a DataFrame - python

I have written a program (in Python 3.6) that tries to map the columns of a users csv/excel to a template xls I have. So far so good but part of this process has to be user's data processing which are contacts. For example I want to delete duplicates ,merge data etc. To do this I need to compare every row to all other rows which is costly. Every user's csv I read has ~ 2000-4000 rows but I want it to be efficient for even more rows. I have stored the data in a pd.DataFrame.
Is there a more efficient way to do the comparisons beside brute force?
Thanks

First, what code have you tried?
But to delete duplicates, this is very easy in pandas. Example below:
import pandas as pd
import numpy as np
# Creating the Test DataFrame below -------------------------------
dfp = pd.DataFrame({'A' : [np.NaN,np.NaN,3,4,5,5,3,1,5,np.NaN],
'B' : [1,0,3,5,0,0,np.NaN,9,0,0],
'C' : ['AA1233445','A9875', 'rmacy','Idaho Rx','Ab123455','TV192837','RX','Ohio Drugs','RX12345','USA Pharma'],
'D' : [123456,123456,1234567,12345678,12345,12345,12345678,123456789,1234567,np.NaN],
'E' : ['Assign','Unassign','Assign','Ugly','Appreciate','Undo','Assign','Unicycle','Assign','Unicorn',]})
print(dfp)
#Output Below----------------
A B C D E
0 NaN 1.0 AA1233445 123456.0 Assign
1 NaN 0.0 A9875 123456.0 Unassign
2 3.0 3.0 rmacy 1234567.0 Assign
3 4.0 5.0 Idaho Rx 12345678.0 Ugly
4 5.0 0.0 Ab123455 12345.0 Appreciate
5 5.0 0.0 TV192837 12345.0 Undo
6 3.0 NaN RX 12345678.0 Assign
7 1.0 9.0 Ohio Drugs 123456789.0 Unicycle
8 5.0 0.0 RX12345 1234567.0 Assign
9 NaN 0.0 USA Pharma NaN Unicorn
# Remove all records with duplicated values in column a:
# keep='first' keeps the first occurences.
df2 = dfp[dfp.duplicated(['A'], keep='first')]
#output
A B C D E
1 NaN 0.0 A9875 123456.0 Unassign
5 5.0 0.0 TV192837 12345.0 Undo
6 3.0 NaN RX 12345678.0 Assign
8 5.0 0.0 RX12345 1234567.0 Assign
9 NaN 0.0 USA Pharma NaN Unicorn
if you want to have a new dataframe with no dupes that checks across all columns use the tilde. the ~ operator is essentially the not equal to or != operator. official documentation here
df2 = dfp[~dfp.duplicated(keep='first')]

Related

Performing operations on column with nan's without removing them

I currently have a data frame like so:
treated
control
9.5
9.6
10
5
6
0
6
6
I want to apply get a log 2 ratio between treated and control i.e log2(treated/control). However, the math.log2() ratio breaks, due to 0 values in the control column (a zero division). Ideally, I would like to get the log 2 ratio using method chaining, e.g a df.assign() and simply put nan's where it is not possible, like so:
treated
control
log_2_ratio
9.5
9.6
-0.00454
10
5
0.301
6
0
nan
6
6
0
I have managed to do this in an extremely round-about way, where I have:
made a column ratio which is treated/control
done new_df = df.dropna() on this dataframe
applied the log 2 ratio to this.
Left joined it back to it's the original df.
As always, any help is very much appreciated :)
You need to replace the inf with nan:
df.assign(log_2_ratio=np.log2(df['treated'].div(df['control'])).replace(np.inf, np.nan))
Output:
treated control log_2_ratio
0 9.5 9.6 -0.015107
1 10.0 5.0 1.000000
2 6.0 0.0 NaN
3 6.0 6.0 0.000000
To avoid subsequent replacement you may go through an explicit condition (bearing in mind that multiplication/division operation with zero always result in 0).
df.assign(log_2_ratio=lambda x: np.where(x.treated * x.control, np.log2(x.treated/x.control), np.nan))
Out[22]:
treated control log_2_ratio
0 9.5 9.6 -0.015107
1 10.0 5.0 1.000000
2 6.0 0.0 NaN
3 6.0 6.0 0.000000
Stick with the numpy log functions and you'll get an inf in the cells where the divide doesn't work. That seems like a better choice than nan anyway.
>>> df["log_2_ratio"] = np.log2(df.treated/df.control)
>>> df
treated control log_2_ratio
0 9.5 9.6 -0.015107
1 10.0 5.0 1.000000
2 6.0 0.0 inf
3 6.0 6.0 0.000000

Forward fill on custom value in pandas dataframe

I am looking to perform forward fill on some dataframe columns.
the ffill method replaces missing values or NaN with the previous filled value.
In my case, I would like to perform a forward fill, with the difference that I don't want to do that on Nan but for a specific value (say "*").
Here's an example
import pandas as pd
import numpy as np
d = [{"a":1, "b":10},
{"a":2, "b":"*"},
{"a":3, "b":"*"},
{"a":4, "b":"*"},
{"a":np.nan, "b":50},
{"a":6, "b":60},
{"a":7, "b":70}]
df = pd.DataFrame(d)
with df being
a b
0 1.0 10
1 2.0 *
2 3.0 *
3 4.0 *
4 NaN 50
5 6.0 60
6 7.0 70
The expected result should be
a b
0 1.0 10
1 2.0 10
2 3.0 10
3 4.0 10
4 NaN 50
5 6.0 60
6 7.0 70
If replacing "*" by np.nan then ffill, that would cause to apply ffill to column a.
Since my data has hundreds of columns, I was wondering if there is a more efficient way than looping over all columns, check if it countains "*", then replace and ffill.
You can use df.mask with df.isin with df.replace
df.mask(df.isin(['*']),df.replace('*',np.nan).ffill())
a b
0 1.0 10
1 2.0 10
2 3.0 10
3 4.0 10
4 NaN 50
5 6.0 60
6 7.0 70
I think you're going in the right direction, but here's a complete solution. What I'm doing is 'marking' the original NaN values, then replacing "*" with NaN, using ffill, and then putting the original NaN values back.
df = df.replace(np.NaN, "<special>").replace("*", np.NaN).ffill().replace("<special>", np.NaN)
output:
a b
0 1.0 10.0
1 2.0 10.0
2 3.0 10.0
3 4.0 10.0
4 NaN 50.0
5 6.0 60.0
6 7.0 70.0
And here's an alternative solution that does the same thing, without the 'special' marking:
original_nan = df.isna()
df = df.replace("*", np.NaN).ffill()
df[original_nan] = np.NaN

Dataframe operation

I have a dataframe populated with different zeros and values different than zero. For each row I want to apply the following condition:
If the value in the given cell is different than zero AND the value in the cell to the right is zero, then put the same value in the cell to the right.
The example would be the following:
This is the one of the rows in the dataframe now:
[0,0,0,20,0,0,0,33,3,0,5,0,0,0,0,0]
The function would convert it to the following:
[0,0,0,20,20,20,20,33,3,3,5,5,5,5,5,5]
I want to apply this to the whole dataframe.
Your help would be much appreciated!
Thank you.
Since you imply you are using Pandas, I would leverage a bit of the build-in muscle in the library.
import pandas as pd
import numpy as np
s = pd.Series([0,0,0,20,0,0,0,33,3,0,5,0,0,0,0,0])
s.replace(0, np.NaN, inplace=True)
s = s.ffill()
Output:
0 NaN
1 NaN
2 NaN
3 20.0
4 20.0
5 20.0
6 20.0
7 33.0
8 3.0
9 3.0
10 5.0
11 5.0
12 5.0
13 5.0
14 5.0
15 5.0
dtype: float64

Drop Duplicates and Add Values Pandas

I have a dataframe below. I would like to drop the duplicates, but add the duplicated value from the E column to the non-duplicated record
import pandas as pd
import numpy as np
dfp = pd.DataFrame({'A' : [np.NaN,np.NaN,3,4,5,5,3,1,6,7],
'B' : [1,1,3,5,0,0,np.NaN,9,0,0],
'C' : ['AA1233445','AA1233445', 'rmacy','Idaho Rx','Ab123455','TV192837','RX','Ohio Drugs','RX12345','USA Pharma'],
'D' : [123456,123456,1234567,12345678,12345,12345,12345678,123456789,1234567,np.NaN],
'E' : ['Assign','Allign','Hello','Ugly','Appreciate','Undo','Testing','Unicycle','Pharma','Unicorn',]})
print(dfp)
I'm grabbing all the duplicates:
df2 = dfp.loc[(dfp['A'].duplicated(keep=False))].copy()
A B C D E
0 NaN 1.0 AA1233445 123456.0 Assign
1 NaN 1.0 AA1233445 123456.0 Allign
2 3.0 3.0 rmacy 1234567.0 Hello
4 5.0 0.0 Ab123455 12345.0 Appreciate
5 5.0 0.0 TV192837 12345.0 Undo
6 3.0 NaN RX 12345678.0 Testing
and would like my outcome to be:
A B C D E
0 NaN 1.0 AA1233445 123456.0 Assign Allign
2 3.0 3.0 rmacy 1234567.0 Hello Testing
4 5.0 0.0 Ab123455 12345.0 Appreciate Undo
I know I need to use dfp.loc[(dfp['A'].duplicated(keep='last'))].copy() to grab the first occurrence, but I'm failing to set the value of the E column to include the other duplicated values.
I'm thinking I need to try something like:
df3 = dfp.loc[(dfp['A'].duplicated(keep='last'))].copy()
df3['E'] = df3['E'] + dfp.loc[(dfp['A'].duplicated(keep=False).copy()),'E']
but my output is:
A B C D E
0 NaN 1.0 AA1233445 123456.0 AssignAssign
2 3.0 3.0 rmacy 1234567.0 HelloHello
4 5.0 0.0 Ab123455 12345.0 AppreciateAppreciate
I'm stumped. Am I over complicating it? How can I get the output I'm looking for so that I can later drop all the duplicates, except the first, but 'save' the values of the dropped vlaues in the E Column?
Define functions to use in agg and use within groupby. In order to get groupby to work with NaN, I converted to strings then back to floats.
f = {c: ' '.join if c == 'E' else 'first' for c in ['B', 'C', 'D', 'E']}
dfp.groupby(
dfp.A.astype(str), sort=False
).agg(f).reset_index().eval(
'A = #pd.to_numeric(A, "coerce").values',
inplace=False
)
A B C D E
0 NaN 1.0 AA1233445 123456.0 Assign Allign
1 3.0 3.0 rmacy 1234567.0 Hello Testing
2 4.0 5.0 Idaho Rx 12345678.0 Ugly
3 5.0 0.0 Ab123455 12345.0 Appreciate Undo
4 1.0 9.0 Ohio Drugs 123456789.0 Unicycle
5 6.0 0.0 RX12345 1234567.0 Pharma
6 7.0 0.0 USA Pharma NaN Unicorn
Limiting it to just the duplicated rows:
f = {c: ' '.join if c == 'E' else 'first' for c in ['B', 'C', 'D', 'E']}
d1 = dfp[dfp.duplicated('A', keep=False)]
d2 = d1.groupby(d1.A.astype(str), sort=False).agg(f).reset_index()
d2.A = d2.A.astype(float)
d2
A B C D E
0 NaN 1.0 AA1233445 123456.0 Assign Allign
1 3.0 3.0 rmacy 1234567.0 Hello Testing
2 5.0 0.0 Ab123455 12345.0 Appreciate Undo
Here is my ugly solution:
In [263]: (dfp.reset_index()
...: .assign(A=dfp.A.fillna(-1))
...: .groupby('A')
...: .filter(lambda x: len(x) > 1)
...: .groupby('A', as_index=False)
...: .apply(lambda x: x.head(1).assign(E=x.E.str.cat(sep=' ')))
...: .replace({'A':{-1:np.nan}})
...: .set_index('index'))
...:
Out[263]:
A B C D E
index
0 NaN 1.0 AA1233445 123456.0 Assign Allign
2 3.0 3.0 rmacy 1234567.0 Hello Testing
4 5.0 0.0 Ab123455 12345.0 Appreciate Undo

I am trying to fill all NaN values in rows with number data types to zero in pandas

I have a DateFrame with a mixture of string, and float rows. The float rows are all still whole numbers and were only changed to floats because their were missing values. I want to fill in all the NaN rows that are numbers with zero while leaving the NaN in columns that are strings. Here is what I have currently.
df.select_dtypes(include=['int', 'float']).fillna(0, inplace=True)
This doesn't work and I think it is because .select_dtypes() returns a view of the DataFrame so the .fillna() doesn't work. Is there a method similar to this to fill all the NaNs on only the float rows.
Use either DF.combine_first (does not act inplace):
df.combine_first(df.select_dtypes(include=[np.number]).fillna(0))
or DF.update (modifies inplace):
df.update(df.select_dtypes(include=[np.number]).fillna(0))
The reason why fillna fails is because DF.select_dtypes returns a completely new dataframe which although forms a subset of the original DF, but is not really a part of it. It behaves as a completely new entity in itself. So any modifications done to it will not affect the DF it gets derived from.
Note that np.number selects all numeric type.
Your pandas.DataFrame.select_dtypes approach is good; you've just got to cross the finish line:
>>> df = pd.DataFrame({'A': [np.nan, 'string', 'string', 'more string'], 'B': [np.nan, np.nan, 3, 4], 'C': [4, np.nan, 5, 6]})
>>> df
A B C
0 NaN NaN 4.0
1 string NaN NaN
2 string 3.0 5.0
3 more string 4.0 6.0
Don't try to perform the in-place fillna here (there's a time and place for inplace=True, but here is not one). You're right in that what's returned by select_dtypes is basically a view. Create a new dataframe called filled and join the filled (or "fixed") columns back with your original data:
>>> filled = df.select_dtypes(include=['int', 'float']).fillna(0)
>>> filled
B C
0 0.0 4.0
1 0.0 0.0
2 3.0 5.0
3 4.0 6.0
>>> df = df.join(filled, rsuffix='_filled')
>>> df
A B C B_filled C_filled
0 NaN NaN 4.0 0.0 4.0
1 string NaN NaN 0.0 0.0
2 string 3.0 5.0 3.0 5.0
3 more string 4.0 6.0 4.0 6.0
Then you can drop whatever original columns you had to keep only the "filled" ones:
>>> df.drop([x[:x.find('_filled')] for x in df.columns if '_filled' in x], axis=1, inplace=True)
>>> df
A B_filled C_filled
0 NaN 0.0 4.0
1 string 0.0 0.0
2 string 3.0 5.0
3 more string 4.0 6.0
Consider a dataframe like this
col1 col2 col3 id
0 1 1 1 a
1 0 NaN 1 a
2 NaN 1 1 NaN
3 1 0 1 b
You can select the numeric columns and fillna
num_cols = df.select_dtypes(include=[np.number]).columns
df[num_cols]=df.select_dtypes(include=[np.number]).fillna(0)
col1 col2 col3 id
0 1 1 1 a
1 0 0 1 a
2 0 1 1 NaN
3 1 0 1 b

Categories

Resources