Merge duplicated cells instead of dropping them [duplicate] - python

I have a panda dataframe (here represented using excel):
Now I would like to delete all dublicates (1) of a specific row (B).
How can I do it ?
For this example, the result would look like that:

You can use duplicated for boolean mask and then set NaNs by loc, mask or numpy.where:
df.loc[df['B'].duplicated(), 'B'] = np.nan
df['B'] = df['B'].mask(df['B'].duplicated())
df['B'] = np.where(df['B'].duplicated(), np.nan,df['B'])
Alternative if need remove duplicates rows by B column:
df = df.drop_duplicates(subset=['B'])
Sample:
df = pd.DataFrame({
'B': [1,2,1,3],
'A':[1,5,7,9]
})
print (df)
A B
0 1 1
1 5 2
2 7 1
3 9 3
df.loc[df['B'].duplicated(), 'B'] = np.nan
print (df)
A B
0 1 1.0
1 5 2.0
2 7 NaN
3 9 3.0
df = df.drop_duplicates(subset=['B'])
print (df)
A B
0 1 1
1 5 2
3 9 3

Related

How do you generate a rolling count the number of rows that are duplicated in Pandas? [duplicate]

I come from a sql background and I use the following data processing step frequently:
Partition the table of data by one or more fields
For each partition, add a rownumber to each of its rows that ranks the row by one or more other fields, where the analyst specifies ascending or descending
EX:
df = pd.DataFrame({'key1' : ['a','a','a','b','a'],
'data1' : [1,2,2,3,3],
'data2' : [1,10,2,3,30]})
df
data1 data2 key1
0 1 1 a
1 2 10 a
2 2 2 a
3 3 3 b
4 3 30 a
I'm looking for how to do the PANDAS equivalent to this sql window function:
RN = ROW_NUMBER() OVER (PARTITION BY Key1 ORDER BY Data1 ASC, Data2 DESC)
data1 data2 key1 RN
0 1 1 a 1
1 2 10 a 2
2 2 2 a 3
3 3 3 b 1
4 3 30 a 4
I've tried the following which I've gotten to work where there are no 'partitions':
def row_number(frame,orderby_columns, orderby_direction,name):
frame.sort_index(by = orderby_columns, ascending = orderby_direction, inplace = True)
frame[name] = list(xrange(len(frame.index)))
I tried to extend this idea to work with partitions (groups in pandas) but the following didn't work:
df1 = df.groupby('key1').apply(lambda t: t.sort_index(by=['data1', 'data2'], ascending=[True, False], inplace = True)).reset_index()
def nf(x):
x['rn'] = list(xrange(len(x.index)))
df1['rn1'] = df1.groupby('key1').apply(nf)
But I just got a lot of NaNs when I do this.
Ideally, there'd be a succinct way to replicate the window function capability of sql (i've figured out the window based aggregates...that's a one liner in pandas)...can someone share with me the most idiomatic way to number rows like this in PANDAS?
you can also use sort_values(), groupby() and finally cumcount() + 1:
df['RN'] = df.sort_values(['data1','data2'], ascending=[True,False]) \
.groupby(['key1']) \
.cumcount() + 1
print(df)
yields:
data1 data2 key1 RN
0 1 1 a 1
1 2 10 a 2
2 2 2 a 3
3 3 3 b 1
4 3 30 a 4
PS tested with pandas 0.18
Use groupby.rank function.
Here the working example.
df = pd.DataFrame({'C1':['a', 'a', 'a', 'b', 'b'], 'C2': [1, 2, 3, 4, 5]})
df
C1 C2
a 1
a 2
a 3
b 4
b 5
df["RANK"] = df.groupby("C1")["C2"].rank(method="first", ascending=True)
df
C1 C2 RANK
a 1 1
a 2 2
a 3 3
b 4 1
b 5 2
You can do this by using groupby twice along with the rank method:
In [11]: g = df.groupby('key1')
Use the min method argument to give values which share the same data1 the same RN:
In [12]: g['data1'].rank(method='min')
Out[12]:
0 1
1 2
2 2
3 1
4 4
dtype: float64
In [13]: df['RN'] = g['data1'].rank(method='min')
And then groupby these results and add the rank with respect to data2:
In [14]: g1 = df.groupby(['key1', 'RN'])
In [15]: g1['data2'].rank(ascending=False) - 1
Out[15]:
0 0
1 0
2 1
3 0
4 0
dtype: float64
In [16]: df['RN'] += g1['data2'].rank(ascending=False) - 1
In [17]: df
Out[17]:
data1 data2 key1 RN
0 1 1 a 1
1 2 10 a 2
2 2 2 a 3
3 3 3 b 1
4 3 30 a 4
It feels like there ought to be a native way to do this (there may well be!...).
You can use transform and Rank together Here is an example
df = pd.DataFrame({'C1' : ['a','a','a','b','b'],
'C2' : [1,2,3,4,5]})
df['Rank'] = df.groupby(by=['C1'])['C2'].transform(lambda x: x.rank())
df
Have a look at Pandas Rank method for more information
pandas.lib.fast_zip() can create a tuple array from a list of array. You can use this function to create a tuple series, and then rank it:
values = {'key1' : ['a','a','a','b','a','b'],
'data1' : [1,2,2,3,3,3],
'data2' : [1,10,2,3,30,20]}
df = pd.DataFrame(values, index=list("abcdef"))
def rank_multi_columns(df, cols, **kw):
data = []
for col in cols:
if col.startswith("-"):
flag = -1
col = col[1:]
else:
flag = 1
data.append(flag*df[col])
values = pd.lib.fast_zip(data)
s = pd.Series(values, index=df.index)
return s.rank(**kw)
rank = df.groupby("key1").apply(lambda df:rank_multi_columns(df, ["data1", "-data2"]))
print rank
the result:
a 1
b 2
c 3
d 2
e 4
f 1
dtype: float64

Create pandas dataframe columns using column names and elements [duplicate]

How can one idiomatically run a function like get_dummies, which expects a single column and returns several, on multiple DataFrame columns?
With pandas 0.19, you can do that in a single line :
pd.get_dummies(data=df, columns=['A', 'B'])
Columns specifies where to do the One Hot Encoding.
>>> df
A B C
0 a c 1
1 b c 2
2 a b 3
>>> pd.get_dummies(data=df, columns=['A', 'B'])
C A_a A_b B_b B_c
0 1 1.0 0.0 0.0 1.0
1 2 0.0 1.0 0.0 1.0
2 3 1.0 0.0 1.0 0.0
Since pandas version 0.15.0, pd.get_dummies can handle a DataFrame directly (before that, it could only handle a single Series, and see below for the workaround):
In [1]: df = DataFrame({'A': ['a', 'b', 'a'], 'B': ['c', 'c', 'b'],
...: 'C': [1, 2, 3]})
In [2]: df
Out[2]:
A B C
0 a c 1
1 b c 2
2 a b 3
In [3]: pd.get_dummies(df)
Out[3]:
C A_a A_b B_b B_c
0 1 1 0 0 1
1 2 0 1 0 1
2 3 1 0 1 0
Workaround for pandas < 0.15.0
You can do it for each column seperate and then concat the results:
In [111]: df
Out[111]:
A B
0 a x
1 a y
2 b z
3 b x
4 c x
5 a y
6 b y
7 c z
In [112]: pd.concat([pd.get_dummies(df[col]) for col in df], axis=1, keys=df.columns)
Out[112]:
A B
a b c x y z
0 1 0 0 1 0 0
1 1 0 0 0 1 0
2 0 1 0 0 0 1
3 0 1 0 1 0 0
4 0 0 1 1 0 0
5 1 0 0 0 1 0
6 0 1 0 0 1 0
7 0 0 1 0 0 1
If you don't want the multi-index column, then remove the keys=.. from the concat function call.
Somebody may have something more clever, but here are two approaches. Assuming you have a dataframe named df with columns 'Name' and 'Year' you want dummies for.
First, simply iterating over the columns isn't too bad:
In [93]: for column in ['Name', 'Year']:
...: dummies = pd.get_dummies(df[column])
...: df[dummies.columns] = dummies
Another idea would be to use the patsy package, which is designed to construct data matrices from R-type formulas.
In [94]: patsy.dmatrix(' ~ C(Name) + C(Year)', df, return_type="dataframe")
Unless I don't understand the question, it is supported natively in get_dummies by passing the columns argument.
The simple trick I am currently using is a for-loop.
First separate categorical data from Data Frame by using select_dtypes(include="object"),
then by using for loop apply get_dummies to each column iteratively
as I have shown in code below:
train_cate=train_data.select_dtypes(include="object")
test_cate=test_data.select_dtypes(include="object")
# vectorize catagorical data
for col in train_cate:
cate1=pd.get_dummies(train_cate[col])
train_cate[cate1.columns]=cate1
cate2=pd.get_dummies(test_cate[col])
test_cate[cate2.columns]=cate2

Pandas: Split columns into multiple columns by two delimiters

I have data like this
ID INFO
1 A=2;B=2;C=5
2 A=3;B=4;C=1
3 A=1;B=3;C=2
I want to split the Info columns into
ID A B C
1 2 2 5
2 3 4 1
3 1 3 2
I can split columns with one delimiter by using
df['A'], df['B'], df['C'] = df['INFO'].str.split(';').str
then split again by = but this seems to not so efficient in case I have many rows and especially when there are so many field that cannot be hard-coded beforehand.
Any suggestion would be greatly welcome.
You could use named groups together with Series.str.extract. In the end concat back the 'ID'. This assumes you always have A=;B=;and C= in a line.
pd.concat([df['ID'],
df['INFO'].str.extract('A=(?P<A>\d);B=(?P<B>\d);C=(?P<C>\d)')], axis=1)
# ID A B C
#0 1 2 2 5
#1 2 3 4 1
#2 3 1 3 2
If you want a more flexible solution that can deal with cases where a single line might be 'A=1;C=2' then we can split on ';' and partition on '='. pivot in the end to get to your desired output.
### Starting Data
#ID INFO
#1 A=2;B=2;C=5
#2 A=3;B=4;C=1
#3 A=1;B=3;C=2
#4 A=1;C=2
(df.set_index('ID')['INFO']
.str.split(';', expand=True)
.stack()
.str.partition('=')
.reset_index(-1, drop=True)
.pivot(columns=0, values=2)
)
# A B C
#ID
#1 2 2 5
#2 3 4 1
#3 1 3 2
#4 1 NaN 2
Browsing a Series is much faster that iterating across the rows of a dataframe.
So I would do:
pd.DataFrame([dict([x.split('=') for x in t.split(';')]) for t in df['INFO']], index=df['ID']).reset_index()
It gives as expected:
ID A B C
0 1 2 2 5
1 2 3 4 1
2 3 1 3 2
It should be faster than splitting twice dataframe columns.
values = [dict(item.split("=") for item in value.split(";")) for value in df.INFO]
df[['a', 'b', 'c']] = pd.DataFrame(values)
This will give you the desired output:
ID INFO a b c
1 a=1;b=2;c=3 1 2 3
2 a=4;b=5;c=6 4 5 6
3 a=7;b=8;c=9 7 8 9
Explanation:
The first line converts every value to a dictionary.
e.g.
x = 'a=1;b=2;c=3'
dict(item.split("=") for item in x.split(";"))
results in :
{'a': '1', 'b': '2', 'c': '3'}
DataFrame can take a list of dicts as an input and turn it into a dataframe.
Then you only need to assign the dataframe to the columns you want:
df[['a', 'b', 'c']] = pd.DataFrame(values)
Another solution is Series.str.findAll to extract values and then apply(pd.Series):
df[["A", "B", "C"]] = df.INFO.str.findall(r'=(\d+)').apply(pd.Series)
df = df.drop("INFO", 1)
Details:
df = pd.DataFrame([[1, "A=2;B=2;C=5"],
[2, "A=3;B=4;C=1"],
[3, "A=1;B=3;C=2"]],
columns=["ID", "INFO"])
print(df.INFO.str.findall(r'=(\d+)'))
# 0 [2, 2, 5]
# 1 [3, 4, 1]
# 2 [1, 3, 2]
df[["A", "B", "C"]] = df.INFO.str.findall(r'=(\d+)').apply(pd.Series)
print(df)
# ID INFO A B C
# 0 1 A=2;B=2;C=5 2 2 5
# 1 2 A=3;B=4;C=1 3 4 1
# 2 3 A=1;B=3;C=2 1 3 2
# Remove INFO column
df = df.drop("INFO", 1)
print(df)
# ID A B C
# 0 1 2 2 5
# 1 2 3 4 1
# 2 3 1 3 2
Another solution :
#split on ';'
#explode
#then split on '='
#and pivot
df_INFO = (df.INFO
.str.split(';')
.explode()
.str.split('=',expand=True)
.pivot(columns=0,values=1)
)
pd.concat([df.ID,df_INFO],axis=1)
ID A B C
0 1 2 2 5
1 2 3 4 1
2 3 1 3 2

Pandas: Check if a value exist in a column, create a new column, exist add 1 if not add 0

I have a DataFrame
df = columnA=[1,2,3,4,5,6]
columnB=['Apple AA','Banana BB',NaN,'Strawberry DD',NaN,'Blueberry EE']
I want to create a new column if columnB contains values
df = columnA=[1,2,3,4,5,6]
columnB=['Apple AA','Banana BB',NaN,'Strawberry DD',NaN,'Blueberry EE']
columnC=[1,1,0,1,0,1]
My code:
df[columnC] = df[columnB].map(lambda x: 1 if len(x) > 0 else 0 if len(x) == 0)
Or
columnC = np.repeat(0, df.shape[0]
for i in df:
if len(df[columnB]) > 0:
df[columnC] = 1
Neither are working.
You can use .notnull() to test if your values are not NaN
import pandas as pd
import numpy as np
df = pd.DataFrame({'columnA':[1,2,3,4,5,6],
'columnB':['Apple AA','Banana BB',np.NaN,'Strawberry DD',np.NaN,'Blueberry EE']})
df['columnC'] = df['columnB'].notnull()*1
The multiplication by 1 is used to convert booleans to binary values.
Also be careful not to forget quotes around your column names.
Here is the code:
df = pd.DataFrame({'A': [1,2,3,4,5,6], 'B': ['Apple', 'Ba', np.nan, 'St', np.nan, 'e']})
df['C'] = df['B'].isna()
A B C
0 1 Apple False
1 2 Ba False
2 3 NaN True
3 4 St False
4 5 NaN True
5 6 e False
Then convert the boolean value to 0 1
df['C'] = df['C'].apply(lambda x: 1 if not x else 0)
A B C
0 1 Apple 1
1 2 Ba 1
2 3 NaN 0
3 4 St 1
4 5 NaN 0
5 6 e 1
Could you please try following, using np.where from numpy.
df['C']=np.where(df['B'].notnull(),1,0)
Output will be as follows.
A B C
1 Apple AA 1
2 Banana BB 1
3 NaN None 0
4 Strawberry DD 1
5 NaN None 0
6 Blueberry EE 1

Delete pandas column if column name begins with a number

I have a pandas DataFrame with about 200 columns. Roughly, I want to do this
for col in df.columns:
if col begins with a number:
df.drop(col)
I'm not sure what are the best practices when it comes to handling pandas DataFrames, how should I handle this? Will my pseudocode work, or is it not recommended to modify a pandas dataframe in a for loop?
I think simpliest is select all columns which not starts with number by filter with regex - ^ is for start of string and \D is for not number:
df1 = df.filter(regex='^\D')
Similar alternative:
df1 = df.loc[:, df.columns.str.contains('^\D')]
Or inverse condition and select numbers:
df1 = df.loc[:, ~df.columns.str.contains('^\d')]
df1 = df.loc[:, ~df.columns.str[0].str.isnumeric()]
If want use your pseudocode:
for col in df.columns:
if col[0].isnumeric():
df = df.drop(col, axis=1)
Sample:
df = pd.DataFrame({'2A':list('abcdef'),
'1B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D3':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
1B 2A C D3 E F
0 4 a 7 1 5 a
1 5 b 8 3 3 a
2 4 c 9 5 6 a
3 5 d 4 7 9 b
4 5 e 2 1 2 b
5 4 f 3 0 4 b
df1 = df.filter(regex='^\D')
print (df1)
C D3 E F
0 7 1 5 a
1 8 3 3 a
2 9 5 6 a
3 4 7 9 b
4 2 1 2 b
5 3 0 4 b
An alternative can be this:
columns = [x for x in df.columns if not x[0].isdigit()]
df = df[columns]

Categories

Resources