Pandas, duplicate a row based on a condition - python

I have a dataframe like this -
What I want to do is, whenever there is 'X' in Col3, that row should get duplicated and 'X' should be changed to 'Z'. The result must look like this -
I did try a few approaches, but nothing worked!
Can somebody please guide on how to do this.

You can filter first by boolean indexing and set Z to Col3 by DataFrame.assign, join with original with concat, sorting index by DataFrame.sort_index with stabble algo mergesort and last create default RangeIndex by DataFrame.reset_index with drop=True:
df = pd.DataFrame({
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'Col3':list('aXcdXf'),
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
df = (pd.concat([df, df[df['Col3'].eq('X')].assign(Col3 = 'Z')])
.sort_index(kind='mergesort')
.reset_index(drop=True))
print (df)
B C Col3 D E F
0 4 7 a 1 5 a
1 5 8 X 3 3 a
2 5 8 Z 3 3 a
3 4 9 c 5 6 a
4 5 4 d 7 9 b
5 5 2 X 1 2 b
6 5 2 Z 1 2 b
7 4 3 f 0 4 b

Related

pandas drop last group element

I have a DataFrame df = pd.DataFrame({'col1': ["a","b","c","d","e", "f","g","h"], 'col2': [1,1,1,2,2,3,3,3]}) that looks like
Input:
col1 col2
0 a 1
1 b 1
2 c 1
3 d 2
4 e 2
5 f 3
6 g 3
7 h 3
I want to drop the last row bases off of grouping "col2" which would look like...
Expected Output:
col1 col2
0 a 1
1 b 1
3 d 2
5 f 3
6 g 3
I wrote df.groupby('col2').tail(1) which gets me what I want to delete but when I try to write df.drop(df.groupby('col2').tail(1)) I get an axis error. What would be a solution to this
Look like duplicated would work:
df[df.duplicated('col2', keep='last') |
(~df.duplicated('col2', keep=False)) # this is to keep all single-row groups
]
Or with your approach, you should drop the index:
# this would also drop all single-row groups
df.drop(df.groupby('col2').tail(1).index)
Output:
col1 col2
0 a 1
1 b 1
3 d 2
5 f 3
6 g 3
try this:
df.groupby('col2', as_index=False).apply(lambda x: x.iloc[:-1,:]).reset_index(drop=True)

Delete pandas column if column name begins with a number

I have a pandas DataFrame with about 200 columns. Roughly, I want to do this
for col in df.columns:
if col begins with a number:
df.drop(col)
I'm not sure what are the best practices when it comes to handling pandas DataFrames, how should I handle this? Will my pseudocode work, or is it not recommended to modify a pandas dataframe in a for loop?
I think simpliest is select all columns which not starts with number by filter with regex - ^ is for start of string and \D is for not number:
df1 = df.filter(regex='^\D')
Similar alternative:
df1 = df.loc[:, df.columns.str.contains('^\D')]
Or inverse condition and select numbers:
df1 = df.loc[:, ~df.columns.str.contains('^\d')]
df1 = df.loc[:, ~df.columns.str[0].str.isnumeric()]
If want use your pseudocode:
for col in df.columns:
if col[0].isnumeric():
df = df.drop(col, axis=1)
Sample:
df = pd.DataFrame({'2A':list('abcdef'),
'1B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D3':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
1B 2A C D3 E F
0 4 a 7 1 5 a
1 5 b 8 3 3 a
2 4 c 9 5 6 a
3 5 d 4 7 9 b
4 5 e 2 1 2 b
5 4 f 3 0 4 b
df1 = df.filter(regex='^\D')
print (df1)
C D3 E F
0 7 1 5 a
1 8 3 3 a
2 9 5 6 a
3 4 7 9 b
4 2 1 2 b
5 3 0 4 b
An alternative can be this:
columns = [x for x in df.columns if not x[0].isdigit()]
df = df[columns]

How to apply a condition to pandas iloc

I select columns 2 - end from a pandas DataFrame with iloc as
d=c.iloc[:,2:]
now how can I apply a condition to this selection? For example, if column1==1.
You can use DataFrame.iloc if need filter first column select by position, : means here select all rows:
c[c.iloc[:, 0] == 1]
Sample:
c = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (c)
A B C D E F
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b
df = c[c.iloc[:, 3] == 1]
print (df)
A B C D E F
0 a 4 7 1 5 a
4 e 5 2 1 2 b
This is referred to as mixed indexing in that you want to index by boolean results in rows and position in columns. I'd use loc in order to take advantage of boolean indexing for the rows. But that implies that you need column names values for the column slice.
d.loc[d.column1 == 1, d.columns[2:]]
If your column names are not unique then you can resort to the dreaded chained index.
d.loc[d.column1 == 1].iloc[:, 2:]
What might also be intuitive is to use query afterwards:
d.iloc[:, 2:].query('column1 == 1')

Using isin() to determine what should be printed

Right now I have two dataframes (data1 and data2)
I would like to print a column of string values in the dataframe called data1, based on whether the ID exists in both data2 and data1.
What I am doing now gives me a boolean list (True or False if the ID exists in the both dataframes but not the column of strings).
print(data2['id'].isin(data1.id).to_string())
yields
0 True
1 True
2 True
3 True
4 True
5 True
Any ideas would be appreciated.
Here is a sample of data1
'user_id', 'id', 'rating', 'unix_timestamp'
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
And data2 contains something like this
'id', 'title', 'release_date',
'video_release_date', 'imdb_url'
37|Nadja (1994)|01-Jan-1994||http://us.imdb.com/M/title-exact?Nadja%20(1994)|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0
38|Net, The (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Net,%20The%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|1|0|0
39|Strange Days (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Strange%20Days%20(1995)|0|1|0|0|0|0|1|0|0|0|0|0|0|0|0|1|0|0|0
If all values of ids are unique:
I think you need merge with inner join. For data2 select only id column, on parameter should be omit, because joining on all columns - here only id:
df = pd.merge(data1, data2[['id']])
Sample:
data1 = pd.DataFrame({'id':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3]})
print (data1)
B C id
0 4 7 a
1 5 8 b
2 4 9 c
3 5 4 d
4 5 2 e
5 4 3 f
data2 = pd.DataFrame({'id':list('frcdeg'),
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],})
print (data2)
D E id
0 1 5 f
1 3 3 r
2 5 6 c
3 7 9 d
4 1 2 e
5 0 4 g
df = pd.merge(data1, data2[['id']])
print (df)
B C id
0 4 9 c
1 5 4 d
2 5 2 e
3 4 3 f
If id are duplicated in one or another Dataframe use another answer, also added similar solutions:
df = data1[data1['id'].isin(set(data1['id']) & set(data2['id']))]
ids = set(data1['id']) & set(data2['id'])
df = data2.query('id in #ids')
df = data1[np.in1d(data1['id'], np.intersect1d(data1['id'], data2['id']))]
Sample:
data1 = pd.DataFrame({'id':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3]})
print (data1)
B C id
0 4 7 a
1 5 8 b
2 4 9 c
3 5 4 d
4 5 2 e
5 4 3 f
data2 = pd.DataFrame({'id':list('fecdef'),
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],})
print (data2)
D E id
0 1 5 f
1 3 3 e
2 5 6 c
3 7 9 d
4 1 2 e
5 0 4 f
df = data1[data1['id'].isin(set(data1['id']) & set(data2['id']))]
print (df)
B C id
2 4 9 c
3 5 4 d
4 5 2 e
5 4 3 f
EDIT:
You can use:
df = data2.loc[data1['id'].isin(set(data1['id']) & set(data2['id'])), ['title']]
ids = set(data1['id']) & set(data2['id'])
df = data2.query('id in #ids')[['title']]
df = data2.loc[np.in1d(data1['id'], np.intersect1d(data1['id'], data2['id'])), ['title']]
You can compute the set intersection of the two columns -
ids = set(data1['id']).intersection(data2['id'])
Or,
ids = np.intersect1d(data1['id'], data2['id'])
Next, query/filter out relevant rows.
data1.loc[data1['id'].isin(ids), 'id']

How to extract rows in a pandas dataframe NOT in a subset dataframe

I have two dataframes. DF and SubDF. SubDF is a subset of DF. I want to extract the rows in DF that are NOT in SubDF.
I tried the following:
DF2 = DF[~DF.isin(SubDF)]
The number of rows are correct and most rows are correct,
ie number of rows in subDF + number of rows in DF2 = number of rows in DF
but I get rows with NaN values that do not exist in the original DF
Not sure what I'm doing wrong.
Note: the original DF does not have any NaN values, and to double check I did DF.dropna() before and the result still produced NaN
You need merge with outer join and boolean indexing, because DataFrame.isin need values and index match:
DF = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (DF)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
SubDF = pd.DataFrame({'A':[3],
'B':[6],
'C':[9],
'D':[5],
'E':[6],
'F':[3]})
print (SubDF)
A B C D E F
0 3 6 9 5 6 3
#return no match
DF2 = DF[~DF.isin(SubDF)]
print (DF2)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
DF2 = pd.merge(DF, SubDF, how='outer', indicator=True)
DF2 = DF2[DF2._merge == 'left_only'].drop('_merge', axis=1)
print (DF2)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
Another way, borrowing the setup from #jezrael:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
sub = pd.DataFrame({'A':[3],
'B':[6],
'C':[9],
'D':[5],
'E':[6],
'F':[3]})
extract_idx = list(set(df.index) - set(sub.index))
df_extract = df.loc[extract_idx]
The rows may not be sorted in the original df order. If matching order is required:
extract_idx = list(set(df.index) - set(sub.index))
idx_dict = dict(enumerate(df.index))
order_dict = dict(zip(idx_dict.values(), idx_dict.keys()))
df_extract = df.loc[sorted(extract_idx, key=order_dict.get)]

Categories

Resources