Pandas dataframe isolate row below a keyword

Pandas dataframe isolate row below a keyword - python

I have a 1 column dataframe
df = pd.read_csv(txt_file, header=None)
I am trying to search for a string in the column and then return the row after
key_word_df = df[df[0].str.contains("KeyWord")]
I dont know how you can then every each time the keyword is found, isolate the row below it and assign to a new df.

You could use the .shift method on an indexer. I've split this into multiple lines to demonstrate what's happening, but you could do the operation in a one-liner for brevity in practice.
import pandas as pd
# 1. Dummy DataFrame with strings
In [1]: df = pd.DataFrame(["one", "two", "one", "two", "three"], columns=["text",])
# 2. Create the indexer, use `shift` to move the values down one and `fillna` to remove NaN values
In [2]: idx = df["text"].str.contains("one").shift(1).fillna(False)
In [3]: idx
Out [3]:
0 False
1 True
2 False
3 True
4 False
Name: text, dtype: bool
# 3. Use the indexer to show the next row from the matched values:
In: [4] df[idx]
Out: [4]
text
1 two
3 two

You can use the shift function. Here's an example
import pandas as pd
df = pd.DataFrame({'word': ['hello', 'ice', 'kitten', 'hello', 'foo', 'bar', 'hello'],
'val': [1,2,3,4,5,6,7]})
val word
0 1 hello
1 2 ice
2 3 kitten
3 4 hello
4 5 foo
5 6 bar
6 7 hello
keyword = 'hello'
df[(df['word']==keyword).shift(1).fillna(False)]
val word
1 2 ice
4 5 foo

Here is one way.
Get the index of the rows that match your condition. Then use .loc to get the matching index + 1.
Consider the following example:
df = pd.DataFrame({0: ['KeyWord', 'foo', 'bar', 'KeyWord', 'blah']})
print(df)
# 0
#0 KeyWord
#1 foo
#2 bar
#3 KeyWord
#4 blah
Apply the mask, and get the rows of the index + 1:
key_word_df = df.loc[df[df[0].str.contains("KeyWord")].index + 1, :]
print(key_word_df)
# 0
#1 foo
#4 blah

Related

excluding specific rows in pandas dataframe

I am trying to create a new dataframe selecting only those rows which a specific column value does not start with a capital S.
I have tried the following options:
New_dataframe = dataframe.loc[~dataframe.column.str.startswith(('S'))]
filter = dataframe['column'].astype(str).str.contains(r'^\S')
New_dataframe = dataframe[~filter]
However both options return an empty dataframe. Does anybody have a better solution?

Your code works well:
dataframe = pd.DataFrame({'ColA': ['Start', 'Hello', 'World', 'Stop'],
'ColB': [3, 4, 5, 6]})
New_dataframe = dataframe.loc[~df['ColA'].str.startswith('S')]
print(New_dataframe)
Output:
>>> New_dataframe
ColA ColB
1 Hello 4
2 World 5
>>> dataframe
ColA ColB
0 Start 3
1 Hello 4
2 World 5
3 Stop 6

Rename duplicate column name by order in Pandas

I have a dataframe, df, where I would like to rename two duplicate columns in consecutive order:
Data
DD Nice Nice Hello
0 1 1 2
Desired
DD Nice1 Nice2 Hello
0 1 1 2
Doing
df.rename(columns={"Name": "Name1", "Name": "Name2"})
I am running the rename function, however, because both column names are identical, the results are not desirable.

Here's an approach with groupby:
s = df.columns.to_series().groupby(df.columns)
df.columns = np.where(s.transform('size')>1,
df.columns + s.cumcount().add(1).astype(str),
df.columns)
Output:
DD Nice1 Nice2 Hello
0 0 1 1 2

You could use an itertools.count() counter and a list expression to create new column headers, then assign them to the data frame.
For example:
>>> import itertools
>>> df = pd.DataFrame([[1, 2, 3]], columns=["Nice", "Nice", "Hello"])
>>> df
Nice Nice Hello
0 1 2 3
>>> count = itertools.count(1)
>>> new_cols = [f"Nice{next(count)}" if col == "Nice" else col for col in df.columns]
>>> df.columns = new_cols
>>> df
Nice1 Nice2 Hello
0 1 2 3
(Python 3.6+ required for the f-strings)
EDIT: Alternatively, per the comment below, the list expression can replace any label that may contain "Nice" in case there are unexpected spaces or other characters:
new_cols = [f"Nice{next(count)}" if "Nice" in col else col for col in df.columns]

You can use:
cols = pd.Series(df.columns)
dup_count = cols.value_counts()
for dup in cols[cols.duplicated()].unique():
cols[cols[cols == dup].index.values.tolist()] = [dup + str(i) for i in range(1, dup_count[dup]+1)]
df.columns = cols
Input:
col_1 Nice Nice Nice Hello Hello Hello
col_2 1 2 3 4 5 6
Output:
col_1 Nice1 Nice2 Nice3 Hello1 Hello2 Hello3
col_2 1 2 3 4 5 6
Setup to generate duplicate cols:
df = pd.DataFrame(data={'col_1':['Nice', 'Nice', 'Nice', 'Hello', 'Hello', 'Hello'], 'col_2':[1,2,3,4, 5, 6]})
df = df.set_index('col_1').T

Create new dataframe column from the values of 2 other columns

I have 2 columns in my data frame. At any one instance (row), at least one of the columns has a string value in it, it is possible that the other column has NoneType in it or another string.
I want to create a 3rd column that, in the case where one of the columns is a NoneType, will take the value of the string. And in the case where both are strings, will take the concatenation of the two.
How can I do this?
column1 column2 column3
0 hello None hello
1 None goodbye goodbye
2 hello goodbye hello, goodbye

Series.str.cat
Use na_rep='' so joins with missing values do not result in NaN for the entire row. Then strip any excess separators that were joined due to missing data (assuming separator characters also don't start or end any of your words).
import pandas as pd
df = pd.DataFrame({'column1': ['hello', None, 'hello'],
'column2': [None, 'goodbye', 'goodbye']})
sep = ', '
df['column3'] = (df['column1'].str.cat(df['column2'], sep=sep, na_rep='')
.str.strip(sep))
print(df)
column1 column2 column3
0 hello None hello
1 None goodbye goodbye
2 hello goodbye hello, goodbye
With many columns, where there might be streaks of missing data in the middle, the above doesn't work to remove the excess separators. Instead you could use a slow lambda along the rows. We join all values after dropping the nulls:
df['column3'] = df.apply(lambda row: ', '.join(row.dropna()), axis=1)

Solution
You could replace all the NaNs with an empty string and then conact the columns (A and B) to create column C.
df2 = df.fillna('')
df['C'] = df2.A.str.strip() + df2.B.str.strip(); #del df2;
print(df)
Output:
A B C=A+B
0 1 3 13
1 2 None 2
2 dog dog dogdog
3 None None
4 snake 20 snake20
5 cat None cat
Dummy Data
d = {
'A': ['1', '2', 'dog', None, 'snake', 'cat'],
'B': ['3', None, 'dog', None, '20', None]
}
df = pd.DataFrame(d)
print(df)
Output:
A B
0 1 3
1 2 None
2 dog dog
3 None None
4 snake 20
5 cat None

How to give duplicated columns distinct names in Pandas [duplicate]

I have several columns named the same in a df. I need to rename them but the problem is that the df.rename method renames them all the same way. How I can rename the below blah(s) to blah1, blah4, blah5?
df = pd.DataFrame(np.arange(2*5).reshape(2,5))
df.columns = ['blah','blah2','blah3','blah','blah']
df
# blah blah2 blah3 blah blah
# 0 0 1 2 3 4
# 1 5 6 7 8 9
Here is what happens when using the df.rename method:
df.rename(columns={'blah':'blah1'})
# blah1 blah2 blah3 blah1 blah1
# 0 0 1 2 3 4
# 1 5 6 7 8 9

Starting with Pandas 0.19.0 pd.read_csv() has improved support for duplicate column names
So we can try to use the internal method:
In [137]: pd.io.parsers.ParserBase({'names':df.columns})._maybe_dedup_names(df.columns)
Out[137]: ['blah', 'blah2', 'blah3', 'blah.1', 'blah.2']
Since Pandas 1.3.0:
pd.io.parsers.base_parser.ParserBase({'names':df.columns, 'usecols':None})._maybe_dedup_names(df.columns)
This is the "magic" function:
def _maybe_dedup_names(self, names):
# see gh-7160 and gh-9424: this helps to provide
# immediate alleviation of the duplicate names
# issue and appears to be satisfactory to users,
# but ultimately, not needing to butcher the names
# would be nice!
if self.mangle_dupe_cols:
names = list(names) # so we can index
counts = {}
for i, col in enumerate(names):
cur_count = counts.get(col, 0)
if cur_count > 0:
names[i] = '%s.%d' % (col, cur_count)
counts[col] = cur_count + 1
return names

I was looking to find a solution within Pandas more than a general Python solution.
Column's get_loc() function returns a masked array if it finds duplicates with 'True' values pointing to the locations where duplicates are found. I then use the mask to assign new values into those locations. In my case, I know ahead of time how many dups I'm going to get and what I'm going to assign to them but it looks like df.columns.get_duplicates() would return a list of all dups and you can then use that list in conjunction with get_loc() if you need a more generic dup-weeding action
'''UPDATED AS-OF SEPT 2020'''
cols=pd.Series(df.columns)
for dup in df.columns[df.columns.duplicated(keep=False)]:
cols[df.columns.get_loc(dup)] = ([dup + '.' + str(d_idx)
if d_idx != 0
else dup
for d_idx in range(df.columns.get_loc(dup).sum())]
)
df.columns=cols
blah blah2 blah3 blah.1 blah.2
0 0 1 2 3 4
1 5 6 7 8 9
New Better Method (Update 03Dec2019)
This code below is better than above code. Copied from another answer below (#SatishSK):
#sample df with duplicate blah column
df=pd.DataFrame(np.arange(2*5).reshape(2,5))
df.columns=['blah','blah2','blah3','blah','blah']
df
# you just need the following 4 lines to rename duplicates
# df is the dataframe that you want to rename duplicated columns
cols=pd.Series(df.columns)
for dup in cols[cols.duplicated()].unique():
cols[cols[cols == dup].index.values.tolist()] = [dup + '.' + str(i) if i != 0 else dup for i in range(sum(cols == dup))]
# rename the columns with the cols list.
df.columns=cols
df
Output:
blah blah2 blah3 blah.1 blah.2
0 0 1 2 3 4
1 5 6 7 8 9

You could use this:
def df_column_uniquify(df):
df_columns = df.columns
new_columns = []
for item in df_columns:
counter = 0
newitem = item
while newitem in new_columns:
counter += 1
newitem = "{}_{}".format(item, counter)
new_columns.append(newitem)
df.columns = new_columns
return df
Then
import numpy as np
import pandas as pd
df=pd.DataFrame(np.arange(2*5).reshape(2,5))
df.columns=['blah','blah2','blah3','blah','blah']
so that df:
blah blah2 blah3 blah blah
0 0 1 2 3 4
1 5 6 7 8 9
then
df = df_column_uniquify(df)
so that df:
blah blah2 blah3 blah_1 blah_2
0 0 1 2 3 4
1 5 6 7 8 9

You could assign directly to the columns:
In [12]:
df.columns = ['blah','blah2','blah3','blah4','blah5']
df
Out[12]:
blah blah2 blah3 blah4 blah5
0 0 1 2 3 4
1 5 6 7 8 9
[2 rows x 5 columns]
If you want to dynamically just rename the duplicate columns then you could do something like the following (code taken from answer 2: Index of duplicates items in a python list):
In [25]:
import collections
dups = collections.defaultdict(list)
dup_indices=[]
col_list=list(df.columns)
for i, e in enumerate(list(df.columns)):
dups[e].append(i)
for k, v in sorted(dups.items()):
if len(v) >= 2:
dup_indices = v
for i in dup_indices:
col_list[i] = col_list[i] + ' ' + str(i)
col_list
Out[25]:
['blah 0', 'blah2', 'blah3', 'blah 3', 'blah 4']
You could then use this to assign back, you could also have a function to generate a unique name that is not present in the columns prior to renaming.

duplicated_idx = dataset.columns.duplicated()
duplicated = dataset.columns[duplicated_idx].unique()
rename_cols = []
i = 1
for col in dataset.columns:
if col in duplicated:
rename_cols.extend([col + '_' + str(i)])
else:
rename_cols.extend([col])
dataset.columns = rename_cols

Thank you #Lamakaha for the solution. Your idea gave me a chance to modify it and make it workable in all the cases.
I am using Python 3.7.3 version.
I tried your piece of code on my data set which had only one duplicated column i.e. two columns with same name. Unfortunately, the column names remained As-Is without being renamed. On top of that I got a warning that "get_duplicates() is deprecated and same will be removed in future version". I used duplicated() coupled with unique() in place of get_duplicates() which did not yield the expected result.
I have modified your piece of code little bit which is working for me now for my data set as well as in other general cases as well.
Here are the code runs with and without code modification on the example data set mentioned in the question along with results:
df=pd.DataFrame(np.arange(2*5).reshape(2,5))
df.columns=['blah','blah2','blah3','blah','blah']
df
cols=pd.Series(df.columns)
for dup in df.columns.get_duplicates():
cols[df.columns.get_loc(dup)]=[dup+'.'+str(d_idx) if d_idx!=0 else dup for d_idx in range(df.columns.get_loc(dup).sum())]
df.columns=cols
df
f:\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: FutureWarning:
'get_duplicates' is deprecated and will be removed in a future
release. You can use idx[idx.duplicated()].unique() instead
Output:
blah blah2 blah3 blah blah.1
0 0 1 2 3 4
1 5 6 7 8 9
Two of the three "blah"(s) are not renamed properly.
Modified code
df=pd.DataFrame(np.arange(2*5).reshape(2,5))
df.columns=['blah','blah2','blah3','blah','blah']
df
cols=pd.Series(df.columns)
for dup in cols[cols.duplicated()].unique():
cols[cols[cols == dup].index.values.tolist()] = [dup + '.' + str(i) if i != 0 else dup for i in range(sum(cols == dup))]
df.columns=cols
df
Output:
blah blah2 blah3 blah.1 blah.2
0 0 1 2 3 4
1 5 6 7 8 9
Here is a run of modified code on some another example:
cols = pd.Series(['X', 'Y', 'Z', 'A', 'B', 'C', 'A', 'A', 'L', 'M', 'A', 'Y', 'M'])
for dup in cols[cols.duplicated()].unique():
cols[cols[cols == dup].index.values.tolist()] = [dup + '_' + str(i) if i != 0 else dup for i in range(sum(cols == dup))]
cols
Output:
0 X
1 Y
2 Z
3 A
4 B
5 C
6 A_1
7 A_2
8 L
9 M
10 A_3
11 Y_1
12 M_1
dtype: object
Hope this helps anybody who is seeking answer to the aforementioned question.

Since the accepted answer (by Lamakaha) is not working for recent versions of pandas, and because the other suggestions looked a bit clumsy, I worked out my own solution:
def dedupIndex(idx, fmt=None, ignoreFirst=True):
# fmt: A string format that receives two arguments:
# name and a counter. By default: fmt='%s.%03d'
# ignoreFirst: Disable/enable postfixing of first element.
idx = pd.Series(idx)
duplicates = idx[idx.duplicated()].unique()
fmt = '%s.%03d' if fmt is None else fmt
for name in duplicates:
dups = idx==name
ret = [ fmt%(name,i) if (i!=0 or not ignoreFirst) else name
for i in range(dups.sum()) ]
idx.loc[dups] = ret
return pd.Index(idx)
Use the function as follows:
df.columns = dedupIndex(df.columns)
# Result: ['blah', 'blah2', 'blah3', 'blah.001', 'blah.002']
df.columns = dedupIndex(df.columns, fmt='%s #%d', ignoreFirst=False)
# Result: ['blah #0', 'blah2', 'blah3', 'blah #1', 'blah #2']

Here's a solution that also works for multi-indexes
# Take a df and rename duplicate columns by appending number suffixes
def rename_duplicates(df):
import copy
new_columns = df.columns.values
suffix = {key: 2 for key in set(new_columns)}
dup = pd.Series(new_columns).duplicated()
if type(df.columns) == pd.core.indexes.multi.MultiIndex:
# Need to be mutable, make it list instead of tuples
for i in range(len(new_columns)):
new_columns[i] = list(new_columns[i])
for ix, item in enumerate(new_columns):
item_orig = copy.copy(item)
if dup[ix]:
for level in range(len(new_columns[ix])):
new_columns[ix][level] = new_columns[ix][level] + f"_{suffix[tuple(item_orig)]}"
suffix[tuple(item_orig)] += 1
for i in range(len(new_columns)):
new_columns[i] = tuple(new_columns[i])
df.columns = pd.MultiIndex.from_tuples(new_columns)
# Not a MultiIndex
else:
for ix, item in enumerate(new_columns):
if dup[ix]:
new_columns[ix] = item + f"_{suffix[item]}"
suffix[item] += 1
df.columns = new_columns

I just wrote this code it uses a list comprehension to update all duplicated names.
df.columns = [x[1] if x[1] not in df.columns[:x[0]] else f"{x[1]}_{list(df.columns[:x[0]]).count(x[1])}" for x in enumerate(df.columns)]

Created a function with some tests so it should be drop in ready; this is a little different than Lamakaha's excellent solution since it renames the first appearance of a duplicate column:
from collections import defaultdict
from typing import Dict, List, Set
import pandas as pd
def rename_duplicate_columns(df: pd.DataFrame) -> pd.DataFrame:
"""Rename column headers to ensure no header names are duplicated.
Args:
df (pd.DataFrame): A dataframe with a single index of columns
Returns:
pd.DataFrame: The dataframe with headers renamed; inplace
"""
if not df.columns.has_duplicates:
return df
duplicates: Set[str] = set(df.columns[df.columns.duplicated()].tolist())
indexes: Dict[str, int] = defaultdict(lambda: 0)
new_cols: List[str] = []
for col in df.columns:
if col in duplicates:
indexes[col] += 1
new_cols.append(f"{col}.{indexes[col]}")
else:
new_cols.append(col)
df.columns = new_cols
return df
def test_rename_duplicate_columns():
df = pd.DataFrame(data=[[1, 2]], columns=["a", "b"])
assert rename_duplicate_columns(df).columns.tolist() == ["a", "b"]
df = pd.DataFrame(data=[[1, 2]], columns=["a", "a"])
assert rename_duplicate_columns(df).columns.tolist() == ["a.1", "a.2"]
df = pd.DataFrame(data=[[1, 2, 3]], columns=["a", "b", "a"])
assert rename_duplicate_columns(df).columns.tolist() == ["a.1", "b", "a.2"]

We can just assign each column a different name.
Suppoese duplicate column name is like = [a,b,c,d,d,c]
Then just create a list of name what you want to assign:
C = [a,b,c,d,D1,C1]
df.columns = c
This works for me.

This is my solution:
cols = [] # for tracking if we alread seen it before
new_cols = []
for col in df.columns:
cols.append(col)
count = cols.count(col)
if count > 1:
new_cols.append(f'{col}_{count}')
else:
new_cols.append(col)
df.columns = new_cols

Here's an elegant solution:
Isolate a dataframe with only the repeated columns (looks like it will be a series but it will be a dataframe if >1 column with that name):
df1 = df['blah']
For each "blah" column, give it a unique number
df1.columns = ['blah_' + str(int(x)) for x in range(len(df1.columns))]
Isolate a dataframe with all but the repeated columns:
df2 = df[[x for x in df.columns if x != 'blah']]
Merge back together on indices:
df3 = pd.merge(df1, df2, left_index=True, right_index=True)
Et voila:
blah_0 blah_1 blah_2 blah2 blah3
0 0 3 4 1 2
1 5 8 9 6 7

Getting the integer index of a Pandas DataFrame row fulfilling a condition?

I have the following DataFrame:
a b c
b
2 1 2 3
5 4 5 6
As you can see, column b is used as an index. I want to get the ordinal number of the row fulfilling ('b' == 5), which in this case would be 1.
The column being tested can be either an index column (as with b in this case) or a regular column, e.g. I may want to find the index of the row fulfilling ('c' == 6).

Use Index.get_loc instead.
Reusing #unutbu's set up code, you'll achieve the same results.
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(np.arange(1,7).reshape(2,3),
columns = list('abc'),
index=pd.Series([2,5], name='b'))
>>> df
a b c
b
2 1 2 3
5 4 5 6
>>> df.index.get_loc(5)
1

You could use np.where like this:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(1,7).reshape(2,3),
columns = list('abc'),
index=pd.Series([2,5], name='b'))
print(df)
# a b c
# b
# 2 1 2 3
# 5 4 5 6
print(np.where(df.index==5)[0])
# [1]
print(np.where(df['c']==6)[0])
# [1]
The value returned is an array since there could be more than one row with a particular index or value in a column.

With Index.get_loc and general condition:
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(np.arange(1,7).reshape(2,3),
columns = list('abc'),
index=pd.Series([2,5], name='b'))
>>> df
a b c
b
2 1 2 3
5 4 5 6
>>> df.index.get_loc(df.index[df['b'] == 5][0])
1

The other answers based on Index.get_loc() do not provide a consistent result, because this function will return in integer if the index consists of all unique values, but it will return a boolean mask array if the index does not consist of unique values. A more consistent approach to return a list of integer values every time would be the following, with this example shown for an index with non-unique values:
df = pd.DataFrame([
{"A":1, "B":2}, {"A":2, "B":2},
{"A":3, "B":4}, {"A":1, "B":3}
], index=[1,2,3,1])
If searching based on index value:
[i for i,v in enumerate(df.index == 1) if v]
[0, 3]
If searching based on a column value:
[i for i,v in enumerate(df["B"] == 2) if v]
[0, 1]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas dataframe isolate row below a keyword - python

Related

excluding specific rows in pandas dataframe

Rename duplicate column name by order in Pandas

Create new dataframe column from the values of 2 other columns

How to give duplicated columns distinct names in Pandas [duplicate]

Getting the integer index of a Pandas DataFrame row fulfilling a condition?

Categories

Resources