i have a data frame with a column of strings and integers.
On one of the columns containing strings I want to search all the items of that column for a specific substring let say "abc" and delete the row if the substring exists. How do I do that? It sounds easy but somehow I struggle with this.
The substring is always the last three characters.
I tried the following:
df1 = df.drop(df[df.Hostname[-4:]== "abc"])
which gives me
UserWarning: Boolean Series key will be reindexed to match DataFrame
index
so I tried to modify the values in that column and filter out all values that do not have "abc" at the end:
red = [c for c in df.Hostname[-4:] if c != 'abc']
which gives me
KeyError('%s not in index' % objarr[mask])
What do I do wrong?
Thanks for your help!
Use boolean indexing, add indexing with str if need check last 4 (3) chars of column Hostname and change condition from == to !=:
df1 = df[df.Hostname.str[-4:] != "abc"]
Or maybe:
df1 = df[df.Hostname.str[-3:] != "abc"]
Sample:
df = pd.DataFrame({'Hostname':['k abc','abc','dd'],
'b':[1,2,3],
'c':[4,5,6]})
print (df)
Hostname b c
0 k abc 1 4
1 abc 2 5
2 dd 3 6
df1 = df[df.Hostname.str[-3:] != "abc"]
print (df1)
Hostname b c
2 dd 3 6
Also works str.endswith if need check last chars:
df1 = df[~df.Hostname.str.endswith("abc")]
print (df1)
Hostname b c
2 dd 3 6
EDIT:
If need check in last 4 chars if abc and then remove rows first extract values and then use str.contains:
df1 = df[~df.Hostname.str[-4:].str.contains('abc')]
print (df1)
Hostname b c
2 dd 3 6
EDIT1:
For default index add reset_index - python counts form 0, so values of index are 0,1,2,...:
df1 = df[df.Hostname.str[-3:] != "abc"].reset_index(drop=True)
Related
I'm trying to copy data from different columns to a particular column in the same DataFrame.
Index
col1A
col2A
colB
list
CT
CW
CH
0
1
:
1
b
2
2
3
3d
But prior to that I wanted to search if those columns(col1A,col2A,colB) exist in the DataFrame and group those columns which are present and move the grouped data to relevant columns(CT,CH,etc) like,
CH
CW
CT
0
1
1
1
b
b
2
2
2
3
3d
3d
I did,
col_list1 = ['ColA','ColB','ColC']
test1 = any([ i in df.columns for i in col_list1 ])
if test1==True:
df['CH'] = df['Col1A'] +df['Col2A']
df['CT'] = df['ColB']
this code is throwing me a keyerror
.
I want it to ignore columns that are not present and add only those that are present
IIUC, you can use Python set or Series.isin to find the common columns
cols = list(set(col_list1) & set(df.columns))
# or
cols = df.columns[df.columns.isin(col_list1)]
df['CH'] = df[cols].sum(axis=1)
Instead of just concatenating the columns with +, collect them into a list and use sum with axis=1:
df['CH'] = np.sum([df[c] for c in cl if c in df], axis=1)
I have a dataframe df which has 4 columns 'A','B','C','D'
I have to search for a substring in each column and return the complete dataframe in the search order for example if I get the substring in column B row 3,4,5 then my final df would be having
3 rows. For this I am using df[df['A'].str.contains('string_to _search') and it's working fine but one of the column consist each element in the column as list of strings like in column B
A B C D
0 asdfg [asdfgh, cvb] asdfg nbcjsh
1 fghjk [ertyu] fghhjk yrewf
2 xcvb [qwerr, hjklk, bnm] cvbvb gjfsjgf
3 ertyu [qwert] ertyhhu ertkkk
so df[df['A'].str.contains('string_to _search') is not working for column B pls suggest how can I search in this column and maintain the order of complete dataframe.
There are lists in column B, so need in statement:
df1 = df[df['B'].apply(lambda x: 'cvb' in x)]
print (df1)
A B C D
0 asdfg [asdfgh, cvb] asdfg nbcjsh
If want use str.contains then is possible use str.join first, so is possible search also substrings:
df1 = df[df['B'].str.join(' ').str.contains('er')]
print (df1)
A B C D
1 fghjk [ertyu] fghhjk yrewf
2 xcvb [qwerr, hjklk, bnm] cvbvb gjfsjgf
3 ertyu [qwert] ertyhhu ertkkk
If want search in all columns:
df2 = (df[df.assign(B = df['B'].str.join(' '))
.apply(' '.join, axis=1)
.str.contains('g')]
)
print (df2)
A B C D
0 asdfg [asdfgh, cvb] asdfg nbcjsh
1 fghjk [ertyu] fghhjk yrewf
2 xcvb [qwerr, hjklk, bnm] cvbvb gjfsjgf
list = ['abc', 'def_1', 'xyz_8']
An example row for a df below
abc_1 abc_99 def_1 def_2 xyz_8 xyz_1
2 1 1 2 2 3
I would like to scan and select only some of the df columns based on the list. The list element can be a substring of the column name. For example column abc_1 will be included since abc is a substring, but xyz_1 is not included since xyz_1 is not an element of the list, and none of the list element is a substring of xyz_1.
I want a df['sum'] = 6 (or 2+1+1+2) for that row.
filter / str.contains
You can use filter or str.contains, both of which supports regex:
L = ['abc', 'def_1', 'xyz_8']
# courtesy of #JonClements
df['result'] = df.filter(regex='|'.join(L)).sum(1)
# original
df['result'] = df.iloc[:, df.columns.str.contains('|'.join(L))].sum(1)
print(df)
abc_1 abc_99 def_1 def_2 xyz_8 xyz_1 result
0 2 1 1 2 2 3 6
I want something like this.
Index Sentence
0 I
1 want
2 like
3 this
Keyword Index
want 1
this 3
I tried with df.index("Keyword") but its not giving for all the rows. It will be really helpful if someone solve this.
Use isin with boolean indexing only:
df = df[df['Sentence'].isin(['want', 'this'])]
print (df)
Index Sentence
1 1 want
3 3 this
EDIT: If need compare by another column:
df = df[df['Sentence'].isin(df['Keyword'])]
#another DataFrame df2
#df = df[df['Sentence'].isin(df2['Keyword'])]
And if need index values:
idx = df.index[df['Sentence'].isin(df['Keyword'])]
#alternative
#idx = df[df['Sentence'].isin(df['Keyword'])].index
I'm trying to add one column in my dataframe (DF) according to another column value and whether that value is in my DF or not.
Example:
>>> d = { 'one' : pd.Series(['aa', 'bb', 'cc', 'aa-01', 'bb-02', 'dd']) }
>>> df = pd.DataFrame(d)
>>> df
one
0 aa
1 bb
2 cc
3 aa-01
4 bb-02
5 dd
I would like to add the following column if I can find another element with the current element appended -01 or -02.
Example: in this dataframe only the elements 'aa' and 'bb' have the elements with the appended value, which are 'aa-01', and 'bb-02', thus only 'aa' and 'bb' will have the value True in the new column
Expected result:
>>> expected_df
one two
0 aa True
1 bb True
2 cc False
3 aa-01 False
4 bb-02 False
5 dd False
I believe I have to use isin() with apply(), but I can't figure out a way to modify the row and use isin at the same time within the function passed as argument to apply.
Use str.endswith to check for strings ending with the given chars and create a boolean mask. Followed by removing the last three chars after the mask generation fed to the isin method.
mask = df['one'].str.endswith(('-01','-02'))
df['two'] = df['one'].isin(df[mask].squeeze().str[:-3])
df