How to select DataFrame columns based on partial matching? - python

I was struggling this afternoon to find a way of selecting few columns of my Pandas DataFrame, by checking the occurrence of a certain pattern in their name (label?).
I had been looking for something like contains or isin for nd.arrays / pd.series, but got no luck.
This frustrated me quite a bit, as I was already checking the columns of my DataFrame for occurrences of specific string patterns, as in:
hp = ~(df.target_column.str.contains('some_text') | df.target_column.str.contains('other_text'))
df_cln= df[hp]
However, no matter how I banged my head, I could not apply .str.contains() to the object returned bydf.columns - which is an Index - nor the one returned by df.columns.values - which is an ndarray. This works fine for what is returned by the "slicing" operation df[column_name], i.e. a Series, though.
My first solution involved a for loop and the creation of a help list:
ll = []
for a in df.columns:
if a.startswith('start_exp1') | a.startswith('start_exp2'):
ll.append(a)
df[ll]
(one could apply any of the str functions, of course)
Then, I found the map function and got it to work with the following code:
import re
sel = df.columns.map(lambda x: bool(re.search('your_regex',x))
df[df.columns[sel]]
Of course in the first solution I could have performed the same kind of regex checking, because I can apply it to the str data type returned by the iteration.
I am very new to Python and never really programmed anything so I am not too familiar with speed/timing/efficiency, but I tend to think that the second method - using a map - could potentially be faster, besides looking more elegant to my untrained eye.
I am curious to know what you think of it, and what possible alternatives would be. Given my level of noobness, I would really appreciate if you could correct any mistakes I could have made in the code and point me in the right direction.
Thanks,
Michele
EDIT : I just found the Index method Index.to_series(), which returns - ehm - a Series to which I could apply .str.contains('whatever').
However, this is not quite as powerful as a true regex, and I could not find a way of passing the result of Index.to_series().str to the re.search() function..

Select column by partial string, can simply be done, via:
df.filter(like='hello') # select columns which contain the word hello
And to select rows by partial string match, you can pass axis=0 to filter:
df.filter(like='hello', axis=0)

Your solution using map is very good. If you really want to use str.contains, it is possible to convert Index objects to Series (which have the str.contains method):
In [1]: df
Out[1]:
x y z
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9
In [2]: df.columns.to_series().str.contains('x')
Out[2]:
x True
y False
z False
dtype: bool
In [3]: df[df.columns[df.columns.to_series().str.contains('x')]]
Out[3]:
x
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
UPDATE I just read your last paragraph. From the documentation, str.contains allows you to pass a regex by default (str.contains('^myregex'))

I think df.keys().tolist() is the thing you're searching for.
A tiny example:
from pandas import DataFrame as df
d = df({'somename': [1,2,3], 'othername': [4,5,6]})
names = d.keys().tolist()
for n in names:
print n
print type(n)
Output:
othername
type 'str'
somename
type 'str'
Then with the strings you got, you can do any string operation you want.

Related

How to change the values of certain rowns in a column of a MultiIndex Dataframe in Pandas

I have a dataframe (df) with the following columns:
print(df.columns)
['A','B','C','D','E']
And let's assume all the columns have numbers as data.
Then I select some of the columns to become indexes
Index = ['A','B','C']
df.set_index(Index).sort_index()
and I use it this way for some analysis.At some point I need to change the rows of column 'E' when index 'C' has certain values, for instance something like :
df.loc[df[(slice(None,None),slice(None,None),slice(5,10))], 'E' ] = 6
Which, obviously, doesn't work. I have tried a bunch of different approaches: using tuples and slices for the index as shown in my line above, re-arranging the indexes so i can use a single slice (Moving 'C' to the first level), tried with .xs (cross section) etc and I cannot do it. (I have been looking into de documentation of .loc, .xs, etc) I don't find an example that does exactly this, nor I find conclusive answer that this is not possible. Right now I was able to do the following:
df.reset_index(inplace=True) # returning it back into a normal DataFrame
df.loc[(DataFrame['C'] >= 5) & (df['C'] <= 10),'E'] = 6 # Modifying normally based on column data
df.set_index(Index).sort_index() # bring it back to a multiindex
But this doesn't seem right. It would seem to me that indexes should be able to be sliced somehow, I just can't find how. Perhaps I'm not searching the correct terms on Google. if anyone could give me a hand or point me in the right direction I'd greatly appreciate it.
You can use df.index.get_level_values('C')--which returns an index array of the values--like below.
import pandas as pd
df = pd.DataFrame(np.random.randint(0,10,size=(100, 5)), columns=list('ABCDE'))
df = df.set_index(['A','B','C']).sort_index()
df.loc[(df.index.get_level_values('C') <= 10) & (df.index.get_level_values('C') >= 5), 'E'] = 6
print(df)
Results:
D E
A B C
0 0 6 3 6
2 0 6 1
7 2 6
3 6 5 6
9 1 6
... .. ..
9 3 3 5 0
6 6 6
4 3 5 7
7 6 6
6 8 6 6
Note: The the parenthesis around both .get_level_values()s are required because otherwise the answer is ambiguous and will throw an error.

pandas how to multiply all elements of same column in python

I know it is a simple answer but i could'nt find anywhere, I need to show the multiplication of all values of a single column in python.
Here's the dataframe:
VALUE
0 2
1 3
2 1
3 3
4 1
The output should give me 23131 = 18
Try prod
df.VALUE.prod()
Out[345]: 18
To add to the previous answer you can use df.product(axis=0) as well.

How to drop parentheses within column or data frame

df =
A B
1 5
2 6)
(3 7
4 8
To remove parentheses I did:
df.A = df.A.str.replace(r"\(.*\)","")
But no result. I have checked a lot of replies here, but still same result.
Would appreciate to remove parentheses from the whole data set or at least in coulmn
to remove parentheses from the whole data set
With regex character class [...] :
In [15]: df.apply(lambda s: s.str.replace(r'[()]', ''))
Out[15]:
A B
0 1 5
1 2 6
2 3 7
3 4 8
Or the same with df.replace(r'[()]', '', regex=True) which is a more concise way.
If you want regex, you can use r"[()]" instead of alteration groups, as long as you need to replace only one character at a time.
df.A = df.A.str.replace(r"[()]", "")
I find it easier to read and alter if needed.

Dynamically accessing a pandas dataframe column

Consider this simple example
import pandas as pd
df = pd.DataFrame({'one' : [1,2,3],
'two' : [1,0,0]})
df
Out[9]:
one two
0 1 1
1 2 0
2 3 0
I want to write a function that takes as inputs a dataframe df and a column mycol.
Now this works:
df.groupby('one').two.sum()
Out[10]:
one
1 1
2 0
3 0
Name: two, dtype: int64
this works too:
def okidoki(df,mycol):
return df.groupby('one')[mycol].sum()
okidoki(df, 'two')
Out[11]:
one
1 1
2 0
3 0
Name: two, dtype: int64
but this FAILS
def megabug(df,mycol):
return df.groupby('one').mycol.sum()
megabug(df, 'two')
AttributeError: 'DataFrameGroupBy' object has no attribute 'mycol'
What is wrong here?
I am worried that okidoki uses some chaining that might create some subtle bugs (https://pandas.pydata.org/pandas-docs/stable/indexing.html#why-does-assignment-fail-when-using-chained-indexing).
How can I still keep the syntax groupby('one').mycol? Can the mycol string be converted to something that might work that way?
Thanks!
You pass a string as the second argument. In effect, you're trying to do something like:
df.'two'
Which is invalid syntax. If you're trying to dynamically access a column, you'll need to use the index notation, [...] because the dot/attribute accessor notation doesn't work for dynamic access.
Dynamic access on its own is possible. For example, you can use getattr (but I don't recommend this, it's an antipattern):
In [674]: df
Out[674]:
one two
0 1 1
1 2 0
2 3 0
In [675]: getattr(df, 'one')
Out[675]:
0 1
1 2
2 3
Name: one, dtype: int64
Dynamically selecting by attribute from a groupby call can be done, something like:
In [677]: getattr(df.groupby('one'), mycol).sum()
Out[677]:
one
1 1
2 0
3 0
Name: two, dtype: int64
But don't do it. It is a horrid anti pattern, and much more unreadable than df.groupby('one')[mycol].sum().
I think you need [] for select column by column name what is general solution for selecting columns, because select by attributes have many exceptions:
You can use this access only if the index element is a valid python identifier, e.g. s.1 is not allowed. See here for an explanation of valid identifiers.
The attribute will not be available if it conflicts with an existing method name, e.g. s.min is not allowed.
Similarly, the attribute will not be available if it conflicts with any of the following list: index, major_axis, minor_axis, items, labels.
In any of these cases, standard indexing will still work, e.g. s['1'], s['min'], and s['index'] will access the corresponding element or column.
def megabug(df,mycol):
return df.groupby('one')[mycol].sum()
print (megabug(df, 'two'))
one
1 1
2 0
3 0
Name: two, dtype: int64

Pandas overwrite values in column selectively based on condition from another column

I have a dataframe in pandas with four columns. The data consists of strings. Sample:
A B C D
0 2 asicdsada v:cVccv u
1 4 ascccaiiidncll v:cVccv:ccvc u
2 9 sca V:c u
3 11 lkss v:cv u
4 13 lcoao v:ccv u
5 14 wuduakkk V:ccvcv: u
I want to replace the string 'u' in Col D with the string 'a' if Col C in that row contains the substring 'V' (case sensitive).
Desired outcome:
A B C D
0 2 asicdsada v:cVccv a
1 4 ascccaiiidncll v:cVccv:ccvc a
2 9 sca V:c a
3 11 lkss v:cv u
4 13 lcoao v:ccv u
5 14 wuduakkk V:ccvcv: a
I prefer to overwrite the value already in Column D, rather than assign two different values, because I'd like to selectively overwrite some of these values again later, under different conditions.
It seems like this should have a simple solution, but I cannot figure it out, and haven't been able to find a fully applicable solution in other answered questions.
df.ix[1]["D"] = "a"
changes an individual value.
df.ix[:]["C"].str.contains("V")
returns a series of booleans, but I am not sure what to do with it. I have tried many many combinations of .loc, apply, contains, re.search, and for loops, and I get either errors or replace every value in column D. I'm a novice with pandas/python so it's hard to know whether my syntax, methods, or conceptualization of what I even need to do are off (probably all of the above).
As you've already tried, use str.contains to get a boolean Series, and then use .loc to say "change these rows and the D column". For example:
In [5]: df.loc[df["C"].str.contains("V"), "D"] = "a"
In [6]: df
Out[6]:
A B C D
0 2 asicdsada v:cVccv a
1 4 ascccaiiidncll v:cVccv:ccvc a
2 9 sca V:c a
3 11 lkss v:cv u
4 13 lcoao v:ccv u
5 14 wuduakkk V:ccvcv: a
(Avoid using .ix -- it's officially deprecated now.)

Categories

Resources