Pandas overwrite values in column selectively based on condition from another column - python

I have a dataframe in pandas with four columns. The data consists of strings. Sample:
A B C D
0 2 asicdsada v:cVccv u
1 4 ascccaiiidncll v:cVccv:ccvc u
2 9 sca V:c u
3 11 lkss v:cv u
4 13 lcoao v:ccv u
5 14 wuduakkk V:ccvcv: u
I want to replace the string 'u' in Col D with the string 'a' if Col C in that row contains the substring 'V' (case sensitive).
Desired outcome:
A B C D
0 2 asicdsada v:cVccv a
1 4 ascccaiiidncll v:cVccv:ccvc a
2 9 sca V:c a
3 11 lkss v:cv u
4 13 lcoao v:ccv u
5 14 wuduakkk V:ccvcv: a
I prefer to overwrite the value already in Column D, rather than assign two different values, because I'd like to selectively overwrite some of these values again later, under different conditions.
It seems like this should have a simple solution, but I cannot figure it out, and haven't been able to find a fully applicable solution in other answered questions.
df.ix[1]["D"] = "a"
changes an individual value.
df.ix[:]["C"].str.contains("V")
returns a series of booleans, but I am not sure what to do with it. I have tried many many combinations of .loc, apply, contains, re.search, and for loops, and I get either errors or replace every value in column D. I'm a novice with pandas/python so it's hard to know whether my syntax, methods, or conceptualization of what I even need to do are off (probably all of the above).

As you've already tried, use str.contains to get a boolean Series, and then use .loc to say "change these rows and the D column". For example:
In [5]: df.loc[df["C"].str.contains("V"), "D"] = "a"
In [6]: df
Out[6]:
A B C D
0 2 asicdsada v:cVccv a
1 4 ascccaiiidncll v:cVccv:ccvc a
2 9 sca V:c a
3 11 lkss v:cv u
4 13 lcoao v:ccv u
5 14 wuduakkk V:ccvcv: a
(Avoid using .ix -- it's officially deprecated now.)

Related

How to change the values of certain rowns in a column of a MultiIndex Dataframe in Pandas

I have a dataframe (df) with the following columns:
print(df.columns)
['A','B','C','D','E']
And let's assume all the columns have numbers as data.
Then I select some of the columns to become indexes
Index = ['A','B','C']
df.set_index(Index).sort_index()
and I use it this way for some analysis.At some point I need to change the rows of column 'E' when index 'C' has certain values, for instance something like :
df.loc[df[(slice(None,None),slice(None,None),slice(5,10))], 'E' ] = 6
Which, obviously, doesn't work. I have tried a bunch of different approaches: using tuples and slices for the index as shown in my line above, re-arranging the indexes so i can use a single slice (Moving 'C' to the first level), tried with .xs (cross section) etc and I cannot do it. (I have been looking into de documentation of .loc, .xs, etc) I don't find an example that does exactly this, nor I find conclusive answer that this is not possible. Right now I was able to do the following:
df.reset_index(inplace=True) # returning it back into a normal DataFrame
df.loc[(DataFrame['C'] >= 5) & (df['C'] <= 10),'E'] = 6 # Modifying normally based on column data
df.set_index(Index).sort_index() # bring it back to a multiindex
But this doesn't seem right. It would seem to me that indexes should be able to be sliced somehow, I just can't find how. Perhaps I'm not searching the correct terms on Google. if anyone could give me a hand or point me in the right direction I'd greatly appreciate it.
You can use df.index.get_level_values('C')--which returns an index array of the values--like below.
import pandas as pd
df = pd.DataFrame(np.random.randint(0,10,size=(100, 5)), columns=list('ABCDE'))
df = df.set_index(['A','B','C']).sort_index()
df.loc[(df.index.get_level_values('C') <= 10) & (df.index.get_level_values('C') >= 5), 'E'] = 6
print(df)
Results:
D E
A B C
0 0 6 3 6
2 0 6 1
7 2 6
3 6 5 6
9 1 6
... .. ..
9 3 3 5 0
6 6 6
4 3 5 7
7 6 6
6 8 6 6
Note: The the parenthesis around both .get_level_values()s are required because otherwise the answer is ambiguous and will throw an error.

How to eliminate just pair duplicates in a Pandas Dataframe?

After reading mostly all the questions related to pair duplicates, no question address the following issue:
Given a Df:
Letter
0 a
1 b
2 c
3 d
4 a
5 b
6 a
7 a
8 a
Eliminate only pairs of duplicates. For example: as the Df have 5 a's, the solution is to eliminate the first two set of pairs of a's and leave the last a (order is important). The two b's are just eliminated because they are a set of pairs. The resulting Df would look like this:
Letter
2 c
3 d
8 a
I hope it was clear the issue. Thanks!
You can first get rid of letters with even number of rows, then use drop_duplicates.
df.groupby('Letter').filter(lambda x: len(x)%2>0).drop_duplicates(keep="last")
Out[174]:
Letter
2 c
3 d
8 a

Python Pandas - filtering df by the number of unique values within a group

Here is an example of data I'm working on. (as a pandas df)
index inv Rev_stream Bill_type Net_rev
1 1 A Original -24.77
2 1 B Original -24.77
3 2 A Original -409.33
4 2 B Original -409.33
5 2 C Original -409.33
6 2 D Original -409.33
7 3 A Original -843.11
8 3 A Rebill 279.5
9 3 B Original -843.11
10 4 A Rebill 279.5
11 4 B Original -843.11
12 5 B Rebill 279.5
How could I filter this df, in a way to only get the lines where invoice/Rev_stream combo has both original and rebill kind of Net_rev. In the example above it would be only lines with index 7 and 8.
Is there an easy way to do it, without iterating over the whole dataframe and building dictionaries of invoice+RevStream : Bill_type?
What I'm looking for is some kind of
df = df[df[['inv','Rev_stream']]['Bill_type'].unique().len() == 2]
Unfortunately the code above doesn't work.
Thanks in advance.
You can group your data by inv and Rev_stream columns and then check for each group if both Original and Rebill are in the Bill_type values and filter based on the condition:
(df.groupby(['inv', 'Rev_stream'])
.filter(lambda g: 'Original' in g.Bill_type.values and 'Rebill' in g.Bill_type.values))

How to select DataFrame columns based on partial matching?

I was struggling this afternoon to find a way of selecting few columns of my Pandas DataFrame, by checking the occurrence of a certain pattern in their name (label?).
I had been looking for something like contains or isin for nd.arrays / pd.series, but got no luck.
This frustrated me quite a bit, as I was already checking the columns of my DataFrame for occurrences of specific string patterns, as in:
hp = ~(df.target_column.str.contains('some_text') | df.target_column.str.contains('other_text'))
df_cln= df[hp]
However, no matter how I banged my head, I could not apply .str.contains() to the object returned bydf.columns - which is an Index - nor the one returned by df.columns.values - which is an ndarray. This works fine for what is returned by the "slicing" operation df[column_name], i.e. a Series, though.
My first solution involved a for loop and the creation of a help list:
ll = []
for a in df.columns:
if a.startswith('start_exp1') | a.startswith('start_exp2'):
ll.append(a)
df[ll]
(one could apply any of the str functions, of course)
Then, I found the map function and got it to work with the following code:
import re
sel = df.columns.map(lambda x: bool(re.search('your_regex',x))
df[df.columns[sel]]
Of course in the first solution I could have performed the same kind of regex checking, because I can apply it to the str data type returned by the iteration.
I am very new to Python and never really programmed anything so I am not too familiar with speed/timing/efficiency, but I tend to think that the second method - using a map - could potentially be faster, besides looking more elegant to my untrained eye.
I am curious to know what you think of it, and what possible alternatives would be. Given my level of noobness, I would really appreciate if you could correct any mistakes I could have made in the code and point me in the right direction.
Thanks,
Michele
EDIT : I just found the Index method Index.to_series(), which returns - ehm - a Series to which I could apply .str.contains('whatever').
However, this is not quite as powerful as a true regex, and I could not find a way of passing the result of Index.to_series().str to the re.search() function..
Select column by partial string, can simply be done, via:
df.filter(like='hello') # select columns which contain the word hello
And to select rows by partial string match, you can pass axis=0 to filter:
df.filter(like='hello', axis=0)
Your solution using map is very good. If you really want to use str.contains, it is possible to convert Index objects to Series (which have the str.contains method):
In [1]: df
Out[1]:
x y z
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9
In [2]: df.columns.to_series().str.contains('x')
Out[2]:
x True
y False
z False
dtype: bool
In [3]: df[df.columns[df.columns.to_series().str.contains('x')]]
Out[3]:
x
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
UPDATE I just read your last paragraph. From the documentation, str.contains allows you to pass a regex by default (str.contains('^myregex'))
I think df.keys().tolist() is the thing you're searching for.
A tiny example:
from pandas import DataFrame as df
d = df({'somename': [1,2,3], 'othername': [4,5,6]})
names = d.keys().tolist()
for n in names:
print n
print type(n)
Output:
othername
type 'str'
somename
type 'str'
Then with the strings you got, you can do any string operation you want.

How to df.groupby(cols).apply(my_func) for some columns, while leave a few columns not tackled?

Suppose I have a Pandas dataframe df has columns a,b,c,d...z . And I want to: df.groupby('a').apply(my_func()) for columns d-z, while leave column 'b' & 'c' unchanged . How to do that ?
I notice Pandas can apply different function to different column by passing a dict . But I have a long column list and just want parameters to set or tip to simply tell Pandas to bypass some columns and apply my_func() to rest of columns ? (Otherwise I have to build a long dict)
One simple (and general) approach is to create a view of the dataframe with the subset you are interested in (or, stated for your case, a view with all columns except the ones you want to ignore), and then use APPLY for that view.
In [116]: df
Out[116]:
a b c d f
0 one 3 0.493808 a bob
1 two 8 0.150585 b alice
2 one 6 0.641816 c michael
3 two 5 0.935653 d joe
4 one 1 0.521159 e kate
Use your favorite methods to create the view you need. You could select a range of columns like so df_view = df.ix[:,'b':'d'], but the following might be more useful for your scenario:
#I want all columns except two
cols = df.columns.tolist()
mycols = [x for x in cols if not x in ['a','f']]
df_view = df[mycols]
Apply your function to that view. (Note this doesn't yet change anything in df.)
In [158]: df_view.apply(lambda x: x /2)
Out[158]:
b c d
0 1 0.246904 20
1 4 0.075293 25
2 3 0.320908 28
3 2 0.467827 28
4 0 0.260579 24
Update the df using update()
In [156]: df.update(df_view.apply(lambda x: x/2))
In [157]: df
Out[157]:
a b c d f
0 one 1 0.246904 20 bob
1 two 4 0.075293 25 alice
2 one 3 0.320908 28 michael
3 two 2 0.467827 28 joe
4 one 0 0.260579 24 kate

Categories

Resources