Replacing values in a column for a subset of rows - python

I have a dataframe having multiple columns. I would like to replace the value in a column called Discriminant. Now this value needs to only be replaced for a few rows, whenever a condition is met in another column called ids. I tried various methods; The most common method seems to be using the .loc method, but for some reason it doesn't work for me.
Here are the variations that I am unsuccessfully trying:
encodedid - variable used for condition checking
indices - variable used for subsetting the dataframe (starts from zero)
Variation 1:
df[df.ids == encodedid].loc[df.ids==encodedid, 'Discriminant'].values[indices] = 'Y'
Variation 2:
df[df['ids'] == encodedid].iloc[indices,:].set_value('questionid','Discriminant', 'Y')
Variation 3:
df.loc[df.ids==encodedid, 'Discriminant'][indices] = 'Y'
Variation 3 particularly has been disappointing in that most posts on SO tend to say it should work but it gives me the following error:
ValueError: [ 0 1 2 3 5 6 7 8 10 11 12 13 14 16 17 18 19 20 21 22 23] not contained in the index
Any pointers will be highly appreciated.

you are slicing too much. try something like this:
indexer = df[df.ids == encodedid].index
df.loc[indexer, 'Discriminant'] = 'Y'
.loc[] needs an index list and a column list. you can set the value of that slice easily using = 'what you need'
looking at your problem you might want to set that for 2 columns at the same time such has:
indexer = df[df.ids == encodedid].index
column_list = ['Discriminant', 'questionid']
df.loc[indexer, column_list] = 'Y'

Maybe something like this. I don't have a dataframe to test it, but...
df['Discriminant'] = np.where(df['ids'] == 'some_condition', 'replace', df['Discriminant'])

Related

how to get a single value from dataframe only in Python

I have dataframe df_my that looks like this
id name age major
----------------------------------------
0 1 Mark 34 English
1 2 Tom 55 Art
2 3 Peter 31 Science
3 4 Mohammad 23 Math
4 5 Mike 47 Art
...
I am trying to get the value of major (only)
I used this and it works fine when I know the id of the record
df_my["major"][3]
returns
"Math"
great
but I want to get the major for a variable record
I used
i = 3
df_my.loc[df_my["id"]==i]["major"]
and also used
i = 3
df_my[df_my["id"]==i]["major"]
but they both return
3 Math
it includes the record index too
how can I get the major only and nothing else?
You could use squeeze:
i = 3
out = df.loc[df['id']==i,'major'].squeeze()
Another option is iat:
out = df.loc[df['id']==i,'major'].iat[0]
Output:
'Science'
I also stumbled over this problem, from a little different angle:
df = pd.DataFrame({'First Name': ['Kumar'],
'Last Name': ['Ram'],
'Country': ['India'],
'num_var': 1})
>>> df.loc[(df['First Name'] == 'Kumar'), "num_var"]
0 1
Name: num_var, dtype: int64
>>> type(df.loc[(df['First Name'] == 'Kumar'), "num_var"])
<class 'pandas.core.series.Series'>
So it returns a Series (although it is only a series with only 1 element). If you access through the index, you receive the integer.
df.loc[0, "num_var"]
1
type(df.loc[0, "num_var"])
<class 'numpy.int64'>
The answer on how to select the respective, single value was already given above. However, I think it is interesting to note that accessing through an index always gives the single value whereas accessing through a condition returns a series. This is, b/c accessing with index clearly returns only one value whereas accessing through a condition can return several values.
If one of the columns of your dataframe is the natural primary index for those data, then it's usually a good idea to make pandas aware of it by setting the index accordingly:
df_my.set_index('id', inplace=True)
Now you can easily get just the major value for any id value i:
df_my.loc[i, 'major']
Note that for i = 3, the output is 'Science', which is expected, as noted in the comments to your question above.

Python: How to pass row and next row DataFrame.apply() method?

I have DataFrame with thousands rows. Its structure is as below
A B C D
0 q 20 'f'
1 q 14 'd'
2 o 20 'a'
I want to compare the A column of current row and next row. If those values are equal I want to add the value of B column which has lower the value to D column of compared row which has greater value. Then I want to remove the moved column value of column B. It's like a swap process.
A B C D
0 q 20 'f' 14
1 o 20 'a'
I have thousands rows and iloc, loc, at methods work slow. At least I want to use DataFrame apply method. I tried some code samples but they didn't work.
I want to do something as below:
DataFrame.apply(lambda row: self.compare(row, next(row)), axis=1))
I have a compare method but I couldn't pass next row to the compare method. How can I pass it to the method? Also I am open to hear faster pandas solutions.
Best not to do that with apply as it will be slow; you can look at using shift, e.g.
df['A_shift'] = df['A'].shift(1)
df['Is_Same'] = 0
df.loc[df.A_shift == df.A, 'Is_Same'] = 1
Gets a bit more complicated if you're doing the shift within groups, but still possible.

Cell-wise calculations in a Pandas Dataframe

I have what I'm sure is a fundamental lack of understanding about how dataframes work in Python. I am sure this is an easy question, but I have looked everywhere and can't find a good explanation. I am trying to understand why sometimes dataframe calculations seem to run on a row-by-row (or cell by cell) basis, and sometimes seem to run for an entire column... For example:
data = {'Name':['49-037-23094', '49-029-21476', '49-029-20812', '49-041-21318'], 'Depth':[20, 21, 7, 18]}
df = pd.DataFrame(data)
df
Which gives:
Name Depth
0 49-037-23094 20
1 49-029-21476 21
2 49-029-20812 7
3 49-041-21318 18
Now I know I can do:
df['DepthDouble']=df['Depth']*2
And get:
Name Depth DepthDouble
0 49-037-23094 20 40
1 49-029-21476 21 42
2 49-029-20812 7 14
3 49-041-21318 18 36
Which is what I would expect. But this doesn't always work, and I'm trying to understand why. For example, I am trying to run this code to modify the name:
df['newName']=''.join(re.findall('\d',str(df['Name'])))
which gives:
Name Depth DepthDouble \
0 49-037-23094 20 40
1 49-029-21476 21 42
2 49-029-20812 7 14
3 49-041-21318 18 36
newName
0 04903723094149029214762490292081234904121318
1 04903723094149029214762490292081234904121318
2 04903723094149029214762490292081234904121318
3 04903723094149029214762490292081234904121318
So it is taking all the values in my name column, removing the dashes, and concatenating them. Of course, I'd just like it to be a new name column exactly the same as the original "Name" column, but without the dashes.
So, can anyone help me understand what I am doing wrong here? I Don't understand why sometimes Dataframe calculations for one column are done row by row (e.g., the Depth Doubled column) and sometimes Python seems to take all values in the entire column and run the calculation (e.g., the newName column).
Surely the way to get around this isn't by making a loop for every index in the df to force it to run individually for each row for a given column?
If the output you're looking for is:
Name Depth newName
0 49-037-23094 20 4903723094
1 49-029-21476 21 4902921476
2 49-029-20812 7 4902920812
3 49-041-21318 18 4904121318
The way to get this is:
df['newName']=df['Name'].map(lambda name: ''.join(re.findall('\d', name)))
map is like apply but specifically for Series objects. Since you're applying to only the Name column you are operating on a Series.
If the lambda part is confusing, an equivalent way to write it is:
def find_digits(name):
return ''.join(re.findall('\d', name))
df['newName']=df['Name'].map(find_digits)
The equivalent operation in traditional for loops is:
newNameSeries = pd.Series(name='newName')
for name in df['Name']:
newNameSeries = newNameSeries.append(pd.Series(''.join(re.findall('\d', name))), ignore_index=True)
pd.concat([df, newNameSeries], axis=1).rename(columns={0:'newName'})
While there might be a slightly cleaner way to do the loop, you can see how much simpler the first approach is compared to trying to use for-loops. It's also faster. As you already have indicated you know, avoid for loops when using pandas.
The issue is that with str(df['Name']) you are converting the entire Name-column of your DataFrame to one single string. What you want to do instead is to use one of pandas' own methods for strings, which will be applied to every single element of the column.
For example, you could use pandas' replace method for strings:
import pandas as pd
data = {'Name':['49-037-23094', '49-029-21476', '49-029-20812', '49-041-21318'], 'Depth':[20, 21, 7, 18]}
df = pd.DataFrame(data)
df['newName'] = df['Name'].str.replace('-', '')

Pandas - how to filter dataframe by regex comparisons on mutliple column values

I have a dataframe like the following, where everything is formatted as a string:
df
property value count
0 propAb True 10
1 propAA False 10
2 propAB blah 10
3 propBb 3 8
4 propBA 4 7
5 propCa 100 4
I am trying to find a way to filter the dataframe by applying a series of regex-style rules to both the property and value columns together.
For example, some sample rules may be like the following:
"if property starts with 'propA' and value is not 'True', drop the row".
Another rule may be something more mathematical, like:
"if property starts with 'propB' and value < 4, drop the row".
Is there a way to accomplish something like this without having to iterate over all rows each time for every rule I want to apply?
You still have to apply each rule (how else?), but let pandas handle the rows. Also, instead of removing the rows that you do not like, keep the rows that you do. Here's an example of how the first two rules can be applied:
rule1 = df.property.str.startswith('propA') & (df.value != 'True')
df = df[~rule1] # Keep everything that does NOT match
rule2 = df.property.str.startswith('propB') & (df.value < 4)
df = df[~rule2] # Keep everything that does NOT match
By the way, the second rule will not work because value is not a numeric column.
For the first one:
df = df.drop(df[(df.property.startswith('propA')) & (df.value is not True)].index)
and the other one:
df = df.drop(df[(df.property.startswith('propB')) & (df.value < 4)].index)

Python : how to rank an element among a list?

Let's say I have an UNORDERED Dataframe :
df = pandas.DataFrame({'A': [6, 2, 3, 5]})
I have an input :
input = 3
I want to find the rank of my input in the list. Here :
expected_rank_in_df(input) = 2
# Because 2 < 3 < 5 < 6
Assumption : The input is always included in the dataframe. So for example, I will not find the position of "4" in this df.
The first idea was to use like here : Pandas rank by column value:
df.rank()
But it seems overkill to me as I don't need to rank the whole column. Maybe it's not ?
If you know for sure that the input is in the column, the rank will be equal to
df[df > input].count()
Does that make sense? If you intend on calling this multiple times, it may be worth it to just sort the column. But this is probably faster if you only care about a few inputs.
You can get first position of matched value by numpy.where with boolean mask for first True:
a = 3
print (np.where(np.sort(df['A']) == a)[0][0] + 1)
2
If default RangeIndex:
a = 3
print (df['A'].sort_values().eq(3).idxmax())
2
Another idea is count True values by sum:
print (df['A'].gt(3).sum())
2

Categories

Resources