Pandas assignment vs inplace=True on .loc? [duplicate]

Pandas assignment vs inplace=True on .loc? [duplicate] - python

I have tried many times, but seems the 'replace' can NOT work well after use 'loc'.
For example I want to replace the 'conlumn_b' with an regex for the row that the 'conlumn_a' value is 'apple'.
Here is my sample code :
df.loc[df['conlumn_a'] == 'apple', 'conlumn_b'].replace(r'^11*', 'XXX',inplace=True, regex=True)
Example:
conlumn_a conlumn_b
apple 123
banana 11
apple 11
orange 33
The result that I expected for the 'df' is:
conlumn_a conlumn_b
apple 123
banana 11
apple XXX
orange 33
Anyone has meet this issue that needs 'replace' with regex after 'loc' ?
OR you guys has some other good solutions ?
Thank you so much for your help!

inplace=True works on the object that it was applied on.
When you call .loc, you're slicing your dataframe object to return a new one.
>>> id(df)
4587248608
And,
>>> id(df.loc[df['conlumn_a'] == 'apple', 'conlumn_b'])
4767716968
Now, calling an in-place replace on this new slice will apply the replace operation, updating the new slice itself, and not the original.
Now, note that you're calling replace on a column of int, and nothing is going to happen, because regular expressions work on strings.
Here's what I offer you as a workaround. Don't use regex at all.
m = df['conlumn_a'] == 'apple'
df.loc[m, 'conlumn_b'] = df.loc[m, 'conlumn_b'].replace(11, 'XXX')
df
conlumn_a conlumn_b
0 apple 123
1 banana 11
2 apple XXX
3 orange 33
Or, if you need regex based substitution, then -
df.loc[m, 'conlumn_b'] = df.loc[m, 'conlumn_b']\
.astype(str).replace('^11$', 'XXX', regex=True)
Although, this converts your column to an object column.

I'm going to borrow from a recent answer of mine. This technique is a general purpose strategy for updating a dataframe in place:
df.update(
df.loc[df['conlumn_a'] == 'apple', 'conlumn_b']
.replace(r'^11$', 'XXX', regex=True)
)
df
conlumn_a conlumn_b
0 apple 123
1 banana 11
2 apple XXX
3 orange 33
Note that all I did was remove the inplace=True and instead wrapped it in the pd.DataFrame.update method.

I think you need filter in both sides:
m = df['conlumn_a'] == 'apple'
df.loc[m,'conlumn_b'] = df.loc[m,'conlumn_b'].astype(str).replace(r'^(11+)','XXX',regex=True)
print (df)
conlumn_a conlumn_b
0 apple 123
1 banana 11
2 apple XXX
3 orange 33

Related

How to use delimiter on column where some instances do not have the delimiter, python

I have a column called sub, that I want to split into sub_1 and sub_2.
sub example looks like this:
banana/apple,
banana,
apple
where some records have the delimiter '/' and some don't.
What I am trying to do is:
if there is a delimiter like in the first example above, split so:
sub_1 -> banana and sub_2 -> apple
if there is no delimiter like in the second two examples, then it would look like:
sub_1 -> banana
sub_1 -> apple
I tried this code:
df[['sub_1', 'sub_2']] = df['sub'].str.split('/', expand=True)
However I get this error:
ValueError: Columns must be same length as key
which I am guessing is because how some columns do not have a delimiter, wondering if there is a quick way to default that issue to the first column if anyone here has run into this issue before.
Thanks for any direction.

Assuming your dataframe looks like this:
>>> df
sub
0 banana/apple
1 banana
2 apple
Then you just need to split with expand:
>>> df[["sub_1", "sub_2"]] = df["sub"].str.split("/", expand=True)
>>> df
sub sub_1 sub_2
0 banana/apple banana apple
1 banana banana None
2 apple apple None
Or, using pd.Series, if you want NaNs:
>>> df["sub"].str.split("/").apply(pd.Series)
sub sub_1 sub_2
0 banana/apple banana apple
1 banana banana NaN
2 apple apple NaN

Try this:
df = pd.DataFrame({"sub": ["banana/apple", "banana", "apple"]})
res = df.join(df.pop("sub").str.split("/", expand=True))
res.columns = ["sub_1", "sub_2"]
print(res)
sub_1 sub_2
0 banana apple
1 banana None
2 apple None

How to filter Pandas Dataframe rows which contains any string from a list?

I have dataframe that has values like those:
A B
["I need avocado" "something"]
["something" "I eat margarina"]
And I want to find rows that:
In any column of the row, the column's value is contained in a list.
for example, for the list:
["apple","avocado","bannana"]
And only this line should match:
["I need avocado" "something"]
This line doesnt work:
dataFiltered[dataFiltered[col].str.contains(*includeKeywords)]
Returns:
{TypeError}unsupported operand type(s) for &: 'str' and 'int'
What should I do?

Setup
df = pd.DataFrame(dict(
A=['I need avocado', 'something', 'useless', 'nothing'],
B=['something', 'I eat margarina', 'eat apple', 'more nothing']
))
includeKeywords = ["apple", "avocado", "bannana"]
Problem
A B
0 I need avocado something # True 'avocado' in A
1 something I eat margarina
2 useless eat apple # True 'apple' in B
3 nothing more nothing
Solution
pandas.DataFrame.stack to make df a Series and enable us to use the pandas.Series.str accessor functions
pandas.Series.str.contains with '|'.join(includeKeywords)
pandas.Series.any with argument level=0 because we added a level to the index when we stacked
df[df.stack().str.contains('|'.join(includeKeywords)).any(level=0)]
A B
0 I need avocado something
2 useless eat apple
Details
This produces a regex search string. In regex, '|' means or. So for a regex search, this says match 'apple', 'avocado', or 'bannana'
kwstr = '|'.join(includeKeywords)
print(kwstr)
apple|avocado|bannana
Stacking will flatten our DataFrame
df.stack()
0 A I need avocado
B something
1 A something
B I eat margarina
2 A useless
B eat apple
3 A nothing
B more nothing
dtype: object
Fortunately, the pandas.Series.str.contains method can handle regex and it will produce a boolean Series
df.stack().str.contains(kwstr)
0 A True
B False
1 A False
B False
2 A False
B True
3 A False
B False
dtype: bool
At which point we can cleverly use pandas.Series.any by suggesting it only care about level=0
mask = df.stack().str.contains(kwstr).any(level=0)
mask
0 True
1 False
2 True
3 False
dtype: bool
By using level=0 we preserved the original index in the resulting Series. This makes it perfect for filtering df
df[mask]
A B
0 I need avocado something
2 useless eat apple

Take advantage of the any() function and use a list comprenesion in an df.apply()
df = pd.DataFrame(["I need avocado","I eat margarina"])
print(df)
# 0
# 0 I need avocado
# 1 I eat margarina
includeKeywords = ["apple","avocado","bannana"]
print(df[df.apply(lambda r: any([kw in r[0] for kw in includeKeywords]), axis=1)])
# 0
# 0 I need avocado
To make this a bit clearer, you basically need to make a mask that returns True/False for each row
mask = [any([kw in r for kw in includeKeywords]) for r in df[0]]
print(mask)
Then you can use that mask to print the selected rows in your DataFrame
# [True, False]
print(df[mask])
# 0
# 0 I need avocado
I am showing you both ways because while the df.apply() method is handy for a one liner, it is really slow compared to a standard list comprehension. So if you have a small enough set, feel free to use df.apply(). Otherwise, I'd suggest the python comprehension over the pandas method.

sum occurrences of a string in pandas dataframe

I have to count and sum totals over a dataframe, but with a condition:
fruit days_old
apple 4
apple 5
orange 1
orange 5
I have to count with the condition that a fruit is over 3 days old. So the output I need is
2 apples and 1 orange
I thought I would have to use an apply function, but I would have to save each fruit type to a variable or something. I'm sure there's an easier way.
ps. I've been looking but I don't see a clear way to create tables here with proper spacing. The only thing that's clear is to not copy and paste with tabs!

One way is to use pd.Series.value_counts:
res = df.loc[df['days_old'] > 3, 'fruit'].value_counts()
# apple 2
# orange 1
# Name: fruit, dtype: int64
Using pd.DataFrame.apply is inadvisable as this will result in an inefficient loop.

You can use value_counts():
In [120]: df[df.days_old > 3]['fruit'].value_counts()
Out[120]:
apple 2
orange 1
Name: fruit, dtype: int64

I wanted in the variation party.
pd.factorize + np.bincount
f, u = pd.factorize(df.fruit)
pd.Series(
np.bincount(f, df.days_old > 3).astype(int), u
)
apple 2
orange 1
dtype: int64

The value_counts() methods described by #jpp and #chrisz are great. Just to post another strategy, you can use groupby:
df[df.days_old > 3].groupby('fruit').size()
# fruit
# apple 2
# orange 1
# dtype: int64

Pandas str.contains - Search for multiple values in a string and print the values in a new column [duplicate]

This question already has answers here:
Filter pandas DataFrame by substring criteria
(17 answers)
Closed 3 years ago.
I just started coding in Python and want to build a solution where you would search a string to see if it contains a given set of values.
I've find a similar solution in R which uses the stringr library: Search for a value in a string and if the value exists, print it all by itself in a new column
The following code seems to work but i also want to output the three values that i'm looking for and this solution will only output one value:
#Inserting new column
df.insert(5, "New_Column", np.nan)
#Searching old column
df['New_Column'] = np.where(df['Column_with_text'].str.contains('value1|value2|value3', case=False, na=False), 'value', 'NaN')
------ Edit ------
So i realised i didn't give that good of an explanation, sorry about that.
Below is an example where i match fruit names in a string and depending on if it finds any matches in the string it will print out either true or false in a new column. Here's my question: Instead of printing out true or false i want to print out the name it found in the string eg. apples, oranges etc.
import pandas as pd
import numpy as np
text = [('I want to buy some apples.', 0),
('Oranges are good for the health.', 0),
('John is eating some grapes.', 0),
('This line does not contain any fruit names.', 0),
('I bought 2 blueberries yesterday.', 0)]
labels = ['Text','Random Column']
df = pd.DataFrame.from_records(text, columns=labels)
df.insert(2, "MatchedValues", np.nan)
foods =['apples', 'oranges', 'grapes', 'blueberries']
pattern = '|'.join(foods)
df['MatchedValues'] = df['Text'].str.contains(pattern, case=False)
print(df)
Result
Text Random Column MatchedValues
0 I want to buy some apples. 0 True
1 Oranges are good for the health. 0 True
2 John is eating some grapes. 0 True
3 This line does not contain any fruit names. 0 False
4 I bought 2 blueberries yesterday. 0 True
Wanted result
Text Random Column MatchedValues
0 I want to buy some apples. 0 apples
1 Oranges are good for the health. 0 oranges
2 John is eating some grapes. 0 grapes
3 This line does not contain any fruit names. 0 NaN
4 I bought 2 blueberries yesterday. 0 blueberries

You need to set the regex flag (to interpret your search as a regular expression):
whatIwant = df['Column_with_text'].str.contains('value1|value2|value3',
case=False, regex=True)
df['New_Column'] = np.where(whatIwant, df['Column_with_text'])
------ Edit ------
Based on the updated problem statement, here is an updated answer:
You need to define a capture group in the regular expression using parentheses and use the extract() function to return the values found within the capture group. The lower() function deals with any upper case letters
df['MatchedValues'] = df['Text'].str.lower().str.extract( '('+pattern+')', expand=False)

Here is one way:
foods =['apples', 'oranges', 'grapes', 'blueberries']
def matcher(x):
for i in foods:
if i.lower() in x.lower():
return i
else:
return np.nan
df['Match'] = df['Text'].apply(matcher)
# Text Match
# 0 I want to buy some apples. apples
# 1 Oranges are good for the health. oranges
# 2 John is eating some grapes. grapes
# 3 This line does not contain any fruit names. NaN
# 4 I bought 2 blueberries yesterday. blueberries

Using Panda's groupby just to drop repeated items

I'm sure this is a basic question, but I am unable to find the correct path here.
Let's suppose a dataframe like this, telling how many fruits each person eats per week:
Name Fruit Amount
1 Jack Lemon 3
2 Mary Banana 6
3 Sophie Lemon 1
4 Sophie Cherry 10
5 Daniel Banana 2
6 Daniel Cherry 4
Let's suppose now that I just want to create a bar plot with matplotlib, to show the total amount of each fruit eaten per week in the whole town. To do that, I must groupby the fruits
In his book, pandas author describes groupby as the first part of a split-apply-combine operation:
So, first of all groupby transforms the DataFrame into a DataFrameGroupBy object. Then, ussing a method such as sum, the result is combined into a new DataFrame object. Perfect, I can create my fruit plot now.
But the problem I'm facing is what happens when I do not want to sum, diff or apply any operation at all to each group members. What happens when I just want to use groupby to keep a DataFrame with only one row per fruit type (of course, for an example as simple as this one, I could just get a list of fruits with unique, but that's not the point).
If I do that, the return of groupby is a DataFrameGroupBy object, and many operations which work with DataFrame do not with DataFrameGroupBy.
This problem, which I'm sure its pretty simple to avoid, is giving me a lot of headaches. How can I get a DataFrame from groupby without having to apply any aggregation function? Is there a different workaround without even using groupby which I'm missing due to being lost in translation?

If you just want some row, you can use a combination of groupby-first() + reset_index - it will retain the first row per group:
import pandas as pd
df = pd.DataFrame({'a': [1, 1, 2], 'b': [1, 2, 3]})
>>> df.groupby(df.a).first().reset_index()
a b
0 1 1
1 2 3

This bit make me think this could be the answer you are looking for:
Is there a different workaround without even using groupby
If you just want to drop duplicated rows based on Fruit, .drop_duplicates is the way to go.
df.drop_duplicates(subset='Fruit')
Name Fruit Amount
1 Jack Lemon 3
2 Mary Banana 6
4 Sophie Cherry 10
You have limited control about which rows are preserved, see the docstring.
This is faster and more readable than groupby + first.

IIUC you could use pivot_table which will return DataFrame:
In [140]: df.pivot_table(index='Fruit')
Out[140]:
Amount
Fruit
Banana 4
Cherry 7
Lemon 2
In [141]: type(df.pivot_table(index='Fruit'))
Out[141]: pandas.core.frame.DataFrame
If you want to keep first element you could define your function and pass it to aggfunc argument:
In [144]: df.pivot_table(index='Fruit', aggfunc=lambda x: x.iloc[0])
Out[144]:
Amount Name
Fruit
Banana 6 Mary
Cherry 10 Sophie
Lemon 3 Jack
If you don't want your Fruit to be an index you could also use reset_index:
In [147]: df.pivot_table(index='Fruit', aggfunc=lambda x: x.iloc[0]).reset_index()
Out[147]:
Fruit Amount Name
0 Banana 6 Mary
1 Cherry 10 Sophie
2 Lemon 3 Jack

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.