Faster way to iterate through Pandas Dataframe? - python

I have a list of strings, let's say:
fruit_list = ["apple", "banana", "coconut"]
And I have some Pandas Dataframe, such like:
import pandas as pd
data = [['Apple farm', 10], ['Banana field', 15], ['Coconut beach', 14], ['corn field', 10]]
df = pd.DataFrame(data, columns = ['fruit_source', 'value'])
And I want to populate a new column based on a text search of the existing column 'fruit_source'. What I want populated is whatever element is matched to the specific column within the df. One way of writing it is:
df["fruit"] = NaN
for index, row in df.iterrows():
for fruit in fruit_list:
if fruit in row['fruit_source']:
df.loc[index,'fruit'] = fruit
else:
df.loc[index,'fruit'] = "fruit not found"
In which the dataframe is populated with a new column of what fruit the fruit source collected.
When expanding this out to a larger dataframe, though, this iteration can pose to be an issue based on performance. Reason being, as more rows are introduced, the iteration explodes due to iterating through the list as well.
Is there more of an efficient method that can be done?

You can let Pandas do the work like so:
# Prime series with the "fruit not found" value
df['fruit'] = "fruit not found"
for fruit in fruit_list:
# Generate boolean series of rows matching the fruit
mask = df['fruit_source'].str.contains(fruit, case=False)
# Replace those rows in-place with the name of the fruit
df['fruit'].mask(mask, fruit, inplace=True)
print(df) will then say
fruit_source value fruit
0 Apple farm 10 apple
1 Banana field 15 banana
2 Coconut beach 14 coconut
3 corn field 10 fruit not found

Use str.extract with a regex pattern to avoid a loop:
import re
pattern = fr"({'|'.join(fruit_list)})"
df['fruit'] = df['fruit_source'].str.extract(pattern, flags=re.IGNORECASE) \
.fillna('fruit not found')
Output:
>>> df
fruit_source value fruit
0 Apple farm 10 Apple
1 Banana field 15 Banana
2 Coconut beach 14 Coconut
3 corn field 10 fruit not found
>>> pattern
'(apple|banana|coconut)'

Related

Most efficient way to find and replace 'None' cells with value from a similar cell pair in dataframe?

I'm trying to find the most efficient way to fill in cells with a value of 'None' in a dataframe based on a rule.
If column 'A' has a specific value, such as 'Apple', and column 'B' is empty, I want to find all other instances of column 'A' with value 'Apple' and fill it with whatever is in column 'B's if it is not empty.
For example, if the input is this:
Column A
Column B
Apple
None
Apple
None
Orange
Soda
Banana
None
Apple
Pie
Banana
Bread
Orange
None
Then it should output this:
Column A
Column B
Apple
Pie
Apple
Pie
Orange
Soda
Banana
Bread
Apple
Pie
Banana
Bread
Orange
Soda
You can assume that for a particular Column A & B pair, it will always be the same (e.g. for every 'Apple' in Column A, there will either be 'None' or 'Pie' in Column B).
I tried the below, and it seems to work on this small test dataset, but I'm wondering if someone could suggest a more efficient method that I could use on my actual dataset (~100K rows).
for ind in data.index:
if data['Column B'][ind]=='None':
temp_A = data['Column A'][ind]
for ind2 in data.index:
if (data['Column A'][ind2]==temp_A) & (data['Column B'][ind2]!='None'):
data['Column B'][ind] = data['Column B'][ind2]
break
Filter rows with no None values, remove duplicates and convert to Series, last mapping None values by Series.map:
m = df['Column B'].ne('None')
s = df[m].drop_duplicates('Column A').set_index('Column A')['Column B']
df.loc[~m, 'Column B'] = df.loc[~m, 'Column A'].map(s)
print (df)
Column A Column B
0 Apple Pie
1 Apple Pie
2 Orange Soda
3 Banana Bread
4 Apple Pie
5 Banana Bread
6 Orange Soda

How to check if multiple words are in a string of a dataframe value based on 2 lists, then assign a value to a new column

I have a dataframe with a column of products that are extremely inconsistent with extra words than are necessary. I would like to check each cell and see if there are words present based on two lists that contain the necessary keywords. I would like to check first if any of the words from the first list are present, if true, I would like to check if any of the words from the second list are present. If both end up being true, I would like to assign a value for that row in column product_clean. If false, the value remains nan.
list1: ['fruit', 'FRUIT', 'Fruit', 'FRUit']
list2: ['banana', 'strawberry', 'cherry']
df:
product product_clean
fruit 10% banana SPECIAL nan
FRUit strawberry 99OFF nan
milk nan
jam nan
cherry FRUIT Virginia nan
df_DESIRED:
product product_clean
fruit 10% banana SPECIAL Fruit
FRUit strawberry 99OFF Fruit
milk nan
jam nan
cherry FRUIT Virginia Fruit
I think simpliest is check all values by both conditions and chain them by & for bitwise AND:
m = df['product'].str.contains("|".join(list1)) &
df['product'].str.contains("|".join(list2))
df['product_clean'] = np.where(m, 'Fruit', np.nan)
Or:
df.loc[m, 'product_clean'] = 'Fruit'
print (df)
product product_clean
0 fruit 10% banana SPECIAL Fruit
1 FRUit strawberry 99OFF Fruit
2 milk nan
3 jam nan
4 cherry FRUIT Virginia Fruit
If need first test by first condition aand then by second one it is possible by:
m1 = df['product'].str.contains("|".join(list1))
m = df.loc[m1, 'product'].str.contains("|".join(list2)).reindex(df.index, fill_value=False)
df['product_clean'] = np.where(m, 'Fruit', np.nan)

How to groupby and calculate new field with python pandas?

I'd like to group by a specific column within a data frame called 'Fruit' and calculate the percentage of that particular fruit that are 'Good'
See below for my initial dataframe
import pandas as pd
df = pd.DataFrame({'Fruit': ['Apple','Apple','Banana'], 'Condition': ['Good','Bad','Good']})
Dataframe
Fruit Condition
0 Apple Good
1 Apple Bad
2 Banana Good
See below for my desired output data frame
Fruit Percentage
0 Apple 50%
1 Banana 100%
Note: Because there is 1 "Good" Apple and 1 "Bad" Apple, the percentage of Good Apples is 50%.
See below for my attempt which is overwriting all the columns
groupedDF = df.groupby('Fruit')
groupedDF.apply(lambda x: x[(x['Condition'] == 'Good')].count()/x.count())
See below for resulting table, which seems to calculate percentage but within existing columns instead of new column:
Fruit Condition
Fruit
Apple 0.5 0.5
Banana 1.0 1.0
We can compare Condition with eq and take advantage of the fact that True is (1) and False is (0) when processed as numbers and take the groupby mean over Fruits:
new_df = (
df['Condition'].eq('Good').groupby(df['Fruit']).mean().reset_index()
)
new_df:
Fruit Condition
0 Apple 0.5
1 Banana 1.0
We can further map to a format string and rename to get output into the shown desired output:
new_df = (
df['Condition'].eq('Good')
.groupby(df['Fruit']).mean()
.map('{:.0%}'.format) # Change to Percent Format
.rename('Percentage') # Rename Column to Percentage
.reset_index() # Restore RangeIndex and make Fruit a Column
)
new_df:
Fruit Percentage
0 Apple 50%
1 Banana 100%
*Naturally further manipulations can be done as well.

In Pandas, creating a new data frame using data filtered by a list of lists

So, I've looked around quite a bit and I haven't been able to find an answer to this problem. I apologize if it is indeed out there.
I have a DF that looks like this:
a = pd.DataFrame({'Name': ['apple', 'banana', 'orange', 'apple', 'banana','orange'],
'Units': [2,4,6,5,4,3]})
I also have a list of lists like this:
b = [['apple', 'banana'],['orange']]
The goal is to group apple and banana in to 1 column and orange in to another with their respective units summed. The variable in the column will show up as the first item in the sublist. (no sublist will have duplicates).
Here's what I want the output df to look like:
output = pd.DataFrame({'Name': ['apple', 'orange'],
'Units': [15, 9]})
Here's where I am right now:
for fruit in a['Name']:
for sublist in b:
if fruit in sublist:
pd.concat([XYZ,
pd.DataFrame({'Name': sublist[0], 'Units': a[a.Name == fruit]['Units'].sum(), index=[0})],
axis=1)
XYZ is an empty data frame with columns= Name and Units that I am trying to populate with the results. I don't really understand how to create a data frame when the fruit is in sublist along with the sum of it's Units.
Any thoughts? :D
Edit: sublists can be anywhere from 1 to 300 items. The code here is just a MWE of a much larger data wrangling problem. Apologies for not mentioning this.
Indeed you can do this in one line:
sum_a = a.replace({"banana": "apple"}).groupby("Name", as_index=False).sum()
IIUC, it is better re-create your object , rather than change the original df , since replace still losing the information about apple , since you replace apple to banana.So the out put will only contain the information about apple or banana
d={','.join(x):a.loc[a.Name.isin(x),'Units'].sum() for x in b }
pd.Series(d)
apple,banana 15 # here you do not losing the information of each items in the list
orange 9
dtype: int64
Using pd.Series.isin and boolean indexing:
pd.DataFrame([(l[0], a.Units[a.Name.isin(l)].sum()) for l in b], columns=['Name', 'Units'])
Name Units
0 apple 15
1 orange 9
Another solution would be to make a function which returns both the name and sum value.
from operator import itemgetter
first = itemgetter(0)
def make_rows(cols, df):
for col in cols:
name = first(col)
val = df.loc[df.Name.str.contains('|'.join(col), regex=True), 'Units'].sum()
yield name, val
df1 = pd.DataFrame(make_rows(b, a), columns=a.columns)
print(df1)
Name Units
0 apple 15
1 orange 9
Additionally like this as well:
from functools import partial
def make_rows(df, col):
name = first(col)
val = df.loc[df.Name.str.contains('|'.join(col), regex=True), 'Units'].sum()
return name, val
p = partial(make_rows, a)
pd.DataFrame(list(map(p, b)), columns=a.columns)
Name Units
0 apple 15
1 orange 9

Pandas str.contains - Search for multiple values in a string and print the values in a new column [duplicate]

This question already has answers here:
Filter pandas DataFrame by substring criteria
(17 answers)
Closed 3 years ago.
I just started coding in Python and want to build a solution where you would search a string to see if it contains a given set of values.
I've find a similar solution in R which uses the stringr library: Search for a value in a string and if the value exists, print it all by itself in a new column
The following code seems to work but i also want to output the three values that i'm looking for and this solution will only output one value:
#Inserting new column
df.insert(5, "New_Column", np.nan)
#Searching old column
df['New_Column'] = np.where(df['Column_with_text'].str.contains('value1|value2|value3', case=False, na=False), 'value', 'NaN')
------ Edit ------
So i realised i didn't give that good of an explanation, sorry about that.
Below is an example where i match fruit names in a string and depending on if it finds any matches in the string it will print out either true or false in a new column. Here's my question: Instead of printing out true or false i want to print out the name it found in the string eg. apples, oranges etc.
import pandas as pd
import numpy as np
text = [('I want to buy some apples.', 0),
('Oranges are good for the health.', 0),
('John is eating some grapes.', 0),
('This line does not contain any fruit names.', 0),
('I bought 2 blueberries yesterday.', 0)]
labels = ['Text','Random Column']
df = pd.DataFrame.from_records(text, columns=labels)
df.insert(2, "MatchedValues", np.nan)
foods =['apples', 'oranges', 'grapes', 'blueberries']
pattern = '|'.join(foods)
df['MatchedValues'] = df['Text'].str.contains(pattern, case=False)
print(df)
Result
Text Random Column MatchedValues
0 I want to buy some apples. 0 True
1 Oranges are good for the health. 0 True
2 John is eating some grapes. 0 True
3 This line does not contain any fruit names. 0 False
4 I bought 2 blueberries yesterday. 0 True
Wanted result
Text Random Column MatchedValues
0 I want to buy some apples. 0 apples
1 Oranges are good for the health. 0 oranges
2 John is eating some grapes. 0 grapes
3 This line does not contain any fruit names. 0 NaN
4 I bought 2 blueberries yesterday. 0 blueberries
You need to set the regex flag (to interpret your search as a regular expression):
whatIwant = df['Column_with_text'].str.contains('value1|value2|value3',
case=False, regex=True)
df['New_Column'] = np.where(whatIwant, df['Column_with_text'])
------ Edit ------
Based on the updated problem statement, here is an updated answer:
You need to define a capture group in the regular expression using parentheses and use the extract() function to return the values found within the capture group. The lower() function deals with any upper case letters
df['MatchedValues'] = df['Text'].str.lower().str.extract( '('+pattern+')', expand=False)
Here is one way:
foods =['apples', 'oranges', 'grapes', 'blueberries']
def matcher(x):
for i in foods:
if i.lower() in x.lower():
return i
else:
return np.nan
df['Match'] = df['Text'].apply(matcher)
# Text Match
# 0 I want to buy some apples. apples
# 1 Oranges are good for the health. oranges
# 2 John is eating some grapes. grapes
# 3 This line does not contain any fruit names. NaN
# 4 I bought 2 blueberries yesterday. blueberries

Categories

Resources