Updates to Python pandas dataframe rows do not update the dataframe? - python

I just discovered that iterating the rows of a pandas dataframe, and making updates to each row, does not update the dataframe! Is this expected behaviour, or does one need to do something to the row first so the update reflects in the parent dataframe?
I know one could update the dataframe directly in the loop, or with a simple recalculation on the column in this simple/contrived example, but my question is about the fact that iterrows() seems to provide copies of the rows rather than references to the actual rows in the dataframe. Is there a reason for this?
import pandas as pd
fruit = {"Fruit": ['Apple','Avacado','Banana','Strawberry','Grape'],"Color": ['Red','Green','Yellow','Pink','Green'],
"Price": [45, 90, 60, 37, 49]
}
df = pd.DataFrame(fruit)
for index, row in df.iterrows():
row['Price'] = row['Price'] * 2
print(row['Price']) # the price is doubled here as expected
print(df['Price']) # the original values of price in the dataframe are unchanged

You are storing the changes as row['Price'] but not actually saving it back to the dataframe df, you can go ahead and test this by using:
id(row) == id(df)
Which returns False. Also, for better efficiency you shouldn't loop, but rather simply re-assign. Replace the for loop with:
df['New Price '] = df['Price'] * 2

You are entering the subtleties of copies versus original object. What you update in the loop is a copy of the row, not the original Series.
You should have used a direct access to the DataFrame:
for index, row in df.iterrows():
df.loc[index, 'Price'] = row['Price'] * 2
But the real way to perform such operations should be a vectorial one:
df['Price'] = df['Price'].mul(2)
Or:
df['Price'] *= 2
Output:
Fruit Color Price
0 Apple Red 90
1 Avacado Green 180
2 Banana Yellow 120
3 Strawberry Pink 74
4 Grape Green 98

Related

Keep rows according to condition in Pandas

I am looking for a code to find rows that matches a condition and keep those rows.
In the image example, I wish to keep all the apples with amt1 => 5 and amt2 < 5. I also want to keep the bananas with amt1 => 1 and amt2 < 5 (highlighted red in image). There are many other fruits in the list that I have to filter for (maybe about 10 fruits).
image example
Currently, I am filtering it individually (ie. creating a dataframe that filters out the red and small apples and another dataframe that filters out the green and big bananas and using concat to join the dataframes together afterwards). However, this process takes a long time to run because the dataset is huge. I am looking for a faster way (like filtering it in the dataframe itself without having to create a new dataframes). I also have to use column index instead of column names as the column name changes according to the date.
Hopefully what I said makes sense. Would appreciate any help!
I am not quite sure I understand your requirements because I don't understand how the conditions for the rows to keep are formulated.
One thing you can use to combine multiple criteria for selecting data is the query method of the dataframe:
import pandas as pd
df = pd.DataFrame([
['Apple', 5, 1],
['Apple', 4, 2],
['Orange', 3, 3],
['Banana', 2, 4],
['Banana', 1, 5]],
columns=['Fruits', 'Amt1', 'Amt2'])
df.query('(Fruits == "Apple" & (Amt1 >= 5 & Amt2 < 5)) | (Fruits == "Banana" & (Amt1 >= 1 & Amt2 < 5))')
You might use filter combined with itertuples following way
import pandas as pd
df = pd.DataFrame({"x":[1,2,3,4,5],"y":[10,20,30,40,50]})
def keep(row):
return row[0] >= 2 and row[1] <= 40
df_filtered = pd.DataFrame(filter(keep,df.itertuples())).set_index("Index")
print(df_filtered)
gives output
x y
Index
2 3 30
3 4 40
4 5 50
Explanation: keep is function which should return True for rows to keep False for rows to jettison. .itertuples() provides iterable of tuples, which are feed to filter which select records where keep evaluates to True, these selected rows are used to create new DataFrame. After that is done I set index so Index is corresponding to original DataFrame. Depending on your use case you might elect to not set index.

how to vectorize a for loop on pandas dataframe?

i am working whit a data of about 200,000 rows, in one column of the pandas i have some values that have a empty list, the most of them are list whit several values, here is a picture:
what i want to do is change the empty sets whit this set
[[close*0.95,close*0.94]]
where the close is the close value on the table, the for loop that i use is this one:
for i in range(1,len(data3.index)):
close = data3.close[data3.index==data3.index[i]].values[0]
sell_list = data3.sell[data3.index==data3.index[i]].values[0]
buy_list = data3.buy[data3.index==data3.index[i]].values[0]
if len(sell_list)== 0:
data3.loc[data3.index[i],"sell"].append([[close*1.05,close*1.06]])
if len(buy_list)== 0:
data3.loc[data3.index[i],"buy"].append([[close*0.95,close*0.94]])
i tried to make it work whit multithread but as i need to read all the table to do the next step i cant split the data, i hope you can help me to make a kind of lamda function to apply the df, or something, i am not to much skilled on this, thanks for reading!
the expected output of the row and column "buy" of and empty set should be [[[11554, 11566]]]
Example data:
import pandas as pd
df = pd.DataFrame({'close': [11763, 21763, 31763], 'buy':[[], [[21763, 21767]], []]})
close buy
0 11763 []
1 21763 [[[21763, 21767]]]
2 31763 []
You could do it like this:
# Create mask (a bit faster than df['buy'].apply(len) == 0).
# Assumes there are no NaNs in the column. If you have NaNs, use pd.apply.
m = [len(l) == 0 for l in df['buy'].tolist()]
# Create triple nested lists and assign.
df.loc[m, 'buy'] = list(df.loc[m, ['close', 'close']].mul([0.95, 0.94]).to_numpy()[:, None][:, None])
print(df)
Result:
close buy
0 11763 [[[11174.85, 11057.22]]]
1 21763 [[[21763, 21767]]]
2 31763 [[[30174.85, 29857.219999999998]]]
Some explanation:
m is a boolean mask that selects the rows of the DataFrame with an empty list in the 'buy' column:
m = [len(l) == 0 for l in df['buy'].tolist()]
# Or (a bit slower)
# "Apply the len() function to all lists in the column.
m = df['buy'].apply(len) == 0
print(m)
0 True
1 False
2 True
Name: buy, dtype: bool
We can use this mask to select where to calculate the values.
df.loc[m, ['close', 'close']].mul([0.95, 0.94]) duplicates the 'close' column and calculates the vectorised product of all the (close, close) pairs with (0.95, 0.94) to obtain (close*0.94, close*0.94) in each row of the resulting array.
[:, None][:, None] is just a trick to create two additional axes on the resulting array. This is required since you want triple nested lists ([[[]]]).

Pandas adding additional values between two row-values in a dataframe with number constraint

I have data frame. Under the same index, I have "early_date" & "latest_date", which are in "int" dtype. I want to create additional values in between the "early_date" & "latest_date" row-values. Incidentally, I want to stack the generated values into new rows between them.
Here is how I did it,
df = pd.DataFrame({'index': [1,1,2,2,3,3],
'variable': ['early_date', 'late_date']*3,
'value': [201952,202001,202002,202004,202006,202012]})
# This is what your data looks like unmelted
df_p = df.pivot('index', 'variable', 'value').reset_index()
df_p.columns.name = ''
df_p['new'] = [list(range(x,y+1)) for x, y in zip(df_p.pop('early_date'), df_p.pop('late_date'))]
This is the result
In the column "new", the filling between "201952" & "202001" in index 1 has became 201952, 201953, 201954...201999, 202001.
However, since the "new" column is actually representing the year and weeks. In index 1 case,
It shall not be filling anything between 201952 & 202001, and the result should be [201952, 202001]. Since week 52 is the end of the year.
What can I do to handling these cases?
IIUC, you can add a condition in your for loop:
df_p['new'] = [list(range(x,y+1)) if str(x)[-2:]!='52' else [x,y]
for x, y in zip(df_p.pop('early_date'), df_p.pop('late_date'))]
print(df_p)
index new
0 1 [201952, 202001]
1 2 [202002, 202003, 202004]
2 3 [202006, 202007, 202008, 202009, 202010, 20201...

Extract text between brackets and create rows for each bit of text

In a pandas dataframe I need to extract text between square brackets and output that text as a new column. I need to do this at the "StudyID" level and create new rows for each bit of text extracted.
Here is a simplified example dataframe
data = {
"studyid":['101',
'101',
'102',
'103'],
"Question":["Q1",
"Q2",
"Q1",
"Q3"],
"text":['I love [Bananas] and also [oranges], and [figs]',
'Yesterday I ate [Apples]',
'[Grapes] are my favorite fruit',
'[Mandarins] taste like [oranges] to me'],
}
df2 = pd.DataFrame(data)
I worked out a solution (see code below, if you run this it shows the wanted output), however it is very long with many steps. I am wanting to know if there is a much shorter way of doing this.
You will see that I used str.findall() for the regex, but I originally tried str.extractall() which outputs the extracted text to a dataframe, but I didn't know how to output the extracted text with the "studyid" and "question" columns included in the dataframe generated by extractall(). So I resorted to using the str.findall().
Here is my code ('I know it is clunky') - how can I reduce the number of steps? Thanks in advance for your help!
# Step 1: Use Regex to pull out the text between the square brackets
df3 = pd.DataFrame(df2['text'].str.findall(r"(?<=\[)([^]]+)(?=\])").tolist())
# Step 2: Merge the extracted text back with the original data
df3 = df2.merge(df3, left_index=True, right_index=True)
# Step 3: Transpose the wide file to a long file (e.g. panel)
df4 = pd.melt(df3, id_vars=['studyid', 'Question'], value_vars=[0, 1, 2])
# Step 4: Delete rows with None in the value column
indexNames = df4[df4['value'].isnull()].index
df4.drop(indexNames , inplace=True)
# Step 5: Sort the data by the StudyID and Question
df4.sort_values(by=['studyid', 'Question'], inplace=True)
# Step 6: Drop unwanted columns
df4.drop(['variable'], axis=1, inplace=True)
# Step 7: Reset the index and drop the old index
df4.reset_index(drop=True, inplace=True)
df4
If assign back output of Series.str.findall to column is possible use DataFrame.explode, last for unique index is used DataFrame.reset_index with drop=True:
df2['text'] = df2['text'].str.findall(r"(?<=\[)([^]]+)(?=\])")
df4 = df2.explode('text').reset_index(drop=True)
Solution with Series.str.extractall, removed second level of MultiIndex and last use DataFrame.join for append to original:
s = (df2.pop('text').str.extractall(r"(?<=\[)([^]]+)(?=\])")[0]
.reset_index(level=1, drop=True)
.rename('text'))
df4 = df2.join(s).reset_index(drop=True)
print (df4)
studyid Question text
0 101 Q1 Bananas
1 101 Q1 oranges
2 101 Q1 figs
3 101 Q2 Apples
4 102 Q1 Grapes
5 103 Q3 Mandarins
6 103 Q3 oranges
You can "squeeze" your code to a single instruction:
df2[['studyid', 'Question']].join(df2['text'].str.findall(
r'\[([^]]+)\]').explode().rename('value'))
Even the regex can be simplified: No need for lookbehind / lookforward.
Just place both brackets before / after the capturing group.
If you need, save this result under a variable (e.g. df4 = ...).
Note: In your solution you named the last column in the final result (df4)
as value, so I repeated it in my solution.
But in you want to change this name to whatever you wish, replace 'value'
in my solution with another name of your choice.

Pandas updating values in a column using a lookup dictionary

I have column in a Pandas dataframe that I want to use to lookup a value of cost in a lookup dictionary.
The idea is that I will update an existing column if the item is there and if not the column will be left blank.
All the methods and solutions I have seen so far seem to create a new column, such as apply and assign methods, but it is important that I preserve the existing data.
Here is my code:
lookupDict = {'Apple': 1, 'Orange': 2,'Kiwi': 3,'Lemon': 8}
df1 = pd.DataFrame({'Fruits':['Apple','Banana','Kiwi','Cheese'],
'Pieces':[6, 3, 5, 7],
'Cost':[88, 55, 65, 55]},)
What I want to achieve is lookup the items in the fruit column and if the item is there I want to update the cost column with the dictionary value multiplied by the number of pieces.
For example for Apple the cost is 1 from the lookup dictionary, and in the dataframe the number of pieces is 6, therefore the cost column will be updated from 88 to (6*1) = 6. The next item is banana which is not in the lookup dictionary, therefore the cost in the original dataframe will be left unchanged. The same logic will be applied to the rest of the items.
The only way I can think of achieving this is to separate the lists from the dataframe, iterate through them and then add them back into the dataframe when I'm finished. I am wondering if it would be possible to act on the values in the dataframe without using separate lists??
From other responses I image I have to use the loc indicators such as the following: (But this is not working and I don't want to create a new column)
df1.loc[df1.Fruits in lookupDict,'Cost'] = lookupDict[df1.Fruits] * lookupD[df1.Pieces]
I have also tried to map but it overwrites all the content of the existing column:
df1['Cost'] = df1['Fruits'].map(lookupDict)*df1['Pieces']
EDIT*******
I have been able to achieve it with the following using iteration, however I am still curious if there is a cleaner way to achieve this:
#Iteration method
for i,x in zip(df1['Fruits'],xrange(len(df1.index))):
fruit = (df1.loc[x,'Fruits'])
if fruit in lookupDict:
newCost = lookupDict[fruit] * df1.loc[x,'Pieces']
print(newCost)
df1.loc[x,'Cost'] = newCost
If I understood correctly:
mask = df1['Fruits'].isin(lookupDict.keys())
df1.loc[mask, 'Cost'] = df1.loc[mask, 'Fruits'].map(lookupDict) * df1.loc[mask, 'Pieces']
Result:
In [29]: df1
Out[29]:
Cost Fruits Pieces
0 6 Apple 6
1 55 Banana 3
2 15 Kiwi 5
3 55 Cheese 7

Categories

Resources