Pandas updating values in a column using a lookup dictionary - python

I have column in a Pandas dataframe that I want to use to lookup a value of cost in a lookup dictionary.
The idea is that I will update an existing column if the item is there and if not the column will be left blank.
All the methods and solutions I have seen so far seem to create a new column, such as apply and assign methods, but it is important that I preserve the existing data.
Here is my code:
lookupDict = {'Apple': 1, 'Orange': 2,'Kiwi': 3,'Lemon': 8}
df1 = pd.DataFrame({'Fruits':['Apple','Banana','Kiwi','Cheese'],
'Pieces':[6, 3, 5, 7],
'Cost':[88, 55, 65, 55]},)
What I want to achieve is lookup the items in the fruit column and if the item is there I want to update the cost column with the dictionary value multiplied by the number of pieces.
For example for Apple the cost is 1 from the lookup dictionary, and in the dataframe the number of pieces is 6, therefore the cost column will be updated from 88 to (6*1) = 6. The next item is banana which is not in the lookup dictionary, therefore the cost in the original dataframe will be left unchanged. The same logic will be applied to the rest of the items.
The only way I can think of achieving this is to separate the lists from the dataframe, iterate through them and then add them back into the dataframe when I'm finished. I am wondering if it would be possible to act on the values in the dataframe without using separate lists??
From other responses I image I have to use the loc indicators such as the following: (But this is not working and I don't want to create a new column)
df1.loc[df1.Fruits in lookupDict,'Cost'] = lookupDict[df1.Fruits] * lookupD[df1.Pieces]
I have also tried to map but it overwrites all the content of the existing column:
df1['Cost'] = df1['Fruits'].map(lookupDict)*df1['Pieces']
EDIT*******
I have been able to achieve it with the following using iteration, however I am still curious if there is a cleaner way to achieve this:
#Iteration method
for i,x in zip(df1['Fruits'],xrange(len(df1.index))):
fruit = (df1.loc[x,'Fruits'])
if fruit in lookupDict:
newCost = lookupDict[fruit] * df1.loc[x,'Pieces']
print(newCost)
df1.loc[x,'Cost'] = newCost

If I understood correctly:
mask = df1['Fruits'].isin(lookupDict.keys())
df1.loc[mask, 'Cost'] = df1.loc[mask, 'Fruits'].map(lookupDict) * df1.loc[mask, 'Pieces']
Result:
In [29]: df1
Out[29]:
Cost Fruits Pieces
0 6 Apple 6
1 55 Banana 3
2 15 Kiwi 5
3 55 Cheese 7

Related

Keep rows according to condition in Pandas

I am looking for a code to find rows that matches a condition and keep those rows.
In the image example, I wish to keep all the apples with amt1 => 5 and amt2 < 5. I also want to keep the bananas with amt1 => 1 and amt2 < 5 (highlighted red in image). There are many other fruits in the list that I have to filter for (maybe about 10 fruits).
image example
Currently, I am filtering it individually (ie. creating a dataframe that filters out the red and small apples and another dataframe that filters out the green and big bananas and using concat to join the dataframes together afterwards). However, this process takes a long time to run because the dataset is huge. I am looking for a faster way (like filtering it in the dataframe itself without having to create a new dataframes). I also have to use column index instead of column names as the column name changes according to the date.
Hopefully what I said makes sense. Would appreciate any help!
I am not quite sure I understand your requirements because I don't understand how the conditions for the rows to keep are formulated.
One thing you can use to combine multiple criteria for selecting data is the query method of the dataframe:
import pandas as pd
df = pd.DataFrame([
['Apple', 5, 1],
['Apple', 4, 2],
['Orange', 3, 3],
['Banana', 2, 4],
['Banana', 1, 5]],
columns=['Fruits', 'Amt1', 'Amt2'])
df.query('(Fruits == "Apple" & (Amt1 >= 5 & Amt2 < 5)) | (Fruits == "Banana" & (Amt1 >= 1 & Amt2 < 5))')
You might use filter combined with itertuples following way
import pandas as pd
df = pd.DataFrame({"x":[1,2,3,4,5],"y":[10,20,30,40,50]})
def keep(row):
return row[0] >= 2 and row[1] <= 40
df_filtered = pd.DataFrame(filter(keep,df.itertuples())).set_index("Index")
print(df_filtered)
gives output
x y
Index
2 3 30
3 4 40
4 5 50
Explanation: keep is function which should return True for rows to keep False for rows to jettison. .itertuples() provides iterable of tuples, which are feed to filter which select records where keep evaluates to True, these selected rows are used to create new DataFrame. After that is done I set index so Index is corresponding to original DataFrame. Depending on your use case you might elect to not set index.

Updates to Python pandas dataframe rows do not update the dataframe?

I just discovered that iterating the rows of a pandas dataframe, and making updates to each row, does not update the dataframe! Is this expected behaviour, or does one need to do something to the row first so the update reflects in the parent dataframe?
I know one could update the dataframe directly in the loop, or with a simple recalculation on the column in this simple/contrived example, but my question is about the fact that iterrows() seems to provide copies of the rows rather than references to the actual rows in the dataframe. Is there a reason for this?
import pandas as pd
fruit = {"Fruit": ['Apple','Avacado','Banana','Strawberry','Grape'],"Color": ['Red','Green','Yellow','Pink','Green'],
"Price": [45, 90, 60, 37, 49]
}
df = pd.DataFrame(fruit)
for index, row in df.iterrows():
row['Price'] = row['Price'] * 2
print(row['Price']) # the price is doubled here as expected
print(df['Price']) # the original values of price in the dataframe are unchanged
You are storing the changes as row['Price'] but not actually saving it back to the dataframe df, you can go ahead and test this by using:
id(row) == id(df)
Which returns False. Also, for better efficiency you shouldn't loop, but rather simply re-assign. Replace the for loop with:
df['New Price '] = df['Price'] * 2
You are entering the subtleties of copies versus original object. What you update in the loop is a copy of the row, not the original Series.
You should have used a direct access to the DataFrame:
for index, row in df.iterrows():
df.loc[index, 'Price'] = row['Price'] * 2
But the real way to perform such operations should be a vectorial one:
df['Price'] = df['Price'].mul(2)
Or:
df['Price'] *= 2
Output:
Fruit Color Price
0 Apple Red 90
1 Avacado Green 180
2 Banana Yellow 120
3 Strawberry Pink 74
4 Grape Green 98

How to efficiently count the number of smaller elements for every element in another column?

I have the following df
name created_utc
0 t1_cqug90j 1430438400
1 t1_cqug90k 1430438400
2 t1_cqug90z 1430438400
3 t1_cqug91c 1430438401
4 t1_cqug91e 1430438401
... ... ...
in which column name contains only unique values. I would like to create a dictionary whose keys are the same elements as in column name. The value for each such a key is the number of elements in column created_utc strictly smaller than that of the key. My expected result is something like
{'t1_cqug90j': 6, 't1_cqug90k': 0, 't1_cqug90z': 3, ...}
In this case, there are 6 elements in column created_utc strictly smaller than 1430438400, which is the corresponding value of t1_cqug90j. I can do the loop to generate such dictionary. However, the loop is not efficient in my case with more than 3 millions rows.
Could you please elaborate on a more efficient way?
import pandas as pd
import numpy as np
df = pd.read_csv('https://raw.githubusercontent.com/leanhdung1994/WebMining/main/df1.csv', header = 0)[['name', 'created_utc']]
df
Update: I posted the question How to efficiently count the number of larger elements for every elements in another column? and received a great answer there. However, I'm not able to modify the code into this case. It would be great if there is an efficient code that can be adapted for both cases, i.e. "strictly larger" and "strictly smaller".
I think you need sort_index for descending sorting for your previous answer:
count_utc = df.groupby('created_utc').size().sort_index(ascending=False)
print (count_utc)
created_utc
1430438401 2
1430438400 3
dtype: int64
cumulative_counts = count_utc.shift(fill_value=0).cumsum()
output = dict(zip(df['name'], df['created_utc'].map(cumulative_counts)) )
print (output)
{'t1_cqug90j': 2, 't1_cqug90k': 2, 't1_cqug90z': 2, 't1_cqug91c': 0, 't1_cqug91e': 0}

Fill up another column based on another columns unique value

I have this csv data(an example):
I have 5000 zip codes with other columns but 34(zipcode) of them are only unique. I have to take each zipcode and hit another API to get the median income but how can I fill up the other row's median income column with a duplicate zip code?
N.B: Didn't find anything related to my case.
You want to us transform, which returns a DataFrame with the same indexes as the original object filled with the transformed values.
You will need to write a function that takes a zip code and returns the median value. See this example:
import pandas as pd
def get_med(zip_code):
# This would be your get call to the API
# Here, `zip_code` is a Series, use `.iloc[0]`
# to get the value of the group
return zip_code.iloc[0] * 100
df = pd.DataFrame({"zip":[1, 2, 3, 1, 1]})
df["med_income"] = df.groupby("zip")["zip"].transform(get_med)
# zip med_income
# 0 1 100
# 1 2 200
# 2 3 300
# 3 1 100
# 4 1 100
Alternatively you could generate all the median values in a dict and then map that back onto the DataFrame:
medians = {get_median(zip_code) for zip_code in df["zip"].unique()}
df["med_income"] = df["zip"].map(medians)
I believe you're looking for pandas map. So let's suppose the output of this second API is a dictionary (maybe you can manage to get it):
# Get unique zip codes to use as input to the API
zip_codes = df['Zip'].unique()
# Let's suppose you get an ouput like this
zip_dict = {46234: 1500, 46250: 2000, 46280: 1200} # and so on...
So, you can map the zip code to the Median Income like this:
df['Median Income'] = df['Zip'].map(zip_dict)
where df is your dataframe.
From what I understood, you want to get the unique values of the zipcodes? If yes, then you can use
df.yourColumn.unique()

How to extract values from a Pandas DataFrame, rather than a Series (without referencing the index)?

I am trying to return a specific item from a Pandas DataFrame via conditional selection (and do not want to have to reference the index to do so).
Here is an example:
I have the following dataframe:
Code Colour Fruit
0 1 red apple
1 2 orange orange
2 3 yellow banana
3 4 green pear
4 5 blue blueberry
I enter the following code to search for the code for blueberries:
df[df['Fruit'] == 'blueberry']['Code']
This returns:
4 5
Name: Code, dtype: int64
which is of type:
pandas.core.series.Series
but what I actually want to return is the number 5 of type:
numpy.int64
which I can do if I enter the following code:
df[df['Fruit'] == 'blueberry']['Code'][4]
i.e. referencing the index to give the number 5, but I do not want to have to reference the index!
Is there another syntax that I can deploy here to achieve the same thing?
Thank you!...
Update:
One further idea is this code:
df[df['Fruit'] == 'blueberry']['Code'][df[df['Fruit']=='blueberry'].index[0]]
However, this does not seem particularly elegant (and it references the index). Is there a more concise and precise method that does not need to reference the index or is this strictly necessary?
Thanks!...
Let's try this:
df.loc[df['Fruit'] == 'blueberry','Code'].values[0]
Output:
5
First, use .loc to access the values in your dataframe using the boolean indexing for row selection and index label for column selection. The convert that returned series to an array of values and since there is only one value in that array you can use index '[0]' get the scalar value from that single element array.
Referencing index is a requirement (unless you use next()^), since a pd.Series is not guaranteed to have one value.
You can use pd.Series.values to extract the values as an array. This also works if you have multiple matches:
res = df.loc[df['Fruit'] == 'blueberry', 'Code'].values
# array([5], dtype=int64)
df2 = pd.concat([df]*5)
res = df2.loc[df2['Fruit'] == 'blueberry', 'Code'].values
# array([5, 5, 5, 5, 5], dtype=int64)
To get a list from the numpy array, you can use .tolist():
res = df.loc[df['Fruit'] == 'blueberry', 'Code'].values.tolist()
Both the array and the list versions can be indexed intuitively, e.g. res[0] for the first item.
^ If you are really opposed to using index, you can use next() to iterate:
next(iter(res))
you can also set your 'Fruit' column as ann index
df_fruit_index = df.set_index('Fruit')
and extract the value from the 'Code' column based on the fruit you choose
df_fruit_index.loc['blueberry','Code']
Easiest solution: convert pandas.core.series.Series to integer!
my_code = int(df[df['Fruit'] == 'blueberry']['Code'])
print(my_code)
Outputs:
5

Categories

Resources