I am looking for a code to find rows that matches a condition and keep those rows.
In the image example, I wish to keep all the apples with amt1 => 5 and amt2 < 5. I also want to keep the bananas with amt1 => 1 and amt2 < 5 (highlighted red in image). There are many other fruits in the list that I have to filter for (maybe about 10 fruits).
image example
Currently, I am filtering it individually (ie. creating a dataframe that filters out the red and small apples and another dataframe that filters out the green and big bananas and using concat to join the dataframes together afterwards). However, this process takes a long time to run because the dataset is huge. I am looking for a faster way (like filtering it in the dataframe itself without having to create a new dataframes). I also have to use column index instead of column names as the column name changes according to the date.
Hopefully what I said makes sense. Would appreciate any help!
I am not quite sure I understand your requirements because I don't understand how the conditions for the rows to keep are formulated.
One thing you can use to combine multiple criteria for selecting data is the query method of the dataframe:
import pandas as pd
df = pd.DataFrame([
['Apple', 5, 1],
['Apple', 4, 2],
['Orange', 3, 3],
['Banana', 2, 4],
['Banana', 1, 5]],
columns=['Fruits', 'Amt1', 'Amt2'])
df.query('(Fruits == "Apple" & (Amt1 >= 5 & Amt2 < 5)) | (Fruits == "Banana" & (Amt1 >= 1 & Amt2 < 5))')
You might use filter combined with itertuples following way
import pandas as pd
df = pd.DataFrame({"x":[1,2,3,4,5],"y":[10,20,30,40,50]})
def keep(row):
return row[0] >= 2 and row[1] <= 40
df_filtered = pd.DataFrame(filter(keep,df.itertuples())).set_index("Index")
print(df_filtered)
gives output
x y
Index
2 3 30
3 4 40
4 5 50
Explanation: keep is function which should return True for rows to keep False for rows to jettison. .itertuples() provides iterable of tuples, which are feed to filter which select records where keep evaluates to True, these selected rows are used to create new DataFrame. After that is done I set index so Index is corresponding to original DataFrame. Depending on your use case you might elect to not set index.
Related
I just discovered that iterating the rows of a pandas dataframe, and making updates to each row, does not update the dataframe! Is this expected behaviour, or does one need to do something to the row first so the update reflects in the parent dataframe?
I know one could update the dataframe directly in the loop, or with a simple recalculation on the column in this simple/contrived example, but my question is about the fact that iterrows() seems to provide copies of the rows rather than references to the actual rows in the dataframe. Is there a reason for this?
import pandas as pd
fruit = {"Fruit": ['Apple','Avacado','Banana','Strawberry','Grape'],"Color": ['Red','Green','Yellow','Pink','Green'],
"Price": [45, 90, 60, 37, 49]
}
df = pd.DataFrame(fruit)
for index, row in df.iterrows():
row['Price'] = row['Price'] * 2
print(row['Price']) # the price is doubled here as expected
print(df['Price']) # the original values of price in the dataframe are unchanged
You are storing the changes as row['Price'] but not actually saving it back to the dataframe df, you can go ahead and test this by using:
id(row) == id(df)
Which returns False. Also, for better efficiency you shouldn't loop, but rather simply re-assign. Replace the for loop with:
df['New Price '] = df['Price'] * 2
You are entering the subtleties of copies versus original object. What you update in the loop is a copy of the row, not the original Series.
You should have used a direct access to the DataFrame:
for index, row in df.iterrows():
df.loc[index, 'Price'] = row['Price'] * 2
But the real way to perform such operations should be a vectorial one:
df['Price'] = df['Price'].mul(2)
Or:
df['Price'] *= 2
Output:
Fruit Color Price
0 Apple Red 90
1 Avacado Green 180
2 Banana Yellow 120
3 Strawberry Pink 74
4 Grape Green 98
I want to find duplicate items within 2 rows in Excel. So for example my Excel consists of:
list_A list_B
0 ideal ideal
1 brown colour
2 blue blew
3 red red
I checked the pandas documentation and tried duplicate method but I simply don't know why it keeps saying "DataFrame is empty". It finds both columns and I guess it's iterated over it but why doesn't it find the values and compare them?
I also tried using iterrows but honestly don't know how to implement it.
When running the code I get this output:
Empty DataFrame
Columns: [list A, list B]
Index: []
import pandas as pd
pt = pd.read_excel(r"C:\Users\S531\Desktop\pt.xlsx")
dfObj = pd.DataFrame(pt)
doubles = dfObj[dfObj.duplicated()]
print(doubles)
The output I'm looking for is:
list_A list_B
0 ideal ideal
3 red red
Final solved code looks like this:
import pandas as pd
pt = pd.read_excel(r"C:\Users\S531\Desktop\pt.xlsx")
doubles = pt[pt['list_A'] == pt['list_B']]
print(doubles)
The term "duplicate" is usually used to mean rows that are exact duplicates of previous rows (see the documentation of pd.DataFrame.duplicate).
What you are looking for is just the rows where these two columns are equal. For that, you want:
doubles = pt[pt['list_A'] == pt['list_B']]
I have a dataframe like the following, where everything is formatted as a string:
df
property value count
0 propAb True 10
1 propAA False 10
2 propAB blah 10
3 propBb 3 8
4 propBA 4 7
5 propCa 100 4
I am trying to find a way to filter the dataframe by applying a series of regex-style rules to both the property and value columns together.
For example, some sample rules may be like the following:
"if property starts with 'propA' and value is not 'True', drop the row".
Another rule may be something more mathematical, like:
"if property starts with 'propB' and value < 4, drop the row".
Is there a way to accomplish something like this without having to iterate over all rows each time for every rule I want to apply?
You still have to apply each rule (how else?), but let pandas handle the rows. Also, instead of removing the rows that you do not like, keep the rows that you do. Here's an example of how the first two rules can be applied:
rule1 = df.property.str.startswith('propA') & (df.value != 'True')
df = df[~rule1] # Keep everything that does NOT match
rule2 = df.property.str.startswith('propB') & (df.value < 4)
df = df[~rule2] # Keep everything that does NOT match
By the way, the second rule will not work because value is not a numeric column.
For the first one:
df = df.drop(df[(df.property.startswith('propA')) & (df.value is not True)].index)
and the other one:
df = df.drop(df[(df.property.startswith('propB')) & (df.value < 4)].index)
I am trying to return a specific item from a Pandas DataFrame via conditional selection (and do not want to have to reference the index to do so).
Here is an example:
I have the following dataframe:
Code Colour Fruit
0 1 red apple
1 2 orange orange
2 3 yellow banana
3 4 green pear
4 5 blue blueberry
I enter the following code to search for the code for blueberries:
df[df['Fruit'] == 'blueberry']['Code']
This returns:
4 5
Name: Code, dtype: int64
which is of type:
pandas.core.series.Series
but what I actually want to return is the number 5 of type:
numpy.int64
which I can do if I enter the following code:
df[df['Fruit'] == 'blueberry']['Code'][4]
i.e. referencing the index to give the number 5, but I do not want to have to reference the index!
Is there another syntax that I can deploy here to achieve the same thing?
Thank you!...
Update:
One further idea is this code:
df[df['Fruit'] == 'blueberry']['Code'][df[df['Fruit']=='blueberry'].index[0]]
However, this does not seem particularly elegant (and it references the index). Is there a more concise and precise method that does not need to reference the index or is this strictly necessary?
Thanks!...
Let's try this:
df.loc[df['Fruit'] == 'blueberry','Code'].values[0]
Output:
5
First, use .loc to access the values in your dataframe using the boolean indexing for row selection and index label for column selection. The convert that returned series to an array of values and since there is only one value in that array you can use index '[0]' get the scalar value from that single element array.
Referencing index is a requirement (unless you use next()^), since a pd.Series is not guaranteed to have one value.
You can use pd.Series.values to extract the values as an array. This also works if you have multiple matches:
res = df.loc[df['Fruit'] == 'blueberry', 'Code'].values
# array([5], dtype=int64)
df2 = pd.concat([df]*5)
res = df2.loc[df2['Fruit'] == 'blueberry', 'Code'].values
# array([5, 5, 5, 5, 5], dtype=int64)
To get a list from the numpy array, you can use .tolist():
res = df.loc[df['Fruit'] == 'blueberry', 'Code'].values.tolist()
Both the array and the list versions can be indexed intuitively, e.g. res[0] for the first item.
^ If you are really opposed to using index, you can use next() to iterate:
next(iter(res))
you can also set your 'Fruit' column as ann index
df_fruit_index = df.set_index('Fruit')
and extract the value from the 'Code' column based on the fruit you choose
df_fruit_index.loc['blueberry','Code']
Easiest solution: convert pandas.core.series.Series to integer!
my_code = int(df[df['Fruit'] == 'blueberry']['Code'])
print(my_code)
Outputs:
5
I have column in a Pandas dataframe that I want to use to lookup a value of cost in a lookup dictionary.
The idea is that I will update an existing column if the item is there and if not the column will be left blank.
All the methods and solutions I have seen so far seem to create a new column, such as apply and assign methods, but it is important that I preserve the existing data.
Here is my code:
lookupDict = {'Apple': 1, 'Orange': 2,'Kiwi': 3,'Lemon': 8}
df1 = pd.DataFrame({'Fruits':['Apple','Banana','Kiwi','Cheese'],
'Pieces':[6, 3, 5, 7],
'Cost':[88, 55, 65, 55]},)
What I want to achieve is lookup the items in the fruit column and if the item is there I want to update the cost column with the dictionary value multiplied by the number of pieces.
For example for Apple the cost is 1 from the lookup dictionary, and in the dataframe the number of pieces is 6, therefore the cost column will be updated from 88 to (6*1) = 6. The next item is banana which is not in the lookup dictionary, therefore the cost in the original dataframe will be left unchanged. The same logic will be applied to the rest of the items.
The only way I can think of achieving this is to separate the lists from the dataframe, iterate through them and then add them back into the dataframe when I'm finished. I am wondering if it would be possible to act on the values in the dataframe without using separate lists??
From other responses I image I have to use the loc indicators such as the following: (But this is not working and I don't want to create a new column)
df1.loc[df1.Fruits in lookupDict,'Cost'] = lookupDict[df1.Fruits] * lookupD[df1.Pieces]
I have also tried to map but it overwrites all the content of the existing column:
df1['Cost'] = df1['Fruits'].map(lookupDict)*df1['Pieces']
EDIT*******
I have been able to achieve it with the following using iteration, however I am still curious if there is a cleaner way to achieve this:
#Iteration method
for i,x in zip(df1['Fruits'],xrange(len(df1.index))):
fruit = (df1.loc[x,'Fruits'])
if fruit in lookupDict:
newCost = lookupDict[fruit] * df1.loc[x,'Pieces']
print(newCost)
df1.loc[x,'Cost'] = newCost
If I understood correctly:
mask = df1['Fruits'].isin(lookupDict.keys())
df1.loc[mask, 'Cost'] = df1.loc[mask, 'Fruits'].map(lookupDict) * df1.loc[mask, 'Pieces']
Result:
In [29]: df1
Out[29]:
Cost Fruits Pieces
0 6 Apple 6
1 55 Banana 3
2 15 Kiwi 5
3 55 Cheese 7