Create new pandas column based on conditions stated in a table - python

I have an excel-table called rules_table where each row represents a rule with a column representing the resulting category when that rule is true:
Legs
Eyes
Color
Description
Category
8
6
NaN
Small
Spider
4
2
Black
Friendly
Dog
2
NaN
NaN
Tall
Human
I.e. ignoring the NaN's, the table would create rules as shown in the pseudocode here:
If legs == 8 & Eyes == 6 & Description.contains("Small") then Category = "Spider"
If legs == 4 & Eyes == 2 & Color = "Black" & Description.contains("Friendly") then Category = "Dog"
If legs == 2 & Description.contains("Tall") then Category = "Human"
I also have another table called data_table with same format as the rules_table except it is missing the category column and usually does not contain NaN:
Legs
Eyes
Color
Description
8
6
Brown
The creature is a small...
13
2
Orange
This is...
4
2
Black
This friendly creature...
2
2
White
The creature here is tall...
1
11
Yellow
The creature here is...
My goal is to add the category of the rules_table to the data_table whenever the rule applies, such that executing the code:
complete_table = my_function(rules_table, data_table)
Yields the complete_table:
Legs
Eyes
Color
Description
Category
8
6
Brown
The creature is a small...
Spider
13
2
Orange
This is...
NaN
4
2
Black
This friendly creature...
Dog
2
2
White
The creature here is tall...
Human
1
11
Yellow
The creature here is...
NaN
I am currently loading both tables as pandas dataframes, but am open to all options, note that I have millions of rows so efficiency is important to consider (but not critical).
I have tried two approaches
Approach 1:
I have tried to join/merge the tables and make a work-around function for executing the "description.contains" part of the rule. However, the NaN's are making it tricky for me, and I am not sure how I should work around that.
Approach 2:
I have tried iterating over each row of the rules_table, and then create a list of filters and a list of desired values which I then use together with np.select. However, I cannot figure out how to programatically construct executable code, and therefore end up with strings I cannot use as intended.
Do you have a suggestion for how I may proceed here? I am getting a bit stuck
I can share code if you want, but I am getting stuck on a more fundamental level than just syntax.

If you are familiar with SQL, this problem would have easily solved with its flexible JOIN statements. In MS SQL Server, you can solve your problem like this:
FROM data_table d
LEFT JOIN rules_table r ON (d.Legs = r.Legs)
AND (d.Eyes = r.Eyes OR r.Eyes IS NULL)
AND (d.Color = r.Color OR r.Color IS NULL)
AND (CHARINDEX(r.Description, d.Description) != 0)
Unfortunately, pandas's joins (as implemented by pd.join and pd.merge) are no where as flexible. One way to overcome this is to first perform a cross join and then filter the intermediary result:
def my_function(rules_table, data_table):
# Make a unique number for each row
# To prevent changing the original data_table, we make a new copy
new_data_table = data_table.assign(RowNumber=range(len(data_table)))
# Join every row in new_data_table to every row in rules_table
# We will filter for the matches later
tmp = new_data_table.merge(rules_table, how='cross', suffixes=('', '_rules'))
# Filter for the matches
match = (
( tmp['Legs'] == tmp['Legs_rules'] ) &
((tmp['Eyes'] == tmp['Eyes_rules'] ) | tmp['Eyes_rules'].isna()) &
((tmp['Color'] == tmp['Color_rules']) | tmp['Color_rules'].isna()) &
tmp.apply(lambda row: row['Description_rules'].lower() in row['Description'].lower(), axis=1)
)
# Perform another left join to produce final result
result = new_data_table.merge(tmp.loc[match, ['RowNumber', 'Category']], how='left', on='RowNumber')
return result.drop(columns='RowNumber')

Related

How to (elegantly) add single values and rows to a DataFrame?

Imagine the following DataFrame.
import pandas as pd
animal_sizes = pd.DataFrame({"Animal": ["Horse", "Mouse"],
"Size": ["Large", "Small"]})
Animal
Size
Horse
Large
Mouse
Small
I want to add another row for "Dog". If I understand correctly, I have to first create another DataFrame and then concatenate the new and the existing DataFrame.
pd.concat([animal_sizes,
pd.DataFrame({"Animal": ["Dog"],
"Size": ["Medium"]})]
)
Animal
Size
Horse
Large
Mouse
Small
Dog
Medium
This doesn't seem terribly elegant. Is there a simpler way? I imagine something like animal_sizes.append_row(["Dog", "Medium"]).
Imagine I only want to add another value to the Animal column. (Perhaps I haven't measured the size yet.) Again, pd.concat with an explicit empty (or NaN) value for the Size column seems awkward:
pd.concat([animal_sizes,
pd.DataFrame({"Animal": ["Crow"], "Size": [""]})]
Animal
Size
Horse
Large
Mouse
Small
Crow
Is there a simpler solution? I'm looking for something like animal_sizes["Animal"].append_value("Crow").
I know about pd.append (see this fine answer), but not only is it deprecated, it also expects you to explicate the column for each new row value. This makes it slightly unwieldy for my taste.
animal_sizes.append({"Animal": "Crow"}, ignore_index=True)
Are there more elegant solutions for this very simple problem?
I recommend defining an appropriate index (animals in this case) and using it to insert new rows by name. Use dictionaries to add incomplete rows.
import pandas as pd
animal_sizes = pd.DataFrame({"Animal": ["Horse", "Mouse"],
"Size": ["Large", "Small"],
"othercol": ["A", "B"]}
).set_index("Animal")
animal_sizes.loc["Dog"] = {"othercol": "C"}
animal_sizes.loc["Elephant"] = ["verylarge", "D"]
animal_sizes.loc["unspecifiedanimal"] = {}
print(animal_sizes)
# result:
Size othercol
Animal
Horse Large A
Mouse Small B
Dog NaN C
Elephant verylarge D
unspecifiedanimal NaN NaN
Adding an existing animal replaces a row. This may or may not be intended behavior. If the goal is to blindly dump rows into the table while accepting duplicates, the best solution might still be concat.
Solution for default RangeIndex values in index for always inserting new rows to end of DataFrame:
Use DataFrame.loc with list, only necessary same length like number of columns - new index value is created by length of rows:
animal_sizes.loc[len(animal_sizes)] = ["Dog", "Medium"]
print (animal_sizes)
Animal Size
0 Horse Large
1 Mouse Small
2 Dog Medium
If need also specify columns names:
animal_sizes.loc[len(animal_sizes)] = {"Animal": "Dog", "Size": "Medium"}
print (animal_sizes)
Animal Size
0 Horse Large
1 Mouse Small
2 Dog Medium
You can add a single row to a Pandas DataFrame using the .loc indexing method:
animal_sizes.loc[len(animal_sizes)] = ["Dog", "Medium"]
To add a single value to the Animal column, you can create a new column with that value and concatenate the DataFrames:
animal_sizes['Size'] = animal_sizes['Size'].astype(str)
animal_sizes = pd.concat([animal_sizes, pd.DataFrame({"Animal": ["Crow"], "Size": [""]})], sort=False)
Note that you need to cast the Size column to a string data type to accommodate the empty string.

Keep rows according to condition in Pandas

I am looking for a code to find rows that matches a condition and keep those rows.
In the image example, I wish to keep all the apples with amt1 => 5 and amt2 < 5. I also want to keep the bananas with amt1 => 1 and amt2 < 5 (highlighted red in image). There are many other fruits in the list that I have to filter for (maybe about 10 fruits).
image example
Currently, I am filtering it individually (ie. creating a dataframe that filters out the red and small apples and another dataframe that filters out the green and big bananas and using concat to join the dataframes together afterwards). However, this process takes a long time to run because the dataset is huge. I am looking for a faster way (like filtering it in the dataframe itself without having to create a new dataframes). I also have to use column index instead of column names as the column name changes according to the date.
Hopefully what I said makes sense. Would appreciate any help!
I am not quite sure I understand your requirements because I don't understand how the conditions for the rows to keep are formulated.
One thing you can use to combine multiple criteria for selecting data is the query method of the dataframe:
import pandas as pd
df = pd.DataFrame([
['Apple', 5, 1],
['Apple', 4, 2],
['Orange', 3, 3],
['Banana', 2, 4],
['Banana', 1, 5]],
columns=['Fruits', 'Amt1', 'Amt2'])
df.query('(Fruits == "Apple" & (Amt1 >= 5 & Amt2 < 5)) | (Fruits == "Banana" & (Amt1 >= 1 & Amt2 < 5))')
You might use filter combined with itertuples following way
import pandas as pd
df = pd.DataFrame({"x":[1,2,3,4,5],"y":[10,20,30,40,50]})
def keep(row):
return row[0] >= 2 and row[1] <= 40
df_filtered = pd.DataFrame(filter(keep,df.itertuples())).set_index("Index")
print(df_filtered)
gives output
x y
Index
2 3 30
3 4 40
4 5 50
Explanation: keep is function which should return True for rows to keep False for rows to jettison. .itertuples() provides iterable of tuples, which are feed to filter which select records where keep evaluates to True, these selected rows are used to create new DataFrame. After that is done I set index so Index is corresponding to original DataFrame. Depending on your use case you might elect to not set index.

How Do I Generate a List of Items Not Shared Between Two Dataframes

Basically, I have a single list of a bunch of unique items that are categorized by color (Items). I do some stuff and generate a dataframe with selected combinations of these unique items (Combinations). My goal is to make a list of the items from the original list that do not appear in the Combinations dataframe. Ideally, I'd like to check all four color columns, but for my initial test, I just selected the "Red" column.
import pandas as pd
Items = pd.DataFrame({'Id': ["6917529336306454104","6917529268375577150","6917529175831101427","6917529351156928903","6917529249201580539","6917529246740186376","6917529286870790429","6917529212665335174","6917529206310658443","6917529207434353786","6917529309798817021","6917529352287607192","6917529268327711171","6917529316674574229"
],'Type': ['Red','Blue','Green','Cyan','Red','Blue','Blue','Blue','Blue','Green','Green','Green','Cyan','Cyan']})
Items = Items.set_index('Id', drop=True)
#Do stuff
Combinations = pd.DataFrame({
'Red': ["6917529336306454104","6917529336306454104","6917529336306454104","6917529336306454104"],
'Blue': ["6917529268375577150","6917529286870790429","6917529206310658443","6917529206310658443"],
'Green': ["6917529175831101427","6917529207434353786","6917529309798817021","6917529309798817021"],
'Cyan': ["6917529351156928903","6917529268327711171","6917529351156928903","6917529268327711171"],
'Other': [12,15,18,32]
})
My first attempt was using the line below, but this raises the execution error "KeyError: 'Id'". A forum post indicated that the drop=True in the set_index might resolve it, but that didn't seem to work in my case.
UnusedItems = ~Items[Items['Id'].isin(list(Combinations['Red']))]
I attempted to work around it by using this line. While it executes, it generates an empty dataframe. Just by inspection, item 6917529249201580539 should be returned when considering the "Red" column. Considering all Combination columns, items 6917529249201580539, 6917529246740186376, 6917529212665335174, and 6917529316674574229 should be returned as unused.
UnusedItems = ~Items[Items.iloc[:,0].isin(list(Combinations['Red']))]
I'd appreciate and ideas or guidance. Thanks.
use .melt() on Combination, then change both into sets and subtract
set(Items.index) - set(Combinations.melt().value)
One option would be to grab the first 4 columns from Combinations with iloc and reformat to long form with stack:
(Combinations.iloc[:, :4].stack()
.droplevel(0).rename_axis(index='Type').reset_index(name='Id'))
Type Id
0 Red 6917529336306454104
1 Blue 6917529268375577150
2 Green 6917529175831101427
3 Cyan 6917529351156928903
4 Red 6917529336306454104
5 Blue 6917529286870790429
6 Green 6917529207434353786
7 Cyan 6917529268327711171
8 Red 6917529336306454104
9 Blue 6917529206310658443
10 Green 6917529309798817021
11 Cyan 6917529351156928903
12 Red 6917529336306454104
13 Blue 6917529206310658443
14 Green 6917529309798817021
15 Cyan 6917529268327711171
Then perform an Anti-Join with Items, reset_index to get the 'Id' column back from the index, merge with indicator, and query to filter out values that are present in both frames, then drop the indicator column:
UnusedItems = Items.reset_index().merge(
Combinations.iloc[:, :4].stack()
.droplevel(0).rename_axis(index='Type').reset_index(name='Id'),
how='outer',
indicator='I').query('I != "both"').drop('I', 1)
UnusedItems:
Id Type
8 6917529249201580539 Red
9 6917529246740186376 Blue
11 6917529212665335174 Blue
17 6917529352287607192 Green
20 6917529316674574229 Cyan

What is the best way to process a list of numerical codes as descriptions and in Pandas?

Here the dataset:
df = pd.read_csv('https://data.lacity.org/api/views/d5tf-ez2w/rows.csv?accessType=DOWNLOAD')
The problem:
I have a pandas dataframe of traffic accidents in Los Angeles.
Each accident has a column of mo_codes which is a string of numerical codes (which I converted into a list of codes). Here is a screenshot:
I also have a dictionary of mo_codes description for each respective mo_code and loaded in the notebook.
Now, using the code below I can combine the numeric code with the description:
mo_code_list_final = []
for i in range(20):
for j in df.mo_codes.iloc[i]:
print(i, mo_code_dict[j])
So, I haven't added this as a column to Pandas yet. I wanted to ask if there is a better way to solve the problem I have which is, how best to add the textual description in pandas as a column.
Also, is there an easier way to process this with a pandas function like .assign instead of the for loop. Maybe a list comprehension to process the mo_codes into a new dataframe with the description?
Thanks in advance.
ps. if there is a technical word for this type of problem, pls let me know.
import pandas
codes = {0:'Test1',1:'test 2',2:'test 3',3:'test 4'}
df1 = pandas.DataFrame([["red",[0,1,2],5],["blue",[3,1],6]],columns=[0,'codes',2])
# first explode the list into its own rows
df2 = df1['codes'].apply(pandas.Series).stack().astype(int).reset_index(level=1, drop=True).to_frame('codes').join(df1[[0,2]])
#now use map to apply the text descriptions
df2['desc'] = df2['codes'].map(codes)
print(df2)
"""
codes 0 2 desc
0 0 red 5 Test1
0 1 red 5 test 2
0 2 red 5 test 3
1 3 blue 6 test 4
1 1 blue 6 test 2
"""
I figured out how to finally do this. However, I found the answer in Javascript but the same concept applies.
You simply create a dictionary of mocodes and its string value.
export const mocodesDict = {
"0100": "Suspect Impersonate",
"0101": "Aid victim",
"0102": "Blind",
"0103": "Crippled",
...
}
After that, its as simple as doing this
mocodesDict[item)]
where item you want to convert.

Pandas updating values in a column using a lookup dictionary

I have column in a Pandas dataframe that I want to use to lookup a value of cost in a lookup dictionary.
The idea is that I will update an existing column if the item is there and if not the column will be left blank.
All the methods and solutions I have seen so far seem to create a new column, such as apply and assign methods, but it is important that I preserve the existing data.
Here is my code:
lookupDict = {'Apple': 1, 'Orange': 2,'Kiwi': 3,'Lemon': 8}
df1 = pd.DataFrame({'Fruits':['Apple','Banana','Kiwi','Cheese'],
'Pieces':[6, 3, 5, 7],
'Cost':[88, 55, 65, 55]},)
What I want to achieve is lookup the items in the fruit column and if the item is there I want to update the cost column with the dictionary value multiplied by the number of pieces.
For example for Apple the cost is 1 from the lookup dictionary, and in the dataframe the number of pieces is 6, therefore the cost column will be updated from 88 to (6*1) = 6. The next item is banana which is not in the lookup dictionary, therefore the cost in the original dataframe will be left unchanged. The same logic will be applied to the rest of the items.
The only way I can think of achieving this is to separate the lists from the dataframe, iterate through them and then add them back into the dataframe when I'm finished. I am wondering if it would be possible to act on the values in the dataframe without using separate lists??
From other responses I image I have to use the loc indicators such as the following: (But this is not working and I don't want to create a new column)
df1.loc[df1.Fruits in lookupDict,'Cost'] = lookupDict[df1.Fruits] * lookupD[df1.Pieces]
I have also tried to map but it overwrites all the content of the existing column:
df1['Cost'] = df1['Fruits'].map(lookupDict)*df1['Pieces']
EDIT*******
I have been able to achieve it with the following using iteration, however I am still curious if there is a cleaner way to achieve this:
#Iteration method
for i,x in zip(df1['Fruits'],xrange(len(df1.index))):
fruit = (df1.loc[x,'Fruits'])
if fruit in lookupDict:
newCost = lookupDict[fruit] * df1.loc[x,'Pieces']
print(newCost)
df1.loc[x,'Cost'] = newCost
If I understood correctly:
mask = df1['Fruits'].isin(lookupDict.keys())
df1.loc[mask, 'Cost'] = df1.loc[mask, 'Fruits'].map(lookupDict) * df1.loc[mask, 'Pieces']
Result:
In [29]: df1
Out[29]:
Cost Fruits Pieces
0 6 Apple 6
1 55 Banana 3
2 15 Kiwi 5
3 55 Cheese 7

Categories

Resources