I have the Iris dataset which looks something like:
1,3,1,1,0
1,1,1,1,0
1,3,1,1,0
1,2,1,1,0
1,3,1,1,0
1,2,1,1,0
2,2,2,2,1
2,2,2,2,1
2,2,2,2,1
2,1,2,2,1
1,1,2,2,1
2,1,2,2,1
2,2,3,4,2
2,1,3,4,2
3,1,3,4,2
2,1,3,4,2
2,1,3,4,2
3,1,3,4,2
I am just giving 18 rows but there are total 150 rows. The 1st 4 column gives the 4 attribute values and the fifth column gives the class.
So 3,1,3,4,2 means if the att_1=3, att_2=1, att_3=3, att_4=4 the class=2
Now I have written 2 classifier algorithms, where I tried to extract rules from this dataset.
The 1st algorithm(Implemented using C and python) gives the output as:
*,*,1,*,0
*,*,2,2,1
*,*,*,3,1
*,*,*,4,2
*,*,3,*,2
By these above 5 rows I tried to keep all the characteristics of the main dataset which contains 150 rows. Here * stands for don't care and *,*,2,2,1 this simply means if the value of attribute 3 and 4 =2 then we don't care about the value of attribute 1 and 2, the class will be = 1.
The 2nd algorithm(Implemented using C and python) gives the output as:
*,*,1,*,0
*,*,2,2,1
2,*,*,3,1
*,2,2,*,1
*,*,3,4,2
*,1,*,4,2
*,*,3,2,2
3,*,*,*,2
1,*,*,3,2
Now I took the union of these 2 rule sets. And got the outcome as:
*,*,2,2,1
*,*,1,*,0
*,2,2,*,1
*,*,3,2,2
*,*,3,4,2
3,*,*,*,2
*,*,*,3,1
1,*,*,3,2
*,*,*,4,2
*,*,3,*,2
*,1,*,4,2
2,*,*,3,1
Now the question that is arriving in my mind is, there are 12 rules in that union file. But may be there can be 3 or 4 most effective rules, using which we can get the clear view of the initial iris dataset with 150 rows. So my point is : To find the top 5 most effective rules from this above union rules. Basically I got those rules from the initial iris dataset and now I want to get the initial iris dataset back from the best possible generated rules. Is this problem a np-hard or a np-complete problem? And why so?
I have an excel-table called rules_table where each row represents a rule with a column representing the resulting category when that rule is true:
Legs
Eyes
Color
Description
Category
8
6
NaN
Small
Spider
4
2
Black
Friendly
Dog
2
NaN
NaN
Tall
Human
I.e. ignoring the NaN's, the table would create rules as shown in the pseudocode here:
If legs == 8 & Eyes == 6 & Description.contains("Small") then Category = "Spider"
If legs == 4 & Eyes == 2 & Color = "Black" & Description.contains("Friendly") then Category = "Dog"
If legs == 2 & Description.contains("Tall") then Category = "Human"
I also have another table called data_table with same format as the rules_table except it is missing the category column and usually does not contain NaN:
Legs
Eyes
Color
Description
8
6
Brown
The creature is a small...
13
2
Orange
This is...
4
2
Black
This friendly creature...
2
2
White
The creature here is tall...
1
11
Yellow
The creature here is...
My goal is to add the category of the rules_table to the data_table whenever the rule applies, such that executing the code:
complete_table = my_function(rules_table, data_table)
Yields the complete_table:
Legs
Eyes
Color
Description
Category
8
6
Brown
The creature is a small...
Spider
13
2
Orange
This is...
NaN
4
2
Black
This friendly creature...
Dog
2
2
White
The creature here is tall...
Human
1
11
Yellow
The creature here is...
NaN
I am currently loading both tables as pandas dataframes, but am open to all options, note that I have millions of rows so efficiency is important to consider (but not critical).
I have tried two approaches
Approach 1:
I have tried to join/merge the tables and make a work-around function for executing the "description.contains" part of the rule. However, the NaN's are making it tricky for me, and I am not sure how I should work around that.
Approach 2:
I have tried iterating over each row of the rules_table, and then create a list of filters and a list of desired values which I then use together with np.select. However, I cannot figure out how to programatically construct executable code, and therefore end up with strings I cannot use as intended.
Do you have a suggestion for how I may proceed here? I am getting a bit stuck
I can share code if you want, but I am getting stuck on a more fundamental level than just syntax.
If you are familiar with SQL, this problem would have easily solved with its flexible JOIN statements. In MS SQL Server, you can solve your problem like this:
FROM data_table d
LEFT JOIN rules_table r ON (d.Legs = r.Legs)
AND (d.Eyes = r.Eyes OR r.Eyes IS NULL)
AND (d.Color = r.Color OR r.Color IS NULL)
AND (CHARINDEX(r.Description, d.Description) != 0)
Unfortunately, pandas's joins (as implemented by pd.join and pd.merge) are no where as flexible. One way to overcome this is to first perform a cross join and then filter the intermediary result:
def my_function(rules_table, data_table):
# Make a unique number for each row
# To prevent changing the original data_table, we make a new copy
new_data_table = data_table.assign(RowNumber=range(len(data_table)))
# Join every row in new_data_table to every row in rules_table
# We will filter for the matches later
tmp = new_data_table.merge(rules_table, how='cross', suffixes=('', '_rules'))
# Filter for the matches
match = (
( tmp['Legs'] == tmp['Legs_rules'] ) &
((tmp['Eyes'] == tmp['Eyes_rules'] ) | tmp['Eyes_rules'].isna()) &
((tmp['Color'] == tmp['Color_rules']) | tmp['Color_rules'].isna()) &
tmp.apply(lambda row: row['Description_rules'].lower() in row['Description'].lower(), axis=1)
)
# Perform another left join to produce final result
result = new_data_table.merge(tmp.loc[match, ['RowNumber', 'Category']], how='left', on='RowNumber')
return result.drop(columns='RowNumber')
I have a dataset into a pandas dataframe with 9 set of features and 249 rows, I would like to get a covariance matrix amongst the 9 features (resulting in a 9 X 9 matrix), however, when I use the df.cov() function, I only get a 3 X 3 matrix. What am I doing wrong here?
Thanks!
Below is my code snippet
# perform data preprocessing
# only get players with MPG with less than 20 and only select the required colums
MPG_df = df.loc[df['MPG'] >= 20]
processed_df = MPG_df[["FT%", "2P%", "3P%", "PPG", "RPG", "APG", "SPG", "BPG", "TOPG"]]
processed_df
And when I attempt in getting the covariance matrix using the code below, I only get a 3 X 3 matrix
#desired result
cov_processed_df = df = pandas.DataFrame(processed_df, columns=['FT%', '2P%', '3P%', 'PPG', 'RPG', 'APG', 'SPG', 'BPG', 'TOPG']).cov()
cov_processed_df
Thanks!
The excluded columns are probably non-numeric (even though they look like so!). Try
cov_processed_df = processed_df.astype(float).cov()
To see the data types of the original df, you may run:
print(processed_df.dtypes)
If you see "object" appearing in the result, then it means those columns are non-numeric. (Even if they contain at least 1 non-numeric data, they are flagged as non-numeric.)
Here the dataset:
df = pd.read_csv('https://data.lacity.org/api/views/d5tf-ez2w/rows.csv?accessType=DOWNLOAD')
The problem:
I have a pandas dataframe of traffic accidents in Los Angeles.
Each accident has a column of mo_codes which is a string of numerical codes (which I converted into a list of codes). Here is a screenshot:
I also have a dictionary of mo_codes description for each respective mo_code and loaded in the notebook.
Now, using the code below I can combine the numeric code with the description:
mo_code_list_final = []
for i in range(20):
for j in df.mo_codes.iloc[i]:
print(i, mo_code_dict[j])
So, I haven't added this as a column to Pandas yet. I wanted to ask if there is a better way to solve the problem I have which is, how best to add the textual description in pandas as a column.
Also, is there an easier way to process this with a pandas function like .assign instead of the for loop. Maybe a list comprehension to process the mo_codes into a new dataframe with the description?
Thanks in advance.
ps. if there is a technical word for this type of problem, pls let me know.
import pandas
codes = {0:'Test1',1:'test 2',2:'test 3',3:'test 4'}
df1 = pandas.DataFrame([["red",[0,1,2],5],["blue",[3,1],6]],columns=[0,'codes',2])
# first explode the list into its own rows
df2 = df1['codes'].apply(pandas.Series).stack().astype(int).reset_index(level=1, drop=True).to_frame('codes').join(df1[[0,2]])
#now use map to apply the text descriptions
df2['desc'] = df2['codes'].map(codes)
print(df2)
"""
codes 0 2 desc
0 0 red 5 Test1
0 1 red 5 test 2
0 2 red 5 test 3
1 3 blue 6 test 4
1 1 blue 6 test 2
"""
I figured out how to finally do this. However, I found the answer in Javascript but the same concept applies.
You simply create a dictionary of mocodes and its string value.
export const mocodesDict = {
"0100": "Suspect Impersonate",
"0101": "Aid victim",
"0102": "Blind",
"0103": "Crippled",
...
}
After that, its as simple as doing this
mocodesDict[item)]
where item you want to convert.
I have a point cloud of 6 millions x, y and z points I need to process. I need to look for specific points within this 6 millions xyz points and I have using pandas df.isin() function to do it. I first save the 6 millions points into a pandas dataframe (save under the name point_cloud) and for the specific point I need to look for into a dateframe as well (save under the name specific_point). I only have two specific point I need to look out for. So the output of the df.isin() function should show 2 True value but it is showing 3 instead.
In order to prove that 3 True values are wrong. I actually iterate through the 6 millions point clouds looking for the two specific points using iterrows(). The result was indeed 2 True value. So why is df.isin() showing 3 instead of the correct result of 2?
I have tried this, which result true_count to be 3
label = (point_cloud['x'].isin(specific_point['x']) & point_cloud['y'].isin(specific_point['y']) & point_cloud['z'].isin(specific_point['z'])).astype(int).to_frame()
true_count = 0
for index, t_f in label.iterrows():
if int(t_f.values) == int(1):
true_count += 1
print(true_count)
I have tried this as well, also resulting in true_count to be 3.
for t_f in (point_cloud['x'].isin(specific_point['x']) & point_cloud['y'].isin(specific_point['y']) & point_cloud['z'].isin(specific_point['z'])).values
true_count = 0
if t_f == True:
true_count += 1
Lastly I tried the most inefficient way of iterating through the 6 millions points using iterrows() but this result the correct value for true_count which is 2.
true_count = 0
for index_sp, sp in specific_point.iterrows():
for index_pc, pc in point_cloud.iterrows():
if sp['x'] == pc['x'] and sp['y'] == pc['y'] and sp['z] == pc['z]:
true_count += 1
print(true_count)
Do anyone know why is df.isin() behaving this way? Or have I seem to overlook something?
isin function for multiple columns with and will fail to look the dataframe per row, it is more like check the product the list in dataframe .
So what you can do is
checked=point_cloud.merge(specific_point,on=['x','y','z'],how='inner')
For example, if you have two list l1=[1,2];l2=[3,4], using isin , it will return any row match [1,3],[1,4],[2,3],[2,4]