I am trying to "prettyfi" pandas confusion matrix which just returns a 2D-numpy array.
What I want to, is to add "legends"; one above the columns saying "Pred" and one for the rows saying "Actual"
something like this
pred
0 1
--------
0|123 2
Actual |
1|17 200
(would be perfect if "actual" was rotated but thats just a minor thing).
I have the following lines for creating the dataframe w/o the meta-headers
conf_mat = confusion_matrix(y_true = y_true,y_pred = y_pred)
conf_mat = pd.DataFrame(conf_mat)
... #missing lines for the last part
and I have thought about using some multiindex but I cannot really make it work
Use:
conf_mat = pd.DataFrame(conf_mat)
Solution with MultiIndex:
conf_mat.columns = pd.MultiIndex.from_product([['pred'], conf_mat.columns])
conf_mat.index = pd.MultiIndex.from_product([['Actual'], conf_mat.index])
print (conf_mat)
pred
0 1
Actual 0 123 2
1 17 200
Or solution with index and columns names (but some pandas operation should remeved this meta data):
conf_mat = conf_mat.rename_axis(index='Actual', columns='pred')
print (conf_mat)
pred 0 1
Actual
0 123 2
1 17 200
Related
Im currently trying to get the mean() of a group in my dataframe (tdf), but I have a mix of some NaN values and filled values in my dataset. Example shown below
Test #
a
b
1
1
1
1
2
NaN
1
3
2
2
4
3
My code needs to take this dataset, and make a new dataset containing the mean, std, and 95% interval of the set.
i = 0
num_timeframes = 2 #writing this in for example sake
new_df = pd.DataFrame(columns = tdf.columns)
while i < num_timeframes:
results = tdf.loc[tdf["Test #"] == i].groupby(["Test #"]).mean()
new_df = pd.concat([new_df,results])
results = tdf.loc[tdf["Test #"] == i].groupby(["Test #"]).std()
new_df = pd.concat([new_df,results])
results = 2*tdf.loc[tdf["Test #"] == i].groupby(["Test #"]).std()
new_df = pd.concat([new_df,results])
new_df['Test #'] = new_df['Test #'].fillna(i) #fill out test number values
i+=1
For simplicity, i will show the desired output on the first pass of the while loop, only calculating the mean. The problem impacts every row however. The expected output for the mean of Test # 1 is shown below:
Test #
a
b
1
2
1.5
However, columns which contain any NaN rows are calculating the entire mean as NaN resulting in the output shown below
Test #
a
b
1
2
NaN
I have tried passing skipna=True, but got an error stating that mean doesn't have a skipna argument. Im really at a loss here because it was my understanding that df.mean() ignores NaN rows by default. I have limited experience with python so any help is greatly appreciated.
Use following
DataFrame.mean( axis=None, skipna=True)
I eventually solved this by removing the groupby function entirely (I was looking through it and realized I had no reason to call groupby here other than benefit from groupby keeping my columns in the correct orientation). Figured I'd post my fix in case anyone ever comes across this.
for i in range(num_timeframes):
results = tdf.loc[tdf["Test #"] == i].mean()
results = pd.concat([results, tdf.loc[tdf["Test #"] == i].std()], axis = 1)
results = pd.concat([results, 2*tdf.loc[tdf["Test #"] == i].std()], axis = 1)
results = results.transpose()
results["Test #"] = i
new_df = pd.concat([new_df,results])
new_df.loc[new_df.shape[0]] = [None]*len(new_df.columns)
All i had to do was transpose my results because df.mean() flips the dataframe for some reason which is likely why I had tried using groupby in the first place.
Noticed something very strange in pandas. My dataframe(with 3 rows and 3 columns) looks like this:
When I try to extract ID and Name(separated by underscore) to their own columns using command below, it gives me an error:
df[['ID','Name']] = df.apply(lambda x: get_first_last(x['ID_Name']), axis=1, result_type='broadcast')
Error is:
ValueError: cannot broadcast result
Here's the interesting part though..When I delete the "From_To" column from the original dataframe, performing the same df.apply() to split ID_Name works perfectly fine and I get the new columns like this:
I have checked a lot of SO answers but none seem to help. What did I miss here?
P.S. get_first_last is a very simple function like this:
def get_first_last(s):
str_lis = s.split("_")
return [str_lis[0], str_lis[1]]
From the doc of pandas.DataFrame.apply :
'broadcast' : results will be broadcast to the original shape of the DataFrame, the original index and columns will be retained.
So the problem is that the original shape of your dataframe is (3, 3) and the result of your apply function is 2 columns, so you have a mismatch. and that also explane why when you delete the "From_To", the new shape is (3, 2) and now you have a match ...
You can use 'broadcast' instead of 'expand' and you will have your expected result.
table = [
['1_john', 23, 'LoNDon_paris'],
['2_bob', 34, 'Madrid_milan'],
['3_abdellah', 26, 'Paris_Stockhom']
]
df = pd.DataFrame(table, columns=['ID_Name', 'Score', 'From_to'])
df[['ID','Name']] = df.apply(lambda x: get_first_last(x['ID_Name']), axis=1, result_type='expand')
hope this helps !!
It's definitely not a good use case to use apply, you should rather do:
df[["ID", "Name"]]=df["ID_Name"].str.split("_", expand=True, n=1)
Which for your data will output (I took only first 2 columns from your data frame):
ID_Name Score ID Name
0 1_john 23 1 john
1 2_bob 34 2 bob
2 3_janet 45 3 janet
Now n=1 is just in case you would have multiple _ (e.g. as a part of the name) - to make sure you will return at most 2 columns (otherwise the above code would fail)
For instance, if we slightly modify your code, we get the following output:
ID_Name Score ID Name
0 1_john 23 1 john
1 2_bob_jr 34 2 bob_jr
2 3_janet 45 3 janet
I would like to calculate a sum of variables for a given day. Each day contains a different calculation, but all the days use the variables consistently.
There is a df which specifies my variables and a df which specifies how calculations will change depending on the day.
How can I create a new column containing answers from these different equations?
import pandas as pd
import numpy as np
conversion = [["a",5],["b",1],["c",10]]
conversion_table = pd.DataFrame(conversion,columns=['Variable','Cost'])
data1 = [[1,"3a+b"],[2,"c"],[3,"2c"]]
to_solve = pd.DataFrame(data1,columns=['Day','Q1'])
desired = [[1,16],[2,10],[3,20]]
desired_table=pd.DataFrame(desired,columns=['Day','Q1 solved'])
I have separated my variables and equations based on row. Can I loop though these equations to find non-numerics and re-assign them?
#separate out equations and values
for var in conversion_table["Variable"]:
cost=(conversion_table.loc[conversion_table['Variable'] == var, 'Cost']).mean()
for row in to_solve["Q1"]:
equation=row
A simple suggestion, perhaps you need to rewrite a part of your code. Not sure if your want something like this:
a = 5
b = 1
c = 10
# Rewrite the equation that is readable by Python
# e.g. replace 3a+b by 3*a+b
data1 = [[1,"3*a+b"],
[2,"c"],
[3,"2*c"]]
desired_table = pd.DataFrame(data1,
columns=['Day','Q1'])
desired_table['Q1 solved'] = desired_table['Q1'].apply(lambda x: eval(x))
desired_table
Output:
Day Q1 Q1 solved
0 1 3*a+b 16
1 2 c 10
2 3 2*c 20
If it's possible to have the equations changed to equations with * then you could do this.
Get the mapping from the
mapping = dict(zip(conversion_table['Variable'], conversion_table['Cost'])
the eval the function and replace variables with numeric from the mapping
desired_table['Q1 solved'] = to_solve['Q1'].map(lambda x: eval(''.join([str(mapping[i]) if i.isalpha() else str(i) for i in x])))
0 16
1 10
2 20
Currently I work on a database and I try to sort my rows with pandas. I have a column called 'sessionkey' which refers to a session. So each row can be assigned to a session. I tried to seperate the Data into these sessions.
Furthermore there can be duplicated rows. I tried to drop those with the drop_duplicates function from pandas.
df = pd.read_csv((path_of_data+'part-00000-9d3e32a7-87f8-4218-bed1-e30855ce6f0c-c000.csv'), keep_default_na=False, engine='python')
tmp = df['sessionkey'].values #I want to split data into different sessions
tmp = np.unique(tmp)
df.set_index('sessionkey', inplace=True)
watching = df.loc[tmp[10]].drop_duplicates(keep='first') #here I pick one example
print(watching.sort_values(by =['eventTimestamp', 'eventClickSequenz']))
print(watching.info())
I would have thought that this works fine but when I tried to check my results by printing out my splitted dataframe the output looks very odd to me. For example I printed the length of the Dataframe it says 38 rows x 4 columns. But when I print the same Dataframe there are clearly more than 38 rows and there are still duplicates in it.
I already tried to split the data by using unique indices:
comparison = pd.DataFrame()
for index, item in enumerate(df['sessionkey'].values):
if item==tmp: comparison = comparison.append(df.iloc[index])
comparison.drop_duplicates(keep='first', inplace=True)
print(comparison.sort_values( by = ['eventTimestamp']))
But the Problem is still the same.
The output also seems to follow a pattern. Lets say we have 38 entries. Then pandas returns me the first 1-37 entries and then appends the 2-38 entries. So the last one is left out and then the whole list is shifted and printed again.
When I return the numpy values there are just 38 different rows. So is this a problem of the print function from pandas? Is there an error in my code? Does pandas have a problem with not-unique indexes?
EDIT:
Okay I figured out what the problem is. I wanted to look at a long dataframe so I used:
pd.set_option('display.max_rows', -1)
Now we can use some sample data:
data = np.array([[119, 0], [119, 1], [119, 2]])
columns = ['sessionkey', 'event']
df = pd.DataFrame(data, columns = columns)
print(df)
Printed it now looks like this:
sessionkey event
0 119 0
1 119 1
1 119 1
2 119 2
Although I expected it to look like this:
sessionkey event
0 119 0
1 119 1
2 119 2
I thought my Dataframe has the wrong shape but this is not the case.
So the event in the middle gets printed doubled. Is this a bug or the intendent output?
so drop_duplicates() doesn't look at the index when getting rid of rows, instead it looks at the whole row. But it does have a useful subset kwarg which allows you to specify which rows to use.
You can try the following
df = pd.read_csv((path_of_data+'part-00000-9d3e32a7-87f8-4218-bed1-e30855ce6f0c-c000.csv'), keep_default_na=False, engine='python')
print(df.shape)
print(df["session"].nunique()) # number of unique sessions
df_unique = df.drop_duplicates(subset=["session"],keep='first')
# these two numbers should be the same
print(df_unique.shape)
print(df_unique["session"].nunique())
It sounds like you want to drop_duplicates based on the index - by default drop_duplicates drops based on the column values. To do that try
df.loc[~df.index.duplicated()]
This should only select index values which are not duplicated
I used your sample code.
data = np.array([[119, 0], [119, 1], [119, 2]])
columns = ['sessionkey', 'event']
df = pd.DataFrame(data, columns = columns)
print(df)
And I got your expected outcome.
sessionkey event
0 119 0
1 119 1
2 119 2
After I set the max_rows option, as you did:
pd.set_option('display.max_rows', -1)
I got the incorrect outcome.
sessionkey event
0 119 0
1 119 1
1 119 1
2 119 2
The problem might be the "-1" setting. The doc states that "None" will set max rows to unlimited. I am unsure what "-1" will do in a parameter that takes positive integers or None as acceptable values.
Try
pd.set_option('display.max_rows', None)
I'm sure there must be a quickfix for this but I can't find an answer with a good explanation. I'm looking to iterate over a dataframe and build a crosstab for each pair of columns with pandas. I have subsetted 2 cols from the original data and removed rows with unsuitable data. With the remaining data I am looking to do a crosstab to ultimately build a contingency table to do a ChiX test. Here is my code:
my_data = pd.read_csv(DATA_MATRIX, index_col=0) #GET DATA
AM = pd.DataFrame(columns=my_data.columns, index = my_data.columns) #INITIATE DF TO HOLD ChiX-result
for c1 in my_data.columns:
for c2 in my_data.columns:
sample_df = pd.DataFrame(my_data, columns=[c1,c2]) #make df to do ChiX on
sample_df = sample_df[(sample_df[c1] != 0.5) | (sample_df[c2] != 0.5)].dropna() # remove unsuitable rows
contingency = pd.crosstab(sample_df[c1], sample_df[c2]) ##This doesn't work?
# DO ChiX AND STORE P-VALUE IN 'AM': CODE STILL TO WRITE
The dataframe contains the values 0.0, 0.5, 1.0. The '0.5' is missing data so I am removing these rows before making the contingency table, the remaining values that I wish to make the contingency tables from are all either 0.0 or 1.0. I have checked at the code works up to this point. The error printed to the console is:
ValueError: If using all scalar values, you must pass an index
If anyone can explain why this doesn't work? Help to solve in any way? Or even better provide an alternative way to do a ChiX test on the columns then that would be very helpful, thanks in advance!
EDIT: example of the structure of the first few rows of sample_df
col1 col2
sample1 1 1
sample2 1 1
sample3 0 0
sample4 0 0
sample5 0 0
sample6 0 0
sample7 0 0
sample8 0 0
sample9 0 0
sample10 0 0
sample11 0 0
sample12 1 1
A crosstab between two identical entities is meaningless. pandas is going to tell you:
ValueError: The name col1 occurs multiple times, use a level number
Meaning it assumes you're passing two different columns from a multi-indexed dataframe with the same name.
In your code, you're iterating over columns in a nested loop, so the situation arises where c1 == c2, so pd.crosstab errors out.
The fix would involve adding an if check and skipping that iteration if the columns are equal. So, you'd do:
for c1 in my_data.columns:
for c2 in my_data.columns:
if c1 == c2:
continue
... # rest of your code