I'm sure there must be a quickfix for this but I can't find an answer with a good explanation. I'm looking to iterate over a dataframe and build a crosstab for each pair of columns with pandas. I have subsetted 2 cols from the original data and removed rows with unsuitable data. With the remaining data I am looking to do a crosstab to ultimately build a contingency table to do a ChiX test. Here is my code:
my_data = pd.read_csv(DATA_MATRIX, index_col=0) #GET DATA
AM = pd.DataFrame(columns=my_data.columns, index = my_data.columns) #INITIATE DF TO HOLD ChiX-result
for c1 in my_data.columns:
for c2 in my_data.columns:
sample_df = pd.DataFrame(my_data, columns=[c1,c2]) #make df to do ChiX on
sample_df = sample_df[(sample_df[c1] != 0.5) | (sample_df[c2] != 0.5)].dropna() # remove unsuitable rows
contingency = pd.crosstab(sample_df[c1], sample_df[c2]) ##This doesn't work?
# DO ChiX AND STORE P-VALUE IN 'AM': CODE STILL TO WRITE
The dataframe contains the values 0.0, 0.5, 1.0. The '0.5' is missing data so I am removing these rows before making the contingency table, the remaining values that I wish to make the contingency tables from are all either 0.0 or 1.0. I have checked at the code works up to this point. The error printed to the console is:
ValueError: If using all scalar values, you must pass an index
If anyone can explain why this doesn't work? Help to solve in any way? Or even better provide an alternative way to do a ChiX test on the columns then that would be very helpful, thanks in advance!
EDIT: example of the structure of the first few rows of sample_df
col1 col2
sample1 1 1
sample2 1 1
sample3 0 0
sample4 0 0
sample5 0 0
sample6 0 0
sample7 0 0
sample8 0 0
sample9 0 0
sample10 0 0
sample11 0 0
sample12 1 1
A crosstab between two identical entities is meaningless. pandas is going to tell you:
ValueError: The name col1 occurs multiple times, use a level number
Meaning it assumes you're passing two different columns from a multi-indexed dataframe with the same name.
In your code, you're iterating over columns in a nested loop, so the situation arises where c1 == c2, so pd.crosstab errors out.
The fix would involve adding an if check and skipping that iteration if the columns are equal. So, you'd do:
for c1 in my_data.columns:
for c2 in my_data.columns:
if c1 == c2:
continue
... # rest of your code
Related
Im currently trying to get the mean() of a group in my dataframe (tdf), but I have a mix of some NaN values and filled values in my dataset. Example shown below
Test #
a
b
1
1
1
1
2
NaN
1
3
2
2
4
3
My code needs to take this dataset, and make a new dataset containing the mean, std, and 95% interval of the set.
i = 0
num_timeframes = 2 #writing this in for example sake
new_df = pd.DataFrame(columns = tdf.columns)
while i < num_timeframes:
results = tdf.loc[tdf["Test #"] == i].groupby(["Test #"]).mean()
new_df = pd.concat([new_df,results])
results = tdf.loc[tdf["Test #"] == i].groupby(["Test #"]).std()
new_df = pd.concat([new_df,results])
results = 2*tdf.loc[tdf["Test #"] == i].groupby(["Test #"]).std()
new_df = pd.concat([new_df,results])
new_df['Test #'] = new_df['Test #'].fillna(i) #fill out test number values
i+=1
For simplicity, i will show the desired output on the first pass of the while loop, only calculating the mean. The problem impacts every row however. The expected output for the mean of Test # 1 is shown below:
Test #
a
b
1
2
1.5
However, columns which contain any NaN rows are calculating the entire mean as NaN resulting in the output shown below
Test #
a
b
1
2
NaN
I have tried passing skipna=True, but got an error stating that mean doesn't have a skipna argument. Im really at a loss here because it was my understanding that df.mean() ignores NaN rows by default. I have limited experience with python so any help is greatly appreciated.
Use following
DataFrame.mean( axis=None, skipna=True)
I eventually solved this by removing the groupby function entirely (I was looking through it and realized I had no reason to call groupby here other than benefit from groupby keeping my columns in the correct orientation). Figured I'd post my fix in case anyone ever comes across this.
for i in range(num_timeframes):
results = tdf.loc[tdf["Test #"] == i].mean()
results = pd.concat([results, tdf.loc[tdf["Test #"] == i].std()], axis = 1)
results = pd.concat([results, 2*tdf.loc[tdf["Test #"] == i].std()], axis = 1)
results = results.transpose()
results["Test #"] = i
new_df = pd.concat([new_df,results])
new_df.loc[new_df.shape[0]] = [None]*len(new_df.columns)
All i had to do was transpose my results because df.mean() flips the dataframe for some reason which is likely why I had tried using groupby in the first place.
The 'azdias' is a dataframe which is my main dataset and meta data or feature summary of it lies in dataframe 'feat_info'. The 'feat_info' shows the values in every column that have been displayed as NaN.
Ex: column1 has values [-1,0] as NaN values. So my job will be to find and replace these -1,0 in column1 as NaN.
azdias dataframe:
feat_info dataframe:
I have tried following in jupyter notebook.
def NAFunc(x, miss_unknown_list):
x_output = x
for i in miss_unknown_list:
try:
miss_unknown_value = float(i)
except ValueError:
miss_unknown_value = i
if x == miss_unknown_value:
x_output = np.nan
break
return x_output
for cols in azdias.columns.tolist():
NAList = feat_info[feat_info.attribute == cols]['missing_or_unknown'].values[0]
azdias[cols] = azdias[cols].apply(lambda x: NAFunc(x, NAList))
Question 1: I am trying to impute NaN values. But my code is very
slow. I wish to speed up my process of execution.
I have attached sample of both dataframes:
azdias_sample
AGER_TYP ALTERSKATEGORIE_GROB ANREDE_KZ CJT_GESAMTTYP FINANZ_MINIMALIST
0 -1 2 1 2.0 3
1 -1 1 2 5.0 1
2 -1 3 2 3.0 1
3 2 4 2 2.0 4
4 -1 3 1 5.0 4
feat_info_sample
attribute information_level type missing_or_unknown
AGER_TYP person categorical [-1,0]
ALTERSKATEGORIE_GROB person ordinal [-1,0,9]
ANREDE_KZ person categorical [-1,0]
CJT_GESAMTTYP person categorical [0]
FINANZ_MINIMALIST person ordinal [-1]
If the azdias dataset is obtained from read_csv or similar IO functions, the na_values keyword argument can be used to specify column-specific missing value representations to make sure the returned data frame already has in-place NaN values from the very beginning. The sample code is shown in the following.
from ast import literal_eval
feat_info.set_index("attribute", inplace=True)
# A more concise but less efficient alternative is
# na_dict = feat_info["missing_or_unknown"].apply(literal_eval).to_dict()
na_dict = {attr: literal_eval(val) for attr, val in feat_info["missing_or_unknown"].items()}
df_azdias = pd.read_csv("azidas.csv", na_values=na_dict)
As for the data type, there is no built-in NaN representation for integer data types. Hence a float data type is needed. If the missing values are imputed using fillna, the downcast argument can be specified to make the returned series or data frame have an appropriate data type.
Try using the DataFrame's replace method. How about this?
for c in azdias.columns.tolist():
replace_list = feat_info[feat_info['attribute'] == c]['missing_or_unknown'].values
azidias[c] = azidias[c].replace(to_replace=list(replace_list), value=np.nan)
A couple things I'm not sure about without being able to execute your code:
In your example, you used .values[0]. Don't you want all the values?
I'm not sure if it's necessary to do to_replace=list(replace_list), it may work to just use to_replace=replace_list.
In general, I recommend thinking to yourself "surely Pandas has a function to do this for me." Often, they do. For performance with Pandas generally, avoid looping over and setting things. Vectorized methods tend to be much faster.
I have two pandas dataframes that look about the same but with different information stored in them. My question will be about how to compare the two dataframes to ensure column and row match before performing some analysis and to obtain a third dataframe of the correlation between the two.
df1 (50x14492):
TYPE GENRE1 GENRE2
Name1 .0945 .0845
Name2 .9074 Nan
Name3 1 0
and df2 (50x14492):
TYPE GENRE1 GENRE2
Name1 .9045 .895
Name2 .074 1
Name3 .5 .045
Hoped for result df3 that is as yet unobtained(50x14492):
TYPE GENRE1 GENRE2
Name1 spearsonr(.9045,.9045) spearsonr(.0845,.895)
Name2 spearsonr(.9074,.074) spearsonr(Nan,1)
Name3 spearsonr(1,.5) spearsonr(0,.045)
I'd like to compare df1.GENRE1.Name1 to df2.GENRE1.Name1 but am getting lost in the implementation. In order to do this I have the following code:
for key1, value1 in df1.iteritems():
for key2, value2 in df2.iteritems():
if key1 == key2:
# this gets me to df1.GENRE1 == df2.GENRE1
for newkey1, newval1 in key1.iterrows():
for newkey2, newval2 in key2.iterrows():
if newkey1 == newkey2:
# this does not seem to get me to df1.GENRE1.Name1 == df2.GENRE1.Name1
scipy.stats.spearmanr(newval1, newval2)
This is allowing me to compare df1.GENRE1 and df2.GENRE1 but I am not sure how to get to the next logical step of also ensuring that df1.GENRE1.Name1 == df2.GENRE1.Name1. Another way to put it, I am unsure of how to ensure the rows match now that I have the columns.
NOTE:
I have tried to use spearmanr on the full two dataframes as such:
corr, p_val = scipy.stats.spearmanr(df1, df2, axis=0, nan_policy='omit')
but rather than getting a new dataframe of the same size (50x14492) I am getting a table back that's 100x100.
Similarly if I use:
corr, p_val = scipy.stats.spearmanr(df1['GENRE1'], df2['GENRE1'], axis=0, nan_policy='omit')
I get the correlation of the two columns as a whole, rather than each row of that column. (Which would be of size 1X14492)
Your question is a bit convoluted. Are you trying to get the correlation between the two Genre columns?
If so you can simply call the correlation on the two columns in the DataFrame:
scipy.stats.spearmanr(df1['GENRE1'], df2['GENRE1'])
After reading your comment and edits, it appears you want the correlation row-wise. That's a simple CS problem but you should know that you're not going to get anything meaningful out of taking the correlation between two values. It'll just be undefined or 1. Anyway, this should populate df3 as you requested above:
df3 = pd.DataFrame()
df3['genre1'] = map(spearmanr, zip(df1['genre1'], df2['genre1']))
df3['genre2'] = map(spearmanr, zip(df1['genre2'], df2['genre2']))
I have a dataframe like this:
Application|Category|Feature|Scenario|Result|Exec_Time
A1|C1|F1|scenario1|PASS|2.3
A1|C1|F1|scenario2|FAIL|20.3
A2|C1|F3|scenario3|PASS|12.3
......
The outcome i am looking for will be a pivot with count of results by Feature and also the sum of exec times. Like this
Application|Category|Feature|Count of PASS|Count of FAIL|SumExec_Time
A1|C1|F1|200|12|45.62
A1|C1|F2|90|0|15.11
A1|C2|F3|97|2|33.11*
I got individual dataframes to get the pivots of result counts and the sum of execution time by feature but I am not able to merge those dataframes to get my final expected outcome.
dfr = pd.pivot_table(df,index=["Application","Category","Feature"],
values=["Final_Result"],aggfunc=[len])
dft = pd.pivot_table(df,index=["Application","Category","Feature"],
values=["Exec_time_mins"],aggfunc=[np.sum])
You don't need to merge results here, you can create this with a single pivot_table or groupby/apply. I don't have your data but does this get you what you want?
pivot = pd.pivot_table(df, index=["Application","Category","Feature"],
values = ["Final_Result", "Exec_time_mins"],
aggfunc = [len, np.sum])
#Count total records, number of FAILs and total time.
df2 = df.groupby(by=['Application','Category','Feature']).agg({'Result':[len,lambda x: len(x[x=='FAIL'])],'Exec_Time':sum})
#rename columns
df2.columns=['Count of PASS','Count of FAIL','SumExec_Time']
#calculate number of pass
df2['Count of PASS']-=df2['Count of FAIL']
#reset index
df2.reset_index(inplace=True)
df2
Out[1197]:
Application Category Feature Count of PASS Count of FAIL SumExec_Time
0 A1 C1 F1 1 1 22.6
1 A2 C1 F3 1 0 12.3
In the first dataframe, the last two columns (shift_one and shift_two) can be thought of as a guess of a potential true coordinate. Call this df1.
df1:
p_one p_two dist shift_one shift_two
0 Q8_CB Q2_C d_6.71823_Angs 26.821 179.513
1 Q8_CD Q2_C d_4.72003_Angs 179.799 179.514
....
In the second dataframe, call this df2, I have a dataframe of experimental observed coordinates which I denote peaks. It simply is just the coordinates and one more column that is for how intense the signal was, this just needs to be along for the ride.
df2:
A B C
0 31.323 25.814 251106
1 26.822 26.083 690425
2 27.021 179.34 1409596
3 54.362 21.773 1413783
4 54.412 20.163 862750
....
I am aiming to have a method for each guess in df1 to be queried/searched/refrenced in df2, within a range of 0.300 of the initial guess in df1. I then want this to be returned in a new datframe, lets say df3. In this case, we notice there is a match in row 0 of df1 with row 2 of df2.
desired output, df3:
p_one p_two dist shift_one shift_two match match1 match2 match_inten
0 Q8_CB Q2_C d_6.71823_Angs 26.821 179.513 TRUE 27.021 179.34 1409596
1 Q8_CD Q2_C d_4.72003_Angs 179.799 179.514 NaN NaN NaN NaN
....
I have attempted a few things:
(1) O'Reily suggests dealing with bounds in a list in python by using lambda or def (p 78 of python in a nutshell). So I define a bound function like this.
def bounds (value, l=low, h=high)
I was then thinking that I could just add a new column, following the logic used here (https://stackoverflow.com/a/14717374/3767980).
df1['match'] = ((df2['A'] + 0.3 <= df1['shift_one']) or (df2['A'] + 0.3 => df1['shift_one'])
--I'm really struggling with this statement
Next I would just pull the values, which should be trivial.
(2) make new columns for the upper and lower limit, then run a conditional to see if the value is between the two columns.
Finally:
(a) Do you think I should stay in pandas? or should I move over to NumPy or SciPy or just traditional python arrays/lists. I was thinking that a regular python lists of lists too. I'm afraid of NumPy since I have text too, is NumPy exclusive to numbers/matrices only.
(b) Any help would be appreciated. I used biopython for phase_one and phase_two, pandas for phase_three, and I'm not quite sure for this final phase here what is the best library to use.
(c) It is probably fairly obvious that I'm an amateur programer.
The following assumes that the columns to compare have the same names.
def temp(row):
index = df2[((row-df2).abs() < .3).all(axis=1)].index
return df2.loc[index[0], :] if len(index) else [None]*df2.shape[1]
Eg.
df1 = pd.DataFrame([[1,2],[3,4], [5,6]], columns=["d1", "d2"])
df2 = pd.DataFrame([[1.1,1.9],[3.2,4.3]], columns=["d1", "d2"])
df1.apply(temp, axis=1)
produces
d1 d2
0 1.1 1.9
1 3.2 4.3
2 NaN NaN