Comparing two pandas dataframes on column and the row - python

I have two pandas dataframes that look about the same but with different information stored in them. My question will be about how to compare the two dataframes to ensure column and row match before performing some analysis and to obtain a third dataframe of the correlation between the two.
df1 (50x14492):
TYPE GENRE1 GENRE2
Name1 .0945 .0845
Name2 .9074 Nan
Name3 1 0
and df2 (50x14492):
TYPE GENRE1 GENRE2
Name1 .9045 .895
Name2 .074 1
Name3 .5 .045
Hoped for result df3 that is as yet unobtained(50x14492):
TYPE GENRE1 GENRE2
Name1 spearsonr(.9045,.9045) spearsonr(.0845,.895)
Name2 spearsonr(.9074,.074) spearsonr(Nan,1)
Name3 spearsonr(1,.5) spearsonr(0,.045)
I'd like to compare df1.GENRE1.Name1 to df2.GENRE1.Name1 but am getting lost in the implementation. In order to do this I have the following code:
for key1, value1 in df1.iteritems():
for key2, value2 in df2.iteritems():
if key1 == key2:
# this gets me to df1.GENRE1 == df2.GENRE1
for newkey1, newval1 in key1.iterrows():
for newkey2, newval2 in key2.iterrows():
if newkey1 == newkey2:
# this does not seem to get me to df1.GENRE1.Name1 == df2.GENRE1.Name1
scipy.stats.spearmanr(newval1, newval2)
This is allowing me to compare df1.GENRE1 and df2.GENRE1 but I am not sure how to get to the next logical step of also ensuring that df1.GENRE1.Name1 == df2.GENRE1.Name1. Another way to put it, I am unsure of how to ensure the rows match now that I have the columns.
NOTE:
I have tried to use spearmanr on the full two dataframes as such:
corr, p_val = scipy.stats.spearmanr(df1, df2, axis=0, nan_policy='omit')
but rather than getting a new dataframe of the same size (50x14492) I am getting a table back that's 100x100.
Similarly if I use:
corr, p_val = scipy.stats.spearmanr(df1['GENRE1'], df2['GENRE1'], axis=0, nan_policy='omit')
I get the correlation of the two columns as a whole, rather than each row of that column. (Which would be of size 1X14492)

Your question is a bit convoluted. Are you trying to get the correlation between the two Genre columns?
If so you can simply call the correlation on the two columns in the DataFrame:
scipy.stats.spearmanr(df1['GENRE1'], df2['GENRE1'])
After reading your comment and edits, it appears you want the correlation row-wise. That's a simple CS problem but you should know that you're not going to get anything meaningful out of taking the correlation between two values. It'll just be undefined or 1. Anyway, this should populate df3 as you requested above:
df3 = pd.DataFrame()
df3['genre1'] = map(spearmanr, zip(df1['genre1'], df2['genre1']))
df3['genre2'] = map(spearmanr, zip(df1['genre2'], df2['genre2']))

Related

How to retrieve rows matching a criteria from the multi dimensional numpy array?

I have a multidimensional NumPy array read from a CSV file. I want to retrieve rows matching a certain column in the data set dynamically.
My current array is
[[LIMS_AY60_51X, AY60_51X_61536153d7cdc55.png, 857.61389, 291.227, NO, 728.322,865.442]
[LIMS_AY60_52X, AY60_52X_615f6r53d7cdc55.png, 867.61389, 292.227, NO, 728.322,865.442]
[LIMS_AY60_53X, AY60_53X_615ft153d7cdc55.png, 877.61389, 293.227, NO, 728.322,865.442]
[LIMS_AY60_54X, AY60_54X_615u6153d7cdc55.png, 818.61389, 294.227, NO, 728.322,865.442]
[LIMS_AY60_55X, AY60_55X_615f615od7cdc55.png, 847.61389, 295.227, NO, 728.322,865.442]......]
I would like to use 'np.where' method to extract the rows matching the criteria as follows :
(second column value equal to
'AY60_52X_615f6r53d7cdc55.png'
np.where ((vals == (:,'AY60_52X_615f6r53d7cdc55.png',:,:,:,:,:)).all(axis=1))
This one has an error due to syntax.
File "<ipython-input-31-a28fe9729cd4>", line 3
np.where ((vals == (:,'AY60_52X_615f6r53d7cdc55.png',:,:,:,:,:)).all(axis=1))
^
SyntaxError: invalid syntax
Any help is appreciated
If you're dealing with CSV files and tabular data handling, I'd recommend using Pandas.
Here's very briefly how that would work in your case (df is the usual variable name for a Pandas DataFrame, hence df).
df = pd.read_csv('datafile.csv')
print(df)
results in the output
code filename value1 value2 yesno anothervalue yetanothervalue
0 LIMS_AY60_51X AY60_51X_61536153d7cdc55.png 857.61389 291.227 NO 728.322 865.442
1 LIMS_AY60_52X AY60_52X_615f6r53d7cdc55.png 867.61389 292.227 NO 728.322 865.442
2 LIMS_AY60_53X AY60_53X_615ft153d7cdc55.png 877.61389 293.227 NO 728.322 865.442
3 LIMS_AY60_54X AY60_54X_615u6153d7cdc55.png 818.61389 294.227 NO 728.322 865.442
4 LIMS_AY60_55X AY60_55X_615f615od7cdc55.png 847.61389 295.227 NO 728.322 865.442
Note that the very first column is called the index. It is not in the CSV file, but automatically added by Pandas. You can ignore it here.
The column names are thought-up by me; usually, the first row of the CSV file will have column names, and otherwise Pandas will default to naming them something like "Unnamed: 0", "Unnamed: 1", "Unnamed: 2" etc.
Then, for the actual selection, you do
df['filename'] == 'AY60_52X_615f6r53d7cdc55.png'
which results in
0 False
1 True
2 False
3 False
4 False
Name: filename, dtype: bool
which is a one-dimensional dataframe, called a Series. Again, it has an index column, but more importantly, the second column shows for which row the comparison is true.
You can assign the result to a variable instead, and use that variable to access the rows that have True, as follows:
selection = df['filename'] == 'AY60_52X_615f6r53d7cdc55.png'
print(df[selection])
which yields
code filename value1 value2 yesno anothervalue yetanothervalue
1 LIMS_AY60_52X AY60_52X_615f6r53d7cdc55.png 867.61389 292.227 NO 728.322 865.442
Note that in this case, Pandas is smart enough to figure out whether you want to access a particular column (df['filename']) or a selection of rows (df[selection]). More complicated ways of accessing a dataframe are possible, but you'll have to read up on that.
You can merge some things together, and with the reading of the CSV file, it's just two lines:
df = pd.read_csv('datafile.csv')
df[ df['filename'] == 'AY60_52X_615f6r53d7cdc55.png' ]
which I think is a bit nicer than using purely NumPy. Essentially, use NumPy only when you are really dealing with (multi-dimensional) array data. Not when dealing with records / tabular structured data, as in your case. (Side note: under the hood, Pandas uses a lot of NumPy, so the speed is the same; it's largely a nicer interface with some extra functionality.)
You can do it like this using numpy:
selected_row = a[np.any(a == 'AY60_52X_615f6r53d7cdc55.png', axis=1)]
Output:
>>> selected_row
array([['LIMS_AY60_52X', 'AY60_52X_615f6r53d7cdc55.png', '867.61389', '292.227', 'NO', '728.322', '865.442']], dtype='<U32')

Dataframe- Remove similar rows related based on two columns

I have following dataset:
this dataset print correlation of two columns at left
if you look at the row number 3 and 42, you will find they are same. only column position is different. that does not affect correlation. I want to remove column 42. But this dataset has many these row of similar values. I need a general algorithm to remove these similar value and have only unique.
As the correlation_value seems to be the same, the operation should be commutative, so whatever the value, you just have to focus on two first columns. Sort the tuple and remove duplicates
# You can probably replace 'sorted' by 'set'
key = df[['source_column', 'destination_column']] \
.apply(lambda x: tuple(sorted(x)), axis='columns')
out = df.loc[~key.duplicated()]
>>> out
source_column destination_column correlation_Value
0 A B 1
2 C E 2
3 D F 4
You could try a self join. Without a code example, it's hard to answer, but something like this maybe:
df.merge(df, left_on="source_column", right_on="destination_column")
You can follow that up with a call to drop_duplicates.

Is there a way to allocates sorted values in a dataframe to groups based on alternating elements

I have a Pandas DataFrame like:
COURSE BIB# COURSE 1 COURSE 2 STRAIGHT-GLIDING MEAN PRESTASJON
1 2 20.220 22.535 19.91 21.3775 1.073707
0 1 21.235 23.345 20.69 22.2900 1.077332
This is from a pilot and the DataFrame may be much longer when we perform the real experiment. Now that I have calculated the performance for each BIB#, I want to allocate them into two different groups based on their performance. I have therefore written the following code:
df1 = df1.sort_values(by='PRESTASJON', ascending=True)
This sorts values in the DataFrame. Now I want to assign even rows to one group and odd rows to another. How can I do this?
I have no idea what I am looking for. I have looked up in the documentation for the random module in Python but that is not exactly what I am looking for. I have seen some questions/posts pointing to a scikit-learn stratification function but I don't know if that is a good choice. Alternatively, is there a way to create a loop that accomplishes this? I appreciate your help.
Here a figure to illustrate what I want to accomplish
How about this:
threshold = 0.5
df1['group'] = df1['PRESTASJON'] > threshold
Or if you want values for your groups:
df['group'] = np.where(df['PRESTASJON'] > threshold, 'A', 'B')
Here, 'A' will be assigned to column 'group' if precision meets our threshold, otherwise 'B'.
UPDATE: Per OP's update on the post, if you want to group them alternatively into two groups:
#sort your dataframe based on precision column
df1 = df1.sort_values(by='PRESTASJON')
#create new column with default value 'A' and assign even rows (alternative rows) to 'B'
df1['group'] = 'A'
df1.iloc[1::2,-1] = 'B'
Are you splitting the dataframe alternatingly? If so, you can do:
df1 = df1.sort_values(by='PRESTASJON', ascending=True)
for i,d in df1.groupby(np.arange(len(df1)) %2):
print(f'group {i}')
print(d)
Another way without groupby:
df1 = df1.sort_values(by='PRESTASJON', ascending=True)
mask = np.arange(len(df1)) %2
group1 = df1.loc[mask==0]
group2 = df1.loc[mask==1]

ValueError when using pandas' crosstab

I'm sure there must be a quickfix for this but I can't find an answer with a good explanation. I'm looking to iterate over a dataframe and build a crosstab for each pair of columns with pandas. I have subsetted 2 cols from the original data and removed rows with unsuitable data. With the remaining data I am looking to do a crosstab to ultimately build a contingency table to do a ChiX test. Here is my code:
my_data = pd.read_csv(DATA_MATRIX, index_col=0) #GET DATA
AM = pd.DataFrame(columns=my_data.columns, index = my_data.columns) #INITIATE DF TO HOLD ChiX-result
for c1 in my_data.columns:
for c2 in my_data.columns:
sample_df = pd.DataFrame(my_data, columns=[c1,c2]) #make df to do ChiX on
sample_df = sample_df[(sample_df[c1] != 0.5) | (sample_df[c2] != 0.5)].dropna() # remove unsuitable rows
contingency = pd.crosstab(sample_df[c1], sample_df[c2]) ##This doesn't work?
# DO ChiX AND STORE P-VALUE IN 'AM': CODE STILL TO WRITE
The dataframe contains the values 0.0, 0.5, 1.0. The '0.5' is missing data so I am removing these rows before making the contingency table, the remaining values that I wish to make the contingency tables from are all either 0.0 or 1.0. I have checked at the code works up to this point. The error printed to the console is:
ValueError: If using all scalar values, you must pass an index
If anyone can explain why this doesn't work? Help to solve in any way? Or even better provide an alternative way to do a ChiX test on the columns then that would be very helpful, thanks in advance!
EDIT: example of the structure of the first few rows of sample_df
col1 col2
sample1 1 1
sample2 1 1
sample3 0 0
sample4 0 0
sample5 0 0
sample6 0 0
sample7 0 0
sample8 0 0
sample9 0 0
sample10 0 0
sample11 0 0
sample12 1 1
A crosstab between two identical entities is meaningless. pandas is going to tell you:
ValueError: The name col1 occurs multiple times, use a level number
Meaning it assumes you're passing two different columns from a multi-indexed dataframe with the same name.
In your code, you're iterating over columns in a nested loop, so the situation arises where c1 == c2, so pd.crosstab errors out.
The fix would involve adding an if check and skipping that iteration if the columns are equal. So, you'd do:
for c1 in my_data.columns:
for c2 in my_data.columns:
if c1 == c2:
continue
... # rest of your code

In Python, how do I select the columns of a dataframe satisfying a condition on the number of NaN?

I hope someone could help me. I'm new to Python, and I have a dataframe with 111 columns and over 40 000 rows. All the columns contain NaN values (some columns contain more NaN's than others), so I want to drop those columns having at least 80% of NaN values. How can I do this?
To solve my problem, I tried the following code
df1=df.apply(lambda x : x.isnull().sum()/len(x) < 0.8, axis=0)
The function x.isnull().sum()/len(x) is to divide the number of NaN in the column x by the length of x, and the part < 0.8 is to choose those columns containing less than 80% of NaN.
The problem is that when I run this code I only get the names of the columns together with the boolean "True" but I want the entire columns, not just the names. What should I do?
You could do this:
filt = df.isnull().sum()/len(df) < 0.8
df1 = df.loc[:, filt]
You want to achieve two things. First, you have to find the indices of all columns which contain at most 80% NaNs. Second, you want to discard them from your DataFrame.
To get a pandas Series indicating whether a row should be discarded by doing, you can do:
df1 = df.isnull().sum(axis=0) < 0.8*df.shape[1]
(Btw. you have a typo in your question. You should drop the ==True as it always tests whether 0.5==True)
This will give True for all column indices to keep, as .isnull() gives True (or 1) if it is NaN and False (or 0) for a valid number for every element. Then the .sum(axis=0) sums along the columns giving the number of NaNs in each column. The comparison is then, if that number is bigger than 80% of the number of columns.
For the second task, you can use this to index your columns by using:
df = df[df.columns[df1]]
or as suggested in the comments by doing:
df.drop(df.columns[df1==False], axis=1, inplace=True)

Categories

Resources