Plotting 3 cols of pandas dataframe as heatmap - python

I'm almost certain that this is a duplicate, but I'm having trouble finding the answer in a reasonable amount of time.
I have dataframe with the three columns below:
COLS FUNCS FLUFF
32 1 1 3.24707
33 1 2 14.89260
34 1 3 48.60840
35 1 4 73.68160
36 2 1 4.19922
37 2 2 64.89260
38 2 3 87.91500
39 2 4 91.01560
40 4 1 23.58400
41 4 2 87.89060
42 4 3 95.38570
43 4 4 98.33980
44 8 1 34.47270
45 8 2 95.43460
46 8 3 99.04790
47 8 4 99.80470
I want to plot a heat map of these data with COLS on the horizontal axis and FUNCS on the vertical axis with cells that are scaled according to FLUFF. I don't want to use seaborn. I want to use matplotlib and/or pandas exclusively.
If you also have some insight on how to achieve a logarithmic color scheme, that would also would great.

df.set_index(['COLS', 'FUNCS']).FLUFF.unstack(0).pipe(plt.imshow)
should do it for you.
As cel mentioned in the comments, if your data is actually sparse, you might want to do a .reindex to insert all the rows and columns, filling the NaNs appropriately.
For the log scale have a look at http://matplotlib.org/api/ticker_api.html#matplotlib.ticker.LogFormatter

Related

Comparing Columns in a Pandas Dataframe

I have a pandas data frame with racing results.
Place BibNum Time
0 1 2 5:50
1 2 4 8:09
2 3 7 10:27
3 4 3 11:12
4 5 1 12:13
...
34 1 5 2:03
35 2 9 4:35
36 3 7 5:36
What I would like to know is how can I get a count of how many times the BibNum showed up where the Place was 1, 2, 3 etc?
I know that I can do a "value_counts" but that is for how many times it shows up in a single column. I also looked into using numpy "where" but that is using a conditional like greater than or less than.
IIUC , this is what you need:
out = df.groupby(['Place','BibNum']).size()

How to sort by multiple columns for all values and not just duplicates - python

I have a pandas dataframe for which I need to sort (by ascending) the values by two columns with the output being a "middle ground" of the two columns.
An example is shown bellow. When I use sort_values it sorts by the first columns and considers the second one only for duplicate values. I, however, need to get the row that have the combinaison of lower values for both columns (which is the 3rd one in the ouput bellow).
test = pd.DataFrame({'file':[1,2,3,4,5,6], 'rmse':[66,41,43,39,40,42], 'var':[44,177,201,321,349,379]})
test.sort_values(by=['rmse', 'var'], ascending=[True, True])
Output :
file rmse var
3 4 39 321 <--- First row given by `sort_values`
4 5 40 349
1 2 41 177 <--- Row that I need
5 6 42 379
2 3 43 201
0 1 66 44
I'm not sure how to phrase my question properly in English so please tell me if I need to make my question more clear.
IIUC, let's use rank, mean, and argsort:
test.iloc[test[['var', 'rmse']].rank().mean(axis=1).argsort()]
Output:
file rmse var
1 2 41 177
3 4 39 321
0 1 66 44
4 5 40 349
2 3 43 201
5 6 42 379
Details, rank the values in each column, then average the ranks for each row and sort the mean ranks to determine row order.
I've tried all the methods of df.sort.values but instead of that you can try a for loop like this :
import pandas as pd
test = pd.DataFrame({'file':[1,2,3,4,5,6], 'rmse':[66,41,43,39,40,42], 'var':[44,177,201,321,349,379]})
for i in test:
test[i]=sorted(test[i])
print(test)
Output :
file rmse var
0 1 39 44
1 2 40 177
2 3 41 201
3 4 42 321
4 5 43 349
5 6 66 379

Combine two dataframes and pick first entry based on common column [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 2 years ago.
I have two dataframes like
df1
sub_id Weight
1 56
2 67
3 81
5 73
9 59
df2
sub_id Text
1 He is normal.
1 person is healthy.
1 has strong immune power.
3 She is over weight.
3 person is small.
9 Looks good.
5 Not well.
5 Need to be tested.
By combining these two data frame i need to get as
(when there are multiple sub_id's in second df need to pick first text and combine with first df as below)
merge_df
sub_id Weight Text
1 56 He is normal.
2 67 Nan.
3 81 She is over weight.
5 73 Not well.
9 59 Looks good.
Can anyone help me out?
Thanks in advance.
Here you go:
print(pd.merge(df1, df2.drop_duplicates(subset='sub_id'),
on='sub_id',
how='outer'))
Output
sub_id Weight Text
0 1 56 He is normal.
1 2 67 NaN
2 3 81 She is over weight.
3 5 73 Not well.
4 9 59 Looks good.
To keep the last duplicate, you'd use the parameter keep='last'
print(pd.merge(df1, df2.drop_duplicates(subset='sub_id', keep='last'),
on='sub_id',
how='outer'))
Output
sub_id Weight Text
0 1 56 has strong immune power.
1 2 67 NaN
2 3 81 person is small.
3 5 73 Need to be tested.
4 9 59 Looks good.

Calculating median if for values in a column that match a condition

I am new to Pandas.
My dataset:
df
A B
10 1
15 2
65 3
54 2
51 2
96 1
I am trying to add new column C and calculate the median for values that are in the same group defined by column B.
Expected result:
df
A B C
10 11 53
15 2 34
65 3 65
54 2 34
51 2 34
96 1 53
What I've tried:
df_final['C'] = df_final.groupby('B')['A'].transform('median')
I do get an answer, but due to big DataFrame I am unsure if my code performs correctly, could someone tell me if I am using the right way to achieve this?
You can use:
df_final['C'] = df_final.groupby('B')['A'].transform('median')
As provided in comments.

How to enumerate values in a column according to their size?

I have a pandas data frame in which one of the columns contains real values. I would like to have a new column in this data frame that contains integer numbers indicating what place the real number from another column takes. For example, 1 would mean that the real number from the column with real numbers is the largest one and 2 would mean the second largest and so on.
DataFrame has a rank method:
import pandas as pd
df = pd.DataFrame({'a': np.random.randint(0,100,10)})
df['rank'] = df.rank(ascending=False)
a rank
0 16 8
1 91 1
2 58 4
3 36 6
4 15 9
5 69 3
6 35 7
7 78 2
8 48 5
9 5 10
Make sure you checkout the optional method keyword which sets the behavior in case of equal values.

Categories

Resources