Correlation between values

Correlation between values - python

I want to make correlation in this DataFrame but not the way it is shown, but to rank values from the lowest to largest.
import pandas as pd
import numpy as np
rs = np.random.RandomState(1)
df = pd.DataFrame(rs.rand(9, 8))
corr = df.corr()
corr.style.background_gradient().set_precision(2)
0 1 2 3 4 5 6 7
0 1 0.42 0.031 -0.16 -0.35 0.23 -0.22 0.4
1 0.42 1 -0.24 -0.55 0.011 0.3 -0.26 0.23
2 0.031 -0.24 1 0.29 0.44 0.29 0.23 0.25
3 -0.16 -0.55 0.29 1 -0.33 -0.42 0.58 -0.37
4 -0.35 0.011 0.44 -0.33 1 0.46 0.074 0.19
5 0.23 0.3 0.29 -0.42 0.46 1 -0.41 0.71
6 -0.22 -0.26 0.23 0.58 0.074 -0.41 1 -0.66
7 0.4 0.23 0.25 -0.37 0.19 0.71 -0.66 1

You can use sort_values:
import pandas as pd
import numpy as np
rs = np.random.RandomState(1)
df = pd.DataFrame(rs.rand(9, 8))
corr = df.corr()
print(corr)
print(corr.sort_values(by=0, axis=1, inplace=False)) # by=0 means first row
Results:
0 1 2 3 4 5 6 7
0 1.000000 0.418246 0.030692 -0.160001 -0.352993 0.230069 -0.216804 0.395662
1 0.418246 1.000000 -0.244115 -0.549013 0.010745 0.299203 -0.262351 0.232681
2 0.030692 -0.244115 1.000000 0.288011 0.435907 0.285408 0.225205 0.253840
3 -0.160001 -0.549013 0.288011 1.000000 -0.326950 -0.415688 0.578549 -0.366539
4 -0.352993 0.010745 0.435907 -0.326950 1.000000 0.455738 0.074293 0.193905
5 0.230069 0.299203 0.285408 -0.415688 0.455738 1.000000 -0.413383 0.708467
6 -0.216804 -0.262351 0.225205 0.578549 0.074293 -0.413383 1.000000 -0.664207
7 0.395662 0.232681 0.253840 -0.366539 0.193905 0.708467 -0.664207 1.000000
0 1 7 5 2 3 6 4
0 1.000000 0.418246 0.395662 0.230069 0.030692 -0.160001 -0.216804 -0.352993
1 0.418246 1.000000 0.232681 0.299203 -0.244115 -0.549013 -0.262351 0.010745
2 0.030692 -0.244115 0.253840 0.285408 1.000000 0.288011 0.225205 0.435907
3 -0.160001 -0.549013 -0.366539 -0.415688 0.288011 1.000000 0.578549 -0.326950
4 -0.352993 0.010745 0.193905 0.455738 0.435907 -0.326950 0.074293 1.000000
5 0.230069 0.299203 0.708467 1.000000 0.285408 -0.415688 -0.413383 0.455738
6 -0.216804 -0.262351 -0.664207 -0.413383 0.225205 0.578549 1.000000 0.074293
7 0.395662 0.232681 1.000000 0.708467 0.253840 -0.366539 -0.664207 0.193905

Related

divide all columns by each column in pandas

I have a data frame like 1 and I am trying to create a new data frame 2 which consists of ratios of each column of above data frame.
I tried below mentioned logic.
df_new = pd.concat([df[df.columns.difference([col])].div(df[col], axis=0)\
.add_suffix('/R') for col in df.columns], axis=1)
Output is:
B/R C/R D/R A/R C/R D/R A/R B/R D/R A/R B/R C/R
0 0.46 1.16 0.78 2.16 2.50 1.69 0.86 0.40 0.68 1.28 0.59 1.48
1 1.05 1.25 1.64 0.95 1.19 1.55 0.80 0.84 1.30 0.61 0.64 0.77
2 1.56 2.78 2.78 0.64 1.79 1.79 0.36 0.56 1.00 0.36 0.56 1.00
3 0.54 2.23 0.35 1.86 4.14 0.64 0.45 0.24 0.16 2.89 1.56 6.44
However, here I am facing two issues. One is I am getting both A/B and B/A which are not needed and also increases number of columns. Is there a way to get the output only A/B and eliminate/restrict B/A.
Second issue is with Naming of columns using add suffix method which does not convey which is divided by which. Is there a way to create column names like A/B for Column A divided by column B.

Use combinations with divide columns in list comprehension:
df = pd.DataFrame({
'A':[5,3,6,9,2,4],
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,8],
})
from itertools import combinations
L = {f'{a}/{b}': df[a].div(df[b]) for a, b in combinations(df.columns, 2)}
df = pd.concat(L, axis=1)
print (df)
A/B A/C A/D B/C B/D C/D
0 1.25 0.714286 5.000000 0.571429 4.000000 7.000000
1 0.60 0.375000 1.000000 0.625000 1.666667 2.666667
2 1.50 0.666667 1.200000 0.444444 0.800000 1.800000
3 1.80 2.250000 1.285714 1.250000 0.714286 0.571429
4 0.40 1.000000 2.000000 2.500000 5.000000 2.000000
5 1.00 1.333333 0.500000 1.333333 0.500000 0.375000

Pandas rolling cumulative sum of across two dataframes

I'm looking to create a rolling grouped cumulative sum across two dataframes. I can get the result via iteration, but wanted to see if there was a more intelligent way.
I need the 5 row block of A to roll through the rows of B and accumulate. Think of it as rolling balance with a block of contributions and rolling returns.
So, here's the calculation for C
A B
1 100.00 1 0.01 101.00
2 110.00 2 0.02 215.22 102.00
3 120.00 3 0.03 345.28 218.36 103.00
4 130.00 4 0.04 494.29 351.89 221.52 104.00
5 140.00 5 0.05 666.00 505.99 358.60 224.70 105.00
6 0.06 684.75 517.91 365.38 227.90 106.00
7 0.07 703.97 530.06 372.25 231.12
8 0.08 723.66 542.43 379.21
9 0.09 743.85 555.04
10 0.10 764.54
C Row 5
Begining Balance Contribution Return Ending Balance
0.00 100.00 0.01 101.00
101.00 110.00 0.02 215.22
215.22 120.00 0.03 345.28
345.28 130.00 0.04 494.29
494.29 140.00 0.05 666.00
C Row 6
Begining Balance Contribution Return Ending Balance
0.00 100.00 0.02 102.00
102.00 110.00 0.03 218.36
218.36 120.00 0.04 351.89
351.89 130.00 0.05 505.99
505.99 140.00 0.06 684.75
Here's what the source data looks like:
A B
1 100.00 1 0.01
2 110.00 2 0.02
3 120.00 3 0.03
4 130.00 4 0.04
5 140.00 5 0.05
6 0.06
7 0.07
8 0.08
9 0.09
10 0.10
Here is the desired result:
C
1 Nan
2 Nan
3 Nan
4 Nan
5 666.00
6 684.75
7 703.97
8 723.66
9 743.85
10 764.54

Slicing a pandas dataframe based on less or equal citerion

I bet this question has been answered a number of times but I am struggling to find a definitive solution.
I need to delete dataframe rows based on a greater or equal condition. Because of float64 type I am not able to satisfy the "equal" part of the condition. Splitting the condition into two seems cumbersome and not very pandorable. Can someone help me with finding solution?
Thanks.
Dataframe:
Sg Sw temp_S Krg Krw Pc
0 0.00 1.00 -5.263158e-02 0.000000 0.650000 0.000000
1 0.05 0.95 -4.382459e-17 0.000000 0.650000 0.000000
2 0.10 0.90 5.263158e-02 0.000000 0.593548 0.095790
3 0.15 0.85 1.052632e-01 0.000000 0.537097 0.107775
4 0.20 0.80 1.578947e-01 0.000000 0.480645 0.122121
5 0.25 0.75 2.105263e-01 0.000000 0.424194 0.139496
6 0.30 0.70 2.631579e-01 0.000000 0.367742 0.160837
7 0.35 0.65 3.157895e-01 0.000000 0.311290 0.187397
8 0.36 0.64 3.263158e-01 0.000000 0.300000 0.193483
9 0.40 0.60 3.684211e-01 0.014167 0.230400 0.221009
Slicing:
print(object.sc_df[object.sc_df['Sg'].values > 0.05])
Output:
Sg Sw temp_S Krg Krw Pc
2 0.10 0.90 0.052632 0.000000 0.593548 0.095790
3 0.15 0.85 0.105263 0.000000 0.537097 0.107775
4 0.20 0.80 0.157895 0.000000 0.480645 0.122121
5 0.25 0.75 0.210526 0.000000 0.424194 0.139496
6 0.30 0.70 0.263158 0.000000 0.367742 0.160837
7 0.35 0.65 0.315789 0.000000 0.311290 0.187397
8 0.36 0.64 0.326316 0.000000 0.300000 0.193483
9 0.40 0.60 0.368421 0.014167 0.230400 0.221009
As you can see, line 1 is missing. What would be the best way satisfying "equal" condition?

python add one column based on row value

I have one dataframe as below. I want to add one column based on the column 'need' (such as in row zero ,the need is 1, so select part1 value -0.17. I have pasted the dataframe which I want. Thanks.
df = pd.DataFrame({
'date': [20130101,20130101, 20130103, 20130104, 20130105, 20130107],
'need':[1,3,2,4,3,1],
'part1':[-0.17,-1.03,1.59,-0.05,-0.1,0.9],
'part2':[0.67,-0.03,1.95,-3.25,-0.3,0.6],
'part3':[0.7,-3,1.5,-0.25,-0.37,0.62],
'part4':[0.24,-0.44,1.335,-0.45,-0.57,0.92]
})
date need output part1 part2 part3 part4
0 20130101 1 -0.17 -0.17 0.67 0.70 0.240
1 20130101 3 -3.00 -1.03 -0.03 -3.00 -0.440
2 20130103 2 1.95 1.59 1.95 1.50 1.335
3 20130104 4 -0.45 -0.05 -3.25 -0.25 -0.450
4 20130105 3 -0.37 -0.10 -0.30 -0.37 -0.570
5 20130107 1 0.90 0.90 0.60 0.62 0.920

Use DataFrame.lookup:
df['new'] = df.lookup(df.index, 'part' + df['need'].astype(str))
print (df)
date need part1 part2 part3 part4 new
0 20130101 1 -0.17 0.67 0.70 0.240 -0.17
1 20130101 3 -1.03 -0.03 -3.00 -0.440 -3.00
2 20130103 2 1.59 1.95 1.50 1.335 1.95
3 20130104 4 -0.05 -3.25 -0.25 -0.450 -0.45
4 20130105 3 -0.10 -0.30 -0.37 -0.570 -0.37
5 20130107 1 0.90 0.60 0.62 0.920 0.90
Numpy solution, is necessary sorting increaseing columns by 1 like in sample:
df['new'] = df.filter(like='part').values[np.arange(len(df)), df['need'] - 1]
print (df)
date need part1 part2 part3 part4 new
0 20130101 1 -0.17 0.67 0.70 0.240 -0.17
1 20130101 3 -1.03 -0.03 -3.00 -0.440 -3.00
2 20130103 2 1.59 1.95 1.50 1.335 1.95
3 20130104 4 -0.05 -3.25 -0.25 -0.450 -0.45
4 20130105 3 -0.10 -0.30 -0.37 -0.570 -0.37
5 20130107 1 0.90 0.60 0.62 0.920 0.90

It should be fine
df['new'] = df.iloc[:, 1:].apply(lambda row: row['part'+str(int(row['need']))], axis=1)
date need part1 part2 part3 part4 new
0 20130101 1 -0.17 0.67 0.70 0.240 -0.17
1 20130101 3 -1.03 -0.03 -3.00 -0.440 -3.00
2 20130103 2 1.59 1.95 1.50 1.335 1.95
3 20130104 4 -0.05 -3.25 -0.25 -0.450 -0.45
4 20130105 3 -0.10 -0.30 -0.37 -0.570 -0.37
5 20130107 1 0.90 0.60 0.62 0.920 0.90

Why isn't this code to plot a histogram on a continuous value Pandas column working?

I am trying to create a histogram on a continuous value column Trip_distance in a large 1.4M row pandas dataframe. Wrote the following code:
fig = plt.figure(figsize=(17,10))
trip_data.hist(column="Trip_distance")
plt.xlabel("Trip_distance",fontsize=15)
plt.ylabel("Frequency",fontsize=15)
plt.xlim([0.0,100.0])
#plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
But I am not sure why all values give the same frequency plot which shouldn't be the case. What's wrong with the code?
Test data:
VendorID lpep_pickup_datetime Lpep_dropoff_datetime Store_and_fwd_flag RateCodeID Pickup_longitude Pickup_latitude Dropoff_longitude Dropoff_latitude Passenger_count Trip_distance Fare_amount Extra MTA_tax Tip_amount Tolls_amount Ehail_fee improvement_surcharge Total_amount Payment_type Trip_type
0 2 2015-09-01 00:02:34 2015-09-01 00:02:38 N 5 -73.979485 40.684956 -73.979431 40.685020 1 0.00 7.8 0.0 0.0 1.95 0.0 NaN 0.0 9.75 1 2.0
1 2 2015-09-01 00:04:20 2015-09-01 00:04:24 N 5 -74.010796 40.912216 -74.010780 40.912212 1 0.00 45.0 0.0 0.0 0.00 0.0 NaN 0.0 45.00 1 2.0
2 2 2015-09-01 00:01:50 2015-09-01 00:04:24 N 1 -73.921410 40.766708 -73.914413 40.764687 1 0.59 4.0 0.5 0.5 0.50 0.0 NaN 0.3 5.80 1 1.0
3 2 2015-09-01 00:02:36 2015-09-01 00:06:42 N 1 -73.921387 40.766678 -73.931427 40.771584 1 0.74 5.0 0.5 0.5 0.00 0.0 NaN 0.3 6.30 2 1.0
4 2 2015-09-01 00:00:14 2015-09-01 00:04:20 N 1 -73.955482 40.714046 -73.944412 40.714729 1 0.61 5.0 0.5 0.5 0.00 0.0 NaN 0.3 6.30 2 1.0
5 2 2015-09-01 00:00:39 2015-09-01 00:05:20 N 1 -73.945297 40.808186 -73.937668 40.821198 1 1.07 5.5 0.5 0.5 1.36 0.0 NaN 0.3 8.16 1 1.0
6 2 2015-09-01 00:00:52 2015-09-01 00:05:50 N 1 -73.890877 40.746426 -73.876923 40.756306 1 1.43 6.5 0.5 0.5 0.00 0.0 NaN 0.3 7.80 1 1.0
7 2 2015-09-01 00:02:15 2015-09-01 00:05:34 N 1 -73.946701 40.797321 -73.937645 40.804516 1 0.90 5.0 0.5 0.5 0.00 0.0 NaN 0.3 6.30 2 1.0
8 2 2015-09-01 00:02:36 2015-09-01 00:07:20 N 1 -73.963150 40.693829 -73.956787 40.680531 1 1.33 6.0 0.5 0.5 1.46 0.0 NaN 0.3 8.76 1 1.0
9 2 2015-09-01 00:02:13 2015-09-01 00:07:23 N 1 -73.896820 40.746128 -73.888626 40.752724 1 0.84 5.5 0.5 0.5 0.00 0.0 NaN 0.3 6.80 2 1.0
In [ ]:

Trip_distance column
0 0.00
1 0.00
2 0.59
3 0.74
4 0.61
5 1.07
6 1.43
7 0.90
8 1.33
9 0.84
10 0.80
11 0.70
12 1.01
13 0.39
14 0.56
Name: Trip_distance, dtype: float64
After 100 bins:

EDIT:
After your comments this actually makes perfect sense why you don't get a histogram of each different value. There are 1.4 million rows, and ten discrete buckets. So apparently each bucket is exactly 10% (to within what you can see in the plot).
A quick rerun of your data:
In [25]: df.hist(column='Trip_distance')
Prints out absolutely fine.
The df.hist function comes with an optional keyword argument bins=10 which buckets the data into discrete bins. With only 10 discrete bins and a more or less homogeneous distribution of hundreds of thousands of rows, you might not be able to see the difference in the ten different bins in your low resolution plot:
In [34]: df.hist(column='Trip_distance', bins=50)

Here's another way to plot the data, involves turning the date_time into an index, this might help you for future slicing
#convert column to datetime
trip_data['lpep_pickup_datetime'] = pd.to_datetime(trip_data['lpep_pickup_datetime'])
#turn the datetime to an index
trip_data.index = trip_data['lpep_pickup_datetime']
#Plot
trip_data['Trip_distance'].plot(kind='hist')
plt.show()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Correlation between values - python

Related

divide all columns by each column in pandas

Pandas rolling cumulative sum of across two dataframes

Slicing a pandas dataframe based on less or equal citerion

python add one column based on row value

Why isn't this code to plot a histogram on a continuous value Pandas column working?

Categories

Resources