Pandas Qcut, rounding values, Python - python

I want to create a key for the qcut bins I have from my data set.
So below I have the data from the 'total' column into ten bins, I have dropped the duplicates and sorted the values so I can see what the bin values are and in order. The below has the bins without using 'precision'.
bin_key=pd.qcut(bin_key['Total'], 10).drop_duplicates().sort_values()
bin_key.reset_index(drop=True, inplace=True)
bin_key
Output:
0 (11.199, 7932.26]
1 (7932.26, 15044.289]
2 (15044.289, 22709.757]
3 (22709.757, 32762.481]
4 (32762.481, 43491.146]
5 (43491.146, 55728.56]
6 (55728.56, 72823.314]
7 (72823.314, 100161.814]
8 (100161.814, 156406.846]
9 (156406.846, 1310448.18]
I want to round the values to the nearest thounsand. Using precision it looks like this:
bin_key=pd.qcut(bin_key['Total_Costs'], 10, 'precision=-3').drop_duplicates().sort_values()
bin_key.reset_index(drop=True, inplace=True)
bin_key
Output
0 (-1000.0, 8000.0]
1 (8000.0, 15000.0]
2 (15000.0, 23000.0]
3 (23000.0, 33000.0]
4 (33000.0, 43000.0]
5 (43000.0, 56000.0]
6 (56000.0, 73000.0]
7 (73000.0, 100000.0]
8 (100000.0, 156000.0]
9 (156000.0, 1310000.0]
How can I round to 0 rather than -1000?

Related

Why do I lose numerical precision when extracting element from list in python?

I have a pandas dataframe that looks like this:
data
0 [26.113017616106, 106.948066803935, 215.488217...
1 [26.369709448639, 106.961107298101, 215.558911...
2 [26.261267444521, 106.991763898421, 215.384122...
3 [26.285746968657, 106.912377030428, 215.287348...
4 [26.155342026996, 106.825440402654, 215.114619...
5 [26.159917638984, 106.819720887669, 215.117593...
6 [26.023564401739, 106.843056508808, 215.129947...
7 [26.1155342027, 106.828185769847, 215.15991763...
8 [26.028826355525, 106.841912605811, 215.146190...
9 [26.015099519561, 106.824296499657, 215.130404...
I am trying to extract the 1st element from the Series of lists using this code:
[x[1] for x in df.data]
and I get this result:
0 106.948067
1 106.961107
2 106.991764
3 106.912377
4 106.825440
5 106.819721
6 106.843057
7 106.828186
8 106.841913
9 106.824296
Why do I lose precision and what can I do to keep it?
By default, pandas displays floating-point values with 6 digits of precision.
You can control the precision with pandas’ set_option e.g.
pd.set_option('precision', 12)

Pandas reorder rows of dataframe

I stumble upon very peculiar problem in Pandas. I have this dataframe
,time,id,X,Y,theta,Vx,Vy,ANGLE_FR,DANGER_RAD,RISK_RAD,TTC_DAN_LOW,TTC_DAN_UP,TTC_STOP,SIM
0,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,2.0,3
1,1600349033921620000,1,18.5371406,-14.224917,0,-0.0113912,1.443597,20,0.5,0.9,-1,7,2.0,3
2,1600349033921650000,2,19.808648100000006,-6.778450599999998,0,0.037289,-1.0557937,20,0.5,0.9,-1,7,2.0,3
3,1600349033921670000,3,22.1796988,-5.7078115999999985,0,0.2585675,-1.2431861000000002,20,0.5,0.9,-1,7,2.0,3
4,1600349033921670000,4,20.757325,-16.115366,0,-0.2528627,0.7889673,20,0.5,0.9,-1,7,2.0,3
5,1600349033921690000,5,20.9491012,-17.7806833,0,0.5062633,0.9386511,20,0.5,0.9,-1,7,2.0,3
6,1600349033921690000,6,20.6225258,-5.5344404,0,-0.1192678,-0.7889041,20,0.5,0.9,-1,7,2.0,3
7,1600349033921700000,7,21.8077004,-14.736984,0,-0.0295737,1.3084618,20,0.5,0.9,-1,7,2.0,3
8,1600349033954560000,0,23.206789800000006,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,2.0,3
9,1600349033954570000,1,18.555421300000006,-13.7440508,0,0.0548418,1.4426004,20,0.5,0.9,-1,7,2.0,3
10,1600349033954570000,2,19.8409748,-7.126075500000002,0,0.0969802,-1.0428747,20,0.5,0.9,-1,7,2.0,3
11,1600349033954580000,3,22.3263185,-5.9586202,0,0.4398591,-0.752425,20,0.5,0.9,-1,7,2.0,3
12,1600349033954590000,4,20.7154136,-15.842398800000002,0,-0.12573430000000002,0.8189016,20,0.5,0.9,-1,7,2.0,3
13,1600349033954590000,5,21.038901,-17.4111883,0,0.2693992,1.108485,20,0.5,0.9,-1,7,2.0,3
14,1600349033954600000,6,20.612499,-5.810969,0,-0.030080400000000007,-0.8295869,20,0.5,0.9,-1,7,2.0,3
15,1600349033954600000,7,21.7872537,-14.3011986,0,-0.0613401,1.3073578,20,0.5,0.9,-1,7,2.0,3
16,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,1.5,2
17,1600349033954560000,0,23.206789800000003,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,1.5,2
18,1600349033988110000,0,23.21602,-7.897527,0,0.027693000000000002,-1.1412761999999999,20,0.5,0.9,-1,7,1.5,2
This is input file
Please note that Id always starts at 0 up to 7 and repeat and time column is in sequential step (which implies that previous row should be smaller or equal to current one).
I would like to reorder rows of the dataframe as it is below.
,time,id,X,Y,theta,Vx,Vy,ANGLE_FR,DANGER_RAD,RISK_RAD,TTC_DAN_LOW,TTC_DAN_UP,TTC_STOP,SIM
0,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,1.0,2
1,1600349033954560000,0,23.206789800000003,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,1.0,2
2,1600349033988110000,0,23.21602,-7.897527,0,0.027693000000000002,-1.1412761999999999,20,0.5,0.9,-1,7,1.0,2
3,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,1.5,1
4,1600349033954560000,0,23.206789800000003,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,1.5,1
5,1600349033988110000,0,23.21602,-7.897527,0,0.027693000000000002,-1.1412761999999999,20,0.5,0.9,-1,7,1.5,1
6,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,1.5,2
7,1600349033954560000,0,23.206789800000003,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,1.5,2
8,1600349033988110000,0,23.21602,-7.897527,0,0.027693000000000002,-1.1412761999999999,20,0.5,0.9,-1,7,1.5,2
9,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,1.5,3
10,1600349033954560000,0,23.206789800000003,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,1.5,3
11,1600349033988110000,0,23.21602,-7.897527,0,0.027693000000000002,-1.1412761999999999,20,0.5,0.9,-1,7,1.5,3
This is the desired result
Please note that I need to reorder dataframe rows based on this columns id, time, ANGLE_FR, DANGER_RAD, RISK_RAD, TTC_DAN_LOW, TTC_DAN_UP, TTC_STOP, SIM.
As you see from the desired result we need to reoder dataframe in that way time column from smallest to largest one this holds true for the rest of columns, id, sim, ANGLE_FR, DANGER_RAD, RISK_RAD, TTC_DAN_LOW, TTC_DAN_UP, TTC_STOP.
I tried to sort by several columns without success. Moreover, I tried to use groupby but I failed.
Would you like to help to solve the problem? Any suggestions are welcome.
P.S.
I have paste dataframe so they can be read easily with clipboard function in order to be easily reproducible.
I am attaching pic as well.
What did you try to sort by several columns?
In [10]: df.sort_values(['id', 'time', 'ANGLE_FR', 'DANGER_RAD', 'RISK_RAD', 'TTC_DAN_LOW', 'TTC_DAN_UP', 'TTC_STOP', 'SIM'])
Out[10]:
Unnamed: 0 time id X Y theta Vx Vy ANGLE_FR DANGER_RAD RISK_RAD TTC_DAN_LOW TTC_DAN_UP TTC_STOP SIM
0 0 1600349033921610000 0 23.2644 -7.1409 0 0.0210 -1.1414 20 0.5 0.9 -1 7 2 3
8 8 1600349033954560000 0 23.2068 -7.5171 0 -0.1728 -1.1285 20 0.5 0.9 -1 7 2 3
1 1 1600349033921620000 1 18.5371 -14.2249 0 -0.0114 1.4436 20 0.5 0.9 -1 7 2 3
9 9 1600349033954570000 1 18.5554 -13.7441 0 0.0548 1.4426 20 0.5 0.9 -1 7 2 3
2 2 1600349033921650000 2 19.8086 -6.7785 0 0.0373 -1.0558 20 0.5 0.9 -1 7 2 3
How about this:
groupby_cols = ['ANGLE_FR', 'DANGER_RAD', 'RISK_RAD', 'TTC_DAN_LOW', 'TTC_DAN_UP', 'TTC_STOP, SIM']
df = df.groupby(groupby_cols).reset_index()

Merging two dataframes based on index

I've been on this all night, and just can't figure it out, even though I know it should be simple. So, my sincerest apologies for the following incantation from a sleep-deprived fellow:
So, I have four fields, Employee ID, Name, Station and Shift (ID is non-null integer, the rest are strings or null).
I have about 10 dataframes, all indexed by ID. And each containing only two columns either (Name and Station) or (Name and Shift)
Now of course, I want to combine all of this into one dataframe, which has a unique row for each ID.
But I'm really frustrated by it at this point(especially because I can't find a way to directly check how many unique indices my final dataframe ends with)
After messing around with some very ugly ways of using .merge(), I finally found .concat(). But it keeps making multiple rows per ID, when I check in excel, the indices are like Table1/1234, Table2/1234 etc. One row has the shift, the other one has station, which is precisely what I'm trying to avoid.
How do I compile all my data into one dataframe, having exactly one row per ID? Possibly without using 9 different merge statements, as I have to scale up later.
If I understand your question correctly, this is the thing that you want.
For example with this 3 dataframes..
In [1]: df1
Out[1]:
0 1 2
0 3.588843 3.566220 6.518865
1 7.585399 4.269357 4.781765
2 9.242681 7.228869 5.680521
3 3.600121 3.931781 4.616634
4 9.830029 9.177663 9.842953
5 2.738782 3.767870 0.925619
6 0.084544 6.677092 1.983105
7 5.229042 4.729659 8.638492
8 8.575547 6.453765 6.055660
9 4.386650 5.547295 8.475186
In [2]: df2
Out[2]:
0 1
0 95.013170 90.382886
2 1.317641 29.600709
4 89.908139 21.391058
6 31.233153 3.902560
8 17.186079 94.768480
In [3]: df
Out[3]:
0 1 2
0 0.777689 0.357484 0.753773
1 0.271929 0.571058 0.229887
2 0.417618 0.310950 0.450400
3 0.682350 0.364849 0.933218
4 0.738438 0.086243 0.397642
5 0.237481 0.051303 0.083431
6 0.543061 0.644624 0.288698
7 0.118142 0.536156 0.098139
8 0.892830 0.080694 0.084702
9 0.073194 0.462129 0.015707
You can do
pd.concat([df,df1,df2], axis=1)
This produces
In [6]: pd.concat([df,df1,df2], axis=1)
Out[6]:
0 1 2 0 1 2 0 1
0 0.777689 0.357484 0.753773 3.588843 3.566220 6.518865 95.013170 90.382886
1 0.271929 0.571058 0.229887 7.585399 4.269357 4.781765 NaN NaN
2 0.417618 0.310950 0.450400 9.242681 7.228869 5.680521 1.317641 29.600709
3 0.682350 0.364849 0.933218 3.600121 3.931781 4.616634 NaN NaN
4 0.738438 0.086243 0.397642 9.830029 9.177663 9.842953 89.908139 21.391058
5 0.237481 0.051303 0.083431 2.738782 3.767870 0.925619 NaN NaN
6 0.543061 0.644624 0.288698 0.084544 6.677092 1.983105 31.233153 3.902560
7 0.118142 0.536156 0.098139 5.229042 4.729659 8.638492 NaN NaN
8 0.892830 0.080694 0.084702 8.575547 6.453765 6.055660 17.186079 94.768480
9 0.073194 0.462129 0.015707 4.386650 5.547295 8.475186 NaN NaN
For more details you might want to see pd.concat
Just a tip putting simple illustrative data in your question always helps in getting answer.

pandas plotting skipping xtick labels

I am trying to display two pandas Series objects together, which works, except all the labels are not displayed.
I am trying to plot the two Series together like this:
plt.figure()
sns.set_style('ticks')
ts86['Gene'].value_counts().plot(kind='area')
l97['Gene'].value_counts().plot(kind='area')
sns.despine(offset=10)
But only one of the indexes is displayed.
Here are the two Series that I have:
one
TIIIh 25
TET2-2 24
IDH2 15
TIIIa 14
TIIIb 12
TIIIj 11
TIIIp 9
p53-1 9
SF3B1 8
TIIIe 8
KRAS-1 7
TIIIo 6
TIIId 6
TET2-1 6
GATA1 5
p53-3 5
HRAS 5
NRAS-2 4
IDH1 4
TIIIq 4
JAK2 4
TIIIc 4
TIIIf 3
TIIIg 3
TIIIm 3
KRAS-2 3
p53-2 3
TIIIk 3
TIIIn 2
DNMT3a 1
and
two
p53-1 17
p53-2 2
NRAS-2 2
p53-3 1
KRAS-2 1
Your output graph shows value_counts of 2 dataframes but obviously the index orders are no longer the same, so there is no way to show xticks at this point (e.g. highest count in df1 is TIIIh while that of df2 is p53-1 and you are trying to plot them together by preserving the order).
Let's simply merge df1 and df2 first (I named TIIIh and so on as id for merge key):
combi = pd.merge(ts86, l97, on='id', how='left')
combi = combi.set_index('id')
And then, plot each column and show all xticks:
ax = combi['Gene_x'].plot(kind='area', figsize=(10, 3))
combi['Gene_y'].plot(kind='area', figsize=(10, 3))
ax.set_xticks(range(combi.shape[0]))
ax.set_xticklabels(combi.index, rotation=90)
Now you get this:
Hope this helps.

Drop pandas dataframe row based on max value of a column

I have a Dataframe like so:
p_rel y_BET sq_resid
1 0.069370 41.184996 0.292942
2 0.116405 43.101090 0.010953
3 0.173409 44.727748 0.036832
4 0.225629 46.681293 0.540616
5 0.250682 46.980616 0.128191
6 0.294650 47.446113 0.132367
7 0.322530 48.078038 0.235047
How do I get rid of the fourth row because it has the max value of sq_resid? note: the max will change from dataset to dataset so just removing the 4th row isn't enough.
I have tried several things such as I can remove the max value which leaves the dataframe like below but haven't been able to remove the whole row.
p_rel y_BET sq_resid
1 0.069370 41.184996 0.292942
2 0.116405 43.101090 0.010953
3 0.173409 44.727748 0.036832
4 0.225629 46.681293 Nan
5 0.250682 46.980616 0.128191
6 0.294650 47.446113 0.132367
7 0.322530 48.078038 0.235047
You could just filter the df like so:
In [255]:
df.loc[df['sq_resid']!=df['sq_resid'].max()]
Out[255]:
p_rel y_BET sq_resid
1 0.069370 41.184996 0.292942
2 0.116405 43.101090 0.010953
3 0.173409 44.727748 0.036832
5 0.250682 46.980616 0.128191
6 0.294650 47.446113 0.132367
or drop using idxmax which will return the label row of the max value:
In [257]:
df.drop(df['sq_resid'].idxmax())
Out[257]:
p_rel y_BET sq_resid
1 0.069370 41.184996 0.292942
2 0.116405 43.101090 0.010953
3 0.173409 44.727748 0.036832
5 0.250682 46.980616 0.128191
6 0.294650 47.446113 0.132367
7 0.322530 48.078038 0.235047

Categories

Resources