how to obtain a subset of a dataframe based on column values? - python

I have a dataframe with the current values in columns:
pop_b
CT (mm) A B C D adultos_perc min max class_center Y
0 100- 110 40 0 0 0 0.000000 100 110 105 inf
1 110-120 72 0 0 0 0.000000 110 120 115 inf
2 120-130 108 12 0 0 0.100000 120 130 125 2.197225
3 130-140 112 41 7 0 0.300000 130 140 135 0.847298
4 140-150 92 70 18 4 0.500000 140 150 145 0.000000
5 150-160 60 98 34 7 0.698492 150 160 155 -0.840129
6 160-170 27 105 36 16 0.853261 160 170 165 -1.760409
7 170-180 0 87 38 21 1.000000 170 180 175 -inf
8 180-190 0 45 28 7 1.000000 180 190 185 -inf
9 190-200 0 15 9 6 1.000000 190 200 195 -inf
10 200-210 0 7 3 2 1.000000 200 210 205 -inf
11 210-220 0 4 2 2 1.000000 210 220 215 -inf
12 220-230 0 6 3 2 1.000000 220 230 225 -inf
13 230-240 0 8 3 2 1.000000 230 240 235 -inf
I wanted to create a new dataframe which has only the rows whose "Y" values aren't 'inf' or '-inf'.
The dataframe has the current dtypes:
CT (mm) object
A int64
B int64
C int64
D int64
adultos_perc float64
min int64
max int64
class_center int64
Y float64
dtype: object

You could use between:
out = df[df['Y'].between(-float('inf'), float('inf'), inclusive='neither')]
or gt and lt wrappers chained together with &:
out = df[df['Y'].gt(-float('inf')) & df['Y'].lt(float('inf'))]
Output:
CT(mm) A B C D adultos_perc min max class_center Y
2 120-130 108 12 0 0 0.100000 120 130 125 2.197225
3 130-140 112 41 7 0 0.300000 130 140 135 0.847298
4 140-150 92 70 18 4 0.500000 140 150 145 0.000000
5 150-160 60 98 34 7 0.698492 150 160 155 -0.840129
6 160-170 27 105 36 16 0.853261 160 170 165 -1.760409

Related

fb prophet daily prediction does not give accurate result for missing values

My dataframe (df) contains 2 inputs UnitShrtDescr and SchShrtDescr
.
So for particular UnitShrtDescr and SchShrtDescr it must predict next value. But my data contains lots of missing values (output for in-between dates are 0).
During prediction prophet continuously predict value for each and every day without considering in between dates output as empty. How can i resolve this?
>df #(main dataframe)
>
UnitShrtDescr SchShrtDescr y ds id
8110 50 93 1 2011-12-01 243
3437 29 87 1 2011-12-21 133
6867 43 75 1 2011-12-23 204
1102 8 23 1 2011-12-28 36
5271 36 14 1 2011-12-28 166
... ... ... ... ... ...
13138 83 0 1 2018-05-18 390
14424 92 3 1 2018-05-18 432
11556 69 0 1 2018-05-18 334
11767 69 5 1 2018-05-18 338
4458 30 102 1 2018-05-18 141
15950 rows × 5 columns
code:
model = Prophet(daily_seasonality=True)
model.add_regressor("UnitShrtDescr")
model.add_regressor("SchShrtDescr")
model.fit(df)
input regressor that i want to predict is
UnitShrtDescr=40 and SchShrtDescr=93. So i made make_future_dataframe:
future = model.make_future_dataframe(periods=100, include_history=False)
future["UnitShrtDescr"]=40
future["SchShrtDescr"]=93
Previous value for UnitShrtDescr=40 and SchShrtDescr=93 was:
>dfx[(dfx['UnitShrtDescr']==40) & (dfx['SchShrtDescr']==93)].tail(10)
>
UnitShrtDescr SchShrtDescr y ds id
6293 40 93 1 2018-02-27 189
6294 40 93 3 2018-02-28 189
6295 40 93 1 2018-03-17 189
6296 40 93 1 2018-03-29 189
6297 40 93 1 2018-03-30 189
6298 40 93 4 2018-03-31 189
6299 40 93 1 2018-04-26 189
6300 40 93 1 2018-04-27 189
6301 40 93 4 2018-04-30 189
6302 40 93 1 2018-05-16 189
Please note Gap between dates is much bigger which means y is 0 for between dates.
So when i make prediction it must predict in-between dates as 0 also.
But in this case it continuously predict y without considering in between y as 0
output = model.predict(future)
>output[['ds','yhat']].head(10)
>
ds yhat
0 2018-05-19 2.959505
1 2018-05-20 2.631181
2 2018-05-21 2.418850
3 2018-05-22 2.411914
4 2018-05-23 2.386383
5 2018-05-24 2.444841
6 2018-05-25 2.409294
7 2018-05-26 2.937428
8 2018-05-27 2.588136
9 2018-05-28 2.358953
Please Suggest Changes or better alternative for my case

Read online excel file with a specific sheet and only selected columns

I have to read through pandas the CTG.xls file from the following path:
https://archive.ics.uci.edu/ml/machine-learning-databases/00193/.
From this file I have to select the sheet Data. Moreover I have to select from column K to the column AT of the file. So at the end one have a dataset with these column:
["LB","AC","FM","UC","DL","DS","DP","ASTV","MSTV","ALTV" ,"MLTV" ,"Width","Min","Max" ,"Nmax","Nzeros","Mode","Mean" ,"Median" ,"Variance" ,"Tendency" ,"CLASS","NSP"]
How can I do this using the read function in pandas?
Use:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00193/CTG.xls'
df = pd.read_excel(url, sheet_name='Data', skipfooter=3)
df = df.drop(columns=df.filter(like='Unnamed').columns)
df.columns = df.iloc[0].to_list()
df = df[1:].reset_index(drop=True)
Output
LB AC FM UC DL DS DP ASTV MSTV ALTV MLTV Width Min Max Nmax Nzeros Mode Mean Median Variance Tendency CLASS NSP
0 120 0 0 0 0 0 0 73 0.5 43 2.4 64 62 126 2 0 120 137 121 73 1 9 2
1 132 0.00638 0 0.00638 0.00319 0 0 17 2.1 0 10.4 130 68 198 6 1 141 136 140 12 0 6 1
2 133 0.003322 0 0.008306 0.003322 0 0 16 2.1 0 13.4 130 68 198 5 1 141 135 138 13 0 6 1
3 134 0.002561 0 0.007682 0.002561 0 0 16 2.4 0 23 117 53 170 11 0 137 134 137 13 1 6 1
4 132 0.006515 0 0.008143 0 0 0 16 2.4 0 19.9 117 53 170 9 0 137 136 138 11 1 2 1
... ... ... ... ... ... .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ..
2121 140 0 0 0.007426 0 0 0 79 0.2 25 7.2 40 137 177 4 0 153 150 152 2 0 5 2
2122 140 0.000775 0 0.006971 0 0 0 78 0.4 22 7.1 66 103 169 6 0 152 148 151 3 1 5 2
2123 140 0.00098 0 0.006863 0 0 0 79 0.4 20 6.1 67 103 170 5 0 153 148 152 4 1 5 2
2124 140 0.000679 0 0.00611 0 0 0 78 0.4 27 7 66 103 169 6 0 152 147 151 4 1 5 2
2125 142 0.001616 0.001616 0.008078 0 0 0 74 0.4 36 5 42 117 159 2 1 145 143 145 1 0 1 1
[2126 rows x 23 columns]

Plotting in pivot table using label

my dataset
df
Month 1 2 3 4 5 Label
Name
A 120 80.5 120 105.5 140 0
B 80 110 98.5 105 100 1
C 150 90.5 105 120 190 2
D 100 105 98.5 110 120 1
...
To draw a plot for Month, applying the inverse matrix,
df = df.T
df
Name A B C D
Month
1 120 80 150 100
2 80.05 110 90.5 105
3 130 98.5 105 98.005
4 105.5 105 120 110
5 140 100 190 120
Label 0.00 1.0 2.0 1.000
Ultimately what I want to do is Drawing a plot, the x-axis is this month, y-axis is value.
but,
I have two questions.
Q1.
To inverse matrix, the data type of 'label' is changed(int -> float),
Can only the index of the 'label' be set to int type?
output what I want
df = df.T
df
Name A B C D
Month
1 120 80 150 100
2 80.05 110 90.5 105
3 130 98.5 105 98.005
4 105.5 105 120 110
5 140 100 190 120
Label 0 1 2 1
Q2.
q1 is actually for q2.
When drawing a plot, I want to group it using a label.(Like seaborn hue)
When drawing a plot using the pivot table above, is there a way for grouping to be possible?
(matplotlib, sns method does not matter)
The label above doesn't have to be int, and if possible, you don't need to answer the q1 task.
thank you for reading
Q2: You need reshape values, e.g. here with DataFrame.melt for possible use hue:
df1 = df.reset_index().melt(['Name','Label'])
print (df1)
sns.stripplot(data=df1,hue='Label',x='Name',y='value')
Q1: Pandas not support it, e.g. if convert last row label it not change values to floats:
df = df.T
df.loc['Label', :] = df.loc['Label', :].astype(int)
print (df)
Name A B C D
1 120.0 80.0 150.0 100.0
2 80.5 110.0 90.5 105.0
3 120.0 98.5 105.0 98.5
4 105.5 105.0 120.0 110.0
5 140.0 100.0 190.0 120.0
Label 0.0 1.0 2.0 1.0
EDIT:
df1 = df.reset_index().melt(['Name','Label'], var_name='Month')
print (df1)
Name Label Month value
0 A 0 1 120.0
1 B 1 1 80.0
2 C 2 1 150.0
3 D 1 1 100.0
4 A 0 2 80.5
5 B 1 2 110.0
6 C 2 2 90.5
7 D 1 2 105.0
8 A 0 3 120.0
9 B 1 3 98.5
10 C 2 3 105.0
11 D 1 3 98.5
12 A 0 4 105.5
13 B 1 4 105.0
14 C 2 4 120.0
15 D 1 4 110.0
16 A 0 5 140.0
17 B 1 5 100.0
18 C 2 5 190.0
19 D 1 5 120.0
sns.lineplot(data=df1,hue='Label',x='Month',y='value')

find the maximum value for each streak of numbers in another column in pandas

I have a dataframe like this :
df = pd.DataFrame({'dir': [1,1,1,1,0,0,1,1,1,0], 'price':np.random.randint(100,200,10)})
dir price
0 1 100
1 1 150
2 1 190
3 1 194
4 0 152
5 0 151
6 1 131
7 1 168
8 1 112
9 0 193
and I want a new column that shows the maximum price as long as the dir is 1 and reset if dir is 0.
My desired outcome looks like this:
dir price max
0 1 100 194
1 1 150 194
2 1 190 194
3 1 194 194
4 0 152 NaN
5 0 151 NaN
6 1 131 168
7 1 168 168
8 1 112 168
9 0 193 NaN
Use transform with max for filtered rows:
#get unique groups for consecutive values
g = df['dir'].ne(df['dir'].shift()).cumsum()
#filter only 1
m = df['dir'] == 1
df['max'] = df[m].groupby(g)['price'].transform('max')
print (df)
dir price max
0 1 100 194.0
1 1 150 194.0
2 1 190 194.0
3 1 194 194.0
4 0 152 NaN
5 0 151 NaN
6 1 131 168.0
7 1 168 168.0
8 1 112 168.0
9 0 193 NaN

Adding a row from a dataframe into another by matching columns with NaN values in row pandas python

The Scenario:
I have 2 dataframes fc0 and yc0. Where fc0 is a Cluster and yc0 is another dataframe which needs to be merged in fc0.
The Nature of data is as follows:
fc0
uid 1 2 3 4 5 6
234 235 4.000000 4.074464 4.128026 3.973045 3.921663 4.024864
235 236 3.524208 3.125669 3.652112 3.626923 3.524318 3.650589
236 237 4.174080 4.226267 4.200133 4.150983 4.124157 4.200052
yc0
iid uid 1 2 5 6 9 15
0 944 5.0 3.0 4.0 3.0 3.0 5.0
The Twist
I have 1682 columns in fc0 and I have few hundered values in yc0. Now I need the yc0 to go into fc0
In haste of resolving it, I even tried yc0.reset_index(inplace=True) but wasn't really helpful.
Expected Output
uid 1 2 3 4 5 6
234 235 4.000000 4.074464 4.128026 3.973045 3.921663 4.024864
235 236 3.524208 3.125669 3.652112 3.626923 3.524318 3.650589
236 237 4.174080 4.226267 4.200133 4.150983 4.124157 4.200052
944 5.0 3.0 NaN NaN 4.0 3.0 3.0
References
Link1 Tried this, but landed up inserting NaN values for 1st 16 Columns and rest of the data shifted by that many columns
Link2 Couldn't match column keys, besides I tried it for row.
Link3 Merging doesn't match the columns in it.
Link4 Concatenation doesn't work that way.
Link5 Same issues with Join.
EDIT 1
fc0.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 235 entries, 234 to 468
Columns: 1683 entries, uid to 1682
dtypes: float64(1682), int64(1)
memory usage: 3.0 MB
and
yc0.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Columns: 336 entries, uid to 1007
dtypes: float64(335), int64(1)
memory usage: 2.7 KB
Here's a MVCE example. Does this small sample data show the functionality that you are expecting?
df1 = pd.DataFrame(np.random.randint(0,100,(5,4)), columns=list('ABCE'))
A B C E
0 81 57 54 88
1 63 63 74 10
2 13 89 88 66
3 90 81 3 31
4 66 93 55 4
df2 = pd.DataFrame(np.random.randint(0,100,(5,4)), columns=list('BCDE'))
B C D E
0 93 48 62 25
1 24 97 52 88
2 53 50 21 13
3 81 27 7 81
4 10 21 77 19
df_out = pd.concat([df1,df2])
print(df_out)
Output:
A B C D E
0 81.0 57 54 NaN 88
1 63.0 63 74 NaN 10
2 13.0 89 88 NaN 66
3 90.0 81 3 NaN 31
4 66.0 93 55 NaN 4
0 NaN 93 48 62.0 25
1 NaN 24 97 52.0 88
2 NaN 53 50 21.0 13
3 NaN 81 27 7.0 81
4 NaN 10 21 77.0 19

Categories

Resources