Convert specific string to a numeric value in pandas - python

I am trying to do data analysis of some rainfall data. Example of the data looks like this:-
10 18/05/2016 26.9 40 20.8 34 52.2 20.8 46.5 45
11 19/05/2016 25.5 32 0.3 41.6 42 0.3 56.3 65.2
12 20/05/2016 8.5 29 18.4 9 36 18.4 28.6 46
13 21/05/2016 24.5 18 TRACE 3.5 17 TRACE 4.4 40
14 22/05/2016 0.6 18 0 6.5 14 0 8.6 20
15 23/05/2016 3.5 9 0.6 4.3 14 0.6 7 15
16 24/05/2016 3.6 25 T 3 12 T 14.9 9
17 25/05/2016 25 21 2.2 25.6 50 2.2 25 9
The rainfall data contain a specific string 'TRACE' or 'T' (both meaning non measurable rainfall amount). For analysis, I would like to convert this strings in to '1.0' (float). My desired data should look like this so as to plot the values as line diagram:-
10 18/05/2016 26.9 40 20.8 34 52.2 20.8 46.5 45
11 19/05/2016 25.5 32 0.3 41.6 42 0.3 56.3 65.2
12 20/05/2016 8.5 29 18.4 9 36 18.4 28.6 46
13 21/05/2016 24.5 18 1.0 3.5 17 1.0 4.4 40
14 22/05/2016 0.6 18 0 6.5 14 0 8.6 20
15 23/05/2016 3.5 9 0.6 4.3 14 0.6 7 15
16 24/05/2016 3.6 25 1.0 3 12 1.0 14.9 9
17 25/05/2016 25 21 2.2 25.6 50 2.2 25 9
Can some one point me to right direction?

You can use df.replace, and then converting the numeric to float using df.astype (the original datatype would be object, so any operations on these columns would still suffer from performance issues):
df = df.replace('^T(RACE)?$', 1.0, regex=True)
df.iloc[:, 1:] = df.iloc[:, 1:].astype(float) # converting object columns to floats
This will replace all T or TRACE elements with 1.0.
Output:
10 18/05/2016 26.9 40 20.8 34.0 52.2 20.8 46.5 45.0
11 19/05/2016 25.5 32 0.3 41.6 42.0 0.3 56.3 65.2
12 20/05/2016 8.5 29 18.4 9.0 36.0 18.4 28.6 46.0
13 21/05/2016 24.5 18 1 3.5 17.0 1 4.4 40.0
14 22/05/2016 0.6 18 0 6.5 14.0 0 8.6 20.0
15 23/05/2016 3.5 9 0.6 4.3 14.0 0.6 7.0 15.0
16 24/05/2016 3.6 25 1 3.0 12.0 1 14.9 9.0
17 25/05/2016 25.0 21 2.2 25.6 50.0 2.2 25.0 9.0

Use replace by dict:
df = df.replace({'T':1.0, 'TRACE':1.0})
And then if necessary convert columns to float:
cols = df.columns.difference(['Date','another cols dont need convert'])
df[cols] = df[cols].astype(float)
df = df.replace({'T':1.0, 'TRACE':1.0})
cols = df.columns.difference(['Date','a'])
df[cols] = df[cols].astype(float)
print (df)
a Date 2 3 4 5 6 7 8 9
0 10 18/05/2016 26.9 40.0 20.8 34.0 52.2 20.8 46.5 45.0
1 11 19/05/2016 25.5 32.0 0.3 41.6 42.0 0.3 56.3 65.2
2 12 20/05/2016 8.5 29.0 18.4 9.0 36.0 18.4 28.6 46.0
3 13 21/05/2016 24.5 18.0 1.0 3.5 17.0 1.0 4.4 40.0
4 14 22/05/2016 0.6 18.0 0.0 6.5 14.0 0.0 8.6 20.0
5 15 23/05/2016 3.5 9.0 0.6 4.3 14.0 0.6 7.0 15.0
6 16 24/05/2016 3.6 25.0 1.0 3.0 12.0 1.0 14.9 9.0
7 17 25/05/2016 25.0 21.0 2.2 25.6 50.0 2.2 25.0 9.0
print (df.dtypes)
a int64
Date object
2 float64
3 float64
4 float64
5 float64
6 float64
7 float64
8 float64
9 float64
dtype: object

Extending the answer from #jezrael, you can replace and convert to floats in a single statement (assumes the first column is Date and the remaining are the desired numeric columns):
df.iloc[:, 1:] = df.iloc[:, 1:].replace({'T':1.0, 'TRACE':1.0}).astype(float)

Related

Create a single column from multiple columns in a dataframe

I have a dataframe with 5 columns: M1, M2, M3, M4 and M5. Each column contains floating-point values. Now I want to combine the data of 5 columns into one.
I tried
cols = list(df.columns)
df_new['Total'] = []
df_new['Total'] = [df_new['Total'].append(df[i], ignore_index=True) for i in cols]
But I'm getting this
I'm using Python 3.8.5 and Pandas 1.1.2.
Here's a part of my df
M1 M2 M3 M4 M5
0 5 12 20 26
0.5 5.5 12.5 20.5 26.5
1 6 13 21 27
1.5 6.5 13.5 21.5 27.5
2 7 14 22 28
2.5 7.5 14.5 22.5 28.5
10 15 22 30 36
10.5 15.5 22.5 30.5 36.5
11 16 23 31 37
11.5 16.5 23.5 31.5 37.5
12 17 24 32 38
12.5 17.5 24.5 32.5 38.5
And this is what I'm expecting
0
0.5
1
1.5
2
2.5
10
10.5
11
11.5
12
12.5
5
5.5
6
6.5
7
7.5
15
15.5
16
16.5
17
17.5
12
12.5
13
13.5
14
14.5
22
22.5
23
23.5
24
24.5
20
20.5
21
21.5
22
22.5
30
30.5
31
31.5
32
32.5
26
26.5
27
27.5
28
28.5
36
36.5
37
37.5
38
38.5
import pandas as pd
Just make use of concat() method and list comprehension:
result=pd.concat((df[x] for x in df.columns),ignore_index=True)
Now If you print result then you will get your desired output
Performance(concat() vs unstack()):

How to style my dataframe by column with conditions?

I want to paint the share price cell green if it is higher than the target price and red if it is lower than the alert price and my code is not working as it keeps popping errors.
This is the code that I use
temp_df.style.apply(lambda x: ["background: red" if v < x.iloc[:,1:] and x.iloc[:,1:] != 0 else "" for v in x], subset=['Share Price'], axis = 0)
temp_df.style.apply(lambda x: ["background: green" if v > x.iloc[:,2:] and x.iloc[:,2:] != 0 else "" for v in x], subset=['Share Price'], axis = 0)
Can anyone give me an idea on how to do it?
Index Share Price Alert/Entry Target
0 622.0 424.0 950.0
1 6880.0 5200.0 7450.0
2 62860.0 40000.0 60000.0
3 7669.0 5500.0 8000.0
4 5295.0 3500.0 5500.0
5 227.0 165.0 250.0
6 3970.0 3200.0 4250.0
7 1300.0 850.0 1650.0
8 8480.0 6500.0 8500.0
9 11.3 0.0 0.0
10 66.0 58.0 75.0
11 7.3 6.4 9.6
12 114.8 75.0 130.0
13 172.3 90.0 0.0
14 2.6 2.4 3.2
15 76.8 68.0 85.0
16 19.6 15.4 21.0
17 21.9 11.0 18.6
18 35.4 29.0 42.0
19 12.5 9.2 0.0
20 15.5 0.0 0.0
21 449.8 0.0 0.0
22 4.3 3.6 5.0
23 47.4 40.0 55.0
24 0.6 0.5 0.6
25 49.2 45.0 72.0
26 13.9 0.0 0.0
27 3.0 2.4 4.5
28 2.4 1.8 4.2
29 54.0 0.0 0.0
30 293.5 100.0 250.0
31 190000.0 140000.0 220000.0
32 52200.0 46000.0 58000.0
33 100500.0 75000.0 115000.0
34 4.9 3.8 6.5
35 0.2 0.0 0.0
36 1430.0 980.0 1450.0
37 1585.0 0.0 0.0
38 15.6 11.0 18.0
39 3.3 2.8 6.0
40 52.5 45.0 68.0
41 46.5 35.0 0.0
42 193.6 135.0 0.0
43 122.8 90.0 0.0
44 222.6 165.0 265.0
Provided that "Index" is also a column:
temp_df.style.apply(lambda x: ["background: green" if (i==1 and v > x.iloc[3] and x.iloc[3] != 0) else ("background: red" if (i==1 and v < x.iloc[2]) else "") for i, v in enumerate(x)], axis=1)
i: aims to define the column Share Price to be styled (column: 1)

Dropping multiple columns in pandas at once

I have a data set consisting of 135 columns. I am trying to drop the columns which have empty data of more than 60%. There are some 40 columns approx in it. So, I wrote a function to drop this empty columns. But I am getting "Not contained in axis" error. Could some one help me solving this?. Or any other way to drop this 40 columns at once?
My function:
list_drop = df.isnull().sum()/(len(df))
def empty(df):
if list_drop > 0.5:
df.drop(list_drop,axis=1,inplace=True)
return df
Other method i tried:
df.drop(df.count()/len(df)<0.5,axis=1,inplace=True)
You could use isnull + sum and then use the mask to filter df.columns.
m = df.isnull().sum(0) / len(df) < 0.6
df = df[df.columns[m]]
Demo
df
A B C
0 29.0 NaN 26.6
1 NaN NaN 23.3
2 23.0 94.0 28.1
3 35.0 168.0 43.1
4 NaN NaN 25.6
5 32.0 88.0 31.0
6 NaN NaN 35.3
7 45.0 543.0 30.5
8 NaN NaN NaN
9 NaN NaN 37.6
10 NaN NaN 38.0
11 NaN NaN 27.1
12 23.0 846.0 30.1
13 19.0 175.0 25.8
14 NaN NaN 30.0
15 47.0 230.0 45.8
16 NaN NaN 29.6
17 38.0 83.0 43.3
18 30.0 96.0 34.6
m = df.isnull().sum(0) / len(df) < 0.3 # 0.3 as an example
m
A False
B False
C True
dtype: bool
df[df.columns[m]]
C
0 26.6
1 23.3
2 28.1
3 43.1
4 25.6
5 31.0
6 35.3
7 30.5
8 NaN
9 37.6
10 38.0
11 27.1
12 30.1
13 25.8
14 30.0
15 45.8
16 29.6
17 43.3
18 34.6

Pandas DataFrame- create a 14 day moving average, but show simple averages for the first 14 days of data?

I have a pandas dataframe similar to this.
score avg
date
1/1/2017 0 0
1/2/2017 1 0.5
1/3/2017 2 1
1/4/2017 3 1.5
1/5/2017 4 2
1/6/2017 5 2.5
1/7/2017 6 3
1/8/2017 7 3.5
1/9/2017 8 4
1/10/2017 9 4.5
1/11/2017 10 5
1/12/2017 11 5.5
1/13/2017 12 7.5
1/14/2017 13 6.5
1/15/2017 14 7.5
1/16/2017 15 8.5
1/17/2017 16 9.5
1/18/2017 17 10.5
1/19/2017 18 11.5
1/20/2017 19 12.5
1/21/2017 20 13.5
1/22/2017 21 14.5
1/23/2017 22 15.5
1/24/2017 23 16.5
1/25/2017 24 17.5
1/26/2017 25 18.5
1/27/2017 26 19.5
1/28/2017 27 20.5
1/29/2017 28 21.5
Basically I am looking to create a 14 day rolling average of the data, but instead of showing NaNs for the first 14 days, simply showing the simple averages. For example, the average on day 2 is the average of day 1 and 2, the average on day 10 is the averages of days 1-10, etc. How would I go about doing this without having to manually create averages? Thanks for the help!
What you need to use is rolling with min_periods=1 as paramter:
df['avg2'] = df.rolling(14, min_periods=1)['score'].mean()
Output:
date score avg avg2
0 2017-01-01 0 0.0 0.0
1 2017-01-02 1 0.5 0.5
2 2017-01-03 2 1.0 1.0
3 2017-01-04 3 1.5 1.5
4 2017-01-05 4 2.0 2.0
5 2017-01-06 5 2.5 2.5
6 2017-01-07 6 3.0 3.0
7 2017-01-08 7 3.5 3.5
8 2017-01-09 8 4.0 4.0
9 2017-01-10 9 4.5 4.5
10 2017-01-11 10 5.0 5.0
11 2017-01-12 11 5.5 5.5
12 2017-01-13 12 7.5 6.0
13 2017-01-14 13 6.5 6.5
14 2017-01-15 14 7.5 7.5
15 2017-01-16 15 8.5 8.5
16 2017-01-17 16 9.5 9.5
17 2017-01-18 17 10.5 10.5
18 2017-01-19 18 11.5 11.5
19 2017-01-20 19 12.5 12.5
20 2017-01-21 20 13.5 13.5
21 2017-01-22 21 14.5 14.5
22 2017-01-23 22 15.5 15.5
23 2017-01-24 23 16.5 16.5
24 2017-01-25 24 17.5 17.5
25 2017-01-26 25 18.5 18.5
26 2017-01-27 26 19.5 19.5
27 2017-01-28 27 20.5 20.5
28 2017-01-29 28 21.5 21.5

If statement in grouped Pandas dataframe

I have a dataset that contains columns of year, julian day, hour and temperature. I have grouped the data by year and day and now want to perform an operation on the temperature data IF each day contains 24 hours worth of data. Then, I want to create a Dataframe with year, julian day, max temperature and min temperature. However, I'm not sure of the syntax to make sure this condition is met. Any help would be appreciated. My code is below:
df = pd.read_table(data,skiprows=1,sep='\t',usecols=(0,3,4,6),names=['year','jday','hour','temp'],na_values=-999.9)
g = df.groupby(['year','jday'])
if #the grouped year and day has 24 hours worth of data
maxt = g.aggregate({'temp':np.max})
mint = g.aggregate({'temp':np.min})
else:
continue
And some sample data (goes from 1942-2015):
Year Month Day Julian Hour Wind TempC DewC Pressure RH
1942 9 24 267 9 2.1 18.5 15.2 1014.2 81.0
1942 9 24 267 10 2.1 23.5 14.6 1014.6 57.0
1942 9 24 267 11 3.6 25.2 12.4 1014.2 45.0
1942 9 24 267 12 3.6 26.8 11.9 1014.2 40.0
1942 9 24 267 13 2.6 27.4 11.9 1014.2 38.0
1942 9 24 267 14 2.1 28.0 11.3 1013.5 35.0
1942 9 24 267 15 4.1 29.1 9.1 1013.5 29.0
1942 9 24 267 16 4.1 29.1 10.7 1013.5 32.0
1942 9 24 267 17 4.6 29.1 13.0 1013.9 37.0
1942 9 24 267 18 3.6 25.7 12.4 1015.2 44.0
1942 9 24 267 19 0.0 23.0 16.3 1015.2 66.0
1942 9 24 267 20 2.6 22.4 15.7 1015.9 66.0
1942 9 24 267 21 2.1 20.2 16.3 1016.3 78.0
1942 9 24 267 22 3.1 20.2 14.6 1016.9 70.0
1942 9 24 267 23 2.6 19.6 15.2 1017.6 76.0
1942 9 25 268 0 3.1 18.5 13.5 1018.3 73.0
1942 9 25 268 1 2.6 16.9 13.0 1018.3 78.0
1942 9 25 268 2 4.1 15.7 5.2 1021.0 50.0
1942 9 25 268 3 4.1 15.2 4.1 1020.7 47.0
1942 9 25 268 4 3.1 14.1 5.8 1021.3 57.0
1942 9 25 268 5 3.1 13.0 5.8 1021.3 62.0
1942 9 25 268 6 2.1 13.0 5.2 1022.4 59.0
1942 9 25 268 7 2.1 12.4 1.9 1022.4 49.0
1942 9 25 268 8 3.6 13.5 5.8 1024.7 60.0
1942 9 25 268 9 4.6 15.7 3.5 1025.1 44.0
1942 9 25 268 10 4.1 17.4 1.3 1025.4 34.0
1942 9 25 268 11 2.6 18.5 3.0 1025.4 36.0
1942 9 25 268 12 2.1 19.1 0.8 1025.1 29.0
1942 9 25 268 13 2.6 19.6 2.4 1024.7 32.0
1942 9 25 268 14 4.1 20.7 4.6 1023.4 35.0
1942 9 25 268 15 3.6 21.3 4.1 1023.7 32.0
1942 9 25 268 16 1.5 21.3 4.6 1023.4 34.0
1942 9 25 268 17 5.1 20.7 7.4 1023.4 42.0
1942 9 25 268 18 5.1 19.1 8.5 1023.0 50.0
1942 9 25 268 19 3.6 18.0 9.6 1022.7 58.0
1942 9 25 268 20 3.1 16.3 9.6 1023.0 65.0
1942 9 25 268 21 1.5 15.2 11.3 1023.0 78.0
1942 9 25 268 22 1.5 14.6 11.3 1023.0 81.0
1942 9 25 268 23 2.1 14.1 10.7 1024.0 80.0
I assume that there is no ['year', 'julian'] group which contains non-integer hours so we can just use the length of the group as the condition.
import pandas as pd
def get_min_max_by_date(df_group):
if len(df_group['hour'].unique()) < 24:
new_df = pd.DataFrame()
else:
year = df_group['year'].unique()[0]
j_day = df_group['jday'].unique()[0]
min_temp = df_group['temp'].min()
max_temp = df_group['temp'].max()
new_df = pd.DataFrame({'year': [year],
'julian_day': [j_day],
'min_temp': [min_temp],
'max_temp': [max_temp]}, index=[0])
return new_df
df = pd.read_table(data,
skiprows=1,
sep='\t',
usecols=(0, 3, 4, 6),
names=['year', 'jday', 'hour', 'temp'],
na_values=-999.9)
final_df = df.groupby(['year', 'jday'],
as_index=False).apply(get_min_max_by_date)
final_df = final_df.reset_index()
I don't have time to test this right now, but this should get you started.
I would start by grouping on day alone, and then iterate over the groups, checking the unique hours in each group. You can use set to find the unique hours for each measurement day, and compare with a full days worth of hours {0,1,2,3,...23}
a_full_day = set(range(24))
#data_out = {}
gb = df.groupby(['jday']) # only group by day
for day, inds in gb.groups.iteritems():
if set(df.ix[inds, 'hour']) == a_full_day:
maxt = df.ix[inds, 'temp'].max()
#data_out[day] = {}
#data_out[day]['maxt'] = maxt
# etc
I added some commented lines suggesting how you might want to store the output

Categories

Resources