I have a dataframe with 5 columns: M1, M2, M3, M4 and M5. Each column contains floating-point values. Now I want to combine the data of 5 columns into one.
I tried
cols = list(df.columns)
df_new['Total'] = []
df_new['Total'] = [df_new['Total'].append(df[i], ignore_index=True) for i in cols]
But I'm getting this
I'm using Python 3.8.5 and Pandas 1.1.2.
Here's a part of my df
M1 M2 M3 M4 M5
0 5 12 20 26
0.5 5.5 12.5 20.5 26.5
1 6 13 21 27
1.5 6.5 13.5 21.5 27.5
2 7 14 22 28
2.5 7.5 14.5 22.5 28.5
10 15 22 30 36
10.5 15.5 22.5 30.5 36.5
11 16 23 31 37
11.5 16.5 23.5 31.5 37.5
12 17 24 32 38
12.5 17.5 24.5 32.5 38.5
And this is what I'm expecting
0
0.5
1
1.5
2
2.5
10
10.5
11
11.5
12
12.5
5
5.5
6
6.5
7
7.5
15
15.5
16
16.5
17
17.5
12
12.5
13
13.5
14
14.5
22
22.5
23
23.5
24
24.5
20
20.5
21
21.5
22
22.5
30
30.5
31
31.5
32
32.5
26
26.5
27
27.5
28
28.5
36
36.5
37
37.5
38
38.5
import pandas as pd
Just make use of concat() method and list comprehension:
result=pd.concat((df[x] for x in df.columns),ignore_index=True)
Now If you print result then you will get your desired output
Performance(concat() vs unstack()):
I want to paint the share price cell green if it is higher than the target price and red if it is lower than the alert price and my code is not working as it keeps popping errors.
This is the code that I use
temp_df.style.apply(lambda x: ["background: red" if v < x.iloc[:,1:] and x.iloc[:,1:] != 0 else "" for v in x], subset=['Share Price'], axis = 0)
temp_df.style.apply(lambda x: ["background: green" if v > x.iloc[:,2:] and x.iloc[:,2:] != 0 else "" for v in x], subset=['Share Price'], axis = 0)
Can anyone give me an idea on how to do it?
Index Share Price Alert/Entry Target
0 622.0 424.0 950.0
1 6880.0 5200.0 7450.0
2 62860.0 40000.0 60000.0
3 7669.0 5500.0 8000.0
4 5295.0 3500.0 5500.0
5 227.0 165.0 250.0
6 3970.0 3200.0 4250.0
7 1300.0 850.0 1650.0
8 8480.0 6500.0 8500.0
9 11.3 0.0 0.0
10 66.0 58.0 75.0
11 7.3 6.4 9.6
12 114.8 75.0 130.0
13 172.3 90.0 0.0
14 2.6 2.4 3.2
15 76.8 68.0 85.0
16 19.6 15.4 21.0
17 21.9 11.0 18.6
18 35.4 29.0 42.0
19 12.5 9.2 0.0
20 15.5 0.0 0.0
21 449.8 0.0 0.0
22 4.3 3.6 5.0
23 47.4 40.0 55.0
24 0.6 0.5 0.6
25 49.2 45.0 72.0
26 13.9 0.0 0.0
27 3.0 2.4 4.5
28 2.4 1.8 4.2
29 54.0 0.0 0.0
30 293.5 100.0 250.0
31 190000.0 140000.0 220000.0
32 52200.0 46000.0 58000.0
33 100500.0 75000.0 115000.0
34 4.9 3.8 6.5
35 0.2 0.0 0.0
36 1430.0 980.0 1450.0
37 1585.0 0.0 0.0
38 15.6 11.0 18.0
39 3.3 2.8 6.0
40 52.5 45.0 68.0
41 46.5 35.0 0.0
42 193.6 135.0 0.0
43 122.8 90.0 0.0
44 222.6 165.0 265.0
Provided that "Index" is also a column:
temp_df.style.apply(lambda x: ["background: green" if (i==1 and v > x.iloc[3] and x.iloc[3] != 0) else ("background: red" if (i==1 and v < x.iloc[2]) else "") for i, v in enumerate(x)], axis=1)
i: aims to define the column Share Price to be styled (column: 1)
I have a data set consisting of 135 columns. I am trying to drop the columns which have empty data of more than 60%. There are some 40 columns approx in it. So, I wrote a function to drop this empty columns. But I am getting "Not contained in axis" error. Could some one help me solving this?. Or any other way to drop this 40 columns at once?
My function:
list_drop = df.isnull().sum()/(len(df))
def empty(df):
if list_drop > 0.5:
df.drop(list_drop,axis=1,inplace=True)
return df
Other method i tried:
df.drop(df.count()/len(df)<0.5,axis=1,inplace=True)
You could use isnull + sum and then use the mask to filter df.columns.
m = df.isnull().sum(0) / len(df) < 0.6
df = df[df.columns[m]]
Demo
df
A B C
0 29.0 NaN 26.6
1 NaN NaN 23.3
2 23.0 94.0 28.1
3 35.0 168.0 43.1
4 NaN NaN 25.6
5 32.0 88.0 31.0
6 NaN NaN 35.3
7 45.0 543.0 30.5
8 NaN NaN NaN
9 NaN NaN 37.6
10 NaN NaN 38.0
11 NaN NaN 27.1
12 23.0 846.0 30.1
13 19.0 175.0 25.8
14 NaN NaN 30.0
15 47.0 230.0 45.8
16 NaN NaN 29.6
17 38.0 83.0 43.3
18 30.0 96.0 34.6
m = df.isnull().sum(0) / len(df) < 0.3 # 0.3 as an example
m
A False
B False
C True
dtype: bool
df[df.columns[m]]
C
0 26.6
1 23.3
2 28.1
3 43.1
4 25.6
5 31.0
6 35.3
7 30.5
8 NaN
9 37.6
10 38.0
11 27.1
12 30.1
13 25.8
14 30.0
15 45.8
16 29.6
17 43.3
18 34.6
I have a pandas dataframe similar to this.
score avg
date
1/1/2017 0 0
1/2/2017 1 0.5
1/3/2017 2 1
1/4/2017 3 1.5
1/5/2017 4 2
1/6/2017 5 2.5
1/7/2017 6 3
1/8/2017 7 3.5
1/9/2017 8 4
1/10/2017 9 4.5
1/11/2017 10 5
1/12/2017 11 5.5
1/13/2017 12 7.5
1/14/2017 13 6.5
1/15/2017 14 7.5
1/16/2017 15 8.5
1/17/2017 16 9.5
1/18/2017 17 10.5
1/19/2017 18 11.5
1/20/2017 19 12.5
1/21/2017 20 13.5
1/22/2017 21 14.5
1/23/2017 22 15.5
1/24/2017 23 16.5
1/25/2017 24 17.5
1/26/2017 25 18.5
1/27/2017 26 19.5
1/28/2017 27 20.5
1/29/2017 28 21.5
Basically I am looking to create a 14 day rolling average of the data, but instead of showing NaNs for the first 14 days, simply showing the simple averages. For example, the average on day 2 is the average of day 1 and 2, the average on day 10 is the averages of days 1-10, etc. How would I go about doing this without having to manually create averages? Thanks for the help!
What you need to use is rolling with min_periods=1 as paramter:
df['avg2'] = df.rolling(14, min_periods=1)['score'].mean()
Output:
date score avg avg2
0 2017-01-01 0 0.0 0.0
1 2017-01-02 1 0.5 0.5
2 2017-01-03 2 1.0 1.0
3 2017-01-04 3 1.5 1.5
4 2017-01-05 4 2.0 2.0
5 2017-01-06 5 2.5 2.5
6 2017-01-07 6 3.0 3.0
7 2017-01-08 7 3.5 3.5
8 2017-01-09 8 4.0 4.0
9 2017-01-10 9 4.5 4.5
10 2017-01-11 10 5.0 5.0
11 2017-01-12 11 5.5 5.5
12 2017-01-13 12 7.5 6.0
13 2017-01-14 13 6.5 6.5
14 2017-01-15 14 7.5 7.5
15 2017-01-16 15 8.5 8.5
16 2017-01-17 16 9.5 9.5
17 2017-01-18 17 10.5 10.5
18 2017-01-19 18 11.5 11.5
19 2017-01-20 19 12.5 12.5
20 2017-01-21 20 13.5 13.5
21 2017-01-22 21 14.5 14.5
22 2017-01-23 22 15.5 15.5
23 2017-01-24 23 16.5 16.5
24 2017-01-25 24 17.5 17.5
25 2017-01-26 25 18.5 18.5
26 2017-01-27 26 19.5 19.5
27 2017-01-28 27 20.5 20.5
28 2017-01-29 28 21.5 21.5
I have a dataset that contains columns of year, julian day, hour and temperature. I have grouped the data by year and day and now want to perform an operation on the temperature data IF each day contains 24 hours worth of data. Then, I want to create a Dataframe with year, julian day, max temperature and min temperature. However, I'm not sure of the syntax to make sure this condition is met. Any help would be appreciated. My code is below:
df = pd.read_table(data,skiprows=1,sep='\t',usecols=(0,3,4,6),names=['year','jday','hour','temp'],na_values=-999.9)
g = df.groupby(['year','jday'])
if #the grouped year and day has 24 hours worth of data
maxt = g.aggregate({'temp':np.max})
mint = g.aggregate({'temp':np.min})
else:
continue
And some sample data (goes from 1942-2015):
Year Month Day Julian Hour Wind TempC DewC Pressure RH
1942 9 24 267 9 2.1 18.5 15.2 1014.2 81.0
1942 9 24 267 10 2.1 23.5 14.6 1014.6 57.0
1942 9 24 267 11 3.6 25.2 12.4 1014.2 45.0
1942 9 24 267 12 3.6 26.8 11.9 1014.2 40.0
1942 9 24 267 13 2.6 27.4 11.9 1014.2 38.0
1942 9 24 267 14 2.1 28.0 11.3 1013.5 35.0
1942 9 24 267 15 4.1 29.1 9.1 1013.5 29.0
1942 9 24 267 16 4.1 29.1 10.7 1013.5 32.0
1942 9 24 267 17 4.6 29.1 13.0 1013.9 37.0
1942 9 24 267 18 3.6 25.7 12.4 1015.2 44.0
1942 9 24 267 19 0.0 23.0 16.3 1015.2 66.0
1942 9 24 267 20 2.6 22.4 15.7 1015.9 66.0
1942 9 24 267 21 2.1 20.2 16.3 1016.3 78.0
1942 9 24 267 22 3.1 20.2 14.6 1016.9 70.0
1942 9 24 267 23 2.6 19.6 15.2 1017.6 76.0
1942 9 25 268 0 3.1 18.5 13.5 1018.3 73.0
1942 9 25 268 1 2.6 16.9 13.0 1018.3 78.0
1942 9 25 268 2 4.1 15.7 5.2 1021.0 50.0
1942 9 25 268 3 4.1 15.2 4.1 1020.7 47.0
1942 9 25 268 4 3.1 14.1 5.8 1021.3 57.0
1942 9 25 268 5 3.1 13.0 5.8 1021.3 62.0
1942 9 25 268 6 2.1 13.0 5.2 1022.4 59.0
1942 9 25 268 7 2.1 12.4 1.9 1022.4 49.0
1942 9 25 268 8 3.6 13.5 5.8 1024.7 60.0
1942 9 25 268 9 4.6 15.7 3.5 1025.1 44.0
1942 9 25 268 10 4.1 17.4 1.3 1025.4 34.0
1942 9 25 268 11 2.6 18.5 3.0 1025.4 36.0
1942 9 25 268 12 2.1 19.1 0.8 1025.1 29.0
1942 9 25 268 13 2.6 19.6 2.4 1024.7 32.0
1942 9 25 268 14 4.1 20.7 4.6 1023.4 35.0
1942 9 25 268 15 3.6 21.3 4.1 1023.7 32.0
1942 9 25 268 16 1.5 21.3 4.6 1023.4 34.0
1942 9 25 268 17 5.1 20.7 7.4 1023.4 42.0
1942 9 25 268 18 5.1 19.1 8.5 1023.0 50.0
1942 9 25 268 19 3.6 18.0 9.6 1022.7 58.0
1942 9 25 268 20 3.1 16.3 9.6 1023.0 65.0
1942 9 25 268 21 1.5 15.2 11.3 1023.0 78.0
1942 9 25 268 22 1.5 14.6 11.3 1023.0 81.0
1942 9 25 268 23 2.1 14.1 10.7 1024.0 80.0
I assume that there is no ['year', 'julian'] group which contains non-integer hours so we can just use the length of the group as the condition.
import pandas as pd
def get_min_max_by_date(df_group):
if len(df_group['hour'].unique()) < 24:
new_df = pd.DataFrame()
else:
year = df_group['year'].unique()[0]
j_day = df_group['jday'].unique()[0]
min_temp = df_group['temp'].min()
max_temp = df_group['temp'].max()
new_df = pd.DataFrame({'year': [year],
'julian_day': [j_day],
'min_temp': [min_temp],
'max_temp': [max_temp]}, index=[0])
return new_df
df = pd.read_table(data,
skiprows=1,
sep='\t',
usecols=(0, 3, 4, 6),
names=['year', 'jday', 'hour', 'temp'],
na_values=-999.9)
final_df = df.groupby(['year', 'jday'],
as_index=False).apply(get_min_max_by_date)
final_df = final_df.reset_index()
I don't have time to test this right now, but this should get you started.
I would start by grouping on day alone, and then iterate over the groups, checking the unique hours in each group. You can use set to find the unique hours for each measurement day, and compare with a full days worth of hours {0,1,2,3,...23}
a_full_day = set(range(24))
#data_out = {}
gb = df.groupby(['jday']) # only group by day
for day, inds in gb.groups.iteritems():
if set(df.ix[inds, 'hour']) == a_full_day:
maxt = df.ix[inds, 'temp'].max()
#data_out[day] = {}
#data_out[day]['maxt'] = maxt
# etc
I added some commented lines suggesting how you might want to store the output