What's wrong with this code to conditionally count Pandas dataframe columns? - python

I have the following data:
Data:
ObjectID,Date,Price,Vol,Mx
101,2017-01-01,,145,203
101,2017-01-02,,155,163
101,2017-01-03,67.0,140,234
101,2017-01-04,78.0,130,182
101,2017-01-05,58.0,178,202
101,2017-01-06,53.0,134,204
101,2017-01-07,52.0,134,183
101,2017-01-08,62.0,148,176
101,2017-01-09,42.0,152,193
101,2017-01-10,80.0,137,150
I want to add a new column called CheckCount counting the values in the Vol and Mx columns IF they are greater than 150. I have written the following code:
Code:
import pandas as pd
Observations = pd.read_csv("C:\\Users\\Observations.csv", parse_dates=['Date'], index_col=['ObjectID', 'Date'])
Observations['CheckCount'] = (Observations[['Vol', 'Mx']]>150).count(axis=1)
print(Observations)
However, unfortunately it is counting every value (result is always 2) rather than only where the values are >150 - what is wrong with my code?
Current Result:
ObjectID,Date,Price,Vol,Mx,CheckCount
101,2017-01-01,,145,203,2
101,2017-01-02,,155,163,2
101,2017-01-03,67.0,140,234,2
101,2017-01-04,78.0,130,182,2
101,2017-01-05,58.0,178,202,2
101,2017-01-06,53.0,134,204,2
101,2017-01-07,52.0,134,183,2
101,2017-01-08,62.0,148,176,2
101,2017-01-09,42.0,152,193,2
101,2017-01-10,80.0,137,150,2
Desired Result:
ObjectID,Date,Price,Vol,Mx,CheckCount
101,2017-01-01,,145,203,1
101,2017-01-02,,155,163,2
101,2017-01-03,67.0,140,234,1
101,2017-01-04,78.0,130,182,1
101,2017-01-05,58.0,178,202,2
101,2017-01-06,53.0,134,204,1
101,2017-01-07,52.0,134,183,1
101,2017-01-08,62.0,148,176,1
101,2017-01-09,42.0,152,193,2
101,2017-01-10,80.0,137,150,0

Are you looking for:
df['CheckCount'] = df[['Vol','Mx']].gt(150).sum(1)
Output:
ObjectID Date Price Vol Mx CheckCount
0 101 2017-01-01 NaN 145 203 1
1 101 2017-01-02 NaN 155 163 2
2 101 2017-01-03 67.0 140 234 1
3 101 2017-01-04 78.0 130 182 1
4 101 2017-01-05 58.0 178 202 2
5 101 2017-01-06 53.0 134 204 1
6 101 2017-01-07 52.0 134 183 1
7 101 2017-01-08 62.0 148 176 1
8 101 2017-01-09 42.0 152 193 2
9 101 2017-01-10 80.0 137 150 0

Related

Pandas read csv, always get 1 column

Edit: To note as well I have already searched for this problem, but nothing has worked for me.
First line of data, 109 different fields for one line:
15/12/2022,13:53:27,Off,0,0.00,19.9,22.6,19.6,1,Normal,Operator,Not Fitted,14,83:04:21,34:23:28,28:04:51,0,0,0,3025,0,3551,3535,3446,240,0,239,0,0,Not Fitted,125.11:37:20,44.23:11:47,0,0,0,0,0,0,0,21,2,0,0,21.8,0.0,0.0,23.2,21,26,34,1,66,133,8,60,5,74.16:01:01,23.02:02:40,0,0,0,0,0,0,0,25,2.8,0,0,21.4,0.0,0.0,22.2,21,24.1,32,2,64,133,8,28,1,122.22:39:33,43.18:38:50,0,0,0,0,0,0,0,23,1.6,0,0,21.4,0.0,0.0,22.5,21.2,24.1,32,2,64,133,8,28,1,No Alarms
So, in this case its comma delimited. But when I try
df = pd.read_csv(path, sep=',', error_bad_lines=False, engine='python')
Or even using different combinations I always get one column out.
16:02:29 On 4554 0.00 23.5 36.8 21.1 1 Normal Operator Not Fitted 14 83:06:30 35:01:19 28:06:27 0 0 0 3025 0 3502 3413 2911 245 0 1579 0 0 Not Fitted 125.13:45:20 45.01:01:51 98 4025 98.3 96 2627 0 0 12 4.4 0 0 27 0.0 0.0 39.1 24.4 39.6 51 0 67 133 9 124 5 74.18:09:01 23.03:52:44 98 4018 98.1 100 2746 0 0 17 5.5 0 0 25.1 0.0 0.0 32.3 23.6 34.6 51 0 67 133 9 124 5 123.00:47:33 43.20:28:54 97 4003 97.8 101 2767 0 0 16 4.6 0 0 25.4 0.0 0.0 32.2 23.9 34.1 51 0 67 133 9 124 5 No Alarms Present
[3944 rows x 1 columns]
Its meant to have about 70+ columns but whatever I do I get the same result.
I am trying to use Pandas to incorporate it into another program which uses it as well.
The library is also up to data along with python.
Any help is appreciated.

How can I group dates into pandas

Datos
2015-01-01 58
2015-01-02 42
2015-01-03 41
2015-01-04 13
2015-01-05 6
... ...
2020-06-18 49
2020-06-19 41
2020-06-20 23
2020-06-21 39
2020-06-22 22
2000 rows × 1 columns
I have this df which is made up of a column whose data represents the average temperature of each day in an interval of years. I would like to know how to get the maximum of each day (taking into account that the year has 365 days) and obtain a df similar to this:
Datos
1 40
2 50
3 46
4 8
5 26
... ...
361 39
362 23
363 23
364 37
365 25
365 rows × 1 columns
Forgive my ignorance and thank you very much for the help.
You can do this:
df['Date'] = pd.to_datetime(df['Date'])
df = df.groupby(by=pd.Grouper(key='Date', freq='D')).max().reset_index()
df['Day'] = df['Date'].dt.dayofyear
print(df)
Date Temp Day
0 2015-01-01 58.0 1
1 2015-01-02 42.0 2
2 2015-01-03 41.0 3
3 2015-01-04 13.0 4
4 2015-01-05 6.0 5
... ... ... ...
1995 2020-06-18 49.0 170
1996 2020-06-19 41.0 171
1997 2020-06-20 23.0 172
1998 2020-06-21 39.0 173
1999 2020-06-22 22.0 174
Make a new column:
df["day of year"] = df.Datos.dayofyear
Then
df.groupby("day of year").max()

Pandas DataFrame mean of data in columns occurring before certain date time

I have a dataframe with ID's of clients and their expenses for 2014-2018. What I want is to have the mean of the expenses per ID but only the years before a certain date can be taken into account when calculating the mean value (so column 'Date' dictates which columns can be taken into account for the mean).
Example: for index 0 (ID: 12), the date states '2016-03-08', then the mean should be taken from the columns 'y_2014' and 'y_2015', so then for this index, the mean is 111.0. If the date is too early (e.g. somewhere in 2014 or earlier in this case), then NaN should be returned (see index 6 and 9).
Desired output:
y_2014 y_2015 y_2016 y_2017 y_2018 Date ID mean
0 100.0 122.0 324 632 NaN 2016-03-08 12 111.0
1 120.0 159.0 54 452 541.0 2015-04-09 96 120.0
2 NaN 164.0 687 165 245.0 2016-02-15 20 164.0
3 180.0 421.0 512 184 953.0 2018-05-01 73 324.25
4 110.0 654.0 913 173 103.0 2017-08-04 84 559.0
5 130.0 NaN 754 124 207.0 2016-07-03 26 130.0
6 170.0 256.0 843 97 806.0 2013-02-04 87 NaN
7 140.0 754.0 95 101 541.0 2016-06-08 64 447
8 80.0 985.0 184 84 90.0 2019-03-05 11 284.6
9 96.0 65.0 127 130 421.0 2014-05-14 34 NaN
The code below is what I tried.
Tried code:
import pandas as pd
import numpy as np


df = pd.DataFrame({"ID": [12,96,20,73,84,26,87,64,11,34],
"y_2014": [100,120,np.nan,180,110,130,170,140,80,96],
"y_2015": [122,159,164,421,654,np.nan,256,754,985,65],
"y_2016": [324,54,687,512,913,754,843,95,184,127],
"y_2017": [632,452,165,184,173,124,97,101,84,130],
"y_2018": [np.nan,541,245,953,103,207,806,541,90,421],
"Date": ['2016-03-08', '2015-04-09', '2016-02-15', '2018-05-01', '2017-08-04',
 '2016-07-03', '2013-02-04', '2016-06-08', '2019-03-05', '2014-05-14']})

print(df)

# the years from columns
data = df.filter(like='y_')
data_years = data.columns.str.extract('(\d+)')[0].astype(int)

# the years from Date
years = pd.to_datetime(df.Date).dt.year.values


df['mean'] = data.where(data_years<years[:,None]).mean(1)
print(df)
-> ValueError: Lengths must match to compare
Solved: one possible answer to my own question
import pandas as pd
import numpy as np

df = pd.DataFrame({"ID": [12,96,20,73,84,26,87,64,11,34],
"y_2014": [100,120,np.nan,180,110,130,170,140,80,96],
"y_2015": [122,159,164,421,654,np.nan,256,754,985,65],
"y_2016": [324,54,687,512,913,754,843,95,184,127],
"y_2017": [632,452,165,184,173,124,97,101,84,130],
"y_2018": [np.nan,541,245,953,103,207,806,541,90,421],
"Date": ['2016-03-08', '2015-04-09', '2016-02-15', '2018-05-01', '2017-08-04',
'2016-07-03', '2013-02-04', '2016-06-08', '2019-03-05', '2014-05-14']})
#Subset from original df to calculate mean
subset = df.loc[:,['y_2014', 'y_2015', 'y_2016', 'y_2017', 'y_2018']]
#an expense value is only available for the calculation of the mean when that year has passed, therefore 2015-01-01 is chosen for the 'y_2014' column in the subset etc. to check with the 'Date'-column
subset.columns = ['2015-01-01', '2016-01-01', '2017-01-01', '2018-01-01', '2019-01-01']

s = subset.columns[0:].values < df.Date.values[:,None]
t = s.astype(float)
t[t == 0] = np.nan
df['mean'] = (subset.iloc[:,0:]*t).mean(1)

print(df)
#Additionally: (gives the sum of expenses before a certain date in the 'Date'-column
df['sum'] = (subset.iloc[:,0:]*t).sum(1)

print(df)

Python fill zeros in a timeseries dataframe

I have a list of dates and a dataframe. Now the dataframe has an id column and other values that are not consistent for all dates. I want to fill zeros in all columns for the ids and dates where there is no data. Let me show you by example:
date id clicks conv rev
2019-01-21 234 34 1 10
2019-01-21 235 32 0 0
2019-01-24 234 56 2 20
2019-01-23 235 23 3 30
date list is like this:
[2019-01-01, 2019-01-02,2019-01-03 ....2019-02-28]
What I want is to add zeros for all the missing dates in the dataframe for all ids. So the resultant df should look like:
date id clicks conv rev
2019-01-01 234 0 0 0
2019-01-01 235 0 0 0
. . . .
. . . .
2019-01-21 234 34 1 10
2019-01-21 235 32 0 0
2019-01-22 234 0 0 0
2019-01-22 235 0 0 0
2019-01-23 234 0 0 0
2019-01-23 235 0 0 0
2019-01-24 234 56 2 20
2019-01-23 235 23 3 30
. . . .
2019-02-28 0 0 0 0
With set_index + reindex from the cartesian product of values. Here I'll create the dates with pd.date_range to save some typing, and ensure dates are datetime
import pandas as pd
df['date'] = pd.to_datetime(df.date)
my_dates = pd.date_range('2019-01-01', '2019-02-28', freq='D')
idx = pd.MultiIndex.from_product([my_dates, df.id.unique()], names=['date', 'id'])
df = df.set_index(['date', 'id']).reindex(idx).fillna(0).reset_index()
Output: df
date id clicks conv rev
0 2019-01-01 234 0.0 0.0 0.0
1 2019-01-01 235 0.0 0.0 0.0
...
45 2019-01-23 235 23.0 3.0 30.0
46 2019-01-24 234 56.0 2.0 20.0
47 2019-01-24 235 0.0 0.0 0.0
...
115 2019-02-27 235 0.0 0.0 0.0
116 2019-02-28 234 0.0 0.0 0.0
117 2019-02-28 235 0.0 0.0 0.0

Python - Statsmodels.tsa.seasonal_decompose - missing values in head and tail of dataframe

I have the following dataframe, that I'm calling "sales_df":
Value
Date
2004-01-01 0
2004-02-01 173
2004-03-01 225
2004-04-01 230
2004-05-01 349
2004-06-01 258
2004-07-01 270
2004-08-01 223
... ...
2015-06-01 218
2015-07-01 215
2015-08-01 233
2015-09-01 258
2015-10-01 252
2015-11-01 256
2015-12-01 188
2016-01-01 70
I want to separate its trend from its seasonal component and for that I use statsmodels.tsa.seasonal_decompose through the following code:
decomp=sm.tsa.seasonal_decompose(sales_df.Value)
df=pd.concat([sales_df,decomp.trend],axis=1)
df.columns=['sales','trend']
This is getting me this:
sales trend
Date
2004-01-01 0 NaN
2004-02-01 173 NaN
2004-03-01 225 NaN
2004-04-01 230 NaN
2004-05-01 349 NaN
2004-06-01 258 NaN
2004-07-01 270 236.708333
2004-08-01 223 248.208333
2004-09-01 243 251.250000
... ... ...
2015-05-01 270 214.416667
2015-06-01 218 215.583333
2015-07-01 215 212.791667
2015-08-01 233 NaN
2015-09-01 258 NaN
2015-10-01 252 NaN
2015-11-01 256 NaN
2015-12-01 188 NaN
2016-01-01 70 NaN
Note that there are 6 NaN's in the start and in the end of the Trend's series.
So I ask, is that right? Why is that happening?
This is expected as seasonal_decompose uses a symmetric moving average by default if the filt argument is not specified (as you did). The frequency is inferred from the time series.
https://searchcode.com/codesearch/view/86129185/

Categories

Resources