I have a weather dataset, it gave me many years of data, as below:
Date
rainfallInMon
2009-01-01
0.0
2009-01-02
0.03
2009-01-03
0.05
2009-01-04
0.05
2009-01-05
0.06
...
...
2009-01-29
0.2
2009-01-30
0.21
2009-01-31
0.21
2009-02-01
0.0
2009-02-02
0.0
...
...
I am trying to get the daily rainfall, starting from the end of the month subtracting the previous day. For eg:
Date
rainfallDaily
2009-01-01
0.0
2009-01-02
0.03
2009-01-03
0.02
...
...
2009-01-29
0.01
2009-01-30
0.0
...
...
Thanks for your efforts in advance.
Because there is many years of data need Series.dt.to_period for month periods for distinguish months with years separately:
df['rainfallDaily'] = (df.groupby(df['Date'].dt.to_period('m'))['rainfallInMon']
.diff()
.fillna(0))
Or use Grouper:
df['rainfallDaily'] = (df.groupby(pd.Grouper(freq='M',key='Date'))['rainfallInMon']
.diff()
.fillna(0))
print (df)
Date rainfallInMon rainfallDaily
0 2009-01-01 0.00 0.00
1 2009-01-02 0.03 0.03
2 2009-01-03 0.05 0.02
3 2009-01-04 0.05 0.00
4 2009-01-05 0.06 0.01
5 2009-01-29 0.20 0.14
6 2009-01-30 0.21 0.01
7 2009-01-31 0.21 0.00
8 2009-02-01 0.00 0.00
9 2009-02-02 0.00 0.00
Try:
# Convert to datetime if it's not already the case
df['Date'] = pd.to_datetime(df['Date'])
df['rainfallDaily'] = df.resample('M', on='Date')['rainfallInMon'].diff().fillna(0)
print(df)
# Output
Date rainfallInMon rainfallDaily
0 2009-01-01 0.00 0.00
1 2009-01-02 0.03 0.03
2 2009-01-03 0.05 0.02
3 2009-01-04 0.05 0.00
4 2009-01-05 0.06 0.01
5 2009-01-29 0.20 0.14
6 2009-01-30 0.21 0.01
7 2009-01-31 0.21 0.00
8 2009-02-01 0.00 0.00
9 2009-02-02 0.00 0.00
Related
I am trying to fill the dataframe with certain condition but I can not find the appropriate solution. I have a bit larger dataframe bet let's say that my pandas dataframe looks like this:
0
1
2
3
4
5
0.32
0.40
0.60
1.20
3.40
0.00
0.17
0.12
0.00
1.30
2.42
0.00
0.31
0.90
0.80
1.24
4.35
0.00
0.39
0.00
0.90
1.50
1.40
0.00
And I want to update the values, so that if 0.00 appears once in a row (row 2 and 4) that until the end all the values are 0.00. Something like this:
0
1
2
3
4
5
0.32
0.40
0.60
1.20
3.40
0.00
0.17
0.12
0.00
0.00
0.00
0.00
0.31
0.90
0.80
1.24
4.35
0.00
0.39
0.00
0.00
0.00
0.00
0.00
I have tried with
for t in range (1,T-1):
data= np.where(df[t-1]==0,0,df[t])
and several others ways but I couldn't get what I want.
Thanks!
Try as follows:
Select from df with df.eq(0). This will get us all zeros and the rest as NaN values.
Now, add df.ffill along axis=1. This will continue all the zeros through to the end of each row.
Finally, change the dtype to bool by chaining df.astype, thus turning all zeros into False, and all NaN values into True.
We feed the result to df.where. For all True values, we'll pick from the df itself, for all False values, we'll insert 0.
df = df.where(df[df.eq(0)].ffill(axis=1).astype(bool), 0)
print(df)
0 1 2 3 4 5
0 0.32 0.40 0.6 1.20 3.40 0.0
1 0.17 0.12 0.0 0.00 0.00 0.0
2 0.31 0.90 0.8 1.24 4.35 0.0
3 0.39 0.00 0.0 0.00 0.00 0.0
I am trying to plot the data shown below in a normalised way, in order to have the maximum value on the y-axis equal to 1.
Dataset:
%_F %_M %_C %_D Label
0 0.00 0.00 0.08 0.05 0.0
1 0.00 0.00 0.00 0.14 0.0
2 0.00 0.00 0.10 0.01 1.0
3 0.01 0.01 0.07 0.05 1.0
4 0.00 0.00 0.07 0.14 0.0
6 0.00 0.00 0.07 0.05 0.0
7 0.00 0.00 0.05 0.68 0.0
8 0.00 0.00 0.03 0.09 0.0
9 0.00 0.00 0.04 0.02 0.0
10 0.00 0.00 0.06 0.02 0.0
I tried as follows:
cols_to_norm = ["%_F", "%_M", "%_C", "%_D"]
df[cols_to_norm] = df[cols_to_norm].apply(lambda x: (x - x.min()) / (x.max() - x.min()))
but I am not completely sure about the output.
In fact, if a plot as follows
df.pivot_table(index='Label').plot.bar()
I get a different result. I think it is because I am not considering in the first code the index on Label.
there are multiple techniques normalize
this shows technique which uses native pandas
import io
df = pd.read_csv(io.StringIO(""" %_F %_M %_C %_D Label
0 0.00 0.00 0.08 0.05 0.0
1 0.00 0.00 0.00 0.14 0.0
2 0.00 0.00 0.10 0.01 1.0
3 0.01 0.01 0.07 0.05 1.0
4 0.00 0.00 0.07 0.14 0.0
6 0.00 0.00 0.07 0.05 0.0
7 0.00 0.00 0.05 0.68 0.0
8 0.00 0.00 0.03 0.09 0.0
9 0.00 0.00 0.04 0.02 0.0
10 0.00 0.00 0.06 0.02 0.0"""), sep="\s+")
fig, ax = plt.subplots(2, figsize=[10,6])
df2 = (df-df.min())/(df.max()-df.min())
df.plot(ax=ax[0], kind="line")
df2.plot(ax=ax[1], kind="line")
I have a problem when I try to transpose my dataframe, the process act quite well but I don't understand why doesn't include in the trasposition the last column, this is my original dataframe:
ASH700936D_M-SCIS East 2.07 -0.30 -0.27 0.00 0.00 0.00 0.00 0.19
ASH700936D_M-SCIS North 1.93 0.00 0.00 -0.15 0.09 0.04 -0.27 0.12
ASH700936D_M-SCIS Up 31.59 -40.09 -1.48 15.31 1.03 0.00 0.00 0.65
ASH701945E_M-SCIS East 2.66 0.00 0.00 0.00 0.00 0.00 0.00 0.17
ASH701945E_M-SCIS North -0.91 0.00 0.00 -0.21 0.08 0.13 -0.44 0.12
ASH701945E_M-SCIS Up 5.45 3.31 0.11 0.00 0.00 -0.18 -0.18 0.41
LEIAR20-LEIM East -1.34 0.04 0.06 0.00 0.00 0.03 -0.05 0.05
LEIAR20-LEIM North -0.39 0.04 0.07 0.00 0.00 0.01 -0.06 0.03
LEIAR20-LEIM Up 0.58 0.00 0.00 0.10 0.04 0.20 0.02 0.13
LEIAR25.R3-LEIT East -0.39 0.00 0.00 0.02 0.28 -0.00 -0.31 0.08
LEIAR25.R3-LEIT North -0.65 0.00 0.00 0.09 -0.11 -0.10 0.22 0.05
LEIAR25.R3-LEIT Up 2.02 -8.52 -1.15 2.62 0.63 0.00 0.00 0.27
LEIAR25.R4-LEIT East 0.79 0.00 0.00 0.02 0.22 -0.05 -0.16 0.12
LEIAR25.R4-LEIT North -0.36 0.00 0.00 0.00 0.00 0.04 0.03 0.05
LEIAR25.R4-LEIT Up 15.11 -20.38 0.03 7.53 0.07 0.00 0.00 0.32
My transposed dataframe is:
0 1 2 3 4 5 6 7 8 9 10 11 12 13
ASH700936D_M-SCIS ASH700936D_M-SCIS ASH700936D_M-SCIS ASH701945E_M-SCIS ASH701945E_M-SCIS ASH701945E_M-SCIS LEIAR20-LEIM LEIAR20-LEIM LEIAR20-LEIM LEIAR25.R3-LEIT LEIAR25.R3-LEIT LEIAR25.R3-LEIT LEIAR25.R4-LEIT LEIAR25.R4-LEIT
East North Up East North Up East North Up East North Up East North
2.07 1.93 31.59 2.66 -0.91 5.45 -1.34 -0.39 0.58 -0.39 -0.65 2.02 0.79 -0.36
-0.30 0.0 -40.09 0.0 0.0 3.31 0.04 0.04 0.0 0.0 0.0 -8.52 0.0 0.0
-0.27 0.0 -1.48 0.0 0.0 0.11 0.06 0.07 0.0 0.0 0.0 -1.15 0.0 0.0
0.00 -0.15 15.31 0.0 -0.21 0.0 0.0 0.0 0.1 0.02 0.09 2.62 0.02 0.0
0.00.1 0.09 1.03 0.0 0.08 0.0 0.0 0.0 0.04 0.28 -0.11 0.63 0.22 0.0
0.00.2 0.04 0.0 0.0 0.13 -0.18 0.03 0.01 0.2 -0.0 -0.1 0.0 -0.05 0.04
0.00.3 -0.27 0.0 0.0 -0.44 -0.18 -0.05 -0.06 0.02 -0.31 0.22 0.0 -0.16 0.03
0.19 0.12 0.65 0.17 0.12 0.41 0.05 0.03 0.13 0.08 0.05 0.27 0.12 0.05
As you can see, at the end of the transposed file one of "LEIAR25.R4-LEIT" is missing.
This is my script:
df = pd.read_csv('original_dataframe.txt', sep='\s*', index_col=None, engine='python')
df_transposed = df.transpose()
df_transposed.to_csv('transposta_bozza.txt', sep=' ')
with open ('transposta_bozza.txt','r') as f:
with open ('transposta.txt', 'w') as r:
for line in f:
data=line.split()
df='{0[0]:<20}{0[1]:<20}{0[2]:<20}{0[3]:<20}{0[4]:<20}{0[5]:<20}{0[6]:<20}{0[7]:<20}{0[8]:<20}{0[9]:<20}{0[10]:<20}{0[11]:<20}{0[12]:<20}{0[13]:<20}'.format(data)
r.write("%s\n" % df)
r.close()
I have to say that no, if I try to put the {0[14]:<20}.format(data) says that the "IndexError: Index out of range"
I think that there is some problem with the index but I dont know how to do!!
you don't have to write a temporary file, try out this code:
with open ('transposta.txt', 'w') as fp:
for i, row in df_transposed.iterrows():
fp.write(''.join('{:<20}'.format(x) for x in row))
fp.write('\n')
I wasn't able to spot the problem with your code, but hope this will help
I have a dataframe that looks like the below, except much longer. Ultimately, Var, Type, and Level, when combined, represent unique entries. I want to divide the unexposed entries against the other entries in the dataframe, according to the appropriate grouping (e.g., 'Any-All Exposed' would be divided by 'Any All Unexposed', whereas 'Any Existing Exposed' would be divided by 'Any Existing Unexposed.'
Var Type Level Metric1 Metric2 Metric3
Any All Unexposed 34842 30783 -12
Any All Exposed 54167 54247 0.15
Any All LowExposure 20236 20311 0.37
Any All MediumExposure 15254 15388 0.87
Any All HighExposure 18677 18548 0.7
Any New Unexposed 0 23785 0
Any New Exposed 0 43030 0
Any New LowExposure 0 16356 0
Any New MediumExposure 0 12213 0
Any New HighExposure 0 14461 0
Any Existing Unexposed 34843 6998 -80
Any Existing Exposed 54167 11217 -80
Any Existing LowExposure 20236 3955 -81
Any Existing MediumExposure 15254 3175 -79
Any Existing HighExposure 18677 4087 -78
The most straightforward way to do this, I think, would be creating a mulitindex, but I've tried a variety of methods to no avail (normally, receiving an error that it can't divide on a non-unique index).
An expected result would be something like, where in every row is divided by the Unexposed row according to the var and type values.
Var Type Level Metric1 Metric2 Metric3 MP1 MP2 MP3
Any All Unexposed 34842 30783 -12 1.00 1.00 1.00
Any All Exposed 54167 54247 0.15 1.55 1.76 -0.01
Any All LowExposure 20236 20311 0.37 0.58 0.66 -0.03
Any All MediumExposure 15254 15388 0.87 0.44 0.50 -0.07
Any All HighExposure 18677 18548 0.7 0.54 0.60 -0.06
Any New Unexposed 0 23785 0 0.00 1.00 0.00
Any New Exposed 0 43030 0 0.00 1.81 0.00
Any New LowExposure 0 16356 0 0.00 0.69 0.00
Any New MediumExposure 0 12213 0 0.00 0.51 0.00
Any New HighExposure 0 14461 0 0.00 0.61 0.00
Any Existing Unexposed 34843 6998 -80 1.00 1.00 1.00
Any Existing Exposed 54167 11217 -80 1.55 1.60 1.00
Any Existing LowExposure 20236 3955 -81 0.58 0.57 1.01
Any Existing MediumExposure 15254 3175 -79 0.44 0.45 0.99
Any Existing HighExposure 18677 4087 -78 0.54 0.58 0.98
To divide every row in each Var/Type grouping by a specific Level, use groupby and divide.
For example, to divide by Unexposed, as in your example output:
def divide_by(g, denom_lvl):
cols = ["Metric1", "Metric2", "Metric3"]
num = g[cols]
denom = g.loc[g.Level==denom_lvl, cols].iloc[0]
return num.divide(denom).fillna(0).round(2)
df.groupby(['Var','Type']).apply(divide_by, denom_lvl='Unexposed')
Output:
Metric1 Metric2 Metric3
0 1.00 1.00 1.00
1 1.55 1.76 -0.01
2 0.58 0.66 -0.03
3 0.44 0.50 -0.07
4 0.54 0.60 -0.06
5 0.00 1.00 0.00
6 0.00 1.81 0.00
7 0.00 0.69 0.00
8 0.00 0.51 0.00
9 0.00 0.61 0.00
10 1.00 1.00 1.00
11 1.55 1.60 1.00
12 0.58 0.57 1.01
13 0.44 0.45 0.99
14 0.54 0.58 0.98
Im not sure if i got it correctly. Would sth like this do the trick?
You can parse all unique combinations and perform the division.
var_col= df['Var'].unique()
type_col= df['Type'].unique()
for i in var_col:
for j in type_col:
result= df[df['Var']==i][df['Type']==j][df['Level']=='Exposed'] / df[df['Var']==i][df['Type']==j][df['Level']=='Unexposed']
...
I have the following df in pandas:
df:
DATE STOCK DATA1 DATA2 DATA3
01/01/12 ABC 0.40 0.88 0.22
04/01/12 ABC 0.50 0.49 0.13
07/01/12 ABC 0.85 0.36 0.83
10/01/12 ABC 0.28 0.12 0.39
01/01/13 ABC 0.86 0.87 0.58
04/01/13 ABC 0.95 0.39 0.87
07/01/13 ABC 0.60 0.25 0.56
10/01/13 ABC 0.15 0.28 0.69
01/01/11 XYZ 0.94 0.40 0.50
04/01/11 XYZ 0.65 0.19 0.81
07/01/11 XYZ 0.89 0.59 0.69
10/01/11 XYZ 0.12 0.09 0.18
01/01/12 XYZ 0.25 0.94 0.55
04/01/12 XYZ 0.07 0.22 0.67
07/01/12 XYZ 0.46 0.08 0.54
10/01/12 XYZ 0.04 0.03 0.94
...
I want to group by the stocks, sort by date and then for specified columns (in this case DATA1 and DATA3), I want to get the last four items summed (TTM data).
The output would look like this:
DATE STOCK DATA1 DATA2 DATA3 DATA1_TTM DATA3_TTM
01/01/12 ABC 0.40 0.88 0.22 NaN NaN
04/01/12 ABC 0.50 0.49 0.13 NaN NaN
07/01/12 ABC 0.85 0.36 0.83 NaN NaN
10/01/12 ABC 0.28 0.12 0.39 2.03 1.56
01/01/13 ABC 0.86 0.87 0.58 2.49 1.92
04/01/13 ABC 0.95 0.39 0.87 2.94 2.66
07/01/13 ABC 0.60 0.25 0.56 2.69 2.39
10/01/13 ABC 0.15 0.28 0.69 2.55 2.70
01/01/11 XYZ 0.94 0.40 0.50 NaN NaN
04/01/11 XYZ 0.65 0.19 0.81 NaN NaN
07/01/11 XYZ 0.89 0.59 0.69 NaN NaN
10/01/11 XYZ 0.12 0.09 0.18 2.59 2.18
01/01/12 XYZ 0.25 0.94 0.55 1.90 2.23
04/01/12 XYZ 0.07 0.22 0.67 1.33 2.09
07/01/12 XYZ 0.46 0.08 0.54 0.89 1.94
10/01/12 XYZ 0.04 0.03 0.94 0.82 2.70
...
My approach so far has been to sort by date, then group, then iterate through each group and if there are 3 older events then the current event I sum. Also, I want to check to see if the dates fall within 1 year. Can anyone offer a better way in Python? Thank you.
Added: As a clarification for the 1 year part, let's say you take the last four dates and it goes 1/1/1993, 4/1/12, 7/1/12, 10/1/12 -- a data error. I wouldn't want to sum those four. I would want that one to say NaN.
For this I think you can use transform and rolling_sum. Starting from your dataframe, I might do something like:
>>> df["DATE"] = pd.to_datetime(df["DATE"]) # switch to datetime to ease sorting
>>> df = df.sort(["STOCK", "DATE"])
>>> rsum_columns = "DATA1", "DATA3"
>>> grouped = df.groupby("STOCK")[rsum_columns]
>>> new_columns = grouped.transform(lambda x: pd.rolling_sum(x, 4))
>>> df[new_columns.columns + "_TTM"] = new_columns
>>> df
DATE STOCK DATA1 DATA2 DATA3 DATA1_TTM DATA3_TTM
0 2012-01-01 00:00:00 ABC 0.40 0.88 0.22 NaN NaN
1 2012-04-01 00:00:00 ABC 0.50 0.49 0.13 NaN NaN
2 2012-07-01 00:00:00 ABC 0.85 0.36 0.83 NaN NaN
3 2012-10-01 00:00:00 ABC 0.28 0.12 0.39 2.03 1.57
4 2013-01-01 00:00:00 ABC 0.86 0.87 0.58 2.49 1.93
5 2013-04-01 00:00:00 ABC 0.95 0.39 0.87 2.94 2.67
6 2013-07-01 00:00:00 ABC 0.60 0.25 0.56 2.69 2.40
7 2013-10-01 00:00:00 ABC 0.15 0.28 0.69 2.56 2.70
8 2011-01-01 00:00:00 XYZ 0.94 0.40 0.50 NaN NaN
9 2011-04-01 00:00:00 XYZ 0.65 0.19 0.81 NaN NaN
10 2011-07-01 00:00:00 XYZ 0.89 0.59 0.69 NaN NaN
11 2011-10-01 00:00:00 XYZ 0.12 0.09 0.18 2.60 2.18
12 2012-01-01 00:00:00 XYZ 0.25 0.94 0.55 1.91 2.23
13 2012-04-01 00:00:00 XYZ 0.07 0.22 0.67 1.33 2.09
14 2012-07-01 00:00:00 XYZ 0.46 0.08 0.54 0.90 1.94
15 2012-10-01 00:00:00 XYZ 0.04 0.03 0.94 0.82 2.70
[16 rows x 7 columns]
I don't know what you're asking by "Also, I want to check to see if the dates fall within 1 year", so I'll leave that alone.