Python: Transpose a dataframe and the result in incomplete - python

I have a problem when I try to transpose my dataframe, the process act quite well but I don't understand why doesn't include in the trasposition the last column, this is my original dataframe:
ASH700936D_M-SCIS East 2.07 -0.30 -0.27 0.00 0.00 0.00 0.00 0.19
ASH700936D_M-SCIS North 1.93 0.00 0.00 -0.15 0.09 0.04 -0.27 0.12
ASH700936D_M-SCIS Up 31.59 -40.09 -1.48 15.31 1.03 0.00 0.00 0.65
ASH701945E_M-SCIS East 2.66 0.00 0.00 0.00 0.00 0.00 0.00 0.17
ASH701945E_M-SCIS North -0.91 0.00 0.00 -0.21 0.08 0.13 -0.44 0.12
ASH701945E_M-SCIS Up 5.45 3.31 0.11 0.00 0.00 -0.18 -0.18 0.41
LEIAR20-LEIM East -1.34 0.04 0.06 0.00 0.00 0.03 -0.05 0.05
LEIAR20-LEIM North -0.39 0.04 0.07 0.00 0.00 0.01 -0.06 0.03
LEIAR20-LEIM Up 0.58 0.00 0.00 0.10 0.04 0.20 0.02 0.13
LEIAR25.R3-LEIT East -0.39 0.00 0.00 0.02 0.28 -0.00 -0.31 0.08
LEIAR25.R3-LEIT North -0.65 0.00 0.00 0.09 -0.11 -0.10 0.22 0.05
LEIAR25.R3-LEIT Up 2.02 -8.52 -1.15 2.62 0.63 0.00 0.00 0.27
LEIAR25.R4-LEIT East 0.79 0.00 0.00 0.02 0.22 -0.05 -0.16 0.12
LEIAR25.R4-LEIT North -0.36 0.00 0.00 0.00 0.00 0.04 0.03 0.05
LEIAR25.R4-LEIT Up 15.11 -20.38 0.03 7.53 0.07 0.00 0.00 0.32
My transposed dataframe is:
0 1 2 3 4 5 6 7 8 9 10 11 12 13
ASH700936D_M-SCIS ASH700936D_M-SCIS ASH700936D_M-SCIS ASH701945E_M-SCIS ASH701945E_M-SCIS ASH701945E_M-SCIS LEIAR20-LEIM LEIAR20-LEIM LEIAR20-LEIM LEIAR25.R3-LEIT LEIAR25.R3-LEIT LEIAR25.R3-LEIT LEIAR25.R4-LEIT LEIAR25.R4-LEIT
East North Up East North Up East North Up East North Up East North
2.07 1.93 31.59 2.66 -0.91 5.45 -1.34 -0.39 0.58 -0.39 -0.65 2.02 0.79 -0.36
-0.30 0.0 -40.09 0.0 0.0 3.31 0.04 0.04 0.0 0.0 0.0 -8.52 0.0 0.0
-0.27 0.0 -1.48 0.0 0.0 0.11 0.06 0.07 0.0 0.0 0.0 -1.15 0.0 0.0
0.00 -0.15 15.31 0.0 -0.21 0.0 0.0 0.0 0.1 0.02 0.09 2.62 0.02 0.0
0.00.1 0.09 1.03 0.0 0.08 0.0 0.0 0.0 0.04 0.28 -0.11 0.63 0.22 0.0
0.00.2 0.04 0.0 0.0 0.13 -0.18 0.03 0.01 0.2 -0.0 -0.1 0.0 -0.05 0.04
0.00.3 -0.27 0.0 0.0 -0.44 -0.18 -0.05 -0.06 0.02 -0.31 0.22 0.0 -0.16 0.03
0.19 0.12 0.65 0.17 0.12 0.41 0.05 0.03 0.13 0.08 0.05 0.27 0.12 0.05
As you can see, at the end of the transposed file one of "LEIAR25.R4-LEIT" is missing.
This is my script:
df = pd.read_csv('original_dataframe.txt', sep='\s*', index_col=None, engine='python')
df_transposed = df.transpose()
df_transposed.to_csv('transposta_bozza.txt', sep=' ')
with open ('transposta_bozza.txt','r') as f:
with open ('transposta.txt', 'w') as r:
for line in f:
data=line.split()
df='{0[0]:<20}{0[1]:<20}{0[2]:<20}{0[3]:<20}{0[4]:<20}{0[5]:<20}{0[6]:<20}{0[7]:<20}{0[8]:<20}{0[9]:<20}{0[10]:<20}{0[11]:<20}{0[12]:<20}{0[13]:<20}'.format(data)
r.write("%s\n" % df)
r.close()
I have to say that no, if I try to put the {0[14]:<20}.format(data) says that the "IndexError: Index out of range"
I think that there is some problem with the index but I dont know how to do!!

you don't have to write a temporary file, try out this code:
with open ('transposta.txt', 'w') as fp:
for i, row in df_transposed.iterrows():
fp.write(''.join('{:<20}'.format(x) for x in row))
fp.write('\n')
I wasn't able to spot the problem with your code, but hope this will help

Related

Pandas function for subtract a cumulative column monthly

I have a weather dataset, it gave me many years of data, as below:
Date
rainfallInMon
2009-01-01
0.0
2009-01-02
0.03
2009-01-03
0.05
2009-01-04
0.05
2009-01-05
0.06
...
...
2009-01-29
0.2
2009-01-30
0.21
2009-01-31
0.21
2009-02-01
0.0
2009-02-02
0.0
...
...
I am trying to get the daily rainfall, starting from the end of the month subtracting the previous day. For eg:
Date
rainfallDaily
2009-01-01
0.0
2009-01-02
0.03
2009-01-03
0.02
...
...
2009-01-29
0.01
2009-01-30
0.0
...
...
Thanks for your efforts in advance.
Because there is many years of data need Series.dt.to_period for month periods for distinguish months with years separately:
df['rainfallDaily'] = (df.groupby(df['Date'].dt.to_period('m'))['rainfallInMon']
.diff()
.fillna(0))
Or use Grouper:
df['rainfallDaily'] = (df.groupby(pd.Grouper(freq='M',key='Date'))['rainfallInMon']
.diff()
.fillna(0))
print (df)
Date rainfallInMon rainfallDaily
0 2009-01-01 0.00 0.00
1 2009-01-02 0.03 0.03
2 2009-01-03 0.05 0.02
3 2009-01-04 0.05 0.00
4 2009-01-05 0.06 0.01
5 2009-01-29 0.20 0.14
6 2009-01-30 0.21 0.01
7 2009-01-31 0.21 0.00
8 2009-02-01 0.00 0.00
9 2009-02-02 0.00 0.00
Try:
# Convert to datetime if it's not already the case
df['Date'] = pd.to_datetime(df['Date'])
df['rainfallDaily'] = df.resample('M', on='Date')['rainfallInMon'].diff().fillna(0)
print(df)
# Output
Date rainfallInMon rainfallDaily
0 2009-01-01 0.00 0.00
1 2009-01-02 0.03 0.03
2 2009-01-03 0.05 0.02
3 2009-01-04 0.05 0.00
4 2009-01-05 0.06 0.01
5 2009-01-29 0.20 0.14
6 2009-01-30 0.21 0.01
7 2009-01-31 0.21 0.00
8 2009-02-01 0.00 0.00
9 2009-02-02 0.00 0.00

How to add sum() and mean() value above the df column values in the same line?

Supposed we have a df with a sum() value in the below DataFrame, thanks so much for #jezrael 's answer here, now sum value is in the first line, and avg value is the second line, but it's ugly, how to let sum value and avg value in the same column and with index name:Total? Also place it in the first line as below
# Total 27.56 25.04 -1.31
code in pandas is as below:
df.columns=['value_a','value_b','name','up_or_down','difference']
df1 = df[['value_a','value_b']].sum().to_frame().T
df2 = df[['difference']].mean().to_frame().T
df = pd.concat([df1,df2, df], ignore_index=True)
df
value_a value_b name up_or_down difference
project_name
27.56 25.04
-1.31
2021-project11 0.43 0.48 2021-project11 up 0.05
2021-project1 0.62 0.56 2021-project1 down -0.06
2021-project2 0.51 0.47 2021-project2 down -0.04
2021-porject3 0.37 0.34 2021-porject3 down -0.03
2021-porject4 0.64 0.61 2021-porject4 down -0.03
2021-project5 0.32 0.25 2021-project5 down -0.07
2021-project6 0.75 0.81 2021-project6 up 0.06
2021-project7 0.60 0.60 2021-project7 down 0.00
2021-project8 0.85 0.74 2021-project8 down -0.11
2021-project10 0.67 0.67 2021-project10 down 0.00
2021-project9 0.73 0.73 2021-project9 down 0.00
2021-project11 0.54 0.54 2021-project11 down 0.00
2021-project12 0.40 0.40 2021-project12 down 0.00
2021-project13 0.76 0.77 2021-project13 up 0.01
2021-project14 1.16 1.28 2021-project14 up 0.12
2021-project15 1.01 0.94 2021-project15 down -0.07
2021-project16 1.23 1.24 2021-project16 up 0.01
2022-project17 0.40 0.36 2022-project17 down -0.04
2022-project_11 0.40 0.40 2022-project_11 down 0.00
2022-project4 1.01 0.80 2022-project4 down -0.21
2022-project1 0.65 0.67 2022-project1 up 0.02
2022-project2 0.75 0.57 2022-project2 down -0.18
2022-porject3 0.32 0.32 2022-porject3 down 0.00
2022-project18 0.91 0.56 2022-project18 down -0.35
2022-project5 0.84 0.89 2022-project5 up 0.05
2022-project19 0.61 0.48 2022-project19 down -0.13
2022-project6 0.77 0.80 2022-project6 up 0.03
2022-project20 0.63 0.54 2022-project20 down -0.09
2022-project8 0.59 0.55 2022-project8 down -0.04
2022-project21 0.58 0.54 2022-project21 down -0.04
2022-project10 0.76 0.76 2022-project10 down 0.00
2022-project9 0.70 0.71 2022-project9 up 0.01
2022-project22 0.62 0.56 2022-project22 down -0.06
2022-project23 2.03 1.74 2022-project23 down -0.29
2022-project12 0.39 0.39 2022-project12 down 0.00
2022-project24 1.35 1.55 2022-project24 up 0.20
project25 0.45 0.42 project25 down -0.03
project26 0.53 NaN project26 down NaN
project27 0.68 NaN project27 down NaN
Thanks so much for any advice
Use DataFrame.agg with dictionary for aggregate functions:
df.columns=['value_a','value_b','name','up_or_down','difference']
df1 = df.agg({'value_a':'sum', 'value_b':'sum', 'difference':'mean'}).to_frame('Total').T
df = pd.concat([df1,df])
print (df.head())
value_a value_b difference name up_or_down
Total 27.56 25.04 -0.035405 NaN NaN
2021-project11 0.43 0.48 0.050000 2021-project11 up
2021-project1 0.62 0.56 -0.060000 2021-project1 down
2021-project2 0.51 0.47 -0.040000 2021-project2 down
2021-porject3 0.37 0.34 -0.030000 2021-porject3 down

Normalising data for plotting

I am trying to plot the data shown below in a normalised way, in order to have the maximum value on the y-axis equal to 1.
Dataset:
%_F %_M %_C %_D Label
0 0.00 0.00 0.08 0.05 0.0
1 0.00 0.00 0.00 0.14 0.0
2 0.00 0.00 0.10 0.01 1.0
3 0.01 0.01 0.07 0.05 1.0
4 0.00 0.00 0.07 0.14 0.0
6 0.00 0.00 0.07 0.05 0.0
7 0.00 0.00 0.05 0.68 0.0
8 0.00 0.00 0.03 0.09 0.0
9 0.00 0.00 0.04 0.02 0.0
10 0.00 0.00 0.06 0.02 0.0
I tried as follows:
cols_to_norm = ["%_F", "%_M", "%_C", "%_D"]
df[cols_to_norm] = df[cols_to_norm].apply(lambda x: (x - x.min()) / (x.max() - x.min()))
but I am not completely sure about the output.
In fact, if a plot as follows
df.pivot_table(index='Label').plot.bar()
I get a different result. I think it is because I am not considering in the first code the index on Label.
there are multiple techniques normalize
this shows technique which uses native pandas
import io
df = pd.read_csv(io.StringIO(""" %_F %_M %_C %_D Label
0 0.00 0.00 0.08 0.05 0.0
1 0.00 0.00 0.00 0.14 0.0
2 0.00 0.00 0.10 0.01 1.0
3 0.01 0.01 0.07 0.05 1.0
4 0.00 0.00 0.07 0.14 0.0
6 0.00 0.00 0.07 0.05 0.0
7 0.00 0.00 0.05 0.68 0.0
8 0.00 0.00 0.03 0.09 0.0
9 0.00 0.00 0.04 0.02 0.0
10 0.00 0.00 0.06 0.02 0.0"""), sep="\s+")
fig, ax = plt.subplots(2, figsize=[10,6])
df2 = (df-df.min())/(df.max()-df.min())
df.plot(ax=ax[0], kind="line")
df2.plot(ax=ax[1], kind="line")

Python(Pandas) - Create a column by matching column's values into dataframe

I have the below assumed dataframe
a b c d e F
0.02 0.62 0.31 0.67 0.27 a
0.30 0.07 0.23 0.42 0.00 a
0.82 0.59 0.34 0.73 0.29 a
0.90 0.80 0.13 0.14 0.07 d
0.50 0.62 0.94 0.34 0.53 d
0.59 0.84 0.95 0.42 0.54 d
0.13 0.33 0.87 0.20 0.25 d
0.47 0.37 0.84 0.69 0.28 e
Column F represents the columns of the dataframe.
For each row of column F I want to find relevant row and column from the rest of the dataframe and return the values into one column
The outcome will look like this:
a b c d e f To_Be_Filled
0.02 0.62 0.31 0.67 0.27 a 0.02
0.30 0.07 0.23 0.42 0.00 a 0.30
0.82 0.59 0.34 0.73 0.29 a 0.82
0.90 0.80 0.13 0.14 0.07 d 0.14
0.50 0.62 0.94 0.34 0.53 d 0.34
0.59 0.84 0.95 0.42 0.54 d 0.42
0.13 0.33 0.87 0.20 0.25 d 0.20
0.47 0.37 0.84 0.69 0.28 e 0.28
I am able to identify each case with the below, but not sure how to do it across the whole dataframe.
test.loc[test.iloc[:,5]==a,test.columns==a]
Many thanks in advance.
You can use lookup:
df['To_Be_Filled'] = df.lookup(np.arange(len(df)), df['F'])
df
Out:
a b c d e F To_Be_Filled
0 0.02 0.62 0.31 0.67 0.27 a 0.02
1 0.30 0.07 0.23 0.42 0.00 a 0.30
2 0.82 0.59 0.34 0.73 0.29 a 0.82
3 0.90 0.80 0.13 0.14 0.07 d 0.14
4 0.50 0.62 0.94 0.34 0.53 d 0.34
5 0.59 0.84 0.95 0.42 0.54 d 0.42
6 0.13 0.33 0.87 0.20 0.25 d 0.20
7 0.47 0.37 0.84 0.69 0.28 e 0.28
np.arange(len(df)) can be replaced with df.index.

Group by - select most recent 4 events

I have the following df in pandas:
df:
DATE STOCK DATA1 DATA2 DATA3
01/01/12 ABC 0.40 0.88 0.22
04/01/12 ABC 0.50 0.49 0.13
07/01/12 ABC 0.85 0.36 0.83
10/01/12 ABC 0.28 0.12 0.39
01/01/13 ABC 0.86 0.87 0.58
04/01/13 ABC 0.95 0.39 0.87
07/01/13 ABC 0.60 0.25 0.56
10/01/13 ABC 0.15 0.28 0.69
01/01/11 XYZ 0.94 0.40 0.50
04/01/11 XYZ 0.65 0.19 0.81
07/01/11 XYZ 0.89 0.59 0.69
10/01/11 XYZ 0.12 0.09 0.18
01/01/12 XYZ 0.25 0.94 0.55
04/01/12 XYZ 0.07 0.22 0.67
07/01/12 XYZ 0.46 0.08 0.54
10/01/12 XYZ 0.04 0.03 0.94
...
I want to group by the stocks, sort by date and then for specified columns (in this case DATA1 and DATA3), I want to get the last four items summed (TTM data).
The output would look like this:
DATE STOCK DATA1 DATA2 DATA3 DATA1_TTM DATA3_TTM
01/01/12 ABC 0.40 0.88 0.22 NaN NaN
04/01/12 ABC 0.50 0.49 0.13 NaN NaN
07/01/12 ABC 0.85 0.36 0.83 NaN NaN
10/01/12 ABC 0.28 0.12 0.39 2.03 1.56
01/01/13 ABC 0.86 0.87 0.58 2.49 1.92
04/01/13 ABC 0.95 0.39 0.87 2.94 2.66
07/01/13 ABC 0.60 0.25 0.56 2.69 2.39
10/01/13 ABC 0.15 0.28 0.69 2.55 2.70
01/01/11 XYZ 0.94 0.40 0.50 NaN NaN
04/01/11 XYZ 0.65 0.19 0.81 NaN NaN
07/01/11 XYZ 0.89 0.59 0.69 NaN NaN
10/01/11 XYZ 0.12 0.09 0.18 2.59 2.18
01/01/12 XYZ 0.25 0.94 0.55 1.90 2.23
04/01/12 XYZ 0.07 0.22 0.67 1.33 2.09
07/01/12 XYZ 0.46 0.08 0.54 0.89 1.94
10/01/12 XYZ 0.04 0.03 0.94 0.82 2.70
...
My approach so far has been to sort by date, then group, then iterate through each group and if there are 3 older events then the current event I sum. Also, I want to check to see if the dates fall within 1 year. Can anyone offer a better way in Python? Thank you.
Added: As a clarification for the 1 year part, let's say you take the last four dates and it goes 1/1/1993, 4/1/12, 7/1/12, 10/1/12 -- a data error. I wouldn't want to sum those four. I would want that one to say NaN.
For this I think you can use transform and rolling_sum. Starting from your dataframe, I might do something like:
>>> df["DATE"] = pd.to_datetime(df["DATE"]) # switch to datetime to ease sorting
>>> df = df.sort(["STOCK", "DATE"])
>>> rsum_columns = "DATA1", "DATA3"
>>> grouped = df.groupby("STOCK")[rsum_columns]
>>> new_columns = grouped.transform(lambda x: pd.rolling_sum(x, 4))
>>> df[new_columns.columns + "_TTM"] = new_columns
>>> df
DATE STOCK DATA1 DATA2 DATA3 DATA1_TTM DATA3_TTM
0 2012-01-01 00:00:00 ABC 0.40 0.88 0.22 NaN NaN
1 2012-04-01 00:00:00 ABC 0.50 0.49 0.13 NaN NaN
2 2012-07-01 00:00:00 ABC 0.85 0.36 0.83 NaN NaN
3 2012-10-01 00:00:00 ABC 0.28 0.12 0.39 2.03 1.57
4 2013-01-01 00:00:00 ABC 0.86 0.87 0.58 2.49 1.93
5 2013-04-01 00:00:00 ABC 0.95 0.39 0.87 2.94 2.67
6 2013-07-01 00:00:00 ABC 0.60 0.25 0.56 2.69 2.40
7 2013-10-01 00:00:00 ABC 0.15 0.28 0.69 2.56 2.70
8 2011-01-01 00:00:00 XYZ 0.94 0.40 0.50 NaN NaN
9 2011-04-01 00:00:00 XYZ 0.65 0.19 0.81 NaN NaN
10 2011-07-01 00:00:00 XYZ 0.89 0.59 0.69 NaN NaN
11 2011-10-01 00:00:00 XYZ 0.12 0.09 0.18 2.60 2.18
12 2012-01-01 00:00:00 XYZ 0.25 0.94 0.55 1.91 2.23
13 2012-04-01 00:00:00 XYZ 0.07 0.22 0.67 1.33 2.09
14 2012-07-01 00:00:00 XYZ 0.46 0.08 0.54 0.90 1.94
15 2012-10-01 00:00:00 XYZ 0.04 0.03 0.94 0.82 2.70
[16 rows x 7 columns]
I don't know what you're asking by "Also, I want to check to see if the dates fall within 1 year", so I'll leave that alone.

Categories

Resources