Differences of rows and adding the results in a new row - python

I want the results of this line of code df.diff(periods=len(df)-1) in a new row of my data frame.
The line of code above calculates the difference between the first and the last row. So what I want is to add a new line with the results to compare later in percentages the trend of my data. I have explained the final goal in case there exists a more straightforward approach.
My data
#MetaHash 0x 1inch 88mph AC \
Time
14:00 16/03/2021 196876.0 162052086.0 279895846.0 850387.0 713496.0
14:02 16/03/2021 271687.0 150463819.0 281510814.0 850387.0 714325.0
14:49 16/03/2021 362927.0 164764136.0 278248431.0 862865.0 688467.0
And the results obtained after applying the line of code.
#MetaHash 0x 1inch 88mph AC \
Time
14:00 16/03/2021 NaN NaN NaN NaN NaN
14:02 16/03/2021 NaN NaN NaN NaN NaN
17:15 17/03/2021 NaN NaN NaN NaN NaN
11:46 18/03/2021 362810.0 270883348.0 115643691.0 1833585.0 -312283.0 # I want this row in my previous data frame I have shown above.

Try using concat() method:-
import pandas as pd
df=pd.concat((df,df.diff(periods=len(df)-1).dropna()))

Related

How to do sum previous and next row with skipping Nan in Pandas

I have one column in my Dataframe and I am trying to calculate the energy loss with formula. Problem is that I want to use only two valid rows each time where values are not NaN.
Energy is the input column and looking for something like loss column.
Energy
loss
NaN
Nan
NaN
Nan
NaN
Nan
4
Nan
NaN
Nan
3
1/2(4^2-3^2)
NaN
Nan
11
Nan
3
1/2(3^2-11^2)
NaN
NaN
14
Nan
I tried Lambda custom function but not able to send the next row.
Try something like this:
df = pd.DataFrame({'Energy':[4,None,3,None,11,3,None,14]})
energy = df.Energy.dropna()
def my_loss(series):
return 1/2*(series.iloc[0]**2-series.iloc[1]**2)
loss = energy.rolling(2).apply(my_loss)
df['loss'] = loss[1::2] # skip half of the results
Basically you apply your custom function in the rolling of the droped nan energy, and then merge it again with your df.

Python Pandas read_csv quote issue, impossible to separate data

I'm working with several csv with Pandas. I changed some data name on the original csv file and saved the file. Then, I restarted and reloaded my jupyter notebook but now I got something like this for all dataframe I charged the data source :
Department Zone Element Product Year Unit Value
0 U1,"Z3","ODD 2.a.1... NaN NaN NaN NaN NaN NaN
1 U1,"Z3","ODD 2.a.1... NaN NaN NaN NaN NaN NaN
2 U1,"Z5","ODD 2.a.1... NaN NaN NaN NaN NaN NaN
3 U1,"Z6","ODD 2.a.1... NaN NaN NaN NaN NaN NaN
4 U1,"Z9","ODD 2.a.1... NaN NaN NaN NaN NaN NaN
I tried to use sep=',', encoding='UTF-8-SIG',quotechar='"', quoting=0, engine='python' but same issue. I don't know how to parse the csv because even when I created a new csv form the data (without the quote and separator as ; ) the same issue appears...
csv is 321 rows, as this example with the problem : https://www.cjoint.com/c/LDCmfvq06R6
and original csv file without problem in Pandas : https://www.cjoint.com/c/LDCmlweuR66
I thing problem with quotes of the file
import csv
df = pd.read_csv('LDCmfvq06R6_FAOSTAT.csv', quotechar='"',
delimiter = ',',
quoting=csv.QUOTE_NONE,
on_bad_lines='skip')
for i, col in enumerate(df.columns):
df.iloc[:, i] = df.iloc[:, i].str.replace('"', '')
df.head()

Pandas - instead of dropping rows with nan values I want to keep those rows and drop the others in a particular column

I have several column in my df, one is error. If that column has rows with a value (this one always has 99 as the error message value) I want to remove those rows and keep the ones that are nan.
df:
error
date
count
99
nan
nan
nan
2022-02-01
234
nan
2022-02-02
34643
99
nan
nan
nan
2022-03-02
23425
99
nan
nan
I know how to drop if nan, but I want to do the opposite for the error column
A more general solution than proposed by enke is:
df = df[df.error.isna()]
This way you retain only rows with NaN in error column,
regardless of the error value in original DataFrame.

pandas read_excel adds fractional seconds that don't appear in original xlsx file

I am reading an excel spreadsheet into pandas as:
input_df: pd.DataFrame = pd.read_excel(data_filename, engine='openpyxl')
Here's a screenshot of the beginning of the excel file:
However, when I exam the dataframe, fractional parts are added to two out of the three time columns.
Out[6]:
Real Time Current(nA) Unnamed: 2 Unnamed: 3 Sensor 4 Time Sensor 4 Current nA Unnamed: 6 FS Time FS Value
0 11:58:03.111700 119.400 NaN NaN 10:53:39 119.428 NaN 10:43:12 101.0
1 11:58:04.681197 119.439 NaN NaN 10:53:40.795800 119.474 NaN 10:44:06 103.0
2 11:58:07.246866 119.417 NaN NaN 10:53:43.214300 119.447 NaN 10:51:36 88.0
3 11:58:09.388763 119.416 NaN NaN 10:53:45.294400 119.439 NaN 10:53:39 88.0
4 11:58:11.454134 119.411 NaN NaN 10:53:47.302400 119.451 NaN 11:06:58 83.0
These don't appear in the original excel file as evidenced by the screenshot below:
I have no idea where these fractions come from. They don't appear in the original file. Why is this happening, and how can I read in the correct times?
OK. This was my fault. It turns out that there is a fractional part to the timestamps. Google sheets needed to be configured to show that fractional part. In summary, it appears that there is agreement between the xlsx file and the pandas dataframe.

Can I seperate the values of a dictionary into multiple columns and still be able to plot them?

I want to seperate the values of a dictionary into multiple columns and still be able to plot them. At this moment all the values are in one column.
So concretely I would like to split all the different values in the list of values. And use the amount of values in the longest list as the amount of columns. So for all the shorter lists I would like to fill in the gaps with something like 'NA' so I can still plot it in seaborn.
This is the dictionary that I used:
dictio = {'seq_7009': [6236.9764, 6367.049999999999], 'seq_418': [3716.3642000000004, 3796.4124000000006], 'seq_9143_unamb': [4631.958999999999], 'seq_2888': [5219.3359, 5365.4089], 'seq_1101': [4287.7417, 4422.8254], 'seq_107': [5825.695099999999, 5972.8073], 'seq_6946': [5179.3118, 5364.420900000001], 'seq_6162': [5531.503199999999, 5645.577399999999], 'seq_504': [4556.920899999999, 4631.959], 'seq_3535': [3396.1715999999997, 3446.1969999999997, 5655.896546], 'seq_4077': [4551.9108, 4754.0073,4565.987654,5668.9999976], 'seq_1626_unamb': [3724.3894999999998]}
This is the code for the dataframe:
df = pd.Series(dictio)
test=pd.DataFrame({'ID':df.index, 'Value':df.values})
seq_107 [5825.695099999999, 5972.8073]
seq_1101 [4287.7417, 4422.8254]
seq_1626_unamb [3724.3894999999998]
seq_2888 [5219.3359, 5365.4089]
seq_3535 [3396.1715999999997, 3446.1969999999997, 5655....
seq_4077 [4551.9108, 4754.0073, 4565.987654, 5668.9999976]
seq_418 [3716.3642000000004, 3796.4124000000006]
seq_504 [4556.920899999999, 4631.959]
seq_6162 [5531.503199999999, 5645.577399999999]
seq_6946 [5179.3118, 5364.420900000001]
seq_7009 [6236.9764, 6367.049999999999]
seq_9143_unamb [4631.958999999999]
Thanks in advance for the help!
Convert the Value column to a list of lists, and reload it into a new dataframe. Afterwards, call plot. Something like this -
df = pd.DataFrame(test.Value.tolist(), index=test.ID)
df
0 1 2 3
ID
seq_107 5825.6951 5972.8073 NaN NaN
seq_1101 4287.7417 4422.8254 NaN NaN
seq_1626_unamb 3724.3895 NaN NaN NaN
seq_2888 5219.3359 5365.4089 NaN NaN
seq_3535 3396.1716 3446.1970 5655.896546 NaN
seq_4077 4551.9108 4754.0073 4565.987654 5668.999998
seq_418 3716.3642 3796.4124 NaN NaN
seq_504 4556.9209 4631.9590 NaN NaN
seq_6162 5531.5032 5645.5774 NaN NaN
seq_6946 5179.3118 5364.4209 NaN NaN
seq_7009 6236.9764 6367.0500 NaN NaN
seq_9143_unamb 4631.9590 NaN NaN NaN
df.plot()

Categories

Resources