Pandas subtracting rows gives wrong result

Pandas subtracting rows gives wrong result - python

My pandas dataframe consists of a Column "timeStamp", the elements of which are of type datetime.datetime. I'm trying to obtain the difference between two consecutive rows of this column to obtain the time spent in seconds. I use the following piece of code for it.
df["Time"] = df["timeStamp"].diff(0).dt.total_seconds()
Generally it's working fine, however, I keep getting 0.0 as a result of this operation in quite a few instances even when it's not the case.
Examples values that result in 0.0:
import pandas as pd
import datetime
import numpy as np
df = pd.DataFrame({'S.No.': [1, 2, 3, 4], 'ABC': [datetime.datetime(2019,2,25,11,49,50), datetime.datetime(2019,2,25,11,50,0),datetime.datetime(2019,2,25,11,50,7),datetime.datetime(2019,2,25,11,50,12)]})
df["Time"] = df["ABC"].diff(0).dt.seconds
print df
Note: using python2.7

Try this:
print(df["timestamp"].diff().fillna(0).dt.seconds)
0 0
1 10
2 7
3 5
df['difference']=df["timestamp"].diff().fillna(0).dt.seconds
print(df)
timestamp difference
0 2019-02-25 11:49:50 0
1 2019-02-25 11:50:00 10
2 2019-02-25 11:50:07 7
3 2019-02-25 11:50:12 5

Use
df["Time"] = df["timeStamp"].diff().dt.total_seconds()
instead.
The argument in diff specifies the number of rows above of the row with which you want to calculate the difference. Now, you're filling it with 0, so your subtracting a value from itself, which will always give 0. By leaving it empty, it uses the default value 1, so the difference with 1 row above.

Related

Reassign Pandas DataFrame column values

I have a csv that has incorrect datetime formatting. I've worked out how to convert those values into the format I need, but now I need to reassign all the values in a column to the new, converted values.
For example, I'm hoping that there's something I can put into the following FOR loop that will insert the values back into the dataframe at the correct location:
for i in df[df.columns[1]]:
t = pd.Timestamp(i)
short_date = t.date().strftime('%m/%d/%Y').lstrip('0')
# Insert back into dataframe?
As always, your help is very much appreciated!
Part of the column in question:
Part of the dataframe in question:
Created Date
2019-02-27 22:55:16
2019-01-29 22:57:12
2018-11-29 00:13:31
2019-01-30 21:35:15
2018-12-20 21:14:45
2018-11-01 16:20:15
2019-04-11 16:38:07
2019-01-24 00:23:17
2018-12-21 19:30:10
2018-12-19 22:33:04
2018-11-07 19:54:19
2019-05-10 21:15:00

In the simplest, but most instructive, possible terms:
df = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6]})
df
# x y
# 0 1 4
# 1 2 5
# 2 3 6
df[:] = df[:].astype(float)
df
# x y
# 0 1.0 4.0
# 1 2.0 5.0
# 2 3.0 6.0
Let pandas do the work for you.
Or, for only one column:
df.x = df.x.astype(float)
df
# x y
# 0 1.0 4
# 1 2.0 5
# 2 3.0 6
You'll, of course, replace astype(float) with .date().strftime('%m/%d/%Y').lstrip('0').

To reassign a column, no need for a loop. Something like this should work:
df["column"] = new_column
new_column is either a Series of matching length, or something that can be broadcasted1 to that length. You can find more details in the docs.
That said, if pd.Timestamp can already parse your data, there is no need for "formatting". The formatting is not associated with a timestamp instance. You can choose a particular formatting when you are converting to string with something like df["timestamp"].dt.strftime("%m/%d/%Y").
On the other hand, if you want to change the precision of your timestamp, you can do something like this:
df["timestamp"] = df["timestamp"].astype("datetime64[D]")
Here, all time information will be rounded to a resolution of days. The letter between the [ and ] is the resolution. Again, all this and more is discussed in the docs.
1 Broadcasting is a concept from numpy where you can operate between different but compatibly shaped arrays. Again, everything is covered in the docs.

Thank you all for your help. All of the answers were helpful, but the answer I ended up using was as follows:
import pandas as pd
df[df.columns[0]] = pd.to_datetime(df[df.columns[0]]).dt.strftime('%m/%d/%Y')

How to select all elements greater than a given values in a dataframe

I have a csv that is read by my python code and a dataframe is created using pandas.
CSV file is in following format
1 1.0
2 99.0
3 20.0
7 63
My code calculates the percentile and wants to find all rows that have the value in 2nd column greater than 60.
df = pd.read_csv(io.BytesIO(body), error_bad_lines=False, header=None, encoding='latin1', sep=',')
percentile = df.iloc[:, 1:2].quantile(0.99) # Selecting 2nd column and calculating percentile
criteria = df[df.iloc[:, 1:2] >= 60.0]
While my percentile code works fine, criteria to find all rows that have column 2's value greater than 60 returns
NaN NaN
NaN NaN
NaN NaN
NaN NaN
Can you please help me find the error.

Just correct the condition inside criteria. Being the second column "1" you should write df.iloc[:,1].
Example:
import pandas as pd
import numpy as np
b =np.array([[1,2,3,7], [1,99,20,63] ])
df = pd.DataFrame(b.T) #just creating the dataframe
criteria = df[ df.iloc[:,1]>= 60 ]
print(criteria)
Why?
It seems like the cause resides inside the definition type of the condition. Let's inspect
Case 1:
type( df.iloc[:,1]>= 60 )
Returns pandas.core.series.Series,so it gives
df[ df.iloc[:,1]>= 60 ]
#out:
0 1
1 2 99
3 7 63
Case2:
type( df.iloc[:,1:2]>= 60 )
Returns a pandas.core.frame.DataFrame, and gives
df[ df.iloc[:,1:2]>= 60 ]
#out:
0 1
0 NaN NaN
1 NaN 99.0
2 NaN NaN
3 NaN 63.0
Therefore I think it changes the way the index is processed.
Always keep in mind that 3 is a scalar, and 3:4 is a array.
For more info is always good to take a look at the official doc Pandas indexing

Your indexing a bit off, since you only have two columns [0, 1] and you are interested in selecting just the one with index 1. As #applesoup mentioned the following is just enough:
criteria = df[df.iloc[:, 1] >= 60.0]
However, I would consider naming columns and just referencing based on name. This will allow you to avoid any mistakes in case your df structure changes, e.g.:
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 7], 'b': [1.0, 99.0, 20.0, 63.]})
criteria = df[df['b'] >= 60.0]

People here seem to be more interested in coming up with alternative solutions instead of digging into his code in order to find out what's really wrong. I will adopt a diametrically opposed strategy!
The problem with your code is that you are indexing your DataFrame df by another DataFrame. Why? Because you use slices instead of integer indexing.
df.iloc[:, 1:2] >= 60.0 # Return a DataFrame with one boolean column
df.iloc[:, 1] >= 60.0 # Return a Series
df.iloc[:, [1]] >= 60.0 # Return a DataFrame with one boolean column
So correct your code by using :
criteria = df[df.iloc[:, 1] >= 60.0] # Dont slice !

How to calculate the values of a pandas DataFrame column depending on the results of a rolling function from another column

A very simple example just for understanding.
The goal is to calculate the values of a pandas DataFrame column depending on the results of a rolling function from another column.
I have the following DataFrame:
import numpy as np
import pandas as pd
s = pd.Series([1,2,3,2,1,2,3,2,1])
df = pd.DataFrame({'DATA':s, 'POINTS':0})
df
Note: I don't even know how to format the Jupyter Notebook results in the Stackoverflow edit window, so I copy and paste the image, I beg your pardon.
The DATA column shows the observed data; the POINTS column, initialized to 0, is used to collect the output of a "rolling" function applied to DATA column, as explained in the following.
Set a window = 4
nwin = 4
Just for the example, the "rolling" function calculate the max.
Now let me use a drawing to explain what I need.
For every iteration, the rolling function calculate the maximum of the data in the window; then the POINT at the same index of the max DATA is incremented by 1.
The final result is:
Can you help me with the python code?
I really appreciate your help.
Thank you in advance for your time,
Gilberto
P.S. Can you also suggest how to copy and paste Jupyter Notebook formatted cell to Stackoverflow edit window? Thank you.

IIUC the explanation by #IanS (thanks again!), you can do
In [75]: np.array([df.DATA.rolling(4).max().shift(-i) == df.DATA for i in range(4)]).T.sum(axis=1)
Out[75]: array([0, 0, 3, 0, 0, 0, 3, 0, 0])
To update the column:
In [78]: df = pd.DataFrame({'DATA':s, 'POINTS':0})
In [79]: df.POINTS += np.array([df.DATA.rolling(4).max().shift(-i) == df.DATA for i in range(4)]).T.sum(axis=1)
In [80]: df
Out[80]:
DATA POINTS
0 1 0
1 2 0
2 3 3
3 2 0
4 1 0
5 2 0
6 3 3
7 2 0
8 1 0

import pandas as pd
s = pd.Series([1,2,3,2,1,2,3,2,1])
df = pd.DataFrame({'DATA':s, 'POINTS':0})
df.POINTS=df.DATA.rolling(4).max().shift(-1)
df.POINTS=(df.POINTS*(df.POINTS==df.DATA)).fillna(0)

Iterating over each element in pandas DataFrame

So I got a pandas DataFrame with a single column and a lot of data.
I need to access each of the element, not to change it (with apply()) but to parse it into another function.
When looping through the DataFrame it always stops after the first one.
If I convert it to a list before, then my numbers are all in braces (eg. [12] instead of 12) thus breaking my code.
Does anyone see what I am doing wrong?
import pandas as pd
def go_trough_list(df):
for number in df:
print(number)
df = pd.read_csv("my_ids.csv")
go_trough_list(df)
df looks like:
1
0 2
1 3
2 4
dtype: object
[Finished in 1.1s]
Edit: I found one mistake. My first value is recognized as a header.
So I changed my code to:
df = pd.read_csv("my_ids.csv",header=None)
But with
for ix in df.index:
print(df.loc[ix])
I get:
0 1
Name: 0, dtype: int64
0 2
Name: 1, dtype: int64
0 3
Name: 2, dtype: int64
0 4
Name: 3, dtype: int64
edit: Here is my Solution thanks to jezrael and Nick!
First I added headings=None because my data has no header.
Then I changed my function to:
def go_through_list(df)
new_list = df[0].apply(my_function,parameter=par1)
return new_list
And it works perfectly! Thank you again guys, problem solved.

You can use the index as in other answers, and also iterate through the df and access the row like this:
for index, row in df.iterrows():
print(row['column'])
however, I suggest solving the problem differently if performance is of any concern. Also, if there is only one column, it is more correct to use a Pandas Series.
What do you mean by parse it into another function? Perhaps take the value, and do something to it and create it into another column?
I need to access each of the element, not to change it (with apply()) but to parse it into another function.
Perhaps this example will help:
import pandas as pd
df = pd.DataFrame([20, 21, 12])
def square(x):
return x**2
df['new_col'] = df[0].apply(square) # can use a lambda here nicely

You can convert column as Series tolist:
for x in df['Colname'].tolist():
print x
Sample:
import pandas as pd
df = pd.DataFrame({'a': pd.Series( [1, 2, 3]),
'b': pd.Series( [4, 5, 6])})
print df
a b
0 1 4
1 2 5
2 3 6
for x in df['a'].tolist():
print x
1
2
3
If you have only one column, use iloc for selecting first column:
for x in df.iloc[:,0].tolist():
print x
Sample:
import pandas as pd
df = pd.DataFrame({1: pd.Series( [2, 3, 4])})
print df
1
0 2
1 3
2 4
for x in df.iloc[:,0].tolist():
print x
2
3
4
This can work too, but it is not recommended approach, because 1 can be number or string and it can raise Key error:
for x in df[1].tolist():
print x
2
3
4

Say you have one column named 'myColumn', and you have an index on the dataframe (which is automatically created with read_csv). Try using the .loc function:
for ix in df.index:
print(df.loc[ix]['myColumn'])

How do I best make calculations per slice in a row and save the output as new dataframe

My question relates to how I would make calculations for each row in a pandas dataframe, but on slices of each row, and then output the resulting calculations as a new dataframe that I can save as a txt file.
For example, lets say I want to output a dataframe that has the mean values (for each row) for the data in columns 0, 1 and 2 and a mean value for columns 3, 4 and 5.
I found how to slice columns and this is what I came up with so far (just running it on row 0).
for i in df:
if i == 0:
a = df.ix[:,0:3].mean()
b = df.ix[:,3::].mean()
print a, b
output is something like this:
0 0.000002
1 0.000001
2 0.000001
3 0.000002
dtype: float64 3 0.000002
4 0.000001
5 0.000001
6 0.000002
7 0.000001
dtype: float64
My questions are:
1) I don't understand this output since I expected only two numbers: the mean of the first slice (a) and the mean of the second slice (b).. Where am I going wrong, or is this not the right way to approach this task?
2) how can I store the result in a new dataframe and save it as txt file

You don't need any loops. With pandas, if you're looping, you're probably doing something very wrong. Just select all the rows and subset of columns with the iloc attribute and call the mean method with axis=1:
import pandas
import numpy
numpy.random.seed(0)
df = pandas.DataFrame(numpy.round(numpy.random.normal(size=(10, 5)),2))
means = pandas.DataFrame(df.iloc[:, :3].mean(axis=1), columns=['means'])
print(means)
means
0 1.046667
1 -0.060000
2 0.783333
3 0.536667
4 -0.346667
5 -0.530000
6 -0.120000
7 0.863333
8 -1.393333
9 -0.303333
dtype: float64
You have to explicitly make means a dataframe since the mean method returns a series.
To save it as tab-delimited text file, use: means.to_csv('means.txt', sep='\t')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas subtracting rows gives wrong result - python

Try this: print(df["timestamp"].diff().fillna(0).dt.seconds) 0 0 1 10 2 7 3 5 df['difference']=df["timestamp"].diff().fillna(0).dt.seconds print(df) timestamp difference 0 2019-02-25 11:49:50 0 1 2019-02-25 11:50:00 10 2 2019-02-25 11:50:07 7 3 2019-02-25 11:50:12 5

Related

Reassign Pandas DataFrame column values

How to select all elements greater than a given values in a dataframe

How to calculate the values of a pandas DataFrame column depending on the results of a rolling function from another column

Iterating over each element in pandas DataFrame

How do I best make calculations per slice in a row and save the output as new dataframe

Categories

Resources