Calculating row-wise delta values in a DataFrame - python

I'm trying to calculate what I am calling "delta values", meaning the amount that has changed between two consecutive rows.
For example
A | delta_A
1 | 0
2 | 1
5 | 3
9 | 4
I managed to do that starting with this code (basically copied from a MatLab program I had)
df = df.assign(delta_A=np.zeros(len(df.A)))
df['delta_A'][0] = 0 # start at 'no-change'
df['delta_A'][1:] = df.A[1:].values - df.A[:-1].values
Which generates the dataframe correctly, and seems to have no further negative affects
However, I think there is something wrong with that approach becuase I get these messages.
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
.../__main__.py:5: SettingWithCopyWarning
So, I didn't really understand what that link was trying to say, and I found this post
Adding new column to existing DataFrame in Python pandas
And, as the latest edit to the answer there says to use this code, but I have already used that syntax...
df1 = df1.assign(e=p.Series(np.random.randn(sLength)).values)
So, question is - Is the loc() function the way to go, or what is the more correct way to get that column?

It seems you need diff and then replace NaN with 0:
df['delta_A'] = df.A.diff().fillna(0).astype(int)
A delta_A
0 0 0
1 4 4
2 7 3
3 8 1
Alternative solution with assign
df = df.assign(delta_A=df.A.diff().fillna(0).astype(int))
A delta_A
0 0 0
1 4 4
2 7 3
3 8 1
Another solution if you need to replace only first NaN value:
df['delta_A'] = df.A.diff()
df.loc[df.index[0], 'delta_A'] = 0
print (df)
A delta_A
0 0 0.0
1 4 4.0
2 7 3.0
3 8 1.0
Your solution can be modified with iloc, but I think it's better to use the diff function:
df['delta_A'] = 0 # convert all values to 0
df['delta_A'].iloc[1:] = df.A[1:].values - df.A[:-1].values
#also works
#df['delta_A'][1:] = df.A[1:].values - df.A[:-1].values
print (df)
A delta_A
0 0 0
1 4 4
2 7 3
3 8 1

Related

Compare all columns value to another one with Pandas

I am having trouble with Pandas.
I try to compare each value of a row to another one.
In the attached link you will be able to see a slice of my dataframe.
For each date I have the daily variation of some stocks.
I want to compare each stock variation to the variation of the columns labelled 'CAC 40'.
If the value is greater I want to turn it into a Boolean 1 or 0 if lower.
This should return a dataframe filled only with 1 or 0 so I can then summarize by columns.
I have tried the apply method but this doesn't work.
It returns a Pandas.Serie ( attached below )
def compare_to_cac(row):
for i in row:
if row[i] >= row['CAC 40']:
return 1
else:
return 0
data2 = data.apply(compare_to_cac, axis=1)
Please can someone help me out ?
I worked with this data (column names are not important here, only the CAC 40 one is):
A B CAC 40
0 0 2 9
1 1 3 9
2 2 4 1
3 3 5 2
4 4 7 2
With just a for loop :
for column in df.columns:
if column == "CAC 40":
continue
condition = [df[column] > df["CAC 40"]]
value = [1]
df[column] = np.select(condition, value, default=0)
Which gives me as a result :
A B CAC 40
0 0 0 9
1 0 0 9
2 1 1 1
3 1 1 2
4 1 1 2

Find the number of previous consecutive occurences of value different than current row value in pandas dataframe

Assume that we have the following pandas dataframe:
df = pd.DataFrame({'x':[0,0,1,0,0,0,0],'y':[1,1,1,1,1,1,0],'z':[0,1,1,1,0,0,1]})
x y z
0 0 1 0
1 0 1 1
2 1 1 1
3 0 1 1
4 0 1 0
5 0 1 0
6 0 0 1
All dataframe is filled either by 1 or 0. Looking at each column separately, if current row value is different than previous value I need to count number of previous consecutive values:
x y z
0
1 1
2 2
3 1
4 3
5
6 6 2
I tried to write a lambda function and apply it to entire dataframe, but I failed. Any idea?
Let's try this:
def f(col):
x = (col != col.shift().bfill())
s = x.cumsum()
return s.groupby(s).transform('count').shift().where(x)
df.apply(f).fillna('')
Output:
x y z
0
1 1
2 2
3 1
4 3
5
6 6 2
Details:
Use apply, to apply a custom function on each column of the dataframe.
Find the difference spots in the column then use cumsum to create groups of consecutive values, then groupby and transform to create a count for each record, then mask the values in the column using where for the difference spots.
You can try the following, where you identify the "runs" first, get the "runs" lengths. You will only entry at where it switches, so it is the lengths of the runs except the last one.
import pandas as pd
import numpy as np
def func(x,missing=np.NaN):
runs = np.cumsum(np.append(0,np.diff(x)!=0))
switches = np.where(np.diff(x!=0))[0] + 1
out = np.repeat(missing,len(x))
out[switches] = np.bincount(runs)[:-1]
# thanks to Scott see comments below
##out[switches] = pd.value_counts(runs,sort=False)[:-1]
return(out)
df.apply(func)
x y z
0 NaN NaN NaN
1 NaN NaN 1.0
2 2.0 NaN NaN
3 1.0 NaN NaN
4 NaN NaN 3.0
5 NaN NaN NaN
6 NaN 6.0 2.0
It might be faster with a good implementation of run length encoding.. but I am not too familiar with it in python..

Wrong number of items passed - Pandas and Transform

Good Morning,
I have the following dataframe:
print(df)
a b
1 6
1 4
4 5
4 2
...
And I would like to get:
print(final_df)
a b c
1 6 2
1 4 2
4 5 3
4 2 3
...
I tried using:
df["c"] = df.groupby("a")["b"].transform(np.diff)
And it works of a small test set with two rows, but whenever I try to run in on the whole dataset, it returns:
ValueError: Wrong number of items passed 0, placement implies 1
How can I create final_df ?
This is your problem explained I guess:
diff vs np.diff
And this might give you want you want:
df["c"] = df.groupby("a")["b"].diff().abs().fillna(0)

Pandas messing up the date column on DataFrame.from_csv

I'm using Python 3.4.0 with pandas==0.16.2. I have soccer team results in CSV files that have the following columns: date, at, goals.scored, goals.lost, result. The 'at' column can have one of the three values (H, A, N) which indicate whether the game took place at team's home stadium, away or at a neutral place respectively. Here's the head of one such file:
date,at,goals.scored,goals.lost,result
16/09/2014,A,0,2,2
01/10/2014,H,4,1,1
22/10/2014,A,2,1,1
04/11/2014,H,3,3,0
26/11/2014,H,2,0,1
09/12/2014,A,4,1,1
25/02/2015,H,1,3,2
17/03/2015,A,2,0,1
19/08/2014,A,0,0,0
When I load this file into pandas.DataFrame in the usual way:
import pandas as pd
aTeam = pd.DataFrame.from_csv("teamA-results.csv")
the first two columns 'date' and 'at' seem to be treated as one and I get a malformed data frame like this one:
aTeam.dtypes
at object
goals.scored int64
goals.lost int64
result int64
dtype: object
aTeam
at goals.scored goals.lost result
date
2014-09-16 A 0 2 2
2014-01-10 H 4 1 1
2014-10-22 A 2 1 1
2014-04-11 H 3 3 0
2014-11-26 H 2 0 1
...
The code block does not clearly reflect the corruption, so I attached the screenshot from the Jupyter notebook:
As you can see 'date' and 'at' columns seemed to be treated as one column of object type:
aTeam['at']
date
2014-09-16 A
2014-01-10 H
2014-10-22 A
2014-04-11 H
2014-11-26 H
2014-09-12 A
Initially I thought the lack of quotes around the date was causing this problem, so I added those, but it did not help at all, so then I en-quoted all the values in the 'at' column which still did not solve the problem. I tried single and double quotes in the CSV file. Interestingly using no quotes or double quotes around the values in 'date' and 'at' produced the same results as you could see above. Single quotes were interpreted as parts of the value in the 'at' column, but not in the date column:
Adding the parse_dates=True param did not have any effect on the data frame.
I did not have such issues when I was working with these CSV files in R. I will appreciate any help on this one.
I can replicate your issue using from_csv,the issue is it uses col 0 as the index so passing index_col=None would work:
index_col : int or sequence, default 0
Column to use for index. If a sequence is given, a MultiIndex is used. Different default from read_table
import pandas as pd
aTeam = pd.DataFrame().from_csv("in.csv",index_col=None)
Output:
date at goals.scored goals.lost result
0 16/09/2014 A 0 2 2
1 01/10/2014 H 4 1 1
2 22/10/2014 A 2 1 1
3 04/11/2014 H 3 3 0
4 26/11/2014 H 2 0 1
5 09/12/2014 A 4 1 1
6 25/02/2015 H 1 3 2
7 17/03/2015 A 2 0 1
8 19/08/2014 A 0 0 0
Odr using .read_csv works correctly and is what you probably wanted based on the fact you were trying quotechar which is a valid arg:
import pandas as pd
aTeam = pd.read_csv("in.csv")
Output:
date at goals.scored goals.lost result
0 16/09/2014 A 0 2 2
1 01/10/2014 H 4 1 1
2 22/10/2014 A 2 1 1
3 04/11/2014 H 3 3 0
4 26/11/2014 H 2 0 1
5 09/12/2014 A 4 1 1
6 25/02/2015 H 1 3 2
7 17/03/2015 A 2 0 1
8 19/08/2014 A 0 0 0
I have no problem here (using Python 2.7, Pandas 16.2). I used the text you pasted and created a .csv file on my desktop and loaded it in Pandas using two methods
import pandas as pd
a = pd.read_csv('test.csv')
b = pd.DataFrame.from_csv('test.csv')
>>>a
date at goals.scored goals.lost result
0 16/09/2014 A 0 2 2
1 01/10/2014 H 4 1 1
2 22/10/2014 A 2 1 1
3 04/11/2014 H 3 3 0
4 26/11/2014 H 2 0 1
5 09/12/2014 A 4 1 1
6 25/02/2015 H 1 3 2
7 17/03/2015 A 2 0 1
8 19/08/2014 A 0 0 0
>>>b
at goals.scored goals.lost result
date
2014-09-16 A 0 2 2
2014-01-10 H 4 1 1
2014-10-22 A 2 1 1
2014-04-11 H 3 3 0
2014-11-26 H 2 0 1
2014-09-12 A 4 1 1
2015-02-25 H 1 3 2
2015-03-17 A 2 0 1
2014-08-19 A 0 0 0
You can see the different behavior between the read_csv and from_csv commands in how the index is handled. However, I do not see anything like the problem you mentioned. You could always try further defining the read_csv parameters, though I doubt that will make a substantial difference.
Could you confirm that your date and "at" columns are being smashed together by querying the dataframe via aTeam['at'] and see what that yields?

computing sum of pandas dataframes

I have two dataframes that I want to add bin-wise. That is, given
dfc1 = pd.DataFrame(list(zip(range(10),np.zeros(10))), columns=['bin', 'count'])
dfc2 = pd.DataFrame(list(zip(range(0,10,2), np.ones(5))), columns=['bin', 'count'])
which gives me this
dfc1:
bin count
0 0 0
1 1 0
2 2 0
3 3 0
4 4 0
5 5 0
6 6 0
7 7 0
8 8 0
9 9 0
dfc2:
bin count
0 0 1
1 2 1
2 4 1
3 6 1
4 8 1
I want to generate this:
bin count
0 0 1
1 1 0
2 2 1
3 3 0
4 4 1
5 5 0
6 6 1
7 7 0
8 8 1
9 9 0
where I've added the count columns where the bin columns matched.
In fact, it turns out that I only ever add 1 (that is, count in dfc2 is always 1). So an alternate version of the question is "given an array of bin values (dfc2.bin), how can I add one to each of their corresponding count values in dfc1?"
My only solution thus far feels grossly inefficient (and slightly unreadable in the end), doing an outer joint between the two bin columns, thus creating a third dataframe on which I do a computation and then project out the unneeded column.
Suggestions?
First set bin to be index in both dataframes, then you can use add, fillvalue is needed to point that zero shall be used if bin is missing in dataframe:
dfc1 = dfc1.set_index('bin')
dfc2 = dfc2.set_index('bin')
result = pd.DataFrame.add(dfc1, dfc2, fill_value=0)
Pandas automatically sums up rows with equal index.
By the way, if you need to perform such operation frequently, I strongly recommend using numpy.bincount, which allows even repeating the bin index inside one dataframe
Since the dfc1 index is the same as your "bin" value, you could simply do the following:
dfc1.iloc[dfc2.bin].cnt += 1
Notice that I renamed your "count" column to "cnt" since count is a pandas builtin, which can cause confusion and errors!
As an alternative of #Alleo's answer, you can use method combineAdd to simply add 2 dataframes together and set_index at the same time, provided that their indexes will be matched by bin:
dfc1.set_index('bin').combineAdd(dfc2.set_index('bin')).reset_index()
bin count
0 0 1
1 1 0
2 2 1
3 3 0
4 4 1
5 5 0
6 6 1
7 7 0
8 8 1
9 9 0

Categories

Resources