I would like to get dataframe subsets in a "rolling" manner.
I tried several things without success, here is an example of what I would like to do. Let's consider dataframe.
df
var1 var2
0 43 74
1 44 74
2 45 66
3 46 268
4 47 66
I would like to create a new column with the following function which performs a conditional sum:
def func(x):
tmp = (x["var1"] * (x["var2"] == 74)).sum()
return tmp
and calling it like this
df["newvar"] = df.rolling(2, min_periods=1).apply(func)
That would mean that the function would be applied on dataframe basis, and not for each row or column
It would return
var1 var2 newvar
0 43 74 43 # 43
1 44 74 87 # 43 * 1 + 44 * 1
2 45 66 44 # 44 * 1 + 45 * 0
3 46 268 0 # 45 * 0 + 46 * 0
4 47 66 0 # 46 * 0 + 47 * 0
Is there a pythonic way to do this?
This is just an example but the condition (always based on the sub-dataframe values depends on more than 2 columns.
updated comment
#unutbu posted a great answer to a very similar question here but it appears that his answer is based on pd.rolling_apply which passes the index to the function. I'm not sure how to replicate this with the current DataFrame.rolling.apply method.
original answer
It appears that the variable passed to the argument through the apply function is a numpy array of each column (one at a time) and not a DataFrame so you do not have access to any other columns unfortunately.
But what you can do is use some boolean logic to temporarily create a new column based on whether var2 is 74 or not and then use the rolling method.
df['new_var'] = df.var2.eq(74).mul(df.var1).rolling(2, min_periods=1).sum()
var1 var2 new_var
0 43 74 43.0
1 44 74 87.0
2 45 66 44.0
3 46 268 0.0
4 47 66 0.0
The temporary column is based on the first half of the code above.
df.var2.eq(74).mul(df.var1)
# or equivalently with operators
# (df['var2'] == 74) * df['var1']
0 43
1 44
2 0
3 0
4 0
Finding the type of the variable passed to apply
Its very important to know what is actually being passed to the apply function and I can't always remember what is being passed so if I am unsure I will print out the variable along with its type so that it is clear to me what object I am dealing with. See this example with your original DataFrame.
def foo(x):
print(x)
print(type(x))
return x.sum()
df.rolling(2, min_periods=1).apply(foo)
Output
[ 43.]
<class 'numpy.ndarray'>
[ 43. 44.]
<class 'numpy.ndarray'>
[ 44. 45.]
<class 'numpy.ndarray'>
[ 45. 46.]
<class 'numpy.ndarray'>
[ 46. 47.]
<class 'numpy.ndarray'>
[ 74.]
<class 'numpy.ndarray'>
[ 74. 74.]
<class 'numpy.ndarray'>
[ 74. 66.]
<class 'numpy.ndarray'>
[ 66. 268.]
<class 'numpy.ndarray'>
[ 268. 66.]
<class 'numpy.ndarray'>
The trick is to define a function that has access to your entire dataframe. Then you do a roll on any column and call apply() passing in that function. The function will have access to the window data, which is a subset of the dataframe column. From that subset you can extract the index you should be looking at. (This assumes that your index is strictly increasing. So the usual integer index will work, as well as most time series.) You can use the index to then access the entire dataframe with all the columns.
def dataframe_roll(df):
def my_fn(window_series):
window_df = df[(df.index >= window_series.index[0]) & (df.index <= window_series.index[-1])]
return window_df["col1"] + window_df["col2"]
return my_fn
df["result"] = df["any_col"].rolling(24).apply(dataframe_roll(df), raw=False)
Here's how you get dataframe subsets in a rolling manner:
for df_subset in df.rolling(2):
print(type(df_subset), '\n', df_subset)
Related
I have a file called data that looks like this:
Some Text Information (lines 1-6 in file)
1 22 23
2 44 44
3 55 55
4 66 66
5 77 77
What I'm trying to achieve is this something like this:
[[ 22. 23.]
[ 44. 44.]
[ 55. 55.]
[ 66. 66.]
[ 77. 77.]]
The issue I'm having is that the code I'm using doesn't properly split the data from the file. It ends up looking like this:
[ 1 22 23
0 2 44 44
1 3 55 55, Empty DataFrame
Columns: [1 6734 1453]
Index: [], 1 22 23
2 4 44 44
3 5 55 55
4 6 66 66
5 7 77 77
EOF]
Here's the code I'm using:
def loadFile(filename):
df1 = pd.read_fwf(filename, skiprows=6)
df1 = np.split(df, [2,2])
print('The data points:\n {}'.format(df1[:5]))
I understand the parameters of the split function. For instance, [2,2] should create two sub arrays from my dataframe and my axis is 0. However, why does it not properly split the array?
You can read file into pandas dataFrame and access the values attribute from it. Assuming "Some Text Information" is not the header:
import pandas as pd
df = pd.read_table(filepath, sep='\t', index_col= 0, skiprows = 6, header = None)
df.values # gives you the numpy ndarray
This should use the first column as index. Also you might need to remove the sep argument to let read_table figure it out. Also, try using other separators. If you get the row index in your data then try slicing to get desired results. Use something like:
df.iloc[:,1:].values
Do not use read_fwf, let pandas figure out the structure of your table:
df = pd.read_csv("yourfile", skiprows=6, header=None, sep='\s+')
To elaborate on ManKind_008's answer:
Your explicit line numbers are the problem. Pandas interprets these as valid data.
Using ManKinds solution does properly set the index column, but since your line numbers start at zero you end up with a DataFrame like:
pd.read_fwf('test.csv', header=None, index_col=0, skiprows=6)
1 2
0
1 22 23
2 44 44
3 55 55
4 66 66
5 77 77
Instead I suggest you read in all of your data using:
pd.read_fwf('test.csv', header=None, skiprows=6).iloc[:, 1:]
1 2
0 22 23
1 44 44
2 55 55
3 66 66
4 77 77
This leaves you with what you seem to need. The iloc call is ignoring the first row of data (your line numbers).
From here the df.values command will give you:
array([[22, 23],
[44, 44],
[55, 55],
[66, 66],
[77, 77]])
If you don't want a np.array, you can explicitly cast this to a list using the list() function.
Pandas seems to resist efforts to use DataFrame index values as if they are column values. As a result I am often copying them into a column so that I can reference them for calculations. Is this a good practice? Or am I missing a "correct" way to reference index values?
Consider the following example:
j = [(a, b) for a in ['A','B','C'] for b in random.sample(range(1, 100), 5)]
i = pd.MultiIndex.from_tuples(j, names=['Name','Num'])
df = pd.DataFrame(np.random.randn(15), i, columns=['Vals'])
Now suppose I want to add a column 'SmallestNum' to the DataFrame that lists the smallest index Num for each associated index Name.
Presently the only way I can find to get this to work (assuming that the MultiIndex is large and I don't have it handy as tuples) is to:
First: Copy both index levels into columns of the DataFrame:
df['NameCol'] = df.index.get_level_values(0)
df['NumCol'] = df.index.get_level_values(1)
Otherwise, I can't figure out how I would get the smallest Num value for each Name. At least now I can via:
smallest = pd.DataFrame(df.groupby(['Name'])['NumCol'].min())
Finally, I can merge these data back into the DataFrame as a new column, but only because I can reference the NameCol:
df.merge(smallest.rename(columns={'NumCol' : 'SmallestNum'}), how='left', right_index=True, left_on=['NameCol'])
So is there a way to do this without creating the NameCol and NumCol column copies of the MultiIndex values?
This works:
## get smallest values per Name
vals = df.reset_index(level=1).groupby('Name')['Num'].min()
## map the values to df
df['SmallestNum'] = pd.Series(df.index.get_level_values(0)).map(vals).values
You can use transform:
np.random.seed(456)
j = [(a, b) for a in ['A','B','C'] for b in np.random.randint(1, 100, size=5)]
i = pd.MultiIndex.from_tuples(j, names=['Name','Num'])
df = pd.DataFrame(np.random.randn(15), i, columns=['Vals'])
print (df)
Vals
Name Num
A 28 1.180140
44 0.984257
90 1.835646
43 -1.886823
29 0.424763
B 80 -0.433105
61 -0.166838
46 0.754634
38 1.966975
93 0.200671
C 40 0.742752
82 -1.264271
12 -0.112787
78 0.667358
70 0.357900
df['SmallestNum'] = df.reset_index(level=1).groupby('Name')['Num'].transform('min').values
Or:
df['SmallestNum'] = df.groupby('Name').transform(lambda x: x.index.get_level_values(1).min())
print (df)
Vals SmallestNum
Name Num
A 28 1.180140 28
44 0.984257 28
90 1.835646 28
43 -1.886823 28
29 0.424763 28
B 80 -0.433105 38
61 -0.166838 38
46 0.754634 38
38 1.966975 38
93 0.200671 38
C 40 0.742752 12
82 -1.264271 12
12 -0.112787 12
78 0.667358 12
70 0.357900 12
If we have the data set:
import pandas as pd
a = pd.DataFrame({"A":[34,12,78,84,26], "B":[54,87,35,25,82], "C":[56,78,0,14,13], "D":[0,23,72,56,14], "E":[78,12,31,0,34]})
b = pd.DataFrame({"A":[45,24,65,65,65], "B":[45,87,65,52,12], "C":[98,52,32,32,12], "D":[0,23,1,365,53], "E":[24,12,65,3,65]})
How does one create a correlation matrix, in which the y-axis represents "a" and the x-axis represents "b"?
The aim is to see correlations between the matching columns of the two datasets like this:
If you won't mind a NumPy based vectorized solution, based on this solution post to Computing the correlation coefficient between two multi-dimensional arrays -
corr2_coeff(a.values.T,b.values.T).T # func from linked solution post.
Sample run -
In [621]: a
Out[621]:
A B C D E
0 34 54 56 0 78
1 12 87 78 23 12
2 78 35 0 72 31
3 84 25 14 56 0
4 26 82 13 14 34
In [622]: b
Out[622]:
A B C D E
0 45 45 98 0 24
1 24 87 52 23 12
2 65 65 32 1 65
3 65 52 32 365 3
4 65 12 12 53 65
In [623]: corr2_coeff(a.values.T,b.values.T).T
Out[623]:
array([[ 0.71318502, -0.5923714 , -0.9704441 , 0.48775228, -0.07401011],
[ 0.0306753 , -0.0705457 , 0.48801177, 0.34685977, -0.33942737],
[-0.26626431, -0.01983468, 0.66110713, -0.50872017, 0.68350413],
[ 0.58095645, -0.55231196, -0.32053858, 0.38416478, -0.62403866],
[ 0.01652716, 0.14000468, -0.58238879, 0.12936016, 0.28602349]])
This achieves exactly what you want:
from scipy.stats import pearsonr
# create a new DataFrame where the values for the indices and columns
# align on the diagonals
c = pd.DataFrame(columns = a.columns, index = a.columns)
# since we know set(a.columns) == set(b.columns), we can just iterate
# through the columns in a (although a more robust way would be to iterate
# through the intersection of the two sets of columns, in the case your actual dataframes' columns don't match up
for col in a.columns:
correl_signif = pearsonr(a[col], b[col]) # correlation of those two Series
correl = correl_signif[0] # grab the actual Pearson R value from the tuple from above
c.loc[col, col] = correl # locate the diagonal for that column and assign the correlation coefficient
Edit: Well, it achieved exactly what you wanted, until the question was modified. Although this can easily be changed:
c = pd.DataFrame(columns = a.columns, index = a.columns)
for col in c.columns:
for idx in c.index:
correl_signif = pearsonr(a[col], b[idx])
correl = correl_signif[0]
c.loc[idx, col] = correl
c is now this:
Out[16]:
A B C D E
A 0.713185 -0.592371 -0.970444 0.487752 -0.0740101
B 0.0306753 -0.0705457 0.488012 0.34686 -0.339427
C -0.266264 -0.0198347 0.661107 -0.50872 0.683504
D 0.580956 -0.552312 -0.320539 0.384165 -0.624039
E 0.0165272 0.140005 -0.582389 0.12936 0.286023
I use this function that breaks it down with numpy
def corr_ab(a, b):
a_ = a.values
b_ = b.values
ab = a_.T.dot(b_)
n = len(a)
sums_squared = np.outer(a_.sum(0), b_.sum(0))
stds_squared = np.outer(a_.std(0), b_.std(0))
return pd.DataFrame((ab - sums_squared / n) / stds_squared / n,
a.columns, b.columns)
demo
corr_ab(a, b)
Do you have to use Pandas? This seem can be done via numpy rather easily. Did i understand the task incorrectly?
import numpy
X = {"A":[34,12,78,84,26], "B":[54,87,35,25,82], "C":[56,78,0,14,13], "D":[0,23,72,56,14], "E":[78,12,31,0,34]}
Y = {"A":[45,24,65,65,65], "B":[45,87,65,52,12], "C":[98,52,32,32,12], "D":[0,23,1,365,53], "E":[24,12,65,3,65]}
for key,value in X.items():
print "correlation stats for %s is %s" % (key, numpy.corrcoef(value,Y[key]))
df = pd.DataFrame({'A':[11,11,22,22],'mask':[0,0,0,1],'values':np.arange(10,30,5)})
df
A mask values
0 11 0 10
1 11 0 15
2 22 0 20
3 22 1 25
Now how can I group by A, and keep the column names in tact, and yet put a custom function into Z:
def calculate_df_stats(dfs):
mask_ = list(dfs['B'])
mean = np.ma.array(list(dfs['values']), mask=mask_).mean()
return mean
df['Z'] = df.groupby('A').agg(calculate_df_stats) # does not work
and generate:
A mask values Z
0 11 0 10 12.5
1 22 0 20 25
Whatever I do it only replaces values column with the masked mean.
and can your solution be applied for a function on two columns and return in a new column?
Thanks!
Edit:
To clarify more: let's say I have such a table in Mysql:
SELECT * FROM `Reader_datapoint` WHERE `wavelength` = '560'
LIMIT 200;
which gives me such result:
http://pastebin.com/qXiaWcJq
If I run now this:
SELECT *, avg(action_value) FROM `Reader_datapoint` WHERE `wavelength` = '560'
group by `reader_plate_ID`;
I get:
datapoint_ID plate_ID coordinate_x coordinate_y res_value wavelength ignore avg(action_value)
193 1 0 0 2.1783 560 NULL 2.090027083333334
481 2 0 0 1.7544 560 NULL 1.4695583333333333
769 3 0 0 2.0161 560 NULL 1.6637885416666673
How can I replicate this behaviour in Pandas? note that all the column names stay the same, the first value is taken, and the new column is added.
If you want the original columns in your result, you can first calculate the grouped and aggregated dataframe (but you will have to aggregate in some way your original columns. I took the first occuring as an example):
>>> df = pd.DataFrame({'A':[11,11,22,22],'mask':[0,0,0,1],'values':np.arange(10,30,5)})
>>>
>>> grouped = df.groupby("A")
>>>
>>> result = grouped.agg('first')
>>> result
mask values
A
11 0 10
22 0 20
and then add a column 'Z' to that result by applying your function on the groupby result 'grouped':
>>> def calculate_df_stats(dfs):
... mask_ = list(dfs['mask'])
... mean = np.ma.array(list(dfs['values']), mask=mask_).mean()
... return mean
...
>>> result['Z'] = grouped.apply(calculate_df_stats)
>>>
>>> result
mask values Z
A
11 0 10 12.5
22 0 20 20.0
In your function definition you can always use more columns (just by their name) to return the result.
I would like to force matrix multiplication "orientation" using Python Pandas, both between DataFrames against DataFrames, Dataframes against Series and Series against Series.
As an example, I tried the following code:
t = pandas.Series([1, 2])
print(t.T.dot(t))
Which outputs: 5
But I expect this:
[1 2
2 4]
Pandas is great, but this inability to do matrix multiplications the way I want is what is the most frustrating, so any help would be greatly appreciated.
PS: I know Pandas tries to implicitly use index to find the right way to compute the matrix product, but it seems this behavior can't be switched off!
Here:
In [1]: import pandas
In [2]: t = pandas.Series([1, 2])
In [3]: np.outer(t, t)
Out[3]:
array([[1, 2],
[2, 4]])
Anyone coming to this now may want to consider: pandas.Series.to_frame(). It's kind of clunky.
Here's the original question's example:
import pandas as pd
t = pd.Series([1, 2])
t.to_frame() # t.to_frame().T
# or equivalently:
t.to_frame().dot(t.to_frame().T)
Which yields:
In [3]: t.to_frame().dot(t.to_frame().T)
Out[3]:
0 1
0 1 2
1 2 4
Solution found by y-p:
https://github.com/pydata/pandas/issues/3344#issuecomment-16533461
from pandas.util.testing import makeCustomDataframe as mkdf
a=mkdf(3,5,data_gen_f=lambda r,c: randint(1,100))
b=mkdf(5,3,data_gen_f=lambda r,c: randint(1,100))
c=DataFrame(a.values.dot(b.values),index=a.index,columns=b.columns)
print a
print b
print c
assert (a.iloc[0,:].values*b.iloc[:,0].values.T).sum() == c.iloc[0,0]
C0 C_l0_g0 C_l0_g1 C_l0_g2 C_l0_g3 C_l0_g4
R0
R_l0_g0 39 87 88 2 65
R_l0_g1 59 14 76 10 65
R_l0_g2 93 69 4 29 58
C0 C_l0_g0 C_l0_g1 C_l0_g2
R0
R_l0_g0 76 88 11
R_l0_g1 66 73 47
R_l0_g2 78 69 15
R_l0_g3 47 3 40
R_l0_g4 54 31 31
C0 C_l0_g0 C_l0_g1 C_l0_g2
R0
R_l0_g0 19174 17876 7933
R_l0_g1 15316 13503 4862
R_l0_g2 16429 15382 7284
The assert here is useless, it just does a check that it's indeed a correct matrix multiplication.
The key here seems to be line 4:
c=DataFrame(a.values.dot(b.values),index=a.index,columns=b.columns)
What this does is that it computes the dot product of a and b, but force that the resulting DataFrame c has a's indexes and b's columns, indeed converting the dot product into a matrix multiplication, and in pandas's style since you keep the indexes and columns (you lose the columns of a and indexes of b, but this is semantically correct since in a matrix multiplication you are summing over those rows, so it would be meaningless to keep them).
This is a bit awkward but seems simple enough if it is consistent with the rest of the API (I still have to test what will be the result with Series x Dataframe and Series x Series, I will post here my findings).