Multi-indexed dataframe: Setting values - python

I already asked a related question earlier, but I didn't want to start a comment-and-edit-discussion. So here's -boiled down - what the answer to my earlier question lead me to ask. Consider
import pandas as pd
from numpy import arange
from scipy import random
index = pd.MultiIndex.from_product([arange(0,3), arange(10,15)], names=['A', 'B'])
df = pd.DataFrame(columns=['test'], index=index)
someValues = random.randint(0, 10, size=5)
df.loc[0, 'test'], df.loc[0,:] and df.ix[0] all create a representation of a part of the data frame, the first one being a Series and the other two being df slices. However
df.ix[0] = df.loc[0,'test'] = someValues sets the value for the df
df.loc[0,'test'] = someValues gives an error ValueError: total size of new array must be unchanged
df.loc[0,:] = someValues is being ignored. No error, but the df does not contain the numpy array.
I skimmed the docs but there was no clear logical and systematical explanation on what is going on with MultiIndexes in general. So far, I guess that "if the view is a Series, you can set values", and "otherwise, god knows what happens".
Could someone shed some light on the logic? Moreover, is there some deep meaning behind this or are these just constraints due to how it is set up?

These are all with 0.13.1
These are not all 'slice' representations at all.
This is a Series.
In [50]: df.loc[0,'test']
Out[50]:
B
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
Name: test, dtype: object
These are DataFrames (and the same)
In [51]: df.loc[0,:]
Out[51]:
test
B
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
[5 rows x 1 columns]
In [52]: df.ix[0]
Out[52]:
test
B
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
[5 rows x 1 columns]
This is trying to assign the wrong shape (it looks like it should work, but if you have multiple columns then it won't, that is why this is not allowed)
In [54]: df.ix[0] = someValues
ValueError: could not broadcast input array from shape (5) into shape (5,1)
This works because it knows how to broadcast
In [56]: df.loc[0,:] = someValues
In [57]: df
Out[57]:
test
A B
0 10 4
11 3
12 4
13 2
14 8
1 10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
2 10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
[15 rows x 1 columns]
This works fine
In [63]: df.loc[0,'test'] = someValues+1
In [64]: df
Out[64]:
test
A B
0 10 5
11 4
12 5
13 3
14 9
1 10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
2 10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
[15 rows x 1 columns]
As does this
In [66]: df.loc[0,:] = someValues+1
In [67]: df
Out[67]:
test
A B
0 10 5
11 4
12 5
13 3
14 9
1 10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
2 10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
[15 rows x 1 columns]
Not clear where you generated the cases in your question. I think the logic is pretty straightforward and consistent (their were several inconsistencies in prior versions however).

Related

How to set a value of a panda dataframe between two indices?

I would like to set a value to a panda dataframe based on the values of another column. In a nutshell, for example, if I wanted to set indices of a column my_column of a pandas dataframe pd where another column, my_interesting_column is between 10 and 30, I would like to do something like:
start_index=pd.find_closest_index_where_pd["my_interesting_column"].is_closest_to(10)
end_index=pd.find_closest_index_where_pd["my_interesting_column"].is_closest_to(30)
pd["my_column"].between(star_index, end_index)= some_value
As a simple illustration, suppose I have the following dataframe
df = pd.DataFrame(np.arange(10, 20), columns=list('A'))
df["B"]=np.nan
>>> df
A B
0 10 NaN
1 11 NaN
2 12 NaN
3 13 NaN
4 14 NaN
5 15 NaN
6 16 NaN
7 17 NaN
8 18 NaN
9 19 NaN
How can I do something like
df.where(df["A"].is_between(13,16))= 5
So that the end results looks like
>>> df
A B
0 10 NaN
1 11 NaN
2 12 NaN
3 13 5
4 14 5
5 15 5
6 16 5
7 17 NaN
8 18 NaN
9 19 NaN
pd.loc[start_idx:end_idx, 'my_column'] = some_value
I think this is what you are looking for
df.loc[(df['A'] >= 13) & (df['A'] <= 16), 'B'] = 5

understanding rolling correlation in pandas

I am trying to understand how pandas.rolling_corr actually calculates rolling correlations. So far I have always been doing it with numpy. I prefer to use pandas due to the speed and the ease of use, but I cannot get the rolling correlation as it used to do.
I start with two numy arrays:
c = np.array([1,2,3,4,5,6,7,8,9,8,7,6,5,4,3,2,1])
d = np.array([8,9,8])
now I want to calculate the cross-correlation for which length-3-window of my array c. I define a rolling window function:
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
and calculate the correlation between each of my generated windows and the second original dataset. This approach works just fine:
for win in rolling_window(c, len(d)):
print(np.correlate(win, d))
Outputs:
[50]
[75]
[100]
[125]
[150]
[175]
[200]
[209]
[200]
[175]
[150]
[125]
[100]
[75]
[50]
If I attempt to solve it with pandas:
a = pd.DataFrame([1,2,3,4,5,6,7,8,9,8,7,6,5,4,3,2,1])
b = pd.DataFrame([8,9,8])
no matter if I use DataFrame rolling_corr:
a.rolling(window=3, center=True).corr(b)
or Pandas rolling_corr:
pd.rolling_corr(a, b, window=1, center=True)
I just get a bunch of NaNs:
0
0 NaN
1 0.0
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
Can someone give me a hand? I am able to solve the problem with numpy by flattening the numpy array obtained from converting the pandas DataFrame
a.values.ravel()
However, I would like to solve the calculation entirely with pandas. I have searched the documentation but haven't found the answer I am looking for. What am I missing or not undrstanding?
Thank you very much in advance.
D.
The computation you're trying to do can be thought of as operating on the following dataframe:
pd.concat([a, b], axis=1)
0 0
0 1 8
1 2 9
2 3 8
3 4 NaN
4 5 NaN
5 6 NaN
6 7 NaN
7 8 NaN
8 9 NaN
9 8 NaN
10 7 NaN
11 6 NaN
12 5 NaN
13 4 NaN
14 3 NaN
15 2 NaN
16 1 NaN
If you're using window=3, it correlates the first three values in b with the first 3 values in a, leaving the rest with NaN, and placing the value in the center of the window (center=True).
You can try:
pd.rolling_apply(a, window=3, func=lambda x: np.correlate(x, b[0]))
Output:
0
0 NaN
1 NaN
2 50
3 75
4 100
5 125
6 150
7 175
8 200
9 209
10 200
11 175
12 150
13 125
14 100
15 75
16 50
You can add center=True here too if you'd like.
(I'm using pandas 0.17.0)

Working with NaN values in multiple columns in Pandas

I have multiple datasets with different number of rows and same number of columns.
I would like to find Nan values in each column for example consider these two datasets:
dataset1 : dataset2:
a b a b
1 10 2 11
2 9 3 12
3 8 4 13
4 nan nan 14
5 nan nan 15
6 nan nan 16
I want to find nan values in two datasets a and b :
if it occurs in column b then remove all the rows that have nan values. and if it occurs in column a then fill that values with 0.
this is my snippet code:
a=pd.notnull(data['a'].values.any())
b= pd.notnull((data['b'].values.any()))
if a:
data = data.dropna(subset=['a'])
if b:
data[['a']] = data[['a']].fillna(value=0)
which does not work properly.
You just need fillna and dropna without control flow
data = data.dropna(subset=['b']).fillna(0)
Pass your condition to a dict
df=df.fillna({'a':0,'b':np.nan}).dropna()
You do not need 'b' here
df=df.fillna({'a':0}).dropna()
EDIT :
df.fillna({'a':0}).dropna()
Out[1319]:
a b
0 2.0 11
1 3.0 12
2 4.0 13
3 0.0 14
4 0.0 15
5 0.0 16

Add column in dataframe from list

I have a dataframe with some columns like this:
A B C
0
4
5
6
7
7
6
5
The possible range of values in A are only from 0 to 7.
Also, I have a list of 8 elements like this:
List=[2,5,6,8,12,16,26,32] //There are only 8 elements in this list
If the element in column A is n, I need to insert the n th element from the List in a new column, say 'D'.
How can I do this in one go without looping over the whole dataframe?
The resulting dataframe would look like this:
A B C D
0 2
4 12
5 16
6 26
7 32
7 32
6 26
5 16
Note: The dataframe is huge and iteration is the last option option. But I can also arrange the elements in 'List' in any other data structure like dict if necessary.
Just assign the list directly:
df['new_col'] = mylist
Alternative
Convert the list to a series or array and then assign:
se = pd.Series(mylist)
df['new_col'] = se.values
or
df['new_col'] = np.array(mylist)
IIUC, if you make your (unfortunately named) List into an ndarray, you can simply index into it naturally.
>>> import numpy as np
>>> m = np.arange(16)*10
>>> m[df.A]
array([ 0, 40, 50, 60, 150, 150, 140, 130])
>>> df["D"] = m[df.A]
>>> df
A B C D
0 0 NaN NaN 0
1 4 NaN NaN 40
2 5 NaN NaN 50
3 6 NaN NaN 60
4 15 NaN NaN 150
5 15 NaN NaN 150
6 14 NaN NaN 140
7 13 NaN NaN 130
Here I built a new m, but if you use m = np.asarray(List), the same thing should work: the values in df.A will pick out the appropriate elements of m.
Note that if you're using an old version of numpy, you might have to use m[df.A.values] instead-- in the past, numpy didn't play well with others, and some refactoring in pandas caused some headaches. Things have improved now.
A solution improving on the great one from #sparrow.
Let df, be your dataset, and mylist the list with the values you want to add to the dataframe.
Let's suppose you want to call your new column simply, new_column
First make the list into a Series:
column_values = pd.Series(mylist)
Then use the insert function to add the column. This function has the advantage to let you choose in which position you want to place the column.
In the following example we will position the new column in the first position from left (by setting loc=0)
df.insert(loc=0, column='new_column', value=column_values)
First let's create the dataframe you had, I'll ignore columns B and C as they are not relevant.
df = pd.DataFrame({'A': [0, 4, 5, 6, 7, 7, 6,5]})
And the mapping that you desire:
mapping = dict(enumerate([2,5,6,8,12,16,26,32]))
df['D'] = df['A'].map(mapping)
Done!
print df
Output:
A D
0 0 2
1 4 12
2 5 16
3 6 26
4 7 32
5 7 32
6 6 26
7 5 16
Old question; but I always try to use fastest code!
I had a huge list with 69 millions of uint64. np.array() was fastest for me.
df['hashes'] = hashes
Time spent: 17.034842014312744
df['hashes'] = pd.Series(hashes).values
Time spent: 17.141014337539673
df['key'] = np.array(hashes)
Time spent: 10.724546194076538
You can also use df.assign:
In [1559]: df
Out[1559]:
A B C
0 0 NaN NaN
1 4 NaN NaN
2 5 NaN NaN
3 6 NaN NaN
4 7 NaN NaN
5 7 NaN NaN
6 6 NaN NaN
7 5 NaN NaN
In [1560]: mylist = [2,5,6,8,12,16,26,32]
In [1567]: df = df.assign(D=mylist)
In [1568]: df
Out[1568]:
A B C D
0 0 NaN NaN 2
1 4 NaN NaN 5
2 5 NaN NaN 6
3 6 NaN NaN 8
4 7 NaN NaN 12
5 7 NaN NaN 16
6 6 NaN NaN 26
7 5 NaN NaN 32

Iterrows a rolling sum

I have a pandas dataframe
from pandas import DataFrame, Series
where each row corresponds to one case, and each column corresponds to one month. I want to perform a rolling sum over each 12 month period. Seems simple enough, but I'm getting stuck with
result = [x for x.rolling_sum(12) in df.iterrows()]
result = [x for x.rolling_sum(12) in df.T.iteritems()]
SyntaxError: can't assign to function call
a = []
for x in df.iterrows():
s = x.rolling_sum(12)
a.append(s)
AttributeError: 'tuple' object has no attribute 'rolling_sum'
I think perhaps what you are looking for is
pd.rolling_sum(df, 12, axis=1)
In which case, no list comprehension is necessary. The axis=1 parameter causes Pandas to compute a rolling sum over rows of df.
For example,
import numpy as np
import pandas as pd
ncols, nrows = 13, 2
df = pd.DataFrame(np.arange(ncols*nrows).reshape(nrows, ncols))
print(df)
# 0 1 2 3 4 5 6 7 8 9 10 11 12
# 0 0 1 2 3 4 5 6 7 8 9 10 11 12
# 1 13 14 15 16 17 18 19 20 21 22 23 24 25
print(pd.rolling_sum(df, 12, axis=1))
prints
0 1 2 3 4 5 6 7 8 9 10 11 12
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 66 78
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 222 234
Regarding your list comprehension:
You've got the parts of the list comprehension in the wrong order. Try:
result = [expression for x in df.iterrows()]
See the docs for more about list comprehensions.
The basic form of a list comprehension is
[expression for variable in sequence]
And the resultant list is equivalent to result after Python executes:
result = []
for variable in sequence:
result.append(expression)
See this link for full syntax for list comprehensions.

Categories

Resources