Iterrows a rolling sum - python

I have a pandas dataframe
from pandas import DataFrame, Series
where each row corresponds to one case, and each column corresponds to one month. I want to perform a rolling sum over each 12 month period. Seems simple enough, but I'm getting stuck with
result = [x for x.rolling_sum(12) in df.iterrows()]
result = [x for x.rolling_sum(12) in df.T.iteritems()]
SyntaxError: can't assign to function call
a = []
for x in df.iterrows():
s = x.rolling_sum(12)
a.append(s)
AttributeError: 'tuple' object has no attribute 'rolling_sum'

I think perhaps what you are looking for is
pd.rolling_sum(df, 12, axis=1)
In which case, no list comprehension is necessary. The axis=1 parameter causes Pandas to compute a rolling sum over rows of df.
For example,
import numpy as np
import pandas as pd
ncols, nrows = 13, 2
df = pd.DataFrame(np.arange(ncols*nrows).reshape(nrows, ncols))
print(df)
# 0 1 2 3 4 5 6 7 8 9 10 11 12
# 0 0 1 2 3 4 5 6 7 8 9 10 11 12
# 1 13 14 15 16 17 18 19 20 21 22 23 24 25
print(pd.rolling_sum(df, 12, axis=1))
prints
0 1 2 3 4 5 6 7 8 9 10 11 12
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 66 78
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 222 234
Regarding your list comprehension:
You've got the parts of the list comprehension in the wrong order. Try:
result = [expression for x in df.iterrows()]
See the docs for more about list comprehensions.
The basic form of a list comprehension is
[expression for variable in sequence]
And the resultant list is equivalent to result after Python executes:
result = []
for variable in sequence:
result.append(expression)
See this link for full syntax for list comprehensions.

Related

Fixing Indexing when Appending Dataframes

I am appending three CSVs:
df = pd.read_csv("places_1.csv")
temp = pd.read_csv("places_2.csv")
df = df.append(temp)
temp = pd.read_csv("places_3.csv")
df = df.append(temp)
print(df.head(20))
the joined table looks like:
location device_count population
0 A 11 NaN
1 B 12 NaN
2 C 13 NaN
3 D 14 NaN
4 E 15 NaN
0 F 21 NaN
1 G 22 NaN
2 H 23 NaN
3 I 24 NaN
4 J 25 NaN
0 K 31 NaN
1 L 32 NaN
2 M 33 NaN
3 N 34 NaN
4 O 35 NaN
As you can see the indices are not unique.
When I call this iloc function to multiply the population column by 2:
df2 = df.copy
for index, row in df.iterrows():
df.iloc[index, df.columns.get_loc('population')] = row['device_count'] * 2
I get the following erronious result:
location device_count population
0 A 11 62.0
1 B 12 64.0
2 C 13 66.0
3 D 14 68.0
4 E 15 70.0
0 F 21 NaN
1 G 22 NaN
2 H 23 NaN
3 I 24 NaN
4 J 25 NaN
0 K 31 NaN
1 L 32 NaN
2 M 33 NaN
3 N 34 NaN
4 O 35 NaN
For each CSV it is overwriting the indexes of the first CSV
I have also tried creating a new column of integers and calling df.set_index(). That did not work.
Any tips?
First, use ignore_index, second, don't use append, use pd.concat([temp1, temp2, temp3], ignore_index=True).
As others have stated, you can use ignore_index, and you probably should use pd.concat here. Alternatively, for other situations where you are not combining DataFrames, you can also use df = df.reset_index(drop=True) to change the indices after the fact.
Additionally, you should avoid using iterrows() for reasons listed in the docs here. Using the following works way better:
df.loc[:, 'population'] = df.loc[:, 'device_count'].astype('int') * 2

How to set a value of a panda dataframe between two indices?

I would like to set a value to a panda dataframe based on the values of another column. In a nutshell, for example, if I wanted to set indices of a column my_column of a pandas dataframe pd where another column, my_interesting_column is between 10 and 30, I would like to do something like:
start_index=pd.find_closest_index_where_pd["my_interesting_column"].is_closest_to(10)
end_index=pd.find_closest_index_where_pd["my_interesting_column"].is_closest_to(30)
pd["my_column"].between(star_index, end_index)= some_value
As a simple illustration, suppose I have the following dataframe
df = pd.DataFrame(np.arange(10, 20), columns=list('A'))
df["B"]=np.nan
>>> df
A B
0 10 NaN
1 11 NaN
2 12 NaN
3 13 NaN
4 14 NaN
5 15 NaN
6 16 NaN
7 17 NaN
8 18 NaN
9 19 NaN
How can I do something like
df.where(df["A"].is_between(13,16))= 5
So that the end results looks like
>>> df
A B
0 10 NaN
1 11 NaN
2 12 NaN
3 13 5
4 14 5
5 15 5
6 16 5
7 17 NaN
8 18 NaN
9 19 NaN
pd.loc[start_idx:end_idx, 'my_column'] = some_value
I think this is what you are looking for
df.loc[(df['A'] >= 13) & (df['A'] <= 16), 'B'] = 5

Remove lesser than K consecutive NaNs from pandas DataFrame

I am working Time Series data. I am facing problem while removing consecutive NaNs less than or equal to threshold from a Data Frame column. I tried looking at some of the links like:
Identifying consecutive NaN's with pandas : Identifies where consecutive NaNs are present and what is count.
Pandas: run length of NaN holes : Outputs run Length encoding for NaNs
There are many more others along this lane, but none of them actually tells how can we remove them after identifying.
I found one similar solution but that is in R :
How to remove more than 2 consecutive NA's in a column?
I want solution in Python.
So here is the example:
Here is my dataframe column:
a
0 36.45
1 35.45
2 NaN
3 NaN
4 NaN
5 37.21
6 35.63
7 36.45
8 34.65
9 31.45
10 NaN
11 NaN
12 36.71
13 35.55
14 NaN
15 NaN
16 NaN
17 NaN
18 37.71
If k = 3, my output should be:
a
0 36.45
1 35.45
2 37.21
3 35.63
4 36.45
5 34.65
6 31.45
7 36.71
8 35.55
9 NaN
10 NaN
11 NaN
12 NaN
13 37.71
How can I go about removing the consecutive NaNs less than or equal to some threshold (k).
There are a few ways, but this is how I've done it:
Determine groups of consecutive numbers using a neat cumsum trick
Use groupby + transform to determine the size of each group
Identify groups of NaNs that are within the threshold
Filter them out with boolean indexing.
k = 3
i = df.a.isnull()
m = ~(df.groupby(i.ne(i.shift()).cumsum().values).a.transform('size').le(k) & i)
df[m]
a
0 36.45
1 35.45
5 37.21
6 35.63
7 36.45
8 34.65
9 31.45
12 36.71
13 35.55
14 NaN
15 NaN
16 NaN
17 NaN
18 37.71
You can perform df = df[m]; df.reset_index(drop=True) step at the end if you want a monotonically increasing integer index.
You can create a indicator column to count the consecutive nans.
k = 3
(
df.groupby(pd.notna(df.a).cumsum())
.apply(lambda x: x.dropna() if pd.isna(x.a).sum() <= k else x)
.reset_index(drop=True)
)
Out[375]:
a
0 36.45
1 35.45
2 37.21
3 35.63
4 36.45
5 34.65
6 31.45
7 36.71
8 35.55
9 NaN
10 NaN
11 NaN
12 NaN
13 37.71

Add column in dataframe from list

I have a dataframe with some columns like this:
A B C
0
4
5
6
7
7
6
5
The possible range of values in A are only from 0 to 7.
Also, I have a list of 8 elements like this:
List=[2,5,6,8,12,16,26,32] //There are only 8 elements in this list
If the element in column A is n, I need to insert the n th element from the List in a new column, say 'D'.
How can I do this in one go without looping over the whole dataframe?
The resulting dataframe would look like this:
A B C D
0 2
4 12
5 16
6 26
7 32
7 32
6 26
5 16
Note: The dataframe is huge and iteration is the last option option. But I can also arrange the elements in 'List' in any other data structure like dict if necessary.
Just assign the list directly:
df['new_col'] = mylist
Alternative
Convert the list to a series or array and then assign:
se = pd.Series(mylist)
df['new_col'] = se.values
or
df['new_col'] = np.array(mylist)
IIUC, if you make your (unfortunately named) List into an ndarray, you can simply index into it naturally.
>>> import numpy as np
>>> m = np.arange(16)*10
>>> m[df.A]
array([ 0, 40, 50, 60, 150, 150, 140, 130])
>>> df["D"] = m[df.A]
>>> df
A B C D
0 0 NaN NaN 0
1 4 NaN NaN 40
2 5 NaN NaN 50
3 6 NaN NaN 60
4 15 NaN NaN 150
5 15 NaN NaN 150
6 14 NaN NaN 140
7 13 NaN NaN 130
Here I built a new m, but if you use m = np.asarray(List), the same thing should work: the values in df.A will pick out the appropriate elements of m.
Note that if you're using an old version of numpy, you might have to use m[df.A.values] instead-- in the past, numpy didn't play well with others, and some refactoring in pandas caused some headaches. Things have improved now.
A solution improving on the great one from #sparrow.
Let df, be your dataset, and mylist the list with the values you want to add to the dataframe.
Let's suppose you want to call your new column simply, new_column
First make the list into a Series:
column_values = pd.Series(mylist)
Then use the insert function to add the column. This function has the advantage to let you choose in which position you want to place the column.
In the following example we will position the new column in the first position from left (by setting loc=0)
df.insert(loc=0, column='new_column', value=column_values)
First let's create the dataframe you had, I'll ignore columns B and C as they are not relevant.
df = pd.DataFrame({'A': [0, 4, 5, 6, 7, 7, 6,5]})
And the mapping that you desire:
mapping = dict(enumerate([2,5,6,8,12,16,26,32]))
df['D'] = df['A'].map(mapping)
Done!
print df
Output:
A D
0 0 2
1 4 12
2 5 16
3 6 26
4 7 32
5 7 32
6 6 26
7 5 16
Old question; but I always try to use fastest code!
I had a huge list with 69 millions of uint64. np.array() was fastest for me.
df['hashes'] = hashes
Time spent: 17.034842014312744
df['hashes'] = pd.Series(hashes).values
Time spent: 17.141014337539673
df['key'] = np.array(hashes)
Time spent: 10.724546194076538
You can also use df.assign:
In [1559]: df
Out[1559]:
A B C
0 0 NaN NaN
1 4 NaN NaN
2 5 NaN NaN
3 6 NaN NaN
4 7 NaN NaN
5 7 NaN NaN
6 6 NaN NaN
7 5 NaN NaN
In [1560]: mylist = [2,5,6,8,12,16,26,32]
In [1567]: df = df.assign(D=mylist)
In [1568]: df
Out[1568]:
A B C D
0 0 NaN NaN 2
1 4 NaN NaN 5
2 5 NaN NaN 6
3 6 NaN NaN 8
4 7 NaN NaN 12
5 7 NaN NaN 16
6 6 NaN NaN 26
7 5 NaN NaN 32

Multi-indexed dataframe: Setting values

I already asked a related question earlier, but I didn't want to start a comment-and-edit-discussion. So here's -boiled down - what the answer to my earlier question lead me to ask. Consider
import pandas as pd
from numpy import arange
from scipy import random
index = pd.MultiIndex.from_product([arange(0,3), arange(10,15)], names=['A', 'B'])
df = pd.DataFrame(columns=['test'], index=index)
someValues = random.randint(0, 10, size=5)
df.loc[0, 'test'], df.loc[0,:] and df.ix[0] all create a representation of a part of the data frame, the first one being a Series and the other two being df slices. However
df.ix[0] = df.loc[0,'test'] = someValues sets the value for the df
df.loc[0,'test'] = someValues gives an error ValueError: total size of new array must be unchanged
df.loc[0,:] = someValues is being ignored. No error, but the df does not contain the numpy array.
I skimmed the docs but there was no clear logical and systematical explanation on what is going on with MultiIndexes in general. So far, I guess that "if the view is a Series, you can set values", and "otherwise, god knows what happens".
Could someone shed some light on the logic? Moreover, is there some deep meaning behind this or are these just constraints due to how it is set up?
These are all with 0.13.1
These are not all 'slice' representations at all.
This is a Series.
In [50]: df.loc[0,'test']
Out[50]:
B
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
Name: test, dtype: object
These are DataFrames (and the same)
In [51]: df.loc[0,:]
Out[51]:
test
B
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
[5 rows x 1 columns]
In [52]: df.ix[0]
Out[52]:
test
B
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
[5 rows x 1 columns]
This is trying to assign the wrong shape (it looks like it should work, but if you have multiple columns then it won't, that is why this is not allowed)
In [54]: df.ix[0] = someValues
ValueError: could not broadcast input array from shape (5) into shape (5,1)
This works because it knows how to broadcast
In [56]: df.loc[0,:] = someValues
In [57]: df
Out[57]:
test
A B
0 10 4
11 3
12 4
13 2
14 8
1 10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
2 10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
[15 rows x 1 columns]
This works fine
In [63]: df.loc[0,'test'] = someValues+1
In [64]: df
Out[64]:
test
A B
0 10 5
11 4
12 5
13 3
14 9
1 10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
2 10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
[15 rows x 1 columns]
As does this
In [66]: df.loc[0,:] = someValues+1
In [67]: df
Out[67]:
test
A B
0 10 5
11 4
12 5
13 3
14 9
1 10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
2 10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
[15 rows x 1 columns]
Not clear where you generated the cases in your question. I think the logic is pretty straightforward and consistent (their were several inconsistencies in prior versions however).

Categories

Resources