Add column in dataframe from list - python

I have a dataframe with some columns like this:
A B C
0
4
5
6
7
7
6
5
The possible range of values in A are only from 0 to 7.
Also, I have a list of 8 elements like this:
List=[2,5,6,8,12,16,26,32] //There are only 8 elements in this list
If the element in column A is n, I need to insert the n th element from the List in a new column, say 'D'.
How can I do this in one go without looping over the whole dataframe?
The resulting dataframe would look like this:
A B C D
0 2
4 12
5 16
6 26
7 32
7 32
6 26
5 16
Note: The dataframe is huge and iteration is the last option option. But I can also arrange the elements in 'List' in any other data structure like dict if necessary.

Just assign the list directly:
df['new_col'] = mylist
Alternative
Convert the list to a series or array and then assign:
se = pd.Series(mylist)
df['new_col'] = se.values
or
df['new_col'] = np.array(mylist)

IIUC, if you make your (unfortunately named) List into an ndarray, you can simply index into it naturally.
>>> import numpy as np
>>> m = np.arange(16)*10
>>> m[df.A]
array([ 0, 40, 50, 60, 150, 150, 140, 130])
>>> df["D"] = m[df.A]
>>> df
A B C D
0 0 NaN NaN 0
1 4 NaN NaN 40
2 5 NaN NaN 50
3 6 NaN NaN 60
4 15 NaN NaN 150
5 15 NaN NaN 150
6 14 NaN NaN 140
7 13 NaN NaN 130
Here I built a new m, but if you use m = np.asarray(List), the same thing should work: the values in df.A will pick out the appropriate elements of m.
Note that if you're using an old version of numpy, you might have to use m[df.A.values] instead-- in the past, numpy didn't play well with others, and some refactoring in pandas caused some headaches. Things have improved now.

A solution improving on the great one from #sparrow.
Let df, be your dataset, and mylist the list with the values you want to add to the dataframe.
Let's suppose you want to call your new column simply, new_column
First make the list into a Series:
column_values = pd.Series(mylist)
Then use the insert function to add the column. This function has the advantage to let you choose in which position you want to place the column.
In the following example we will position the new column in the first position from left (by setting loc=0)
df.insert(loc=0, column='new_column', value=column_values)

First let's create the dataframe you had, I'll ignore columns B and C as they are not relevant.
df = pd.DataFrame({'A': [0, 4, 5, 6, 7, 7, 6,5]})
And the mapping that you desire:
mapping = dict(enumerate([2,5,6,8,12,16,26,32]))
df['D'] = df['A'].map(mapping)
Done!
print df
Output:
A D
0 0 2
1 4 12
2 5 16
3 6 26
4 7 32
5 7 32
6 6 26
7 5 16

Old question; but I always try to use fastest code!
I had a huge list with 69 millions of uint64. np.array() was fastest for me.
df['hashes'] = hashes
Time spent: 17.034842014312744
df['hashes'] = pd.Series(hashes).values
Time spent: 17.141014337539673
df['key'] = np.array(hashes)
Time spent: 10.724546194076538

You can also use df.assign:
In [1559]: df
Out[1559]:
A B C
0 0 NaN NaN
1 4 NaN NaN
2 5 NaN NaN
3 6 NaN NaN
4 7 NaN NaN
5 7 NaN NaN
6 6 NaN NaN
7 5 NaN NaN
In [1560]: mylist = [2,5,6,8,12,16,26,32]
In [1567]: df = df.assign(D=mylist)
In [1568]: df
Out[1568]:
A B C D
0 0 NaN NaN 2
1 4 NaN NaN 5
2 5 NaN NaN 6
3 6 NaN NaN 8
4 7 NaN NaN 12
5 7 NaN NaN 16
6 6 NaN NaN 26
7 5 NaN NaN 32

Related

Pandas filter rows based on certain number of certain columns being NaN

I have a data set like this:
seq S01-T01 S01-T02 S01-T03 S02-T01 S02-T02 S02-T03 S03-T01 S03-T02 S03-T03
A NaN 4 5 NaN 4 7 NaN 6 8
B 7 2 9 2 1 9 2 1 1
C NaN 4 4 2 4 NaN 2 6 8
D 5 NaN NaN 2 5 9 NaN 1 1
I want to remove the rows where at least three of the columns marked 'T01' are NaN
So the output would be:
seq S01-T01 S01-T02 S01-T03 S02-T01 S02-T02 S02-T03 S03-T01 S03-T02 S03-T03
B 7 2 9 2 1 9 2 1 1
C NaN 4 4 2 4 NaN 2 6 8
D 5 NaN NaN 2 5 9 NaN 1 1
Because the A row there is NaN is S01-T01, S02-T02, S03-T01. Row D also has three NaNs, but it is kept in because I am only interested in removing the rows if specifically there is >=3 NaN in the column names that have a T01 in them.
I know this could be simple to do, I wrote:
import sys
import pandas as pd
df = pd.read_csv('data.csv',sep=',')
print(df.columns.str.contains['T01'])
To first get all of the cells with T01 in them, and then I was going to count them.
I got the error:
print(df.columns.str.contains['T01'])
TypeError: 'method' object is not subscriptable
Then I thought about iterating through the rows and counting instead e.g.:
for index,row in df.iterrows():
if 'T01' in row:
print(row)
This runs without error but prints nothing to screen. Could someone demonstrate a better way to do this?
If you select only the 'T01' columns, you can take the rowwise sum of nulls and keep only rows that are less than 3.
df.loc[df[[x for x in df if 'T01' in x]].isnull().sum(1).lt(3)]

Pandas shift values in a column over intervening rows

I have a pandas data frame as shown below. One column has values with intervening NaN cells. The values are to be shifted ahead by one so that they replace the next value that follows with the last being lost. The intervening NaN cells have to remain. I tried using .shift() but since I never know how many intervening NaN rows it means a calculation for each shift. Is there a better approach?
IIUC, you may just groupby by non-na values, and shift them.
df['y'] = df.y.groupby(pd.isnull(df.y)).shift()
x y
0 A NaN
1 A NaN
2 A NaN
3 B 5.0
4 B NaN
5 B NaN
6 B NaN
7 C 10.0
8 C NaN
9 C NaN
10 C NaN
Another way:
s = df['y'].notnull()
df.loc[s,'y'] = df.loc[s,'y'].shift()
It would be easier to test if you paste your text data instead of the picture.
Input:
df = pd.DataFrame({'x':list('AAABBBBCCCC'),
'y':[5,np.nan,np.nan,10, np.nan,np.nan,np.nan,
20, np.nan,np.nan,np.nan]})
output:
x y
0 A NaN
1 A NaN
2 A NaN
3 B 5.0
4 B NaN
5 B NaN
6 B NaN
7 C 10.0
8 C NaN
9 C NaN
10 C NaN

Pandas: take whichever column is not NaN

I am working with a fairly messy data set that has been individual csv files with slightly different names. It would be too onerous to rename columns in the csv file, partly because I am still discovering all the variations, so I am looking to determine, for a set of columns, in a given row, which field is not NaN and carrying that forward to a new column. Is there a way to do that?
Case in point. Let's say I have a data frame that looks like this:
Index A B
1 15 NaN
2 NaN 11
3 NaN 99
4 NaN NaN
5 12 14
Let's say my desired output from this is to create a new column C such that my data frame will look like the following:
Index A B C
1 15 NaN 15
2 NaN 11 11
3 NaN 99 99
4 NaN NaN NaN
5 12 14 12 (so giving priority to A over B)
How can I accomplish this?
For a dataframe with an arbitrary number of columns, you can back fill the rows (.bfill(axis=1)) and take the first column (.iloc[:, 0]):
df = pd.DataFrame({
'A': [15, None, None, None, 12],
'B': [None, 11, 99, None, 14],
'C': [10, None, 10, 10, 10]})
df['D'] = df.bfill(axis=1).iloc[:, 0]
>>> df
A B C D
0 15 NaN 10 15
1 NaN 11 NaN 11
2 NaN 99 10 99
3 NaN NaN 10 10
4 12 14 10 12
If you just have 2 columns, the cleanest way would be to use where (the syntax is where([condition], [value if condition is true], [value if condition is false]) (for some reason it took me a while to wrap my head around this).
In [2]: df.A.where(df.A.notnull(),df.B)
Out[2]:
0 15.0
1 11.0
2 99.0
3 NaN
4 12.0
Name: A, dtype: float64
If you have more than two columns, it might be simpler to use max or min; this will ignore the null values, however you'll lose the "column prececence" you want:
In [3]: df.max(axis=1)
Out[3]:
0 15.0
1 11.0
2 99.0
3 NaN
4 14.0
dtype: float64
pandas.DataFrame.update:
df['updated'] = np.nan
for col in df.columns:
df['updated'].update(df[col])
Try this: (This methods allows for flexiblity of giving preference to columns without relying on order of columns.)
Using #Alexanders setup.
df["D"] = df["B"]
df["D"] = df['D'].fillna(df['A'].fillna(df['B'].fillna(df['C'])))
A B C D
0 15.0 NaN 10.0 15.0
1 NaN 11.0 NaN 11.0
2 NaN 99.0 10.0 99.0
3 NaN NaN 10.0 10.0
4 12.0 14.0 10.0 14.0
Or you could use 'df.apply' to give priority to column A.
def func1(row):
A=row['A']
B=row['B']
if A==float('nan'):
if B==float('nan'):
y=float('nan')
else:
y=B
else:
y=A
return y
df['C']=df.apply(func1,axis=1)

Multi-indexed dataframe: Setting values

I already asked a related question earlier, but I didn't want to start a comment-and-edit-discussion. So here's -boiled down - what the answer to my earlier question lead me to ask. Consider
import pandas as pd
from numpy import arange
from scipy import random
index = pd.MultiIndex.from_product([arange(0,3), arange(10,15)], names=['A', 'B'])
df = pd.DataFrame(columns=['test'], index=index)
someValues = random.randint(0, 10, size=5)
df.loc[0, 'test'], df.loc[0,:] and df.ix[0] all create a representation of a part of the data frame, the first one being a Series and the other two being df slices. However
df.ix[0] = df.loc[0,'test'] = someValues sets the value for the df
df.loc[0,'test'] = someValues gives an error ValueError: total size of new array must be unchanged
df.loc[0,:] = someValues is being ignored. No error, but the df does not contain the numpy array.
I skimmed the docs but there was no clear logical and systematical explanation on what is going on with MultiIndexes in general. So far, I guess that "if the view is a Series, you can set values", and "otherwise, god knows what happens".
Could someone shed some light on the logic? Moreover, is there some deep meaning behind this or are these just constraints due to how it is set up?
These are all with 0.13.1
These are not all 'slice' representations at all.
This is a Series.
In [50]: df.loc[0,'test']
Out[50]:
B
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
Name: test, dtype: object
These are DataFrames (and the same)
In [51]: df.loc[0,:]
Out[51]:
test
B
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
[5 rows x 1 columns]
In [52]: df.ix[0]
Out[52]:
test
B
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
[5 rows x 1 columns]
This is trying to assign the wrong shape (it looks like it should work, but if you have multiple columns then it won't, that is why this is not allowed)
In [54]: df.ix[0] = someValues
ValueError: could not broadcast input array from shape (5) into shape (5,1)
This works because it knows how to broadcast
In [56]: df.loc[0,:] = someValues
In [57]: df
Out[57]:
test
A B
0 10 4
11 3
12 4
13 2
14 8
1 10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
2 10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
[15 rows x 1 columns]
This works fine
In [63]: df.loc[0,'test'] = someValues+1
In [64]: df
Out[64]:
test
A B
0 10 5
11 4
12 5
13 3
14 9
1 10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
2 10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
[15 rows x 1 columns]
As does this
In [66]: df.loc[0,:] = someValues+1
In [67]: df
Out[67]:
test
A B
0 10 5
11 4
12 5
13 3
14 9
1 10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
2 10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
[15 rows x 1 columns]
Not clear where you generated the cases in your question. I think the logic is pretty straightforward and consistent (their were several inconsistencies in prior versions however).

Iterrows a rolling sum

I have a pandas dataframe
from pandas import DataFrame, Series
where each row corresponds to one case, and each column corresponds to one month. I want to perform a rolling sum over each 12 month period. Seems simple enough, but I'm getting stuck with
result = [x for x.rolling_sum(12) in df.iterrows()]
result = [x for x.rolling_sum(12) in df.T.iteritems()]
SyntaxError: can't assign to function call
a = []
for x in df.iterrows():
s = x.rolling_sum(12)
a.append(s)
AttributeError: 'tuple' object has no attribute 'rolling_sum'
I think perhaps what you are looking for is
pd.rolling_sum(df, 12, axis=1)
In which case, no list comprehension is necessary. The axis=1 parameter causes Pandas to compute a rolling sum over rows of df.
For example,
import numpy as np
import pandas as pd
ncols, nrows = 13, 2
df = pd.DataFrame(np.arange(ncols*nrows).reshape(nrows, ncols))
print(df)
# 0 1 2 3 4 5 6 7 8 9 10 11 12
# 0 0 1 2 3 4 5 6 7 8 9 10 11 12
# 1 13 14 15 16 17 18 19 20 21 22 23 24 25
print(pd.rolling_sum(df, 12, axis=1))
prints
0 1 2 3 4 5 6 7 8 9 10 11 12
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 66 78
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 222 234
Regarding your list comprehension:
You've got the parts of the list comprehension in the wrong order. Try:
result = [expression for x in df.iterrows()]
See the docs for more about list comprehensions.
The basic form of a list comprehension is
[expression for variable in sequence]
And the resultant list is equivalent to result after Python executes:
result = []
for variable in sequence:
result.append(expression)
See this link for full syntax for list comprehensions.

Categories

Resources