Transforming data frame into Series creates NA's - python

I've downloaded dataframe and tried to create pd.Series from this DataFrame
data = pd.read_csv(filepath_or_buffer = "train.csv", index_col = 0)
data.columns
Index([u'qid1',u'qid2',u'question1',u'question2'], dtype = 'object')
Here is columns in DataFrame, qid1 is ID of question1 and qid2 is ID for question2
Also, there is no Nan in my DataFrame:
data.question1.isnull().sum()
0
I want to create pandas.Series() from first questions with qid1 as index:
question1 = pd.Series(data.question1, index = data.qid1)
question1.isnull.sum()
68416
And now, there are 68416 Null values in my Series. Where is my mistake?

pass anonymous values so the Series ctor doesn't try to align:
question1 = pd.Series(data.question1.values, index = data.qid1)
The problem here is that question1 column has it's own index so it's going to try to use this during the construction
Example:
In [12]:
df = pd.DataFrame({'a':np.arange(5), 'b':list('abcde')})
df
Out[12]:
a b
0 0 a
1 1 b
2 2 c
3 3 d
4 4 e
In [13]:
s = pd.Series(df['a'], index = df['b'])
s
Out[13]:
b
a NaN
b NaN
c NaN
d NaN
e NaN
Name: a, dtype: float64
In [14]:
s = pd.Series(df['a'].values, index = df['b'])
s
Out[14]:
b
a 0
b 1
c 2
d 3
e 4
dtype: int32
Effectively what happens here is that you're reindexing your existing column with the passed in new index, because there are no index values that match you get NaN

Related

Pandas DataFrame efficiently split one column into multiple

I have a dataframe similar to this:
data = {"col_1": [0, 1, 2],
"col_2": ["abc", "defg", "hi"]}
df = pd.DataFrame(data)
Visually:
col_1 col_2
0 0 abc
1 1 defg
2 2 hi
What I'd like to do is split up each character in col_2, and append it as a new column to the dataframe
example iterative method:
def get_chars(string):
chars = []
for char in string:
chars.append(char)
return chars
char_df = pd.DataFrame()
for i in range(len(df)):
char_arr = get_chars(df.loc[i, "col_2"])
temp_df = pd.DataFrame(char_arr).T
char_df = pd.concat([char_df, temp_df], ignore_index=True, axis=0)
df = pd.concat([df, char_df], ignore_index=True, axis=1)
Which results in the correct form:
0 1 2 3 4 5
0 0 abc a b c NaN
1 1 defg d e f g
2 2 hi h i NaN NaN
But I believe iterating though the dataframe like this is very inefficient, so I want to find a faster (ideally vectorised) solution.
In reality, I'm not really splitting up strings, but the point of this question is to find a way to efficiently process one column, and return many.
If need performance use DataFrame constructor with convert values to lists:
df = df.join(pd.DataFrame([list(x) for x in df['col_2']], index=df.index))
Or:
df = df.join(pd.DataFrame(df['col_2'].apply(list).tolist(), index=df.index))
print (df)
col_1 col_2 0 1 2 3
0 0 abc a b c None
1 1 defg d e f g
2 2 hi h i None None

Drop nan rows unless string value in separate column - Pandas

I want to drop rows containing NaN values except if a separate column contains a specific string. Using the df below, I want to drop rows if NaN in Code2, Code3 unless the string A is in Code1.
df = pd.DataFrame({
'Code1' : ['A','A','B','B','C','C'],
'Code2' : ['B',np.nan,'A','B',np.nan,'B'],
'Code3' : ['C',np.nan,'C','C',np.nan,'A'],
})
def dropna(df, col):
if col == np.nan:
df = df.dropna()
return df
df = dropna(df, df['Code2'])
Intended Output:
Code1 Code2 Code3
0 A B C
1 A NaN NaN
2 B A C
3 B B C
4 C B A
Use DataFrame.notna + DataFrame.all to performance a boolean indexing:
new_df=df[df.Code1.eq('A')|df.notna().all(axis=1)]
print(new_df)
Code1 Code2 Code3
0 A B C
1 A NaN NaN
2 B A C
3 B B C
5 C B A

combining columns in pandas dataframe

I have the following dataframe:
df = pd.DataFrame({
'user_a':['A','B','C',np.nan],
'user_b':['A','B',np.nan,'D']
})
I would like to create a new column called user and have the resulting dataframe:
What's the best way to do this for many users?
Use forward filling missing values and then select last column by iloc:
df = pd.DataFrame({
'user_a':['A','B','C',np.nan,np.nan],
'user_b':['A','B',np.nan,'D',np.nan]
})
df['user'] = df.ffill(axis=1).iloc[:, -1]
print (df)
user_a user_b user
0 A A A
1 B B B
2 C NaN C
3 NaN D D
4 NaN NaN NaN
use .apply method:
In [24]: df = pd.DataFrame({'user_a':['A','B','C',np.nan],'user_b':['A','B',np.nan,'D']})
In [25]: df
Out[25]:
user_a user_b
0 A A
1 B B
2 C NaN
3 NaN D
In [26]: df['user'] = df.apply(lambda x: [i for i in x if not pd.isna(i)][0], axis=1)
In [27]: df
Out[27]:
user_a user_b user
0 A A A
1 B B B
2 C NaN C
3 NaN D D

How to pick out all non-NULL value from multiple columns in Python Dataframe

I had a DataFrame like below:
column-a column-b column-c
0 Nan A B
1 A Nan C
2 Nan Nan C
3 A B C
I hope to create a new column-D to capture all non-NULL values from column A to C:
column d
0 A,B
1 A,C
2 C
3 A,B,C
Thanks!
You need to change the 'Nan' to np.nan, then using stack with groupby join
df=df.replace('Nan',np.nan)
df.stack().groupby(level=0).agg(','.join)
Out[570]:
0 A,B
1 A,C
2 C
3 A,B,C
dtype: object
#df['column-d']= df.stack().groupby(level=0).agg(','.join)
After fixing the nans:
df = df.replace('Nan', np.nan)
collect all non-null values in each row in a list and join the list items.
df['column-d'] = df.apply(lambda x: ','.join(x[x.notnull()]), axis=1)
#0 A,B
#1 A,C
#2 C
#3 A,B,C
Surprisingly, this solution is somewhat faster than the stack/groupby solution by Wen, at least for the posted dataset.

adding new column to pandas dataframe with values for particular items?

I have this pandas dataframe:
d=pandas.DataFrame([{"a": 1}, {"a": 3, "b": 2}])
and I'm trying to add a new column to it with non-null values only for certain rows, based on their numeric indices in the array. for example, adding a new column "c" only to the first row in d:
# array of row indices
indx = np.array([0])
d.ix[indx]["c"] = "foo"
which should add "foo" as the column "c" value for the first row, and NaN for all other rows. but this doesn't seem to change the array:
d.ix[np.array([0])]["c"] = "foo"
In [18]: d
Out[18]:
a b
0 1 NaN
1 3 2
what am I doing wrong here? how can it be done? thanks.
In [11]: df = pd.DataFrame([{"a": 1}, {"a": 3, "b": 2}])
In [12]: df['c'] = np.array(['foo',np.nan])
In [13]: df
Out[13]:
a b c
0 1 NaN foo
1 3 2 nan
If you were assigning a numeric value, the following would work
In [16]: df['c'] = np.nan
In [17]: df.ix[0,'c'] = 1
In [18]: df
Out[18]:
a b c
0 1 NaN 1
1 3 2 NaN

Categories

Resources