I've downloaded dataframe and tried to create pd.Series from this DataFrame
data = pd.read_csv(filepath_or_buffer = "train.csv", index_col = 0)
data.columns
Index([u'qid1',u'qid2',u'question1',u'question2'], dtype = 'object')
Here is columns in DataFrame, qid1 is ID of question1 and qid2 is ID for question2
Also, there is no Nan in my DataFrame:
data.question1.isnull().sum()
0
I want to create pandas.Series() from first questions with qid1 as index:
question1 = pd.Series(data.question1, index = data.qid1)
question1.isnull.sum()
68416
And now, there are 68416 Null values in my Series. Where is my mistake?
pass anonymous values so the Series ctor doesn't try to align:
question1 = pd.Series(data.question1.values, index = data.qid1)
The problem here is that question1 column has it's own index so it's going to try to use this during the construction
Example:
In [12]:
df = pd.DataFrame({'a':np.arange(5), 'b':list('abcde')})
df
Out[12]:
a b
0 0 a
1 1 b
2 2 c
3 3 d
4 4 e
In [13]:
s = pd.Series(df['a'], index = df['b'])
s
Out[13]:
b
a NaN
b NaN
c NaN
d NaN
e NaN
Name: a, dtype: float64
In [14]:
s = pd.Series(df['a'].values, index = df['b'])
s
Out[14]:
b
a 0
b 1
c 2
d 3
e 4
dtype: int32
Effectively what happens here is that you're reindexing your existing column with the passed in new index, because there are no index values that match you get NaN
Related
I have a dataframe similar to this:
data = {"col_1": [0, 1, 2],
"col_2": ["abc", "defg", "hi"]}
df = pd.DataFrame(data)
Visually:
col_1 col_2
0 0 abc
1 1 defg
2 2 hi
What I'd like to do is split up each character in col_2, and append it as a new column to the dataframe
example iterative method:
def get_chars(string):
chars = []
for char in string:
chars.append(char)
return chars
char_df = pd.DataFrame()
for i in range(len(df)):
char_arr = get_chars(df.loc[i, "col_2"])
temp_df = pd.DataFrame(char_arr).T
char_df = pd.concat([char_df, temp_df], ignore_index=True, axis=0)
df = pd.concat([df, char_df], ignore_index=True, axis=1)
Which results in the correct form:
0 1 2 3 4 5
0 0 abc a b c NaN
1 1 defg d e f g
2 2 hi h i NaN NaN
But I believe iterating though the dataframe like this is very inefficient, so I want to find a faster (ideally vectorised) solution.
In reality, I'm not really splitting up strings, but the point of this question is to find a way to efficiently process one column, and return many.
If need performance use DataFrame constructor with convert values to lists:
df = df.join(pd.DataFrame([list(x) for x in df['col_2']], index=df.index))
Or:
df = df.join(pd.DataFrame(df['col_2'].apply(list).tolist(), index=df.index))
print (df)
col_1 col_2 0 1 2 3
0 0 abc a b c None
1 1 defg d e f g
2 2 hi h i None None
I want to drop rows containing NaN values except if a separate column contains a specific string. Using the df below, I want to drop rows if NaN in Code2, Code3 unless the string A is in Code1.
df = pd.DataFrame({
'Code1' : ['A','A','B','B','C','C'],
'Code2' : ['B',np.nan,'A','B',np.nan,'B'],
'Code3' : ['C',np.nan,'C','C',np.nan,'A'],
})
def dropna(df, col):
if col == np.nan:
df = df.dropna()
return df
df = dropna(df, df['Code2'])
Intended Output:
Code1 Code2 Code3
0 A B C
1 A NaN NaN
2 B A C
3 B B C
4 C B A
Use DataFrame.notna + DataFrame.all to performance a boolean indexing:
new_df=df[df.Code1.eq('A')|df.notna().all(axis=1)]
print(new_df)
Code1 Code2 Code3
0 A B C
1 A NaN NaN
2 B A C
3 B B C
5 C B A
I have the following dataframe:
df = pd.DataFrame({
'user_a':['A','B','C',np.nan],
'user_b':['A','B',np.nan,'D']
})
I would like to create a new column called user and have the resulting dataframe:
What's the best way to do this for many users?
Use forward filling missing values and then select last column by iloc:
df = pd.DataFrame({
'user_a':['A','B','C',np.nan,np.nan],
'user_b':['A','B',np.nan,'D',np.nan]
})
df['user'] = df.ffill(axis=1).iloc[:, -1]
print (df)
user_a user_b user
0 A A A
1 B B B
2 C NaN C
3 NaN D D
4 NaN NaN NaN
use .apply method:
In [24]: df = pd.DataFrame({'user_a':['A','B','C',np.nan],'user_b':['A','B',np.nan,'D']})
In [25]: df
Out[25]:
user_a user_b
0 A A
1 B B
2 C NaN
3 NaN D
In [26]: df['user'] = df.apply(lambda x: [i for i in x if not pd.isna(i)][0], axis=1)
In [27]: df
Out[27]:
user_a user_b user
0 A A A
1 B B B
2 C NaN C
3 NaN D D
I had a DataFrame like below:
column-a column-b column-c
0 Nan A B
1 A Nan C
2 Nan Nan C
3 A B C
I hope to create a new column-D to capture all non-NULL values from column A to C:
column d
0 A,B
1 A,C
2 C
3 A,B,C
Thanks!
You need to change the 'Nan' to np.nan, then using stack with groupby join
df=df.replace('Nan',np.nan)
df.stack().groupby(level=0).agg(','.join)
Out[570]:
0 A,B
1 A,C
2 C
3 A,B,C
dtype: object
#df['column-d']= df.stack().groupby(level=0).agg(','.join)
After fixing the nans:
df = df.replace('Nan', np.nan)
collect all non-null values in each row in a list and join the list items.
df['column-d'] = df.apply(lambda x: ','.join(x[x.notnull()]), axis=1)
#0 A,B
#1 A,C
#2 C
#3 A,B,C
Surprisingly, this solution is somewhat faster than the stack/groupby solution by Wen, at least for the posted dataset.
I have this pandas dataframe:
d=pandas.DataFrame([{"a": 1}, {"a": 3, "b": 2}])
and I'm trying to add a new column to it with non-null values only for certain rows, based on their numeric indices in the array. for example, adding a new column "c" only to the first row in d:
# array of row indices
indx = np.array([0])
d.ix[indx]["c"] = "foo"
which should add "foo" as the column "c" value for the first row, and NaN for all other rows. but this doesn't seem to change the array:
d.ix[np.array([0])]["c"] = "foo"
In [18]: d
Out[18]:
a b
0 1 NaN
1 3 2
what am I doing wrong here? how can it be done? thanks.
In [11]: df = pd.DataFrame([{"a": 1}, {"a": 3, "b": 2}])
In [12]: df['c'] = np.array(['foo',np.nan])
In [13]: df
Out[13]:
a b c
0 1 NaN foo
1 3 2 nan
If you were assigning a numeric value, the following would work
In [16]: df['c'] = np.nan
In [17]: df.ix[0,'c'] = 1
In [18]: df
Out[18]:
a b c
0 1 NaN 1
1 3 2 NaN