Drop nan rows unless string value in separate column - Pandas - python

I want to drop rows containing NaN values except if a separate column contains a specific string. Using the df below, I want to drop rows if NaN in Code2, Code3 unless the string A is in Code1.
df = pd.DataFrame({
'Code1' : ['A','A','B','B','C','C'],
'Code2' : ['B',np.nan,'A','B',np.nan,'B'],
'Code3' : ['C',np.nan,'C','C',np.nan,'A'],
})
def dropna(df, col):
if col == np.nan:
df = df.dropna()
return df
df = dropna(df, df['Code2'])
Intended Output:
Code1 Code2 Code3
0 A B C
1 A NaN NaN
2 B A C
3 B B C
4 C B A

Use DataFrame.notna + DataFrame.all to performance a boolean indexing:
new_df=df[df.Code1.eq('A')|df.notna().all(axis=1)]
print(new_df)
Code1 Code2 Code3
0 A B C
1 A NaN NaN
2 B A C
3 B B C
5 C B A

Related

How do I replace pandas rows with values of another dataframe for all instances of the value in the first df?

I have two dataframes:
df1=
A B C
a 1 3
b 2 3
c 2 2
a 1 4
df2=
A B C
a 1 3.5
Now I need to replace all occurrences of a in df1 (2 in this case) with a in df2, leaving b and c unchanged. The final dataframe should be:
df_final=
A B C
b 2 3
c 2 2
a 1 3.5
Do you mean:
df_final = pd.concat((df1[df1['A'].ne('a')], df2))
Or if you have several values like a:
list_special = ['a']
df_final = pd.concat((df1[~df1['A'].isin(list_special)], df2))
If df2 just has the average of duplicated values, you can do df1.groupby(["A", "B"]).mean().reset_index()
Otherwise, you can do something like this:
In [27]: df = df1.groupby(["A", "B"]).first().merge(df2, how="left", on=["A", "
...: B"])
...: df["C"] = df["C_y"].fillna(df["C_x"])
...: df = df[["A", "B", "C"]]
...: df
Out[27]:
A B C
0 a 1 3.5
1 b 2 3.0
2 c 2 2.0

combining columns in pandas dataframe

I have the following dataframe:
df = pd.DataFrame({
'user_a':['A','B','C',np.nan],
'user_b':['A','B',np.nan,'D']
})
I would like to create a new column called user and have the resulting dataframe:
What's the best way to do this for many users?
Use forward filling missing values and then select last column by iloc:
df = pd.DataFrame({
'user_a':['A','B','C',np.nan,np.nan],
'user_b':['A','B',np.nan,'D',np.nan]
})
df['user'] = df.ffill(axis=1).iloc[:, -1]
print (df)
user_a user_b user
0 A A A
1 B B B
2 C NaN C
3 NaN D D
4 NaN NaN NaN
use .apply method:
In [24]: df = pd.DataFrame({'user_a':['A','B','C',np.nan],'user_b':['A','B',np.nan,'D']})
In [25]: df
Out[25]:
user_a user_b
0 A A
1 B B
2 C NaN
3 NaN D
In [26]: df['user'] = df.apply(lambda x: [i for i in x if not pd.isna(i)][0], axis=1)
In [27]: df
Out[27]:
user_a user_b user
0 A A A
1 B B B
2 C NaN C
3 NaN D D

How to combine multiple rows of same category to one in pandas?

I'm trying to get from table 1 to table 2 from the image but I can't seem to get it right. I tried pivot table to change col A - D from rows to cols. Then I try groupby but it doesn't give me one row but messes up my dataframe instead.
You can fill the null values with the value in the column and drop duplicates:
with :
df = pd.DataFrame([["A", pd.np.nan, pd.np.nan, "Y", "Z"],
[pd.np.nan, "B", pd.np.nan, "Y", "Z"],
[pd.np.nan,pd.np.nan, "C", "Y", "Z"]], columns=list("ABCDE"))
df
A B C D E
0 A NaN NaN Y Z
1 NaN B NaN Y Z
2 NaN NaN C Y Z
df.ffill().bfill().drop_duplicates()
A B C D E
0 A B C Y Z
df.ffill().bfill() gives:
A B C D E
0 A B C Y Z
1 A B C Y Z
2 A B C Y Z
As per your comment, you could define a function that fill the missing value of the first row by the unique value that lies somewhere else in the same column.
def fillna_uniq(df, col):
if isinstance(col, list):
for c in col:
df.loc[df.index[0], c] = df[c].dropna().iloc[0]
else:
df.loc[df.index[0], col] = df[col].dropna().iloc[0]
return df.iloc[[0]]
You could then do:
fillna_uniq(df.copy(), ["B", "C", "D"])
A B C D E F
0 Hello I am lost Pandas Data
It is a bit faster I think. You can modify your df inplace by passing directly the dataframe, not a copy.
HTH
One way you can do this is using apply and dropna:
Assuming those blanks in your table above are really nulls:
df = pd.DataFrame({'A':['Hello',np.nan,np.nan,np.nan],'B':[np.nan,'I',np.nan,np.nan],
'C':[np.nan,np.nan,'am',np.nan],
'D':[np.nan,np.nan,np.nan,'lost'],
'E':['Pandas']*4,
'F':['Data']*4})
print(df)
A B C D E F
0 Hello NaN NaN NaN Pandas Data
1 NaN I NaN NaN Pandas Data
2 NaN NaN am NaN Pandas Data
3 NaN NaN NaN lost Pandas Data
Using apply, you can apply the lambda function to each column of the dataframe, first dropping null values then find the max:
df.apply(lambda x: x.dropna().max()).to_frame().T
A B C D E F
0 Hello I am lost Pandas Data
Or if your blanks are really empty strings, then you can do this:
df1 = df.replace(np.nan,'')
df1
A B C D E F
0 Hello Pandas Data
1 I Pandas Data
2 am Pandas Data
3 lost Pandas Data
df1.apply(lambda x: x[x!=''].max()).to_frame().T
A B C D E F
0 Hello I am lost Pandas Data

Transforming data frame into Series creates NA's

I've downloaded dataframe and tried to create pd.Series from this DataFrame
data = pd.read_csv(filepath_or_buffer = "train.csv", index_col = 0)
data.columns
Index([u'qid1',u'qid2',u'question1',u'question2'], dtype = 'object')
Here is columns in DataFrame, qid1 is ID of question1 and qid2 is ID for question2
Also, there is no Nan in my DataFrame:
data.question1.isnull().sum()
0
I want to create pandas.Series() from first questions with qid1 as index:
question1 = pd.Series(data.question1, index = data.qid1)
question1.isnull.sum()
68416
And now, there are 68416 Null values in my Series. Where is my mistake?
pass anonymous values so the Series ctor doesn't try to align:
question1 = pd.Series(data.question1.values, index = data.qid1)
The problem here is that question1 column has it's own index so it's going to try to use this during the construction
Example:
In [12]:
df = pd.DataFrame({'a':np.arange(5), 'b':list('abcde')})
df
Out[12]:
a b
0 0 a
1 1 b
2 2 c
3 3 d
4 4 e
In [13]:
s = pd.Series(df['a'], index = df['b'])
s
Out[13]:
b
a NaN
b NaN
c NaN
d NaN
e NaN
Name: a, dtype: float64
In [14]:
s = pd.Series(df['a'].values, index = df['b'])
s
Out[14]:
b
a 0
b 1
c 2
d 3
e 4
dtype: int32
Effectively what happens here is that you're reindexing your existing column with the passed in new index, because there are no index values that match you get NaN

Comparing two dataframes of different length row by row and adding columns for each row with equal value

I have two dataframes of different length in python pandas like this:
df1: df2:
Column1 Column2 Column3 ColumnA ColumnB
0 1 a r 0 1 a
1 2 b u 1 1 d
2 3 c k 2 1 e
3 4 d j 3 2 r
4 5 e f 4 2 w
5 3 y
6 3 h
What I am trying to do now is comparing Column1 of df1 and ColumnA of df2. For each "hit", where a row in ColumnA in df2 has the same value as a row in Column1 in df1, I want to append a column to df1 with the vaule ColumnB of df2 has for the row where the "hit" was found, so that my result looks like this:
df1:
Column1 Column2 Column3 Column4 Column5 Column6
0 1 a r a d e
1 2 b u r w
2 3 c k y h
3 4 d j
4 5 e f
What I have tried so far was:
for row in df1, df2:
if df1[Column1] == df2[ColumnA]:
print 'yey!'
which gave me an error saying I could not compare two dataframes of different length. So I tried:
for row in df1, df2:
if def2[def2['ColumnA'].isin(def1['column1'])]:
print 'lalala'
else:
print 'Nope'
Which "works" in terms that I get an output, but I do not think it iterates over the rows and compares them, since it only prints 'lalala' two times. So I researched some more and found a way to iterate over each row of the dataframe, which is:
for index, row in df1.iterrows():
print row['Column1]
But I do not know how to use this to compare the columns of the two dataframes and get the output I desire.
Any help on how to do this would be really appreciated.
I recommend you to use DataFrame API which allows to operate with DF in terms of join, merge, groupby, etc. You can find my solution below:
import pandas as pd
df1 = pd.DataFrame({'Column1': [1,2,3,4,5],
'Column2': ['a','b','c','d','e'],
'Column3': ['r','u','k','j','f']})
df2 = pd.DataFrame({'Column1': [1,1,1,2,2,3,3], 'ColumnB': ['a','d','e','r','w','y','h']})
dfs = pd.DataFrame({})
for name, group in df2.groupby('Column1'):
buffer_df = pd.DataFrame({'Column1': group['Column1'][:1]})
i = 0
for index, value in group['ColumnB'].iteritems():
i += 1
string = 'Column_' + str(i)
buffer_df[string] = value
dfs = dfs.append(buffer_df)
result = pd.merge(df1, dfs, how='left', on='Column1')
print(result)
The result is:
Column1 Column2 Column3 Column_0 Column_1 Column_2
0 1 a r a d e
1 2 b u r w NaN
2 3 c k y h NaN
3 4 d j NaN NaN NaN
4 5 e f NaN NaN NaN
P.s. More details:
1) for df2 I produce groups by 'Column1'. The single group is a data frame. Example below:
Column1 ColumnB
0 1 a
1 1 d
2 1 e
2) for each group I produce data frame buffer_df:
Column1 Column_0 Column_1 Column_2
0 1 a d e
3) after that I create DF dfs:
Column1 Column_0 Column_1 Column_2
0 1 a d e
3 2 r w NaN
5 3 y h NaN
4) in the end I execute left join for df1 and dfs obtaining needed result.
2)* buffer_df is produced iteratively:
step0 (buffer_df = pd.DataFrame({'Column1': group['Column1'][:1]})):
Column1
5 3
step1 (buffer_df['Column_0'] = group['ColumnB'][5]):
Column1 Column_0
5 3 y
step2 (buffer_df['Column_1'] = group['ColumnB'][5]):
Column1 Column_0 Column_1
5 3 y h

Categories

Resources