This question already has answers here:
How to replace NaN values by Zeroes in a column of a Pandas Dataframe?
(17 answers)
Closed 3 years ago.
I want to create a new column and replace NA and not missing value with 0 and 1.
#df
col1
1
3
5
6
what I want:
#df
col1 NewCol
1 1
3 1
0
5 1
0
6 1
This is what I tried:
df['NewCol']=df['col1'].fillna(0)
df['NewCol']=df['col1'].replace(df['col1'].notnull(), 1)
It seems that the second line is incorrect.
Any suggestion?
You can try:
df['NewCol'] = [*map(int, pd.notnull(df.col1))]
Hope this helps.
First you will need to convert all 'na's into '0's. How you do this will vary by scope.
For a single column you can use:
df['DataFrame Column'] = df['DataFrame Column'].fillna(0)
For the whole dataframe you can use:
df.fillna(0)
After this, you need to replace all nonzeros with '1's. You could do this like so:
for index, entry in enumerate(df['col']):
if entry != 0:
df['col'][index] = 1
Note that this method counts 0 as an empty entry, which may or may not be the desired functionality.
Related
This question already has answers here:
How do I get a list of all the duplicate items using pandas in python?
(13 answers)
Closed 2 months ago.
i use the pandas.DataFrame.drop_duplicates to search duplicates in a dataframe. This removes the duplicates from the dataframe. This also works great. However, I would like to know which data has been removed.
Is there a way to save the data in a new list before removing it?
I have unfortunately found in the documentation of pandas no information on this.
Thanks for the answer.
It uses duplicated function to filter out the information which is duplicated. By default the first occurrence is set to True, all others set as False, Using this function and filter on original data, you can know which data is kept and which is dropped out.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html
You can use duplicated and boolean indexing with groupby.agg to keep the list of duplicates:
m = df.duplicated('group')
dropped = df[m].groupby(df['group'])['value'].agg(list)
print(dropped)
df = df[~m]
print(df)
Output:
# print(dropped)
group
A [2]
B [4, 5]
C [7]
Name: value, dtype: object
# print(df)
group value
0 A 1
2 B 3
5 C 6
Used input:
group value
0 A 1
1 A 2
2 B 3
3 B 4
4 B 5
5 C 6
6 C 7
This question already has answers here:
Check if Pandas column contains value from another column
(3 answers)
Check if value from one dataframe exists in another dataframe
(4 answers)
Closed 11 months ago.
I have two dataframes and I want to check which value of df1 in col1 also occurs in df2 in col1. If it occurs: a 1 in col2_new, otherwise a 0. Is it best to do this using a list? So column of df1 converted into list and then a loop over the column of the other data frame or is there a more elegant way?
df1 (before):
index
col1
1
a
2
b
3
c
df2:
index
col1
1
a
2
e
3
b
df1 (after):
index
col1
col2_new
1
a
1
2
b
1
3
c
0
Use Series.isin with converting mask to integers:
df1['col2_new'] = df1['col1'].isin(df2['col1']).astype(int)
Or:
df1['col2_new'] = np.where(df1['col1'].isin(df2['col1']), 1, 0)
This question already has answers here:
Pandas groupby with delimiter join
(2 answers)
Concatenate strings from several rows using Pandas groupby
(8 answers)
Closed 3 years ago.
Given a Pandas Dataframe df, with column names 'Session', and 'List':
Can I group together the 'List' values for the same values of 'Session'?
My Approach
I've tried solving the problem by creating a new dataframe, and iterating through the rows of the inital dataframe while maintaing a session counter that I increment if I see that the session has changed.
If it hasn't changed, then I append the List value that corresponds to that rows value with a comma.
Whenever the session changes, I used strip to get rid of the last comma (extra).
Initial DataFrame
Session List
0 1 a
1 1 b
2 1 c
3 2 d
4 2 e
5 3 f
Required DataFrame
Session List
0 1 a,b,c
1 2 d,e
2 3 f
Can someone suggest something more efficient or simple?
Thank you in advance.
Use groupby and apply and reset_index:
>>> df.groupby('Session')['List'].agg(','.join).reset_index()
Session List
0 1 a,b,c
1 2 d,e
2 3 f
>>>
This question already has answers here:
Pandas filtering for multiple substrings in series
(3 answers)
Closed 4 years ago.
I tried
df = df[~df['event.properties.comment'].isin(['Extra'])]
Problem is it would just drop the row if the column contains exactly 'Extra' and I need to drop the ones that contain it even as a substring.
Any help?
You can use or condition to have multiple conditions in checking string, for your requirement you may retain text if it have "Extra" or "~".
Considered df
vals ids
0 1 ~
1 2 bball
2 3 NaN
3 4 Extra text
df[~df.ids.fillna('').str.contains('Extra')]
Out:
vals ids
0 1 ~
1 2 bball
2 3 NaN
This question already has answers here:
How to check whether a pandas DataFrame is empty?
(5 answers)
Closed 5 years ago.
Given a dataframe df, I'd apply some condition df[condition] and retrieve a subset. I just want to check if there are any rows in the subset - this would tell me the condition is a valid one.
In [551]: df
Out[551]:
Col1
0 1
1 2
2 3
3 4
4 5
5 3
6 1
7 2
8 3
What I want to check is something like this:
if df[condition] has rows:
do something
What is the best way to check whether a filtered dataframe has rows? Here's some methods that don't work:
if df[df.Col1 == 1]: Gives ValueError: The truth value of a DataFrame is ambiguous.
if df[df.Col1 == 1].any(): Also gives ValueError
I suppose I can test the len. Are there other ways?
You could use df.empty:
df_conditional = df.loc[df['column_name'] == some_value]
if not df_conditional.empty:
... # process dataframe results