Pandaic way to check whether a dataframe has any rows [duplicate] - python

This question already has answers here:
How to check whether a pandas DataFrame is empty?
(5 answers)
Closed 5 years ago.
Given a dataframe df, I'd apply some condition df[condition] and retrieve a subset. I just want to check if there are any rows in the subset - this would tell me the condition is a valid one.
In [551]: df
Out[551]:
Col1
0 1
1 2
2 3
3 4
4 5
5 3
6 1
7 2
8 3
What I want to check is something like this:
if df[condition] has rows:
do something
What is the best way to check whether a filtered dataframe has rows? Here's some methods that don't work:
if df[df.Col1 == 1]: Gives ValueError: The truth value of a DataFrame is ambiguous.
if df[df.Col1 == 1].any(): Also gives ValueError
I suppose I can test the len. Are there other ways?

You could use df.empty:
df_conditional = df.loc[df['column_name'] == some_value]
if not df_conditional.empty:
... # process dataframe results

Related

Pandas Drop Duplicates And Store Duplicates [duplicate]

This question already has answers here:
How do I get a list of all the duplicate items using pandas in python?
(13 answers)
Closed 2 months ago.
i use the pandas.DataFrame.drop_duplicates to search duplicates in a dataframe. This removes the duplicates from the dataframe. This also works great. However, I would like to know which data has been removed.
Is there a way to save the data in a new list before removing it?
I have unfortunately found in the documentation of pandas no information on this.
Thanks for the answer.
It uses duplicated function to filter out the information which is duplicated. By default the first occurrence is set to True, all others set as False, Using this function and filter on original data, you can know which data is kept and which is dropped out.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html
You can use duplicated and boolean indexing with groupby.agg to keep the list of duplicates:
m = df.duplicated('group')
dropped = df[m].groupby(df['group'])['value'].agg(list)
print(dropped)
df = df[~m]
print(df)
Output:
# print(dropped)
group
A [2]
B [4, 5]
C [7]
Name: value, dtype: object
# print(df)
group value
0 A 1
2 B 3
5 C 6
Used input:
group value
0 A 1
1 A 2
2 B 3
3 B 4
4 B 5
5 C 6
6 C 7

How to select specific row elements according to the length of each element pandas [duplicate]

This question already has answers here:
filter dataframe rows based on length of column values
(4 answers)
Closed 3 years ago.
I'm trying to drop some rows from my dataframe according to a specific element if the elements length is different than 7 then this element should be discarded.
for i in range(len(ups_df['Purchase_Order'])):
if( len(ups_df['Purchase_Order'][i]) != 7):
del ups_df['Purchase_Order'][i]
print(len(ups_df['Purchase_Order']))
The output I get is key error 4
I would solve this issue using lambda to filter the specific row in the dataframe with the condition, to keep only rows that have a length of 7. You can obviously change or adapt this to fit your needs:
filtered_df = ups_df[ups_df['Purchase_order'].apply(lambda x: len(str(x)) == 7)]
This is an example:
data = {'A':[1,2,3,4,5,6],'Purchase_order':['aaaa111','bbbb222','cc34','f41','dddd444','ce30431404']}
ups_df = pd.DataFrame(data)
filtered_df = ups_df[ups_df['Purchase_order'].apply(lambda x: len(str(x)) == 7)]
Original dataframe:
A Purchase_order
0 1 aaaa111
1 2 bbbb222
2 3 cc34
3 4 f41
4 5 dddd444
5 6 ce30431404
After filtering (dropping the rows that have a length different than 7):
A Purchase_order
0 1 aaaa111
1 2 bbbb222
4 5 dddd444

python replace not na value [duplicate]

This question already has answers here:
How to replace NaN values by Zeroes in a column of a Pandas Dataframe?
(17 answers)
Closed 3 years ago.
I want to create a new column and replace NA and not missing value with 0 and 1.
#df
col1
1
3
5
6
what I want:
#df
col1 NewCol
1 1
3 1
0
5 1
0
6 1
This is what I tried:
df['NewCol']=df['col1'].fillna(0)
df['NewCol']=df['col1'].replace(df['col1'].notnull(), 1)
It seems that the second line is incorrect.
Any suggestion?
You can try:
df['NewCol'] = [*map(int, pd.notnull(df.col1))]
Hope this helps.
First you will need to convert all 'na's into '0's. How you do this will vary by scope.
For a single column you can use:
df['DataFrame Column'] = df['DataFrame Column'].fillna(0)
For the whole dataframe you can use:
df.fillna(0)
After this, you need to replace all nonzeros with '1's. You could do this like so:
for index, entry in enumerate(df['col']):
if entry != 0:
df['col'][index] = 1
Note that this method counts 0 as an empty entry, which may or may not be the desired functionality.

Combining pandas rows based on condition [duplicate]

This question already has answers here:
Pandas groupby with delimiter join
(2 answers)
Concatenate strings from several rows using Pandas groupby
(8 answers)
Closed 3 years ago.
Given a Pandas Dataframe df, with column names 'Session', and 'List':
Can I group together the 'List' values for the same values of 'Session'?
My Approach
I've tried solving the problem by creating a new dataframe, and iterating through the rows of the inital dataframe while maintaing a session counter that I increment if I see that the session has changed.
If it hasn't changed, then I append the List value that corresponds to that rows value with a comma.
Whenever the session changes, I used strip to get rid of the last comma (extra).
Initial DataFrame
Session List
0 1 a
1 1 b
2 1 c
3 2 d
4 2 e
5 3 f
Required DataFrame
Session List
0 1 a,b,c
1 2 d,e
2 3 f
Can someone suggest something more efficient or simple?
Thank you in advance.
Use groupby and apply and reset_index:
>>> df.groupby('Session')['List'].agg(','.join).reset_index()
Session List
0 1 a,b,c
1 2 d,e
2 3 f
>>>

How can I drop rows in a dataframe efficiently ir a specific column contains a substring [duplicate]

This question already has answers here:
Pandas filtering for multiple substrings in series
(3 answers)
Closed 4 years ago.
I tried
df = df[~df['event.properties.comment'].isin(['Extra'])]
Problem is it would just drop the row if the column contains exactly 'Extra' and I need to drop the ones that contain it even as a substring.
Any help?
You can use or condition to have multiple conditions in checking string, for your requirement you may retain text if it have "Extra" or "~".
Considered df
vals ids
0 1 ~
1 2 bball
2 3 NaN
3 4 Extra text
df[~df.ids.fillna('').str.contains('Extra')]
Out:
vals ids
0 1 ~
1 2 bball
2 3 NaN

Categories

Resources