How to compress a pandas dataframe, based on their boolean column value?

How to compress a pandas dataframe, based on their boolean column value? - python

Given a pandas frame like this:
In:
pokemon yes no ignore
Vulpix True False False
Nidorino False True False
Growlithe False False True
Krokorok False True False
Darumaka False False True
Klefki False True False
Croagunk True False False
What is the correct way of getting as a row value the column associated to the pokemon column?:
Out:
pokemon Val
Vulpix yes
Nidorino no
Growlithe ignore
Krokorok no
Darumaka ignore
Klefki no
Croagunk yes
So far I tried with cross tab:
pd.crosstab(df, columns=['yes', 'no', 'ignore'])
However, I am getting a value error:
`ValueError: Shape of passed values`
What is the correct way of getting the previous output?

If those are one-hot encoded such that there is only ever a single 1/True on each row, set_index and dot with the columns.
df = df.set_index('pokemon')
df = df.dot(df.columns)
pokemon
Vulpix yes
Nidorino no
Growlithe ignore
Krokorok no
Darumaka ignore
Klefki no
Croagunk yes
dtype: object
The above is a Series, to get the DataFrame to match your output:
df = df.dot(df.columns).to_frame('Val').reset_index()

Related

How to find error value on pandas dataframe?

While using csv files from excel and read it with pandas data frame, got 1 value that's has symbol such as 2$3.74836730957 while it has to be 243.74836730957 (it seems mistook 4 with $). is there anyways that I could find such as values that I mention before and change it into NaN values on Data Frame?
CSV file:

You can use pd.to_numeric in order to report boolean values that denote whether a particular column has only numerical values. In order to check all columns you can do
df.apply(lambda s: pd.to_numeric(s, errors='coerce').notnull().all())
And the output would look like:
A True
B False
C True
D False
...
dtype: bool
Now if you want to know which specific row(s) are not numerical you can use np.isreal:
df.applymap(np.isreal)
A B C D
item
r1 True True True True
r2 True True True True
r3 True True True False
...

Python Pandas: filtering dataframe based on mask list throwing error

I have a list that contains True/False values. Index of the list is the id. List looks like as follow:
mask:
0000003555CE050 True
0000484541CE90D True
0000486629CE4AA True
0000012748DE8FB True
0000047030DE4E4 True
0000510518CEA7C False
0000510762DE9AC False
0000510486CEF04 False
0000202438DEE9C False
0000510534CE437 False
I have a dataframe which also contains the id column. I want to keep values of those id that has True from the list. I am trying to filter the data frame using following command:
df.loc[:,mask]
However, I got TypeError which is as follows:
TypeError: '(slice(None, None, None), 0000003555CE050 True
0000484541CE90D True
0000486629CE4AA True
0000012748DE8FB True
0000047030DE4E4 True
...
0000510518CEA7C False
0000510762DE9AC False
0000510486CEF04 False
0000202438DEE9C False
0000510534CE437 False
Name: ICP, Length: 92245, dtype: bool)' is an invalid key
I am not sure where am I making the mistake. Could anyone point out where am I making the mistake?

Let us try get_level_values, you need know which level it is , for id column
df = df[df.columns.get_level_values(level=0).isin(mask.index[mask])]

Find rows in a DataFrame that match

Suppose make a data frame with pandas
df = pandas.DataFrame({"a":[1,2,3],"b":[3,4,5]})
Now I have a series
s = pandas.Series([1,3], index=['a','b'])
OK, finally, my question is how can I know s is a item in df especially I need to consider performance?
ps: It's better to return True or False when test s in or not in df.

You want
df.eq(s).all(axis=1).any()
# True
We start with a row-wise comparison,
df.eq(s)
a b
0 True True
1 False False
2 False False
Find all rows which match fully,
df.eq(s).all(axis=1)
0 True
1 False
2 False
dtype: bool
Return True if any matches were found,
df.eq(s).all(axis=1).any()
# True

How to filter by boolean?

I have a series of booleans extracted and I would like to filter a dataframe from this in Pandas but it is returning no results.
Dataframe
Account mphone rphone BPHONE
0 14999201 3931812 8014059 9992222
1 12980801 4444444 3932929 4279999
2 9999999 3279999 4419999 3938888
Here are the series:
df['BPHONE'].str.startswith(tuple(combined_list))
0 False
1 True
2 False
Name: BPHONE, dtype: bool
df['rphone'].str.startswith(tuple(combined_list))
0 True
1 False
2 True
Name: rphone, dtype: bool
Now below, when I try to filter this, I am getting empty results:
df[(df['BPHONE'].str.startswith(tuple(combined_list))) & (df['rphone'].str.startswith(tuple(combined_list)))]
Account mphone rphone BPHONE
Lastly, when I just use one column, it seems to match and filter by row and not column. Any advice here on how to correct this?
df[(df['BPHONE'].str.startswith(tuple(combined_list)))]
Account mphone rphone BPHONE
1 12980801 4444444 3932929 4279999
I thought that this should just populate BPHONE along the column axis and not the row axis. How would I be able to filter this way?
The output wanted is the following:
Account mphone rphone BPHONE
14999201 3931812 8014059 np.nan
12980801 4444444 np.nan 4279999
99999999 3279999 4419999 np.nan
To explain this, rphone shows True, False, True, so only the True numbers should show. Under False it should not show, or show as NAN.

The output you are expecting is not a filtered result, but conditionally replaced result.
condition = df['BPHONE'].str.startswith(tuple(combined_list))
Use np.where
df['BPHONE'] = pd.np.where(condition, df['BPHONE'], pd.np.nan)
OR
df.loc[~condition, 'BPHONE'] = pd.np.nan

All filters you are applying are functioning correctly:
df['BPHONE'].str.startswith(tuple(combined_list))
0 False
1 True #Only this row will be retained
2 False
The combined filter will return:
df[(df['BPHONE'].str.startswith(tuple(combined_list))) & (df['rphone'].str.startswith(tuple(combined_list)))]
First filter Second filter Combined filter
0 False True False #Not retained
1 True False False #Not retained
2 False True False #Not retained

Find values of data frame in another dataframe in python

I have two data frames df_1 contains:
["TP","MP"]
and df_2 contains:
["This is case 12389TP12098","12378MP899" is now resolved","12356DCT is pending"]
I want to use values in df_1 search it in each entry of df_2
and return those which matches. In this case,those two entries which have TP,MP.
I tried something like this.
df_2.str.contains(df_1)

You need to do it separately for each element of df_1. Pandas will help you:
df_1.apply(df_2.str.contains)
Out:
0 1 2
0 True False False
1 False True False
That's a matrix of all combinations. You can pretty it up:
matches = df_1.apply(df_2.str.contains)
matches.index = df_1
matches.columns = df_2
matches
Out:
This is case 12389TP12098 12378MP899 is now resolved 12356DCT is pending
TP True False False
MP False True False

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to compress a pandas dataframe, based on their boolean column value? - python

Related

How to find error value on pandas dataframe?

Python Pandas: filtering dataframe based on mask list throwing error

Find rows in a DataFrame that match

How to filter by boolean?

Find values of data frame in another dataframe in python

Categories

Resources