Dataframe use grouped row values as a row name - python

i just started to work with python (pandas) and now i have my first question.
I have a dataframe with the following row names:
ID A Class
1 True [0,5]
2 False [0,5]
3 True [5,10]
4 False [10,20]
5 True [0,5]
6 False [10,20]
Now i'm looking for a cool solution, where i can do something like this:
Class True False
[0,5] 2 1
[5,10] 1 0
[10,20] 0 2
I want to count how much True and False i have for a Class
Is there a fast solution? My Dataframe could have more than 2 million entries.

You can use pivot_table to do the aggregation. After that, it's just a matter of formatting the column names and index to match your desired output.
# Perform the pivot and aggregation.
df = pd.pivot_table(df, index='Class', columns='A', aggfunc='count', fill_value=0)
# Format column names and index to match desired output.
df.columns = [c[1] for c in df.columns]
df.reset_index(inplace=True)
The resulting output:
Class False True
0 [0,5] 1 2
1 [10,20] 2 0
2 [5,10] 0 1
Edit:
The above solution assumes that the elements of the 'Class' column are strings. If they are lists, you could do the following:
df['Class'] = df['Class'].map(tuple)
**original solution code here**
df['Class'] = df['Class'].map(list)

Let df be your dataframe, I would first use:
g = df.groupby('Class')['A'].value_counts().reset_index()
that returns:
Class A 0
0 [0,5] True 2
1 [0,5] False 1
2 [10,20] False 2
3 [5,10] True 1
then I would pivot the above table to get your desired shape:
a = pd.pivot_table(g, index='Class', columns='A', values=0).fillna(0)
This returns:
A False True
Class
[0,5] 1.0 2.0
[10,20] 2.0 0.0
[5,10] 0.0 1.0

Related

df.duplicated() not finding duplicates

I am trying to run this code.
import pandas as pd
df = pd.DataFrame({'A':['1','2'],
'B':['1','2'],
'C':['1','2']})
print(df.duplicated())
It is giving me the output.
0 False
1 False
dtype: bool
I want to know why it is showing index 1 as False and not True.
I'm expecting output this.
0 False
1 True
dtype: bool
I'm using Python 3.11.1 and Pandas 1.4.4
duplicated is working on full rows (or a subset of the columns if the parameter is used).
Here you don't have any duplicate:
A B C
0 1 1 1 # this row is unique
1 2 2 2 # this one is also unique
I believe you might want duplication column-wise?
df.T.duplicated()
Output:
A False
B True
C True
dtype: bool
You are not getting the expected output because you don't have duplicates, to begin with. I added the duplicate rows to the end of your dataframe and this is closer to what you are looking for:
import pandas as pd
df = pd.DataFrame({'A':['1','2'],
'B':['1','2'],
'C':['1','2']})
df = pd.concat([df]*2)
df
A B C
0 1 1 1
1 2 2 2
0 1 1 1
1 2 2 2
df.duplicated(keep='first')
Output:
0 False
1 False
0 True
1 True
dtype: bool
And the if you want to keep duplicates the other way around:
df.duplicated(keep='last')
0 True
1 True
0 False
1 False
dtype: bool

how can I get the index of rows having null values in all columns

I'd like to get index of rows which have only null values straight in pandas, python3.
thanks.
Use:
i = df.index[df.isna().all(axis=1)]
If large DataFrame, slowier solution:
i = df[df.isna().all(axis=1)].index
Sample:
df=pd.DataFrame({"a":[np.nan,0,1],
"b":[np.nan,1,np.nan]})
print (df)
a b
0 NaN NaN
1 0.0 1.0
2 1.0 NaN
i = df.index[df.isna().all(axis=1)]
print (i)
Int64Index([0], dtype='int64')
Explanation:
First compare missing values by DataFrame.isna:
print (df.isna())
a b
0 True True
1 False False
2 False True
Then check if all Trues per rows by DataFrame.all:
print (df.isna().all(axis=1))
0 True
1 False
2 False
dtype: bool
And last filter index values by boolean indexing.

Pandas: Generate a Dataframe column which has values depending on another column of a dataframe

I am trying to generate a pandas Dataframe where a column will have numerical values based on the values of a column in another dataframe. Below is an example:
I want to generate another dataframe based on a column of dataframe df_
ipdb> df_ = pd.DataFrame({'c1':[False, True, False, True]})
ipdb> df_
c1
0 False
1 True
2 False
3 True
Using df_ another dataframe df1 is generated with columns as below.
ipdb> df1
col1 col2
0 0 NaN
1 1 0
2 2 NaN
3 3 1
Here, 'col1' has normal index values and 'c1' has NaN in the rows where there was False in df_ and sequentially incrementing values where 'c1' is True.
To generate this dataframe, below is what I have tried.
ipdb> df_[df_['c1']==True].reset_index().reset_index()
level_0 index c1
0 0 1 True
1 1 3 True
However, I feel there should be a better way to generate a dataframe with the two columns as in df1.
I think you need cumsum and subtract 1 for start counting from 0:
df_ = pd.DataFrame({'c1':[False, True, False, True]})
df_['col2'] = df_.loc[df_['c1'], 'c1'].cumsum().sub(1)
print (df_)
c1 col2
0 False NaN
1 True 0.0
2 False NaN
3 True 1.0
Another solution is count occurencies of True values by sum with numpy.arange and assign back to filtered DataFrame:
df_.loc[df_['c1'],'col2']= np.arange(df_['c1'].sum())
print (df_)
c1 col2
0 False NaN
1 True 0.0
2 False NaN
3 True 1.0
Details:
print (df_['c1'].sum())
2
print (np.arange(df_['c1'].sum()))
[0 1]
another way to solve this,
df.loc[df['c1'],'col2']=range(len(df[df['c1']]))
Output:
c1 col2
0 False NaN
1 True 0.0
2 False NaN
3 True 1.0

What is the Right Syntax When Using .notnull() in Pandas?

I want to use .notnull() on several columns of a dataframe to eliminate the rows which contain "NaN" values.
Let say I have the following df:
A B C
0 1 1 1
1 1 NaN 1
2 1 NaN NaN
3 NaN 1 1
I tried to use this syntax but it does not work? do you know what I am doing wrong?
df[[df.A.notnull()],[df.B.notnull()],[df.C.notnull()]]
I get this Error:
TypeError: 'Series' objects are mutable, thus they cannot be hashed
What should I do to get the following output?
A B C
0 1 1 1
Any idea?
You can first select subset of columns by df[['A','B','C']], then apply notnull and specify if all values in mask are True:
print (df[['A','B','C']].notnull())
A B C
0 True True True
1 True False True
2 True False False
3 False True True
print (df[['A','B','C']].notnull().all(1))
0 True
1 False
2 False
3 False
dtype: bool
print (df[df[['A','B','C']].notnull().all(1)])
A B C
0 1.0 1.0 1.0
Another solution is from Ayhan comment with dropna:
print (df.dropna(subset=['A', 'B', 'C']))
A B C
0 1.0 1.0 1.0
what is same as:
print (df.dropna(subset=['A', 'B', 'C'], how='any'))
and means drop all rows, where is at least one NaN value.
You can apply multiple conditions by combining them with the & operator (this works not only for the notnull() function).
df[(df.A.notnull() & df.B.notnull() & df.C.notnull())]
A B C
0 1.0 1.0 1.0
Alternatively, you can just drop all columns which contain NaN. The original DataFrame is not modified, instead a copy is returned.
df.dropna()
You can simply do:
df.dropna()

Drop row in pandas dataframe if any value in the row equals zero

How do I drop a row if any of the values in the row equal zero?
I would normally use df.dropna() for NaN values but not sure how to do it with "0" values.
i think the easiest way is looking at rows where all values are not equal to 0:
df[(df != 0).all(1)]
You could make a boolean frame and then use any:
>>> df = pd.DataFrame([[1,0,2],[1,2,3],[0,1,2],[4,5,6]])
>>> df
0 1 2
0 1 0 2
1 1 2 3
2 0 1 2
3 4 5 6
>>> df == 0
0 1 2
0 False True False
1 False False False
2 True False False
3 False False False
>>> df = df[~(df == 0).any(axis=1)]
>>> df
0 1 2
1 1 2 3
3 4 5 6
Although it is late, someone else might find it helpful.
I had similar issue. But the following worked best for me.
df =pd.read_csv(r'your file')
df =df[df['your column name'] !=0]
reference:
Drop rows with all zeros in pandas data frame
see #ikbel benabdessamad
Assume a simple DataFrame as below:
df=pd.DataFrame([1,2,0,3,4,0,9])
Pick non-zero values which turns all zero values into nan and remove nan-values
df=df[df!=0].dropna()
df
Output:
0
0 1.0
1 2.0
3 3.0
4 4.0
6 9.0

Categories

Resources