Select rows of a pandas DataFrame with at most one null entry - python

If I need to choose from a dataframe where columns col1 and col2 must follow the condition that atleast one of these columns must be not null.
Right now, I am trying to perform below but it doesn't work
df=df.loc[(df['Cat1_L2'].isnull()) & (df['Cat2_L3'].isnull())==False]

Setup
(Modifying U8-Forward's data)
df = pd.DataFrame({'Cat1_L2':[1,np.nan,3, np.nan], 'Cat3_L3': [np.nan,3,4, np.nan]})
df
Cat1_L2 Cat3_L3
0 1.0 NaN
1 NaN 3.0
2 3.0 4.0
3 NaN NaN
Indexing with isna + sum
Fixing your code, ensure the number of True cases (corresponding to NaN in columns) is lesser than 2.
df[df[['Cat1_L2', 'Cat3_L3']].isna().sum(axis=1) < 2]
Cat1_L2 Cat3_L3
0 1.0 NaN
1 NaN 3.0
2 3.0 4.0
dropna with thresh
df.dropna(subset=['Cat1_L2', 'Cat3_L3'], thresh=1)
Cat1_L2 Cat3_L3
0 1.0 NaN
1 NaN 3.0
2 3.0 4.0

One way is to loop over every row using itertuples(). Beaware that this is computationally expensive.
1 - Create list that chceks your condition for each row using itertuples()
condition_list = []
for row in df.itertuples():
if (row.Cat1_L2 != None) or (row.Cat2_L3 != None):
condition_list.append(1)
else:
condition_list.append(0)
2. Convert list to pandas series
condition_series = pd.Series(condition_list)
3. Append series to original df
df['condition_column'] = condition_series.values
4. Filter df
df_new = df[df.condition_column == 1]
del df_new['condition_column']

Related

Pandas DataFrame, group by column into single line items but extend columns by number of occurrences per group

I am trying to reformat a DataFrame into a single line item per categorical group, but my fixed format needs to retain all elements of data associated to the category as new columns.
for example I have a DataFrame:
dta = {'day':['A','A','A','A','B','C','C','C','C','C'],
'param1':[100,200,2,3,7,23,43,98,1,0],
'param2':[1,20,65,3,67,2,3,98,654,5]}
df = pd.DataFrame(dta)
I need to be able to transform/reformat the DataFrame where the data is grouped by the 'day' column (e.g. one row per day) but then has columns generated dynamically according to how many entries are within each category.
For example category C in the 'day' column has 5 entries, meaning for 'day' C you would have 5 param1 values and 5 param2 values.
The associated values for days A and B would be populated with NaN or empty where they do not have entries.
e.g.
dta2 = {'day':['A','B','C'],
'param1_1':[100,7,23],
'param1_2':[200,np.nan,43],
'param1_3':[2,np.nan,98],
'param1_4':[3,np.nan,1],
'param1_5':[np.nan,np.nan,0],
'param2_1':[1,67,2],
'param2_2':[20,np.nan,3],
'param2_3':[65,np.nan,98],
'param2_4':[3,np.nan,654],
'param2_5':[np.nan,np.nan,5]
}
df2 = pd.DataFrame(dta2)
Unfortunately this is a predefined format that I have to maintain.
I am aiming to use Pandas as efficiently as possible to minimise deconstructing and reassembling the DataFrame.
You first need to melt, then add a helper columns to cumcount the labels per group and pivot:
df2 = (
df.melt(id_vars='day')
.assign(group=lambda d: d.groupby(['day', 'variable']).cumcount().add(1).astype(str))
.pivot(index='day', columns=['variable', 'group'], values='value')
)
df2.columns = df2.columns.map('_'.join)
df2 = df2.reset_index()
output:
day param1_1 param1_2 param1_3 param1_4 param1_5 param2_1 param2_2 param2_3 param2_4 param2_5
0 A 100.0 200.0 2.0 3.0 NaN 1.0 20.0 65.0 3.0 NaN
1 B 7.0 NaN NaN NaN NaN 67.0 NaN NaN NaN NaN
2 C 23.0 43.0 98.0 1.0 0.0 2.0 3.0 98.0 654.0 5.0

select range of values for all columns in pandas dataframe

I have a dataframe 'DF', part of which looks like this:
I want to select only the values between 0 and 0.01, to form a new dataframe(with blanks where the value was over 0.01)
To do this, i tried:
similarity = []
for x in DF:
similarity.append([DF[DF.between(0, 0.01).any(axis=1)]])
simdf = pd.DataFrame(similarity)
simdf.to_csv("similarity.csv")
However, i get the error AttributeError: 'DataFrame' object has no attribute 'between'
How do i select a range of values and create a new data frame with these?
Just do the two comparisons:
df_new = df[(df>0) & (df<0.01)]
Example:
import pandas as pd
df = pd.DataFrame({"a":[0,2,4,54,56,4],"b":[4,5,7,12,3,4]})
print(df[(df>5) & (df<33)])
a b
0 NaN NaN
1 NaN NaN
2 NaN 7.0
3 NaN 12.0
4 NaN NaN
5 NaN NaN
If want blank string instead of NaN:
df[(df>5) & (df<33)].fillna("")

Add new column with column names of a table, based on conditions [duplicate]

I have a dataframe as below:
I want to get the name of the column if column of a particular row if it contains 1 in the that column.
Use DataFrame.dot:
df1 = df.dot(df.columns)
If there is multiple 1 per row:
df2 = df.dot(df.columns + ';').str.rstrip(';')
Firstly
Your question is very ambiguous and I recommend reading this link in #sammywemmy's comment. If I understand your problem correctly... we'll talk about this mask first:
df.columns[
(df == 1) # mask
.any(axis=0) # mask
]
What's happening? Lets work our way outward starting from within df.columns[**HERE**] :
(df == 1) makes a boolean mask of the df with True/False(1/0)
.any() as per the docs:
"Returns False unless there is at least one element within a series or along a Dataframe axis that is True or equivalent".
This gives us a handy Series to mask the column names with.
We will use this example to automate for your solution below
Next:
Automate to get an output of (<row index> ,[<col name>, <col name>,..]) where there is 1 in the row values. Although this will be slower on large datasets, it should do the trick:
import pandas as pd
data = {'foo':[0,0,0,0], 'bar':[0, 1, 0, 0], 'baz':[0,0,0,0], 'spam':[0,1,0,1]}
df = pd.DataFrame(data, index=['a','b','c','d'])
print(df)
foo bar baz spam
a 0 0 0 0
b 0 1 0 1
c 0 0 0 0
d 0 0 0 1
# group our df by index and creates a dict with lists of df's as values
df_dict = dict(
list(
df.groupby(df.index)
)
)
Next step is a for loop that iterates the contents of each df in df_dict, checks them with the mask we created earlier, and prints the intended results:
for k, v in df_dict.items(): # k: name of index, v: is a df
check = v.columns[(v == 1).any()]
if len(check) > 0:
print((k, check.to_list()))
('b', ['bar', 'spam'])
('d', ['spam'])
Side note:
You see how I generated sample data that can be easily reproduced? In the future, please try to ask questions with posted sample data that can be reproduced. This way it helps you understand your problem better and it is easier for us to answer it for you.
Getting column name are dividing in 2 sections.
If you want in a new column name then condition should be unique because it will only give 1 col name for each row.
data = {'foo':[0,0,3,0], 'bar':[0, 5, 0, 0], 'baz':[0,0,2,0], 'spam':[0,1,0,1]}
df = pd.DataFrame(data)
df=df.replace(0,np.nan)
df
foo bar baz spam
0 NaN NaN NaN NaN
1 NaN 5.0 NaN 1.0
2 3.0 NaN 2.0 NaN
3 NaN NaN NaN 1.0
If you were looking for min or maximum
max= df.idxmax(1)
min = df.idxmin(1)
out= df.assign(max=max , min=min)
out
foo bar baz spam max min
0 NaN NaN NaN NaN NaN NaN
1 NaN 5.0 NaN 1.0 bar spam
2 3.0 NaN 2.0 NaN foo baz
3 NaN NaN NaN 1.0 spam spam
2nd case, If your condition is satisfied in multiple columns for example you are looking for columns that contain 1 and you are looking for list because its not possible to adjust in same dataframe.
str_con= df.astype(str).apply(lambda x:x.str.contains('1.0',case=False, na=False)).any()
df.column[str_con]
#output
Index(['spam'], dtype='object') #only spam contains 1
Or you are looking for numerical condition columns contains value more than 1
num_con = df.apply(lambda x:x>1.0).any()
df.columns[num_con]
#output
Index(['foo', 'bar', 'baz'], dtype='object') #these col has higher value than 1
Happy learning

Getting size() or groupby & count to work across all columns

Sorry if this seems repetive, i've found a lot of close answers using groupby and size but none that return the column header as the index.
I have the following df (which actually has 340 columns and many rows):
import pandas as pd
data = {'Name_Clean_40_40_Correct':['0','1','0','0'], 'Name_Clean_40_80_Correct':['0','1','1','N/A'],'Name_Clean_40_60_Correct':['N/A','N/A','0','1']}
df_third = pd.DataFrame(data)
I am trying to count the instances of '0','1', and 'N/A' for each column. So i'd like to have the index be the column names and the columns be '0','1', and 'N/A'.
I was trying this, but i'm afraid it is very inefficient or incorrect, since it won't complete.
def countx(x, colname):
df_thresholds=df_third.groupby(colname).count()
for col in df_thresholds.columns:
df_thresholds[col + '_Count'] = df_third.apply(countx, axis=1, args=(col,))
I can do it for one column but that would be a pain:
df_thresholds=df_third.groupby('Name_Clean_100_100_Correct').count()
df_thresholds=df_thresholds[['Name_Raw']]
df_thresholds=df_thresholds.T
If I understand correctly this should work:
df_third.apply(pd.Series.value_counts)
result:
Name_Clean_40_40_Correct ... Name_Clean_40_60_Correct
0 3.0 ... 1
1 1.0 ... 1
N/A NaN ... 2
BTW: to select only columns containing 'Correct':
df_third.filter(like='Correct')
Transposed form df_third.T:
0 1 N/A
Name_Clean_40_40_Correct 3.0 1.0 NaN
Name_Clean_40_80_Correct 1.0 2.0 1.0
Name_Clean_40_60_Correct 1.0 1.0 2.0

Why is that a row added using the dataframe loc function does not give the correct result

I tried to insert a new row to a dataframe named 'my_df1' using the my_df1.loc function.But in the result ,the new row added has NaN values
my_data = {'A':pd.Series([1,2,3]),'B':pd.Series([4,5,6]),'C':('a','b','c')}
my_df1 = pd.DataFrame(my_data)
print(my_df1)
my_df1.loc[3] = pd.Series([5,5,5])
Result displayed is as below
A B C
0 1.0 4.0 a
1 2.0 5.0 b
2 3.0 6.0 c
3 NaN NaN NaN
The reason that is all NaN is that my_df1.loc[3] as index (A,B,C) while pd.Series([5,5,5]) as index (0,1,2). When you do series1=series2, pandas only copies values of common indices, hence the result.
To fix this, do as #anky_91 says, or if you already has a series, use its values:
my_df1.loc[3] = my_series.values
Finally I found out how to add a Series as a row or column to a dataframe
my_data = {'A':pd.Series([1,2,3]),'B':pd.Series([4,5,6]),'C':('a','b','c')}
my_df1 = pd.DataFrame(my_data)
print(my_df1)
Code1 adds a new column 'D' and values 5,5,5 to the dataframe
my_df1.loc[:,'D'] = pd.Series([5,5,5],index = my_df1.index)
print(my_df1)
Code2 adds a new row with index 3 and values 3,4,3,4 to the dataframe in code 1
my_df1.loc[3] = pd.Series([3,4,3,4],index = ('A','B','C','D'))
print(my_df1)

Categories

Resources