Delete the rows that have a single zero in Pandas - python

I have imported an excel sheet using Pandas like this:
w = pd.read_excel(r"C:\Users\lvk\Downloads\Softwares\Prob.xls", header=None)
Once I imported the excel sheet, I need to delete the rows with even a single zero in any column.
Are there any functions in Python to do that?
Please let me know.
Input:
row1: 0 4 3 5
row2: 1 6 5 61
row3: 1 3 6 0
Expected output:
1 6 5 61

Pandas has very powerful interfaces for indexing and selecting data. Among them are the use of the loc keyword to access by rows, and square brackets to pass indexing logic to loc. Normally you might use the names of your columns to do logical operations on their values. Here I don't know the index or columns of your excel data, so we will just loop through all the columns that are there.
# We are going to look in each column
for col in w.columns:
# And select only the rows in w that don't have a 0 in that column
w = w.loc[w[column] != 0]

Related

How to drop dataframe columns using both, a list and not from a list?

I am trying to drop pandas column in the following way. I have a list with columns to drop. This list will be used many times in my notebook. I have 2 columns which are only referenced once
drop_cols=['var1','var2']
df = df.drop(columns={'var0',drop_cols})
So basically, I want to drop all columns from list drop_cols in addition to a hard-coded "var0" column all in one swoop. This gives an error, How do I resolve?
df = df.drop(columns=drop_cols+['var0'])
From what I gather you have a set of columns you wish to drop from several different dataframes while at the same time adding another unique column to also be dropped a data frame. The command you have used is close but misses the point in that you can't create a concatenated list in the way you are trying to do it. This is how I would approach the problem.
Given a Dataframe of the form:
V0 V1 V2 V3
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
define a function to merge colnames
def mergeNames(spc_col, multi_cols):
rslt = [spc_col]
rslt.extend(multi+cols)
return rslt
Then with
drop_cols = ['V1', 'V2']
df.drop(columns=mergeNames('V0', drop_cols)
yields:
V3
0 4
1 8
2 12

Python Pandas: Delete all rows of data below first empty cell in column A

I have a csv file that gets mailed to me every day and I want to write a script to clean up the data before I push it in a database. At the bottom of the csv file are 2 empty rows (Row 73 & 74 in image) and two rows with some junk data in them (Row 75 & 76 in image) and I need to delete these rows.
To identify the first empty row, it might be helpful to know that Column A will always have data in it until the first empty row (Row 73 in image).
Can you help me figure out how to identify these rows and delete the data in them?
You can check misisng values by Series.isna, create cumulative sum by Series.cumsum and filter only if equal 0 by boolean indexing. Also this solution working if no missing value in first column.
df = pd.DataFrame({'A':['as','bf', np.nan, 'vd', 'ss'],
'B':[1,2,3,4,5]})
print (df)
A B
0 as 1
1 bf 2
2 NaN 3
3 vd 4
4 ss 5
df = df[df['A'].isna().cumsum() == 0]
print (df)
A B
0 as 1
1 bf 2

Transpose all rows in one column of dataframe to multiple columns based on certain conditions

I would like to convert one column of data to multiple columns in dataframe based on certain values/conditions.
Please find the code to generate the input dataframe
df1 = pd.DataFrame({'VARIABLE':['studyid',1,'age_interview', 65,'Gender','1.Male',
'2.Female',
'Ethnicity','1.Chinese','2.Indian','3.Malay']})
The data looks like as shown below
Please note that I may not know the column names in advance. But it usually follows this format. What I have shown above is a sample data and real data might have around 600-700 columns and data arranged in this fashion
What I would like to do is convert values which start with non-digits(characters) as new columns in dataframe. It can be a new dataframe.
I attempted to write a for loop but failed to due to the below error. Can you please help me achieve this outcome.
for i in range(3,len(df1)):
#str(df1['VARIABLE'][i].contains('^\d'))
if (df1['VARIABLE'][i].astype(str).contains('^\d') == True):
Through the above loop, I was trying to check whether first char is a digit, if yes, then retain it as a value (ex: 1,2,3 etc) and if it's a character (ex:gender, ethnicity etc), then create a new column. But guess this is an incorrect and lengthy approach
For example, in the above example, the columns would be studyid,age_interview,Gender,Ethnicity.
The final output would look like this
Can you please let me know if there is an elegant approach to do this?
You can use groupby to do something like:
m=~df1['VARIABLE'].str[0].str.isdigit().fillna(True)
new_df=(pd.DataFrame(df1.groupby(m.cumsum()).VARIABLE.apply(list).
values.tolist()).set_index(0).T)
print(new_df.rename_axis(None,axis=1))
studyid age_interview Gender Ethnicity
1 1 65 1.Male 1.Chinese
2 None None 2.Female 2.Indian
3 None None None 3.Malay
Explanation: m is a helper series which helps seperating the groups:
print(m.cumsum())
0 1
1 1
2 2
3 2
4 3
5 3
6 3
7 4
8 4
9 4
10 4
Then we group this helper series and apply list:
df1.groupby(m.cumsum()).VARIABLE.apply(list)
VARIABLE
1 [studyid, 1]
2 [age_interview, 65]
3 [Gender, 1.Male, 2.Female]
4 [Ethnicity, 1.Chinese, 2.Indian, 3.Malay]
Name: VARIABLE, dtype: object
At this point we have each group as a list with the column name as the first entry.
So we create a dataframe with this and set the first column as index and transpose to get our desired output.
Use itertools.groupby and then construct pd.DataFrame:
import pandas as pd
import itertools
l = ['studyid',1,'age_interview', 65,'Gender','1.Male',
'2.Female',
'Ethnicity','1.Chinese','2.Indian','3.Malay']
l = list(map(str, l))
grouped = [list(g) for k, g in itertools.groupby(l, key=lambda x:x[0].isnumeric())]
d = {k[0]: v for k,v in zip(grouped[::2],grouped[1::2])}
pd.DataFrame.from_dict(d, orient='index').T
Output:
Gender studyid age_interview Ethnicity
0 1.Male 1 65 1.Chinese
1 2.Female None None 2.Indian
2 None None None 3.Malay

Only getting relevant data from Pandas Dataframe

Brief background: I just started recently using Pandas to read in a csv file of data. I'm able to create a dataframe from reading the csv but now I want to do some calculations using only specific columns of the dataset.
Is there a way to create a new dataframe where I only use rows where the relevant columns are not NA or 0? For example imagine an array that looks like:
blah blah1 blah2 blah3
0 1 1 1 1
1 NA NA 1 NA
2 1 1 1 1
So say I want to do things with data under columns "blah1" and "blah2", but I want to only use rows 0 and 2 because 1 has an NA under the column "blah".
Is there a simple way of doing this? Thanks!
Edit (Clarifications):
- I don't know ahead of time that I want to drop row 1, thus I need to be able to check for a NA value (and possibly any other placeholder value beyond just whether it is null).
Yes, you can use dropna
df = df.dropna(axis = 1)
and to select columns use this:
df = df[["blah1", "blah2"]]
Now df contains only cols "blah1" and "blah2" and rows 0 and 2
EDIT 1
To limit NaN verification to some columns you can use isnull().
mask = df[["blah1", "blah2"]].isnull().all(axis=1)
df = df[~mask]
EDIT 2
mask = df.B == 'placeholder'
df = df[~mask]

Pandas indexing and accessing columns by names

I am trying to access pandas dataframe by column names after indexing the df with a specific column and it returns incorrect column values.
import pandas as pd
rs =pd.read_csv('rs.txt', header="infer", sep="\t", names=['id', 'exp','fov','cycle', 'color', 'values'], index_col=2)
rs.cycle.head()
I am indexing the df here with 'fov' and I want to access the 'cycle' column, it gives me the color column instead. I think I am missing something here?
EDIT
The first few lines of the input file are:
6 3 1 G 0.96593
6 3 1 O 0.88007
6 3 1 R 0.94305
6 3 2 B 0.90554
6 3 2 G 0.93146
I think the problem arises because your data file has 5 columns and your names list has 6 elements. To verify, check the first few values in the id column- these will all be set to 6 if I am right. The First few items in the exp column will have the value 3.
To fix this, read your input file like so:
rs =pd.read_csv('rs.txt', header="infer", sep="\t", names=['exp','fov','cycle', 'color', 'values'], index_col=2
Pandas will automatically insert row identifiers.

Categories

Resources