Only getting relevant data from Pandas Dataframe - python

Brief background: I just started recently using Pandas to read in a csv file of data. I'm able to create a dataframe from reading the csv but now I want to do some calculations using only specific columns of the dataset.
Is there a way to create a new dataframe where I only use rows where the relevant columns are not NA or 0? For example imagine an array that looks like:
blah blah1 blah2 blah3
0 1 1 1 1
1 NA NA 1 NA
2 1 1 1 1
So say I want to do things with data under columns "blah1" and "blah2", but I want to only use rows 0 and 2 because 1 has an NA under the column "blah".
Is there a simple way of doing this? Thanks!
Edit (Clarifications):
- I don't know ahead of time that I want to drop row 1, thus I need to be able to check for a NA value (and possibly any other placeholder value beyond just whether it is null).

Yes, you can use dropna
df = df.dropna(axis = 1)
and to select columns use this:
df = df[["blah1", "blah2"]]
Now df contains only cols "blah1" and "blah2" and rows 0 and 2
EDIT 1
To limit NaN verification to some columns you can use isnull().
mask = df[["blah1", "blah2"]].isnull().all(axis=1)
df = df[~mask]
EDIT 2
mask = df.B == 'placeholder'
df = df[~mask]

Related

Pandas. What is the best way to insert additional rows in dataframe based on cell values?

I have dataframe like this:
id
name
emails
1
a
a#e.com,b#e.com,c#e.com,d#e.com
2
f
f#gmail.com
And I need iterate over emails if there are more than one, create additional rows in dataframe with additional emails, not corresponding to name, should be like this:
id
name
emails
1
a
a#e.com
2
f
f#gmail.com
3
NaN
b#e.com
4
NaN
c#e.com
5
NaN
d#e.com
What is the best way to do it apart of iterrows with append or concat? is it ok to modify iterated dataframe during iteration?
Thanks.
Use DataFrame.explode with splitted values by Series.str.split first, then compare values before # and if no match set missing value and last sorting like missing values are in end of DataFrame with assign range to id column:
df = df.assign(emails = df['emails'].str.split(',')).explode('emails')
mask = df['name'].eq(df['emails'].str.split('#').str[0])
df['name'] = np.where(mask, df['name'], np.nan)
df = df.sort_values('name', key=lambda x: x.isna(), ignore_index=True)
df['id'] = range(1, len(df) + 1)
print (df)
id name emails
0 1 a a#e.com
1 2 f f#gmail.com
2 3 NaN b#e.com
3 4 NaN c#e.com
4 5 NaN d#e.com

Delete the rows that have a single zero in Pandas

I have imported an excel sheet using Pandas like this:
w = pd.read_excel(r"C:\Users\lvk\Downloads\Softwares\Prob.xls", header=None)
Once I imported the excel sheet, I need to delete the rows with even a single zero in any column.
Are there any functions in Python to do that?
Please let me know.
Input:
row1: 0 4 3 5
row2: 1 6 5 61
row3: 1 3 6 0
Expected output:
1 6 5 61
Pandas has very powerful interfaces for indexing and selecting data. Among them are the use of the loc keyword to access by rows, and square brackets to pass indexing logic to loc. Normally you might use the names of your columns to do logical operations on their values. Here I don't know the index or columns of your excel data, so we will just loop through all the columns that are there.
# We are going to look in each column
for col in w.columns:
# And select only the rows in w that don't have a 0 in that column
w = w.loc[w[column] != 0]

select series based on index in a DataFrame in python

I have a pd.Series like
myS = pd.Series(np.arange(1,11,1))
I also have a pd.DataFrame like
mydf = pd.DataFrame([[1,2,3],[7,8,9]])
I would like to select values in myS based on index in mydf, but would like to have the results stored in a dataframe with same shape as mydf.
So the desired resulting dataframe is pd.DataFrame([[2,3,4],[8,9,1]])
What is the best way to achieve this?
Using replace
yourdf=mydf.replace(myS)
yourdf
Out[174]:
0 1 2
0 2 3 4
1 8 9 10

How to drop columns with NA using cudf?

Pandas:
data = data.dropna(axis = 'columns')
I am trying to do something similar using a cudf dataframe but the apis don't offer this functionality.
My solution is to convert to a pandas df, do the above command, then re-convert to a cudf. Is there a better solution?
cuDF now supports column based dropna, so the following will work:
import cudf
​
df = cudf.DataFrame({'a':[0,1,None], 'b':[None,0,2], 'c':[1,2,3]})
print(df)
a b c
0 0 null 1
1 1 0 2
2 null 2 3
df.dropna(axis='columns')
c
0 1
1 2
2 3
Until dropna is implemented, you can check the null_count of each column and drop the ones with null_count>0.

How to add a MultiIndex after loading csv data into a pandas dataframe?

I am trying to add additional index rows to an existing pandas dataframe after loading csv data into it.
So let's say I load my data like this:
columns = ['Relative_Pressure','Volume_STP']
df = pd.read_csv(StringIO(contents), skiprows=4, delim_whitespace=True,index_col=False,header=None)
df.columns = columns
where contents is a string in csv format. The resulting DataFrame might look something like this:
For clarity reasons I would now like to add additional index rows to the DataFrame as shown here:
However in the link these multiple index rows are generated right when the DataFrame is created. I would like to add e.g. rows for unit or descr to the columns.
How could I do this?
You can create a MultiIndex on the columns by specifically creating the index and then assigning it to the columns separately from reading in the data.
I'll use the example from the link you provided. The first method is to create the MultiIndex when you make the dataframe:
df = pd.DataFrame({('A',1,'desc A'):[1,2,3],('B',2,'desc B'):[4,5,6]})
df.columns.names=['NAME','LENGTH','DESCRIPTION']
df
NAME A B
LENGTH 1 2
DESCRIPTION desc A desc B
0 1 4
1 2 5
2 3 6
As stated, this is not what you are after. Instead, you can make the dataframe (from your file for example) and then make the MultiIndex from a set of lists and then assign it to the columns:
df = pd.DataFrame({'desc A':[1,2,3], 'desc B':[4,5,6]})
# Output
desc A desc B
0 1 4
1 2 5
2 3 6
# Create a multiindex from lists
index = pd.MultiIndex.from_arrays((['A', 'B'], [1, 2], ['desc A', 'desc B']))
# Assign to the columns
df.columns = index
# Output
A B
1 2
desc A desc B
0 1 4
1 2 5
2 3 6
# Name the columns
df.columns.names = ['NAME','LENGTH','DESCRIPTION']
# Output
NAME A B
LENGTH 1 2
DESCRIPTION desc A desc B
0 1 4
1 2 5
2 3 6
There are other ways to construct a MultiIndex, for example, from_tuples and from_product. You can read more about Multi Indexes in the documentation.

Categories

Resources