Can I apply lambda function to df column over specific rows? - python

New to python, so bear with me!
I have a function that applies random values to different columns in my df (depending on the column name; each column has specific random value range). For one specific function, I want to apply it to just half of my column. I attempted working with just my even rows to deal with this, but it seems it's not working. When I print out the first few rows, I can tell the column did not update. Would appreciate any help!
Here's what I've attempted:
for col in df.columns:
if 'shirt' in col:
df[col] = df[col].apply(lambda x: np.random.randint(0,5))
elif 'pants' in col:
df[col] = df[col].apply(lambda x: np.random.randint(0,10))
elif 'sweater' in col:
df[col] = df[col].apply(lambda x: np.random.randint(0,50) if df.index.all() % 2 == 0 else np.NaN)
else:
pass
(had to add .all() bc I got a Value Error: The truth value of an array with more than one element is ambiguous error.)
Using the index to apply it to half of my dataframe is what I can think of, but this wouldn't work if I wanted to apply that lambda function to just 60% or 80% of my column and leave the rest as nulls.
This is what I get as a result of the code above:
shirt_count pant_count sweater_count
0 14 3 5
1 18 3 7
2 1 3 5
3 7 1 9
4 2 3 2
Would love any help as I've been staring at my screen for hours!

You can apply separate functions to the even and odd rows of the 'sweater' column:
import pandas as pd
import numpy as np
df = pd.DataFrame({'shirt': [0,0,0,0,0], 'pants': [0,0,0,0,0], 'sweater': [0,0,0,0,0]})
for col in df.columns:
if 'shirt' in col:
df[col] = df[col].apply(lambda x: np.random.randint(0,5))
elif 'pants' in col:
df[col] = df[col].apply(lambda x: np.random.randint(0,10))
elif 'sweater' in col:
# random integers for even rows
df[col][::2] = df[col][::2].apply(lambda x: np.random.randint(0,50))
# na.NaN for odd rows
df[col][1::2] = df[col][1::2].apply(lambda x: np.NaN)
else:
pass
Output:
>>> df
shirt pants sweater
0 4 9 0.0
1 0 1 NaN
2 2 7 28.0
3 2 7 NaN
4 0 8 30.0

Related

dropping rows that has only one non zero value from a pandas dataframe in python

I have a pandas dataframe as shown below:
Pandas Dataframe
I want to drop the rows that has only one non zero value. What's the most efficient way to do this?
Try boolean indexing
# sample data
df = pd.DataFrame(np.zeros((10, 10)), columns=list('abcdefghij'))
df.iloc[2:5, 3] = 1
df.iloc[4:5, 4] = 1
# boolean indexing based on condition
df[df.ne(0).sum(axis=1).ne(1)]
Only rows 2 and 3 are removed because row 4 has two non-zero values and every other row has zero non-zero values. So we drop rows 2 and 3.
df.ne(0).sum(axis=1)
0 0
1 0
2 1
3 1
4 2
5 0
6 0
7 0
8 0
9 0
Not sure if this is the most efficient but I'll try:
df[[col for col in df.columns if (df[col] != 0).sum() == 1]]
2 loops per column here: 1 for checking if != 0 and one more to sum the boolean values up (could break earlier if the second value is found).
Otherwise, you can define a custom function to check without looping twice per column:
def check(column):
already_has_one = False
for value in column:
if value != 0:
if already_has_one:
return False
already_has_one = True
return already_has_one
then:
df[[col for col in df.columns if check(df[col])]]
Which is much faster than the first.
Or like this:
df[(df.applymap(lambda x: bool(x)).sum(1) > 1).values]

Dropping column if more than half of the values are same - Python

I have pandas df which looks like the pic:
enter image description here
I want to delete any column if more than half of the values are the same in the column, and I dont know how to do this
I trid using :pandas.Series.value_counts
but with no luck
You can iterate over the columns, count the occurences of values as you tried with value counts and check if it is more than 50% of your column's data.
n=len(df)
cols_to_drop=[]
for e in list(df.columns):
max_occ=df['id'].value_counts().iloc[0] #Get occurences of most common value
if 2*max_occ>n: # Check if it is more than half the len of the dataset
cols_to_drop.append(e)
df=df.drop(cols_to_drop,axis=1)
You can use apply + value_counts and getting the first value to get the max count:
count = df.apply(lambda s: s.value_counts().iat[0])
col1 4
col2 2
col3 6
dtype: int64
Thus, simply turn it into a mask depending on whether the greatest count is more than half len(df), and slice:
count = df.apply(lambda s: s.value_counts().iat[0])
df.loc[:, count.le(len(df)/2)] # use 'lt' if needed to drop if exactly half
output:
col2
0 0
1 1
2 0
3 1
4 2
5 3
Use input:
df = pd.DataFrame({'col1': [0,1,0,0,0,1],
'col2': [0,1,0,1,2,3],
'col3': [0,0,0,0,0,0],
})
Boolean slicing with a comprension
df.loc[:, [
df.shape[0] // s.value_counts().max() >= 2
for _, s in df.iteritems()
]]
col2
0 0
1 1
2 0
3 1
4 2
5 3
Credit to #mozway for input data.

pandas combine multiple row into one, and update other columns [duplicate]

I have this dataframe and I need to drop all duplicates but I need to keep first AND last values
For example:
1 0
2 0
3 0
4 0
output:
1 0
4 0
I tried df.column.drop_duplicates(keep=("first","last")) but it doesn't word, it returns
ValueError: keep must be either "first", "last" or False
Does anyone know any turn around for this?
Thanks
You could use the panda's concat function to create a dataframe with both the first and last values.
pd.concat([
df['X'].drop_duplicates(keep='first'),
df['X'].drop_duplicates(keep='last'),
])
you can't drop both first and last... so trick is too concat data frames of first and last.
When you concat one has to handle creating duplicate of non-duplicates. So only concat unique indexes in 2nd Dataframe. (not sure if Merge/Join would work better?)
import pandas as pd
d = {1:0,2:0,10:1, 3:0,4:0}
df = pd.DataFrame.from_dict(d, orient='index', columns=['cnt'])
print(df)
cnt
1 0
2 0
10 1
3 0
4 0
Then do this:
d1 = df.drop_duplicates(keep=("first"))
d2 = df.drop_duplicates(keep=("last"))
d3 = pd.concat([d1,d2.loc[set(d2.index) - set(d1.index)]])
d3
Out[60]:
cnt
1 0
10 1
4 0
Use a groupby on your column named column, then reindex. If you ever want to check for duplicate values in more than one column, you can extend the columns you include in your groupby.
df = pd.DataFrame({'column':[0,0,0,0]})
Input:
column
0 0
1 0
2 0
3 0
df.groupby('column', as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[0, -1]]).reset_index(level=0, drop=True)
Output:
column
0 0
3 0

How can I vectorize the apply + filter operation on pandas.DataFrame?

Imagine that I have a Dataframe and the columns are [A,B,C]. There are some different values for each of these columns. And I want to produce one more column D which can be received with the following function:
def produce_column(i):
# Extract current row by index
raw = df.loc[i]
# Extract previous 3 values for the same sub-df which are before i
df_same = df[
(df['A'] == raw.A)
& (df['B'] == raw.B)
].loc[:i].tail(3)
# Check that we have enough values
if df_same.shape[0] != 3:
return False
# Doesn't matter which function is in use, I just need to apply it on the column / columns
diffs = df_same['C'].map(lambda x: x <= 10 and x > 0)
return all(diffs)
df['D'] = df.index.map(lambda x: produce_column(x))
So on each step, I need to get the Dataframe, which have the same set of properties as a row and perform some operations on columns of this Dataframe. I have a few hundred thousands of rows, so this code takes a lot of time to be executed. I think that a good idea is to vectorize the operation, but I don't know how to do that. Maybe there's another way to perform this?
Thanks in advance!
UPD Here's an example
df = pd.DataFrame([(1,2,3), (4,5,6), (7,8,9)], columns=['A','B','C'])
A B C
0 1 2 3
1 4 5 6
2 7 8 9
df['D'] = df.index.map(lambda x: produce_column(x))
A B C D
0 1 2 3 True
1 4 5 6 True
2 7 8 9 False

Data cleaning: Remove 0 value from my dataset having a header and index_col

I have a dataset showing below.
What I would like to do is three things.
Step 1: AA to CC is an index, however, happy to keep in the dataset for the future purpose.
Step 2: Count 0 value to each row.
Step 3: If 0 is more than 20% in the row, which means more than 2 in this case because DD to MM is 10 columns, remove the row.
So I did a stupid way to achieve above three steps.
df = pd.read_csv("dataset.csv", header=None)
df_bool = (df == "0")
print(df_bool.sum(axis=1))
then I got an expected result showing below.
0 0
1 0
2 1
3 0
4 1
5 8
6 1
7 0
So removed the row #5 as I indicated below.
df2 = df.drop([5], axis=0)
print(df2)
This works well even this is not an elegant, kind of a stupid way to go though.
However, if I import my dataset as header=0, then this approach did not work at all.
df = pd.read_csv("dataset.csv", header=0)
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
How come this happens?
Also, if I would like to write a code with loop, count and drop functions, what does the code look like?
You can just continue using boolean_indexing:
First we calculate number of columns and number of zeroes per row:
n_columns = len(df.columns) # or df.shape[1]
zeroes = (df == "0").sum(axis=1)
We then select only rows that have less than 20 % zeroes.
proportion_zeroes = zeroes / n_columns
max_20 = proportion_zeroes < 0.20
df[max_20] # This will contain only rows that have less than 20 % zeroes
One liner:
df[((df == "0").sum(axis=1) / len(df.columns)) < 0.2]
It would have been great if you could have posted how the dataframe looks in pandas rather than a picture of an excel file. However, constructing a dummy df
df = pd.DataFrame({'index1':['a','b','c'],'index2':['b','g','f'],'index3':['w','q','z']
,'Col1':[0,1,0],'Col2':[1,1,0],'Col3':[1,1,1],'Col4':[2,2,0]})
Step1, assigning the index can be done using the .set_index() method as per below
df.set_index(['index1','index2','index3'],inplace=True)
instead of doing everything manually when it comes fo filtering out, you can use the return you got from df_bool.sum(axis=1) in the filtering of the dataframe as per below
df.loc[(df==0).sum(axis=1) / (df.shape[1])>0.6]
index1 index2 index3 Col1 Col2 Col3 Col4
c f z 0 0 1 0
and using that you can drop those rows, assuming 20% then you would use
df = df.loc[(df==0).sum(axis=1) / (df.shape[1])<0.2]
Ween it comes to the header issue it's a bit difficult to answer without seeing the what the file or dataframe looks like

Categories

Resources