pandas groupby where and else - python

I have a dataframe like this:
col1 col2
0 a 100
1 a 200
2 a 150
3 b 1000
4 c 400
5 c 200
what I want to do is group by col1 and count the number of occurrences and if count is equal or greater than 2, then calculate mean of col2 for those rows and if not apply another function. The output should be:
col1 mean
0 a 150
1 b whatever aggregator function returns
2 c 300
I followed #ansev solution in here pandas groupby count and then conditional mean however I don't want to replace them with NaN and actually want to replace it a value that returns from another function like this:
def aggregator(col1, col2):
return col1+col2
Please keep in mind that the actual aggregator function is more complicated and dependent to other tables and this is for just simplifying the question.

I'm not sure this is what you need, but you can resolve to apply:
def aggregator(x):
if len(x)==1:
return pd.Series( (x['col1'] + x['col2'].astype(str)).values)
else: return pd.Series(x['col2'].mean())
df.groupby('col1').apply(aggregator)
Output:
0
col1
a 150
b b1000
c 300

Related

Dropping column if more than half of the values are same - Python

I have pandas df which looks like the pic:
enter image description here
I want to delete any column if more than half of the values are the same in the column, and I dont know how to do this
I trid using :pandas.Series.value_counts
but with no luck
You can iterate over the columns, count the occurences of values as you tried with value counts and check if it is more than 50% of your column's data.
n=len(df)
cols_to_drop=[]
for e in list(df.columns):
max_occ=df['id'].value_counts().iloc[0] #Get occurences of most common value
if 2*max_occ>n: # Check if it is more than half the len of the dataset
cols_to_drop.append(e)
df=df.drop(cols_to_drop,axis=1)
You can use apply + value_counts and getting the first value to get the max count:
count = df.apply(lambda s: s.value_counts().iat[0])
col1 4
col2 2
col3 6
dtype: int64
Thus, simply turn it into a mask depending on whether the greatest count is more than half len(df), and slice:
count = df.apply(lambda s: s.value_counts().iat[0])
df.loc[:, count.le(len(df)/2)] # use 'lt' if needed to drop if exactly half
output:
col2
0 0
1 1
2 0
3 1
4 2
5 3
Use input:
df = pd.DataFrame({'col1': [0,1,0,0,0,1],
'col2': [0,1,0,1,2,3],
'col3': [0,0,0,0,0,0],
})
Boolean slicing with a comprension
df.loc[:, [
df.shape[0] // s.value_counts().max() >= 2
for _, s in df.iteritems()
]]
col2
0 0
1 1
2 0
3 1
4 2
5 3
Credit to #mozway for input data.

Faster method of extracting characters for multiple columns in dataframe

I have a Panda dataframe with multiple columns that has string data in a format like this:
id col1 col2 col3
1 '1:correct' '0:incorrect' '1:correct'
2 '0:incorrect' '1:correct' '1:correct'
What I would like to do is to extract the numeric character before the colon : symbol. The resulting data should look like this:
id col1 col2 col3
1 1 0 1
2 0 1 1
What I have tried is using regex, like following:
colname = ['col1','col2','col3']
row = len(df)
for col in colname:
df[col] = df[col].str.findall(r"(\d+):")
for i in range(0,row):
df[col].iloc[i] = df[col].iloc[i][0]
df[col] = df[col].astype('int64')
The second loop selects the first and only element in a list created by regex. I then convert the object dtype to integer. This code basically does what I want, but it is way too slow even for a small dataset with few thousand rows. I have heard that loops are not very efficient in Python.
Is there a faster, more Pythonic way of extracting numerics in a string and converting it to integers?
Use Series.str.extract for get first value before : in DataFrame.apply for processing each column by lambda function:
colname = ['col1','col2','col3']
f = lambda x: x.str.extract(r"(\d+):", expand=False)
df[colname] = df[colname].apply(f).astype('int64')
print (df)
id col1 col2 col3
0 1 1 0 1
1 2 0 1 1
Another solution with split and selecting first value before ::
colname = ['col1','col2','col3']
f = lambda x: x.str.strip("'").str.split(':').str[0]
df[colname] = df[colname].apply(f).astype('int64')
print (df)
id col1 col2 col3
0 1 1 0 1
1 2 0 1 1
An option is using list comprehension; since this involves strings, you should get fast speed:
import re
pattern = re.compile(r"\d(?=:)")
result = {key: [int(pattern.search(arr).group(0))
if isinstance(arr, str)
else arr
for arr in value.array]
for key, value in df.items()}
pd.DataFrame(result)
id col1 col2 col3
0 1 1 0 1
1 2 0 1 1

Calculate perc in Pandas Dataframe based on rows having a specific condition for each distinct value in column

I have a dataframe with sample values as given below
`
col1 col2
A ['1','2','er']
A []
B ['3','4','ac']
B ['5']
C []
`
I want to calculate the percentage of total number of rows for each value in col1 against total number of rows in col2 that are not empty list.
I am able to do it if there is a single value in col1. I am looking for a solution to do this iteratively. Thanks for the help.
I believe you need compare length of lists greater like 0, convert to number and athen aggregate mean:
df1 = df['col2'].str.len().gt(0).view('i1').groupby(df['col1']).mean().reset_index(name='%')
print (df1)
col1 %
0 A 0.5
1 B 1.0
2 C 0.0

Unique UUID base on n columns of Pandas dataframe (to handle duplicates on ElasticSearch)

i am creating a function to set an UUID column base on the values of other columns. What i want is to handle duplicates when indexind dataframes into Elasticsearch. The UUID should be always the same based on the value of several columns.
I am having problems with the output, the same UUID is generated for each row.
Dataframe
cols = ['col1', 'col2']
data = {'col1': ['Mike','Robert','Sandy'],
'col2': ['100','200','300']}
col1 col2
0 Mike 100
1 Robert 200
2 Sandy 300
Function
def create_uuid_on_n_col (df):
# concat column string values
concat_col_str_id = df.apply(lambda x: uuid.uuid5(uuid.NAMESPACE_DNS,'_'.join(map(str, x))), axis=1)
return concat_col_str_id[0]
Output
df['id'] = create_uuid_2_col(df[['col1','col2']])
col1 col2 id
0 Mike 100 a17ad043-486f-5eeb-8138-8fa2b10659fd
1 Robert 200 a17ad043-486f-5eeb-8138-8fa2b10659fd
2 Sandy 300 a17ad043-486f-5eeb-8138-8fa2b10659fd
There's no need to define another helper function. We can also vectorize the joining of the columns as shown below.
from functools import partial
p = partial(uuid.uuid5, uuid.NAMESPACE_DNS)
df.assign(id=(df.col1 + '_' + df.col2).apply(p))
col1 col2 id
0 Mike 100 a17ad043-486f-5eeb-8138-8fa2b10659fd
1 Robert 200 e520efd5-157a-57ee-84fb-41b9872af407
2 Sandy 300 11208b7c-b99b-5085-ad98-495004e6b043
If you don't want to import partial then define a function.
def custom_uuid(data):
val = uuid.uuid5(uuid.NAMESPACE_DNS, data)
return val
df.assign(id=(df.col1 + '_' + df.col2).apply(custom_uuid))
Using your original function as shown below.
def create_uuid_on_n_col(df):
temp = df.agg('_'.join, axis=1)
return df.assign(id=temp.apply(custom_uuid))
create_uuid_on_n_col(df[['col1','col2']])
col1 col2 id
0 Mike 100 a17ad043-486f-5eeb-8138-8fa2b10659fd
1 Robert 200 e520efd5-157a-57ee-84fb-41b9872af407
2 Sandy 300 11208b7c-b99b-5085-ad98-495004e6b043

pandas multiply other columns by another

I would like to apply a scaling factor on some data:
col1 col2 col3
10 4 5
100 2 3
1000 6 7
Then, i would like the output to be:
col1 col2 col3
10 40 50
100 200 300
1000 6000 7000
I was trying to use a lambda but it kept throwing me errors.
pandas.DataFrame.mul with axis=0
When Pandas operates between a DataFrame and a Series it aligns the index of the Series with the columns of the DataFrame. We can alter that behavior by utilizing the equivalent operation method and passing the axis=0 argument to tell Pandas to align the Series index with the DataFrame index.
df[['col2', 'col3']] = df[['col2', 'col3']].mul(df['col1'], axis=0)
df
col1 col2 col3
0 10 40 50
1 100 200 300
2 1000 6000 7000
A shorter way of doing this
df.update(df.drop('col1', 1).mul(df.col1, axis=0))
in-line
And not in-place. Meaning, produce a copy and leave the original alone
df.assign(**df.drop('col1', 1).mul(df.col1, axis=0))
col1 col2 col3
0 10 40 50
1 100 200 300
2 1000 6000 7000
After thought
I was messing around with this completely hacky way of doing it.
[df.get(c).__imul__(df.col1) for c in [*df][1:]];
Super gross as it depends on a side effect of the comprehension and throws the result of the comprehension away.
Please ignore this!

Categories

Resources