I have a dataframe with a list of strings and I would like to add columns with the number of occurrences of a character, sorted with the maximum to minimum occurrences
The datafrae is very big so I need an efficient way to calculate it
Originale df:
Item
0 ABABCBF
1 ABABCGH
2 ABABEFR
3 ABABFBF
4 ABACTC3
Wanted df:
Item o1 o2 o3 o4 o5
0 ABABCBF 3 2 1 1 null
1 ABABCGH 2 2 1 1 1
2 ABABEFR 2 2 1 1 1
3 ABABFBF 3 2 2 null null
4 ABACTC3 2 2 1 1 1
I have tried using collection counter but I am not able to convert the result in the column of the dataframe
collections.Counter(df['item'])
Thanks
You can use collections.Counter and the DataFrame constructor:
from collections import Counter
out = df.join(pd.DataFrame(
sorted(Counter(x).values(), reverse=True)
for x in df['Item'])
.rename(columns=lambda x: f'o{x+1}')
)
print(out)
Output:
Item o1 o2 o3 o4 o5
0 ABABCBF 3 2 1 1.0 NaN
1 ABABCGH 2 2 1 1.0 1.0
2 ABABEFR 2 2 1 1.0 1.0
3 ABABFBF 3 2 2 NaN NaN
4 ABACTC3 2 2 1 1.0 1.0
Try:
import json
import pandas as pd
from collections import Counter
df = pd.DataFrame({'Item': ['ABACABDF', 'BACBDFHGAAAA']})
result = df.join(
pd.DataFrame(
json.loads(
df['Item']
.transform(lambda x: sorted(list(Counter(x).values()), reverse=True))
.to_json(orient='records')
)
)
.rename(columns=(lambda x: f'o{x+1}'))
)
result
Item o1 o2 o3 o4 o5 o6 o7
0 ABACABDF 3 2 1 1 1 NaN NaN
1 BACBDFHGAAAA 5 2 1 1 1 1.0 1.0
try this:
def count_chars(txt: str):
ser = pd.Series([*txt])
result = ser.value_counts().tolist()
return result
result = df.join(
pd.DataFrame([*df['Item'].apply(count_chars)]).rename(columns=lambda x: f'o{x+1}'))
print(result)
>>>
Item o1 o2 o3 o4 o5
0 ABABCBF 3 2 1 1.0 NaN
1 ABABCGH 2 2 1 1.0 1.0
2 ABABEFR 2 2 1 1.0 1.0
3 ABABFBF 3 2 2 NaN NaN
4 ABACTC3 2 2 1 1.0 1.0
Related
Given this df
from io import StringIO
import pandas as pd
data = StringIO('''gene_variant gene val1 val2 val3
b1 b 1 1 1
b2 b 2 11 1
b3 b 3 11 1
c2 c 1 1 1
t1 t 1 1 1
t2 t 12 2 2
t4 t 12 3 2
t5 t 1 4 3
d2 d 11 1 2
d4 d 11 1 1''')
df = pd.read_csv(data, sep='\t')
How do I get the gene_variant for each gene where; the gene_variant corresponds to the max value for val1 if the max value is not duplicated, and if it is duplicated, the gene_variant corresponds to the max value for val2 if the max value for val2 is not duplicated, or then just max for val3? I.e., any tiebreakers are decided by the next column until the third option.
EDIT: The column val2 is only considered if the max val in val1 is a duplicated/a tie. Same for val3. If the max val in val1/2 is duplicated/is a tie then the values in those columns are no longer considered. Only the values in 1 column at a time are compared.
I've been trying solutions based on:
df.groupby('gene').agg(max)
and:
df.groupby('gene').rank('max')
But I can't get there without dropping out into iteration...
The correct answer would be:
b3 3
c2 1
t5 4
d2 2
Thanks in advance!
If need maximum only for groups with no duplicated values is possible use:
#per groups count number of unique values
df1 = df.groupby('gene').transform('nunique')
#compare columns with `gene_variant` and set NaN if duplicates per columns
#if maximum is count from all columns if not duplicated values get max
max1 = df.where(df1.eq(df1.pop('gene_variant'), axis=0)).max(axis=1)
#if max is count by order - first val1, then val2
#back filling missing values and select first column
max1 = df.where(df1.eq(df1.pop('gene_variant'), axis=0)).bfill(axis=1).iloc[:, 0]
#assign column by maximum
df = df.assign(max1 = max1)
#get rows from original with maximum max1 per groups
df = df.loc[df.groupby('gene', sort=False)['max1'].idxmax(), ['gene_variant','max1']]
print (df)
gene_variant max1
2 b3 3.0
3 c2 1.0
7 t5 4.0
8 d2 2.0
How it working:
df1 = df.groupby('gene').transform('nunique')
s = df1.pop('gene_variant')
print (df.where(df1.eq(s, axis=0)))
gene_variant gene val1 val2 val3
0 NaN NaN 1.0 NaN NaN
1 NaN NaN 2.0 NaN NaN
2 NaN NaN 3.0 NaN NaN
3 NaN NaN 1.0 1.0 1.0
4 NaN NaN NaN 1.0 NaN
5 NaN NaN NaN 2.0 NaN
6 NaN NaN NaN 3.0 NaN
7 NaN NaN NaN 4.0 NaN
8 NaN NaN NaN NaN 2.0
9 NaN NaN NaN NaN 1.0
#max of all columns
print (df.where(df1.eq(s, axis=0)).max(axis=1))
0 1.0
1 2.0
2 3.0
3 1.0
4 1.0
5 2.0
6 3.0
7 4.0
8 2.0
9 1.0
dtype: float64
#back fill NaNs
print (df.where(df1.eq(s, axis=0)).bfill(axis=1))
gene_variant gene val1 val2 val3
0 1.0 1.0 1.0 NaN NaN
1 2.0 2.0 2.0 NaN NaN
2 3.0 3.0 3.0 NaN NaN
3 1.0 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0 NaN
5 2.0 2.0 2.0 2.0 NaN
6 3.0 3.0 3.0 3.0 NaN
7 4.0 4.0 4.0 4.0 NaN
8 2.0 2.0 2.0 2.0 2.0
9 1.0 1.0 1.0 1.0 1.0
#selected first column
print (df.where(df1.eq(s, axis=0)).bfill(axis=1).iloc[:, 0])
0 1.0
1 2.0
2 3.0
3 1.0
4 1.0
5 2.0
6 3.0
7 4.0
8 2.0
9 1.0
Name: gene_variant, dtype: float64
You could use .sort_values() to get the maximum values. If you pass it multiple columns, it will treat tiebrakers correctly.
In [9]: df.sort_values(["val1", "val2", "val3"])
Out[9]:
gene_variant gene val1 val2 val3
0 b1 b 1 1 1
3 c2 c 1 1 1
4 t1 t 1 1 1
9 d4 d 1 1 1
8 d2 d 1 1 2
7 t5 t 1 4 3
1 b2 b 2 1 1
5 t2 t 2 2 2
6 t4 t 2 3 2
2 b3 b 3 1 1
Now, in order to do this for each gene you can groupby('gene') and apply a custom function.
In [11]: df.groupby("gene").apply(
...: lambda _df: _df.sort_values(["val1", "val2", "val3"], ascending=False)
...: .head(1)
...: .squeeze()
...: )
Out[11]:
gene_variant gene val1 val2 val3
gene
b b3 b 3 1 1
c c2 c 1 1 1
d d2 d 1 1 2
t t4 t 2 3 2
However, this is not telling you which val it was that won the tiebraker.
Lets say we want to compute the variable D in the dataframe below based on time values in variable B and C.
Here, second row of D is C2 - B1, the difference is 4 minutes and
third row = C3 - B2= 4 minutes,.. and so on.
There is no reference value for first row of D so its NA.
Issue:
We also want a NA value for the first row when the category value in variable A changes from 1 to 2. In other words, the value -183 must be replaced by NA.
A B C D
1 5:43:00 5:24:00 NA
1 6:19:00 5:47:00 4
1 6:53:00 6:23:00 4
1 7:29:00 6:55:00 2
1 8:03:00 7:31:00 2
1 8:43:00 8:05:00 2
2 6:07:00 5:40:00 -183
2 6:42:00 6:11:00 4
2 7:15:00 6:45:00 3
2 7:53:00 7:17:00 2
2 8:30:00 7:55:00 2
2 9:07:00 8:32:00 2
2 9:41:00 9:09:00 2
2 10:17:00 9:46:00 5
2 10:52:00 10:20:00 3
You can use:
# Compute delta
df['D'] = (pd.to_timedelta(df['C']).sub(pd.to_timedelta(df['B'].shift()))
.dt.total_seconds().div(60))
# Fill nan
df.loc[df['A'].ne(df['A'].shift()), 'D'] = np.nan
Output:
>>> df
A B C D
0 1 5:43:00 5:24:00 NaN
1 1 6:19:00 5:47:00 4.0
2 1 6:53:00 6:23:00 4.0
3 1 7:29:00 6:55:00 2.0
4 1 8:03:00 7:31:00 2.0
5 1 8:43:00 8:05:00 2.0
6 2 6:07:00 5:40:00 NaN
7 2 6:42:00 6:11:00 4.0
8 2 7:15:00 6:45:00 3.0
9 2 7:53:00 7:17:00 2.0
10 2 8:30:00 7:55:00 2.0
11 2 9:07:00 8:32:00 2.0
12 2 9:41:00 9:09:00 2.0
13 2 10:17:00 9:46:00 5.0
14 2 10:52:00 10:20:00 3.0
You can use the difference between datetime columns in pandas.
Having
df['B_dt'] = pd.to_datetime(df['B'])
df['C_dt'] = pd.to_datetime(df['C'])
Makes the following possible
>>> df['D'] = (df.groupby('A')
.apply(lambda s: (s['C_dt'] - s['B_dt'].shift()).dt.seconds / 60)
.reset_index(drop=True))
You can always drop these new columns later.
I want to build a data frame with m column and n rows.
Each rows start with 1 and increment by 1 until m.
I've tried to find a solution, but I found only this solution for the columns.
I have also added a figure of a simple case.
Using assign to broadcast the rows in an empty DataFrame:
df = (
pd.DataFrame(index=range(3))
.assign(**{f'c{i}': i+1 for i in range(4)})
)
Output:
c0 c1 c2 c3
0 1 2 3 4
1 1 2 3 4
2 1 2 3 4
You can use np.tile:
import numpy as np
m = 4
n = 3
out = pd.DataFrame(np.tile(np.arange(1,m), (n,1)), columns=[f'c{num}' for num in range(m-1)])
Output:
c0 c1 c2
0 1 2 3
1 1 2 3
2 1 2 3
Try with this (no additional libraries needed):
df = pd.DataFrame({f'c{n}': [n + 1] * (m - 1) for n in range(m)})
Result with m = 4:
c0 c1 c2 c3
0 1 2 3 4
1 1 2 3 4
2 1 2 3 4
We just do np.one
m = 3
n = 4
out = pd.DataFrame(np.ones((m,n))*(np.arange(n)+1))
Out[139]:
0 1 2 3
0 1.0 2.0 3.0 4.0
1 1.0 2.0 3.0 4.0
2 1.0 2.0 3.0 4.0
I have the following dataframe with the following code:
for i in range(int(tower1_base),int(tower1_top)):
if i not in tower1_not_included_int :
df = pd.concat([df, pd.DataFrame({"Tower": 1, "Floor": i, "Unit": list("ABCDEFG")})],ignore_index=True)
Result:
Tower Floor Unit
0 1 1.0 A
1 1 1.0 B
2 1 1.0 C
3 1 1.0 D
4 1 1.0 E
5 1 1.0 F
6 1 1.0 G
How can I create another Index column like this?
Tower Floor Unit Index
0 1 1.0 A 1A1
1 1 2.0 B 1B2
2 1 3.0 C 1C3
3 1 4.0 D 1D4
4 1 5.0 E 1E5
5 1 6.0 F 1F6
6 1 7.0 G 1G7
You can simply add the columns:
df['Index'] = df['Tower'].astype(str)+df['Unit']+df['Floor'].astype(int).astype(str)
Outputs this for the first version of your dataframe:
Tower Floor Unit Index
0 1 1.0 A 1A1
1 1 1.0 B 1B1
2 1 1.0 C 1C1
3 1 1.0 D 1D1
4 1 1.0 E 1E1
5 1 1.0 F 1F1
6 1 1.0 G 1G1
Another approach.
I've created a copy of the dataframe and reordered the columns, to make the "melting" easier.
dfAl = df.reindex(columns=['Tower','Unit','Floor'])
to_load = [] #list to load the new column
vals = pd.DataFrame.to_numpy(dfAl) #All values extracted
for sublist in vals:
combs = ''.join([str(i).strip('.0') for i in sublist]) #melting values
to_load.append(combs)
df['Index'] = to_load
If you really want the 'Index' column to be a real index, the last step:
df = df.set_index('Index')
print(df)
Tower Floor Unit
Index
1A1 1 1.0 A
1B2 1 2.0 B
1C3 1 3.0 C
1D4 1 4.0 D
1E5 1 5.0 E
1F6 1 6.0 F
1G7 1 7.0 G
This question already has answers here:
Pandas: filling missing values by mean in each group
(12 answers)
Closed last year.
I Know that the fillna() method can be used to fill NaN in whole dataframe.
df.fillna(df.mean()) # fill with mean of column.
How to limit mean calculation to the group (and the column) where the NaN is.
Exemple:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'a': pd.Series([1,1,1,2,2,2]),
'b': pd.Series([1,2,np.NaN,1,np.NaN,4])
})
print df
Input
a b
0 1 1
1 1 2
2 1 NaN
3 2 1
4 2 NaN
5 2 4
Output (after groupby('a') & replace NaN by mean of group)
a b
0 1 1.0
1 1 2.0
2 1 1.5
3 2 1.0
4 2 2.5
5 2 4.0
IIUC then you can call fillna with the result of groupby on 'a' and transform on 'b':
In [44]:
df['b'] = df['b'].fillna(df.groupby('a')['b'].transform('mean'))
df
Out[44]:
a b
0 1 1.0
1 1 2.0
2 1 1.5
3 2 1.0
4 2 2.5
5 2 4.0
If you have multiple NaN values then I think the following should work:
In [47]:
df.fillna(df.groupby('a').transform('mean'))
Out[47]:
a b
0 1 1.0
1 1 2.0
2 1 1.5
3 2 1.0
4 2 2.5
5 2 4.0
EDIT
In [49]:
df = pd.DataFrame({
'a': pd.Series([1,1,1,2,2,2]),
'b': pd.Series([1,2,np.NaN,1,np.NaN,4]),
'c': pd.Series([1,np.NaN,np.NaN,1,np.NaN,4]),
'd': pd.Series([np.NaN,np.NaN,np.NaN,1,np.NaN,4])
})
df
Out[49]:
a b c d
0 1 1 1 NaN
1 1 2 NaN NaN
2 1 NaN NaN NaN
3 2 1 1 1
4 2 NaN NaN NaN
5 2 4 4 4
In [50]:
df.fillna(df.groupby('a').transform('mean'))
Out[50]:
a b c d
0 1 1.0 1.0 NaN
1 1 2.0 1.0 NaN
2 1 1.5 1.0 NaN
3 2 1.0 1.0 1.0
4 2 2.5 2.5 2.5
5 2 4.0 4.0 4.0
You get all NaN for 'd' as all values are NaN for group 1 for d
We first compute the group means, ignoring the missing values:
group_means = df.groupby('a')['b'].agg(lambda v: np.nanmean(v))
Next, we use groupby again, this time fetching the corresponding values:
df_new = df.groupby('a').apply(lambda t: t.fillna(group_means.loc[t['a'].iloc[0]]))