I have this dataframe
ORF IDClass genName ORFDesc
0 b186 [1,1,1,0] 'bglS' beta-glucosidase
1 b2202 [1,1,1,0] 'cbhK' carbohydrate kinase
2 b727 [1,1,1,0] 'fucA' L-fuculose phosphate aldolase
3 b1731 [1,1,1,0] 'gabD1' succinate-semialdehyde dehydrogenase
4 b234 [1,1,1,0] 'gabD2' succinate-semialdehyde dehydrogenase
and I need to count how many registers have IDClass = [1,1,1,0], IDClass = [1,2,0,0] etc
Im using he str.count().sum() function but it returns me more ocurrences than registers in my dataset. What am I doing wrong?
Ex:
IN: count = df2.IDClass.str.count('[1,1,1,0]').sum()
OUT: [3924 rows x 4 columns]
21552
If I do:
IN: count = df2.IDClass.str.count('[1,1,1,0]').sum()
OUT: [3924 rows x 4 columns]
0 7
1 7
2 7
3 7
4 7
..
3919 6
3920 6
3921 6
3922 6
3923 6
Any idea?
Thanks is advance,
If your IDClass is string type, you can just do:
df['IDClass'].value_counts()
If that gives an error, it's likely that your IDClass is list type. Then you can use tuple:
df['IDClass'].apply(tuple).value_counts()
Related
I have this dataFrame. Each row of column data is a list containing around 50 data points. And I want to count the number of occurrences of numbers over 50 and over 20.
>>> df['data'].head(10)
0 [33.23, 51.02, 32.01 ...
1 [99.04, 38.06, 39.57...
2 [96.04, 96.72, 401.93...
3 [96.64, 99.15, 99.83...
4 [96.71, 38.93, 53.02....
5 [88.72, 37.61, 39.61...
6 [38.93, 88.72, 37.31...
7 [88.72, 39.61, 35.71...
8 [97.44, 99.04, 88.56....
9 [00.14, 89.61, 39.95...
If we transform the df to dic, it would look like below:
>>> df.to_dict()
{'data': {'row1': [33.23, 51.02, 32.01,...], 'row2': [99.04, 38.06, 39.57,...],'row3': [96.04, 96.72, 401.93,...],'row4'...}}
The expected result i would like to get is a new column called result and it stores the count of values in data column over 50.0 or over 20.0 if no values are over 50.0:
>>> df.show()
data result
0 [33.23, 51.02, 32.01 ... 1
1 [99.04, 38.06, 39.57... 1
2 [96.04, 96.72, 401.93... 3
3 [96.64, 99.15, 99.83... 3
4 [96.71, 38.93, 53.02.... 2
This is the method i used:
pandas_data_frame[result_column] = pandas_data_frame.apply(
lambda row: count_values(row[data]), axis=1)
def count__values(numlist):
count1 = sum(
x >= 50.0 for x in list)
count2 = sum(
x >= 20.0 for x in list)
return count1 if count1 > 0 else count2
However the dataframe can be extremely huge and i was wondering if there is any pandas method to improve the performance? Thanks.
Try with explode and groupby:
df[[">=50", ">=20"]] = (df.explode("data")
.groupby(level=0)["data"]
.agg([lambda x: x.ge(50).sum(),
lambda x: x.ge(20).sum()]
)
)
>>> df
data >=50 >=20
0 [33.23, 51.02, 32.01] 1 3
1 [99.04, 38.06, 39.57] 1 3
2 [96.04, 96.72, 401.93] 3 3
3 [96.64, 99.15, 99.83] 3 3
4 [96.71, 38.93, 53.02] 2 3
5 [88.72, 37.61, 39.61] 1 3
6 [38.93, 88.72, 37.31] 1 3
7 [88.72, 39.61, 35.71] 1 3
8 [97.44, 99.04, 88.56] 3 3
9 [0.14, 89.61, 39.95] 1 2
Can you help on the following task? I have a dataframe column such as:
index df['Q0']
0 1
1 2
2 3
3 5
4 5
5 6
6 7
7 8
8 3
9 2
10 4
11 7
I want to substitute the values in df.loc[3:8,'Q0'] with the values in df.loc[0:2,'Q0'] if df.loc[0,'Q0']!=df.loc[3,'Q0']
The result should look like the one below:
index df['Q0']
0 1
1 2
2 3
3 1
4 2
5 3
6 1
7 2
8 3
9 2
10 4
11 7
I tried the following line:
df.loc[3:8,'Q0'].where(~df.loc[0,'Q0']!=df.loc[3,'Q0']),other=df.loc[0:2,'Q0'],inplace=True)
or
df['Q0'].replace(to_replace=df.loc[3:8,'Q0'], value=df.loc[0:2,'Q0'], inplace=True)
But it doesn't work. Most possible I am doing something wrong.
Any suggestions?
You can use the cycle function:
from itertools import cycle
c = cycle(df["Q0"][0:3])
if df.Q0[0] != df.Q0[3]:
df["Q0"][3:8] = [next(c) for _ in range(5)]
Thanks for the replies. I tried the suggestions but I have some issues:
#adnanmuttaleb -
When I applied the function in a dataframe with more than 1 column (e.g. 12x2 or larger) I notice that the value in df.Q0[8] didn't change. Why?
#jezrael -
When I adjust to your suggestion I get the error:
ValueError: cannot copy sequence with size 5 to array axis with dimension 6
When I change the range to 6, I am getting wrong results
import pandas as pd
from itertools import cycle
data={'Q0':[1,2,3,5,5,6,7,8,3,2,4,7],
'Q0_New':[0,0,0,0,0,0,0,0,0,0,0,0]}
df = pd.DataFrame(data)
##### version 1
c = cycle(df["Q0"][0:3])
if df.Q0[0] != df.Q0[3]:
df['Q0_New'][3:8] = [next(c) for _ in range(5)]
##### version 2
d = cycle(df.loc[0:3,'Q0'])
if df.Q0[0] != df.Q0[3]:
df.loc[3:8,'Q0_New'] = [next(d) for _ in range(6)]
Why we have different behaviors and what corrections need to be made?
Thanks once more guys.
i want to add a dot to each string in dataframe,for example 49454170 become 4945.4170
below frame:
Missed Trades
0 49454170
1 49532878
2 49511387
3 49451350
4 49402211
5 49403961
6 49331707
7 49320696
Here's one approach using str.findall and join back the resulting lists with str.join:
df['Missed Trades'].str.findall(r'(\d{4})').str.join('.')
0 4945.4170
1 4953.2878
2 4951.1387
3 4945.1350
4 4940.2211
5 4940.3961
6 4933.1707
7 4932.0696
Name: MissedTrades, dtype: object
Something like this might work -
df['Missed Trades'] = df['Missed Trades'].astype(int) / 10000
df['Missed Trades'] = df['Missed Trades'].map(lambda x : x[0:4]+'.'+x[4:8])
Missed Trades
0 4945.4170
1 4953.2878
2 4951.1387
3 4945.1350
4 4940.2211
5 4940.3961
6 4933.1707
7 4932.0696
Hello I read in an excel file as a DataFrame whose rows contains multiple values. The shape of the df is like:
Welding
0 65051020 ...
1 66053510 66053550 ...
2 66553540 66553560 ...
3 67053540 67053505 ...
now I want to split each row and write each entry into an own row like
Welding
0 65051020
1 66053510
2 66053550
....
n 67053505
I tried have tried:
[new.append(df.loc[i,"Welding"].split()) for i in range(len(df))]
df2=pd.DataFrame({"Welding":new})
print(df2)
Welding
0 66053510
1 66053550
2 66053540
3 66053505
4 66053551
5 [65051020, 65051010, 65051030, 65051035, 65051...
6 [66053510, 66053550, 66053540, 66053505, 66053...
7 [66553540, 66553560, 66553505, 66553520, 66553...
8 [67053540, 67053505, 67057505]
9 [65051020, 65051010, 65051030, 65051035, 65051...
10 [66053510, 66053550, 66053540, 66053505, 66053...
11 [66553540, 66553560, 66553505, 66553520, 66553...
12 [67053540, 67053505, 67057505]
13 [65051020, 65051010, 65051030, 65051035, 65051...
14 [66053510, 66053550, 66053540, 66053505, 66053...
15 [66553540, 66553560, 66553505, 66553520, 66553...
16 [67053540, 67053505, 67057505]
But this did not return the expected results.
Appreciate each help!
Use split with stack and last to_frame:
df = df['Welding'].str.split(expand=True).stack().reset_index(drop=True).to_frame('Welding')
print (df)
Welding
0 65051020
1 66053510
2 66053550
3 66553540
4 66553560
5 67053540
6 67053505
I have two data frames as below:
Sample_name C14-Cer C16-Cer C18-Cer C18:1-Cer C20-Cer
0 1 1 0.161456 0.033139 0.991840 2.111023 0.846197
1 1 10 0.636140 1.024235 36.333741 16.074662 3.142135
2 1 13 0.605840 0.034337 2.085061 2.125908 0.069698
3 1 14 0.038481 0.152382 4.608259 4.960007 0.162162
4 1 5 0.035628 0.087637 1.397457 0.768467 0.052605
5 1 6 0.114375 0.020196 0.220193 7.662065 0.077727
Sample_name C14-Cer C16-Cer C18-Cer C18:1-Cer C20-Cer
0 1 1 0.305224 0.542488 66.428382 73.615079 10.342252
1 1 10 0.814696 1.246165 73.802644 58.064363 11.179206
2 1 13 0.556437 0.517383 50.555948 51.913547 9.412299
3 1 14 0.314058 1.148754 56.165767 61.261950 9.142128
4 1 5 0.499129 0.460813 40.182454 41.770906 8.263437
5 1 6 0.300203 0.784065 47.359506 52.841821 9.833513
I want to divide the numerical values in the selected cells of the first by the second and I am using the following code:
df1_int.loc[:,'C14-Cer':].div(df2.loc[:,'C14-Cer':])
However, this way I lose the information from the column "Sample_name".
C14-Cer C16-Cer C18-Cer C18:1-Cer C20-Cer
0 0.528977 0.061088 0.014931 0.028677 0.081819
1 0.780831 0.821909 0.492309 0.276842 0.281070
2 1.088785 0.066367 0.041243 0.040951 0.007405
3 0.122529 0.132650 0.082047 0.080964 0.017738
4 0.071381 0.190178 0.034778 0.018397 0.006366
5 0.380993 0.025759 0.004649 0.145000 0.007904
How can I perform the division while keeping the column "Sample_name" in the resulting dataframe?
You can selectively overwrite using loc, the same way that you're already performing the division:
df1_int.loc[:,'C14-Cer':] = df1_int.loc[:,'C14-Cer':].div(df2.loc[:,'C14-Cer':])
This preserves the sample_name col:
In [12]:
df.loc[:,'C14-Cer':] = df.loc[:,'C14-Cer':].div(df1.loc[:,'C14-Cer':])
df
Out[12]:
Sample_name C14-Cer C16-Cer C18-Cer C18:1-Cer C20-Cer
index
0 1 1 0.528975 0.061087 0.014931 0.028677 0.081819
1 1 10 0.780831 0.821910 0.492309 0.276842 0.281070
2 1 13 1.088785 0.066367 0.041243 0.040951 0.007405
3 1 14 0.122528 0.132650 0.082047 0.080964 0.017738
4 1 5 0.071380 0.190179 0.034778 0.018397 0.006366
5 1 6 0.380992 0.025758 0.004649 0.145000 0.007904