counting T/F values for several conditions - python

I am a beginner using pandas.
I'm looking for mutations on several patients. I have 16 different conditions. I simply write a code about it but how can do this by for loop? I try to find the changes on MUT column and set them as True and False. Then try to count the True/False numbers. I have done for only 4.
Can you suggest a more simple way, instead of writing the same code 16 times?
s1=df["MUT"]
A_T= s1.str.contains("A:T")
ATnum= A_T.value_counts(sort=True)
s2=df["MUT"]
A_G=s2.str.contains("A:G")
AGnum=A_G.value_counts(sort=True)
s3=df["MUT"]
A_C=s3.str.contains("A:C")
ACnum=A_C.value_counts(sort=True)
s4=df["MUT"]
A__=s4.str.contains("A:-")
A_num=A__.value_counts(sort=True)

I'm not an expert with using Pandas, so don't know if there's a cleaner way of doing this, but perhaps the following might work?
chars = 'TGC-'
nums = {}
for char in chars:
s = df["MUT"]
A = s.str.contains("A:" + char)
num = A.value_counts(sort=True)
nums[char] = num
ATnum = nums['T']
AGnum = nums['G']
# ...etc
Basically, go through each unique character (T, G, C, -) then pull out the values that you need, then finally stick the numbers in a dictionary. Then, once the loop is finished, you can fetch whatever numbers you need back out of the dictionary.

Just use value_counts, this will give you a count of all unique values in your column, no need to create 16 variables:
In [5]:
df = pd.DataFrame({'MUT':np.random.randint(0,16,100)})
df['MUT'].value_counts()
Out[5]:
6 11
14 10
13 9
12 9
1 8
9 7
15 6
11 6
8 5
5 5
3 5
2 5
10 4
4 4
7 3
0 3
dtype: int64

Related

Pandas - Equal occurrences of unique type for a column

I have a Pandas DF called “DF”. I would like to sample data from the population in such a way that, given a occurrence count, N = 100 and column = "Type", I would like to print a total of 100 rows from that column in such a way that the distribution of occurrences of each type is equal.
SNo
Type
Difficulty
1
Single
5
2
Single
15
3
Single
4
4
Multiple
2
5
Multiple
14
6
None
7
7
None
4323
For instance, If I specify N = 3, the output must be :
SNo
Type
Difficulty
1
Single
5
3
Multiple
4
6
None
7
If for the number N, the occurrences of certain types do not meet the minimum split, I can randomly increase another count.
I am wondering on how to approach this programmatically. Thanks!
Use groupby.sample (pandas ≥ 1.1) with N divided by the number of types.
NB. This assumes the N is a multiple of the number of types if you want a strict equality.
N = 3
N2 = N//df['Type'].nunique()
out = df.groupby('Type').sample(n=N2)
handling non multiple of the number of types
Use the same as above and complete to N with random rows excluding those already selected.
N = 5
N2, R = divmod(N, df['Type'].nunique())
out = df.groupby('Type').sample(n=N2)
out = pd.concat([out, df.drop(out.index).sample(n=R)])
As there is still a chance that you complete with items of the same group, if you really want to ensure sampling from different groups replace the last step with:
out = pd.concat([out, df.drop(out.index).groupby('Type').sample(n=1).sample(n=R)]
Example output:
SNo Type Difficulty
4 5 Multiple 14
6 7 None 4323
2 3 Single 4
3 4 Multiple 2
5 6 None 14

Cluster similar - but not identical - digits in pandas dataframe

I have a pandas dataframe with 2M+ rows. One of the columns pin, contains a series of 14 digits.
I'm trying to cluster similar — but not identical — digits. Specifically, I want to match the first 10 digits without regard to the final four. The pin column was imported as an int then converted to a string.
Put another way, the first 10 digits should match but the final four shouldn't. Duplicates of exact-matching pins should be dropped.
For example these should all be grouped together:
17101110141403
17101110141892
17101110141763
17101110141199
17101110141788
17101110141851
17101110141831
17101110141487
17101110141914
17101110141843
Desired output:
Biggest cluster | other columns
Second biggest cluster | other columns
...and so on | other columns
I've tried using a combination of groupby and regex without success.
pat2 = '1710111014\d\d\d\d'
pat = '\d\d\d\d\d\d\d\d\d\d\d\d\d\d'
grouped = df2.groupby(df2['pin'].str.extract(pat, expand=False), axis= 1)
and
df.groupby(['pin']).filter(lambda group: re.match > 1)
Here's a link to the original data set: https://datacatalog.cookcountyil.gov/Property-Taxation/Assessor-Parcel-Sales/wvhk-k5uv
It's not clear why you need regex for this, what about the following(assuming pin is stored as a string) (Note that you haven't included your expected output)
pin
0 17101110141403
1 17101110141892
2 17101110141763
3 17101110141199
4 17101110141788
5 17101110141851
6 17101110141831
7 17101110141487
8 17101110141914
9 17101110141843
df.groupby(df['pin'].str[:10]).size()
pin
1710111014 10
dtype: int64
If you want this information appended back to your original dataframe, you can use
df['size']=df.groupby(df['pin'].astype(str).str[:10])['pin'].transform(len)
pin size
0 17101110141403 10
1 17101110141892 10
2 17101110141763 10
3 17101110141199 10
4 17101110141788 10
5 17101110141851 10
6 17101110141831 10
7 17101110141487 10
8 17101110141914 10
9 17101110141843 10
Then, assuming you have more columns, you can sort your dataframe by size of cluster with
df.sort_values('size')

Why do I lose numerical precision when extracting element from list in python?

I have a pandas dataframe that looks like this:
data
0 [26.113017616106, 106.948066803935, 215.488217...
1 [26.369709448639, 106.961107298101, 215.558911...
2 [26.261267444521, 106.991763898421, 215.384122...
3 [26.285746968657, 106.912377030428, 215.287348...
4 [26.155342026996, 106.825440402654, 215.114619...
5 [26.159917638984, 106.819720887669, 215.117593...
6 [26.023564401739, 106.843056508808, 215.129947...
7 [26.1155342027, 106.828185769847, 215.15991763...
8 [26.028826355525, 106.841912605811, 215.146190...
9 [26.015099519561, 106.824296499657, 215.130404...
I am trying to extract the 1st element from the Series of lists using this code:
[x[1] for x in df.data]
and I get this result:
0 106.948067
1 106.961107
2 106.991764
3 106.912377
4 106.825440
5 106.819721
6 106.843057
7 106.828186
8 106.841913
9 106.824296
Why do I lose precision and what can I do to keep it?
By default, pandas displays floating-point values with 6 digits of precision.
You can control the precision with pandas’ set_option e.g.
pd.set_option('precision', 12)

Slicing large lists based on input

If I have multiple lists such that
hello = [1,3,5,7,9,11,13]
bye = [2,4,6,8,10,12,14]
and the user inputs 3
is there a way to get the output to go back 3 indexes in the list and start there to get:
9 10
11 12
13 14
with tabs \t between each space.
if the user would input 5
the expected output would be
5 6
7 8
9 10
11 12
13 14
I've tried
for i in range(user_input):
print(hello[-i-1], '\t', bye[-i-1])
Just use negative indexies that start from the end minus the user input (-user_input) and move to the the end (-1), something like:
for i in range(-user_input, 0):
print(hello[i], bye[i])
Another zip solution, but one-lined:
for h, b in zip(hello[-user_input:], bye[-user_input:]):
print(h, b, sep='\t')
Avoids converting the result of zip to a list, so the only temporaries are the slices of hello and bye. While iterating by index can avoid those temporaries, in practice it's almost always cleaner and faster to do the slice and iterate the values, as repeated indexing is both unpythonic and surprisingly slow in CPython.
Use negative indexing in the slice.
hello = [1,3,5,7,9,11,13]
print(hello[-3:])
print(hello[-3:-2])
output
[9, 11, 13]
[9]
You can zip the two lists and use itertools.islice to obtain the desired portion of the output:
from itertools import islice
print('\n'.join(map(' '.join, islice(zip(map(str, hello), map(str, bye)), len(hello) - int(input()), len(hello)))))
Given an input of 3, this outputs:
5 6
7 8
9 10
11 12
13 14
You can use zip to return a lists of tuple where the i-th element comes from the i-th iterable argument.
zip_ = list(zip(hello, bye))
for item in zip_[-user_input:]:
print(item[0], '\t' ,item[1])
then use negative index to get what you want.
If you want to analyze the data
I think using pandas.datafrme may be helpful.
INPUT_INDEX = int(input('index='))
df = pd.DataFrame([hello, bye])
df = df.iloc[:, len(df.columns)-INPUT_INDEX:]
for col in df.columns:
h_value, b_value = df[col].values
print(h_value, b_value)
console
index=3
9 10
11 12
13 14

Pandas Vectorization with Function on Parts of Column

So I have a dataframe that looks something like this:
df1 = pd.DataFrame([[1,2, 3], [5,7,8], [2,5,4]])
0 1 2
0 1 2 3
1 5 7 8
2 2 5 4
I then have a function that adds 5 to a number called add5. I'm trying to create a new column in df1 that adds 5 to all the numbers in column 2 that are greater than 3. I want to use vectorization not apply as this concept is going to be expanded to a dataset with hundreds of thousands of entries and speed will be important. I can do it without the greater than 3 constraint like this:
df1['3'] = add5(df1[2])
But my goal is to do something like this:
df1['3'] = add5(df1[2]) if df1[2] > 3
Hoping someone can point me in the right direction on this. Thanks!
With Pandas, a function applied explicitly to each row typically cannot be vectorised. Even implicit loops such as pd.Series.apply will likely be inefficient. Instead, you should use true vectorised operations, which lean heavily on NumPy in both functionality and syntax.
In this case, you can use numpy.where:
df1[3] = np.where(df1[2] > 3, df1[2] + 5, df1[2])
Alternatively, you can use pd.DataFrame.loc in a couple of steps:
df1[3] = df1[2]
df1.loc[df1[2] > 3, 3] = df1[2] + 5
In each case, the term df1[2] > 3 creates a Boolean series, which is then used to mask another series.
Result:
print(df1)
0 1 2 3
0 1 2 3 3
1 5 7 8 13
2 2 5 4 9

Categories

Resources