I have a df:
import pandas as pd
df.head(20)
id ch start end strand
0 10:100026072-100029645(+) 10 100026072 100029645 +
1 10:110931880-110932381(+) 10 110931880 110932381 +
2 10:110932431-110933096(+) 10 110932431 110933096 +
3 10:111435307-111439556(-) 10 111435307 111439556 -
4 10:115954439-115964883(-) 10 115954439 115964883 -
5 10:115986231-116018509(-) 10 115986231 116018509 -
6 10:116500106-116500762(-) 10 116500106 116500762 -
7 10:116654355-116657389(-) 10 116654355 116657389 -
8 10:117146840-117147002(-) 10 117146840 117147002 -
9 10:126533798-126533971(-) 10 126533798 126533971 -
10 10:127687390-127688824(+) 10 127687390 127688824 +
11 10:19614164-19624369(-) 10 19614164 19624369 -
12 10:42537888-42543687(+) 10 42537888 42543687 +
13 10:61927486-61931038(-) 10 61927486 61931038 -
14 10:70699779-70700206(-) 10 70699779 70700206 -
15 10:76532243-76532565(-) 10 76532243 76532565 -
16 10:79336852-79337034(-) 10 79336852 79337034 -
17 10:79342487-79343173(+) 10 79342487 79343173 +
18 10:79373277-79373447(-) 10 79373277 79373447 -
19 10:82322045-82337358(+) 10 82322045 82337358 +
df.shape
(501, 5)
>>>df.dtypes
id object
ch object
start object
end object
strand object
dtype: object
Question:
I would like to perform multiple operations based on 'start' and 'end' column
first create two additional columns called
newstart newend
desiredoperation: if strand == '+':
df['newstart'] = end - int(27)
df['newend'] = end + 2
elif:
strand == '-'
df['newstart'] = start - int(3)
df['newend'] = start + 26
how can i do this using pandas, I found the link below but not sure how to execute it. If any one can provide a pseudo code will build up on it.
adding multiple columns to pandas simultaneously
You can do it using np.where, 2 lines but readable
df['newstart'] = np.where(df.strand == '+', df.end-int(27), df.start-int(3))
df['newend'] = np.where(df.strand == '+', df.end+int(2), df.start+int(26))
id ch start end strand newstart newend
0 10:100026072-100029645(+) 10 100026072 100029645 + 100029618 100029647
1 10:110931880-110932381(+) 10 110931880 110932381 + 110932354 110932383
2 10:110932431-110933096(+) 10 110932431 110933096 + 110933069 110933098
3 10:111435307-111439556(-) 10 111435307 111439556 - 111435304 111435333
4 10:115954439-115964883(-) 10 115954439 115964883 - 115954436 115954465
5 10:115986231-116018509(-) 10 115986231 116018509 - 115986228 115986257
6 10:116500106-116500762(-) 10 116500106 116500762 - 116500103 116500132
7 10:116654355-116657389(-) 10 116654355 116657389 - 116654352 116654381
8 10:117146840-117147002(-) 10 117146840 117147002 - 117146837 117146866
9 10:126533798-126533971(-) 10 126533798 126533971 - 126533795 126533824
If you want to do it in pandas, df.loc is a good candidate:
df['newstart'] = df['start'] - 3
df['newend'] = df['start'] + 26
subset = df['strand'] == '+'
df.loc[subset,'newstart']=df.loc[subset,'end']-27
df.loc[subset,'newend']=df.loc[subset,'end']+2
I think it is a good idea to keep using pandas to process your data: it will keep your code consistent, and there is probably a better, shorter way to write the code above.
df.loc is a very useful function to perform data lookup and processing, try to fiddle with it since it is a great tool.
Enjoy
Related
I obtained the text output file for my data sample which reports TE insertion sites in the genome. It looks like that:
sample chr pos strand family order support comment frequency
1 1 4254339 . hAT|9 hAT R - 0,954
1 1 34804000 . Stowaway|41 Stowaway R - 1
1 1 12839440 . Tourist|15 Tourist F - 1
1 1 11521962 . Tourist|10 Tourist R - 1
1 1 28197852 . Tourist|11 Tourist F - 1
1 1 7367886 . Stowaway|36 Stowaway R - 1
1 1 13130538 . Stowaway|36 Stowaway R - 1
1 1 6177708 . hAT|4 hAT F - 1
1 1 3783728 . hAT|20 hAT F - 1
1 1 10332288 . uc|12 uc R - 1
1 1 15780052 . uc|5 uc R - 1
1 1 28309928 . uc|5 uc R - 1
1 1 31010266 . uc|33 uc R - 0,967
1 1 84155 . uc|1 uc F - 1
1 1 3815830 . uc|31 uc R - 0,879
1 1 66241 . Mutator|4 Mutator F - 1
1 1 15709187 . Mutator|4 Mutator F - 0,842
I want to compare it with the bed file representing TE sites for the reference genome. It looks like that:
chr start end family
1 12005 12348 Tourist|7
1 4254339 4254340 hAT|9
1 66241 66528 Mutator|4
1 76762 76849 Stowaway|10
1 81966 82251 Stowaway|39
1 84155 84402 uc|1
1 84714 84841 uc|28
1 13130538 13130540 Stowaway|3
I want to check if TE insertions found in my sample occur in the reference, for example, if the first TE: hAT|9 in position 4254339 on chromosome 1 will be found in the bed file between the range defined by column 2 as the start and 3 as the end AND recognized as hAT|9 family according to column 4. I try to do it with pandas but I'm pretty confused. Thanks for the suggestions!
EDIT:
I slightly modified the input files view for easier understanding and parsing. Below is my code using pandas with two for loops (thanks #furas for suggesting).
import pandas as pd
ref_base = pd.read_csv('ref_test.bed', sep='\t')
te_output = pd.read_csv('srr_test1.txt', sep='\t')
result = []
for te in te_output.index:
te_pos = te_output['pos'][te]
te_family_sample = te_output['family'][te]
for ref in ref_base.index:
te_family_ref = ref_base['family'][ref]
start = ref_base['start'][ref]
end = ref_base['end'][ref]
if te_family_sample == te_family_ref and te_pos in range(start, end):
# print(te_output["chr"][te], te_output["pos"][te], te_output["family"][te], te_output["support"][te],
# te_output["frequency"][te])
result.append(str(te_output["chr"][te]) + '\t' + str(te_output["pos"][te]) + '\t' + te_output["family"][te]
+ '\t' + te_output["support"][te] + '\t' + str(te_output["frequency"][te]))
print(result)
resultFile = open("result.txt", 'w')
# write data to file
for r in result:
resultFile.write(r + '\n')
resultFile.close()
There is my expected result:
1 4254339 hAT|9 R 0,954
1 84155 uc|1 F 1
1 66241 Mutator|4 F 1
I write it in the easy way as I could but I would like to find more efficient solution. Any ideas?
This is a small example of a bigger data.
I have a text file like this one below.
Code: 44N
______________
Unit: m
Color: red
Length - Width - Height -
31 - 8 - 6 -
32 - 4 - 3 -
35 - 5 - 6 -
----------------------------------------
Code: 40N
______________
Unit: m
Color: blue
Length - Width - Height -
32 - 3 - 2 -
37 - 2 - 8 -
33 - 1 - 6 -
31 - 5 - 8 -
----------------------------------------
Code: 38N
I would like to get the lines containing the text that starts with " Length" until the line that starts with "----------------------------------------". I would like to do this for every time it happens and then convert each of this new data in a dataframe... maybe adding it to a list of dataframes.
At this example, I should have two dataframes like these ones:
Length Width Height
31 8 6
32 4 3
35 5 6
Length Width Height
32 3 2
37 2 8
33 1 6
31 5 8
I already tried something, but it only saves one text to the list and not both of them. And then I don't know how to convert them to a dataframe.
file = open('test.txt', 'r')
file_str = file.read()
well_list = []
def find_between(data, first, last):
start = data.index(first)
end = data.index(last, start)
return data[start:end]
well_list.append(find_between(file_str, " Length", "----------------------------------------" ))
Anyone could help me?
Hey that shows how parsing data can be tricky. Use .split() method off strings to do the job. Here is a way to do it.
import pandas as pd
import numpy as np
with open('test.txt', 'r') as f:
text = f.read()
data_start = 'Length - Width - Height'
data_end = '----------------------------------------'
# split the text in sections containing the data
sections = text.split(data_start)[1:]
# number of columns in the dataframe
col_names = data_start.split('-')
num_col = len(col_names)
for s in sections:
# remove everything after '------'
s = s.split(data_end)[0]
# now keep the numbers only
data = s.split()
# change string to int and discard '-'
data = [int(d) for d in data if d!='-']
# reshape the data (num_rows, num_cols)
data = np.array(data).reshape((int(len(data)/num_col), num_col))
df = pd.DataFrame(data, columns=col_names)
print(df)
I am reducing the noise in a pointcloud dataset from a multibeamscan, the dataset i am currently working on has 87295 rows.
I want to get the mean (or standard deviation) of a range of cells. Each range of cells is 1% of the entire dataframe.
It's been awhile since I've used Python/Pandas, but this is what I've come up with
amount_of_rows = df['X'].count()
percentage = 0.01
row1 = int(amount_of_rows * percentage)
row2 = int(amount_of_rows * (percentage * 2))
row3 = int(amount_of_rows * (percentage * 3))
row4...
row5....
This goes on for 100 rows.
row1 = 872
row2 = 1745
So then I would use these rows to get the average between them like so.
rowmean1 = df[row1:row2].mean()
rowmean2 = df[row2:row3].mean()
rowmean3 = df[row3:row4].mean()
rowmean4...
rowmean5...
I would then use these rowmeans to filter the data based on the value, in the same range.
So rowmean1 is the mean of all the values between 872 and 1745
But before I do that, I would like to know if there is a better way to do it? Without copypasting my code 100 times? I've tried writing different functions and loops to do so myself, but nothing yielded the desired result.
sample df
X Y Z
1 100980.05 498385.17 -9.15
2 100980.08 498385.13 -9.14
3 100980.11 498385.08 -9.13
4 100980.13 498385.04 -9.12
5 100980.16 498384.99 -9.11
6 100980.19 498384.95 -9.10
7 100980.26 498384.84 -8.56
8 100980.24 498384.86 -9.08
9 100980.28 498384.79 -8.86
10 100980.30 498384.77 -9.06
11 100980.32 498384.73 -9.05
12 100980.35 498384.68 -9.04
13 100980.38 498384.64 -9.03
14 100980.40 498384.59 -9.02
15 100980.43 498384.55 -9.01
16 100980.46 498384.51 -8.99
17 100980.48 498384.47 -8.98
18 100980.51 498384.42 -8.97
19 100980.54 498384.38 -8.96
20 100980.56 498384.34 -8.95
21 100980.59 498384.29 -8.94
22 100980.62 498384.25 -8.92
23 100980.64 498384.21 -8.91
24 100980.67 498384.17 -8.90
25 100980.69 498384.12 -8.89
26 100980.73 498384.07 -8.58
27 100980.75 498384.04 -8.87
28 100980.77 498384.00 -8.86
29 100980.80 498383.96 -8.85
30 100980.82 498383.91 -8.84
Edit
import numpy as np
import pandas as pd
df = pd.DataFrame(...)
df["percentage"] = 1/len(df)
df['cum'] = df['percentage'].cumsum()
for val in np.arange(0.01, 1.01, .01):
precedent = val - .01
ix = df[(precedent < df['cum']) & (df['cum'] <= val)].index
this_mean = df.loc[ix, 'X'].mean()
print(min(ix), '->', max(ix), ':', this_mean)
I have a data frame with 384 rows (and an additional dummy one in the bigining).
each row has 4 variable I wrote manually. 3 calculated fields based on those 4 variables.
and 3 that are comparing each calculated variable to the row before. each field can have 1 of two values (basically True/False).
Final goal - I want to arrange the data frame in a way that the 64 possible combination of the 6 calculated fields (2^6), occur 6 times (2^6*6=384).
Each iteration does a frequency table (pivot) and if one of the fields differ from 6 it breaks and randomize the order.
The problem that there are 384!-12*6! possible combinations and my computer is running the following script for over 4 days without a solution.
import pandas as pd
from numpy import random
# a function that calculates if a row is congruent or in-congruent
def set_cong(df):
if df["left"] > df["right"] and df["left_size"] > df["right_size"] or df["left"] < df["right"] and df["left_size"] < df["right_size"]:
return "Cong"
else:
return "InC"
# open file and calculate the basic fields
DF = pd.read_csv("generator.csv")
DF["distance"] = abs(DF.right-DF.left)
DF["CR"] = DF.left > DF.right
DF["Cong"] = DF.apply(set_cong, axis=1)
again = 1
# main loop to try and find optimal order
while again == 1:
# make a copy of the DF to not have to load it each iteration
df = DF.copy()
again = 0
df["rand"] = [[random.randint(low=1, high=100000)] for i in range(df.shape[0])]
# as 3 of the fields are calculated based on the previous row the first one is a dummy and when sorted needs to stay first
df.rand.loc[0] = 0
Sorted = df.sort_values(['rand'])
Sorted["Cong_n1"] = Sorted.Cong.eq(Sorted.Cong.shift())
Sorted["Side_n1"] = Sorted.CR.eq(Sorted.CR.shift())
Sorted["Dist_n1"] = Sorted.distance.eq(Sorted.distance.shift())
# here the dummy is deleted
Sorted = Sorted.drop(0, axis=0)
grouped = Sorted.groupby(['distance', 'CR', 'Cong', 'Cong_n1', 'Dist_n1', "Side_n1"])
for name, group in grouped:
if group.shape[0] != 6:
again = 1
break
Sorted.to_csv("Edos.csv", sep="\t",index=False)
print ("bye")
the data frame looks like this:
left right size_left size_right distance cong CR distance_n1 cong_n1 side_n1
1 6 22 44 5 T F dummy dummy dummy
5 4 44 22 1 T T F T F
2 3 44 22 1 F F T F F
I am currently working on this project and have some issues finding the average time based on their corresponding types. For now, I have my output as shown below after reading my CSV file.
#Following this format (x typeList : timeTakenList)
0 Lift : 5 days, 5:39:00
1 Lift : 5 days, 5:31:00
2 lighting : 3 days, 9:47:00
3 ACMV : 5 days, 5:21:00
4 lighting : 3 days, 9:32:00
.
.
.
How do I calculate the average time taken for each type such that I have the following output?
0 Lift : (5 days, 5:31:00 + 5 days, 5:39:00) / 2
1 lighting : (3 days, 9:47:00 + 3 days, 9:32:00) / 2
2 ACMV : 5 days, 5:21:00
.
.
.
The timeTakenList is calculated from subtracting another column, Acknowledged Date from the column CompletedDate.
timeTakenList = completedDate - acknowledgedDate
There is a ton of other types in the typeList and I'm trying to avoid doing an if statement like if typeList[x] == "Lift", then add the time together, etc.
Example of .csv file:
I'm not too sure how else to make my question clearer, but any help is much appreciated.
check the below implementation.
import re
typeList = ["Lighting", "Lighting", "Air-Con", "Air-Con", "Toilet"]
timeTakenList = ["10hours", "5hours", "2days, 5hrs", "5hours", "4hours"]
def convertTime(total_time):
li = list(map(int,re.sub(r"days|hrs|hours","",total_time).split(",")))
if len(li) == 2:
return li[0] * 24 + li[1]
return li[0]
def convertDays(total_time):
days = int(total_time / 24)
hours = total_time % 24
if (hours).is_integer():
hours = int(hours)
if days == 0:
return str(hours) + "hours"
return str(days) + "days, " + str(hours) + "hrs"
def avg(numbers):
av = float(sum(numbers)) / max(len(numbers), 1)
return av
avgTime = {}
for i, types in enumerate(typeList):
if avgTime.has_key(types):
avgTime[types].append(convertTime(timeTakenList[i]))
else:
avgTime[types] = [convertTime(timeTakenList[i])]
for types in avgTime.keys():
print types + " : " + convertDays(avg(avgTime[types]))
Algo
Convert strip "hours", "hrs", "days" from the timeTakenList list.
convert the elements consisting of days and hours into hours for
easier avg calculation
create hash having typeList elements as keys and converted
timeTakenList elements as list of values
print the values for each key converted back to days and hours.