I have two tab delimited files. Each has two columns, one for chromosome and one for column. I want to identify the number of positions in file2 that are within a specified window range of the positions in file1, and then check in the next window, and the next and so forth.
So if this is the first row from file 1:
1 10500
And this is from file 2:
1 10177
1 10352
1 10616
1 11008
1 11012
1 13110
1 13116
1 13118
1 13273
1 13550
If the window is of size 1000, the following co-ordinates from file 2 fall within this window:
1 10177
1 10352
1 10616
The second window should then be 2000 in size and therefore these co-ordinates fall within this window:
1 10177
1 10352
1 10616
1 11008
1 11012
My files are split by chromosomes so this needs to be taken into account also.
I have started by creating the following function:
#Function counts number of variants in a window of specified size around a specified
#genomic position
def vCount(nsPos, df, windowSize):
#If the window minimum is below 0, set to 0
if(nsPos - (windowSize/2) < 0):
low = 0
else:
low = nsPos - (windowSize/2)
high = high = nsPos + (windowSize/2)
#calculate the length (i.e. the number) of variants in the subsetted data.
diversity = len(df[(df['POS'] >= low) & (df['POS'] <= high)])
return(diversity)
This calculates the number for a single co-ordinate from file 1. However, what I want to do is calculate the number for every co-ordinate and then calculate the average. For this I use the following function, which uses the above function:
#Function to apply vCount function across the entire positional column of df
def windowAvg(NSdf, variantsDF, window):
x=NSdf.POS.apply(vCount, df=variantsDF, windowSize=window)
return(x)
This also works fine, but as I mentioned above, this needs to be done over multiple genomic windows, and then the means must be plotted. So I have created a function that uses a loop and then plots the result:
def loopNplot(NSdf, variantsDF, window):
#store window
windows = list(range(1,101))
#store diversity
diversity = []
#Loop through windows appending to the list.
for i in range(window, 101000, window):
#Create a list to hold counts for each chromosome (which we then take the mean of)
tempList = []
#Loop through chromosomes
for j in NSdf['CHROM']:
#Subset NS polymorphisms
subDF = vtable.loc[(vtable['CHROM'] == j)].copy()
#Subset all variants
subVar = all_variants.loc[(all_variants['CHROM'] == j)].copy()
x = windowAvg(subDF, subVar, i)
#convert series to list
x = x.tolist()
#extend templist to include x
tempList.extend(x)
#Append mean of tempList - counts of diversity - to list.
diversity.append(sum(tempList)/len(tempList))
#Copy diversity
divCopy = list(diversity)
#Add a new first index of 0
diversity = [0] + diversity
#Remove last index
diversity = diversity[:-1]
#x - diversity to give number of variants in each window
div = list(map(operator.sub, divCopy, diversity))
plt.scatter(windows, div)
plt.show()
The issue here is that file1 has 11609 rows, and file2 has 13758644 rows. Running the above function is extremely sluggish and I would like to know if there is any way to optimise or change the script to make things more efficient?
I am more than happy to use other command line tools if Python is not the best way to go about this.
Related
Since my last post did lack in information:
example of my df (the important col):
deviceID: unique ID for the vehicle. Vehicles send data all Xminutes.
mileage: the distance moved since the last message (in km)
positon_timestamp_measure: unixTimestamp of the time the dataset was created.
deviceID mileage positon_timestamp_measure
54672 10 1600696079
43423 20 1600696079
42342 3 1600701501
54672 3 1600702102
43423 2 1600702701
My Goal is to validate the milage by comparing it to the max speed of the vehicle (which is 80km/h) by calculating the speed of the vehicle using the timestamp and the milage. The result should then be written in the orginal dataset.
What I've done so far is the following:
df_ori['dataIndex'] = df_ori.index
df = df_ori.groupby('device_id')
#create new col and set all values to false
df_ori['valid'] = 0
for group_name, group in df:
#sort group by time
group = group.sort_values(by='position_timestamp_measure')
group = group.reset_index()
#since I can't validate the first point in the group, I set it to valid
df_ori.loc[df_ori.index == group.dataIndex.values[0], 'validPosition'] = 1
#iterate through each data in the group
for i in range(1, len(group)):
timeGoneSec = abs(group.position_timestamp_measure.values[i]-group.position_timestamp_measure.values[i-1])
timeHours = (timeGoneSec/60)/60
#calculate speed
if((group.mileage.values[i]/timeHours)<maxSpeedKMH):
df_ori.loc[dataset.index == group.dataIndex.values[i], 'validPosition'] = 1
dataset.validPosition.value_counts()
It definitely works the way I want it to, however it lacks in performance a lot. The df contains nearly 700k in data (already cleaned). I am still a beginner and can't figure out a better solution. Would really appreciate any of your help.
If I got it right, no for-loops are needed here. Here is what I've transformed your code into:
df_ori['dataIndex'] = df_ori.index
df = df_ori.groupby('device_id')
#create new col and set all values to false
df_ori['valid'] = 0
df_ori = df_ori.sort_values(['position_timestamp_measure'])
# Subtract preceding values from currnet value
df_ori['timeGoneSec'] = \
df_ori.groupby('device_id')['position_timestamp_measure'].transform('diff')
# The operation above will produce NaN values for the first values in each group
# fill the 'valid' with 1 according the original code
df_ori[df_ori['timeGoneSec'].isna(), 'valid'] = 1
df_ori['timeHours'] = df_ori['timeGoneSec']/3600 # 60*60 = 3600
df_ori['flag'] = (df_ori['mileage'] / df_ori['timeHours']) <= maxSpeedKMH
df_ori.loc[df_ori['flag'], 'valid'] = 1
# Remove helper columns
df_ori = df.drop(columns=['flag', 'timeHours', 'timeGoneSec'])
The basic idea is try to use vectorized operation as much as possible and to avoid for loops, typically iteration row by row, which can be insanly slow.
Since I can't get the context of your code, please double check the logic and make sure it works as desired.
Overview: I am working with pandas dataframes of census information, while they only have two columns, they are several hundred thousand rows in length. One column is a census block ID number and the other is a 'place' value, which is unique to the city in which that census block ID resides.
Example Data:
BLOCKID PLACEFP
0 60014001001000 53000
1 60014001001001 53000
...
5844 60014099004021 53000
5845 60014100001000
5846 60014100001001
5847 60014100001002 53000
Problem: As shown above, there are several place values that are blank, though they have a census block ID in their corresponding row. What I found was that in several instances, the census block ID that is missing a place value, is located within the same city as the surrounding blocks that do not have a missing place value, especially if the bookend place values are the same - as shown above, with index 5844 through 5847 - those two blocks are located within the same general area as the surrounding blocks, but just seem to be missing the place value.
Goal: I want to be able to go through this dataframe, find these instances and fill in the missing place value, based on the place value before the missing value and the place value that immediately follows.
Current State & Obstacle: I wrote a loop that goes through the dataframe to correct these issues, shown below.
current_state_blockid_df = pandas.DataFrame({'BLOCKID':[60014099004021,60014100001000,60014100001001,60014100001002,60014301012019,60014301013000,60014301013001,60014301013002,60014301013003,60014301013004,60014301013005,60014301013006],
'PLACEFP': [53000,,,53000,11964,'','','','','','',11964]})
for i in current_state_blockid_df.index:
if current_state_blockid_df.loc[i, 'PLACEFP'] == '':
#Get value before blank
prior_place_fp = current_state_blockid_df.loc[i - 1, 'PLACEFP']
next_place_fp = ''
_n = 1
# Find the end of the blank section
while next_place_fp == '':
next_place_fp = current_state_blockid_df.loc[i + _n, 'PLACEFP']
if next_place_fp == '':
_n += 1
# if the blanks could likely be in the same city, assign them the city's place value
if prior_place_fp == next_place_fp:
for _i in range(1, _n):
current_state_blockid_df.loc[_i, 'PLACEFP'] = prior_place_fp
However, as expected, it is very slow when dealing with hundreds of thousands or rows of data. I have considered using maybe ThreadPool executor to split up the work, but I haven't quite figured out the logic I'd use to get that done. One possibility to speed it up slightly, is to eliminate the check to see where the end of the gap is and instead just fill it in with whatever the previous place value was before the blanks. While that may end up being my goto, there's still a chance it's too slow and ideally I'd like it to only fill in if the before and after values match, eliminating the possibility of the block being mistakenly assigned. If someone has another suggestion as to how this could be achieved quickly, it would be very much appreciated.
You can use shift to help speed up the process. However, this doesn't solve for cases where there are multiple blanks in a row.
df['PLACEFP_PRIOR'] = df['PLACEFP'].shift(1)
df['PLACEFP_SUBS'] = df['PLACEFP'].shift(-1)
criteria1 = df['PLACEFP'].isnull()
criteria2 = df['PLACEFP_PRIOR'] == df['PLACEFP_AFTER']
df.loc[criteria1 & criteria2, 'PLACEFP'] = df.loc[criteria1 & criteria2, 'PLACEFP_PRIOR']
If you end up needing to iterate over the dataframe, use df.itertuples. You can access the column values in the row via dot notation (row.column_name).
for idx, row in df.itertuples():
# logic goes here
Using your dataframe as defined
def fix_df(current_state_blockid_df):
df_with_blanks = current_state_blockid_df[current_state_blockid_df['PLACEFP'] == '']
df_no_blanks = current_state_blockid_df[current_state_blockid_df['PLACEFP'] != '']
sections = {}
last_i = 0
grouping = []
for i in df_with_blanks.index:
if i - 1 == last_i:
grouping.append(i)
last_i = i
else:
last_i = i
if len(grouping) > 0:
sections[min(grouping)] = {'indexes': grouping}
grouping = []
grouping.append(i)
if len(grouping) > 0:
sections[min(grouping)] = {'indexes': grouping}
for i in sections.keys():
sections[i]['place'] = current_state_blockid_df.loc[i-1, 'PLACEFP']
l = []
for i in sections:
for x in sections[i]['indexes']:
l.append(sections[i]['place'])
df_with_blanks['PLACEFP'] = l
final_df = pandas.concat([df_with_blanks, df_no_blanks]).sort_index(axis=0)
return final_df
df = fix_df(current_state_blockid_df)
print(df)
Output:
BLOCKID PLACEFP
0 60014099004021 53000
1 60014100001000 53000
2 60014100001001 53000
3 60014100001002 53000
4 60014301012019 11964
5 60014301013000 11964
6 60014301013001 11964
7 60014301013002 11964
8 60014301013003 11964
9 60014301013004 11964
10 60014301013005 11964
11 60014301013006 11964
I have the following code which reads a csv file and then analyzes it. One patient has more than one illness and I need to find how many times an illness is seen on all patients. But the query given here
raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size
is so slow that it takes more than 15 mins. Is there a way to make the query faster?
raw_data = pd.read_csv(r'C:\Users\omer.kurular\Desktop\Data_Entry_2017.csv')
data = ["Cardiomegaly", "Emphysema", "Effusion", "No Finding", "Hernia", "Infiltration", "Mass", "Nodule", "Atelectasis", "Pneumothorax", "Pleural_Thickening", "Pneumonia", "Fibrosis", "Edema", "Consolidation"]
illnesses = pd.DataFrame({"Finding_Label":[],
"Count_of_Patientes_Having":[],
"Count_of_Times_Being_Shown_In_An_Image":[]})
ids = raw_data["Patient ID"].drop_duplicates()
index = 0
for ctr in data[:1]:
illnesses.at[index, "Finding_Label"] = ctr
illnesses.at[index, "Count_of_Times_Being_Shown_In_An_Image"] = raw_data[raw_data["Finding Labels"].str.contains(ctr)].size / 12
for i in ids:
illnesses.at[index, "Count_of_Patientes_Having"] = raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size
index = index + 1
Part of dataframes:
Raw_data
Finding Labels - Patient ID
IllnessA|IllnessB - 1
Illness A - 2
From what I read I understand that ctr stands for the name of a disease.
When you are doing this query:
raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size
You are not only filtering the rows which have the disease, but also which have a specific patient id. If you have a lot of patients, you will need to do this query a lot of times. A simpler way to do it would be to not filter on the patient id and then take the count of all the rows which have the disease.
This would be:
raw_data[raw_data['Finding Labels'].str.contains(ctr)].size
And in this case since you want the number of rows, len is what you are looking for instead of size (size will be the number of cells in the dataframe).
Finally another source of error in your current code was the fact that you were not keeping the count for every patient id. You needed to increment illnesses.at[index, "Count_of_Patientes_Having"] not set it to a new value each time.
The code would be something like (for the last few lines), assuming you want to keep the disease name and the index separate:
for index, ctr in enumerate(data[:1]):
illnesses.at[index, "Finding_Label"] = ctr
illnesses.at[index, "Count_of_Times_Being_Shown_In_An_Image"] = len(raw_data[raw_data["Finding Labels"].str.contains(ctr)]) / 12
illnesses.at[index, "Count_of_Patientes_Having"] = len(raw_data[raw_data['Finding Labels'].str.contains(ctr)])
I took the liberty of using enumerate for a more pythonic way of handling indexes. I also don't really know what "Count_of_Times_Being_Shown_In_An_Image" is, but I assumed you had had the same confusion between size and len.
Likely the reason your code is slow is that you are growing a data frame row-by-row inside a loop which can involve multiple in-memory copying. Usually this is reminiscent of general purpose Python and not Pandas programming which ideally handles data in blockwise, vectorized processing.
Consider a cross join of your data (assuming a reasonable data size) to the list of illnesses to line up Finding Labels to each illness in same row to be filtered if longer string contains shorter item. Then, run a couple of groupby() to return the count and distinct count by patient.
# CROSS JOIN LIST WITH MAIN DATA FRAME (ALL ROWS MATCHED)
raw_data = (raw_data.assign(key=1)
.merge(pd.DataFrame({'ills':ills, 'key':1}), on='key')
.drop(columns=['key'])
)
# SUBSET BY ILLNESS CONTAINED IN LONGER STRING
raw_data = raw_data[raw_data.apply(lambda x: x['ills'] in x['Finding Labels'], axis=1)]
# CALCULATE GROUP BY count AND distinct count
def count_distinct(grp):
return (grp.groupby('Patient ID').size()).size
illnesses = pd.DataFrame({'Count_of_Times_Being_Shown_In_An_Image': raw_data.groupby('ills').size(),
'Count_of_Patients_Having': raw_data.groupby('ills').apply(count_distinct)})
To demonstrate, consider below with random, seeded input data and output.
Input Data (attempting to mirror original data)
import numpy as np
import pandas as pd
alpha = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789'
data_tools = ['sas', 'stata', 'spss', 'python', 'r', 'julia']
ills = ["Cardiomegaly", "Emphysema", "Effusion", "No Finding", "Hernia",
"Infiltration", "Mass", "Nodule", "Atelectasis", "Pneumothorax",
"Pleural_Thickening", "Pneumonia", "Fibrosis", "Edema", "Consolidation"]
np.random.seed(542019)
raw_data = pd.DataFrame({'Patient ID': np.random.choice(data_tools, 25),
'Finding Labels': np.core.defchararray.add(
np.core.defchararray.add(np.array([''.join(np.random.choice(list(alpha), 3)) for _ in range(25)]),
np.random.choice(ills, 25).astype('str')),
np.array([''.join(np.random.choice(list(alpha), 3)) for _ in range(25)]))
})
print(raw_data.head(10))
# Patient ID Finding Labels
# 0 r xPNPneumothoraxXYm
# 1 python ScSInfiltration9Ud
# 2 stata tJhInfiltrationJtG
# 3 r thLPneumoniaWdr
# 4 stata thYAtelectasis6iW
# 5 sas 2WLPneumonia1if
# 6 julia OPEConsolidationKq0
# 7 sas UFFCardiomegaly7wZ
# 8 stata 9NQHerniaMl4
# 9 python NB8HerniapWK
Output (after running above process)
print(illnesses)
# Count_of_Times_Being_Shown_In_An_Image Count_of_Patients_Having
# ills
# Atelectasis 3 1
# Cardiomegaly 2 1
# Consolidation 1 1
# Effusion 1 1
# Emphysema 1 1
# Fibrosis 2 2
# Hernia 4 3
# Infiltration 2 2
# Mass 1 1
# Nodule 2 2
# Pleural_Thickening 1 1
# Pneumonia 3 3
# Pneumothorax 2 2
Currently learning Python and very new to Numpy & Panda
I have pieced together a random generator with a range. It uses Numpy and I am unable to isolate each individual result to count the iterations within a range within my random's range.
Goal: Count the iterations of "Random >= 1000" and then add 1 to the appropriate cell that correlates to the tally of iterations. Example in very basic sense:
#Random generator begins... these are first four random generations
Randomiteration0 = 175994 (Random >= 1000)
Randomiteration1 = 1199 (Random >= 1000)
Randomiteration2 = 873399 (Random >= 1000)
Randomiteration3 = 322 (Random < 1000)
#used to +1 to the fourth row of column A in CSV
finalIterationTally = 4
#total times random < 1000 throughout entire session. Placed in cell B1
hits = 1
#Rinse and repeat to custom set generations quantity...
(The logic would then be to +1 to A4 in the spreadsheet. If the iteration tally would have been 7, then +1 to the A7, etc. So essentially, I am measuring the distance and frequency of that distance between each "Hit")
My current code includes a CSV export portion. I do not need to export each individual random result any longer. I only need to export the frequency of each iteration distance between each hit. This is where I am stumped.
Cheers
import pandas as pd
import numpy as np
#set random generation quantity
generations=int(input("How many generations?\n###:"))
#random range and generator
choices = range(1, 100000)
samples = np.random.choice(choices, size=generations)
#create new column in excel
my_break = 1000000
if generations > my_break:
n_empty = my_break - generations % my_break
samples = np.append(samples, [np.nan] * n_empty).reshape((-1, my_break)).T
#export results to CSV
(pd.DataFrame(samples)
.to_csv('eval_test.csv', index=False, header=False))
#left uncommented if wanting to test 10 generations or so
print (samples)
I believe you are mixing up iterations and generations. It sounds like you want 4 iterations for N numbers of generations, but your bottom piece of code does not express the "4" anywhere. If you pull all your variables out to the top of your script it can help you organize better. Panda is great for parsing complicated csvs, but for this case you don't really need it. You probably don't even need numpy.
import numpy as np
THRESHOLD = 1000
CHOICES = 10000
ITERATIONS = 4
GENERATIONS = 100
choices = range(1, CHOICES)
output = np.zeros(ITERATIONS+1)
for _ in range(GENERATIONS):
samples = np.random.choice(choices, size=ITERATIONS)
count = sum([1 for x in samples if x > THRESHOLD])
output[count] += 1
output = map(str, map(int, output.tolist()))
with open('eval_test.csv', 'w') as f:
f.write(",".join(output)+'\n')
I am trying to read from array i created and return value inside array from the column and row its found in.this is what i have at the moment.
import pandas as pd
import os
import re
Dir = os.getcwd()
Blks = []
for files in Dir:
for f in os.listdir(Dir):
if re.search('txt', f):
Blks = [each for each in os.listdir(Dir) if each.endswith('.txt')]
print (Blks)
for z in Blks:
df = pd.read_csv(z, sep=r'\s+', names=['x','y','z'])
a = []
a = df.pivot('y','x','z')
print (a)
OUTPUTS:
x 300.00 300.25 300.50 300.75 301.00 301.25 301.50 301.75
y
200.00 100 100 100 100 100 100 100 100
200.25 100 100 100 100 110 100 100 100
200.50 100 100 100 100 100 100 100 100
x will be my columns and y the rows, inside the array is values corresponding to there adjacent column and row. as you can see above there is a odd 110 value that is 10 above the other values, i'm trying to read the array and return the x (column) and y (row) value for the value that's 10 difference by checking its values next to it(top,bottom,right,left) to calculate the difference.
Hope someone can kindly guide me into right direction, and any beginner tips are appreciated.if its unclear what i'm asking please ask i don't have years experience in all methodology,i have only recently started with python .
You could use DataFrame.ix to loop through all the values, row by row and column by column.
oddCoordinates=[]
for r in df.shape[0]:
for c in df.shape[1]:
if checkDiffFromNeighbors(df,r,c):
oddCoordinates.append((r,c))
The row and column of values that are different from the neighbors are listed in oddCoordinates.
To check the difference between the neighbors, you could loop them and count how many different values there are:
def checkDiffFromNeighbors(df,r,c):
#counter of how many different
diffCnt = 0
#loop over the neighbor rows
for dr in [-1,0,1]:
r1 = r+dr
#row should be a valid number
if r1>=0 and r1<df.shape[0]:
#loop over columns in row
for dc in [-1,0,1]:
#either row or column delta should be 0, because we do not allow diagonal
if dr==0 or dc==0:
c1 = c+dc
#check legal column
if c1>=0 and c1<df.shape[1]:
if df.ix[r,c]!=df.ix[r1,c1]:
diffCnt += 1
# if diffCnt==1 then a neighbor is (probably) the odd one
# otherwise this one is the odd one
# Note that you could be more strict and require all neighbors to be different
if diffCnt>1:
return True
else:
return False