Split a list up to a maximum number of elements - python

I was wondering if someone could help me with the following problem: I have a text file that I split into rows and columns. The text file contains a variable amount of columns, however I would like to split each row into seven columns, no more, no less. To do that, I want to through everything after the sixth column into a single column.
Example code:
import numpy as np
rot = ['6697 1100.0 90.0 0.0 0.0 6609 !',
'701 0.0 0.0 83.9 1.5 000 !AFR-AHS IndHS-AFR']
for i in range(len(rot)):
rot[i]=rot[i].split()
Here, the array 'rot' contains 7 entries in the first row (the ! counts as a separate entry) and 8 in the second row. In both cases, everything after and including the ! should be grouped in the same column.
Many thanks!

You are almost there. split takes (as its second argument) the maximum number of splits to do.
https://docs.python.org/3.8/library/stdtypes.html#str.split
rot = ['6697 1100.0 90.0 0.0 0.0 6609 !',
'701 0.0 0.0 83.9 1.5 000 !AFR-AHS IndHS-AFR']
for i in range(len(rot)):
rot[i]=rot[i].split(maxsplit=6)
Note: You want six splits, which results in seven columns. You'll need to do some extra processing if the text can have fewer than seven columns though.

Related

Faster way of finding count of a category over a window function in Python

I have a categorical column (well a discrete value column) and i would like to count the number of rows with each category over a centered sliding window function. I am using python with pandas as numpy to do this. I have something that works but it is slow and not so elegant.
I was wondering if there was a faster or easier way of doing this. I am running it over around 10,000 rows now and it takes around 20 seconds, which is ok but id like to run it over several 100,000 rows and up to 1,000,000 rows.
my code so far is as follows:
counted = pd.DataFrame()
for i in df[discrete_column].unique():
counts = df[discrete_column].rolling(window_size, 0, True).apply(lambda x:np.where(x==i, 1,0).sum())
counted[i]=counts
input would be like this (index, column)
index
discrete_column
58702
65030
58703
65030
58704
65030
58705
65030
58706
65030
58707
30000
58708
30000
58709
30000
58710
30000
Output (this is just a snippet) and i used a window size of 20
index
65030
30000
58703
0.684211
0.315789
58704
0.650000
0.350000
58705
0.600000
0.400000
58706
0.550000
0.450000
58707
0.500000
0.500000
58708
0.450000
0.550000
each category from the input becomes a column and the values in the output are the proportions (renormalized to 1 across the row) of each category in that window size.

Sequential predictions on multiple sequences

I need to predict a sequence of responses based on a sequence of predictors using something like an LSTM. There are multiple sequences, and they are structured such that they cannot be stacked and still make sense.
For example, we might have one sequence with the sequential values
Observation
Location 1x
Location 1y
Location 2 (response)
1
3.8
2.5
9.4
2
3.9
2.7
9.7
and another with the values
Observation
Location 1x
Location 1y
Location 2 (response)
1
9.4
4.6
16.8
2
9.2
4.1
16.2
Observation 2 from the first table and observation 1 from the second table do not follow each other. I then need to predict on an unseen sequence like
Location 1x
Location 1y
5.6
8.4
5.6
8.1
which is also not correlated to the first two, except that the first two should give a guideline on how to predict the sequence.
I've looked into multiple sequence prediction, and haven't had much luck. Can anyone give a guideline on what sort of stretegies I might use for this problem? Thanks.

Using a Data Frame Column as start & another as end for range to calculate mean for another dataframe column

I have 2 dataframes, 1 that contains start and end row ids and another that contains the dataframe where I want to calculate the mean for all rows between those coordinates.
First dataframe:
id
Exon region start (bp)
Exon region end (bp)
0
577
647
1
648
1601
2
1602
1670
3
1671
3229
4
3230
3304
Second Dataframe:
id
chrom
pos
mean
median
over_1
over_5
over_10
over_15
over_20
over_25
over_30
over_50
over_100
average_exon_coverage
0
1
12141
0.029005
0
0.021939
0.000105
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
1
1
12142
0.029216
0
0.021622
0.000105
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
I have managed to create a column in the new dataframe 'average_exon_coverage' and tried to calculate the mean for the start and end positions but I am not sure what I am doing wrong, my code is below:
meanList = []
for x in range(exon['Exon region start (bp)'].astype(int), exon['Exon region end (bp)'].astype(int)):
meanList.append(exomes_avg_mean['mean'])
exomes_avg_mean['average exon coverage'] = numpy.mean(meanList)
meanList=[]
I want to take the first column as start and the second column as end and keep calculating the mean all coordinates between them and put them in the column I have created.
Thanks.
Consider - First dataframe having range as dfRange
and second datframe having dfData.
Step1- Find the shape of dfRange. using shape you can get max rows.
step2 - using For loop
for rowNumber in range(maxRows):
you can get each row of dfRange and their corresponding start and end value.
like for any row -> dfRange[rowNumber][0] gives Exon region start and
dfRange[rowNumber][1] gives Exon region end
Step 3- Slice tempDf= dfData[start:end+1]
step 4- sum up and take mean on whatever axis you want of tempDf.
step 5 - store that results wherever you want.
Step 6 - loop back for other rows
otherwise instead of shape you can directly go for
for index, row in dfRange.iterrows():

How to set the threshold for closest date when comparing two columns with dates

I am extracting the rows from two csv files based on values in two columns. The name of columns in both files are same but values are different. I am matching the 'Education_Period' column of df22 with 'Education_Period' column of df33 to find the rows based on exactly same 'Education_Period' and closest to 'Program_startDate'. For closest 'Program_startDate', the code compares the 'Program_startDate' of df22 with 'Program_startDate' of df33. If there are more than one rows with similar closest date , then it chooses one. At the end the output is written in df44 and df55 (as shown in output table below). Please note that df22 has less number of records than df33.
Here are two issues with below python code.
Since we are pairing the dataframes based on matching so at the end the number of records in df44 and df55 should be equal but I am getting different number in both (df44 has less than df55). How this can be fixed?
Is there a way to set the threshold for closest 'Program_startDate'. For example: when I am running this code on my huge datasets then I observed that for closest 'Program_startDate' there are records where say 22/12/2009 is matched with 12/06/2015 as the closest date. I want the closest match upto 6 months or say any other number as a threshold. Please guide how I can do this in my below code?
Input:
df2
Student_IDs Education_Period Waiting_Period Program_startDate
23C 100.5 5.5 29/03/2018
34B 77.2 3.0 12/12/2009
11X 77.2 8.5 14/09/2019
88N 99.9 12.0 20/03/2017
22A 77.2 12.0 30/03/2015
df3
Student_IDs Education_Period Waiting_Period Program_startDate
11X 30.5 40.0 29/03/2018
99Y 77.2 20.0 12/12/2009
88Z 14.1 19.1 14/09/2016
12Z 77.2 15.0 26/06/2018
234M 100.5 19.2 30/03/2015
34M 100.5 44.5 30/04/2018
Output:
df4
Student_IDs Education_Period Waiting_Period Program_startDate
23C 100.5 5.5 29/03/2018
34B 77.2 3.0 12/12/2009
11X 77.2 8.5 14/09/2019
df5
Student_IDs Education_Period Waiting_Period Program_startDate
234M 100.5 19.2 30/04/2018
99Y 77.2 20.0 12/12/2009
12Z 77.2 15.0 26/06/2018
Python Code:
df23 = df22.merge(df33, on='Education_Period', how='inner')
df23['TD'] = (df23['Program_startDate_x']-df23['Program_startDate_y']).abs()
df23 = df23.sort_values(by=['TD']).reset_index()
df23i = df23.sort_values(by=['TD']).reset_index().drop_duplicates(subset=['Education_Period', 'Student_IDs_x']).groupby(['Education_Period'])['TD'].idxmin()
df23i = [x[1] if type(x) == tuple else x for x in df23i]
df23j = df23.sort_values(by=['TD']).reset_index().drop_duplicates(subset=['Education_Period', 'Student_IDs_y']).groupby(['Education_Period'])['TD'].idxmin()
df23j = [x[1] if type(x) == tuple else x for x in df23j]
df44 = df23.loc[df23i, ['Education_Period','Student_IDs_x', 'Waiting_Period_x', 'Program_startDate_x']].reset_index(drop=True).rename(columns={'Student_IDs_x':'Student_IDs', 'Waiting_Period_x':'Waiting_Period', 'Program_startDate_x':'Program_startDate'}).drop_duplicates(subset=['Education_Period', 'Student_IDs']).drop_duplicates(subset=['Education_Period', 'Student_IDs'])
df55 = df23.loc[df23j, ['Education_Period','Student_IDs_y', 'Waiting_Period_y', 'Program_startDate_y']].reset_index(drop=True).rename(columns={'Student_IDs_y':'Student_IDs', 'Waiting_Period_y':'Waiting_Period', 'Program_startDate_y':'Program_startDate'}).drop_duplicates(subset=['Education_Period', 'Student_IDs']).drop_duplicates(subset=['Education_Period', 'Student_IDs'])

Pandas Way of Weighted Average in a Large DataFrame

I do have a large dataset (around 8 million rows x 25 columns) in Pandas and I am struggling to find a way to compute weighted average on this dataframe which in turn creates another data frame.
Here is how my dataset looks like (very simplified version of it):
prec temp
location_id hours
135 1 12.0 4.0
2 14.0 4.1
3 14.3 3.5
4 15.0 4.5
5 15.0 4.2
6 15.0 4.7
7 15.5 5.1
136 1 12.0 4.0
2 14.0 4.1
3 14.3 3.5
4 15.0 4.5
5 15.0 4.2
6 15.0 4.7
7 15.5 5.1
I have a multi-index on [location_id, hours]. I have around 60k locations and 140 hours for each location (making up the 8 million rows).
The rest of the data is numeric (float) or categorical. I have only included 2 columns here, normally there are around 20 columns.
What I am willing to do is to create a new data frame that is basically a weighted average of this data frame. The requirements indicate that 12 of these location_ids should be averaged out by a specified weight to form the combined_location_id values.
For example, location_ids 1,3,5,7,9,11,13,15,17,19,21,23 with their appropriate weights (separate data coming in from another data frame) should be weighted averaged to from the combined_location_id CL_1's data.
That is a lot of data to handle and I wasn't able to find a completely Pandas way of solving it. Therefore, I went with a for loop approach. It is extremely slow and I am sure this is not the right way to do it:
def __weighted(self, ds, weights):
return np.average(ds, weights=weights)
f = {'hours': 'first', 'location_id': 'first',
'temp': lambda x: self.__weighted(x, weights), 'prec': lambda x: self.__weighted(x, weights)}
data_frames = []
for combined_location in all_combined_locations:
mapped_location_ids = combined_location.location_ids
weights = combined_location.weights_of_location_ids
data_for_this_combined_location = pd.concat(df_data.loc[df_data.index.get_level_values(0) == location_id] for location_id in mapped_location_ids)
data_grouped_by_distance = data_for_this_combined_location.groupby("hours", as_index=False)
data_grouped_by_distance = data_grouped_by_distance.agg(f)
data_frames.append(data_grouped_by_distance)
df_combined_location_data = pd.concat(data_frames)
df_combined_location_data.set_index(['location_id', 'hours'], inplace=True)
This works well functionally, however the performance and the memory consumption is horrible. It is taking over 2 hours on my dataset and that is currently not acceptable. The existence of the for loop is an indicator that this could be handled better.
Is there a better/faster way to implement this?
From what I saw you can reduce one for loop with mapped_location_ids
data_for_this_combined_location = df_data.loc[df_data.index.get_level_values(0).isin(mapped_location_ids)]

Categories

Resources