I have a pandas data frame like this:
Subset Position Value
1 1 2
1 10 3
1 15 0.285714
1 43 1
1 48 0
1 89 2
1 132 2
1 152 0.285714
1 189 0.133333
1 200 0
2 1 0.133333
2 10 0
2 15 2
2 33 2
2 36 0.285714
2 72 2
2 132 0.133333
2 152 0.133333
2 220 3
2 250 8
2 350 6
2 750 0
I want to know how can I get the mean of values for every "x" row with "y" step size per subset in pandas?
For example, mean of every 5 rows (step size =2) for value column in each subset like this:
Subset Start_position End_position Mean
1 1 48 1.2571428
1 15 132 1.0571428
1 48 189 0.8838094
2 1 36 0.8838094
2 15 132 1.2838094
2 36 220 1.110476
2 132 350 3.4533332
Is this what you were looking for:
df = pd.DataFrame({'Subset': [1]*10+[2]*12,
'Position': [1,10,15,43,48,89,132,152,189,200,1,10,15,33,36,72,132,152,220,250,350,750],
'Value': [2,3,.285714,1,0,2,2,.285714,.1333333,0,0.133333,0,2,2,.285714,2,.133333,.133333,3,8,6,0]})
averaged_df = pd.DataFrame(columns=['Subset', 'Start_position', 'End_position', 'Mean'])
window = 5
step_size = 2
for subset in df.Subset.unique():
subset_df = df[df.Subset==subset].reset_index(drop=True)
for i in range(0,len(df),step_size):
window_rows = subset_df.iloc[i:i+window]
if len(window_rows) < window:
continue
window_average = {'Subset': window_rows.Subset.loc[0+i],
'Start_position': window_rows.Position[0+i],
'End_position': window_rows.Position.iloc[-1],
'Mean': window_rows.Value.mean()}
averaged_df = averaged_df.append(window_average,ignore_index=True)
Some notes about the code:
It assumes all subsets are in order in the original df (1,1,2,1,2,2 will behave as if it was 1,1,1,2,2,2)
If there is a group left that's smaller than a window, it will skip it (e.g. 1, 132, 200, 0,60476 is not included`)
One version specific answer would be, using pandas.api.indexers.FixedForwardWindowIndexer introduced in pandas 1.1.0:
>>> window=5
>>> step=2
>>> indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=window)
>>> df2 = df.join(df.Position.shift(-(window-1)), lsuffix='_start', rsuffix='_end')
>>> df2 = df2.assign(Mean=df2.pop('Value').rolling(window=indexer).mean()).iloc[::step]
>>> df2 = df2[df2.Position_start.lt(df2.Position_end)].dropna()
>>> df2['Position_end'] = df2['Position_end'].astype(int)
>>> df2
Subset Position_start Position_end Mean
0 1 1 48 1.257143
2 1 15 132 1.057143
4 1 48 189 0.883809
10 2 1 36 0.883809
12 2 15 132 1.283809
14 2 36 220 1.110476
16 2 132 350 3.453333
I have a big text file like this small example:
small example:
>chr9:128683-128744
GGATTTCTTCTTAGTTTGGATCCATTGCTGGTGAGCTAGTGGGATTTTTTGGGGGGTGTTA
>chr16:134222-134283
AGCTGGAAGCAGCGTGGGAATCACAGAATGGCCGAGAACTTAAAGGCTTTGCTTGGCCTGG
>chr16:134226-134287
GGAAGCAGCGTGGGAATCACAGAATGGACGGCCGATTAAAGGCTTTGCTTGGCCTGGATTT
>chr1:134723-134784
AAGTGATTCACCCTGCCTTTCCGACCTTCCCCAGAACAGAACACGTTGATCGTGGGCGATA
>chr16:135770-135831
GCCTGAGCAAAGGGCCTGCCCAGACAAGATTTTTTAATTGTTTAAAAACCGAATAAATGTT
>chr16:135787-135848
GCCCAGACAAGATTTTTTAATTGTTTAAAAACCGAATAAATGTTTTATTTCTAGAAAACTG
>chr16:135788-135849
CCCAGACAAGATTTTTTAATTGTTTAAAAACCGAATAAATGTTTTATTTCTAGAAAACTGT
>chr16:136245-136306
CACTTCACAAATAGAAGGCTGTCAGAGAGACAGGGACAGGCCACACAAGTGTTTCTGCACA
>chr7:146692-146753
GTGTGACCAAAACTTAGGATGTTAGCCGAACTCTCCGTTACTATCATTTTGGATTTCCAGT
>chr8:147932-147993
GGTAAAGGTAAATACATAAACAAACATAAAACCGATCCTATTGTAATTTTGGTTTGTAACT
this file is divided into different groups and every group has 2 parts (2 lines). the 1st line which starts with > is ID and the 2nd line is a sequence of characters. length of every sequence of characters is 61.
I have a short sequence (which is CCGA) I would like to scan every 2nd part for this short sequence. and output would be a text file with 2 columns.
1st column: is the position where the beginning of short sequence is located (every 2nd part has 61 characters so in the output I will report the position of characters which is a number).
2nd column: is the count of number of times that the beginning of short sequence is located at that specific position.
for instance for the following sequence of characters the beginning of short sequence is at position 49.
GCCTGAGCAAAGGGCCTGCCCAGACAAGATTTTTTAATTGTTTAAAAACCGAATAAATGTT
for the small example, the expected output would look like this:
expected output:
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 0
16 0
17 0
18 0
19 0
20 0
21 1
22 0
23 0
24 0
25 0
26 1
27 0
28 0
29 0
30 0
31 1
32 4
33 0
34 0
35 0
36 0
37 0
38 0
39 0
40 0
41 0
42 0
43 0
44 0
45 0
46 0
47 0
48 0
49 1
50 0
51 0
52 0
53 0
54 0
55 0
56 0
57 0
58 0
59 0
60 0
61 0
I am trying to do that in python using the following code. but the output is not like what I want.
infile = open('infile.txt', 'r')
ss = 'CCGA'
count = 0
for line in infile:
if not line.startswith('>'):
for match in pattern.finder(ss):
count +=1
POSITION = pattern.finder(ss)
COUNT = count
do you know how to fix it?
The below uses finditer to find all non-overlapping occurences of the CCGA pattern, and creates a mapping from the index of the beginning of the sequence to the number of times a sequence has begun at that index.
from re import compile
from collections import defaultdict
pat = compile(r'CCGA')
mapping = defaultdict(int)
with open('infile.txt', 'r') as infile:
for line in infile:
if not line.startswith('>'):
for match in pat.finditer(line):
mapping[match.start() + 1] += 1
for i in range(1, 62):
print("{:>2} {:>2}".format(i, mapping[i]))
prints
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 0
16 0
17 0
18 0
19 0
20 0
21 1
22 0
23 0
24 0
25 0
26 1
27 0
28 0
29 0
30 0
31 1
32 4
33 0
34 0
35 0
36 0
37 0
38 0
39 0
40 0
41 0
42 0
43 0
44 0
45 0
46 0
47 0
48 0
49 1
50 0
51 0
52 0
53 0
54 0
55 0
56 0
57 0
58 0
59 0
60 0
61 0
One way to export it to a file would be to use the print function
with open('outfile.txt', 'w+') as outfile:
for i in range(1, 62):
print(i, mapping[i], sep='\t', file=outfile)
I have a data frame with two columns
df = ['xPos', 'lineNum']
import pandas as pd
data = '''\
xPos lineNum
40 1
50 1
75 1
90 1
42 2
75 2
110 2
45 3
70 3
95 3
125 3
38 4
56 4
74 4'''
I have created the aggregate data frame for this by using
aggrDF = df.describe(include='all')
command
and I am interested in the minimum of the xPos value. So, i get it by using
minxPos = aggrDF.ix['min']['xPos']
Desired output
data = '''\
xPos lineNum xDiff
40 1 2
50 1 10
75 1 25
90 1 15
42 2 4
75 2 33
110 2 35
45 3 7
70 3 25
95 3 25
125 3 30
38 4 0
56 4 18
74 4 18'''
The logic
I want to compere the two consecutive rows of the data frame and calculate a new column based on this logic:
if( df['LineNum'] != df['LineNum'].shift(1) ):
df['xDiff'] = df['xPos'] - minxPos
else:
df['xDiff'] = df['xPos'].shift(1)
Essentially, I want the new column to have the difference of the two consecutive rows in the df, as long as the line number is the same.
If the line number changes, then, the xDiff column should have the difference with the minimum xPos value that I have from the aggregate data frame.
Can you please help? thanks,
These two lines should do it:
df['xDiff'] = df.groupby('lineNum').diff()['xPos']
df.loc[df['xDiff'].isnull(), 'xDiff'] = df['xPos'] - minxPos
>>> df
xPos lineNum xDiff
0 40 1 2.0
1 50 1 10.0
2 75 1 25.0
3 90 1 15.0
4 42 2 4.0
5 75 2 33.0
6 110 2 35.0
7 45 3 7.0
8 70 3 25.0
9 95 3 25.0
10 125 3 30.0
11 38 4 0.0
12 56 4 18.0
13 74 4 18.0
You just need groupby lineNum and apply the condition you already writing down
df['xDiff']=np.concatenate(df.groupby('lineNum').apply(lambda x : np.where(x['lineNum'] != x['lineNum'].shift(1),x['xPos'] - x['xPos'].min(),x['xPos'].shift(1)).astype(int)).values)
df
Out[76]:
xPos lineNum xDiff
0 40 1 0
1 50 1 40
2 75 1 50
3 90 1 75
4 42 2 0
5 75 2 42
6 110 2 75
7 45 3 0
8 70 3 45
9 95 3 70
10 125 3 95
11 38 4 0
12 56 4 38
13 74 4 56
What I want to do:
What I want to do is that I have a big .csv file. I want to break down this big csv file into many small files based on the common records in BB column that alos contain 1 in the HH column, and all uncommon records that contain 0 in HH column.
As a result, all files will contain common records in BB column that contain 1 in the HH column, and all uncommon records that has no records in BB column and contain 0 in the HH column. The file name should be based on the common record of column 2 (BB). Please take a look below for the scenarion.
Any suggestion idea is appreciated highly.
bigFile.csv :
AA BB CC DD EE FF GG HH
12 53 115 10 3 3 186 1
12 53 01e 23 3 2 1
12 53 0ce 65 1 3 1
12 53 173 73 4 2 1
12 59 115 0 3 3 186 1
12 59 125 0 3 3 186 1
12 61 01e 23 3 2 1
12 61 b6f 0 1 1 1
12 61 b1b 0 6 5 960 1
12 68b 95 3 5 334 0
12 31a 31 2 2 0
12 221 0 4 5 0
12 12b 25 5 4 215 0
12 a10 36 5 1 0
My expected results files woud be as follows:
53.csv :
AA BB CC DD EE FF GG HH
12 53 115 10 3 3 186 1
12 53 01e 23 3 2 1
12 53 0ce 65 1 3 1
12 53 173 73 4 2 1
12 68b 95 3 5 334 0
12 31a 31 2 2 0
12 221 0 4 5 0
12 12b 25 5 4 215 0
12 a10 36 5 1 0
59.csv :
AA BB CC DD EE FF GG HH
12 59 115 0 3 3 186 1
12 59 125 0 3 3 186 1
12 68b 95 3 5 334 0
12 31a 31 2 2 0
12 221 0 4 5 0
12 12b 25 5 4 215 0
12 a10 36 5 1 0
61.csv :
AA BB CC DD EE FF GG HH
12 61 01e 23 3 2 1
12 61 b6f 0 1 1 1
12 61 b1b 0 6 5 960 1
12 68b 95 3 5 334 0
12 31a 31 2 2 0
12 221 0 4 5 0
12 12b 25 5 4 215 0
12 a10 36 5 1 0
For the data you have provided, the following script will produce your requested output files. It will perform this operation on ALL CSV files found in the folder:
from itertools import groupby
import glob
import csv
import os
def remove_unwanted(rows):
return [['' if col == 'NULL' else col for col in row[2:]] for row in rows]
output_folder = 'temp' # make sure this folder exists
# Search for ALL CSV files in the current folder
for csv_filename in glob.glob('*.csv'):
with open(csv_filename) as f_input:
basename = os.path.splitext(os.path.basename(csv_filename))[0] # e.g. bigfile
csv_input = csv.reader(f_input)
header = next(csv_input)
# Create a list of entries with '0' in last column
id_list = remove_unwanted(row for row in csv_input if row[7] == '0')
f_input.seek(0) # Go back to the start
header = remove_unwanted([next(csv_input)])
for k, g in groupby(csv_input, key=lambda x: x[1]):
if k == '':
break
# Format an output file name in the form 'bigfile_53.csv'
file_name = os.path.join(output_folder, '{}_{}.csv'.format(basename, k))
with open(file_name, 'wb') as f_output:
csv_output = csv.writer(f_output)
csv_output.writerows(header)
csv_output.writerows(remove_unwanted(g))
csv_output.writerows(id_list)
This will result in the files bigfile_53.csv, bigfile_59.csv and bigfile_61.csv being created in an output folder called temp. For example bigfile_53.csv will appear as follows:
Entries containing the string 'NULL' will be converted to an empty string, and the first two columns will be removed (as per OP's comment).
Tested in Python 2.7.9
You should look into the csv module. You can read your input file line by line and sort each line according to the BB column. This should be easy to do with a dictionary who's keys are the value in the BB column and the values are a list containing the information from that row. You can then write these lists to csv files using the csv module.
Hi I'm relatively new to Python and am currently working on trying to measure the width of features in an image. The resolution of my image is 1m so measuring the width should be easier. I've managed to select certain columns or rows of the image and extract the necessary data using loops and such. My code is below:
subset = imarray[:,::500]#(imarray.shape[1]/2):(imarray.shape[1]/2)+1]
subset[(subset > 0) & (subset <= 17)] = 1
subset[(subset > 17)] = 0
width = []
count = 0
for i in np.arange(subset.shape[1]):
column = subset[:,i]
for value in column:
if (value == 1):
count += 1
width.append(count)
width_arr = np.array(width).astype('uint8')
else:
count = 0
final = np.split(width_arr, np.argwhere(width_arr == 1).flatten())
final2 = [x for x in final if x != []]
width2 = []
for array in final2:
width2.append(max(array))
width2 = np.array(width2).astype('uint8')
print width2
I can't figure out how to split the output up so it shows the results for each column or row individually. Instead all I've been able to do is to append the data to an empty list and here's the output for that:
[ 70 35 4 2 5 36 4 5 2 51 97 4 228 3 21 47 7 21
23 58 126 4 111 2 2 5 3 2 18 15 6 19 3 3 12 15
6 8 2 4 6 88 122 24 14 49 73 57 74 6 179 8 3 2
6 3 184 9 3 19 24 3 2 2 3 255 30 8 191 33 127 5
3 27 112 2 24 2 5 2 10 30 10 6 37 2 38 6 12 17
44 67 23 5 101 10 9 4 6 4 255 136 5 255 255 255 255 26
255 235 148 4 255 199 3 2 114 87 255 109 69 12 41 20 30 57
72 89 32]
So these are the widths of the features in all the columns appended together. How do I use my loop or another method to split these up into individual numpy arrays representing each column I've sliced out of the original?
It seems like I am almost there but I can't seem to figure that last step out and it's driving me nuts.
Thanks in advance for your help!