Same code works diffferently regarding file size - python

I am running a simple code to select text from lines in the input file and write that text to an output file.
with open('inputpath', 'r') as vh_datoteka, open('outputpath', 'w') as iz_datoteka:
for line in vh_datoteka:
NMEA = str(line) [24:-39]
iz_datoteka.write (NMEA + '\n')
The data I need to process looks something like this (two lines):
2012-05-01
23:59:59.007;!AIVDM,1,1,0,,33cm>k100013vglDPkW1QSin0000,0*6E;2470028;1;NULL;2012-05-01
21:59:59.007 2012-05-01
23:59:59.007;!AIVDM,1,1,0,,19NSBn001nQ8<7vDhIq43C<2280F,0*07;2470032;1;NULL;2012-05-01
21:59:59.007 ...
Since I have large files to process (~2GB) I first tested the code on a small part of one of the large files (simply copied first 1000 or so lines and saved them into a test file).
The code worked perfectly and I got the results I was looking for:
!AIVDM,1,1,0,,33cm>k100013vglDPkW1QSin0000,0*6E;
!AIVDM,1,1,0,,19NSBn001nQ8<7vDhIq43C<2280F,0*07;
After that I tried using the code on the whole data and got very different outputs:
2 3 : 5 9 : 5 9 . 0 0 7 ; ! A I V D M , 1 , 1 , 0 , , 3 3 c m > k 1 0
0 0 1 3 v g l D P k W 1 Q S i n 0 0 0 0 , 0 * 6 E ; 2 4 7 0 0 2 8 ; 1
; N U L L ; 2 0 1 2 - 3 : 5 9 : 5 9 . 0 0 7 ; ! A I V D M , 1 , 1 ,
0 , , 1 9 N S B n 0 0 1 n Q 8 < 7 v D h I q 4 3 C < 2 2 8 0 F , 0 * 0
7 ; 2 4 7 0 0 3 2 ; 1 ; N U L L ; 2 0 1 2 - ...
I have been trying to figure out the reason for such behaviour and have ran out of ideas and obviously need help.

Thank you Tobias for your comment.
Apparently the large data files were in UTF16-LE, which was the problem. I corrected the python code to read in utf16 and write to utf8 and that did the trick.
with codecs.open('inputpath', 'r', encoding='utf-16-le') as vh_datoteka, open('outputpath', 'w') as iz_datoteka:
for line in vh_datoteka:
NMEA = str(line) [24:-39]
iz_line = NMEA + '\n'
iz_datoteka.write (iz_line.encode('utf-8'))

Related

2D list editing in Python

I am trying to edit a 5 * 5 square matrix in Python.And I initialize every element in this 5 * 5 matrix with the value 0. I initialize the matrix by using lists using this code:
h = []
for i in range(5):
h.append([0,0,0,0,0])
And now I want to change the matrix to something like this.
4 5 0 0 0
0 4 5 0 0
0 0 4 5 0
0 0 0 4 5
5 0 0 0 4
Here is the piece of code -
i = 0
a = 0
while i < 5:
h[i][a] = 4
h[i][a+1] = 5
a += 1
i += 1
where h[i][j] is the 2 D matrix. But the output is always is showing something like this -
4 4 4 4 4
4 4 4 4 4
4 4 4 4 4
4 4 4 4 4
4 4 4 4 4
Can you guys tell me what is wrong with it?
Do the update as follows using the modulo operator %:
for i in range(5):
h[i][i % 5] = 4
h[i][(i+1) % 5] = 5
The % 5 in the first line isn't strictly necessary but underlines the general principle for matrices of various dimensions. Or more generally, for random dimensions:
for i, row in enumerate(h):
n = len(row)
row[i % n] = 4
row[(i+1) % n] = 5
Question answered here: 2D list has weird behavor when trying to modify a single value
This should work:
#m = [[0]*5]*5 # Don't do this.
m = []
for i in range(5):
m.append([0]*5)
i = a = 0
while i < 5:
m[i][a] = 4
if a < 4:
m[i][a+1] = 5
a += 1
i += 1

How to read a text file using pandas in Python and split each character/letter of the data frame

How to read a text file using pandas in Python and split each character/letter of the data frame
text_df:
7 3
Tno
h%n
a #
tA
$c
#T%
ii!
And, I want the file to be as below:
7 3
T n o
h % n
a #
t A
$ c
# T %
i i !
Can anyone help me out with this? I tried with the below code but not working out:
df = pd.read_csv("location\\text_df.txt", sep='', header=None)
Use pd.read_fwf
pd.read_fwf('location\\text_df.txt', widths=[1,1,1], header=None)
0 1 2
0 7 NaN 3
1 T n o
2 h % n
3 a NaN #
4 t A NaN
5 $ c NaN
6 # T %
7 i i !
Or
pd.read_fwf('location\\text_df.txt', widths=[1,1,1], header=None).fillna('')
0 1 2
0 7 3
1 T n o
2 h % n
3 a #
4 t A
5 $ c
6 # T %
7 i i !

Reading different text file and exctract same index lines [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 7 years ago.
Improve this question
I am not a python programmer. But I need to make a input file for a software. I have a a.txt file and b.txt and each line in a.txt is corresponded to an "indexes" in b.txt.
a.txt:
0 0 0 L M L 41 ACC sh 1.008732
1 0 0 L * L 53 NCR sh 1.022706
2 1 1 L M L 18 ACC sh 1.025172
3 2 2 L M L 17 ACC sh 1.017734
4 2 2 L * L 21 NCR sh 1.025410
b.txt:
#indexes: 0 0 0
1 -0.375E+04 0.382E+01
2 -0.375E+04 0.432E+01
3 -0.376E+04 0.353E+01
#indexes: 1 0 0
1 -0.635E+04 0.331E+01
2 -0.235E+04 0.238E+01
#indexes: 2 1 1
1 -0.735E+04 0.093E+01
#indexes: 3 2 2
1 -0.835E+04 0.331E+01
2 -0.035E+04 0.438E+01
#indexes: 4 2 2
1 -0.475E+04 0.331E+01
2 -0.365E+04 0.438E+01
I need to extract lines with "ACC" in the 8th column in a.txt and store them in a new a_new.txt.
a_new.txt:
0 0 0 L M L 41 ACC sh 1.008732
2 1 1 L M L 18 ACC sh 1.025172
3 2 2 L M L 17 ACC sh 1.017734
Then read b.txt file, find "indexes" lines and see if the numbers in that line are the the same as ACC lines (first 3 coulmns) then store that index box in b_new.txt:
b_new.txt:
#indexes: 0 0 0
1 -0.375E+04 0.382E+01
2 -0.375E+04 0.432E+01
3 -0.376E+04 0.353E+01
#indexes: 2 1 1
1 -0.735E+04 0.093E+01
#indexes: 3 2 2
1 -0.835E+04 0.331E+01
2 -0.035E+04 0.438E+01
I would appreciate if any body could help me?
Took a couple of minutes to do this:
import re
f = open('a.txt','r')
a = f.read()
f.close()
a_new = open('a_new.txt','w')
a_new.write('\n'.join(re.findall('(^.*ACC.*$)',a,re.M)))
a_new.close()
f = open('b.txt','r')
b = f.read()
f.close()
with open('b_new.txt','w') as b_new,open('a_new.txt','r') as a_new:
inds = [x.replace(' ','') for x in re.findall('^\s*(\d\s*\d\s*\d)',a_new.read(),re.M)]
for ind in inds:
reg = '(#indexes:\s*{0}\s*{1}\s*{2}[\s\S]*?(?=#indexes|$))'.format(*list(ind))
matches = re.findall(reg,b)
b_new.write('\n'.join(matches))
after running it, the a_new.txt will be like:
0 0 0 L M L 41 ACC sh 1.008732
2 1 1 L M L 18 ACC sh 1.025172
3 2 2 L M L 17 ACC sh 1.017734
and b_new.txt :
#indexes: 0 0 0
1 -0.375E+04 0.382E+01
2 -0.375E+04 0.432E+01
3 -0.376E+04 0.353E+01
#indexes: 2 1 1
1 -0.735E+04 0.093E+01
#indexes: 3 2 2
1 -0.835E+04 0.331E+01
2 -0.035E+04 0.438E+01

Loop to multiply column i by 2, add column j, and repeat for all pairs of columns in a file [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 8 years ago.
Improve this question
I need to convert a genotype dosage file into an allelic dosage file.
Input looks like this:
#snp a1 a2 i1 j1 i2 j2 i3 j3
chr6_24000211_D D I3 0 0 0 0 0 0
rs78244999 A G 1 0 1 0 1 0
rs1511479 T C 0 1 1 0 0 1
rs34425199 A C 0 0 0 0 0 0
rs181892770 A G 1 0 1 0 1 0
rs501871 A G 0 1 0.997 0.003 0 1
chr6_24000836_D D I4 0 0 0 0 0 0
chr6_24000891_I I2 D 0 0 0 0 0 1
rs16888446 A C 0 0 0 0 0 0
Columns 1-3 are identifiers. No operations should be performed on these, they need to just be copied as is into the output file. For the remaining columns, they need to be considered as a pair of column i and column j and the following operation needs to be performed: 2*i + j
Pseudocode
write first three columns of input file to output
for all i and j in the file, write 2*i + j to output
Desired output looks like this:
#snp a1 a2 1 2 3
chr6_24000211_D D I3 0 0 0
rs78244999 A G 2 2 2
rs1511479 T C 1 2 1
rs34425199 A C 0 0 0
rs181892770 A G 2 2 2
rs501871 A G 1 1.997 1
chr6_24000836_D D I4 0 0 0
chr6_24000891_I I2 D 0 0 1
rs16888446 A C 0 0 0
I will be performing this on a number of files with different total columns, so I want the loop to run for (total number of columns - 3)/2 iterations, i.e. until it reaches the last column of the file.
Input files are ~9 million rows by ~10,000 columns, so reading the files into a program such as R is very slow. I am not sure the most efficient tool to use to implement this (awk? perl? python?), and as a novice coder I unsure of where to begin re: syntax for the solution.
Here's the awk implementation of your posted algorithm, enhanced just slightly to produce the first row you show in your expected output:
$ cat tst.awk
{
printf "%s %s %s", $1, $2, $3
c=0
for (i=4; i<NF; i+=2) {
printf " %s", (NR>1 ? 2*$i + $(i+1) : ++c)
}
print ""
}
.
$ awk -f tst.awk file
#snp a1 a2 1 2 3
chr6_24000211_D D I3 0 0 0
rs78244999 A G 2 2 2
rs1511479 T C 1 2 1
rs34425199 A C 0 0 0
rs181892770 A G 2 2 2
rs501871 A G 1 1.997 1
chr6_24000836_D D I4 0 0 0
chr6_24000891_I I2 D 0 0 1
rs16888446 A C 0 0 0
Python version
#!/usr/bin/env python
from itertools import izip_longest, chain
def chunk(sequence, chunk_size=2):
"""
list(chunk([1,2,3,4], 2)) => [(1,2),(3,4)]
"""
# Take advantage of the same iterator being consumed
# multiple times/sources to do grouping
return izip_longest(*[iter(sequence)] * chunk_size)
def processor(csv_reader):
for row in csv_reader:
# collect the pairs and process them
processed_pairs = (2*float(i)+float(j) for i, j in chunk(row[3:]))
# yield back the first 3 element and the processed pairs
yield list(i for j in (row[0:3], processed_pairs) for i in j)
if __name__ == '__main__':
import csv, sys
with open(sys.argv[1], 'rb') as csvfile:
source = processor(csv.reader(csvfile, delimiter=' '))
for line in source:
print line
This will do as you ask. It expects the input file as a parameter on the command line and will send the output to STDOUT, which you can redirect to a file if you wish.
use strict;
use warnings;
while (<>) {
my #fields = split;
my #probs = splice #fields, 3;
if (/^#/) {
push #fields, 1 .. #probs / 2;
}
else {
while (#probs >= 2) {
my ($i, $j) = splice #probs, 0, 2;
push #fields, $i + $i + $j;
}
}
print "#fields\n";
}
output
#SNP A1 A2 1 2 3
chr6_24000211_D D I3 0 0 0
rs78244999 A G 2 2 2
rs1511479 T C 1 2 1
rs34425199 A C 0 0 0
rs181892770 A G 2 2 2
rs501871 A G 1 1.997 1
chr6_24000836_D D I4 0 0 0
chr6_24000891_I I2 D 0 0 1
rs16888446 A C 0 0 0

best way to implement Apriori in python pandas

What is the best way to implement the Apriori algorithm in pandas? So far I got stuck on transforming extracting out the patterns using for loops. Everything from the for loop onward does not work. Is there a vectorized way to do this in pandas?
import pandas as pd
import numpy as np
trans=pd.read_table('output.txt', header=None,index_col=0)
def apriori(trans, support=4):
ts=pd.get_dummies(trans.unstack().dropna()).groupby(level=1).sum()
#user input
collen, rowlen =ts.shape
#max length of items
tssum=ts.sum(axis=1)
maxlen=tssum.loc[tssum.idxmax()]
items=list(ts.columns)
results=[]
#loop through items
for c in range(1, maxlen):
#generate patterns
pattern=[]
for n in len(pattern):
#calculate support
pattern=['supp']=pattern.sum/rowlen
#filter by support level
Condit=pattern['supp']> support
pattern=pattern[Condit]
results.append(pattern)
return results
results =apriori(trans)
print results
When I insert this with support 3
a b c d e
0
11 1 1 1 0 0
666 1 0 0 1 1
10101 0 1 1 1 0
1010 1 1 1 1 0
414147 0 1 1 0 0
10101 1 1 0 1 0
1242 0 0 0 1 1
101 1 1 1 1 0
411 0 0 1 1 1
444 1 1 1 0 0
it should output something like
Pattern support
a 6
b 7
c 7
d 7
e 3
a,b 5
a,c 4
a,d 4
Assuming I understand what you're after, maybe
from itertools import combinations
def get_support(df):
pp = []
for cnum in range(1, len(df.columns)+1):
for cols in combinations(df, cnum):
s = df[list(cols)].all(axis=1).sum()
pp.append([",".join(cols), s])
sdf = pd.DataFrame(pp, columns=["Pattern", "Support"])
return sdf
would get you started:
>>> s = get_support(df)
>>> s[s.Support >= 3]
Pattern Support
0 a 6
1 b 7
2 c 7
3 d 7
4 e 3
5 a,b 5
6 a,c 4
7 a,d 4
9 b,c 6
10 b,d 4
12 c,d 4
14 d,e 3
15 a,b,c 4
16 a,b,d 3
21 b,c,d 3
[15 rows x 2 columns]
add support, confidence, and lift caculation。
def apriori(data, set_length=2):
import pandas as pd
df_supports = []
dataset_size = len(data)
for combination_number in range(1, set_length+1):
for cols in combinations(data.columns, combination_number):
supports = data[list(cols)].all(axis=1).sum() * 1.0 / dataset_size
confidenceAB = data[list(cols)].all(axis=1).sum() * 1.0 / len(data[data[cols[0]]==1])
confidenceBA = data[list(cols)].all(axis=1).sum() * 1.0 / len(data[data[cols[-1]]==1])
liftAB = confidenceAB * dataset_size / len(data[data[cols[-1]]==1])
liftBA = confidenceAB * dataset_size / len(data[data[cols[0]]==1])
df_supports.append([",".join(cols), supports, confidenceAB, confidenceBA, liftAB, liftBA])
df_supports = pd.DataFrame(df_supports, columns=['Pattern', 'Support', 'ConfidenceAB', 'ConfidenceBA', 'liftAB', 'liftBA'])
df_supports.sort_values(by='Support', ascending=False)
return df_supports

Categories

Resources