I wrote some throw away code which takes a list of ids checks for duplicates and writes a list of ids. Nothing fancy just a small part of what I am working on..
I get this weird output. It looks to me like the delimiter is adding spaces where it shouldn't. Is delimiter just between words or line ? Very confused.
r s 9 3 6 4 5 5 4
r s 9 3 1 1 1 7 1
r s 7 8 9 0 2 0 2 5
r s 7 6 5 2 3 3 1
r s 7 2 1 0 4 8
r s 6 9 8 3 2 6 7
r s 6 4 6 5 6 5 7
r s 6 2 9 2 4 2
r s 6 1 9 9 1 1 5 6
Code:
__author__ = 'prumac'
import csv
allsnps = []
def open_file():
ifile = open('mirnaduplicates.csv', "rb")
print "open file"
return csv.reader(ifile)
def write_file():
with open('mirnaduplicatesremoved.csv', 'w') as fp:
a = csv.writer(fp, delimiter=' ')
a.writerows(allsnps)
def checksnp(name):
if name in allsnps:
pass
else:
allsnps.append(name)
def mymain():
reader = open_file()
for r in reader:
checksnp(r[0])
print len(allsnps)
print allsnps
write_file()
mymain()
.writerows() expects a list of lists. Instead, you are handing it a list of strings, and these are treated as sequences of characters.
Put each string in a tuple or list:
a.writerows([val] for val in allsnps)
Note that you could do this all a little more efficiently:
with open('mirnaduplicates.csv', "rb") as ifile, \
open('mirnaduplicatesremoved.csv', 'wb') as fp:
reader = csv.reader(ifile)
writer = csv.writer(fp, delimiter=' ')
seen = set()
seen_add = seen.add
writer.writerows(row for row in reader if row[0] not in seen and not seen_add(row[0]))
Related
in R, there is a common function called fread, which is used to read in tsv/csv/... files.
It has a super useful argument called skip, which allows you to specify a string, and the row in which that string is found is then used as the header (useful if you specify a substring of the column names row)
I was wondering if there is a similar function in python because it seems extremely useful.
Cheers!
A technique I sometimes use (e.g. to filter faulty data, and when none of the other wonderful capabilities of pandas.read_csv() seem to address the case at hand) is to define a io.TextIOWrapper.
In your case, you could write:
class SkipUntilMatchWrapper(io.TextIOWrapper):
def __init__(self, f, matcher, include_matching=False):
super().__init__(f, line_buffering=True)
self.f = f
self.matcher = matcher
self.include_matching = include_matching
self.has_matched = False
def read(self, size=None):
while not self.has_matched:
line = self.readline()
if self.matcher(line):
self.has_matched = True
if self.include_matching:
return line
return super().read(size)
Let's try it on a simple example:
# make an example
with open('sample.csv', 'w') as f:
print('garbage 1', file=f)
print('garbage 2', file=f)
print('and now for some data', file=f)
print('a,b,c', file=f)
x = np.random.randint(0, 10, size=(5, 3))
np.savetxt(f, x, fmt='%d', delimiter=',')
Read:
with open('sample.csv', 'rb') as f_orig:
with SkipUntilMatchWrapper(f_orig, lambda s: 'a,b,c' in s, include_matching=True) as f:
df = pd.read_csv(f)
>>> df
a b c
0 2 7 8
1 7 3 3
2 3 6 9
3 0 6 0
4 4 0 9
Another way:
with open('sample.csv', 'rb') as f_orig:
with SkipUntilMatchWrapper(f_orig, lambda s: 'for some data' in s) as f:
df = pd.read_csv(f)
>>> df
a b c
0 2 7 8
1 7 3 3
2 3 6 9
3 0 6 0
4 4 0 9
I have a file kind of like this:
===
1 2 3 4
===
2 3 4 5
===
3 4 5 6
and I am trying to make a program to turn the file into this
p
===
1 2 3 4
p
===
2 3 4 5
p
===
3 4 5 6
Is there any way I could do this in python?
you can use:
with open('my_file.txt') as fp:
lines = fp.readlines()
for i, l in enumerate(lines):
if l == '===\n':
lines[i] = 'p\n===\n'
with open('my_file.txt', 'w') as fp:
fp.write(''.join(lines))
This question already has answers here:
Why does csvwriter.writerow() put a comma after each character?
(4 answers)
Closed 3 years ago.
I made a code to delete certain columns (column 0,1,2,3,4,5,6) from a bunch of .csv dataset.
import csv
import os
data_path = "C:/Users/hhs/dataset/PSP/Upper/"
save_path = "C:/Users/hhs/Refined/PSP/Upper/"
for filename in os.listdir(data_path):
data_full_path = os.path.join(data_path, filename)
save_full_path = os.path.join(save_path, filename)
with open(data_full_path,"r") as source:
rdr= csv.reader(source)
with open(save_full_path,"w") as result:
wtr= csv.writer( result )
for line in rdr:
wtr.writerow((line[7]))
One of original dataset looks like this
Normals:0 Normals:1 Normals:2 Points:0 Points:1 Points:2 area cp
-0.69498 0.62377 0.34311 28.829 3.4728 -0.947160 0.25877 -0.094391
-0.73130 0.54405 0.39395 30.082 4.9111 -0.785480 0.23499 -0.261690
-0.74539 0.49691 0.42782 31.210 6.4629 -0.626470 0.20982 -0.330730
-0.75245 0.48322 0.42985 32.359 8.0473 -0.455080 0.19428 -0.221340
-0.77195 0.46254 0.41825 33.546 9.7963 -0.270990 0.19849 -0.086641
-0.78905 0.45241 0.39759 34.737 11.6860 -0.079976 0.18456 -0.022418
-0.79771 0.45422 0.37858 35.915 13.5840 0.118160 0.17047 0.026102
-0.80090 0.45479 0.37198 37.092 15.4810 0.330220 0.15594 0.154880
-0.80260 0.45516 0.36904 38.268 17.3770 0.550100 0.14279 0.316590
-0.80504 0.45774 0.36178 39.444 19.2740 0.769020 0.12996 0.475640
-0.80747 0.46024 0.35383 40.620 21.1710 0.982050 0.11692 0.624090
The result does have the last column as "cp" values, which is what I want.
However, the result looks very weird, every digit is located at different columns.
c p
- 0 . 0 9 4 3 9
- 0 . 2 6 1 6 9
- 0 . 3 3 0 7 3
- 0 . 2 2 1 3 4
- 0 . 0 8 6 6 4
- 0 . 0 2 2 4 1
0 . 0 2 6 1 0 2
0 . 1 5 4 8 8
0 . 3 1 6 5 9
0 . 4 7 5 6 4
0 . 6 2 4 0 9
.
.
.
Why the result looks like this?
Fix two issues in the second loop
add newline & delimiter
CSV file written with Python has blank lines between each row
change (line[7]) to [line[7]]
Why does csvwriter.writerow() put a comma after each character?
with open(save_full_path, "w", newline='') as result:
wtr= csv.writer(result, delimiter=',')
for line in rdr:
wtr.writerow([line[7]])
I have a file "testread.txt" having below data.
A
1
2
3
4
BA
5
6
7
8
CB
9
10
11
D
12
13
14
15
I Wanted to read and extract data each section wise and write it to different files. Eg;
1
2
3
4
Write it to File "a.txt"
5
6
7
8
Write it to File "b.txt"
9
10
11
Write it to File "c.txt"
and so on...
A (rough) solution could be get using:
collections.defaultdict to divide and store items;
numpy.savetxt to save them into files.
import numpy as np
from collections import defaultdict
with open('testread.txt', 'r') as f:
content = f.readlines()
d = defaultdict(list)
i = 0
for line in content:
if line == '\n':
i+=1
else:
d[i].append(line.strip())
for k,v in d.items():
np.savetxt('file{}.txt'.format(k), v[1:], delimiter=",", fmt='%s')
and you get:
file0.txt
1
2
3
4
file1.txt:
5
6
7
8
file2.txt:
9
10
11
file3.txt
12
13
14
15
The idea is to skip file when a new empty line is available. The below code should do the trick.
files_list = ['a.txt', 'b.txt', 'c.txt']
fpr = open('input.txt')
for f in files_list:
with open(f, 'w') as fpw:
for i, line in enumerate(fpr):
if i == 0: # skips first line
continue
if line.strip():
fpw.write(line)
else:
break
I have a very large (10GB) data file of the form:
A B C D
1 2 3 4
2 2 3 4
3 2 3 4
4 2 3 4
5 2 3 4
1 2 3 4
2 2 3 4
3 2 3 4
4 2 3 4
5 2 3 4
1 2 3 4
2 2 3 4
3 2 3 4
4 2 3 4
5 2 3 4
I would like to read just the B column of the file and rearrange it in the form
2 2 2 2 2
2 2 2 2 2
2 2 2 2 2
it takes very long time to read the data and rearrange them, could some give me a very efficient method to do this in python
This is the code that I used for my MATLAB for processing the data
fid = fopen('hpts.out', 'r'); % Open text file
InputText = textscan(fid, '%s', 1, 'delimiter', '\n'); % Read header lines
HeaderLines = InputText{1}
A = textscan(fid,'%n %n %n %n %n', 'HeaderLines', 1);
t = A{1};
vz = A{4};
L = 1;
for j = 1:1:5000
for i=1:1:14999
V1(j,i) = vz(L);
L = L +1 ;
end
end
imagesc(V1);
You can us Python for this, but I think this is exactly the sort of job where a shell script is better, since it's a lot shorter & easier:
$ tail -n+2 input_file | awk '{print $2}' | tr '\n' ' ' | fmt -w 10
tail removes the first (header) line;
awk gets the second column;
tr puts it on a single line;
and fmt makes lines a maximum of 10 characters.
Since this is a streaming operation, it should not take a lot of memory, and most performance for this is limited to just disk I/O (although shell pipes also introduce some overhead).
Example:
$ tail -n+2 input_file | awk '{print $2}' | tr '\n' ' ' | fmt -w 10
2 2 2 2 2
2 2 2 2 2
2 2 2 2 2
2 2 2 2 2
This streaming approach should perform well:
from itertools import izip_longest
with open('yourfile', 'r') as fin, open('newfile', 'w') as fout:
# discard header row
next(fin)
# make generator for second column
col2values = (line.split()[1] for line in fin)
# zip into groups of five.
# fillvalue used to make a partial last row look good.
for row in izip_longest(*[col2values ]*5, fillvalue=''):
fout.write(' '.join(row) + '\n')
Dont't read the whole file at one time! Read the file line by line:
def read_data():
with open("filename.txt", 'r') as f:
for line in f:
yield line.split()[1]
with open('file_to_save.txt', 'w') as f:
for i, data in enumerate(read_data()):
f.write(data)
if i % 5 == 0:
f.write('\n')