pandas.read_csv skip rows until certain string is found - python

in R, there is a common function called fread, which is used to read in tsv/csv/... files.
It has a super useful argument called skip, which allows you to specify a string, and the row in which that string is found is then used as the header (useful if you specify a substring of the column names row)
I was wondering if there is a similar function in python because it seems extremely useful.
Cheers!

A technique I sometimes use (e.g. to filter faulty data, and when none of the other wonderful capabilities of pandas.read_csv() seem to address the case at hand) is to define a io.TextIOWrapper.
In your case, you could write:
class SkipUntilMatchWrapper(io.TextIOWrapper):
def __init__(self, f, matcher, include_matching=False):
super().__init__(f, line_buffering=True)
self.f = f
self.matcher = matcher
self.include_matching = include_matching
self.has_matched = False
def read(self, size=None):
while not self.has_matched:
line = self.readline()
if self.matcher(line):
self.has_matched = True
if self.include_matching:
return line
return super().read(size)
Let's try it on a simple example:
# make an example
with open('sample.csv', 'w') as f:
print('garbage 1', file=f)
print('garbage 2', file=f)
print('and now for some data', file=f)
print('a,b,c', file=f)
x = np.random.randint(0, 10, size=(5, 3))
np.savetxt(f, x, fmt='%d', delimiter=',')
Read:
with open('sample.csv', 'rb') as f_orig:
with SkipUntilMatchWrapper(f_orig, lambda s: 'a,b,c' in s, include_matching=True) as f:
df = pd.read_csv(f)
>>> df
a b c
0 2 7 8
1 7 3 3
2 3 6 9
3 0 6 0
4 4 0 9
Another way:
with open('sample.csv', 'rb') as f_orig:
with SkipUntilMatchWrapper(f_orig, lambda s: 'for some data' in s) as f:
df = pd.read_csv(f)
>>> df
a b c
0 2 7 8
1 7 3 3
2 3 6 9
3 0 6 0
4 4 0 9

Related

Append data of a second file to the first file in each line

My question looks exactly like this post : Append float data at the end of each line in a text file
But for me, it is different. I have a dat file containing over 500 lines.
I want that for each line, it adds me the value of the corresponding line in the second file. This second file only contains values like 0 or 1 in one column.
What I have :
File 1 : File 2 :
1 2 3 4 0
1 2 3 4 1
1 2 3 4 0
What I want :
File 1 : File 2 :
1 2 3 4 0 0
1 2 3 4 1 1
1 2 3 4 0 0
What I've already tried :
Y = np.loadtxt('breastcancerY')
def get_number(_):
lines = []
for line in Y:
print('this is a line', line)
return " " + str(line) + '\n'
with open("breastcancerX","r") as f:
data = f.read()
out = re.sub('\n',get_number,data)
with open("output.txt","w") as f:
f.write(out)
When I do that and I print my values in file of 0 and 1, all the values are 0, it doesn't correspond to my file.
EDIT 1 :
Using this code :
# first read the two files into list of lines
with open("breastcancerY","r") as f:
dataY = f.readlines()
with open("breastcancerX","r") as f:
dataX = f.readlines()
# then combine lines from two files to one line.
with open("output.dat","w") as f:
for X,Y in zip(dataX,dataY):
f.write(f"{X} {Y}")
It gives me
this
# I don't understand what you want to do this this part
Y = np.loadtxt('breastcancerY')
def get_number(_):
lines = []
for line in Y:
print('this is a line', line)
return " " + str(line) + '\n'
# I don't understand what you want to do this this part
# first read the two files into list of lines
with open("breastcancerY","r") as f:
dataY = f.readlines()
with open("breastcancerX","r") as f:
dataX = f.readlines()
# then combine lines from two files to one line.
with open("output.txt","w") as f:
for X,Y in zip(dataX,dataY):
f.write(f"{X.strip()} {Y.strip()}\n")
Using zip which provides the pairing of lines
Code
with open('file1.txt', 'r') as f1, open('file2.txt', 'r') as f2, open('fil3.txt', 'w') as f3:
for line1, line2 in zip(f1, f2):
f3.write(f'{line1.rstrip()} {line2}') # Writes:
# line from file1 without \n
# space,
# corresponding line from file2
Files
File1.txt
1 2 3 4
1 2 3 4
1 2 3 4
file2.txt
0
1
0
Result: file3.txt
1 2 3 4 0
1 2 3 4 1
1 2 3 4 0

Print line in file until line blank line

I have a file "testread.txt" having below data.
A
1
2
3
4
BA
5
6
7
8
CB
9
10
11
D
12
13
14
15
I Wanted to read and extract data each section wise and write it to different files. Eg;
1
2
3
4
Write it to File "a.txt"
5
6
7
8
Write it to File "b.txt"
9
10
11
Write it to File "c.txt"
and so on...
A (rough) solution could be get using:
collections.defaultdict to divide and store items;
numpy.savetxt to save them into files.
import numpy as np
from collections import defaultdict
with open('testread.txt', 'r') as f:
content = f.readlines()
d = defaultdict(list)
i = 0
for line in content:
if line == '\n':
i+=1
else:
d[i].append(line.strip())
for k,v in d.items():
np.savetxt('file{}.txt'.format(k), v[1:], delimiter=",", fmt='%s')
and you get:
file0.txt
1
2
3
4
file1.txt:
5
6
7
8
file2.txt:
9
10
11
file3.txt
12
13
14
15
The idea is to skip file when a new empty line is available. The below code should do the trick.
files_list = ['a.txt', 'b.txt', 'c.txt']
fpr = open('input.txt')
for f in files_list:
with open(f, 'w') as fpw:
for i, line in enumerate(fpr):
if i == 0: # skips first line
continue
if line.strip():
fpw.write(line)
else:
break

Reading and Rearranging data in Python

I have a very large (10GB) data file of the form:
A B C D
1 2 3 4
2 2 3 4
3 2 3 4
4 2 3 4
5 2 3 4
1 2 3 4
2 2 3 4
3 2 3 4
4 2 3 4
5 2 3 4
1 2 3 4
2 2 3 4
3 2 3 4
4 2 3 4
5 2 3 4
I would like to read just the B column of the file and rearrange it in the form
2 2 2 2 2
2 2 2 2 2
2 2 2 2 2
it takes very long time to read the data and rearrange them, could some give me a very efficient method to do this in python
This is the code that I used for my MATLAB for processing the data
fid = fopen('hpts.out', 'r'); % Open text file
InputText = textscan(fid, '%s', 1, 'delimiter', '\n'); % Read header lines
HeaderLines = InputText{1}
A = textscan(fid,'%n %n %n %n %n', 'HeaderLines', 1);
t = A{1};
vz = A{4};
L = 1;
for j = 1:1:5000
for i=1:1:14999
V1(j,i) = vz(L);
L = L +1 ;
end
end
imagesc(V1);
You can us Python for this, but I think this is exactly the sort of job where a shell script is better, since it's a lot shorter & easier:
$ tail -n+2 input_file | awk '{print $2}' | tr '\n' ' ' | fmt -w 10
tail removes the first (header) line;
awk gets the second column;
tr puts it on a single line;
and fmt makes lines a maximum of 10 characters.
Since this is a streaming operation, it should not take a lot of memory, and most performance for this is limited to just disk I/O (although shell pipes also introduce some overhead).
Example:
$ tail -n+2 input_file | awk '{print $2}' | tr '\n' ' ' | fmt -w 10
2 2 2 2 2
2 2 2 2 2
2 2 2 2 2
2 2 2 2 2
This streaming approach should perform well:
from itertools import izip_longest
with open('yourfile', 'r') as fin, open('newfile', 'w') as fout:
# discard header row
next(fin)
# make generator for second column
col2values = (line.split()[1] for line in fin)
# zip into groups of five.
# fillvalue used to make a partial last row look good.
for row in izip_longest(*[col2values ]*5, fillvalue=''):
fout.write(' '.join(row) + '\n')
Dont't read the whole file at one time! Read the file line by line:
def read_data():
with open("filename.txt", 'r') as f:
for line in f:
yield line.split()[1]
with open('file_to_save.txt', 'w') as f:
for i, data in enumerate(read_data()):
f.write(data)
if i % 5 == 0:
f.write('\n')

Count all +1's in the file python

I have the following data:
1 3 4 2 6 7 8 8 93 23 45 2 0 0 0 1
0 3 4 2 6 7 8 8 90 23 45 2 0 0 0 1
0 3 4 2 6 7 8 6 93 23 45 2 0 0 0 1
-1 3 4 2 6 7 8 8 21 23 45 2 0 0 0 1
-1 3 4 2 6 7 8 8 0 23 45 2 0 0 0 1
The above data is in a file. I want to count the number of 1's,0's,-1's but only in 1st column. I am taking the file in standard input but the only way I could think of is to do like this:
cnt = 0
cnt1 = 0
cnt2 = 0
for line in sys.stdin:
(t1, <having 15 different variables as that many columns are in files>) = re.split("\s+", line.strip())
if re.match("+1", t1):
cnt = cnt + 1
if re.match("-1", t1):
cnt1 = cnt1 + 1
if re.match("0", t1):
cnt2 = cnt2 + 1
How can I make it better especially the 15 different variables part as thats the only place where I will be using those variables.
Use collections.Counter:
from collections import Counter
with open('abc.txt') as f:
c = Counter(int(line.split(None, 1)[0]) for line in f)
print c
Output:
Counter({0: 2, -1: 2, 1: 1})
Here str.split(None, 1) splits the line just once:
>>> s = "1 3 4 2 6 7 8 8 93 23 45 2 0 0 0 1"
>>> s.split(None, 1)
['1', '3 4 2 6 7 8 8 93 23 45 2 0 0 0 1']
Numpy makes it even easy:
>>> import numpy as np
>>> from collections import Counter
>>> Counter(np.loadtxt('abc.txt', usecols=(0,), dtype=np.int))
Counter({0: 2, -1: 2, 1: 1})
If you only want the first column, then only split the first column. And use a dictionary to store the counts for each value.
count = dict()
for line in sys.stdin:
(t1, rest) = line.split(' ', 1)
try:
count[t1] += 1
except KeyError:
count[t1] = 1
for item in count:
print '%s occurs %i times' % (item, count[item])
Instead of using tuple unpacking, where you need a number of variables exactly equal to the number of parts returned by split(), you can just use the first element of those parts:
parts = re.split("\s+", line.strip())
t1 = parts[0]
or equivalently, simply
t1 = re.split("\s+", line.strip())[0]
import collections
def countFirstColum(fileName):
res = collections.defaultdict(int)
with open(fileName) as f:
for line in f:
key = line.split(" ")[0]
res[key] += 1;
return res
rows = []
for line in f:
column = line.strip().split(" ")
rows.append(column)
then you get a 2-dimensional array.
1st column:
for row in rows:
print row[0]
output:
1
0
0
-1
-1
This is from a script of mine with an infile, I checked and it works with standard input as infile:
dictionary = {}
for line in someInfile:
line = line.strip('\n') # if infile but you should
f = line.split() # do your standard input thing
dictionary[f[0]]=0
for line in someInfile:
line = line.strip('\n') # if infile but you should
f = line.split() # do your standard input thing
dictionary[f[0]]+=1
print dictionary

csv writer is adding delimiters in each words..

I wrote some throw away code which takes a list of ids checks for duplicates and writes a list of ids. Nothing fancy just a small part of what I am working on..
I get this weird output. It looks to me like the delimiter is adding spaces where it shouldn't. Is delimiter just between words or line ? Very confused.
r s 9 3 6 4 5 5 4
r s 9 3 1 1 1 7 1
r s 7 8 9 0 2 0 2 5
r s 7 6 5 2 3 3 1
r s 7 2 1 0 4 8
r s 6 9 8 3 2 6 7
r s 6 4 6 5 6 5 7
r s 6 2 9 2 4 2
r s 6 1 9 9 1 1 5 6
Code:
__author__ = 'prumac'
import csv
allsnps = []
def open_file():
ifile = open('mirnaduplicates.csv', "rb")
print "open file"
return csv.reader(ifile)
def write_file():
with open('mirnaduplicatesremoved.csv', 'w') as fp:
a = csv.writer(fp, delimiter=' ')
a.writerows(allsnps)
def checksnp(name):
if name in allsnps:
pass
else:
allsnps.append(name)
def mymain():
reader = open_file()
for r in reader:
checksnp(r[0])
print len(allsnps)
print allsnps
write_file()
mymain()
.writerows() expects a list of lists. Instead, you are handing it a list of strings, and these are treated as sequences of characters.
Put each string in a tuple or list:
a.writerows([val] for val in allsnps)
Note that you could do this all a little more efficiently:
with open('mirnaduplicates.csv', "rb") as ifile, \
open('mirnaduplicatesremoved.csv', 'wb') as fp:
reader = csv.reader(ifile)
writer = csv.writer(fp, delimiter=' ')
seen = set()
seen_add = seen.add
writer.writerows(row for row in reader if row[0] not in seen and not seen_add(row[0]))

Categories

Resources