Python: read nth value on every line from text file into array - python

I have a very large text file containing XYZ data, with each value separated by a single space:
100000 200000 2.5698
200000 200000 1.9863
200000 400000 2.2587
...
I'm looking to create an array of only the last value in each line (i.e. the Z value). What I have so far is:
with open(xyzFile) as f:
for eachLine in f:
tmpLine = f.readline()
print("### tmpLine: {0}".format(tmpLine))
This prints out the first line of the file, as expected:
### tmpLine: 253575 705575 83.710655
How can I grab the third value and iterate to next line in text file? I guess I need a for loop somewhere here. I know how to append the value to the array, which would go in between these two processes:
zArray.append(zValue)

You could try using numpy's loadtxt, documentation is found here. There is a handy usecols argument that you can set to 2, to only read the 3rd column. Using the small bit of data you provide the following code generates a 1D array of the 3rd column
import numpy as np
z = np.loadtxt("filename.txt", usecols=2)
print (z)
# output is [ 2.5698 1.9863 2.2587]

Try this by doing split and strip(if need) wisely
with open(xyzFile) as f:
for eachLine in f:
print("### tmpLine: {0}".format(eachLine.strip().split()[-1].strip()))

You will read the line and split the line on the basis of space tmpLine.split(' '), then you will get an array of that line.
Of that array fetch the third element.tmpLine.split(' ')[2]
zArray = []
with open(xyzFile) as f:
for eachLine in f:
zArray.append(eachLine.split(' ')[2])
Another way that's List Comprehension as suggested by Jon is :
with open(xyzFile) as f:
zArray = [eachline.split(' ')[2] for eachline in f]

You can use the csv module
with open(xyzFile) as f:
for row in csv.reader(f, delimiter=" "):
print(row[-1])
Which yields something like this:
2.5698
1.9863
2.2587

Related

For Loop over a list in Python

I have a train_file.txt which has 3 columns on each row.
For example;
1 10 1
1 12 1
2 64 2
6 17 1
...
I am reading this txt file with
train_data = open("train_file.txt", 'r').readlines()
Then I am trying to get each value with for loop
for eachline in train_data:
uid, lid, x = eachline.strip().split()
Question: Train data is a huge file that's why I want to just get the first 1000 rows.
I was trying to execute the following code but I am getting an error ('list' object cannot be interpreted as an integer)
for eachline in range(train_data,1000)
uid, lid, x = eachline.strip().split()
It is not necessary to read the entire file at all. You could use enumerate on the file directly and break early or use itertools.islice:
from itertools import islice
train_data = list(islice(open("train_file.txt", 'r'), 1000))
You can also keep using the same file handle to read more data later:
f = open("train_file.txt", 'r')
train_data = list(islice(f, 1000)) # reads first 1000
test_data = list(islice(f, 100)) # reads next 100
Maybe try changing this line:
train_data = open("train_file.txt", 'r').readlines()
To:
train_data = open("train_file.txt", 'r').readlines()[:1000]
train_data is a list, use slicing:
for eachline in train_data[:1000]:
As the file is "huge" in your words a better approach is to read just first 1000 rows (readlines() will read the whole file in memory)
with open("train_file.txt", 'r'):
train_data = []
for idx, line in enumerate(f, start=1):
train_data.append(line.strip.split())
if idx == 1000:
break
Note that data will be str, not int. You probably want to convert them to int.
You could use enumerate and a break:
for k, line in enumerate(lines):
if k > 1000:
break # exit the loop
# do stuff on the line
I would recommend using the csv built in library since the data is csv-like (or the pandas one if you're using it), and using with. So something like this:
import csv
from itertools import islice
with open('./test.csv', 'r') as input_file:
csv_reader = csv.reader(input_file, delimiter=' ')
rows = list(islice(csv_reader, 1000))
# Use rows
print(rows)
You don't need it right now but it will make escaped characters or multiline entries way easier to parse. Also, if there are headers you can use csv.DictReader to include them.
Regarding your original code:
The call the readlines() will read all lines at that point so doing any filtering after won't make a difference.
If you did read it that way, to get the first 1000 lines your for loop should be:
for eachline in traindata[:1000]:
...

Read and convert row in text file into list of string

I have a text file data.txt that contains 2 rows of text.
first_row_1 first_row_2 first_row_3
second_row_1 second_row_2 second_row_3
I would like to read the second row of the text file and convert the contents into a list of string in python. The list should look like this;
txt_list_str=['second_row_1','second_row_2','second_row_3']
Here is my attempted code;
import csv
with open('data.txt', newline='') as f:
reader = csv.reader(f)
row1 = next(reader)
row2 = next(reader)
my_list = row2.split(" ")
I got the error AttributeError: 'list' object has no attribute 'split'
I am using python v3.
EDIT: Thanks for all the answers. I am sure all of them works. But can someone tell me what is wrong with my own attempted code? Thanks.
The reason your code doesn't work is you are trying to use split on a list, but it is meant to be used on a string. Therefore in your example you would use row2[0] to access the first element of the list.
my_list = row2[0].split(" ")
Alternatively, if you have access to the numpy library you can use loadtxt.
import numpy as np
f = np.loadtxt("data.txt", dtype=str, skiprows=1)
print (f)
# ['second_row_1' 'second_row_2' 'second_row_3']
The result of this is an array as opposed to a list. You could simply cast the array to a list if you require a list
print (list(f))
#['second_row_1', 'second_row_2', 'second_row_3']
Use read file method to open file.
E.g.
>>> fp = open('temp.txt')
Use file inbuilt generator to iterate lines by next method, and ignore first line.
>>> next(fp)
'first_row_1 first_row_2 first_row_3)\n'
Get second line in any variable.
>>> second_line = next(fp)
>>> second_line
'second_row_1 second_row_2 second_row_3'
Use Split string method to get items in list. split method take one or zero argument. if not given they split use white space for split.
>>> second_line.split()
['second_row_1', 'second_row_2', 'second_row_3']
And finally close the file.
fp.close()
Note: There are number of way to get respective output.
But you should attempt first as DavidG said in comment.
with open("file.txt", "r") as f:
next(f) # skipping first line; will work without this too
for line in f:
txt_list_str = line.split()
print(txt_list_str)
Output
['second_row_1', 'second_row_2', 'second_row_3']

Creating individual columns to write to new csv

A CSV returns the following values
"1,323104,564382"
"2,322889,564483"
"3,322888,564479"
"4,322920,564425"
"5,322942,564349"
"6,322983,564253"
"7,322954,564154"
"8,322978,564121"
How would i take the " marks off each end of the rows, it seems to make individual columns when i do this.
reader=[[i[0].replace('\'','')] for i in reader]
does not change the file at all
It seems strictly easier to peel the quotes off first, and then feed it to the csv reader, which simply takes any iterable over lines as input.
import csv
import sys
f = open(sys.argv[1])
contents = f.read().replace('"', '')
reader = csv.reader(contents.splitlines())
for x,y,z in reader:
print x,y,z
Assuming every line is wrapped by two double quotes, we can do this:
f = open("filename.csv", "r")
newlines = []
for line in f: # we could use a list comprehension, but for simplicity, we won't.
newlines.append(line[1:-1])
f.close()
f2 = open("filename.csv", "w")
for index, line in enumerate(f2):
f2.write(newlines[index])
f2.close()
[1:-1] uses a list-indexing operation to get the second letter of the string to the last letter of the string, each represented by the indexes 1 and -1.
enumerate() is a helper function that turns an iterable into (0, first_element), (1, second_element), ... pairs.
Iterating over a file gets you its lines.

Sort a file by first (or second, or else) column in python

This seems a very basic question, but I am new to python, and after spending a long time trying to find a solution on my own, I thought it's time to ask some more advanced people!
So, I have a file (sample):
ENSMUSG00000098737 95734911 95734973 3 miRNA
ENSMUSG00000077677 101186764 101186867 4 snRNA
ENSMUSG00000092727 68990574 68990678 11 miRNA
ENSMUSG00000088009 83405631 83405764 14 snoRNA
ENSMUSG00000028255 145003817 145032776 3 protein_coding
ENSMUSG00000028255 145003817 145032776 3 processed_transcript
ENSMUSG00000028255 145003817 145032776 3 processed_transcript
ENSMUSG00000098481 38086202 38086317 13 miRNA
ENSMUSG00000097075 126971720 126976098 7 lincRNA
ENSMUSG00000097075 126971720 126976098 7 lincRNA
and I need to write a new file with all the same information, but sorted by the first column.
What I use so far is :
lines = open(my_file, 'r').readlines()
output = open("intermediate_alphabetical_order.txt", 'w')
for line in sorted(lines, key=itemgetter(0)):
output.write(line)
output.close()
It doesn't return me any error, but just writes the output file exactly as the input file.
I know it is certainly a very basic mistake, but it would be amazing if some of you could tell me what I'm doing wrong!
Thanks a lot!
Edit
I am having trouble with the way I open the file, so the answers concerning already opened arrays don't really help.
The problem you're having is that you're not turning each line into a list. When you read in the file, you're just getting the whole line as a string. You're then sorting by the first character of each line, and this is always the same character in your input, 'E'.
To just sort by the first column, you need to split the first block off and just read that section. So your key should be this:
for line in sorted(lines, key=lambda line: line.split()[0]):
split will turn your line into a list, and then the first column is taken from that list.
If your input file is tab-separated, you can also use the csv module.
import csv
from operator import itemgetter
reader = csv.reader(open("t.txt"), delimiter="\t")
for line in sorted(reader, key=itemgetter(0)):
print(line)
sorts by first column.
Change the number in
key=itemgetter(0)
for sorting by a different column.
Same idea as SuperBiasedMan, but I prefer this approach: if you want another way of sorting (for example: if first column matches, sort by second, then third, etc) it is more easily implemented
with open(my_file) as f:
lines = [line.split(' ') for line in f]
output = open("result.txt", 'w')
for line in sorted(lines):
output.write(' '.join(line), key=itemgetter(0))
output.close()
You can write a function that takes a filename, delimiter and column to sort by using csv.reader to parse the file:
from operator import itemgetter
import csv
def sort_by(fle,col,delim):
with open(fle) as f:
r = csv.reader(f, delim=delim)
for row in sorted(r, key=itemgetter(col)):
yield row
for row in sort_by("your_file",2, "\t"):
print(row)
You can do this quickly with pandas as follows, with the data file set up exactly as you show it (i.e., with variable spaces as separators):
import pandas as pd
df = pd.read_csv('csvdata.csv', sep=' ', skipinitialspace=True, header=None)
df.sort(columns=[0], inplace=True)
df.to_csv('sorted_csvdata.csv', header=None, index=None)
Just to check the result:
with open('sorted_csvdata.csv', 'r') as f:
print(f.read())
ENSMUSG00000028255,145003817,145032776,3,protein_coding
ENSMUSG00000028255,145003817,145032776,3,processed_transcript
ENSMUSG00000028255,145003817,145032776,3,processed_transcript
ENSMUSG00000077677,101186764,101186867,4,snRNA
ENSMUSG00000088009,83405631,83405764,14,snoRNA
ENSMUSG00000092727,68990574,68990678,11,miRNA
ENSMUSG00000097075,126971720,126976098,7,lincRNA
ENSMUSG00000097075,126971720,126976098,7,lincRNA
ENSMUSG00000098481,38086202,38086317,13,miRNA
ENSMUSG00000098737,95734911,95734973,3,miRNA
You can do multi column sorting by adding additional columns to the list in the colmuns=[...] keyword argument.
Here is another option. Similar to some of the ideas above. Basically, mysort is a function that will do the custom sorting for you which here is based on
def mysort(line):
return line.split()[0]
with open("records.txt", "r") as f:
text = f.readlines()
for line in sorted(text, key=mysort):
print line

Read file from line 2 or skip header row

How can I skip the header row and start reading a file from line2?
with open(fname) as f:
next(f)
for line in f:
#do something
f = open(fname,'r')
lines = f.readlines()[1:]
f.close()
If you want the first line and then you want to perform some operation on file this code will helpful.
with open(filename , 'r') as f:
first_line = f.readline()
for line in f:
# Perform some operations
If slicing could work on iterators...
from itertools import islice
with open(fname) as f:
for line in islice(f, 1, None):
pass
f = open(fname).readlines()
firstLine = f.pop(0) #removes the first line
for line in f:
...
To generalize the task of reading multiple header lines and to improve readability I'd use method extraction. Suppose you wanted to tokenize the first three lines of coordinates.txt to use as header information.
Example
coordinates.txt
---------------
Name,Longitude,Latitude,Elevation, Comments
String, Decimal Deg., Decimal Deg., Meters, String
Euler's Town,7.58857,47.559537,0, "Blah"
Faneuil Hall,-71.054773,42.360217,0
Yellowstone National Park,-110.588455,44.427963,0
Then method extraction allows you to specify what you want to do with the header information (in this example we simply tokenize the header lines based on the comma and return it as a list but there's room to do much more).
def __readheader(filehandle, numberheaderlines=1):
"""Reads the specified number of lines and returns the comma-delimited
strings on each line as a list"""
for _ in range(numberheaderlines):
yield map(str.strip, filehandle.readline().strip().split(','))
with open('coordinates.txt', 'r') as rh:
# Single header line
#print next(__readheader(rh))
# Multiple header lines
for headerline in __readheader(rh, numberheaderlines=2):
print headerline # Or do other stuff with headerline tokens
Output
['Name', 'Longitude', 'Latitude', 'Elevation', 'Comments']
['String', 'Decimal Deg.', 'Decimal Deg.', 'Meters', 'String']
If coordinates.txt contains another headerline, simply change numberheaderlines. Best of all, it's clear what __readheader(rh, numberheaderlines=2) is doing and we avoid the ambiguity of having to figure out or comment on why author of the the accepted answer uses next() in his code.
If you want to read multiple CSV files starting from line 2, this works like a charm
for files in csv_file_list:
with open(files, 'r') as r:
next(r) #skip headers
rr = csv.reader(r)
for row in rr:
#do something
(this is part of Parfait's answer to a different question)
# Open a connection to the file
with open('world_dev_ind.csv') as file:
# Skip the column names
file.readline()
# Initialize an empty dictionary: counts_dict
counts_dict = {}
# Process only the first 1000 rows
for j in range(0, 1000):
# Split the current line into a list: line
line = file.readline().split(',')
# Get the value for the first column: first_col
first_col = line[0]
# If the column value is in the dict, increment its value
if first_col in counts_dict.keys():
counts_dict[first_col] += 1
# Else, add to the dict and set value to 1
else:
counts_dict[first_col] = 1
# Print the resulting dictionary
print(counts_dict)

Categories

Resources