I have about 40 million lines of text to parse through and I want to treat each line as a split string and then ask for multiple slices (or subscripts, whatever they are called) using a list of numbers I generate in a method.
# ...
other_file = open('output.txt','w')
list = [1, 4, 5, 7, ...]
for line in open(input_file):
other_file.write(line.split(',')[i for i in list])
the subscript can't take this generator I have shown, but I want to ask the split line for multiple entries in it without having to iterate through the list in every line.
I apologize, I know this is a simple answer but I just can't think of it. It's so late!
CSV module can help you
import csv
reader = csv.reader(open(input_file, 'r'))
writer = csv.writer(open(output_file, 'w'))
fields = (1,4,5,7,...)
for row in reader:
writer.writerow([row[i] for i in fields])
For further improvements, open files with context managers
Don't use list as a variable name - remember there is a builtin called list
other_file = open('output.txt','w')
lst = [1,4,5,7,...]
for line in open(input_file):
fields = line.split(',')
other_file.write(",".join(fields[i] for i in lst) + "\n")
For further improvement use context managers to open/close the files for you
from operator import itemgetter
from csv import reader, writer
fields = 1,4,5,7
row_filter = itemgetter(*fields)
with open('inp.txt', 'r') as inp:
with open('out.txt', 'w') as out:
writer(out).writerows(map(row_filter, reader(inp)))
Related
Sorry if this has already been answered before; the searches I have done have not been helpful.
I have a file that stores data as such:
name,number
(Although perhaps not relevant to the question, I will have to add entries to this file. I know how to do this.)
My question is for the pythonic(?) way of analyzing the data and sorting it in ascending order. So if the file was:
alex,30
bob,20
and I have to add the entry
carol, 25
The file should be rewritten as
bob,20
carol,25
alex,30
My first attempt was to store the entire file as a string (by read()) and then split by lines to get a list of strings, procedurally split those strings by a comma, and then create a new list of scores then sort that, but this doesn't seem right and fails because I don't have a way to go "back" once I have the order of scores.
I am unable to use libraries for this program.
Edit:
My first attempt I did not test because all it manages to do is sort a list of the scores; I don't know of a way to get the "entries" back.
file = open("scores.txt" , "r")
data = file.read()
list_data = data.split()
data.append([name,score])
for i in range(len(list_data)):
list_scores = list_scores.append(list_data[i][1])
list_scores = sorted(list_scores)
As you can see, this gives me an ascending list of scores, but I do not know where to go from here in order to sort the list of name, score entries.
You will just have to write the sorted entries back to some file, using some basic string formatting:
with open('scores.txt') as f_in, open('file_out.txt', 'w') as f_out:
entries = [(x, int(y)) for x, y in (line.strip().split(',') for line in f_in)]
entries.append(('carol', 25))
entries.sort(key=lambda e: e[1])
for x, y in entries:
f_out.write('{},{}\n'.format(x, y))
I'm going to assume you're capable of putting your data into a .csv file in the following format:
Name,Number
John,20
Jane,25
Then you can use csv.DictReader to read this into a dictionary with something like as shown in the listed example:
with(open('name_age.csv', 'w') as csvfile:
reader = csv.DictReader(csvfile)
and write to it using
with(open('name_age.csv') as csvfile:
writer = csv.DictWriter(csvfile)
writer.writerow({'Name':'Carol','Number':25})
You can then sort it using python's built-in operator as shown here
this a function that will take a filename and sort it for you
def sort_file(filename):
f = open(filename, 'r')
text = f.read()
f.close()
lines = [i.split(',') for i in text.splitlines()]
lines.sort(key=lambda x: x[1])
lines = [', '.join(i) for i in lines]
text = '\n'.join(lines)
f = open(filename, 'w')
f.write(text)
f.close()
A CSV returns the following values
"1,323104,564382"
"2,322889,564483"
"3,322888,564479"
"4,322920,564425"
"5,322942,564349"
"6,322983,564253"
"7,322954,564154"
"8,322978,564121"
How would i take the " marks off each end of the rows, it seems to make individual columns when i do this.
reader=[[i[0].replace('\'','')] for i in reader]
does not change the file at all
It seems strictly easier to peel the quotes off first, and then feed it to the csv reader, which simply takes any iterable over lines as input.
import csv
import sys
f = open(sys.argv[1])
contents = f.read().replace('"', '')
reader = csv.reader(contents.splitlines())
for x,y,z in reader:
print x,y,z
Assuming every line is wrapped by two double quotes, we can do this:
f = open("filename.csv", "r")
newlines = []
for line in f: # we could use a list comprehension, but for simplicity, we won't.
newlines.append(line[1:-1])
f.close()
f2 = open("filename.csv", "w")
for index, line in enumerate(f2):
f2.write(newlines[index])
f2.close()
[1:-1] uses a list-indexing operation to get the second letter of the string to the last letter of the string, each represented by the indexes 1 and -1.
enumerate() is a helper function that turns an iterable into (0, first_element), (1, second_element), ... pairs.
Iterating over a file gets you its lines.
I have a csv file that has each line formatted with the line name followed by 11 pieces of data. Here is an example of a line.
CW1,0,-0.38,2.04,1.34,0.76,1.07,0.98,0.81,0.92,0.70,0.64
There are 12 lines in total, each with a unique name and data.
What I would like to do is extract the first cell from each line and use that to name the corresponding data, either as a variable equal to a list containing that line's data, or maybe as a dictionary, with the first cell being the key.
I am new to working with inputting files, so the farthest I have gotten is to read the file in using the stock solution in the documentation
import csv
path = r'data.csv'
with open(path,'rb') as csvFile:
reader = csv.reader(csvFile,delimiter=' ')
for row in reader:
print(row[0])
I am failing to figure out how to assign each row to a new variable, especially when I am not sure what the variable names will be (this is because the csv file will be created by a user other than myself).
The destination for this data is a tool that I have written. It accepts lists as input such as...
CW1 = [0,-0.38,2.04,1.34,0.76,1.07,0.98,0.81,0.92,0.70,0.64]
so this would be the ideal end solution. If it is easier, and considered better to have the output of the file read be in another format, I can certainly re-write my tool to work with that data type.
As Scironic said in their answer, it is best to use a dict for this sort of thing.
However, be aware that dict objects do not have any "order" - the order of the rows will be lost if you use one. If this is a problem, you can use an OrderedDict instead (which is just what it sounds like: a dict that "remembers" the order of its contents):
import csv
from collections import OrderedDict as od
data = od() # ordered dict object remembers the order in the csv file
with open(path,'rb') as csvFile:
reader = csv.reader(csvFile, delimiter = ' ')
for row in reader:
data[row[0]] = row[1:] # Slice the row up into 0 (first item) and 1: (remaining)
Now if you go looping through your data object, the contents will be in the same order as in the csv file:
for d in data.values():
myspecialtool(*d)
You need to use a dict for these kinds of things (dynamic variables):
import csv
path = r'data.csv'
data = {}
with open(path,'rb') as csvFile:
reader = csv.reader(csvFile,delimiter=' ')
for row in reader:
data[row[0]] = row[1:]
dicts are especially useful for dynamic variables and are the best method to store things like this. to access you just need to use:
data['CW1']
This solution also means that if you add any extra rows in with new names, you won't have to change anything.
If you are desperate to have the variable names in the global namespace and not within a dict, use exec (N.B. IF ANY OF THIS USES INPUT FROM OUTSIDE SOURCES, USING EXEC/EVAL CAN BE HIGHLY DANGEROUS (rm * level) SO MAKE SURE ALL INPUT IS CONTROLLED AND UNDERSTOOD BY YOURSELF).
with open(path,'rb') as csvFile:
reader = csv.reader(csvFile,delimiter=' ')
for row in reader:
exec("{} = {}".format(row[0], row[1:])
In python, you can use slicing: row[1:] will contain the row, except the first element, so you could do:
>>> d={}
>>> with open("f") as f:
... c = csv.reader(f, delimiter=',')
... for r in c:
... d[r[0]]=map(int,r[1:])
...
>>> d
{'var1': [1, 3, 1], 'var2': [3, 0, -1]}
Regarding variable variables, check How do I do variable variables in Python? or How to get a variable name as a string in Python?. I would stick to dictionary though.
An alternative to using the proper csv library could be as follows:
path = r'data.csv'
csvRows = open(path, "r").readlines()
dataRows = [[float(col) for col in row.rstrip("\n").split(",")[1:]] for row in csvRows]
for dataRow in dataRows: # Where dataRow is a list of numbers
print dataRow
You could then call your function where the print statement is.
This reads the whole file in and produces a list of lines with trailing newlines. It then removes each newline and splits each row into a list of strings. It skips the initial column and calls float() for each entry. Resulting in a list of lists. It depends how important the first column is?
I'm a new Python user.
I have a txt file that will be something like:
3,1,3,2,3
3,2,2,3,2
2,1,3,3,2,2
1,2,2,3,3,1
3,2,1,2,2,3
but may be less or more lines.
I want to import each line as a list.
I know you can do it as such:
filename = 'MyFile.txt'
fin=open(filename,'r')
L1list = fin.readline()
L2list = fin.readline()
L3list = fin.readline()
but since I don't know how many lines I will have, is there another way to create individual lists?
Do not create separate lists; create a list of lists:
results = []
with open('inputfile.txt') as inputfile:
for line in inputfile:
results.append(line.strip().split(','))
or better still, use the csv module:
import csv
results = []
with open('inputfile.txt', newline='') as inputfile:
for row in csv.reader(inputfile):
results.append(row)
Lists or dictionaries are far superiour structures to keep track of an arbitrary number of things read from a file.
Note that either loop also lets you address the rows of data individually without having to read all the contents of the file into memory either; instead of using results.append() just process that line right there.
Just for completeness sake, here's the one-liner compact version to read in a CSV file into a list in one go:
import csv
with open('inputfile.txt', newline='') as inputfile:
results = list(csv.reader(inputfile))
Create a list of lists:
with open("/path/to/file") as file:
lines = []
for line in file:
# The rstrip method gets rid of the "\n" at the end of each line
lines.append(line.rstrip().split(","))
with open('path/to/file') as infile: # try open('...', 'rb') as well
answer = [line.strip().split(',') for line in infile]
If you want the numbers as ints:
with open('path/to/file') as infile:
answer = [[int(i) for i in line.strip().split(',')] for line in infile]
lines=[]
with open('file') as file:
lines.append(file.readline())
Clarification:
So if my file has 10 lines:
THe first line is a heading, so I want to append some text at the end of first line
THen I have a list which contains 9 elements..
I want to read that list and append the end of each line with corresponding element..
So basically list[0] to second line, list[1] to third line and so on..
I have a file which is delimted by comma.
something like this:
A,B,C
0.123,222,942
......
Now I want to do something like this:
A,B,C,D #append "D" just once
0.123,222,942,99293
............
This "D" is actually saved in a list so yeah I have this "D"
How do I do this? I mean I know the naive way.
like go thru each line and do something like
string += str(list[i])
Basically how do i append something at the end of the file in pythonic way :)
Just create a new file:
data = ['header', 1, 2, 3, 4]
with open("infile", 'r') as inf, open("infile.2", 'w') as outf:
outf.writelines('%s,%s\n' % (s.strip(), n) for s, n in zip(inf, data))
If you want to "update" the input file, just rename the new one afterwards
import os
os.unlink("infile")
os.rename("infile.2", "infile")
Short answer: Use the csv module.
Long answer:
import csv
newvalues = [...]
with open("path/to/input.csv") as file:
data = list(csv.reader(file))
with open("path/to/input.csv", "w") as file:
writer = csv.writer(file)
for row, newvalue in zip(data, newvalues):
row.append(newvalue)
writer.writerow(row)
Naturally, this depends on the lines in the file and newvalues being the same length. If this isn't the case, you could use something like zip_longest to fill in the excess lines with a given value.
If you are doing this to the different files, we can do it even more easily:
import csv
newvalues = [...]
with open("path/to/input.csv") as from, open("path/to/output.csv", "w") as to:
reader = csv.reader(from)
writer = csv.writer(to)
for row, newvalue in zip(reader, newvalues):
row.append(newvalue)
writer.writerow(row)
This also has the advantage of not reading the entire file into memory, so for very large files, this is a better solution.