I have two files.
One file has two columns-let's call it db, and the other one has one column-let's call it in.
Second column in db is the same type as the column in in and both files are sorted by this column.
db for example:
RPL24P3 NG_002525
RPLP1P1 NG_002526
RPL26P4 NG_002527
VN2R11P NG_006060
VN2R12P NG_006061
VN2R13P NG_006062
VN2R14P NG_006063
in for example:
NG_002527
NG_006062
I want to read through these files and get the output as follows:
NG_002527: RPL26P4
NG_006062: VN2R13P
Meaning that I'm iterating on in lines and trying to find the matching line in db.
The code I have written for that is:
with open(db_file, 'r') as db, open(sortIn, 'r') as inF, open(out_file, 'w') as outF:
for line in inF:
for dbline in db:
if len(dbline) > 1:
dbline = dbline.split('\t')
if line.rstrip('\n') == dbline[db_specifications[0]]:
outF.write(dbline[db_specifications[0]] + ': ' + dbline[db_specifications[1]] + '\n')
break
*db_specification isn't relevant for this problem, hence I didn't copy the relevant code for it - the problem doesn't lie there.
The current code will find a match and write it as I planned just for the first line in in but won't find any matches for the other lines. I have a suspicion it has to do with break but I can't figure out what to change.
Since the data in the db_file is sorted by second column, you can use this code to read the file.
with open("xyz.txt", "r") as db_file, open("abc.txt", "r") as sortIn, open("out.txt", 'w') as outF:
#first read the sortIn file as a list
i_list = [line.strip() for line in sortIn.readlines()]
#for each record read from the file, split the values into key and value
for line in db_file:
t_key,t_val = line.strip().split(' ')
#if value is in i_list file, then write to output file
if t_val in i_list: outF.write(t_val + ': ' + t_key + '\n')
#if value has reached the max value in sort list
#then you don't need to read the db_file anymore
if t_val == i_list[-1]: break
The output file will have the following items:
NG_002527: RPL26P4
NG_006062: VN2R13P
In the above code, we have to read the sortIn list first. Then read each line in the db_file. i_list[-1] will have the max value of sortIn file as the sortIn file is also sorted in ascending order.
The above code will have fewer i/o compared to the below one.
===========
previous answer submission:
Based on how the data has been stored in the db_file, it looks like we have to read the entire file to check against the sortIn file. If the values in the db_file was sorted by the second column, we could have stopped reading the file once the last item in sortIn was found.
With the assumption that we need to read all records from the files, see if the below code works for you.
with open("xyz.txt", "r") as db_file, open("abc.txt", "r") as sortIn, open("out.txt", 'w') as outF:
#read the db_file and convert it into a dictionary
d_list = dict([line.strip().split(' ') for line in db_file.readlines()])
#read the sortIn file as a list
i_list = [line.strip() for line in sortIn.readlines()]
#check if the value of each value in d_list is one of the items in i_list
out_list = [v + ': '+ k for k,v in d_list.items() if v in i_list]
#out_list is your final list that needs to be written into a file
#now read out_list and write each item into the file
for i in out_list:
outF.write(i + '\n')
The output file will have the following items:
NG_002527: RPL26P4
NG_006062: VN2R13P
To help you, i have also printed the contents in d_list, i_list, and out_list.
The contents in d_list will look like this:
{'RPL24P3': 'NG_002525', 'RPLP1P1': 'NG_002526', 'RPL26P4': 'NG_002527', 'VN2R11P': 'NG_006060', 'VN2R12P': 'NG_006061', 'VN2R13P': 'NG_006062', 'VN2R14P': 'NG_006063'}
The contents in i_list will look like this:
['NG_002527', 'NG_006062']
The contents that get written into the outF file from out_list will look like this:
['NG_002527: RPL26P4', 'NG_006062: VN2R13P']
I was able to solve the problem by inserting the following line:
line = next(inF)
before the break statement.
Related
I have a test.txt file that contains:
yellow.blue.purple.green
red.blue.red.purple
And i'd like to have on output.txt just the second and the third part of each line, like this:
blue.purple
blue.red
Here is python code:
with open ('test.txt', 'r') as file1, open('output.txt', 'w') as file2:
for line in file1:
file2.write(line.partition('.')[2]+ '\n')
but the result is:
blue.purple.green
blue.red.purple
How is possible take only the second and third part of each line?
Thanks
You may want
with open('test.txt', 'r') as file1, open('output.txt', 'w') as file2:
for line in file1:
file2.write(".".join(line.split('.')[1:3])+ '\n')
When you apply split('.') to the line e.g. yellow.blue.purple.green, you get a list of values
["yellow", "blue", "purple", "green"]
By slicing [1:3], you get the second and third items.
First I created a .txt file that had the same data that you entered in your original .txt file. I used the 'w' mode to write the data as you already know. I create an empty list as well that we will use to store the data, and later write to the output.txt file.
output_write = []
with open('test.txt', 'w') as file_object:
file_object.write('yellow.' + 'blue.' + 'purple.' + 'green')
file_object.write('\nred.' + 'blue.' + 'red.' + 'purple')
Next I opened the text file that I created and used 'r' mode to read the data on the file, as you already know. In order to get the output that you wanted, I read each line of the file in a for loop, and for each line I split the line to create a list of the items in the line. I split them based on the period in between each item ('.'). Next you want to get the second and third items in the lines, so I create a variable that stores index[1] and index[2] called new_read. After we will have the two pieces of data that you want and you'll likely want to write to your output file. I store this data in a variable called output_data. Lastly I append the output data to the empty list that we created earlier.
with open ('test.txt', 'r') as file_object:
for line in file_object:
read = line.split('.')
new_read = read[1:3]
output_data = (new_read[0] + '.' + new_read[1])
output_write.append(output_data)
Lastly we can write this data to a file called 'output.txt' as you noted earlier.
with open('output.txt', 'w') as file_object:
file_object.write(output_write[0])
file_object.write('\n' + output_write[1])
print(output_write[0])
print(output_write[1])
Lastly I print the data just to check the output:
blue.purple
blue.red
Suppose I have a big file as file.txt and it has data of around 300,000. I want to split it based on certain key location. See file.txt below:
Line 1: U0001;POUNDS;**CAN**;1234
Line 2: U0001;POUNDS;**USA**;1234
Line 3: U0001;POUNDS;**CAN**;1234
Line 100000; U0001;POUNDS;**CAN**;1234
The locations are limited to 10-15 different nation. And I need to separate each record of a particular country in one particular file. How to do this task in Python
Thanks for help
This will run with very low memory overhead as it writes each line as it reads it.
Algorithm:
open input file
read a line from input file
get country from line
if new country then open file for country
write the line to country's file
loop if more lines
close files
Code:
with open('file.txt', 'r') as infile:
try:
outfiles = {}
for line in infile:
country = line.split(';')[2].strip('*')
if country not in outfiles:
outfiles[country] = open(country + '.txt', 'w')
outfiles[country].write(line)
finally:
for outfile in outfiles.values():
outfile.close()
with open("file.txt") as f:
content = f.readlines()
# you may also want to remove whitespace characters like `\n` at the end of each line
text = [x.strip() for x in content]
x = [i.split(";") for i in text]
x.sort(key=lambda x: x[2])
from itertools import groupby
from operator get itemgetter
y = groupby(x, itemgetter(2))
res = [(i[0],[j for j in i[1]]) for i in y]
for country in res:
with open(country[0]+".txt","w") as writeFile:
writeFile.writelines("%s\n" % ';'.join(l) for l in country[1])
will group by your item!
Hope it helps!
Looks like what you have is a csv file. csv stands for comma-separated values, but any file that uses a different delimiter (in this case a semicolon ;) can be treated like a csv file.
We'll use the python module csv to read the file in, and then write a file for each country
import csv
from collections import defaultdict
d = defaultdict(list)
with open('file.txt', 'rb') as f:
r = csv.reader(f, delimiter=';')
for line in r:
d[l[2]].append(l)
for country in d:
with open('{}.txt'.format(country), 'wb') as outfile:
w = csv.writer(outfile, delimiter=';')
for line in d[country]:
w.writerow(line)
# the formatting-function for the filename used for saving
outputFileName = "{}.txt".format
# alternative:
##import time
##outputFileName = lambda loc: "{}_{}.txt".format(loc, time.asciitime())
#make a dictionary indexed by location, the contained item is new content of the file for the location
sortedByLocation = {}
f = open("file.txt", "r")
#iterate each line and look at the column for the location
for l in f.readlines():
line = l.split(';')
#the third field (indices begin with 0) is the location-abbreviation
# make the string lower, cause on some filesystems the file with upper chars gets overwritten with only the elements with lower characters, while python differs between the upper and lower
location = line[2].lower().strip()
#get previous lines of the location and store it back
tmp = sortedByLocation.get(location, "")
sortedByLocation[location]=tmp+l.strip()+'\n'
f.close()
#save file for each location
for location, text in sortedByLocation.items():
with open(outputFileName(location) as f:
f.write(text)
I have a odd csv file thas has data with header value and its corresponding data in a manner as below:
,,,Completed Milling Job,,,,,, # row 1
,,,,Extended Report,,,,,
,,Job Spec numerical control,,,,,,,
Job Number,3456,,,,,, Operator Id,clipper,
Coder Machine Name,Caterpillar,,,,,,Job Start time,3/12/2013 6:22,
Machine type,Stepper motor,,,,,,Job end time,3/12/2013 9:16,
I need to extract the data from this strucutre create another csv file as per the structure below:
Status,Job Number,Coder Machine Name,Machine type, Operator Id,Job Start time,Job end time,,, # header
Completed Milling Job,3456,Caterpillar,Stepper motor,clipper,3/12/2013 6:22,3/12/2013 9:16,,, # data row
If you notice, there is a new header column added called 'status" but the value is in the first row of the csv file. rest of the column names in output file are extracted from the original file.
Any thoughts will be greatly appreciated - thanks
Assuming the files are all exactly like that (at least in terms of caps) this should work, though I can only guarantee it on the exact data you have supplied:
#!/usr/bin/python
import glob
from sys import argv
g=open(argv[2],'w')
g.write("Status,Job Number,Coder Machine Name,Machine type, Operator Id,Job Start time,Job end time\n")
for fname in glob.glob(argv[1]):
with open(fname) as f:
status=f.readline().strip().strip(',')
f.readline()#extended report not needed
f.readline()#job spec numerical control not needed
s=f.readline()
job_no=s.split('Job Number,')[1].split(',')[0]
op_id=s.split('Operator Id,')[1].strip().strip(',')
s=f.readline()
machine_name=s.split('Coder Machine Name,')[1].split(',')[0]
start_t=s.split('Job Start time,')[1].strip().strip(',')
s=f.readline()
machine_type=s.split('Machine type,')[1].split(',')[0]
end_t=s.split('Job end time,')[1].strip().strip(',')
g.write(",".join([status,job_no,machine_name,machine_type,op_id,start_t,end_t])+"\n")
g.close()
It takes a glob argument (like Job*.data) and an output filename and should construct what you need. Just save it as 'so.py' or something and run it as python so.py <data_files_wildcarded> output.csv
Here is a solution that should work on any CSV files that follow the same pattern as what you showed. That is a seriously nasty format.
I got interested in the problem and worked on it during my lunch break. Here's the code:
COMMA = ','
NEWLINE = '\n'
def _kvpairs_from_line(line):
line = line.strip()
values = [item.strip() for item in line.split(COMMA)]
i = 0
while i < len(values):
if not values[i]:
i += 1 # advance past empty value
else:
# yield pair of values
yield (values[i], values[i+1])
i += 2 # advance past pair
def kvpairs_by_column_then_row(lines):
"""
Given a series of lines, where each line is comma-separated values
organized as key/value pairs like so:
key_1,value_1,key_n+1,value_n+1,...
key_2,value_2,key_n+2,value_n+2,...
...
key_n,value_n,key_n+n,value_n+n,...
Yield up key/value pairs taken from the first column, then from the second column
and so on.
"""
pairs = [_kvpairs_from_line(line) for line in lines]
done = [False for _ in pairs]
while not all(done):
for i in range(len(pairs)):
if not done[i]:
try:
key_value_tuple = next(pairs[i])
yield key_value_tuple
except StopIteration:
done[i] = True
STATUS = "Status"
columns = [STATUS]
d = {}
with open("data.csv", "rt") as f:
# get an iterator that lets us pull lines conveniently from file
itr = iter(f)
# pull first line and collect status
line = next(itr)
lst = line.split(COMMA)
d[STATUS] = lst[3]
# pull next lines and make sure the file is what we expected
line = next(itr)
assert "Extended Report" in line
line = next(itr)
assert "Job Spec numerical control" in line
# pull all remaining lines and save in a list
lines = [line.strip() for line in f]
for key, value in kvpairs_by_column_then_row(lines):
columns.append(key)
d[key] = value
with open("output.csv", "wt") as f:
# write column headers line
line = COMMA.join(columns)
f.write(line + NEWLINE)
# write data row
line = COMMA.join(d[key] for key in columns)
f.write(line + NEWLINE)
I write result of sqlite 3 query to csv file like:
2221,5560081.75998,7487177.66,237.227573347,0.0,5.0,0.0
2069,5559223.00998,7486978.53,237.245992308,0.0,5.0,0.0
10001,5560080.63053,7487182.53076,237.227573347,0.0,5.0,0.0
1,50.1697105444,20.8112828879,214.965341376,5.0,-5.0,0.0
2,50.1697072935,20.8113209177,214.936598128,5.0,-5.0,0.0
10002,50.1697459029,20.8113995467,214.936598128,5.0,-5.0,0.0
1,50.1697105444,20.8112828879,214.965341376,-5.0,-5.0,0.0
2,50.1697072935,20.8113209177,214.936598128,-5.0,-5.0,0.0
10003,50.1697577958,20.8112608051,214.936598128,-5.0,-5.0,0.0
My first general question is how to pick every nth line of csv or txt file with python?
And my specific problem is how to remove last three columns of every two lines of csv file, leaving every third line with no changes?
The outpu would be like:
2221,5560081.75998,7487177.66,237.227573347
2069,5559223.00998,7486978.53,237.245992308
10001,5560080.63053,7487182.53076,237.227573347,0.0,5.0,0.0
1,50.1697105444,20.8112828879,214.965341376
2,50.1697072935,20.8113209177,214.936598128
10002,50.1697459029,20.8113995467,214.936598128,5.0,-5.0,0.0
1,50.1697105444,20.8112828879,214.965341376
2,50.1697072935,20.8113209177,214.936598128
10003,50.1697577958,20.8112608051,214.936598128,-5.0,-5.0,0.0
I've tried inter alia with:
fi = open('file.csv','r')
for i, row in enumerate(csv.reader(fi, delimiter=',', skipinitialspace=True)):
if i % 3 == 2:
print row[0:]
else:
print row[0], row[1], row[2], row[3]
To retrieve the nth line it's easiest to iterate through but you can use the line cache module to grab it.
To answer the other part, assuming you want a new csv file with the desired qualities:
my_file = []
with open('file.csv','r') as fi:
for i, row in enumerate(csv.reader(fi, delimiter=',', skipinitialspace=True)):
if i % 3 == 2:
my_file.append(row)
else:
my_file.append(row[:-3])
#if you want to save a new csv file
with open('new_file.csv', 'wb') as new_fi:
new_fi_writer = csv.writer(new_fi, delimiter=', ')
for line in my_file:
new_fi_writer.writerow(line)
#alternatively (if you just want to print the lines)
for line in my_file:
print ' '.join(line)
I've got a problem trying to replace keys (dictionary) that I have in a file, with its corresponding values. More details: an input file called event_from_picks looks like:
EVENT2593
EVENT2594
EVENT2595
EVENT41025
EVENT2646
EVENT2649
Also, my dictionary, created by reloc_event_coords_dic() looks like:
{'EVENT23595': ['36.9828 -34.0538 138.1554'], 'EVENT2594': ['41.2669 -33.0179 139.2269'], 'EVENT2595': ['4.7500 -32.7926 138.1523'], 'EVENT41025': ['16.2453 -32.9552 138.2604'], 'EVENT2646': ['5.5949 -32.4923 138.1866'], 'EVENT2649': ['7.9533 -31.8304 138.6966']}
What I'd like to end up with, is a new file with the values instead of the keys. In this case, a new file called receiver.in which will look like:
36.9828 -34.0538 138.1554
41.2669 -33.0179 139.2269
4.7500 -32.7926 138.1523
16.2453 -32.9552 138.2604
5.5949 -32.4923 138.1866
7.9533 -31.8304 138.6966
My wrong function (I know I must have a problem with loops but I can't figure out what) so far is:
def converted_lines ():
file_out = open ('receiver.in', 'w')
converted_lines = []
event_dict = reloc_event_coords_dic()
data_line = event_dict.items() # Takes data as('EVENT31933', ['10.1230 -32.8294 138.1718'])
for element in data_line:
for item in element:
event_number = element[0] # Gets event number
coord_line = event_dict.get (event_number, None)
with open ('event_from_picks', 'r') as file_in:
for line in file_in:
if line.startswith(" "):
continue
if event_number:
converted_lines.append ("%s" % coord_line)
file_out.writelines(converted_lines)
Thanks for reading!
just do the following:
with open('receiver.in', 'w') as f:
f.writelines([v[0] for v in reloc_event_coords_dic().itervalues()])
Your first loop just leaves the last pair in the coord_line variable.
Better do
event_dict = reloc_event_coords_dic()
with open ('event_from_picks', 'r') as file_in:
with open('receiver.in', 'w') as file_out:
for in_line in file_in:
file_out.writelines(event_dict[in_line.strip()])
(untested, but you should get the logic).