reading a file and parse them into section - python

okay so I have a file that contains ID number follows by name just like this:
10 alex de souza
11 robin van persie
9 serhat akin
I need to read this file and break each record up into 2 fields the id, and the name. I need to store the entries in a dictionary where ID is the key and the name is the satellite data. Then I need to output, in 2 columns, one entry per line, all the entries in the dictionary, sorted (numerically) by ID. dict.keys and list.sort might be helpful (I guess). Finally the input filename needs to be the first command-line argument.
Thanks for your help!
I have this so far however can't go any further.
fin = open("ids","r") #Read the file
for line in fin: #Split lines
string = str.split()
if len(string) > 1: #Seperate names and grades
id = map(int, string[0]
name = string[1:]
print(id, name) #Print results

We need sys.argv to get the command line argument (careful, the name of the script is always the 0th element of the returned list).
Now we open the file (no error handling, you should add that) and read in the lines individually. Now we have 'number firstname secondname'-strings for each line in the list "lines".
Then open an empty dictionary out and loop over the individual strings in lines, splitting them every space and storing them in the temporary variable tmp (which is now a list of strings: ('number', 'firstname','secondname')).
Following that we just fill the dictionary, using the number as key and the space-joined rest of the names as value.
To print the dictionary sorted just loop over the list of numbers returned by sorted(out), using the key=int option for numerical sorting. Then print the id (the number) and then the corresponding value by calling the dictionary with a string representation of the id.
import sys
try:
infile = sys.argv[1]
except IndexError:
infile = input('Enter file name: ')
with open(infile, 'r') as file:
lines = file.readlines()
out = {}
for fullstr in lines:
tmp = fullstr.split()
out[tmp[0]] = ' '.join(tmp[1:])
for id in sorted(out, key=int):
print id, out[str(id)]
This works for python 2.7 with ASCII-strings. I'm pretty sure that it should be able to handle other encodings as well (German Umlaute work at least), but I can't test that any further. You may also want to add a lot of error handling in case the input file is somehow formatted differently.

Just a suggestion, this code is probably simpler than the other code posted:
import sys
with open(sys.argv[1], "r") as handle:
lines = handle.readlines()
data = dict([i.strip().split(' ', 1) for i in lines])
for idx in sorted(data, key=int):
print idx, data[idx]

Related

Need to read csv files (when csv file is multiple input files) in Python

I have a school assignment that is asking me to write a program that first reads in the name of an input file and then reads the file using the csv.reader() method. The file contains a list of words separated by commas. The program should output the words and their frequencies (the number of times each word appears in the file) without any duplicates.
I have been able to figure out how to do this somewhat for one specific input file, but the program needs to be able to read multiple input files. This is what I have so far:
with open('input1.csv', 'r') as input1file:
csv_reader = csv.reader(input1file, delimiter = ',')
for row in csv_reader:
new_row = set(row)
for m in new_row:
count = row.count(m)
print(m, count)
This is what I get:
woman 1
man 2
Cat 1
Hello 1
boy 2
cat 2
dog 2
hey 2
hello 1
This works (almost) for the input1 file, except it changes the order each time I run it.
And I need it to work for two other input files?
sample CSV
hello,cat,man,hey,dog,boy,Hello,man,cat,woman,dog,Cat,hey,boy
See the code below for an example, I've commented it so you understand what it does and why.
As for the fact that for your implementation the order is different is due to the usage of set. A set by definition is unordered.
Also note that with your implementation you are passing over the rows twice, once to turn it into a set, and once more to count. Besides this, if the file contains more than one row, your logic would fail, as the counting part only gets reached when the last line of the file is read.
import csv
def count_things(filename):
with open(filename) as infile:
csv_reader = csv.reader(infile, delimiter = ',')
result = {}
for row in csv_reader:
# go over the row by element
for element in row:
# does it exist already?
if element in result:
# if yes, increase count
result[element] += 1
else:
# if no, add and set count to 1
result[element] = 1
# sorting, explained in detail here:
# https://stackoverflow.com/a/613218/9267296
return {k: v for k, v in sorted(result.items(), key=lambda item: item[1], reverse=True)}
# you could just return unsorted result by using:
# return result
for key, value in count_things("input1.csv").items():
# iterate over items() by key/value pairs
# see this link:
# https://www.w3schools.com/python/python_dictionaries_access.asp
print(key, value)

Reading and taking specific file contents in a list in python

I have a file containing:
name: Sam
placing: 2
quote: I'll win.
name: Jamie
placing: 1
quote: Be the best.
and I want to read the file through python and append specific contents into a list. I want my first list to contain:
rank = [['Sam', 2],['Jamie', 1]]
and second list to contain:
quo = ['I'll win','Be the best']
first off, i start reading the file by:
def read_file():
filename = open("player.txt","r")
playerFile = filename
player = [] #first list
quo = [] #second list
for line in playerFile: #going through each line
line = line.strip().split(':') #strip new line
print(line) #checking purpose
player.append(line[1]) #index out of range
player.append(line[2])
quo.append(line[3])
I'm getting an index out of range in the first append. I have split by ':' but I can't seem to access it.
When you do line = line.strip().split(':') when line = "name: Sam"
you will receive ['name', ' Sam'] so first append should work.
The second one player.append(line[2] will not work.
As zython said in the comments , you need to know the format of the file and each blank line or other changes in the file , can make you script to fail.
You should analyze the file differently:
If you can rely on the fact that "name" and "quote" are always existing fields in each player data , you should look for this field names.
for example:
for line in file:
# Run on each line and insert to player list only the lines with "name" in it
if ("name" in line):
# Line with "name" was found - do what you need with it
player.append(line.split(":")[1])
A few problems,
The program attempts to read three lines worth of data in a single iteration of the for loop. But that won't work, because the loop, and the split command are parsing only a single line per iteration. It will take three loop iterations to read a single entry from your file.
The program needs handling for blank lines. Generally, when reading files like this, you probably want a lot of error handling, the data is usually not formatted perfectly. My suggestion is to check for blank lines, where line has only a single value which is an empty string. When you see that, ignore the line.
The program needs to collect the first and second lines of each entry, and put those into a temporary array, then append the temporary array to player. So you'll need to declare that temporary array above, populate first with the name field, next with the placing field, and finally append it to player.
Zero-based indexing. Remember that the first item of an array is list[0], not list[1]
I think you are confused on how to check for a line and add content from line to two lists based on what it contains. You could use in to check what line you are on currently. This works assuming your text file is same as given in question.
rank, quo = [], []
for line in playerFile:
splitted = line.split(": ")
if "name" in line:
name = splitted[1]
elif "placing" in line:
rank.append([name, splitted[1]])
elif "quote" in line:
quo.append(splitted[1])
print(rank) # [['Sam', '2'],['Jamie', '1']]
print(quo) # ["I'll win",'Be the best']
Try this code:
def read_file():
filename = open("player.txt", "r")
playerFile = filename
player = []
rank = []
quo = []
for line in playerFile:
value = line.strip().split(": ")
if "name" in line:
player.append(value[1])
if "placing" in line:
player.append(value[1])
if "quote" in line:
quo.append(value[1])
rank.append(player)
player = []
print(rank)
print(quo)
read_file()

Python reading file problems

highest_score = 0
g = open("grades_single.txt","r")
arrayList = []
for line in highest_score:
if float(highest_score) > highest_score:
arrayList.extend(line.split())
g.close()
print(highest_score)
Hello, wondered if anyone could help me , I'm having problems here. I have to read in a file of which contains 3 lines. First line is no use and nor is the 3rd. The second contains a list of letters, to which I have to pull them out (for instance all the As all the Bs all the Cs all the way upto G) there are multiple letters of each. I have to be able to count how many off each through this program. I'm very new to this so please bear with me if the coding created is wrong. Just wondered if anyone could point me in the right direction of how to pull out these letters on the second line and count them. I then have to do a mathamatical function with these letters but I hope to work that out for myself.
Sample of the data:
GTSDF60000
ADCBCBBCADEBCCBADGAACDCCBEDCBACCFEABBCBBBCCEAABCBB
*
You do not read the contents of the file. To do so use the .read() or .readlines() method on your opened file. .readlines() reads each line in a file seperately like so:
g = open("grades_single.txt","r")
filecontent = g.readlines()
since it is good practice to directly close your file after opening it and reading its contents, directly follow with:
g.close()
another option would be:
with open("grades_single.txt","r") as g:
content = g.readlines()
the with-statement closes the file for you (so you don't need to use the .close()-method this way.
Since you need the contents of the second line only you can choose that one directly:
content = g.readlines()[1]
.readlines() doesn't strip a line of is newline(which usually is: \n), so you still have to do so:
content = g.readlines()[1].strip('\n')
The .count()-method lets you count items in a list or in a string. So you could do:
dct = {}
for item in content:
dct[item] = content.count(item)
this can be made more efficient by using a dictionary-comprehension:
dct = {item:content.count(item) for item in content}
at last you can get the highest score and print it:
highest_score = max(dct.values())
print(highest_score)
.values() returns the values of a dictionary and max, well, returns the maximum value in a list.
Thus the code that does what you're looking for could be:
with open("grades_single.txt","r") as g:
content = g.readlines()[1].strip('\n')
dct = {item:content.count(item) for item in content}
highest_score = max(dct.values())
print(highest_score)
highest_score = 0
arrayList = []
with open("grades_single.txt") as f:
arraylist.extend(f[1])
print (arrayList)
This will show you the second line of that file. It will extend arrayList then you can do whatever you want with that list.
import re
# opens the file in read mode (and closes it automatically when done)
with open('my_file.txt', 'r') as opened_file:
# Temporarily stores all lines of the file here.
all_lines_list = []
for line in opened_file.readlines():
all_lines_list.append(line)
# This is the selected pattern.
# It basically means "match a single character from a to g"
# and ignores upper or lower case
pattern = re.compile(r'[a-g]', re.IGNORECASE)
# Which line i want to choose (assuming you only need one line chosen)
line_num_i_need = 2
# (1 is deducted since the first element in python has index 0)
matches = re.findall(pattern, all_lines_list[line_num_i_need-1])
print('\nMatches found:')
print(matches)
print('\nTotal matches:')
print(len(matches))
You might want to check regular expressions in case you need some more complex pattern.
To count the occurrences of each letter I used a dictionary instead of a list. With a dictionary, you can access each letter count later on.
d = {}
g = open("grades_single.txt", "r")
for i,line in enumerate(g):
if i == 1:
holder = list(line.strip())
g.close()
for letter in holder:
d[letter] = holder.count(letter)
for key,value in d.iteritems():
print("{},{}").format(key,value)
Outputs
A,9
C,15
B,15
E,4
D,5
G,1
F,1
One can treat the first line specially (and in this case ignore it) with next inside try: except StopIteration:. In this case, where you only want the second line, follow with another next instead of a for loop.
with open("grades_single.txt") as f:
try:
next(f) # discard 1st line
line = next(f)
except StopIteration:
raise ValueError('file does not even have two lines')
# now use line

Refering to a list of names using Python

I am new to Python, so please bear with me.
I can't get this little script to work properly:
genome = open('refT.txt','r')
datafile - a reference genome with a bunch (2 million) of contigs:
Contig_01
TGCAGGTAAAAAACTGTCACCTGCTGGT
Contig_02
TGCAGGTCTTCCCACTTTATGATCCCTTA
Contig_03
TGCAGTGTGTCACTGGCCAAGCCCAGCGC
Contig_04
TGCAGTGAGCAGACCCCAAAGGGAACCAT
Contig_05
TGCAGTAAGGGTAAGATTTGCTTGACCTA
The file is opened:
cont_list = open('dataT.txt','r')
a list of contigs that I want to extract from the dataset listed above:
Contig_01
Contig_02
Contig_03
Contig_05
My hopeless script:
for line in cont_list:
if genome.readline() not in line:
continue
else:
a=genome.readline()
s=line+a
data_out = open ('output.txt','a')
data_out.write("%s" % s)
data_out.close()
input('Press ENTER to exit')
The script successfully writes the first three contigs to the output file, but for some reason it doesn't seem able to skip "contig_04", which is not in the list, and move on to "Contig_05".
I might seem a lazy bastard for posting this, but I've spent all afternoon on this tiny bit of code -_-
I would first try to generate an iterable which gives you a tuple: (contig, gnome):
def pair(file_obj):
for line in file_obj:
yield line, next(file_obj)
Now, I would use that to get the desired elements:
wanted = {'Contig_01', 'Contig_02', 'Contig_03', 'Contig_05'}
with open('filename') as fin:
pairs = pair(fin)
while wanted:
p = next(pairs)
if p[0] in wanted:
# write to output file, store in a list, or dict, ...
wanted.forget(p[0])
I would recommend several things:
Try using with open(filename, 'r') as f instead of f = open(...)/f.close(). with will handle the closing for you. It also encourages you to handle all of your file IO in one place.
Try to read in all the contigs you want into a list or other structure. It is a pain to have many files open at once. Read all the lines at once and store them.
Here's some example code that might do what you're looking for
from itertools import izip_longest
# Read in contigs from file and store in list
contigs = []
with open('dataT.txt', 'r') as contigfile:
for line in contigfile:
contigs.append(line.rstrip()) #rstrip() removes '\n' from EOL
# Read through genome file, open up an output file
with open('refT.txt', 'r') as genomefile, open('out.txt', 'w') as outfile:
# Nifty way to sort through fasta files 2 lines at a time
for name, seq in izip_longest(*[genomefile]*2):
# compare the contig name to your list of contigs
if name.rstrip() in contigs:
outfile.write(name) #optional. remove if you only want the seq
outfile.write(seq)
Here's a pretty compact approach to get the sequences you'd like.
def get_sequences(data_file, valid_contigs):
sequences = []
with open(data_file) as cont_list:
for line in cont_list:
if line.startswith(valid_contigs):
sequence = cont_list.next().strip()
sequences.append(sequence)
return sequences
if __name__ == '__main__':
valid_contigs = ('Contig_01', 'Contig_02', 'Contig_03', 'Contig_05')
sequences = get_sequences('dataT.txt', valid_contigs)
print(sequences)
The utilizes the ability of startswith() to accept a tuple as a parameter and check for any matches. If the line matches what you want (a desired contig), it will grab the next line and append it to sequences after stripping out the unwanted whitespace characters.
From there, writing the sequences grabbed to an output file is pretty straightforward.
Example output:
['TGCAGGTAAAAAACTGTCACCTGCTGGT',
'TGCAGGTCTTCCCACTTTATGATCCCTTA',
'TGCAGTGTGTCACTGGCCAAGCCCAGCGC',
'TGCAGTAAGGGTAAGATTTGCTTGACCTA']

Write dictionary values (list) to output file - Python

I am trying to print values(a list) from a dictionary to the third column of another file that contains the dictionary key in the first column. I would like the list of values to print in the third column of the output file with a space separating each value. I know my problem lies somewhere in the fact that Python can't write things that aren't strings and that the list is separated by a "," but I am new to programming and am not sure how to accomplish this - any help is much appreciated, thanks!
The GtfFile.txt is a 10 column file (sep = '\t') which I generate the dictionary from... using the Gene name as the key and the Term (functional category) as the values. Several genes have more than one Term attributed to them and are repeated as new lines for each term. There are varying numbers of genes associated with each Term as well and thus I generate a list as the key for each Term. THIS PART OF MY SCRIPT APPEARS TO BE WORKING AS I WOULD LIKE IT TO!
The FuncEnr_terms.txt is a 2 column file (sep ='\t') which consists of a Term in the first column and a description of the term in the 2 column. My desired output file would be to duplicate this file with a third column that contains the Genes associated with the Term separated by a space. WRITING THIS TO THE OUTPUT FILE IS WHERE MY PROBLEM LIES.
Below is my code:
#!/usr/bin/env python
import sys
from collections import defaultdict
if len(sys.argv) != 4 :
print("Usage: GeneSetFileGen.py <GtfFile.txt> <FuncEnr_terms.txt> <OutputFile.txt>")
sys.exit(0)
OutFileName = sys.argv[3]
OutFile = open(OutFileName, 'w')
TermGeneDic = defaultdict(list)
with open(sys.argv[1], 'r') as f :
for line in f :
line = line.strip()
line = line.split('\t')
Term = line[8]
Gene = line[0]
TermGeneDic[Term].append(Gene)
#write output file
with open(sys.argv[2], 'r') as f :
for line in f :
line = line.strip()
Term, Des = line.split('\t')
OutFile.write(Term + '\t' + Des + '\t' + str(TermGeneDic[Term]) + '\n')
OutFile.close
If I understand what you require correctly then what you need is to replace this expression:
str(TermGeneDic[Term])
with something like:
" ".join(TermGeneDic[Term])
A couple of pointers on your code: your code will be incomprehensible to anyone else if you don't follow pep 8 conventions fairly closely. This means, no CamelCase except for class names.
Secondly, reusing variable is generally bad, and a sign that you should just chain up those method calls. It's especially bad when you have a variable like line whose type you actually change.
Thirdly, brackets (parentheses) are mandatory for calling a method or function.
Fourthly, you join the elements of a list into a string with '\t'.join(termgenes[term])
Finally, use templating to generate long strings - it ends up being easier to work with.
Your code should look like:
import sys
from collections import defaultdict
if len(sys.argv) != 4 :
print("Usage: GeneSetFileGen.py <GtfFile.txt> <FuncEnr_terms.txt> <OutputFile.txt>")
sys.exit(0)
progname,gtffilename,funcencrfilename,outfilename = sys.argv
termgenes = defaultdict(list)
with open(gtffilename, 'r') as gtf :
for line in gtf:
linefields = line.strip().split('\t')
term, gene = linefields[8],linefields[0]
termgenes[term].append(gene)
#write output file
with open(funcencrfilename, 'r') as funcencrfile, open(outfilename, 'w') as outfile:
for line in funcencrfile:
term, des = line.strip().split('\t')
outfile.write('%s\t%s%s\n' % term,des,'\t'.join(termgenes[term]))

Categories

Resources