Output content in CSV format python - python

My script will process few files a different path, I want to write those output in CSV format in python.
For example:
%> script_name <file_name>
In every file, I have different options to be checked.
For example : file1:
Best_friend : Riky
Mutual_friend : Anuj
Family_friend : Jamie
For example : file2:
Best_friend : Anjelina
Mutual_friend : Mythe
For example : file3:
Best_friend : Mahira
Mutual_friend : Shyna
Dear_frind : Kisty
I want to create CSV in the format
File,Best_friend, Mutual_friend
File1,Riky,Anuj
File2,Anjelina,Mythe
File3,Mahira,shyna
Please help

Well, there's several things to your question.
You want to get passed several files, read some values in each of them, then output the values into csv file.
It helps if you decompose your problem into several successive steps.
First, you need to know how to read the best and mutual friend in a given file. You can do that in a function:
def get_best_mutual(filename):
# some code
return (best_friend, mutual_friend)
Then, you can just iterate over all your files to write the values while you collect them:
for filename in list_of_filenames:
best_friend, mutual_friend = get_best_mutual(filename)
# write filename, best_friend, mutual_friend in output file
Writing into the file should be easy, I'll not go into the details.
The problem might be to actually get the values from the input files.
When you read a text file, you typically read it line by line. Then you can just look at your line to decide what to do: if it defines either best or mutual friend, save the definition, otherwise do nothing.
Concretely, it might look like:
def get_best_mutual(filename):
for line in open(filename): # read each line of the file
key, value = line.split(':', 1) # split the line along the first :
if key.startswith('Best'):
best_friend = value
if key.startswith('Mutual'):
mutual_friend = value
return (best_friend, mutual_friend)
Obviously, you'd have to protect a bit more the code, in case for example the line doesn't have a ':' in it, and you might also notice that the value starts with a space and ends with a '\n': you can use value.strip() to solve that. Same for the key, if a line starts with a space the code above will not recognize it.
You also need to decide what to do if a file doesn't have a best_friend, for example.

Using csv dictReader/dictWriter are more efficient way of handling the csv files.
Hope this will solve your problem:
import sys
import csv
import copy
def create_csv(files):
headers= ['File', 'Best Friend', 'Mutual Friend']
list1 = []
for file in files:
with open(file,'r') as file_obj:
dict_temp = {}
dict_temp['File'] = file
for line in file_obj:
if line.split(':')[0] == 'Best_friend ':
dict_temp['Best Friend'] = line.split(':')[1].strip()
if line.split(':')[0] == 'Mutual_friend ':
dict_temp['Mutual Friend'] = line.split(':')[1].strip()
list1.append(dict_temp)
print list1
csv_result = open('result.csv','wb')
writer = csv.DictWriter(csv_result, delimiter=',', fieldnames=headers, quoting=csv.QUOTE_NONE)
writer.writeheader()
for entry in list1:
writer.writerow(entry)
csv_result.close()
if __name__ == "__main__":
create_csv(sys.argv[1:])
You can add/remove the columns in csv just by adding it in dictionary with appropriate key.

Related

How to skip used line or continue until an unused line when writing to a csv file in python?

I am trying to write information to a csv file. It is inside a for loop that will change the data. Basically what I want it to do is get data put it in a csv file and then it will loop get new data and put it in the same csv file, but it needs to continue where it left off from the last set of data and not replace it.
Here is what I have:
for i in range(0, 11):
#Calling all the above functions
soc_auth_requests()
create_account()
config_admin_create()
account_user_create()
account_activate()
account_config_DNS_create()
#Creating the dictionary for the CSV file with the data fields made and modified from before
#It is necessary this be done after the above method calls to ensure the data field values are correct
data = {
'Account_Name': acc_name,
'Account_Id': acc_id,
'User_Email': user_email,
'User_id': user_id
}
#Creating a csv file and writing the dictionary titled "data" to it
outfile = open('Accounts_Details.csv', 'w')
while('Accounts_Details.csv'.isspace() == False):
outfile.readline()
for key, value in sorted(data.items()):
outfile.write('\t' + str(value) + '\n')
I can't give more than this, But I can confirm it is parsing 10 times and the data is there. How do I skip used lines or just continue where I left off?
Note: It is important that each bit of information is on a newline. Ex:
id1
name1
id2
name2
.
.
.
Note: I have tried many different threads and none of them apply to my situation nor explain enough for me to match them up, so I have been exploring options with no luck.
You can use the append 'a' modifier when you open a file to write where you left off. For example:
outfile = open('Accounts_Details.csv', 'a')
for key, value in sorted(data.items()):
outfile.write('\t' + str(value) + '\n')
Open the file outside of the for loop
with open(filepath, 'w') as f:
for i in range(something):
d = create_dict()
for a, b in d.items():
f.write('{}\t{}\n'.format(a, b)

Parsing a text file with line breaks in python

I have a text file with about 20 entries. They look like this:
~
England
Link: http://imgur.com/foobar.jpg
Capital: London
~
Iceland
Link: http://imgur.com/foobar2.jpg
Capital: Reykjavik
...
etc.
I would like to take these entries and turn them into a CSV.
There is a '~' separating each entry. I'm scratching my head trying to figure out how to go thru line by line and create the CSV values for each country. Can anyone give me a clue on how to go about this?
Use the libraries luke :)
I'm assuming your data is well formatted. Most real world data isn't that way. So, here goes a solution.
>>> content.split('~')
['\nEngland\nLink: http://imgur.com/foobar.jpg\nCapital: London\n', '\nIceland\nLink: http://imgur.com/foobar2.jpg\nCapital: Reykjavik\n', '\nEngland\nLink: http://imgur.com/foobar.jpg\nCapital: London\n', '\nIceland\nLink: http://imgur.com/foobar2.jpg\nCapital: Reykjavik\n']
For writing the CSV, Python has standard library functions.
>>> import csv
>>> csvfile = open('foo.csv', 'wb')
>>> fieldnames = ['Country', 'Link', 'Capital']
>>> writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
>>> for entry in entries:
... cols = entry.strip().splitlines()
... writer.writerow({'Country': cols[0], 'Link':cols[1].split(': ')[1], 'Capital':cols[2].split(':')[1]})
...
If your data is more semi structured or badly formatted, consider using a library like PyParsing.
Edit:
Second column contains URLs, so we need to handle the splits well.
>>> cols[1]
'Link: http://imgur.com/foobar2.jpg'
>>> cols[1].split(':')[1]
' http'
>>> cols[1].split(': ')[1]
'http://imgur.com/foobar2.jpg'
The way that I would do that would be to use the open() function using the syntax of:
f = open('NameOfFile.extensionType', 'a+')
Where "a+" is append mode. The file will not be overwritten and new data can be appended. You could also use "r+" to open the file in read mode, but would lose the ability to edit. The "+" after a letter signifies that if the document does not exist, it will be created. The "a+" I've never found to work without the "+".
After that I would use a for loop like this:
data = []
tmp = []
for line in f:
line.strip() #Removes formatting marks made by python
if line == '~':
data.append(tmp)
tmp = []
continue
else:
tmp.append(line)
Now you have all of the data stored in a list, but you could also reformat it as a class object using a slightly different algorithm.
I have never edited CSV files using python, but I believe you can use a loop like this to add the data:
f2 = open('CSVfileName.csv', 'w') #Can change "w" for other needs i.e "a+"
for entry in data:
for subentry in entry:
f2.write(str(subentry) + '\n') #Use '\n' to create a new line
From my knowledge of CSV that loop would create a single column of all of the data. At the end remember to close the files in order to save the changes:
f.close()
f2.close()
You could combine the two loops into one in order to save space, but for the sake of explanation I have not.

Using Python to Merge Single Line .dat Files into one .csv file

I am beginner in the programming world and a would like some tips on how to solve a challenge.
Right now I have ~10 000 .dat files each with a single line following this structure:
Attribute1=Value&Attribute2=Value&Attribute3=Value...AttibuteN=Value
I have been trying to use python and the CSV library to convert these .dat files into a single .csv file.
So far I was able to write something that would read all files, store the contents of each file in a new line and substitute the "&" to "," but since the Attribute1,Attribute2...AttributeN are exactly the same for every file, I would like to make them into column headers and remove them from every other line.
Any tips on how to go about that?
Thank you!
Since you are a beginner, I prepared some code that works, and is at the same time very easy to understand.
I assume that you have all the files in the folder called 'input'. The code beneath should be in a script file next to the folder.
Keep in mind that this code should be used to understand how a problem like this can be solved. Optimisations and sanity checks have been left out intentionally.
You might want to check additionally what happens when a value is missing in some line, what happens when an attribute is missing, what happens with a corrupted input etc.. :)
Good luck!
import os
# this function splits the attribute=value into two lists
# the first list are all the attributes
# the second list are all the values
def getAttributesAndValues(line):
attributes = []
values = []
# first we split the input over the &
AtributeValues = line.split('&')
for attrVal in AtributeValues:
# we split the attribute=value over the '=' sign
# the left part goes to split[0], the value goes to split[1]
split = attrVal.split('=')
attributes.append(split[0])
values.append(split[1])
# return the attributes list and values list
return attributes,values
# test the function using the line beneath so you understand how it works
# line = "Attribute1=Value&Attribute2=Value&Attribute3=Vale&AttibuteN=Value"
# print getAttributesAndValues(line)
# this function writes a single file to an output file
def writeToCsv(inFile='', wfile="outFile.csv", delim=","):
f_in = open(inFile, 'r') # only reading the file
f_out = open(wfile, 'ab+') # file is opened for reading and appending
# read the whole file line by line
lines = f_in.readlines()
# loop throug evert line in the file and write its values
for line in lines:
# let's check if the file is empty and write the headers then
first_char = f_out.read(1)
header, values = getAttributesAndValues(line)
# we write the header only if the file is empty
if not first_char:
for attribute in header:
f_out.write(attribute+delim)
f_out.write("\n")
# we write the values
for value in values:
f_out.write(value+delim)
f_out.write("\n")
# Read all the files in the path (without dir pointer)
allInputFiles = os.listdir('input/')
allInputFiles = allInputFiles[1:]
# loop through all the files and write values to the csv file
for singleFile in allInputFiles:
writeToCsv('input/'+singleFile)
but since the Attribute1,Attribute2...AttributeN are exactly the same
for every file, I would like to make them into column headers and
remove them from every other line.
input = 'Attribute1=Value1&Attribute2=Value2&Attribute3=Value3'
once for the the first file:
','.join(k for (k,v) in map(lambda s: s.split('='), input.split('&')))
for each file's content:
','.join(v for (k,v) in map(lambda s: s.split('='), input.split('&')))
Maybe you need to trim the strings additionally; don't know how clean your input is.
Put the dat files in a folder called myDats. Put this script next to the myDats folder along with a file called temp.txt. You will also need your output.csv. [That is, you will have output.csv, myDats, and mergeDats.py in the same folder]
mergeDats.py
import csv
import os
g = open("temp.txt","w")
for file in os.listdir('myDats'):
f = open("myDats/"+file,"r")
tempData = f.readlines()[0]
tempData = tempData.replace("&","\n")
g.write(tempData)
f.close()
g.close()
h = open("text.txt","r")
arr = h.read().split("\n")
dict = {}
for x in arr:
temp2 = x.split("=")
dict[temp2[0]] = temp2[1]
with open('output.csv','w' """use 'wb' in python 2.x""" ) as output:
w = csv.DictWriter(output,my_dict.keys())
w.writeheader()
w.writerow(my_dict)

Write dictionary values (list) to output file - Python

I am trying to print values(a list) from a dictionary to the third column of another file that contains the dictionary key in the first column. I would like the list of values to print in the third column of the output file with a space separating each value. I know my problem lies somewhere in the fact that Python can't write things that aren't strings and that the list is separated by a "," but I am new to programming and am not sure how to accomplish this - any help is much appreciated, thanks!
The GtfFile.txt is a 10 column file (sep = '\t') which I generate the dictionary from... using the Gene name as the key and the Term (functional category) as the values. Several genes have more than one Term attributed to them and are repeated as new lines for each term. There are varying numbers of genes associated with each Term as well and thus I generate a list as the key for each Term. THIS PART OF MY SCRIPT APPEARS TO BE WORKING AS I WOULD LIKE IT TO!
The FuncEnr_terms.txt is a 2 column file (sep ='\t') which consists of a Term in the first column and a description of the term in the 2 column. My desired output file would be to duplicate this file with a third column that contains the Genes associated with the Term separated by a space. WRITING THIS TO THE OUTPUT FILE IS WHERE MY PROBLEM LIES.
Below is my code:
#!/usr/bin/env python
import sys
from collections import defaultdict
if len(sys.argv) != 4 :
print("Usage: GeneSetFileGen.py <GtfFile.txt> <FuncEnr_terms.txt> <OutputFile.txt>")
sys.exit(0)
OutFileName = sys.argv[3]
OutFile = open(OutFileName, 'w')
TermGeneDic = defaultdict(list)
with open(sys.argv[1], 'r') as f :
for line in f :
line = line.strip()
line = line.split('\t')
Term = line[8]
Gene = line[0]
TermGeneDic[Term].append(Gene)
#write output file
with open(sys.argv[2], 'r') as f :
for line in f :
line = line.strip()
Term, Des = line.split('\t')
OutFile.write(Term + '\t' + Des + '\t' + str(TermGeneDic[Term]) + '\n')
OutFile.close
If I understand what you require correctly then what you need is to replace this expression:
str(TermGeneDic[Term])
with something like:
" ".join(TermGeneDic[Term])
A couple of pointers on your code: your code will be incomprehensible to anyone else if you don't follow pep 8 conventions fairly closely. This means, no CamelCase except for class names.
Secondly, reusing variable is generally bad, and a sign that you should just chain up those method calls. It's especially bad when you have a variable like line whose type you actually change.
Thirdly, brackets (parentheses) are mandatory for calling a method or function.
Fourthly, you join the elements of a list into a string with '\t'.join(termgenes[term])
Finally, use templating to generate long strings - it ends up being easier to work with.
Your code should look like:
import sys
from collections import defaultdict
if len(sys.argv) != 4 :
print("Usage: GeneSetFileGen.py <GtfFile.txt> <FuncEnr_terms.txt> <OutputFile.txt>")
sys.exit(0)
progname,gtffilename,funcencrfilename,outfilename = sys.argv
termgenes = defaultdict(list)
with open(gtffilename, 'r') as gtf :
for line in gtf:
linefields = line.strip().split('\t')
term, gene = linefields[8],linefields[0]
termgenes[term].append(gene)
#write output file
with open(funcencrfilename, 'r') as funcencrfile, open(outfilename, 'w') as outfile:
for line in funcencrfile:
term, des = line.strip().split('\t')
outfile.write('%s\t%s%s\n' % term,des,'\t'.join(termgenes[term]))

Python: Concise / elegant way to reformat a set of text files?

I have written a python script to process a set of ASCII files within a given dir. I wonder if there is a more concise and/or "pythonesque" way to do it, without loosing readability?
Python Code
import os
import fileinput
import glob
import string
indir='./'
outdir='./processed/'
for filename in glob.glob(indir+'*.asc'): # get a list of input ASCII files to be processed
fin=open(indir+filename,'r') # input file
fout=open(outdir+filename,'w') # out: processed file
lines = iter(fileinput.input([indir+filename])) # iterator over all lines in the input file
fout.write(next(lines)) # just copy the first line (the header) to output
for line in lines:
val=iter(string.split(line,' '))
fout.write('{0:6.2f}'.format(float(val.next()))), # first value in the line has it's own format
for x in val: # iterate over the rest of the numbers in the line
fout.write('{0:10.6f}'.format(float(val.next()))), # the rest of the values in the line has a different format
fout.write('\n')
fin.close()
fout.close()
An example:
Input:
;;; This line is the header line
-5.0 1.090074154029272 1.0034662411357929 0.87336062116561186 0.78649408279093869 0.65599958665017222 0.4379879132749317 0.26310799350679176 0.087808018565486673
-4.9900000000000002 1.0890770415316042 1.0025480136545413 0.87256100700428996 0.78577373527626004 0.65539842673645277 0.43758616966566649 0.26286647978335914 0.087727357602906453
-4.9800000000000004 1.0880820021223023 1.0016316956763136 0.87176305623792771 0.78505488659611744 0.65479851808106115 0.43718526271594083 0.26262546925502467 0.087646864773454014
-4.9700000000000006 1.0870890372077564 1.0007172884938402 0.87096676998908273 0.78433753775986659 0.65419986152386733 0.4367851929843618 0.26238496225635727 0.087566540188423345
-4.9600000000000009 1.086098148170821 0.99980479337809591 0.87017214936140763 0.78362168975984026 0.65360245789061966 0.4363859610200459 0.26214495911617541 0.087486383957276398
Processed:
;;; This line is the header line
-5.00 1.003466 0.786494 0.437988 0.087808
-4.99 1.002548 0.785774 0.437586 0.087727
-4.98 1.001632 0.785055 0.437185 0.087647
-4.97 1.000717 0.784338 0.436785 0.087567
-4.96 0.999805 0.783622 0.436386 0.087486
Other than a few minor changes, due to how Python has changed through time, this looks fine.
You're mixing two different styles of next(); the old way was it.next() and the new is next(it). You should use the string method split() instead of going through the string module (that module is there mostly for backwards compatibility to Python 1.x). There's no need to use go through the almost useless "fileinput" module, since open file handle are also iterators (that module comes from a time before Python's file handles were iterators.)
Edit: As #codeape pointed out, glob() returns the full path. Your code would not have worked if indir was something other than "./". I've changed the following to use the correct listdir/os.path.join solution. I'm also more familiar with the "%" string interpolation than string formatting.
Here's how I would write this in more idiomatic modern Python
def reformat(fin, fout):
fout.write(next(fin)) # just copy the first line (the header) to output
for line in fin:
fields = line.split(' ')
# Make a format header specific to the number of fields
fmt = '%6.2f' + ('%10.6f' * (len(fields)-1)) + '\n'
fout.write(fmt % tuple(map(float, fields)))
basenames = os.listdir(indir) # get a list of input ASCII files to be processed
for basename in basenames:
input_filename = os.path.join(indir, basename)
output_filename = os.path.join(outdir, basename)
with open(input_filename, 'r') as fin, open(output_filename, 'w') as fout:
reformat(fin, fout)
The Zen of Python is "There should be one-- and preferably only one --obvious way to do it". It's interesting how you functions which, during the last 10+ years, was "obviously" the right solution, but are no longer. :)
fin=open(indir+filename,'r') # input file
fout=open(outdir+filename,'w') # out: processed file
#code
fin.close()
fout.close()
can be written as:
with open(indir+filename,'r') as fin, open(outdir+filename,'w') as fout:
#code
In python 2.6, you can use:
with open(indir+filename,'r') as fin:
with open(outdir+filename,'w') as fout:
#code
And the line
lines = iter(fileinput.input([indir+filename]))
is useless. You can just iterate over an open file(fin in your case)
You can also do line.split(' ') instead of string.split(line, ' ')
If you change those things, there is no need to import string and fileinput.
Edit: I didn't know you can use inline code. That's cool
In my build script, I have this code:
inFile = open(sourceFile,'r')
outFile = open(targetFile,'w')
for line in inFile:
line = doKeywordSubstitution(line)
outFile.write(line)
inFile.close()
outFile.close()
I don't know of a way to make this any more concise. Putting the line-changing logic in a different function looks neater to me though.
I may be missing the point of your code, but I don't understand why you have lines = iter(fileinput.input([indir+filename])).
I don't understand why do you use: string.split(line, ' ') instead of just line.split(' ').
Well maybe I would write the string-processing part like this:
values = line.split(' ')
values[0] = '{0:6.2f}'.format(float(values[0]))
values[1:] = ['{0:10.6f}'.format(float(v)) for v in values[1:]]
fout.write(' '.join(values))
At least for me this looks better but this might be subjective :)
Instead of indir I would use os.curdir. Instead of "./processed" I would do: os.path.join(os.curdir, 'processed').

Categories

Resources