Extract text files into multiple columns in python - python

I have different text files and I want to extract the values from there into a csv file.
Each file has the following format
main cost: 30
additional cost: 5
I managed to do that but the problem that I want it to insert the values of each file into a different columns I also want the number of text files to be a user argument
This is what I'm doing now
numFiles = sys.argv[1]
d = [[] for x in xrange(numFiles+1)]
for i in range(numFiles):
filename = 'mytext' + str(i) + '.text'
with open(filename, 'r') as in_file:
for line in in_file:
items = line.split(' : ')
num = items[1].split('\n')
if i ==0:
d[i].append(items[0])
d[i+1].append(num[0])
grouped = itertools.izip(*d[i] * 1)
if i == 0:
grouped1 = itertools.izip(*d[i+1] * 1)
with open(outFilename, 'w') as out_file:
writer = csv.writer(out_file)
for j in range(numFiles):
for val in itertools.izip(d[j]):
writer.writerow(val)
This is what I'm getting now, everything in one column
main cost
additional cost
30
5
40
10
And I want it to be
main cost | 30 | 40
additional cost | 5 | 10

You could use a dictionary to do this where the key will be the "header" you want to use and the value be a list.
So it would look like someDict = {'main cost': [30,40], 'additional cost': [5,10]}
edit2: Went ahead and cleaned up this answer so it makes a little more sense.
You can build the dictionary and iterate over it like this:
from collections import OrderedDict
in_file = ['main cost : 30', 'additional cost : 5', 'main cost : 40', 'additional cost : 10']
someDict = OrderedDict()
for line in in_file:
key,val = line.split(' : ')
num = int(val)
if key not in someDict:
someDict[key] = []
someDict[key].append(num)
for key in someDict:
print(key)
for value in someDict[key]:
print(value)
The code outputs:
main cost
30
40
additional cost
5
10
Should be pretty straightforward to modify the example to fit your desired output.
I used the example # append multiple values for one key in Python dictionary and thanks to #wwii for some suggestions.
I used an OrderedDict since a dictionary won't keep keys in order.
You can run my example # https://ideone.com/myN2ge

This is how I might do it. Assumes the fields are the same in all the files. Make a list of names, and a dictionary using those field names as keys, and the list of values as the entries. Instead of running on file1.text, file2.text, etc. run the script with file*.text as a command line argument.
#! /usr/bin/env python
import sys
if len(sys.argv)<2:
print "Give file names to process, with wildcards"
else:
FileList= sys.argv[1:]
FileNum = 0
outFilename = "myoutput.dat"
NameList = []
ValueDict = {}
for InfileName in FileList:
Infile = open(InfileName, 'rU')
for Line in Infile:
Line=Line.strip('\n')
Name,Value = Line.split(":")
if FileNum==0:
NameList.append(Name.strip())
ValueDict[Name] = ValueDict.get(Name,[]) + [Value.strip()]
FileNum += 1 # the last statement in the file loop
Infile.close()
# print NameList
# print ValueDict
with open(outFilename, 'w') as out_file:
for N in NameList:
OutString = "{},{}\n".format(N,",".join(ValueDict.get(N)))
out_file.write(OutString)
Output for my four fake files was:
main cost,10,10,40,10
additional cost,25.6,25.6,55.6,25.6

Related

Writing out a list of phrases to a csv file

Following on from an earlier post, I have written some Python code to calculate the frequency of occurrences of certain phrases (contained in the "word_list" variable with three examples listed but will have many more) in a large number of text files. The code I've written below requires me to take each element of the list and insert it into a string for comparison to each text file. However the current code is only writing the frequencies for the last phrase in the list rather than all of them to the relevant columns in a spreadsheet. Is this just an indent issue, not placing the writerow in the correct position or is there a logic flaw in my code. Also is there any way to avoid using a list to string assignment in order to compare the phrases to those in the text files?
word_list = ['in the event of', 'frankly speaking', 'on the other hand']
S = {}
p = 0
k = 0
with open(file_path, 'w+', newline='') as csv_file:
writer = csv.writer(csv_file)
writer.writerow(["Fohone-K"] + word_list)
for filename in glob.glob(os.path.join(path, '*.txt')):
if filename.endswith('.txt'):
f = open(filename)
Fohone-K = filename[8:]
data = f.read()
# new code section from scratch file
l = len(word_list)
for s in range(l):
phrase = word_list[s]
S = data.count((phrase))
if S:
#k = k + 1
print("'{}' match".format(Fohone-K), S)
else:
print("'{} no match".format(Fohone-K))
print("\n")
# for m in word_list:
if S >= 0:
print([Fohone-K] + [S])
writer.writerow([Fohone-K] + [S])
The output currently looks like this.
enter image description here
When it needs to look like this.
enter image description here
You probably were going for something like this:
import csv, glob, os
word_list = ['in the event of', 'frankly speaking', 'on the other hand']
file_path = 'out.csv'
path = '.'
with open(file_path, 'w+', newline='') as csv_file:
writer = csv.writer(csv_file)
writer.writerow(["Fohone-K"] + word_list)
for filename in glob.glob(os.path.join(path, '*.txt')):
if filename.endswith('.txt'):
with open(filename) as f:
postfix = filename[8:]
content = f.read()
matches = [content.count(phrase) for phrase in word_list]
print(f"'{filename}' {'no ' if all(n == 0 for n in matches) else ''}match")
writer.writerow([postfix] + matches)
The key problem was you were writing S on each row, which only contained a single count. That's fixed here by writing a full set of matches.

Python : How to read Key Value pair from CSV file?

I have a csv file having 3 columns and I want to read 1st and 3rd column as key value pair. I am doing it like below but it's not working.
with open(dirName + fileName) as f:
for line in f:
(key, value) = line.split(',')
I'm thinking you want something like:
with open(dirName + fileName) as f:
for line in f:
fields = line.split(',')
assert len(fields) == 3
(key, _, value) = fields
But maybe glance at the csv module.
Any time you're working with csv files use the csv module.
As #Buckeye14Guy says: you should also use pathlib for path manipulations.
And, for fast lookup, you can store key-value pairs in a dictionary, d.
import csv, pathlib
d = {}
your_path = pathlib.PurePath(dirName).joinpath(filename)
with open(your_path,'r') as f:
reader = csv.reader(f)
for line in reader:
d[line[0]] = line[2] # dict entry with key = 1st col and value = 3rd col
Try this
with open(file,'r+') as text:
for line in text.readlines():
(key, value) = line.split(',')

Python: Removing dupes from large text file

I need my code to remove duplicate lines from a file, at the moment it is just reproducing the same file as output. Can anyone see how to fix this? The for loop is not running as I would have liked.
#!usr/bin/python
import os
import sys
#Reading Input file
f = open(sys.argv[1]).readlines()
#printing no of lines in the input file
print "Total lines in the input file",len(f)
#temporary dictionary to store the unique records/rows
temp = {}
#counter to count unique items
count = 0
for i in range(0,9057,1):
if i not in temp: #if row is not there in dictionary i.e it is unique so store it into a dictionary
temp[f[i]] = 1;
count += 1
else: #if exact row is there then print duplicate record and dont store that
print "Duplicate Records",f[i]
continue;
#once all the records are read print how many unique records are there
#u can print all unique records by printing temp
print "Unique records",count,len(temp)
#f = open("C://Python27//Vendor Heat Map Test 31072015.csv", 'w')
#print f
#f.close()
nf = open("C://Python34//Unique_Data.csv", "w")
for data in temp.keys():
nf.write(data)
nf.close()
# Written by Gary O'Neill
# Date 03-08-15
This is a much better way to do what you want:
infile_path = 'infile.csv'
outfile_path = 'outfile.csv'
written_lines = set()
with open(infile_path, 'r') as infile, open(outfile_path, 'w') as outfile:
for line in infile:
if line not in written_lines:
outfile.write(line)
written_lines.add(line)
else:
print "Duplicate record: {}".format(line)
print "{} unique records".format(len(written_lines))
This will read one line at a time, so it works even on large files that don't fit into memory. While it's true that if they're mostly unique lines, written_lines will end up being large anyway, it's better than having two copies of almost every line in memory.
You should test the existence of f[i] in temp not i. Change the line:
if i not in temp:
with
if f[i] not in temp:

How can I merge two or more text files and remove duplicate email addresses with Python?

I have two text files with space delimited email addresses - newalias.txt and origalias.txt. Essentially these are email alias mappings I want to merge together, but there are duplicates in the first index. I want to favor the line with a match in the first index in newalias.txt and drop the dup in origalias.txt. Also, drop exact duplicates.
OrigAlias:
sam#example.com sam.smith#example.root.org
jane#example.com jane.maiden#example.root.org
bob#example.com robert.johnson#example.root.org
NewAlias:
sam#example.com samuel.smith#example.root.org
jane#example.com jane.married#example.root.org
bob#example.com robert.johnson#example.root.org
Results:
sam#example.com samuel.smith#example.root.org
jane#example.com jane.married#example.root.org
bob#example.com robert.johnson#example.root.org
I've been studying Python recently and I've done some interesting things, but text parsing is still a challenge for me. I'm still getting familiar with the options in Python.
Edit
I worked on the problem by myself for a while and came up with this:
# Py 3.4.1
# Instructions:
# Rename current domain mapping export to dmapsOrig.txt
# Rename whitespace delimited customer modifications file to dmapsNew.txt
# Place the two text files and this script in the same directory
# Run the script: 'python dmapsMerge.py'
from datetime import date
OrigDict = {} # Create empty dictionaries for processing
NewAddDict = {} #
ResultsDict = {} #
with open('dmapsOrig.txt', 'r') as file1: # Populate OrigDict dictionary from dmapsOrig.txt file
for x in file1:
if not x.startswith("#"): # Ignore commented lines
dmaps = x.split()
OrigDict[(dmaps[0])] = ''.join(dmaps[1])
with open('dmapsNew.txt', 'r') as file2: # Populate NewAddDict dictionary from dmapsNew.txt file
for y in file2:
if not y.startswith("#"): # Ignore commented lines
newdmaps = y.split()
NewAddDict[(newdmaps[0])] = ''.join(newdmaps[1])
with open('dmapsOrig-formatted-%s.txt' % date.today(), 'wt') as file3:
file3.write('## Generated on %s' % date.today() + '\n') # Insert date stamp
for alias in sorted(OrigDict.keys()):
file3.write(alias + ' ' + OrigDict[alias] + '\n') # Format original input and write to sorted file
ResultsDict = OrigDict.copy() # Copy OrigDict dictionary keys and values to ResultsDict Dictionary
ResultsDict.update(NewAddDict) # Merge new dmaps into original
with open('dmapsResults-%s.txt' % date.today(), 'wt') as file4:
file4.write('## Generated on %s' % date.today() + '\n') # Insert date stamp
for alias in sorted(ResultsDict.keys()):
file4.write(alias + ' ' + ResultsDict[alias] + '\n') # Format dictionary output and write to results.txt file
file1.close() # Close open files
file2.close() #
file3.close() #
file4.close() #
Assuming that your files are not too big, the simplest solution would be to load origalias.txt in memory, then load newalias.txt (updating existing entries if necessary), and dump the merged data.
aliases = {}
with open("origalias.txt") as f:
for line in f:
key, val = line.strip().split(" ")
aliases[key] = val
with open("newalias.txt") as f:
for line in f:
key, val = line.strip().split(" ")
aliases[key] = val
with open("mergedalias.txt", "w") as f:
for key, val in aliases.items():
f.write("{} {}\n".format(key, val))
A few keys to the code above:
Using a dict aliases allows you to prevent duplicates, as setting a new value for a key replaces the old value.
Files are iterable (i.e. usable with for), each iteration applies to one line, which is convenient in your scenario.
.strip() removes leading and trailing spaces; then .split(" ") cuts the string according to spaces, and the two components are affected to key and val respectively.
Note that if a line contains less or more than two space-separated parts, the affectation to key, val will raise an exception. Consider using .split(" ", 1) for a more tolerant behaviour.
Hope this helps.
# construct a dictionary from orig file
original_dict = dict([tuple(i.split(' ')) for i in open('origalias.txt')])
# create a new dictionary and update the original dictionary(this overwrite new values for same key)
original_dict.update(dict([tuple(i.split(' ')) for i in open('newalias.txt')])))
# now write to new file if you want
fp = open('newfile','w')
for key, value in original_dict.iteritems():
fp.write('%s %s\n'%(key, value))
with open('origalias.txt') as forig, open('newalias.txt') as fnew, open('results.txt', 'w') as fresult:
dd = {}
for fn in (forig, fnew): # first pass will load with original, then overwrite with new
for ln in fn:
alias, address = ln.split(' ')
dd[alias] = address
# just write out all element in dictionary
for alias, address in dd.iteritems():
fresult.write('%s %s\n' % (alias, address))

Python sort text file in dictionary

I have a text file that looks something like this:
John Graham 2
Marcus Bishop 0
Bob Hamilton 1
... and like 20 other names.
Each name appears several times and with a different number(score) after it.
I need to make a list that shows each name only one time and with a sum of that name's total score efter it. I need to use a dictionary.
This is what i have done, but it only makes a list like the text file looked like from the beginning:
dict = {}
with open('scores.txt', 'r+') as f:
data = f.readlines()
for line in data:
nameScore = line.split()
print (nameScore)
I don't know how to do the next part.
Here is one option using defaultdict(int):
from collections import defaultdict
result = defaultdict(int)
with open('scores.txt', 'r') as f:
for line in f:
key, value = line.rsplit(' ', 1)
result[key] += int(value.strip())
print result
If the contents of scores.txt is:
John Graham 2
Marcus Bishop 0
Bob Hamilton 1
John Graham 3
Marcus Bishop 10
it prints:
defaultdict(<type 'int'>,
{'Bob Hamilton': 1, 'John Graham': 5, 'Marcus Bishop': 10})
UPD (formatting output):
for key, value in result.iteritems():
print key, value
My first pass would look like:
scores = {} # Not `dict`. Don't reuse builtin names.
with open('scores.txt', 'r') as f: # Not "r+" unless you want to write later
for line in f:
name, score = line.strip().rsplit(' ', 1)
score = int(score)
if name in scores:
scores[name] = scores[name] + score
else:
scores[name] = score
print scores.items()
This isn't exactly how I'd write it, but I wanted to be explicit enough that you could follow along.
use dictionary get:
dict = {}
with open('file.txt', 'r+') as f:
data = f.readlines()
for line in data:
nameScore = line.split()
l=len(nameScore)
n=" ".join(nameScore[:l-1])
dict[n] = dict.get(n,0) + int(nameScore[-1])
print dict
Output:
{'Bob Hamilton': 1, 'John Graham': 2, 'Marcus Bishop': 0}
I had a similar situation I was in. I modified Wesley's Code to work for my specific situation. I had a mapping file "sort.txt" that consisted of different .pdf files and numbers to indicate the order that I want them in based on an output from DOM manipulation from a website. I wanted to combine all these separate pdf files into a single pdf file but I wanted to retain the same order they are in as they are on the website. So I wanted to append numbers according to their tree location in a navigation menu.
1054 spellchecking.pdf
1055 using-macros-in-the-editor.pdf
1056 binding-macros-with-keyboard-shortcuts.pdf
1057 editing-macros.pdf
1058 etc........
Here is the Code I came up with:
import os, sys
# A dict with keys being the old filenames and values being the new filenames
mapping = {}
# Read through the mapping file line-by-line and populate 'mapping'
with open('sort.txt') as mapping_file:
for line in mapping_file:
# Split the line along whitespace
# Note: this fails if your filenames have whitespace
new_name, old_name = line.split()
mapping[old_name] = new_name
# List the files in the current directory
for filename in os.listdir('.'):
root, extension = os.path.splitext(filename)
#rename, put number first to allow for sorting by name and
#then append original filename +e extension
if filename in mapping:
print "yay" #to make coding fun
os.rename(filename, mapping[filename] + filename + extension)
I didn't have a suffix like _full so I didn't need that code. Other than that its the same code, I've never really touched python so this was a good learning experience for me.

Categories

Resources