Parsing file structure based on a specific pattern

Parsing file structure based on a specific pattern - python

I have a text file with multiple lines that are in the order of name, location, website, then 'END' to indicate the end of one person's profile, then again name, location, website, and so on.
I need to add the name as a key to a dictionary and the rest (location, website) as its values.
So if I have a file:
name1
location1
website1
END
name2
location2
website2
END
name3
location3
website3
END
the outcome would be:
dict = {'name1': ['location1','website1'],
'name2': ['location2', 'website2'],
'name3': ['location3', 'website3']}
edit: the value would be a list, sorry about that
I have no idea how to approach this, can someone point me in the right direction?

First, there appears to be a misconception about the structure of a dictionary, or, more general, of associative containers in general, underlying this question.
The structure of a dict is, in python-like syntax
{
key : whatever_value1,
another_key: whatever_value2,
# ...
}
Second, if you trim the trailing digit from
name1
location1
website1
you naturally arrive at a struct-like ADT for the END-seperated individual entries of that file, namely
class Whatever(object):
def __init__(self, name, location, website):
self.name = name
self.location = location
self.website = website
(your mileage will vary regarding the name of the class)
Thus what you could use, is a python dict, that maps a key - likely the name attribute of your records - to a (reference to) an instance of that type.
To process the input file, you simple read the file line-wise each time until you encounter END, and then commit a class Whatever to the dictionary using (e.g.) its name as the key.

Use the fact "END" delimits each section, itertools.groupby will split the file using END and we just need to create our key/value pairing as we iterate over the groupby object.
from itertools import groupby
from collections import OrderedDict
with open("test.txt") as f:
d = OrderedDict((next(v), list(v))
for k, v in groupby(map(str.rstrip, f), key=lambda x: x[:3] != "END") if k)
Output:
OrderedDict([('name1', ['location1', 'website1']),
('name2', ['location2', 'website2']),
('name3', ['location3', 'website3'])])
Or using a regular for loop, just change the key each time we hit END storing the lines for each section in a tmp list:
from collections import OrderedDict
with open("test.txt") as f:
# itertools.imap for python2
data = map(str.rstrip, f)
d, tmp, k = OrderedDict(), [], next(data)
for line in data:
if line == "END":
d[k] = tmp
k, tmp = next(data, ""), []
else:
tmp.append(line)
Output will be the same:
OrderedDict([('name1', ['location1', 'website1']),
('name2', ['location2', 'website2']),
('name3', ['location3', 'website3'])])
Both code examples will work for any length sections not just three lines.

It has been answered, but you can shorten things by applying Python's very own dict and list comprehension:
with open(file, 'r') as f:
triplets = [data.strip().split('\n') for data in f.read().strip().split('END') if data]
d = {name: [line, site] for name, line, site in triplets}

You can take a slice of four lines at a time from the file without having to load it all into memory. One way to do this is with islice from itertools.
from itertools import islice
data = dict()
with open('file.path') as input:
while True:
batch = tuple(x.strip() for x in islice(input, 4))
if not batch:
break;
name, location, website, end = batch
data[name] = (location, website)
Verification:
> from pprint import pprint
> pprint(data)
{'name1': ('location1', 'website1'),
'name2': ('location2', 'website2'),
'name3': ('location3', 'website3')}

If you are guaranteed that you will always get this data in this format, then you could do the following:
dict = {}
name = None
location = None
website = None
count = 0:
with open(file, 'r') as f: #where file is the file name
for each in f:
count += 1
if count == 1:
name = each
elif count == 2:
location = each
elif count == 3:
website = each
elif count == 4 and each == 'END':
count = 0 # Forgot to reset to 0 when it got to four... my bad.
dict[name] = (location, website) # Adding to the dictionary as a tuple since you need to have key -> value not key -> value1, value2
else:
print("Well, something went amiss %i %s" % count, each)

Related

Sorting and enumerating imported data from txt file (Python)

guys!
I'm trying to do a movie list, with data imported from a txt file that looks like this:
"Star Wars", "Y"
"Indiana Jones", "N"
"Pulp Fiction", "N"
"Fight Club", "Y"
(with Y = watched, and N = haven't seen yet)
I'm trying to sort the list by name, so that it'll look something like:
1. Fight Club (Watched)
2. Indiana Jones (Have Not Watched Yet)
3. Pulp Fiction (Have Not Watched Yet)
4. Star Wars (Watched)
And this is what I have so far:
def sortAlphabetically():
movie_list = {}
with open('movies.txt') as f:
for line in f:
movie, watched = line.strip().split(',')
movie_list[movie.strip()] = watched.strip()
if watched.strip() == '"N"':
print(movie.strip() + " (Have Not Watched Yet)")
if watched.strip() == '"Y"':
print(movie.strip() + " (Watched)")
I found a tutorial and tried adding this code within the function to sort them:
sortedByKeyDict = sorted(movie_list.items(), key=lambda t: t[0])
return sortedByKeyDict
I also tried using from ast import literal_eval to try and remove the quotation marks and then inserting this in the function:
for k, v in movie_list.items():
movie_list[literal_eval(k)] = v
But neither worked.
What should I try next?
Is it possible to remove the quotation marks?
And how do I go about enumerating?
Thank you so much in advance!

Here you go, this should do it:
filename = './movies.txt'
watched_mapping = {
'Y': 'Watched',
'N': 'Have Not Watched Yet'
}
with open(filename) as f:
content = f.readlines()
movies = []
for line in content:
name, watched = line.strip().lstrip('"').rstrip('"').split('", "')
movies.append({
'name': name,
'watched': watched
})
sorted_movies = sorted(movies, key=lambda k: k['name'])
for i, movie in enumerate(sorted_movies, 1):
print('{}. {} ({})'.format(
i,
movie['name'],
watched_mapping[movie['watched']],
))
First we define watched_mapping which simply maps values in your file to values you want printed.
After that we open the file and read all of its lines into a content list.
Next thing to do is parse that list and extract values from it (from each line we must extract the movie name and whether it has been watched or not). We will save those values into another list of dictionaries, each containing the movie name and whether it has been watched.
Thats what name, watched = line.strip().lstrip('"').rstrip('"').split('", "') is for, it basically strips garbage from each end of the line and then splits the line by garbage in the middle, returning clean name and watched.
Next thing to do is sort the list by name value in each dictionary:
sorted_movies = sorted(movies, key=lambda k: k['name'])
After that we simply enumerate the sorted list (starting at 1) and parse it to print out the desired output (using the watched_mapping to print out sentences instead of simple Y and N).
Output:
1. Fight Club (Watched)
2. Indiana Jones (Have Not Watched Yet)
3. Pulp Fiction (Have Not Watched Yet)
4. Star Wars (Watched)
Case insensitive sorting changes:
sorted_movies = sorted(movies, key=lambda k: k['name'].lower())
Simply change the value movies get sorted by into lowercase name. Now when sorting all the names are treated as lowercase.

Your function with some quick fix
def sortAlphabetically():
movie_list = []
with open('movies.txt') as f:
for line in f:
movie, watched = line.strip().split(',')
movie_list.append({
'name': movie.strip()[1:-1],
'watched': watched.strip()[1:-1]
})
return sorted(movie_list, key = lambda x : x['name'])

Well I just modified your code. When you use sorted() on a dictionary, then dictionary gets converted to a list of tuples. All I have done is that I have made another dictionary from the existing list of tuples.
def sortAlphabetically():
movie_list = dict()
with open('movies.txt') as f:
for line in f:
movie, watched = line.strip().split(',')
movie_list[movie.strip()] = watched.strip()
movie_sorted = sorted(movie_list.items(), key = lambda kv: kv[0])
movie_list = dict()
for key, value in movie_sorted:
movie_list[key] = value
i = 1
for key, value in movie_list.items():
if value == 'Y':
print("{}. {} {}".format(i,key,"(Watched)"))
else:
print("{}. {} {}".format(i,key,"Have Not Watched Yet"))
i += 1
I intentionally kept the code simple for better understanding. Hope this helps :)

CSV selecting multiple columns

I have this CSV file whereby it contain lots of information. I have coded a program which are able to count what are inside the columns of 'Feedback' and the frequency of it.
My problem now is that, after I have produced the items inside 'Feedback' columns, I want to specifically bring out another columns which tally to the 'Feedback' columns.
Some example of the CSV file is as follow:
Feedback Description Status
Others Fire Proct Complete
Complaints Grass Complete
Compliment Wall Complete
... ... ...
With the frequency of the 'Feedback' columns, I now want to show, let's say if I select 'Complaints'. Then I want everything that tally with 'Complaints' from Description to show up.
Something like this:
Complaints Grass
Complaints Table
Complaints Door
... ...
Following is the code I have so far:
import csv, sys, os, shutil
from collections import Counter
reader = csv.DictReader(open('data.csv'))
result = {}
for row in reader:
for column, value in row.iteritems():
result.setdefault(column,[]).append(value)
list = []
for items in result['Feedback']:
if items == '':
items = items
else:
newitem = items.upper()
list.append(newitem)
unique = Counter(list)
for k, v in sorted(unique.items()):
print k.ljust(30),' : ', v
This is only the part whereby it count what's inside the 'Feedback' Columns and the frequency of it.

You could also store a defaultdict() holding a list of entries for each category as follows:
import csv
from collections import Counter, defaultdict
with open('data.csv', 'rb') as f_csv:
csv_reader = csv.DictReader(f_csv)
result = {}
feedback = defaultdict(list)
for row in csv_reader:
for column, value in row.iteritems():
result.setdefault(column, []).append(value)
feedback[row['Feedback'].upper()].append(row['Description'])
data = []
for items in result['Feedback']:
if items == '':
items = items
else:
newitem = items.upper()
data.append(newitem)
unique = Counter(data)
for k, v in sorted(unique.items()):
print "{:20} : {:5} {}".format(k, v, ', '.join(feedback[k]))
This would display your output as:
COMPLAINTS : 2 Grass, Door
COMPLIMENT : 2 Wall, Table
OTHERS1 : 1 Fire Proct
Or on multiple lines if instead you used:
print "{:20} : {:5}".format(k, v)
print ' ' + '\n '.join(feedback[k])
When using the csv library, you should open your file with rb in Python 2.x. Also avoid using list as a variable name as this overwrites the Python list() function.
Note: It is easier to use format() when printing aligned data.

You can do it with the code at the very end of this snippet, which is derived from the code in your question. I modified how the file is read by using a with statement which insures that it is closed when it's no longer needed. I also changed the name of the variable named list you had. because it hides the name of the built-in type and is considered by most to be a poor programming practice. See PEP 8 - Style Guide for Python Code for more on this and related topics.
For testing purposes, I also added a couple more rows of 'Complaints' type of 'Feedback' items.
import csv
from collections import Counter
with open('information.csv') as csvfile:
result = {}
for row in csv.DictReader(csvfile):
for column, value in row.iteritems():
result.setdefault(column, []).append(value)
items = [item.upper() for item in result['Feedback']]
unique = Counter(items)
for k, v in sorted(unique.items()):
print k.ljust(30), ' : ', v
print
for i, feedback in enumerate(result['Feedback']):
if feedback == 'Complaints':
print feedback, ' ', result['Description'][i]
Output:
COMPLAINTS : 3
COMPLIMENT : 1
OTHERS : 1
Complaints Grass
Complaints Table
Complaints Door

Creating a Dictionary from txt

I have a couple of lines inside a text that i am looking to turn the first word to a key (space is between each) with a function, and the rest to follow as values.
This is what the text contains:
FFFB10 11290 Charlie
1A9345 37659 Delta
221002 93323 Omega
The idea is to turn the first word into a key, but also arrange it (row underneath a row) visualy, so the first word(FFFB10) is the key, and the rest are values, meaning:
Entered: FFFB10
Location: 11290
Name: Charlie
I tried with this as a beginning:
def code(codeenter, file):
for line in file.splitlines():
if name in line:
parts = line.split(' ')
But i dont know how to continue (i erased most of the code), any suggestions?

Assuming you managed to extract a list of lines without the newline character at the end.
def MakeDict(lines):
return {key: (location, name) for key, location, name in (line.split() for line in lines)}
This is an ordinary dictionary comprehension with a generator expression. The former is all the stuff in brackets and the latter is inside the last pair of brackets. line.split splits a line with whitespace being the delimiter.
Example run:
>>> data = '''FFFB10 11290 Charlie
... 1A9345 37659 Delta
... 221002 93323 Omega'''
>>> lines = data.split('\n')
>>> lines
['FFFB10 11290 Charlie', '1A9345 37659 Delta', '221002 93323 Omega']
>>> def MakeDict(lines):
... return {key: (location, name) for key, location, name in (line.split() for line in lines)}
...
>>>
>>> MakeDict(lines)
{'FFFB10': ('11290', 'Charlie'), '1A9345': ('37659', 'Delta'), '221002': ('93323', 'Omega')}
How to format the output:
for key, values in MakeDict(lines).items():
print("Key: {}\nLocation: {}\nName: {}".format(key, *values))

See ForceBru's answer on how to construct the dictionary. Here's the printing part:
for k, (v1, v2) in your_dict.items():
print("Entered: {}\nLocation: {}\nName: {}\n".format(k, v1, v2))

You can try this:
f = [i.strip('\n').split() for i in open('filename.txt')]
final_dict = {i[0]:i[1:] for i in f}
Assuming the data is structured like this:
FFFB10 11290 Charlie
1A9345 37659 Delta
221002 93323 Omega
Your output will be:
{'FFFB10': ['11290', 'Charlie'], '221002': ['93323', 'Omega'], '1A9345': ['37659', 'Delta']}

You may want to consider using namedtuple.
from collections import namedtuple
code = {}
Code = namedtuple('Code', 'Entered Location Name')
filename = '/Users/ca_mini/Downloads/junk.txt'
with open(filename, 'r') as f:
for row in f:
row = row.split()
code[row[0]] = Code(*row)
>>> code
{'1A9345': Code(Entered='1A9345', Location='37659', Name='Delta'),
'221002': Code(Entered='221002', Location='93323', Name='Omega'),
'FFFB10': Code(Entered='FFFB10', Location='11290', Name='Charlie')}

Smaller program will print out key from values in dictionary, but stops when incorporated into larger function?

So I have a problem.
I am wanting to do something similar to this, where I call out a value, and it prints out the keys associated with that value. And I can even get it working:
def test(pet):
dic = {'Dog': ['der Hund', 'der Katze'] , 'Cat' : ['der Katze'] , 'Bird': ['der Vogel']}
items = dic.items()
key = dic.keys()
values = dic.values()
for x, y in items:
for item in y:
if item == pet:
print x
However, when I incorporate this same code format into a larger program it stops working:
def movie(movie):
file = open('/Users/Danrex/Desktop/Text.txt' , 'rt')
read = file.read()
list = read.split('\n')
actorList=[]
for item in list:
actorList = actorList + [item.split(',')]
actorDict = dict()
for item in actorList:
if item[0] in actorDict:
actorDict[item[0]].append(item[1])
else:
actorDict[item[0]] = [item[1]]
items = actorDict.items()
for x, y in items:
for item in y:
if item == movie:
print x
I have print(ed) out actorDict, items, x, y, and item and they all seem to follow the same format as the previous code so I can't figure out why this isn't working! So confused. And, please, when you explain it to me do it as if I am a complete idiot, which I probably am.

Cleaning up the code with some more idiomatic Python will sometimes clarify things. This is how I would write it in Python 2.7:
from collections import defaultdict
def movie(movie):
actorDict = defaultdict(list)
movie_info_filename = '/Users/Danrex/Desktop/Text.txt'
with open(movie_info_filename, 'rt') as fin:
for line_item in fin:
split_items = line_item.split(',')
actorDict[split_items[0]].append(split_items[1])
for actor, actor_info in actorDict.items():
for info_item in actor_info:
if info_item == movie:
print actor
In this case, what mostly boiled out were temporary objects created for making the actorDict. defaultdict creates a dictionary-like object that allows one to specify a function to generate the default value for a key that isn't currently present. See the collections documentation for more info.
What it looks like you're trying to do is print out some actor value for each time they are listed with a particular movie in your text file.
If you're going to check more than one movie, make the actorDict once and reference your movies against that existing actorDict. This will save you trips to disk.
from collections import defaultdict
def make_actor_dict():
actorDict = defaultdict(list)
movie_info_filename = '/Users/Danrex/Desktop/Text.txt'
with open(movie_info_filename, 'rt') as fin:
for line_item in fin:
split_items = line_item.split(',')
actorDict[split_items[0]].append(split_items[1])
def movie(movie, actorDict):
for actor, actor_info in actorDict.items():
for info_item in actor_info:
if info_item == movie:
print actor
def main():
actorDict = make_actor_dict()
movie('Star Wars', actorDict)
movie('Indiana Jones', actorDict)
If you only care that the actor was in that movie, you don't have to iterate through the movie list manually, you can just check that movie is in actor_info:
def movie(movie, actorDict):
for actor in actorDict:
if movie in actorDict[actor]:
print actor
Of course, you already figure out that the problem was the movie name not being an exact match to the text you read from the file. If you want to allow less-than-exact matches, you should consider normalizing your movie string and your data strings from the file. The string methods strip() and lower() can be really helpful there.

Stumped - In Python obtaining unique keys with multiple associated values from a list of ditionaries

I'm parsing a csv file to perform some basic data processing. The file that I am working with is a log of user activity to a website formatted as follows:
User ID, Url, Number of Page Loads, Number of Interactions
User ID and Url are strings, Number of Page Loads and Number of Interactions are integers.
I am attempting to determine which url has the best interaction-to-page ratio.
The part I am struggling with is getting unique values and aggregating the results from the columns.
I've written the following code:
import csv
from collections import defaultdict
fields = ["USER","URL","LOADS","ACT"]
file = csv.DictReader(open('file.csv', 'rU'), delimiter=",",fieldnames=fields)
file.next()
dict = defaultdict(int)
for i in dict:
dict[i['URL']] += int(i['LOADS'])
This works fine. It returns a list of unique urls with the number of total loads by url in a dictionary - { 'URL A' : 1000 , 'URL B' : 500}
The issue is when i try to add multiple values to the url key, I'm stumped.
I've tried amending the for loop to do:
for i in dict:
dict[i['URL']] += int(i['LOADS']), int(i['ACT'])
and I receive TypeError: unsupported operand type(s) for +=: 'int' and 'tuple'. Why is the second value considered a tuple?
I tried adding just int(i[ACT]), and it worked fine. It's just when I try both values at the same time.
I'm on python 2.6.7; Any ideas on how to do this and why it's considered a tuple?

You are better off using a list as your defaultdict container:
import csv
from collections import defaultdict
d = defaultdict(list)
fields = ["USER","URL","LOADS","ACT"]
with open('file.csv', 'rU') as the_file:
rows = csv.DictReader(the_file, delimiter=",",fieldnames=fields)
rows.next()
for row in rows:
data = (int(row['LOADS']),int(row['ACT']))
d[row['URL']].append(data)
Now you have
d['someurl'] = [(5,17),(7,14)]
Now you can do whatever sums you would like, for example, all the loads for a URL:
load_sums = {k:sum(i[0] for i in d[k]) for k in d}

Because int(i['LOADS']), int(i['ACT']) is a tuple:
>>> 1, 2
(1, 2)
If you want to add both variables at the same time, just add them together:
+= int(i['LOADS']) + int(i['ACT'])
Also, you're shadowing the builtin dict and list types. Use different variable names. You won't be able to use the list builtin once your shadow it:
>>> d = {1: 2, 3: 4}
>>> list(d)
[1, 3]
>>> list = 5
>>> list(d)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'int' object is not callable

It's just when I try both values at the same time.
How do you want to "add" them? As their sum?
for i in list:
dict[i['URL']] += int(i['LOADS']) + int(i['ACT'])
Also, don't use list and dict as variable names.
import csv
fields = ["USER","URL","LOADS","ACT"]
d = {}
with open('file.csv', 'rU') as f:
csvr = csv.DictReader(f, delimiter=",",fieldnames=fields)
csvr.next()
for rec in csvr:
d[rec['URL']] = d.get(rec['URL'], 0) + int(rec['LOADS']) + int(rec['ACT'])

You could use an object-oriented approach and define a class to hold the information. It's wordier than most of the other answers, but worth considering.
import csv
from collections import defaultdict
class Info(object):
def __init__(self, loads=0, acts=0):
self.loads = loads
self.acts = acts
def __add__(self, args): # add a tuple of values
self.loads += args[0]
self.acts += args[1]
return self
def __repr__(self):
return '{}(loads={}, acts={})'.format(self.__class__.__classname__,
self.loads, self.acts)
summary = defaultdict(Info)
fields = ["USER", "URL", "LOADS", "ACTS"]
with open('urldata.csv', 'rU') as csv_file:
reader = csv.DictReader(csv_file, delimiter=",", fieldnames=fields)
reader.next() # skip header
for rec in reader:
summary[rec['URL']] += (int(rec['LOADS']), int(rec['ACTS']))
for url,info in summary.items():
print '{{{!r}: ({}, {})}}'.format(url, info.loads, info.acts)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing file structure based on a specific pattern - python

It has been answered, but you can shorten things by applying Python's very own dict and list comprehension: with open(file, 'r') as f: triplets = [data.strip().split('\n') for data in f.read().strip().split('END') if data] d = {name: [line, site] for name, line, site in triplets}

Related

Sorting and enumerating imported data from txt file (Python)

CSV selecting multiple columns

Creating a Dictionary from txt

Smaller program will print out key from values in dictionary, but stops when incorporated into larger function?

Stumped - In Python obtaining unique keys with multiple associated values from a list of ditionaries

Categories

Resources