I'm not very experienced with complicated large-scale parsing in Python, do you guys have any tips or guides on how to easily parse multiple text files with different formats, and combining them into a single .csv file and ultimately entering them into a database?
An example of the text files is as follows:
general.txt (Name -- Department (DEPT) Room # [Age]
John Doe -- Management (MANG) 205 [Age: 40]
Equipment: Laptop, Desktop, Printer, Stapler
Experience: Python, Java, HTML
Description: Hardworking, awesome
Mary Smith -- Public Relations (PR) 605 [Age: 24]
Equipment: Mac, PC
Experience: Social Skills
Description: fun to be around
Scott Lee -- Programmer (PG) 403 [Age: 25]
Equipment: Personal Computer
Experience: HTML, CSS, JS
Description: super-hacker
Susan Kim -- Programmer (PG) 504 [Age: 21]
Equipment: Desktop
Experience: Social Skills
Descriptions: fun to be around
Bob Simon -- Programmer (PG) 101 [Age: 29]
Equipment: Pure Brain Power
Experience: C++, C, Java
Description: never comes out of his room
cars.txt (a list of people who own cars by their department/room #)
Programmer: PG 403, PG 101
Management: MANG 205
house.txt
Programmer: PG 504
The final csv should preferably tabulate to something like:
Name | Division | Division Abbrevation | Equipment | Room | Age | Car? | House? |
Scott Lee Programming PG PC 403 25 YES NO
Mary Smith Public Rel. PR Mac, PC 605 24 NO NO
The ultimate goal is to have a database, where searching "PR" would return every row where a person's Department is "PR," etc. There's maybe 30 text files total, each representing one or more columns in a database. Some columns are short paragraphs, which include commas. Around 10,000 rows total. I know Python has built in csv, but I'm not sure where to start, and how to end with just 1 csv. Any help?
It looks like you're looking for someone who will solve a whole problem for you. Here I am :)
General idea is to parse general info to dict (using regular expressions), then append additional fields to it and finally write to CSV. Here's Python 3.x solution (I think Python 2.7+ should suffice):
import csv
import re
def read_general(fname):
# Read general info to dict with 'PR 123'-like keys
# Gerexp that will split row into ready-to-use dict
re_name = re.compile(r'''
(?P<Name>.+)
\ --\ # Separator + space
(?P<Division>.+)
\ # Space
\(
(?P<Division_Abbreviation>.*)
\)
\ # Space
(?P<Id>\d+)
\ # Space
\[Age:\ # Space at the end
(?P<Age>\d+)
\]
''', re.X)
general = {}
with open(fname, 'rt') as f:
for line in f:
line = line.strip()
m = re_name.match(line)
if m:
# Name line, start new man
man = m.groupdict()
key = '%s %s' % (m.group('Division_Abbreviation'), m.group('Id'))
general[key] = man
elif line:
# Non empty lines
# Add values to dict
key, value = line.split(': ', 1)
man[key] = value
return general
def add_bool_criteria(fname, field, general):
# Append a field with YES/NO value
with open(fname, 'rt') as f:
yes_keys = set()
# Phase one, gather all keys
for line in f:
line = line.strip()
_, keys = line.split(': ', 1)
yes_keys.update(keys.split(', '))
# Fill data
for key, man in general.items(): # iteritems() will be faster in Python 2.x
man[field] = 'YES' if key in yes_keys else 'NO'
def save_csv(fname, general):
with open(fname, 'wt') as f:
# Gather field names
all_fields = set()
for value in general.values():
all_fields.update(value.keys())
# Write to csv
w = csv.DictWriter(f, all_fields)
w.writeheader()
w.writerows(general.values())
def main():
general = read_general('general.txt')
add_bool_criteria('cars.txt', 'Car?', general)
add_bool_criteria('house.txt', 'House?', general)
from pprint import pprint
pprint(general)
save_csv('result.csv', general)
if __name__ == '__main__':
main()
I wish you lot of $$$ for this ;)
Side note
CSV is a history, you could use JSON for storage and further use, because it's simpler to use, more flexible and human readable.
You just have a function which parses one file, and returns a list of dictionaries containing {'name': 'Bob Simon', 'age': 29, ...} etc. Then call this on each of your files, extending a master list. Then write this master list of dicts as a CSV file.
More elaborately:
First you need to parse the input files, you'd have a function which takes a file, and returns a list of "things".
def parse_txt(fname):
f = open(fname)
people = []
# Here, parse f. Maybe using a while loop, and calling
# f.readline() until there is an empty line Construct a
# dictionary from each person's block, and append it to allpeople
return people
This returns something like:
people = [
{'name': 'Bob Simon', 'age': 29},
{'name': 'Susan Kim', 'age': 21},
]
Then, loop over each of your input files (maybe by using os.listdir, or optparse to get a list of args):
allpeople = []
for curfile in args:
people = parse_txt(fname = curfile)
allpeople.extend(people)
So allpeople is a long list of all the people from all files.
Finally you can write this to a CSV file using the csv module (this bit usually involves another function to reorganise the data into a format more compatible with the writer module)
I'll do it backwards, I'll start by loading all those house.txt and cars.txt each one into a dict, that could look like:
cars = {'MANG': [205], 'PG': [403, 101]}
Since you said to have like 30 of them, you could easily use a nested dict without making things too complicated:
data = {'house': {'PG': 504}, 'cars': {...}}
Once the data dict will be complete, load general.txt and while building the dict for each employee (or whatever they are) do a dict look-up see if they have a house or not, or a car, etc..
For example for John Doe you'll have to check:
if data['house']['PG'].get(205):
# ...
and update his dict accordingly. Obviously you don't have to hard code all the possible look-ups, just build a couple of lists of the ['house', 'cars', ...] or something like that and iterate over it.
At the end you should have a big list of dict with all the info merged, so just write each one of them to a csv file.
Best possible advise: Don't do that.
Your cars and house relations are, ummmm, interesting. Owning a house or a car is an attribute of a person or other entity (company, partnership, joint tenancy, tenancy in common, etc, etc). It is NOT an attribute of a ("division", room) combination. The first fact in your cars file is "A programmer in room 403 owns a car". What happens in the not unlikely event that there 2 or more programmers in the same room?
The equipment shouldn't be in a list.
Don't record age, record date or year of birth.
You need multiple tables in a database, not 1 CSV file. You need to study a book on elementary database design.
Related
So I'm making a Yu-Gi-Oh database program. I have all the information stored in a large text file. Each monster is chategorized in the following way:
|Name|NUM 1|DESC 1|TYPE|LOCATION|STARS|ATK|DEF|DESCRIPTION
Here's an actual example:
|A Feather of the Phoenix|37;29;18|FET;YSDS;CP03|Spell Card}Spell||||Discard 1 card. Select from your Graveyard and return it to the top of your Deck.|
So I made a program that searches this large text file by name and it returns the information from the text file without the '|'. Here it is:
with open('TEXT.txt') as fd:
input=[x.strip('|').split('|') for x in fd.readlines()]
to_search={x[0]:x for x in input}
print('\n'.join(to_search[name]))
Now I'm trying to edit my program so I can search for the name of the monster and choose which attribute I want to display. So it'd appear like
A Feather of the Phoenix
Description:
Discard 1 card. Select from your Graveyard and return it to the top of your Deck.
Any clues as to how I can do this?
First, this is a variant dialect of CSV, and can be parsed with the csv module instead of trying to do it manually. For example:
with open('TEXT.txt') as fd:
rows = csv.reader(fd, delimiter='|')
to_search = {row[1]:row for row in rows}
print('\n'.join(to_search[name]))
You might also prefer to use DictReader, so each row is a dict (keyed off the names in the header row, or manually-specified column names if you don't have one):
with open('TEXT.txt') as fd:
rows = csv.DictReader(fd, delimiter='|')
to_search = {row['Name']:row for row in rows}
print('\n'.join(to_search[name]))
Then, to select a specific attribute:
with open('TEXT.txt') as fd:
rows = csv.DictReader(fd, delimiter='|')
to_search = {row['Name']:row for row in rows}
print(to_search[name][attribute])
However… I'm not sure this is a good design in the first place. Do you really want to re-read the entire file for each lookup? I think it makes more sense to read it into memory once, into a general-purpose structure that you can use repeatedly. And in fact, you've almost got such a structure:
with open('TEXT.txt') as fd:
monsters = list(csv.DictReader(fd, delimiter='|'))
monsters_by_name = {monster['Name']: monster for monster in monsters}
Then you can build additional indexes, like a multi-map of monsters by location, etc., if you need them.
All this being said, your original code can almost handle what you want already. to_search[name] is a list. If you just build a map from attribute names to indices, you can do this:
attributes = ['Name', 'NUM 1', 'DESC 1', 'TYPE', 'LOCATION', 'STARS', 'ATK', 'DEF', 'DESCRIPTION']
attributes_by_name = {value: idx for idx, value in enumerate(attributes)}
# ...
with open('TEXT.txt') as fd:
input=[x.strip('|').split('|') for x in fd.readlines()]
to_search={x[0]:x for x in input}
attribute_index = attributes_by_name[attributes]
print(to_search[name][attribute_index])
You could look at the namedtuple class in collections. You will want to make each entry a namedtuple with your fields as attributes. The namedtuple might look like:
Card = namedtuple('Card', 'name, number, description, whatever_else')
As shown in the collections documentation, namedtuple and csv work well together:
import csv
for card in map(Card._make, csv.reader(open("cards", "rb"))):
print card.name, card.description # format however you want here
The mechanics around search can be very complicated. For example, if you want a really fast search built around an exact match, you could build a dictionary for each attribute you're interested in:
name_map = {card.name: card for card in all_cards}
search_result = name_map[name_you_searched_for]
You could also do a startswith search:
possibles = [card for card in all_cards if card.name.startswith(search_string)]
# here you need to decide what to do with these possibles, in this example, I'm just snagging the first one, and I'm not handling the possibility that you don't find one, you should.
search_result = possibles[0]
I recommend against trying to search the file itself. This is an extremely complex kind of search to do and is typically left up to database systems to implement this kind of functionality. If you need to do this, consider switching the application to sqlite or another lightweight database.
how can a delete a specific entry from a bibtex file based on a cite key using python? I basically want a function that takes two arguments (path to bibtex file and cite key) and deletes the entry that corresponds to the key from the file. I played around with regular expressions but wasn't successful. I also looked a little for bibtex parsers but that seems like an overkill. In the skeleton function below, the decisive part is content_modified =.
def deleteEntry(path, key):
# get content of bibtex file
f = open(path, 'r')
content = f.read()
f.close()
# delete entry from content string
content_modified =
# rewrite file
f = open(path, 'w')
f.write(content_modified)
f.close()
Here is an example bibtex file (with spaces in the abstract):
#article{dai2008thebigfishlittlepond,
title = {The {Big-Fish-Little-Pond} Effect: What Do We Know and Where Do We Go from Here?},
volume = {20},
shorttitle = {The {Big-Fish-Little-Pond} Effect},
url = {http://dx.doi.org/10.1007/s10648-008-9071-x},
doi = {10.1007/s10648-008-9071-x},
abstract = {The big-fish-little-pond effect {(BFLPE)} refers to the theoretical prediction that equally able students will have lower academic
self-concepts in higher-achieving or selective schools or programs than in lower-achieving or less selective schools or programs,
largely due to social comparison based on local norms. While negative consequences of being in a more competitive educational
setting are highlighted by the {BFLPE}, the exact nature of the {BFLPE} has not been closely scrutinized. This article provides
a critique of the {BFLPE} in terms of its conceptualization, methodology, and practical implications. Our main argument is that
of the {BFLPE.}},
number = {3},
journal = {Educational Psychology Review},
author = {Dai, David Yun and Rinn, Anne N.},
year = {2008},
keywords = {education, composition by performance, education, peer effect, education, school context, education, social comparison/big-fish{\textendash}little-pond effect},
pages = {283--317},
file = {Dai_Rinn_2008_The Big-Fish-Little-Pond Effect.pdf:/Users/jpl2136/Documents/Literatur/Dai_Rinn_2008_The Big-Fish-Little-Pond Effect.pdf:application/pdf}
}
#book{coleman1966equality,
title = {Equality of Educational Opportunity},
shorttitle = {Equality of educational opportunity},
publisher = {{U.S.} Dept. of Health, Education, and Welfare, Office of Education},
author = {Coleman, James},
year = {1966},
keywords = {\_task\_obtain, education, school context, soz. Ungleichheit, education}
}
EDIT: Here is a solution that I came up with. It's not based on matching the whole bibtex entry but instead looks for all the beginnings #article{dai2008thebigfishlittlepond, and then removes the corresponding entry by slicing the context string.
content_keys = [(m.group(1), m.start(0)) for m in re.finditer("#\w{1,20}\{([\w\d-]+),", content)]
idx = [k[0] for k in content_keys].index(key)
content_modified = content[0:content_keys[idx][1]] + content[content_keys[idx + 1][1]:]
As Beni Cherniavsky-Paskin mentioned in the comment, you will have to rely on the fact, that your BibTex entries will start and end right after the start of the line (without any tabs or spaces). Then you can do this:
pattern = re.compile(r"^#\w+\{"+key+r",.*?^\}", re.S | re.M)
content_modified = re.sub(pattern, "", content)
Note the two modifiers. S makes the . match line breaks. M makes ^ match at the start of the string.
If you cannot rely on this fact, then the BibTex format is simply not a regular language (since it allows nesting of {} which has to be counted for correct results. There are regex flavors, which might still make this task possible (using recursion or balancing group), but I think Python supports none of those features. Hence, you would actually have to use a BibTex parser (which would also make your code a lot more understable, I guess).
I'm using Python, and I have a file which has city names and information such as names, coordinates of the city and population of the city:
Youngstown, OH[4110,8065]115436
Yankton, SD[4288,9739]12011
966
Yakima, WA[4660,12051]49826
1513 2410
Worcester, MA[4227,7180]161799
2964 1520 604
Wisconsin Dells, WI[4363,8977]2521
1149 1817 481 595
How can I create a function to take the city name and return a list containing the latitude and longitude of the given city?
fin = open ("miles.dat","r")
def getCoordinates
cities = []
for line in fin:
cities.append(line.rstrip())
for word in line:
print line.split()
That's what I tried now; how could I get the coordinates of the city by calling the names of the city and how can I return the word of each line but not letters?
Any help will be much appreciated, thanks all.
I am feeling generous since you responded to my comment and made an effort to provide more info....
Your code example isn't even runnable right now, but from a purely pseudocode standpoint, you have at least the basic concept of the first part right. Normally I would want to parse out the information using a regex, but I think giving you an answer with a regex is beyond what you already know and won't really help you learn anything at this stage. So I will try and keep this example within the realm of the tools with which you seem to already be familiar.
def getCoordinates(filename):
'''
Pass in a filename.
Return a parsed dictionary in the form of:
{
city: [lat, lon]
}
'''
fin = open(filename,"r")
cities = {}
for line in fin:
# this is going to split on the comma, and
# only once, so you get the city, and the rest
# of the line
city, extra = line.split(',', 1)
# we could do a regex, but again, I dont think
# you know what a regex is and you seem to already
# understand split. so lets just stick with that
# this splits on the '[' and we take the right side
part = extra.split('[')[1]
# now take the remaining string and split off the left
# of the ']'
part = part.split(']')[0]
# we end up with something like: '4660, 12051'
# so split that string on the comma into a list
latLon = part.split(',')
# associate the city, with the latlon in the dictionary
cities[city] = latLong
return cities
Even though I have provided a full code solution for you, I am hoping that it will be more of a learning experience with the added comments. Eventually you should learn to do this using the re module and a regex pattern.
I've made this CSV file up to play with.. From what I've been told before, I'm pretty sure this CSV file is valid and can be used in this example.
Basically I have this CSV file 'book_list.csv':
name,author,year
Lord of the Rings: The Fellowship of the Ring,J. R. R. Tolkien,1954
Nineteen Eighty-Four,George Orwell,1984
Lord of the Rings: The Return of the King,J. R. R. Tolkien,1954
Animal Farm,George Orwell,1945
Lord of the Rings: The Two Towers, J. R. R. Tolkien, 1954
And I also have this text file 'search_query.txt', whereby I put in keywords or search terms I want to search for in the CSV file:
Lord
Rings
Animal
I've currently come up with some code (with the help of stuff I've read) that allows me to count the number of matching entries. I then have the program write a separate CSV file 'results.csv' which just returns either 'Matching' or ' '.
The program then takes this 'results.csv' file and counts how many 'Matching' results I have and it prints the count.
import csv
import collections
f1 = file('book_list.csv', 'r')
f2 = file('search_query.txt', 'r')
f3 = file('results.csv', 'w')
c1 = csv.reader(f1)
c2 = csv.reader(f2)
c3 = csv.writer(f3)
input = [row for row in c2]
for booklist_row in c1:
row = 1
found = False
for input_row in input:
results_row = []
if input_row[0] in booklist_row[0]:
results_row.append('Matching')
found = True
break
row = row + 1
if not found:
results_row.append('')
c3.writerow(results_row)
f1.close()
f2.close()
f3.close()
d = collections.defaultdict(int)
with open("results.csv", "rb") as info:
reader = csv.reader(info)
for row in reader:
for matches in row:
matches = matches.strip()
if matches:
d[matches] += 1
results = [(matches, count) for matches, count in d.iteritems() if count >= 1]
results.sort(key=lambda x: x[1], reverse=True)
for matches, count in results:
print 'There are', count, 'matching results'+'.'
In this case, my output returns:
There are 4 matching results.
I'm sure there is a better way of doing this and avoiding writing a completely separate CSV file.. but this was easier for me to get my head around.
My question is, this code that I've put together only returns how many matching results there are.. how do I modify it in order to return the ACTUAL results as well?
i.e. I want my output to return:
There are 4 matching results.
Lord of the Rings: The Fellowship of the Ring
Lord of the Rings: The Return of the King
Animal Farm
Lord of the Rings: The Two Towers
As I said, I'm sure there's a much easier way to do what I already have.. so some insight would be helpful. :)
Cheers!
EDIT: I just realized that if my keywords were in lower case, it won't work.. is there a way to avoid case-sensitivity?
Throw away the query file and get your search terms from sys.argv[1:] instead.
Throw away your output file and use sys.stdout instead.
Append matched booklist titles to a result_list. The result_row that you currently have has a rather misleading name. The count that you want is len(result_list). Print that. Then print the contents of result_list.
Convert your query words to lowercase once (before you start reading the input file). As you read each book_list row, convert its title to lowercase. Do your your matching with the lowercase query words and the lowercase title.
Overall plan:
Read in the entire book list csv into a dictionary of {title: info}.
Read in the questions csv. For each keyword, filter the dictionary:
[key for key, value in books.items() if "Lord" in key]
say. Do what you will with the results.
If you want, put the results in another csv.
If you want to deal with casing issues, try turning all the titles to lowercase ("FOO".lower()) when you store them in the dictionary.
Recently I had a question regarding data types.
Since then, I've been trying to use NamedTuples (with more or less success).
My problem currently:
- How to import the lines from a file to new tuples,
- How to import the values separated with space/tab(/whatever) into a given part of the tuple?
Like:
Monday 8:00 10:00 ETR_28135 lh1n1522 Computer science 1
Tuesday 12:00 14:00 ETR_28134 lh1n1544 Geography EA 1
First line should go into tuple[0]. First data: tuple[0].day; second: tuple[0].start; ..and so on.
And when the new line starts (that's two TAB (\t), start a new tuple, like tuple[1]).
I use this to separate the data:
with open(Filename) as f:
for line in f:
rawData = line.strip().split('\t')
And the rest of the logic is still missing (the filling up of the tuples).
(I know. This question, and the recent one are really low-level ones. However, hope these will help others too. If you feel like it's not a real question, too simple to be a question, etc etc, just vote to close. Thank you for your understanding.)
Such database files are called comma separated values even though they are not really separated by commas. Python has a handy library called csv that lets you easily read such files
Here is a slightly modified example from the docs
csv.register_dialect('mycsv', delimiter='\t', quoting=csv.QUOTE_NONE)
with open(filename, 'rb') as f:
reader = csv.reader(f, 'mycsv')
Usually you work one line at a time. If you need the whole file in a tuple then:
t = tuple(reader)
EDIT
If you need to access fields by name you could use cvs.DictReader, but I don't know how exactly that works and I could not test it here.
EDIT 2
Looking at what namedtuples are, I'm a bit outdated. There is a nice example on how namedtuple could work with the csv module:
EmployeeRecord = namedtuple('EmployeeRecord', 'name, age, title, department, paygrade')
import csv
for line in csv.reader(open("employees.csv", "rb")):
emp = EmployeeRecord._make(line)
print emp.name, emp.title
If you want to use a NamedTuple, you can use a slightly modified version of the example given in the Python documentation:
MyRecord = namedtuple('MyRecord', 'Weekday, start, end, code1, code2, title, whatever')
import csv
for rec in map(MyRecord._make, csv.reader(open("mycsv.csv", "rb"), delimiter='\t')):
print rec.weekday
print rec.title
# etc...
Here's a compact way of doing such things.
First declare the class of line item:
fields = "dow", "open_time", "close _time", "code", "foo", "subject", "bar"
Item = namedtuple('Item', " ".join(fields))
The next part is inside your loop.
# this is what your raw data looks like after the split:
#raw_data = ['Monday', '8:00', '10:00', 'ETR_28135', 'lh1n1522', 'Computer science', '1']
data_tuple = Item(**dict(zip(fields, raw_data)))
Now slowly:
zip(fields, raw_data) creates a list of pairs, like [("dow", "Monday"), ("open_time", "8:00"),..]
then dict() turns it into a dictionary, like {"dow": "Monday", "open_time": "8:00", ..}
then ** interprets this dictionary as a bunch of keyword parameters to Item constructor, an equivalent of Item(dow="Monday", open_time="8:00",..).
So your items are named tuples, with all values being strings.
Edit:
If order of fields is not going to change, you can do it far easier:
data_tuple = Item(*raw_data)
This uses the fact that order of fields in the file and order of parameters in Item definition match.