So I'm making a Yu-Gi-Oh database program. I have all the information stored in a large text file. Each monster is chategorized in the following way:
|Name|NUM 1|DESC 1|TYPE|LOCATION|STARS|ATK|DEF|DESCRIPTION
Here's an actual example:
|A Feather of the Phoenix|37;29;18|FET;YSDS;CP03|Spell Card}Spell||||Discard 1 card. Select from your Graveyard and return it to the top of your Deck.|
So I made a program that searches this large text file by name and it returns the information from the text file without the '|'. Here it is:
with open('TEXT.txt') as fd:
input=[x.strip('|').split('|') for x in fd.readlines()]
to_search={x[0]:x for x in input}
print('\n'.join(to_search[name]))
Now I'm trying to edit my program so I can search for the name of the monster and choose which attribute I want to display. So it'd appear like
A Feather of the Phoenix
Description:
Discard 1 card. Select from your Graveyard and return it to the top of your Deck.
Any clues as to how I can do this?
First, this is a variant dialect of CSV, and can be parsed with the csv module instead of trying to do it manually. For example:
with open('TEXT.txt') as fd:
rows = csv.reader(fd, delimiter='|')
to_search = {row[1]:row for row in rows}
print('\n'.join(to_search[name]))
You might also prefer to use DictReader, so each row is a dict (keyed off the names in the header row, or manually-specified column names if you don't have one):
with open('TEXT.txt') as fd:
rows = csv.DictReader(fd, delimiter='|')
to_search = {row['Name']:row for row in rows}
print('\n'.join(to_search[name]))
Then, to select a specific attribute:
with open('TEXT.txt') as fd:
rows = csv.DictReader(fd, delimiter='|')
to_search = {row['Name']:row for row in rows}
print(to_search[name][attribute])
However… I'm not sure this is a good design in the first place. Do you really want to re-read the entire file for each lookup? I think it makes more sense to read it into memory once, into a general-purpose structure that you can use repeatedly. And in fact, you've almost got such a structure:
with open('TEXT.txt') as fd:
monsters = list(csv.DictReader(fd, delimiter='|'))
monsters_by_name = {monster['Name']: monster for monster in monsters}
Then you can build additional indexes, like a multi-map of monsters by location, etc., if you need them.
All this being said, your original code can almost handle what you want already. to_search[name] is a list. If you just build a map from attribute names to indices, you can do this:
attributes = ['Name', 'NUM 1', 'DESC 1', 'TYPE', 'LOCATION', 'STARS', 'ATK', 'DEF', 'DESCRIPTION']
attributes_by_name = {value: idx for idx, value in enumerate(attributes)}
# ...
with open('TEXT.txt') as fd:
input=[x.strip('|').split('|') for x in fd.readlines()]
to_search={x[0]:x for x in input}
attribute_index = attributes_by_name[attributes]
print(to_search[name][attribute_index])
You could look at the namedtuple class in collections. You will want to make each entry a namedtuple with your fields as attributes. The namedtuple might look like:
Card = namedtuple('Card', 'name, number, description, whatever_else')
As shown in the collections documentation, namedtuple and csv work well together:
import csv
for card in map(Card._make, csv.reader(open("cards", "rb"))):
print card.name, card.description # format however you want here
The mechanics around search can be very complicated. For example, if you want a really fast search built around an exact match, you could build a dictionary for each attribute you're interested in:
name_map = {card.name: card for card in all_cards}
search_result = name_map[name_you_searched_for]
You could also do a startswith search:
possibles = [card for card in all_cards if card.name.startswith(search_string)]
# here you need to decide what to do with these possibles, in this example, I'm just snagging the first one, and I'm not handling the possibility that you don't find one, you should.
search_result = possibles[0]
I recommend against trying to search the file itself. This is an extremely complex kind of search to do and is typically left up to database systems to implement this kind of functionality. If you need to do this, consider switching the application to sqlite or another lightweight database.
Related
I need to convert a CSV file into a list of dictionaries without importing CSV or other external libraries for a project I am doing for class.
Attempt
I am able to get the keys using header line but when I try to extract the values it goes row by row instead of column by column and starts in the wrong place. However when I append it to the list it goes back to starting at the right place. However I am unsure of how to connect the keys to the correct column in the list.
CSV file
This is the CSV file I am using, I am only using the descriptions portion up to the first comma.
I tried using a for 6 loop in order to cycle through each key but it seems to go row by row and I don't know how to change it.
If anybody could steer me in the right direction it would be very appreciated.
CSV sample - sample is not saving correctly but it has the three headers on top and then the three matching information below and so on.
(Code,Name,State)\n
(ACAD,Acadia National Park,ME)\n
(ARCH,Arches National Park,UT)\n
(BADL, Badlands National Park,SD)\n
read your question. I am posting code from what I understood from your question. You should learn to post the code in question. It is a mandatory skill. Always open a file using the "with" block. I made a demo CSV file with two rows of records. The following code fetched all the rows as a list of dictionaries.
def readParksFile(fileName="national_parks.csv"):
with open(fileName) as infile:
column_names = infile.readline()
keys = column_names.split(",")
number_of_columns = len(keys)
list_of_dictionaries = []
data = infile.readlines()
list_of_rows = []
for row in data:
list_of_rows.append(row.split(","))
infile.close()
for item in list_of_rows:
row_as_a_dictionary = {}
for i in range(number_of_columns):
row_as_a_dictionary[keys[i]] = item[i]
list_of_dictionaries.append(row_as_a_dictionary)
for i in range(len(list_of_dictionaries)):
print(list_of_dictionaries[i])
Output:
{'Code': 'cell1', 'Name': 'cell2', 'State': 'cell3', 'Acres': 'cell4', 'Latitude': 'cell5', 'Longitude': 'cell6', 'Date': 'cell7', 'Description\n': 'cell8\n'}
{'Code': 'cell11', 'Name': 'cell12', 'State': 'cell13', 'Acres': 'cell14', 'Latitude': 'cell15', 'Longitude': 'cell16', 'Date': 'cell17', 'Description\n': 'cell18'}
I would create a class with a constructor that has the keys from the first row of the CSV as properties. Then create an empty list to store your dictionaries. Then open the file (that is a built-in library so I assume you can use it) and read it line by line. Store the line as a string and use the split method with a comma as the delimiter and store that list in a variable. Call the constructor of your class for each line to construct your dictionary using the indexes of the list from the split method. Before reading the next line, append the dictionary to your list. This is probably not the easiest way to do it but it doesn't use any external libraries (although as others have mentioned, there is a built-in CSV module).
Code:
#Class with constructor
class Park:
def __init__(self, code, name, state):
self.code = code
self.name = name
self.state = state
#Empty array for storing the dictionaries
parks = []
#Open file
parks_csv = open("parks.csv")
#Skip first line
lines = parks_csv.readlines()[1:]
#Read the rest of the lines
for line in lines:
parkProperties = line.split(",")
newPark = Park(parkProperties[0], parkProperties[1], parkProperties[2])
parks.append(newPark)
#Print park dictionaries
#It would be easier to parse this using the JSON library
#But since you said you can't use any libraries
for park in parks:
print(f'{{code: {park.code}, name: {park.name}, state: {park.state}}}')
#Don't forget to close the file
parks_csv.close()
Output:
{code: ACAD, name: Acadia National Park, state: ME}
{code: ARCH, name: Arches National Park, state: UT}
{code: BADL, name: Badlands National Park, state: SD}
This is a short script I've written to refine and validate a large dataset that I have.
# The purpose of this script is the refinement of the job data attained from the
# JSI as it is rendered by the `csv generator` contributed by Luis for purposes
# of presentation on the dashboard map.
import csv
# The number of columns
num_headers = 9
# Remove invalid characters from records
def url_escaper(data):
for line in data:
yield line.replace('&','&')
# Be sure to configure input & output files
with open("adzuna_input_THRESHOLD.csv", 'r') as file_in, open("adzuna_output_GO.csv", 'w') as file_out:
csv_in = csv.reader( url_escaper( file_in ) )
csv_out = csv.writer(file_out)
# Get rid of rows that have the wrong number of columns
# and rows that have only whitespace for a columnar value
for i, row in enumerate(csv_in, start=1):
if not [e for e in row if not e.strip()]:
if len(row) == num_headers:
csv_out.writerow(row)
else:
print "line %d is malformed" % i
I have one field that is structured like so:
finance|statistics|lisp
I've seen ways to do this using other utilities like R, but I want to ideally achieve the same effect within the scope of this python code.
Maybe I can iterate over all the characters of all the columnar values, perhaps as a list, and if I see a | I can dispose of the | and all the text that follows it within the scope of the column value.
I think surely it can be achieved with slices as they do here but I don't quite understand how the indices with slices work- and I can't see how I could include this process harmoniously within the cascade of the current script pipeline.
With regex I guess it's something like this
(?:|)(.*)
Why not use string's split method?
In[4]: 'finance|statistics|lisp'.split('|')[0]
Out[4]: 'finance'
It does not fail with exception when you do not have separator character in the string too:
In[5]: 'finance/statistics/lisp'.split('|')[0]
Out[5]: 'finance/statistics/lisp'
I am attempting to combine a collection of 600 text files, each line looks like
Measurement title Measurement #1
ebv-miR-BART1-3p 4.60618701
....
evb-miR-BART1-200 12.8327289
with 250 or so rows in each file. Each file is formatted that way, with the same data headers. What I would like to do is combine the files such that it looks like this
Measurement title Measurement #1 Measurement #2
ebv-miR-BART1-3p 4.60618701 4.110878867
....
evb-miR-BART1-200 12.8327289 6.813287556
I was wondering if there is an easy way in python to strip out the second column of each file, then append it to a master file? I was planning on pulling each line out, then using regular expressions to look for the second column, and appending it to the corresponding line in the master file. Is there something more efficient?
It is a small amount of data for today's desktop computers (around 150000 measurements) - so keeping everything in memory, and dumping to a single file will be easier than an another strategy. If it would not fit in RAM, maybe using SQL would be a nice approach there -
but as it is, you can create a single default dictionary, where each element is a list -
read all your files and collect the measurements to this dictionary, and dump it to disk -
# create default list dictionary:
>>> from collections import defaultdict
>>> data = defaultdict(list)
# Read your data into it:
>>> from glob import glob
>>> import csv
>>> for filename in glob("my_directory/*csv"):
... reader = csv.reader(open(filename))
... # throw away header row:
... reader.readrow()
... for name, value in reader:
... data[name].append(value)
...
>>> # and record everything down in another file:
...
>>> mydata = open("mydata.csv", "wt")
>>> writer = csv.writer(mydata)
>>> for name, values in sorted(data.items()):
... writer.writerow([name] + values)
...
>>> mydata.close()
>>>
Use the csv module to read the files in, create a dictionary of the measurement names, and make the values in the dictionary a list of the values from the file.
I don't have comment privileges yet, therefore a separate answer.
jsbueno's answer works really well as long as you're sure that the same measurement IDs occur in every file (order is not important, but the sets should be equal!).
In the following situation:
file1:
measID,meas1
a,1
b,2
file2:
measID,meas1
a,3
b,4
c,5
you would get:
outfile:
measID,meas1,meas2
a,1,3
b,2,4
c,5
instead of the desired:
outfile:
measID,meas1,meas2
a,1,3
b,2,4
c,,5 # measurement c was missing in file1!
I'm using commas instead of spaces as delimiters for better visibility.
My code is below. Basically, I've got a CSV file and a text file "input.txt". I'm trying to create a Python application which will take the input from "input.txt" and search through the CSV file for a match and if a match is found, then it should return the first column of the CSV file.
import csv
csv_file = csv.reader(open('some_csv_file.csv', 'r'), delimiter = ",")
header = csv_file.next()
data = list(csv_file)
input_file = open("input.txt", "r")
lines = input_file.readlines()
for row in lines:
inputs = row.strip().split(" ")
for input in inputs:
input = input.lower()
for row in data:
if any(input in terms.lower() for terms in row):
print row[0]
Say my CSV file looks like this:
book title, author
The Rock, Herry Putter
Business Economics, Herry Putter
Yogurt, Daniel Putter
Short Story, Rick Pan
And say my input.txt looks like this:
Herry
Putter
Therefore when I run my program, it prints:
The Rock
Business Economics
The Rock
Business Economics
Yogurt
This is because it searches for all titles with "Herry" first, and then searches all over again for "Putter". So in the end, I have duplicates of the book titles. I'm trying to figure out a way to remove them...so if anyone can help, that would be greatly appreciated.
If original order does not matter, then stick the results into a set first, and then print them out at the end. But, your example is small enough where speed does not matter that much.
Stick the results in a set (which is like a list but only contains unique elements), and print at the end.
Something like;
if any(input in terms.lower() for terms in row):
if not row[0] in my_set:
my_set.add(row[0])
During the search stick results into a list, and only add new results to the list after first searching the list to see if the result is already there. Then after the search is done print the list.
First, get the set of search terms you want to look for in a single list. We use set(...) here to eliminate duplicate search terms:
search_terms = set(open("input.txt", "r").read().lower().split())
Next, iterate over the rows in the data table, selecting each one that matches the search terms. Here, I'm preserving the behavior of the original code, in that we search for the case-normalized search term in any column for each row. If you just wanted to search e.g. the author column, then this would need to be tweaked:
results = [row for row in data
if any(search_term in item.lower()
for item in row
for search_term in search_terms)]
Finally, print the results.
for row in results:
print row[0]
If you wanted, you could also list the authors or any other info in the table. E.g.:
for row in results:
print '%30s (by %s)' % (row[0], row[1])
Recently I had a question regarding data types.
Since then, I've been trying to use NamedTuples (with more or less success).
My problem currently:
- How to import the lines from a file to new tuples,
- How to import the values separated with space/tab(/whatever) into a given part of the tuple?
Like:
Monday 8:00 10:00 ETR_28135 lh1n1522 Computer science 1
Tuesday 12:00 14:00 ETR_28134 lh1n1544 Geography EA 1
First line should go into tuple[0]. First data: tuple[0].day; second: tuple[0].start; ..and so on.
And when the new line starts (that's two TAB (\t), start a new tuple, like tuple[1]).
I use this to separate the data:
with open(Filename) as f:
for line in f:
rawData = line.strip().split('\t')
And the rest of the logic is still missing (the filling up of the tuples).
(I know. This question, and the recent one are really low-level ones. However, hope these will help others too. If you feel like it's not a real question, too simple to be a question, etc etc, just vote to close. Thank you for your understanding.)
Such database files are called comma separated values even though they are not really separated by commas. Python has a handy library called csv that lets you easily read such files
Here is a slightly modified example from the docs
csv.register_dialect('mycsv', delimiter='\t', quoting=csv.QUOTE_NONE)
with open(filename, 'rb') as f:
reader = csv.reader(f, 'mycsv')
Usually you work one line at a time. If you need the whole file in a tuple then:
t = tuple(reader)
EDIT
If you need to access fields by name you could use cvs.DictReader, but I don't know how exactly that works and I could not test it here.
EDIT 2
Looking at what namedtuples are, I'm a bit outdated. There is a nice example on how namedtuple could work with the csv module:
EmployeeRecord = namedtuple('EmployeeRecord', 'name, age, title, department, paygrade')
import csv
for line in csv.reader(open("employees.csv", "rb")):
emp = EmployeeRecord._make(line)
print emp.name, emp.title
If you want to use a NamedTuple, you can use a slightly modified version of the example given in the Python documentation:
MyRecord = namedtuple('MyRecord', 'Weekday, start, end, code1, code2, title, whatever')
import csv
for rec in map(MyRecord._make, csv.reader(open("mycsv.csv", "rb"), delimiter='\t')):
print rec.weekday
print rec.title
# etc...
Here's a compact way of doing such things.
First declare the class of line item:
fields = "dow", "open_time", "close _time", "code", "foo", "subject", "bar"
Item = namedtuple('Item', " ".join(fields))
The next part is inside your loop.
# this is what your raw data looks like after the split:
#raw_data = ['Monday', '8:00', '10:00', 'ETR_28135', 'lh1n1522', 'Computer science', '1']
data_tuple = Item(**dict(zip(fields, raw_data)))
Now slowly:
zip(fields, raw_data) creates a list of pairs, like [("dow", "Monday"), ("open_time", "8:00"),..]
then dict() turns it into a dictionary, like {"dow": "Monday", "open_time": "8:00", ..}
then ** interprets this dictionary as a bunch of keyword parameters to Item constructor, an equivalent of Item(dow="Monday", open_time="8:00",..).
So your items are named tuples, with all values being strings.
Edit:
If order of fields is not going to change, you can do it far easier:
data_tuple = Item(*raw_data)
This uses the fact that order of fields in the file and order of parameters in Item definition match.