My scenario is as follows: I have a table of data (handful of fields, less than a hundred rows) that I use extensively in my program. I also need this data to be persistent, so I save it as a CSV and load it on start-up. I choose not to use a database because every option (even SQLite) is an overkill for my humble requirement (also - I would like to be able to edit the values offline in a simple way, and nothing is simpler than notepad).
Assume my data looks as follows (in the file it's comma separated without titles, this is just an illustration):
Row | Name | Year | Priority
------------------------------------
1 | Cat | 1998 | 1
2 | Fish | 1998 | 2
3 | Dog | 1999 | 1
4 | Aardvark | 2000 | 1
5 | Wallaby | 2000 | 1
6 | Zebra | 2001 | 3
Notes:
Row may be a "real" value written to the file or just an auto-generated value that represents the row number. Either way it exists in memory.
Names are unique.
Things I do with the data:
Look-up a row based on either ID (iteration) or name (direct access).
Display the table in different orders based on multiple field: I need to sort it e.g. by Priority and then Year, or Year and then Priority, etc.
I need to count instances based on sets of parameters, e.g. how many rows have their year between 1997 and 2002, or how many rows are in 1998 and priority > 2, etc.
I know this "cries" for SQL...
I'm trying to figure out what's the best choice for data structure. Following are several choices I see:
List of row lists:
a = []
a.append( [1, "Cat", 1998, 1] )
a.append( [2, "Fish", 1998, 2] )
a.append( [3, "Dog", 1999, 1] )
...
List of column lists (there will obviously be an API for add_row etc):
a = []
a.append( [1, 2, 3, 4, 5, 6] )
a.append( ["Cat", "Fish", "Dog", "Aardvark", "Wallaby", "Zebra"] )
a.append( [1998, 1998, 1999, 2000, 2000, 2001] )
a.append( [1, 2, 1, 1, 1, 3] )
Dictionary of columns lists (constants can be created to replace the string keys):
a = {}
a['ID'] = [1, 2, 3, 4, 5, 6]
a['Name'] = ["Cat", "Fish", "Dog", "Aardvark", "Wallaby", "Zebra"]
a['Year'] = [1998, 1998, 1999, 2000, 2000, 2001]
a['Priority'] = [1, 2, 1, 1, 1, 3]
Dictionary with keys being tuples of (Row, Field):
Create constants to avoid string searching
NAME=1
YEAR=2
PRIORITY=3
a={}
a[(1, NAME)] = "Cat"
a[(1, YEAR)] = 1998
a[(1, PRIORITY)] = 1
a[(2, NAME)] = "Fish"
a[(2, YEAR)] = 1998
a[(2, PRIORITY)] = 2
...
And I'm sure there are other ways... However each way has disadvantages when it comes to my requirements (complex ordering and counting).
What's the recommended approach?
EDIT:
To clarify, performance is not a major issue for me. Because the table is so small, I believe almost every operation will be in the range of milliseconds, which is not a concern for my application.
Having a "table" in memory that needs lookups, sorting, and arbitrary aggregation really does call out for SQL. You said you tried SQLite, but did you realize that SQLite can use an in-memory-only database?
connection = sqlite3.connect(':memory:')
Then you can create/drop/query/update tables in memory with all the functionality of SQLite and no files left over when you're done. And as of Python 2.5, sqlite3 is in the standard library, so it's not really "overkill" IMO.
Here is a sample of how one might create and populate the database:
import csv
import sqlite3
db = sqlite3.connect(':memory:')
def init_db(cur):
cur.execute('''CREATE TABLE foo (
Row INTEGER,
Name TEXT,
Year INTEGER,
Priority INTEGER)''')
def populate_db(cur, csv_fp):
rdr = csv.reader(csv_fp)
cur.executemany('''
INSERT INTO foo (Row, Name, Year, Priority)
VALUES (?,?,?,?)''', rdr)
cur = db.cursor()
init_db(cur)
populate_db(cur, open('my_csv_input_file.csv'))
db.commit()
If you'd really prefer not to use SQL, you should probably use a list of dictionaries:
lod = [ ] # "list of dicts"
def populate_lod(lod, csv_fp):
rdr = csv.DictReader(csv_fp, ['Row', 'Name', 'Year', 'Priority'])
lod.extend(rdr)
def query_lod(lod, filter=None, sort_keys=None):
if filter is not None:
lod = (r for r in lod if filter(r))
if sort_keys is not None:
lod = sorted(lod, key=lambda r:[r[k] for k in sort_keys])
else:
lod = list(lod)
return lod
def lookup_lod(lod, **kw):
for row in lod:
for k,v in kw.iteritems():
if row[k] != str(v): break
else:
return row
return None
Testing then yields:
>>> lod = []
>>> populate_lod(lod, csv_fp)
>>>
>>> pprint(lookup_lod(lod, Row=1))
{'Name': 'Cat', 'Priority': '1', 'Row': '1', 'Year': '1998'}
>>> pprint(lookup_lod(lod, Name='Aardvark'))
{'Name': 'Aardvark', 'Priority': '1', 'Row': '4', 'Year': '2000'}
>>> pprint(query_lod(lod, sort_keys=('Priority', 'Year')))
[{'Name': 'Cat', 'Priority': '1', 'Row': '1', 'Year': '1998'},
{'Name': 'Dog', 'Priority': '1', 'Row': '3', 'Year': '1999'},
{'Name': 'Aardvark', 'Priority': '1', 'Row': '4', 'Year': '2000'},
{'Name': 'Wallaby', 'Priority': '1', 'Row': '5', 'Year': '2000'},
{'Name': 'Fish', 'Priority': '2', 'Row': '2', 'Year': '1998'},
{'Name': 'Zebra', 'Priority': '3', 'Row': '6', 'Year': '2001'}]
>>> pprint(query_lod(lod, sort_keys=('Year', 'Priority')))
[{'Name': 'Cat', 'Priority': '1', 'Row': '1', 'Year': '1998'},
{'Name': 'Fish', 'Priority': '2', 'Row': '2', 'Year': '1998'},
{'Name': 'Dog', 'Priority': '1', 'Row': '3', 'Year': '1999'},
{'Name': 'Aardvark', 'Priority': '1', 'Row': '4', 'Year': '2000'},
{'Name': 'Wallaby', 'Priority': '1', 'Row': '5', 'Year': '2000'},
{'Name': 'Zebra', 'Priority': '3', 'Row': '6', 'Year': '2001'}]
>>> print len(query_lod(lod, lambda r:1997 <= int(r['Year']) <= 2002))
6
>>> print len(query_lod(lod, lambda r:int(r['Year'])==1998 and int(r['Priority']) > 2))
0
Personally I like the SQLite version better since it preserves your types better (without extra conversion code in Python) and easily grows to accommodate future requirements. But then again, I'm quite comfortable with SQL, so YMMV.
A very old question I know but...
A pandas DataFrame seems to be the ideal option here.
http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.DataFrame.html
From the blurb
Two-dimensional size-mutable, potentially heterogeneous tabular data
structure with labeled axes (rows and columns). Arithmetic operations
align on both row and column labels. Can be thought of as a dict-like
container for Series objects. The primary pandas data structure
http://pandas.pydata.org/
I personally would use the list of row lists. Because the data for each row is always in the same order, you can easily sort by any of the columns by simply accessing that element in each of the lists. You can also easily count based on a particular column in each list, and make searches as well. It's basically as close as it gets to a 2-d array.
Really the only disadvantage here is that you have to know in what order the data is in, and if you change that ordering, you'll have to change your search/sorting routines to match.
Another thing you can do is have a list of dictionaries.
rows = []
rows.append({"ID":"1", "name":"Cat", "year":"1998", "priority":"1"})
This would avoid needing to know the order of the parameters, so you can look through each "year" field in the list.
Have a Table class whose rows is a list of dict or better row objects
In table do not directly add rows but have a method which update few lookup maps e.g. for name
if you are not adding rows in order or id are not consecutive you can have idMap too
e.g.
class Table(object):
def __init__(self):
self.rows = []# list of row objects, we assume if order of id
self.nameMap = {} # for faster direct lookup for row by name
def addRow(self, row):
self.rows.append(row)
self.nameMap[row['name']] = row
def getRow(self, name):
return self.nameMap[name]
table = Table()
table.addRow({'ID':1,'name':'a'})
First, given that you have a complex data retrieval scenario, are you sure even SQLite is overkill?
You'll end up having an ad hoc, informally-specified, bug-ridden, slow implementation of half of SQLite, paraphrasing Greenspun's Tenth Rule.
That said, you are very right in saying that choosing a single data structure will impact one or more of searching, sorting or counting, so if performance is paramount and your data is constant, you could consider having more than one structure for different purposes.
Above all, measure what operations will be more common and decide which structure will end up costing less.
I personally wrote a lib for pretty much that quite recently, it is called BD_XML
as its most fundamental reason of existence is to serve as a way to send data back and forth between XML files and SQL databases.
It is written in Spanish (if that matters in a programming language) but it is very simple.
from BD_XML import Tabla
It defines an object called Tabla (Table), it can be created with a name for identification an a pre-created connection object of a pep-246 compatible database interface.
Table = Tabla('Animals')
Then you need to add columns with the agregar_columna (add_column) method, with can take various key word arguments:
campo (field): the name of the field
tipo (type): the type of data stored, can be a things like 'varchar' and 'double' or name of python objects if you aren't interested in exporting to a data base latter.
defecto (default): set a default value for the column if there is none when you add a row
there are other 3 but are only there for database tings and not actually functional
like:
Table.agregar_columna(campo='Name', tipo='str')
Table.agregar_columna(campo='Year', tipo='date')
#declaring it date, time, datetime or timestamp is important for being able to store it as a time object and not only as a number, But you can always put it as a int if you don't care for dates
Table.agregar_columna(campo='Priority', tipo='int')
Then you add the rows with the += operator (or + if you want to create a copy with an extra row)
Table += ('Cat', date(1998,1,1), 1)
Table += {'Year':date(1998,1,1), 'Priority':2, Name:'Fish'}
#…
#The condition for adding is that is a container accessible with either the column name or the position of the column in the table
Then you can generate XML and write it to a file with exportar_XML (export_XML) and escribir_XML (write_XML):
file = os.path.abspath(os.path.join(os.path.dirname(__file__), 'Animals.xml'))
Table.exportar_xml()
Table.escribir_xml(file)
And then import it back with importar_XML (import_XML) with the file name and indication that you are using a file and not an string literal:
Table.importar_xml(file, tipo='archivo')
#archivo means file
Advanced
This are ways you can use a Tabla object in a SQL manner.
#UPDATE <Table> SET Name = CONCAT(Name,' ',Priority), Priority = NULL WHERE id = 2
for row in Table:
if row['id'] == 2:
row['Name'] += ' ' + row['Priority']
row['Priority'] = None
print(Table)
#DELETE FROM <Table> WHERE MOD(id,2) = 0 LIMIT 1
n = 0
nmax = 1
for row in Table:
if row['id'] % 2 == 0:
del Table[row]
n += 1
if n >= nmax: break
print(Table)
this examples assume a column named 'id'
but can be replaced width row.pos for your example.
if row.pos == 2:
The file can be download from:
https://bitbucket.org/WolfangT/librerias
Related
I'm new to this forum, kindly excuse if the question format is not very good.
I'm trying to fetch rows from database table in mysql and print the same after processing the cols (one of the cols contains json which needs to be expanded). Below is the source and expected output. Would be great if someone can suggest an easier way to manage this data.
Note: I have achieved this with lots of looping and parsing but the challenges are.
1) There is no connection between col_names and data and hence when I am printing the data I don't know the order of the data in the resultset so there is a mismatch in the col title that I print and the data, any means to keep this in sync ?
2) I would like to have the flexibility of changing the order of the columns without much rework.
What is best possible way to achieve this. Have not explored the pandas library as I was not sure if it is really necessary.
Using python 3.6
Sample Data in the table
id, student_name, personal_details, university
1, Sam, {"age":"25","DOL":"2015","Address":{"country":"Poland","city":"Warsaw"},"DegreeStatus":"Granted"},UAW
2, Michael, {"age":"24","DOL":"2016","Address":{"country":"Poland","city":"Toruń"},"DegreeStatus":"Granted"},NCU
I'm querying the database using MySQLdb.connect object, steps below
query = "select * from student_details"
cur.execute(query)
res = cur.fetchall() # get a collection of tuples
db_fields = [z[0] for z in cur.description] # generate list of col_names
Data in variables:
>>>db_fields
['id', 'student_name', 'personal_details', 'university']
>>>res
((1, 'Sam', '{"age":"25","DOL":"2015","Address":{"country":"Poland","city":"Warsaw"},"DegreeStatus":"Granted"}','UAW'),
(2, 'Michael', '{"age":"24","DOL":"2016","Address":{"country":"Poland","city":"Toruń"},"DegreeStatus":"Granted"}','NCU'))
Desired Output:
id, student_name, age, DOL, country, city, DegreeStatus, University
1, 'Sam', 25, 2015, 'Poland', 'Warsaw', 'Granted', 'UAW'
2, 'Michael', 24, 2016, 'Poland', 'Toruń', 'Granted', 'NCU'
A not-too-pythonic way but easy to understand (and maybe you can write a more pythonic soltion) might be:
def unwrap_dict(_input):
res = dict()
for k, v in _input.items():
# Assuming you know there's only one nested level
if isinstance(v, dict):
for _k, _v in v.items():
res[_k] = _v
continue
res[k] = v
return res
all_data = list()
for row in result:
res = dict()
for field, data in zip(db_fields, row):
# Assuming you know personal_details is the only JSON column
if field == 'personal_details':
data = json.loads(data)
if isinstance(data, dict):
extra = unwrap_dict(data)
res.update(extra)
continue
res[field] = data
all_data.append(res)
I am new to Pythonland and I have a question. I have a list as below and want to convert it into a dataframe.
I read on Stackoverflow that it is better to create a dictionary then a list so I create one as follows.
column_names = ["name", "height" , "weight", "grade"] # Actual list has 10 entries
row_names = ["jack", "mick", "nick","pick"]
data = ['100','50','A','107','62','B'] # The actual list has 1640 entries
dic = {key:[] for key in column_names}
dic['name'] = row_names
t = 0
while t< len(data):
dic['height'].append(data[t])
t = t+3
t = 1
while t< len(data):
dic['weight'].append(data[t])
t = t+3
So on and so forth, I have 10 columns so I wrote above code 10 times to complete the full dictionary. Then i convert
it to dataframe. It works perfectly fine, there has to
be a way to do this in shorter way. I don't know how to refer to key of a dictionary with a number. Should it be wrapped to a function. Also, how can I automate adding one to value of t before executing the next loop? Please help me.
You can iterate through columnn_names like this:
dic = {key:[] for key in column_names}
dic['name'] = row_names
for t, column_name in enumerate(column_names):
i = t
while i< len(data):
dic[column_name].append(data[i])
i += 3
Enumerate will automatically iterate through t form 0 to len(column_names)-1
i = 0
while True:
try:
for j in column_names:
d[j].append(data[i])
i += 1
except Exception as er: #So when i value exceed by data list it comes to exception and it will break the loop as well
print(er, "################")
break
The first issue that you have all columns data concatenated to a single list. You should first investigate how to prevent it and have list of lists with each column values in a separate list like [['100', '107'], ['50', '62'], ['A', 'B']]. Any way you need this data structure to proceed efficiently:
cl_count = len(column_names)
d_count = len(data)
spl_data = [[data[j] for j in range(i, d_count, cl_count)] for i in range(cl_count)]
Then you should use dict comprehension. This is a 3.x Python feature so it will not work in Py 2.x.
df = pd.DataFrame({j: spl_data[i] for i, j in enumerate(column_names)})
First, we should understand how an ideal dictionary for a dataframe should look like.
A Dataframe can be thought of in two different ways:
One is a traditional collection of rows..
'row 0': ['jack', 100, 50, 'A'],
'row 1': ['mick', 107, 62, 'B']
However, there is a second representation that is more useful, though perhaps not as intuitive at first.
A collection of columns:
'name': ['jack', 'mick'],
'height': ['100', '107'],
'weight': ['50', '62'],
'grade': ['A', 'B']
Now, here is the key thing to realise, the 2nd representation is more useful
because that is the representation interally supported and used in dataframes.
It does not run into conflict of datatype within a single grouping (each column needs to have 1 fixed datatype)
Across a row representation however, datatypes can vary.
Also, operations can be performed easily and consistently on an entire column
because of this consistency that cant be guaranteed in a row.
So, tl;dr DataFrames are essentially collections of equal length columns.
So, a dictionary in that representation can be easily converted into a DataFrame.
column_names = ["name", "height" , "weight", "grade"] # Actual list has 10 entries
row_names = ["jack", "mick"]
data = [100, 50,'A', 107, 62,'B'] # The actual list has 1640 entries
So, With that in mind, the first thing to realize is that, in its current format, data is a very poor representation.
It is a collection of rows merged into a single list.
The first thing to do, if you're the one in control of how data is formed, is to not prepare it this way.
The goal is a list for each column, and ideally, prepare the list in that format.
Now, however, if it is given in this format, you need to iterate and collect the values accordingly. Here's a way to do it
column_names = ["name", "height" , "weight", "grade"] # Actual list has 10 entries
row_names = ["jack", "mick"]
data = [100, 50,'A', 107, 62,'B'] # The actual list has 1640 entries
dic = {key:[] for key in column_names}
dic['name'] = row_names
print(dic)
Output so far:
{'height': [],
'weight': [],
'grade': [],
'name': ['jack', 'mick']} #so, now, names are a column representation with all correct values.
remaining_cols = column_names[1:]
#Explanations for the following part given at the end
data_it = iter(data)
for row in zip(*([data_it] * len(remaining_cols))):
for i, val in enumerate(row):
dic[remaining_cols[i]].append(val)
print(dic)
Output:
{'name': ['jack', 'mick'],
'height': [100, 107],
'weight': [50, 62],
'grade': ['A', 'B']}
And we are done with the representation
Finally:
import pd
df = pd.DataFrame(dic, columns = column_names)
print(df)
name height weight grade
0 jack 100 50 A
1 mick 107 62 B
Edit:
Some explanation for the zip part:
zip takes any iterables and allows us through iterate through them together.
data_it = iter(data) #prepares an iterator.
[data_it] * len(remaining_cols) #creates references to the same iterator
Here, this is similar to [data_it, data_it, data_it]
The * in *[data_it, data_it, data_it] allows us to unpack the list into 3 arguments for the zip function instead
so, f(*[data_it, data_it, data_it]) is equivalent to f(data_it, data_it, data_it) for any function f.
the magic here is that traversing through an iterator/advancing an iterator will now reflect the change across all references
Putting it all together:
zip(*([data_it] * len(remaining_cols))) will actually allow us to take 3 items from data at a time, and assign it to row
So, row = (100, 50, 'A') in first iteration of zip
for i, val in enumerate(row): #just iterate through the row, keeping index too using enumerate
dic[remaining_cols[i]].append(val) #use indexes to access the correct list in the dictionary
Hope that helps.
If you are using Python 3.x, as suggested by l159, you can use a comprehension dict and then create a Pandas DataFrame out of it, using the names as row indexes:
data = ['100', '50', 'A', '107', '62', 'B', '103', '64', 'C', '105', '78', 'D']
column_names = ["height", "weight", "grade"]
row_names = ["jack", "mick", "nick", "pick"]
df = pd.DataFrame.from_dict(
{
row_label: {
column_label: data[i * len(column_names) + j]
for j, column_label in enumerate(column_names)
} for i, row_label in enumerate(row_names)
},
orient='index'
)
Actually, the intermediate dictionary is a nested dictionary: the keys of the outer dictionary are the row labels (in this case the items of the row_names list); the value associated with each key is a dictionary whose keys are the column labels (i.e., the items in column_names) and values are the correspondent elements in the data list.
The function from_dict is used to create the DataFrame instance.
So, the previous code produces the following result:
height weight grade
jack 100 50 A
mick 107 62 B
nick 103 64 C
pick 105 78 D
Currently i am having an question in python pandas. I want to filter a dataframe using url query string dynamically.
For eg:
CSV:
url: http://example.com/filter?Name=Sam&Age=21&Gender=male
Hardcoded:
filtered_data = data[
(data['Name'] == 'Sam') &
(data['Age'] == 21) &
(data['Gender'] == 'male')
];
I don't want to hard code the filter keys like before because the csv file changes anytime with different column headers.
Any suggestions
The easiest way to create this filter dynamically is probably to use np.all.
For example:
import numpy as np
query = {'Name': 'Sam', 'Age': 21, 'Gender': 'male'}
filters = [data[k] == v for k, v in query.items()]
filter_data = data[np.all(filters, axis=0)]
use df.query. For example
df = pd.read_csv(url)
conditions = "Name == 'Sam' and Age == 21 and Gender == 'Male'"
filtered_data = df.query(conditions)
You can build the conditions string dynamically using string formatting like
conditions = " and ".join("{} == {}".format(col, val)
for col, val in zip(df.columns, values)
Typically, your web framework will return the arguments in a dict-like structure. Let's say your args are like this:
args = {
'Name': ['Sam'],
'Age': ['21'], # Note that Age is a string
'Gender': ['male']
}
You can filter your dataset successively like this:
for key, values in args.items():
data = data[data[key].isin(values)]
However, this is likely not to match any data for Age, which may have been loaded as an integer. In that case, you could load the CSV file as a string via pd.read_csv(filename, dtype=object), or convert to string before comparison:
for key, values in args.items():
data = data[data[key].astype(str).isin(values)]
Incidentally, this will also match multiple values. For example, take the URL http://example.com/filter?Name=Sam&Name=Ben&Age=21&Gender=male -- which leads to the structure:
args = {
'Name': ['Sam', 'Ben'], # There are 2 names
'Age': ['21'],
'Gender': ['male']
}
In this case, both Ben and Sam will be matched, since we're using .isin to match.
Total Python noob here, probably missing something obvious. I've searched everywhere and haven't found a solution yet, so I thought I'd ask for some help.
I'm trying to write a function that will build a nested dictionary from a large csv file. The input file is in the following format:
Product,Price,Cost,Brand,
blue widget,5,4,sony,
red widget,6,5,sony,
green widget,7,5,microsoft,
purple widget,7,6,microsoft,
etc...
The output dictionary I need would look like:
projects = { `<Brand>`: { `<Product>`: { 'Price': `<Price>`, 'Cost': `<Cost>` },},}
But obviously with many different brands containing different products. In the input file, the data is ordered alphabetically by brand name, but I know that it becomes unordered as soon as DictReader executes, so I definitely need a better way to handle the duplicates. The if statement as written is redundant and unnecessary.
Here's the non-working, useless code I have so far:
def build_dict(source_file):
projects = {}
headers = ['Product', 'Price', 'Cost', 'Brand']
reader = csv.DictReader(open(source_file), fieldnames = headers, dialect = 'excel')
current_brand = 'None'
for row in reader:
if Brand != current_brand:
current_brand = Brand
projects[Brand] = {Product: {'Price': Price, 'Cost': Cost}}
return projects
source_file = 'merged.csv'
print build_dict(source_file)
I have of course imported the csv module at the top of the file.
What's the best way to do this? I feel like I'm way off course, but there is very little information available about creating nested dicts from a CSV, and the examples that are out there are highly specific and tend not to go into detail about why the solution actually works, so as someone new to Python, it's a little hard to draw conclusions.
Also, the input csv file doesn't normally have headers, but for the sake of trying to get a working version of this function, I manually inserted a header row. Ideally, there would be some code that assigns the headers.
Any help/direction/recommendation is much appreciated, thanks!
import csv
from collections import defaultdict
def build_dict(source_file):
projects = defaultdict(dict)
headers = ['Product', 'Price', 'Cost', 'Brand']
with open(source_file, 'rb') as fp:
reader = csv.DictReader(fp, fieldnames=headers, dialect='excel',
skipinitialspace=True)
for rowdict in reader:
if None in rowdict:
del rowdict[None]
brand = rowdict.pop("Brand")
product = rowdict.pop("Product")
projects[brand][product] = rowdict
return dict(projects)
source_file = 'merged.csv'
print build_dict(source_file)
produces
{'microsoft': {'green widget': {'Cost': '5', 'Price': '7'},
'purple widget': {'Cost': '6', 'Price': '7'}},
'sony': {'blue widget': {'Cost': '4', 'Price': '5'},
'red widget': {'Cost': '5', 'Price': '6'}}}
from your input data (where merged.csv doesn't have the headers, only the data.)
I used a defaultdict here, which is just like a dictionary but when you refer to a key that doesn't exist instead of raising an Exception it simply makes a default value, in this case a dict. Then I get out -- and remove -- Brand and Product, and store the remainder.
All that's left I think would be to turn the cost and price into numbers instead of strings.
[modified to use DictReader directly rather than reader]
Here I offer another way to satisfy your requirement(different from DSM)
Firstly, this is my code:
import csv
new_dict={}
with open('merged.csv','rb')as csv_file:
data=csv.DictReader(csv_file,delimiter=",")
for row in data:
dict_brand=new_dict.get(row['Brand'],dict())
dict_brand[row['Product']]={k:row[k] for k in ('Cost','Price')}
new_dict[row['Brand']]=dict_brand
print new_dict
Briefly speaking, the main point to solve is to figure out what the key-value pairs are in your requirements. According to your requirement,it can be called as a 3-level-dict,here the key of first level is the value of Brand int the original dictionary, so I extract it from the original csv file as
dict_brand=new_dict.get(row['Brand'],dict())
which is going to judge if there exists the Brand value same as the original dict in our new dict, if yes, it just inserts, if no, it creates, then maybe the most complicated part is the second level or middle level, here you set the value of Product of original dict as the value of the new dict of key Brand, and the value of Product is also the key of the the third level dict which has Price and Cost of the original dict as the value,and here I extract them like:
dict_brand[row['Product']]={k:row[k] for k in ('Cost','Price')}
and finally, what we need to do is just set the created 'middle dict' as the value of our new dict which has Brand as the key.
Finally, the output is
{'sony': {'blue widget': {'Price': '5', 'Cost': '4'},
'red widget': {'Price': '6', 'Cost': '5'}},
'microsoft': {'purple widget': {'Price': '7', 'Cost': '6'},
'green widget': {'Price': '7', 'Cost': '5'}}}
That's that.
I am having a bit of trouble getting started on an assignment. We are issued a tab delineated .txt file with 6 columns of data and around 50 lines of this data. I need help starting a list to store this data in for later recall. Eventually I will need to be able to list all the contents of any particular column and sort it, count it, etc. Any help would be appreciated.
Edit; I really haven't done much besides research on this kinda stuff, I know ill be looking into csv, and i have done single column .txt files before but im not sure how to tackle this situation. How will I give names to the separate columns? how will I tell the program when one row ends and the next begins?
The dataframe structure in Pandas basically does exactly what you want. It's highly analogous to the data frame in R if you're familiar with that. It has built in options for subsetting, sorting, and otherwise manipulating tabular data.
It reads directly from csv and even automatically reads in column names. You'd call:
read_csv(yourfilename,
sep='\t', # makes it tab delimited
header=1) # makes the first row the header row.
Works in Python 3.
Let's say you have a csv like the following.
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
You can read them into a dictionary like so:
>>> import csv
>>> reader = csv.DictReader(open('test.csv','r'), fieldnames= ['col1', 'col2', 'col3', 'col4', 'col5', 'col6'], dialect='excel-tab')
>>> for row in reader:
... print row
{'col6': '6', 'col4': '4', 'col5': '5', 'col2': '2', 'col3': '3', 'col1': '1'}
{'col6': '6', 'col4': '4', 'col5': '5', 'col2': '2', 'col3': '3', 'col1': '1'}
{'col6': '6', 'col4': '4', 'col5': '5', 'col2': '2', 'col3': '3', 'col1': '1'}
{'col6': '6', 'col4': '4', 'col5': '5', 'col2': '2', 'col3': '3', 'col1': '1'}
{'col6': '6', 'col4': '4', 'col5': '5', 'col2': '2', 'col3': '3', 'col1': '1'}
But Pandas library might be better suited for this. http://pandas.pydata.org/pandas-docs/stable/io.html#csv-text-files
Sounds like a job better suited to a database. You should just use something like PostgreSQLs COPY FROM operation to import the CSV data into a table then use python + SQL for all your sorting, searching and matching needs.
If you feel a real database is overkill there's still options like SQLlite and BerkleyDB which both have python modules.
EDIT: BerkelyDB is deprecated but anydbm is similiar in concept.
I think using a db for 50 lines and 6 colums is overkill, so here's my idea:
from __future__ import print_function
import os
from operator import itemgetter
def get_records_from_file(path_to_file):
"""
Read a tab-deliminated file and return a
list of dictionaries representing the data.
"""
records = []
with open(path_to_file, 'r') as f:
# Use the first line to get names for columns
fields = [e.lower() for e in f.readline().split('\t')]
# Iterate over the rest of the lines and store records
for line in f:
record = {}
for i, field in enumerate(line.split('\t')):
record[fields[i]] = field
records.append(record)
return records
if __name__ == '__main__':
path = os.path.join(os.getcwd(), 'so.txt')
records = get_records_from_file(path)
print('Number of records: {0}'.format(len(records)))
s = sorted(records, key=itemgetter('id'))
print('Sorted: {0}'.format(s))
For storing records for later use, look into Python's pickle library--that'll allow you to preserve them as Python objects.
Also, note I don't have Python 3 installed on the computer I'm using right now, but I'm pretty sure this'll work on Python 2 or 3.