how to read csvfile into dictionary format - python

I have a csv file as below:
a,green
a,red
a,blue
b,white
b,black
b,brown
I want to read it into python dictionary as below
{'a':{'green','red','blue'},'b':{'white','black','brown'}}
How can I do? Help me please

I will assume you want a dict with a list of values per key
{'a':['green','red','blue'],'b':['white','black','brown']}
If so, a possible quick workaround could be something like this
import csv
# Get a all the rows in the format [[k, v], ...]
rows = list(csv.reader(open('file_path_here', 'r')))
# Get all the unique keys
keys = set(r[0] for r in rows)
# Get a list of values for the given key
def get_values_list(key, _rows):
return [r[1] for r in _rows if r[0] == key]
# Generate the dict
keys_dict = dict((k, get_values_list(k, rows)) for k in keys)
print keys_dict
But I'm pretty sure this solution has a lot of room for improvement if you spend some time on it.

Related

Python: Remove value from dictionary and multiply remaining two

I'm working with a large CSV file where each row includes the date and two values. I'm trying to set up a dictionary with the date as the key for the two values. I then need to multiply the two values of each key and record the answer. I have 3000 rows in the file.
Sample:
So far I have the date set as the key for each pair of values however it's also reusing the date as a third value for each key set, is there a way to remove this?
Once I've removed this, is there a way to multiply the values by eachother in each key set?
This is my code so far:
main_file = "newnoblanks.csv"
import csv
import collections
import pprint
with open(main_file) as fp:
root = csv.reader(fp, delimiter=',')
result = collections.defaultdict(list)
for row in root:
date = row[0].split(",")[0]
result[date].append(row)
print ("Result:-")
pprint.pprint(result)
This is my output:
I don't think you even need to use a defaultdict here, just assign the whole row (minus the date) to the key of the dict. You should just be able to do
with open(main_file) as fp:
root = csv.reader(fp, delimiter=',')
result = dict()
for row in root:
date = row[0].split(",")[0]
result[date] = row[1:]
If you want to get the product of the two values, you could do something like
for key in result:
result[key] = reduce(lambda x, y: x*y, result[key])
I know this has been answered, but feel there is an alternative worth considering:
import csv
from pprint import pprint
with open('newnoblanks.csv') as fp:
root = csv.reader(fp)
result = dict((date, float(a) * float(b)) for date, a, b in root)
pprint(result)
With the following data file:
19/08/2004,49.8458,44994500
20/08/2004,53.80505,23005800
23/08/2004,54.34653,18393200
The output is:
{'19/08/2004': 2242786848.1,
'20/08/2004': 1237828219.29,
'23/08/2004': 999606595.5960001}

Using defaultdict to append a list from an .xlsx file

I'm trying to take an excel file with two fields, ID and xy coordinates and create a dictionary so that each ID is a key to all of the xy coordinate values.
for example, the excel file looks like this:
[1] [1]: http://i.stack.imgur.com/P2agc.png
but there are more than 900 oID values
I want the final format to be something like,
[('0',[-121.129247,37.037939,-121.129247,37.037939,-121.056516,36.997779]),
('1',[all, the, coordinates,with,oID,of,1]),('2'[all,the,coordinate,with,oID,of,2]etc.)]
I am trying to use a for statement to iterate through the excel sheet to populate a list with the first 200 rows, and then putting that into a default dict.
Here is what I have done so far:
wb=openpyxl.load_workbook('simpleCoordinate.xlsx')
sheet=wb['Sheet1']
from collections import defaultdict
CoordDict = defaultdict(list)
for i in range (1,201,1):
coordinate_list=[(sheet.cell(row=i,column=1).value, sheet.cell(row=i, column=2).value)]
for oID, xy in coordinate_list:
CoordDict[oID].append(xy)
print(list(CoordDict.items()))
which returns:
[(11, ['-121.177487,35.49885'])]
Only the 200th line of the excel sheet, rather than the whole thing.. I'm not sure what I'm doing wrong, is it something with the for statement? Am I thinking about this in the wrong way? I'm a total newbie to python any advice would be helpful!
You are overwriting coordinate_list 200 times. Instead, create it, then append to it with the += operator.
wb=openpyxl.load_workbook('simpleCoordinate.xlsx')
sheet=wb.get_sheet_by_name('Sheet1')
from collections import defaultdict
coordinate_list = list()
for i in range (1,201,1):
coordinate_list += [(sheet.cell(row=i,column=1).value, sheet.cell(row=i, column=2).value)]
coord_dict = defaultdict(list)
for oid, xy in coordinate_list:
coord_dict[oid] = xy
print(list(coord_dict.items()))

How to create separate Pandas DataFrames for each CSV file and give them meaningful names?

I've searched thoroughly and can't quite find the guidance I am looking for on this issue so I hope this question is not redundant. I have several .csv files that represent raster images. I'd like to perform some statistical analysis on them so I am trying to create a Pandas dataframe for each file so I can slice 'em dice 'em and plot 'em...but I am having trouble looping through the list of files to create a DF with a meaningful name for each file.
Here is what I have so far:
import glob
import os
from pandas import *
#list of .csv files
#I'd like to turn each file into a dataframe
dataList = glob.glob(r'C:\Users\Charlie\Desktop\Qvik\textRasters\*.csv')
#name that I'd like to use for each data frame
nameList = []
for raster in dataList:
path_list = raster.split(os.sep)
name = path_list[6][:-4]
nameList.append(name)
#zip these lists into a dict
dataDct = {}
for k, v in zip(nameList,dataList):
dataDct[k] = dataDct.get(k,"") + v
dataDct
So now I have a dict where the key is the name I want for each dataframe and the value is the path for read_csv(path):
{'Aspect': 'C:\\Users\\Charlie\\Desktop\\Qvik\\textRasters\\Aspect.csv',
'Curvature': 'C:\\Users\\Charlie\\Desktop\\Qvik\\textRasters\\Curvature.csv',
'NormalZ': 'C:\\Users\\Charlie\\Desktop\\Qvik\\textRasters\\NormalZ.csv',
'Slope': 'C:\\Users\\Charlie\\Desktop\\Qvik\\textRasters\\Slope.csv',
'SnowDepth': 'C:\\Users\\Charlie\\Desktop\\Qvik\\textRasters\\SnowDepth.csv',
'Vegetation': 'C:\\Users\\Charlie\\Desktop\\Qvik\\textRasters\\Vegetation.csv',
'Z': 'C:\\Users\\Charlie\\Desktop\\Qvik\\textRasters\\Z.csv'}
My instinct was to try variations of this:
for k, v in dataDct.iteritems():
k = read_csv(v)
but that leaves me with a single dataframe, 'k' , that is filled with data from the last file read in by the loop.
I'm probably missing something fundamental here but I am starting to spin my wheels on this so I'd thought I'd ask y'all...any ideas are appreciated!
Cheers.
Are you trying to get all of the data frames separately in a dictionary, one data frame per key? If so, this will leave you with the dict you showed but instead will have the data from in each key.
dataDct = {}
for k, v in zip(nameList,dataList):
dataDct[k] = read_csv(v)
So now, you could do this for example:
dataDct['SnowDepth'][['cola','colb']].plot()
Unclear why you're overwriting your object here I think you want either a list or dict of the dfs:
df_list=[]
for k, v in dataDct.iteritems():
df_list.append(read_csv(v))
or
df_dict={}
for k, v in dataDct.iteritems():
df_dict[k] = read_csv(v)

list of dicts to memory for csv reader

I have a list of dictionaries. For example
l = [{'date': '2014-01-01', 'spent': '$3'},[{'date': '2014-01-02', 'spent': '$5'}]
I want to make this csv-like so if I want to save it as a csv I can.
I have other that gets a list and calls splitlines() so I can use csv methods.
for example:
reader = csv.reader(resp.read().splitlines(), delimiter=',')
How can I change my list of dictionaries into a list that like a csv file?
I've been trying cast the dictionary to a string, but haven't had much luck. It should be something like
"date", "spent"
"2014-01-01", "$3"
"2014-01-02", "$5"
this will also help me print out the list of dictionaries in a nice way for the user.
update
This is the function I have which made me want to have the list of dicts:
def get_daily_sum(resp):
rev = ['revenue', 'estRevenue']
reader = csv.reader(resp.read().splitlines(), delimiter=',')
first = reader.next()
for i, val in enumerate(first):
if val in rev:
place = i
break
else:
place = None
if place:
total = sum(float(r[place]) for r in reader)
else:
total = 'Not available'
return total
so I wanted to total up a column from a list of column names. The problem was that the "revenue" column was not always in the same place.
Is there a better way? I have one object that returns a csv like string, and the other a list of dicts.
You would want to use csv.DictWriter to write the file.
with open('outputfile.csv', 'wb') as fout:
writer = csv.DictWriter(fout, ['date', 'spent'])
for dct in lst_of_dict:
writer.writerow(dct)
A solution using list comprehension, should work for any number of keys, but only if all your dicts have same heys.
l = [[d[key] for key in dicts[0].keys()] for d in dicts]
To attach key names for column titles:
l = dicts[0].keys() + l
This will return a list of lists which can be exported to csv:
import csv
myfile = csv.writer(open("data.csv", "wb"))
myfile.writerows(l)

Need more efficient way to parse out csv file in Python

Here's a sample csv file
id, serial_no
2, 500
2, 501
2, 502
3, 600
3, 601
This is the output I'm looking for (list of serial_no withing a list of ids):
[2, [500,501,502]]
[3, [600, 601]]
I have implemented my solution but it's too much code and I'm sure there are better solutions out there. Still learning Python and I don't know all the tricks yet.
file = 'test.csv'
data = csv.reader(open(file))
fields = data.next()
for row in data:
each_row = []
each_row.append(row[0])
each_row.append(row[1])
zipped_data.append(each_row)
for rec in zipped_data:
if rec[0] not in ids:
ids.append(rec[0])
for id in ids:
for rec in zipped_data:
if rec[0] == id:
ser_no.append(rec[1])
tmp.append(id)
tmp.append(ser_no)
print tmp
tmp = []
ser_no = []
**I've omitted var initializing for simplicity of code
print tmp
Gives me output I mentioned above. I know there's a better way to do this or pythonic way to do it. It's just too messy! Any suggestions would be great!
from collections import defaultdict
records = defaultdict(list)
file = 'test.csv'
data = csv.reader(open(file))
fields = data.next()
for row in data:
records[row[0]].append(row[1])
#sorting by ids since keys don't maintain order
results = sorted(records.items(), key=lambda x: x[0])
print results
If the list of serial_nos need to be unique just replace defaultdict(list) with defaultdict(set) and records[row[0]].append(row[1]) with records[row[0]].add(row[1])
Instead of a list, I'd make it a collections.defaultdict(list), and then just call the append() method on the value.
result = collections.defaultdict(list)
for row in data:
result[row[0]].append(row[1])
Here's a version I wrote, looks like there are plenty of answers for this one already though.
You might like using csv.DictReader, gives you easy access to each column by field name (from the header / first line).
#!/usr/bin/python
import csv
myFile = open('sample.csv','rb')
csvFile = csv.DictReader(myFile)
# first row will be used for field names (by default)
myData = {}
for myRow in csvFile:
myId = myRow['id']
if not myData.has_key(myId): myData[myId] = []
myData[myId].append(myRow['serial_no'])
for myId in sorted(myData):
print '%s %s' % (myId, myData[myId])
myFile.close()
Some observations:
0) file is a built-in (a synonym for open), so it's a poor choice of name for a variable. Further, the variable actually holds a file name, so...
1) The file can be closed as soon as we're done reading from it. The easiest way to accomplish that is with a with block.
2) The first loop appears to go over all the rows, grab the first two elements from each, and make a list with those results. However, your rows already all contain only two elements, so this has no net effect. The CSV reader is already an iterator over rows, and the simple way to create a list from an iterator is to pass it to the list constructor.
3) You proceed to make a list of unique ID values, by manually checking. A list of unique things is better known as a set, and the Python set automatically ensures uniqueness.
4) You have the name zipped_data for your data. This is telling: applying zip to the list of rows would produce a list of columns - and the IDs are simply the first column, transformed into a set.
5) We can use a list comprehension to build the list of serial numbers for a given ID. Don't tell Python how to make a list; tell it what you want in it.
6) Printing the results as we get them is kind of messy and inflexible; better to create the entire chunk of data (then we have code that creates that data, so we can do something else with it other than just printing it and forgetting it).
Applying these ideas, we get:
filename = 'test.csv'
with open(filename) as in_file:
data = csv.reader(in_file)
data.next() # ignore the field labels
rows = list(data) # read the rest of the rows from the iterator
print [
# We want a list of all serial numbers from rows with a matching ID...
[serial_no for row_id, serial_no in rows if row_id == id]
# for each of the IDs that there is to match, which come from making
# a set from the first column of the data.
for id in set(zip(*rows)[0])
]
We can probably do even better than this by using the groupby function from the itertools module.
example using itertools.groupby. This only works if the rows are already grouped by id
from csv import DictReader
from itertools import groupby
from operator import itemgetter
filename = 'test.csv'
# the context manager ensures that infile is closed when it goes out of scope
with open(filename) as infile:
# group by id - this requires that the rows are already grouped by id
groups = groupby(DictReader(infile), key=itemgetter('id'))
# loop through the groups printing a list for each one
for i,j in groups:
print [i, map(itemgetter(' serial_no'), list(j))]
note the space in front of ' serial_no'. This is because of the space after the comma in the input file

Categories

Resources