Hey I wonder if you can help me, I did some research on using eval() to read the lines of my CSV and then put it into a dictionary. The problem is that my CSV has 4 pieces of data: the name, the first score, the second score and the third score. How would I transfer this data from a CSV into a dictionary within python so then later on I can check if that users name is the dictionary to append to it or edit the score.
I would like to have it so the key the name, and the scores are the list and are kept in a list so they can be appended to/deleted later.
Thanks for your help.
There is a module in the Python standard library that will help you with reading/writing CSV files. Let me assume that your csv file looks like this:
Jim, 45, 78, 90
Mary, 100,98, 99
Molly, 78, 45,46
Mat, 76, 89, 95
Then:
import csv
scores = {}
with open('score.csv') as f:
reader = csv.reader(f)
for row in reader:
scores.setdefault(row[0],[]).extend(row[1:])
This will create a dictionary scores with names as keys and a list of scores as values:
{'Mat': [' 76', ' 89', ' 95'], 'Jim': [' 45', ' 78', ' 90'], 'Molly': [' 78', ' 45', '46'], 'Mary': [' 100', '98', ' 99']}
import csv
from collections import defaultdict
# Your target is a dictionary {name : [scores]}
scores = defaultdict(list)
with open(csvfilename) as csvfile:
for row in csv.reader(csvfile):
scores[row[0]].extend(row[1:])
I don't think eval is a good tool for this. It is really easy to introduce security vulnerabilities with it, as it will parse and execute whatever you pass it. As an exercise, think about why it may not be okay to execute data from some csv-file. Spoiler: your csv-file is a serialization format, this talk Tom Eastman - Serialization formats are not toys - PyCon 2015 shows dangers that may exist there. For bonus insights look at the source of the collections module we imported the defaultdict from and think about why this use of exec by #raymond-hettinger is different from using eval on data.
eval() is not what you want here I don't think. eval() reads a string and interprets it as python code; what you want is simple file I/O manipulation.
data = numpy.genfromtxt("filename.csv", delimter=";") # non-numpy possibilities available
my_dict = {}
for i in data:
my_dict[data[i,0]] = data[i,1:]
If you really, really want to do it with eval: Well, first, you shouldn't, unless you have a very good reason. Just parse the file as CSV, not as Python code. The right way to do that is with the csv module, as in Chris Wesseling's answer (or, if you're already using NumPy or Pandas, using their functions).
But if you really, really, really want to, can you?
Well, sometimes.
The most basic CSV dialect doesn't quote strings, so its lines aren't going to be valid as Python code. And some CSV dialects handle embedded quotes in ways that either aren't valid in Python, or mean something different.
But some dialects do happen to make at least most rows legal, and meaningful, as Python tuple literals consisting of Python str, int, and float literals. And for those dialects, technically, yes, you could parse them with eval, like this:
scores = {}
with open(path) as f:
for line in f:
name, *newscores = eval(line)
scores.setdefault(name, []).extend(newscores)
But again, you shouldn't.
And even if you really, really, really want to do this, you should at least use literal_eval instead; it will handle all the same legal values that eval would without opening the big gaping security holes (e.g., someone putting __import__('os').system('rm -rf /') in a CSV) and painful-to-debug edge cases.
But even with literal_eval, you don't want it. You want to parse the actual CSV dialect you have, not just treat it as a similar but different language and cross your fingers.
Related
I'm currently going through the Udacity course on data analysis in python, and we've been using the unicodecsv library.
More specifically we've written the following code which reads a csv file and converts it into a list. Here is the code:
def read_csv(filename):
with open(filename,'rb')as f:
reader = unicodecsv.DictReader(f)
return list(reader)
In order to get my head around this, I'm trying to figure out how the data is represented in the dictionary and the list, and I'm very confused. Can someone please explain it to me.
For example, one thing I don't understand is why the following throws an error
enrollment['cancel_date']
While the following works fine:
for enrollment in enrollments:
enrollments['cancel_date'] = parse_date(enrollment['cancel_date'])
Hopefully this question makes sense. I'm just having trouble visualizing how all of this is represented.
Any help would be appreciated.
Thanks.
I too landed up here for some troubles related to the course and found this unanswered. However I think you already managed it. Anyway answering here so that someone else might find this helpful.
Like we all know, dictionaries can be accessed like
dictionary_name['key']
and likewise
enrollments['cancel_date'] should also work.
But if you do something like
print enrollments
you will see the structure
[{u'status': u'canceled', u'is_udacity': u'True', ...}, {}, ... {}]
If you notice the brackets, it's like a list of dictionaries. You may argue it is a list of list. Try it.
print enrollments[0][0]
You'll get an error! KeyError.
So, it's like a collection of dictionaries. How to access them? Zoom down to any dictionary (rather rows of the csv) by enrollments[n].
Now you have a dictionary. You can now use freely the key.
print enrollments[0]['cancel_date']
Now coming to your loop,
for enrollment in enrollments:
enrollment['cancel_date'] = parse_date(enrollment['cancel_date'])
What this is doing is the enrollment is the dummy variable capturing each of the iterable element enrollments like enrollments[1], enrollments[2] ... enrollments[n].
So every-time enrollment is having a dictionary from enrollments and so enrollment['cancel_date'] works over enrollments['cancel_date'].
Lastly I want to add a little more thing which is why I came to the thread.
What is the meaning of "u" in u'..' ? Ex: u'cancel_date' = u'11-02-19'.
The answer is this means the string is encoded as an Unicode. It is not part of the string, it is python notation. Unicode is a library that contains the characters and symbol for all of the world's languages.
This mainly happens because the unicodecsv package does not take the headache of tracking and converting each item in the csv file. It reads them as Unicode to preserve all characters. Now that's why Caroline and you defined and used parse_date() and other functions to convert the Unicode strings to the desired datatype. This is all a part of the Data Wrangling process.
Right now, I'm basically running through an excel sheet.
I have about 20 names and then I have 50k total values that match to one of those 20 names, so the excel sheet is 50k rows long, column B showing any random value, and column A showing one of the 20 names.
I'm trying to get a string for each of the names that show all of the values.
Name A: 123,244,123,523,123,5523,12505,142... etc etc.
Name B: 123,244,123,523,123,5523,12505,142... etc etc.
Right now, I created a dictionary that runs through the excel sheet, checks if the name is all ready in the dictionary, if it is, then it does a
strA = strA + "," + foundValue
Then it inserts strA back into the dictionary for that particular name. If the name doesn't exist, it creates that dictionary key and then adds that value to it.
Now, this was working all well at first.. but it's been about 15 or 20 mins and it is only on 5k values added to the dictionary so far and it seems to get slower as time goes on and it keeps running.
I wonder if there is a better way to do this or faster way to do this. I was thinking of building new dictionaries every 1k values and then combine them all together at the end.. but that would be 50 dictionaries total and it sounds complicated.. although maybe not.. I'm not sure, maybe it could work better that way, this seems to not work.
I DO need the string that shows each value with a comma between each value. That is why I am doing the string thing right now.
There are a number of things that are likely causing your program to run slowly.
String concatenation in python can be extremely inefficient when used with large strings.
Strings in Python are immutable. This fact frequently sneaks up and bites novice Python programmers on the rump. Immutability confers some advantages and disadvantages. In the plus column, strings can be used as keys in dictionaries and individual copies can be shared among multiple variable bindings. (Python automatically shares one- and two-character strings.) In the minus column, you can't say something like, "change all the 'a's to 'b's" in any given string. Instead, you have to create a new string with the desired properties. This continual copying can lead to significant inefficiencies in Python programs.
Considering each string in your example could contain thousands of characters, each time you do a concatenation, python has to copy that giant string into memory to create a new object.
This would be much more efficient:
strings = []
strings.append('string')
strings.append('other_string')
...
','.join(strings)
In your case, instead of each dictionary key storing a massive string, it should store a list, and you would just append each match to the list, and only at the very end would you do a string concatenation using str.join.
In addition, printing to stdout is also notoriously slow. If you're printing to stdout on each iteration of your massive 50,000 item loop, each iteration is being held up by the unbuffered write to stdout. Consider only printing every nth iteration, or perhaps writing to a file instead (file writes are normally buffered) and then tailing the file from another terminal.
This answer is based on OP's answer to my comment. I asked what he would do with the dict, suggesting that maybe he doesn't need to build it in the first place. #simon replies:
i add it to an excel sheet, so I take the KEY, which is the name, and
put it in A1, then I take the VALUE, which is
1345,345,135,346,3451,35.. etc etc, and put that into A2. then I do
the rest of my programming with that information...... but i need
those values seperated by commas and acessible inside that excel sheet
like that!
So it looks like the dict doesn't have to be built after all. Here is an alternative: for each name, create a file, and store those files in a dict:
files = {}
name = 'John' # let's say
if name not in files:
files[name] = open(name, 'w')
Then when you loop over the 50k-row excel, you do something like this (pseudo-code):
for row in 50k_rows:
name, value_string = rows.split() # or whatever
file = files[name]
file.write(value_string + ',') # if already ends with ',', no need to add
Since your value_string is already comma separated, your file will be csv-like without any further tweaking on your part (except maybe you want to strip the last trailing comma after you're done). Then when you need the values, say, of John, just value = open('John').read().
Now I've never worked with 50k-row excels, but I'd be very surprised if this is not quite a bit faster than what you currently have. Having persistent data is also (well, maybe) a plus.
EDIT:
Above is a memory-oriented solution. Writing to files is much slower than appending to lists (but probably still faster than recreating many large strings). But if the lists are huge (which seems likely) and you run into a memory problem (not saying you will), you can try the file approach.
An alternative, similar to lists in performance (at least for the toy test I tried) is to use StringIO:
from io import StringIO # python 2: import StringIO import StringIO
string_ios = {'John': StringIO()} # a dict to store StringIO objects
for value in ['ab', 'cd', 'ef']:
string_ios['John'].write(value + ',')
print(string_ios['John'].getvalue())
This will output 'ab,cd,ef,'
Instead of building a string that looks like a list, use an actual list and make the string representation you want out of it when you are done.
The proper way is to collect in lists and join at the end, but if for some reason you want to use strings, you could speed up the string extensions. Pop the string out of the dict so that there's only one reference to it and thus the optimization can kick in.
Demo:
>>> timeit('s = d.pop(k); s = s + "y"; d[k] = s', 'k = "x"; d = {k: ""}')
0.8417842664330237
>>> timeit('s = d[k]; s = s + "y"; d[k] = s', 'k = "x"; d = {k: ""}')
294.2475278390723
Depending on how you have read the excel file, but let's say that lines are read as delimiter-separated tuples or something:
d = {}
for name, foundValue in line_tuples:
try:
d[name].append(foundValue)
except KeyError:
d[name] = [foundValue]
d = {k: ",".join(v) for k, v in d.items()}
Alternatively using pandas:
import pandas as pd
df = pd.read_excel("some_excel_file.xlsx")
d = df.groupby("A")["B"].apply(lambda x: ",".join(x)).to_dict()
i have a file with over 15k lines each line having 1 key and 1 value. I can modify file content if any formatting is required for faster reading. currently i have made entire file like a dict and doing an eval on that is this the best way to read the file or any better approach can we follow, please suggest.
File mymapfile.txt:
{
'a':'this',
'b':'that',
.
.
.
.
'xyz':'message can have "special" char %s etc '
}
and on this file i am doing eval
f_read = eval(open('mymapfile.txt', 'r').read())
my concern is my file keeps growing and values can have quotes,special char etc where we need to wrap value ''' or """. with dictionary format even if there is small syntax error eval will fail. So is it better to use readlines() without making file as dict and then create dict or eval is faster if we make dict in file? for readlines i can simply write text in each line split with : and need not worry about any special characters
File for readlines:
a:this
b:that
.
.
.
.
xyz:message can have "special" char %s etc
#Mahesh24 answer returns a set with values that look like dict but are not. Also his variable overwrites the builtin dict. Rather use the two lines:
s={ (i.strip()) for i in open('ss.txt','r').readlines() }
d = {i.split(':')[0]:i.split(':')[1] for i in s}
d will then be dict with read in values. Bit of thinking could probably get this into a one liner. Pretty sure there are read csv in python standard library that will give you some more options and robustness. if your data is in any other standard format using the appropriate standard libraries will be preferential. The above two liner will however give you a quick and dirty way of doing it. can change the ":" for commas or whatever separator your data has.
Assuming you'll stick to json you might want to take a look at ultrajson. It seems to be very fast (even if with a memory penalty) at dumping and loading data.
Here are two articles that have some benchmarks and might help you make a decision:
https://medium.com/#jyotiska/json-vs-simplejson-vs-ujson-a115a63a9e26
http://jmoiron.net/blog/python-serialization/
Please avoid eval if you only want to load data.
What you only need is to read lines, recognize the key and the value so your proposed file format:
a:this
b:that
...
is fully suitable.
I'm new to programming, and also to this site, so my apologies in advance for anything silly or "newbish" I may say or ask.
I'm currently trying to write a script in python that will take a list of items and write them into a csv file, among other things. Each item in the list is really a list of two strings, if that makes sense. In essence, the format is [[Google, http://google.com], [BBC, http://bbc.co.uk]], but with different values of course.
Within the CSV, I want this to show up as the first item of each list in the first column and the second item of each list in the second column.
This is the part of my code that I need help with:
with open('integration.csv', 'wb') as f:
writer = csv.writer(f, delimiter=',', dialect='excel')
writer.writerows(w for w in foundInstances)
For whatever reason, it seems that the delimiter is being ignored. When I open the file in Excel, each cell has one list. Using the old example, each cell would have "Google, http://google.com". I want Google in the first column and http://google.com in the second. So basically "Google" and "http://google.com", and then below that "BBC" and "http://bbc.co.uk". Is this possible?
Within my code, foundInstances is the list in which all the items are contained. As a whole, the script works fine, but I cannot seem to get this last step. I've done a lot of looking around within stackoverflow and the rest of the Internet, but I haven't found anything that has helped me with this last step.
Any advice is greatly appreciated. If you need more information, I'd be happy to provide you with it.
Thanks!
In your code on pastebin, the problem is here:
foundInstances.append(['http://' + str(num) + 'endofsite' + ', ' + desc])
Here, for each row in your data, you create one string that already has a comma in it. That is not what you need for the csv module. The CSV module makes comma-delimited strings out of your data. You need to give it the data as a simple list of items [col1, col2, col3]. What you are doing is ["col1, col2, col3"], which already has packed the data into a string. Try this:
foundInstances.append(['http://' + str(num) + 'endofsite', desc])
I just tested the code you posted with
foundInstances = [[1,2],[3,4]]
and it worked fine. It definitely produces the output csv in the format
1,2
3,4
So I assume that your foundInstances has the wrong format. If you construct the variable in a complex manner, you could try to add
import pdb; pdb.set_trace()
before the actual variable usage in the csv code. This lets you inspect the variable at runtime with the python debugger. See the Python Debugger Reference for usage details.
As a side note, according to the PEP-8 Style Guide, the name of the variable should be found_instances in Python.
I have a textfile that looks like this:
Thomas Edgarson, Berliner Str 4, 13359 Berlin
Madeleine Jones, Müller Str 5, 15992 Karlsruhe
etc...
It's always two words, followed by a comma, then two words and number, comma, area code and city. There are no exceptions.
I used
f=open("C:\\Users\\xxxxxx\\Desktop\\useradresses.txt", "r")
text=f.readlines()
f.close()
So now I have a list of all the columns. How can I now search for the area codes in these strings. I need to create a dictionary that looks like this
{'13359':[('Neuss','Wolfgang'),('Juhnke','Harald')]}
Believe me, I've searched, but couldn't find useful information. To me, the whole idea of searching for something like an arbitray area code in a string is new and I haven't come across it so far.
I would be happy if you could give me some pointers as to where I should look for tutorials or give me an idea where to start.
dic = {}
with open('filename') as file:
for name, addr, zcode in (i.split(',') for i in file if i.rstrip()):
dic.setdefault(zcode.split()[0], []).append(name.split())
Further explanation as Sjoerd asked:
Using a generator expression to break each line in 3 variables: name, addr and zcode. Then I split zcode in the desired number and used it as a dictionary key.
As the dict may not have the key yet, I use the setdefault method and that sets the key with a empty list before appending the splitted name.
Loop through the file, reading lines, and split by comma. Then, process each part by splitting by space. Then, add the values to a dictionary.
d={}
for line in open('useradresses.txt','r'):
if line.strip()=='':
continue
(name,strasse,plzort) = line.split(',')
nachname,vorname=name.split()
plz,ort=plzort.split()
if plz in d:
d[plz].append((nachname,vorname))
else:
d[plz]=[(nachname,vorname),]
print d
Python has a lot of libraries dealing with string manipulation, which is what this is. You'll be wanting the re library and the shlex library. I'd suggest the following code:
with open("C:\\Users\\xxxxxx\\Desktop\\useradresses.txt", "r") as f:
for line in f.readlines():
split = shlex.split(line)
mydict[split[6]] = [(split[0], split[1])]
This won't be perfect, it will overwrite identical zip codes, and drops some values. It should point you in the right direction though.