I'm currently going through the Udacity course on data analysis in python, and we've been using the unicodecsv library.
More specifically we've written the following code which reads a csv file and converts it into a list. Here is the code:
def read_csv(filename):
with open(filename,'rb')as f:
reader = unicodecsv.DictReader(f)
return list(reader)
In order to get my head around this, I'm trying to figure out how the data is represented in the dictionary and the list, and I'm very confused. Can someone please explain it to me.
For example, one thing I don't understand is why the following throws an error
enrollment['cancel_date']
While the following works fine:
for enrollment in enrollments:
enrollments['cancel_date'] = parse_date(enrollment['cancel_date'])
Hopefully this question makes sense. I'm just having trouble visualizing how all of this is represented.
Any help would be appreciated.
Thanks.
I too landed up here for some troubles related to the course and found this unanswered. However I think you already managed it. Anyway answering here so that someone else might find this helpful.
Like we all know, dictionaries can be accessed like
dictionary_name['key']
and likewise
enrollments['cancel_date'] should also work.
But if you do something like
print enrollments
you will see the structure
[{u'status': u'canceled', u'is_udacity': u'True', ...}, {}, ... {}]
If you notice the brackets, it's like a list of dictionaries. You may argue it is a list of list. Try it.
print enrollments[0][0]
You'll get an error! KeyError.
So, it's like a collection of dictionaries. How to access them? Zoom down to any dictionary (rather rows of the csv) by enrollments[n].
Now you have a dictionary. You can now use freely the key.
print enrollments[0]['cancel_date']
Now coming to your loop,
for enrollment in enrollments:
enrollment['cancel_date'] = parse_date(enrollment['cancel_date'])
What this is doing is the enrollment is the dummy variable capturing each of the iterable element enrollments like enrollments[1], enrollments[2] ... enrollments[n].
So every-time enrollment is having a dictionary from enrollments and so enrollment['cancel_date'] works over enrollments['cancel_date'].
Lastly I want to add a little more thing which is why I came to the thread.
What is the meaning of "u" in u'..' ? Ex: u'cancel_date' = u'11-02-19'.
The answer is this means the string is encoded as an Unicode. It is not part of the string, it is python notation. Unicode is a library that contains the characters and symbol for all of the world's languages.
This mainly happens because the unicodecsv package does not take the headache of tracking and converting each item in the csv file. It reads them as Unicode to preserve all characters. Now that's why Caroline and you defined and used parse_date() and other functions to convert the Unicode strings to the desired datatype. This is all a part of the Data Wrangling process.
Related
I got completely lost in figuring out this problem below. Here is the question:
country_population_data.csv
how the csv looks like
extract only the country name and its population from the csv. file (e.g., 'China', 14442161070)
create an empty dictionary named pop_dict. Then read the country_population_data.csv file, as a list of lines.
for each line of the records, extract a tuple of country name and population, then store it into the empty dictionary.
*requirement: create own function and use it
The answer should look like:
{'China': 14442161070, 'India': 13934090380, ...
My first approach was making a function to extract the required items from the csv file as a tuple, but somehow it did not work out and gave me this error.
AttributeError: '_io.TextIOWrapper' object has no attribute 'split'
#funtion to split items
with open(csv, 'r') as f:
def str_to_tuple(f)
str_splitted = tuple(f.split(","))
result_tuple = str_splitted.str()[1] + str_splitted.int()[-1]
return(result_tuple)
print(str_to_tuple(f))
And I also was not sure how to put extracted values in a new dictionary. Could anyone help me with this question? It has been just a couple of weeks for me to learn python so bear with my poor codes and explanation.
Any feedback & comments & tips are welcome to get used to this python world!
As this is a question that is part of a course, I will refrain from simply giving you the answer. Instead, I will try to give you some hints, which might help you find the answer by yourself.
My suggestion is that you start with trying to answer the following questions:
How should you call a function in Python? Is that how you do it in your script? Hint: probably not ;) If not, how would you fix this?
What is the type of f (i.e. print(type(f)) to find out)? Did you expect that to be the type of f? Hint: probably not ;) What do you expect the type of f to be? Perhaps we need to call .split(",") on a different variable, one that perhaps doesn't exist yet?
I have a CSV file with 100,000 rows.
Each row in column A is a sentence comprised of both chars and integers.
I want column B to contain only integers.
I want the new columns to be in the same CSV file.
How can I accomplish this?
If I'm understanding your question correctly, I would use .isdigit() to parse the data in column A. I'm frankly not sure what the format of column A is, so I don't know exactly what you would do with this (if you gave more information I could give a more specific answer). Your solution will likely come in a similar form to this:
def find(lines):
B = []
for line in lines:
numbers = [c for c in line if c.isdigit()]
current = int(''.join(numbers))
# current is the concatenation of all
# integers found in column A from left to right
B.append(current)
return B
Let me know if this makes sense or is even in the right track for your solution. Once again, without knowing what you're trying to do, and what A looks like, I'm not sure what your actual goals are.
EDIT
I'm not going to explain the csv stuff for you, mainly because there is a fantastic resource and library for it included in python here. If you have specific questions related to writing csv, definitely post them.
It sounds like you essentially want to pull int values out of column A then add them to a new column B. There are definitely many ways to solve this, but the general form of the problem is for each row you'll filter out the int, then you'll add the filtered int into the new column. I'll list a couple:
Regex: You could use a pattern such as [0-9]+ to pull the string out of A, then use int(whatever that output is) to cast to int, then store those values in B. I'm a sucker for a good regular expression and this one is fairly straight forward. Regexr is a great resource to learn about this and test your pattern.
Use an algorithm similar to above: The above algorithm worked before, but I've updated it slightly. Now that it's been updated it'll return an array of numbers correspondent to numbers in A from left to right. This is relatively sound, but it doesn't necessarily guarantee you have the right integer, given that if the title has an int in it, it'll mess some things up. It is likely one of the more clear ways of doing this, though.
Here is the block to analyse:
('images\\Principales\\Screenshot_1.png', '{"categories":[{"name":"abstract_","score":0.00390625},{"name":"outdoor_","score":0.01171875},{"name":"outdoor_road","score":0.41796875}],"description":{"tags":["road","building","outdoor","scene","street","city","sitting","empty","light","view","driving","red","sign","intersection","green","large","riding","traffic","white","tall","blue","fire"],"captions":[{"text":"a view of a city street","confidence":0.83864323826716347}]},"requestId":"73fc14d5-653f-4a0a-a45a-e7a425580361","metadata":{"width":150,"height":153,"format":"Png"},"color":{"dominantColorForeground":"Grey","dominantColorBackground":"Grey","dominantColors":["Grey"],"accentColor":"274A68","isBWImg":false}}')
I need to extract all elements after "description", but i don't know how to do that... (in fact, i need this elements:
"road", "building","outdoor","scene","street","city","sitting","empty","light","view","driving","red","sign","intersection","green","large","riding","traffic","white","tall","blue","fire"
I've been looking for several minutes already, but I do not understand how to do it! I'm a little beginner in learning "lists" element, and I still have a hard time understanding.
The "For" loop returns only 'images\\Principales\\Screenshot_1.png', then the big blocks left ...
Did you have a solution?
Thanks in advence!
EDIT:
Indeed, it is actually JSON! Thanks to the people who helped me :)
To extract the desired elements contained in the second block, I simply proceeded thus:
import json
ElementSeparate= '{"categories":[{"name":"abstract_","score":0.00390625},{"name":"outdoor_","score":0.01171875},{"name":"outdoor_road","score":0.41796875}],"description":{"tags":["road","building","outdoor","scene","street","city","sitting","empty","light","view","driving","red","sign","intersection","green","large","riding","traffic","white","tall","blue","fire"],"captions":[{"text":"a view of a city street","confidence":0.83864323826716347}]},"requestId":"73fc14d5-653f-4a0a-a45a-e7a425580361","metadata":{"width":150,"height":153,"format":"Png"},"color":{"dominantColorForeground":"Grey","dominantColorBackground":"Grey","dominantColors":["Grey"],"accentColor":"274A68","isBWImg":false}'
ElementSeparate = json.loads(ElementSeparate)
for a in ElementSeparate['description']['tags']:
print a
To me it looks like you're trying to parse JSON. You should use the JSON parser for the second element of the array. You'll get back either list or dictionary. Then you'll be able to extract data from "description" key has.
https://docs.python.org/3/library/json.html
I'm pretty new to Python (and the xlrd module), so my code is probably not nearly as compact as it could be. I'm just using it to analyse some data, so it's more important for me to get what I'm doing rather than for me to make the code as compact as possible (though I do hope to improve, so feel free to give me advice on the coding itself, provided you manage to explain it to a 'newbie' :p )
That being said, here's my issue:
Context
I have an xlsx file with data on errors that people made when translating a text. The first column contains a code for the error relative to the text (conceptual errors), the second column contains a code for the translator that made the error. I want to create a dictionary in which the keys are the conceptual error codes, and the values are lists of the different translators that made that conceptual error.
An short fragment from the xlsx (to give you an idea of the codes in the two columns):
1722_Z1_CF5 1722_HT_EV_Z1_F1
1722_Z1_CF1 1722_PE_AL_Z1_F1
1722_Z1_CF9 1722_PE_EVC_Z1_F1
1722_Z1_CF5 1722_PE_LH_Z1_F1
As you can see, the conceptual error '1722_Z1_CF5' has been made by 2 different people ('1722_HT_EV_Z1_F1' and '1722_PE_LH_Z1_F1). The dictionary for this fragment would look something like:
1722_Z1_CF5: 1722_HT_EV_Z1_F1, 1722_PE_LH_Z1_F1
1722_Z1_CF1: 1722_PE_AL_Z1_F1
1722_Z1_CF9: 1722_PE_EVC_Z1_F1
Code
The code below is what I tried to do to create the dictionary.
def TranslatorsPerError(sheet):
TotalConceptualErrors(sheet)
TranslatorsPerError = {}
for row_index in range(sheet.nrows):
if sheet.cell(row_index,0).value in ConceptualErrors and sheet.cell(row_index,0).value not in TranslatorsPerError:
TranslatorsPerError[str(sheet.cell(row_index,0).value)]=[str(sheet.cell(row_index,1).value),]
if sheet.cell(row_index,0).value in ConceptualErrors and sheet.cell(row_index,0).value in TranslatorsPerError:
TranslatorsPerError[str(sheet.cell(row_index,0).value)].append(str(sheet.cell(row_index,1).value))
return TranslatorsPerError
'TotalConceptualErrors' is a function I created that returns a list ('ConceptualErrors') of the conceptual error codes from the first column without duplicates (and it filters out some other information that was also present in the first column, that's why I needed to use this one first).
Problem
The problem is that this function keeps giving me an error: TypeError: argument of type 'Book' is not iterable
I know that problems with iterables can sometimes be solved by casting certain things into a different type, but I'm not sure how I should solve this one. I tried to use 'str()' for different elements, but that didn't solve the problem. Maybe it has something to do with my code, maybe with the nature of dictionaries or xlrd... (looking at the type 'book', my guess would be on the latter).
Any help or feedback on how to fix this would be greatly appreciated. If you need extra information to understand what's going on or what I'm looking for, please ask.
where is ConceptualErrors being set?
I'm new to programming, and also to this site, so my apologies in advance for anything silly or "newbish" I may say or ask.
I'm currently trying to write a script in python that will take a list of items and write them into a csv file, among other things. Each item in the list is really a list of two strings, if that makes sense. In essence, the format is [[Google, http://google.com], [BBC, http://bbc.co.uk]], but with different values of course.
Within the CSV, I want this to show up as the first item of each list in the first column and the second item of each list in the second column.
This is the part of my code that I need help with:
with open('integration.csv', 'wb') as f:
writer = csv.writer(f, delimiter=',', dialect='excel')
writer.writerows(w for w in foundInstances)
For whatever reason, it seems that the delimiter is being ignored. When I open the file in Excel, each cell has one list. Using the old example, each cell would have "Google, http://google.com". I want Google in the first column and http://google.com in the second. So basically "Google" and "http://google.com", and then below that "BBC" and "http://bbc.co.uk". Is this possible?
Within my code, foundInstances is the list in which all the items are contained. As a whole, the script works fine, but I cannot seem to get this last step. I've done a lot of looking around within stackoverflow and the rest of the Internet, but I haven't found anything that has helped me with this last step.
Any advice is greatly appreciated. If you need more information, I'd be happy to provide you with it.
Thanks!
In your code on pastebin, the problem is here:
foundInstances.append(['http://' + str(num) + 'endofsite' + ', ' + desc])
Here, for each row in your data, you create one string that already has a comma in it. That is not what you need for the csv module. The CSV module makes comma-delimited strings out of your data. You need to give it the data as a simple list of items [col1, col2, col3]. What you are doing is ["col1, col2, col3"], which already has packed the data into a string. Try this:
foundInstances.append(['http://' + str(num) + 'endofsite', desc])
I just tested the code you posted with
foundInstances = [[1,2],[3,4]]
and it worked fine. It definitely produces the output csv in the format
1,2
3,4
So I assume that your foundInstances has the wrong format. If you construct the variable in a complex manner, you could try to add
import pdb; pdb.set_trace()
before the actual variable usage in the csv code. This lets you inspect the variable at runtime with the python debugger. See the Python Debugger Reference for usage details.
As a side note, according to the PEP-8 Style Guide, the name of the variable should be found_instances in Python.