I have a dictionary d of data frames, where the keys are the names and the values are the actual data frames. I have a function that normalizes some of the data frame and spits out a plot, with the title. The function takes in a tuple from d.items() (as the parameter df) so the first (0th) element is the name and the next is the data frame.
I have to do some manipulations on the data frame in the function, and I do so using df[1] without any issues. However, one line is df[1] = df[1].round(2) and this throws the error 'tuple' object does not support item assignment. I have verified that df[1] is a data frame by printing out its type write before this line. Why doesn't this work? It's not a tuple.
That's because your variable is a tuple and you can't assign to a tuple. Tuples are immutable. My understanding of your question:
from pandas import DataFrame
d = {'mydf' : DataFrame({'c1':(1,2),'c2':(4,5)}) } #A dictionary of DFs.
i = list(d.items())[0] #The first item from .items()
i[1] = i[1].round(2) #ERROR
Notice that "i" is a tuple, because that is what .items() returns (tuples). You can't save to i, even if what you are overwriting is something that is mutable. I know that this sounds strange, because you can do things like this:
x = (7,[1,2,3])
x[1].append(4)
print(x)
The reason this works is complicated. Essentially the tuples above are storing the pointers to the information within the tuples (not the data themselves). Hence, if you access a tuple's item (like x[1]), then python takes you to that pointers item (in my case a list) and allows you to run append on it, because the list is mutable. In your case, you are not trying to access the i[1] all by itself, you are trying to overwrite the i[1] entry in the tuple. Hope this makes sense.
Related
I am trying to get my hands dirty by doing some experiments on Data Science using Python and the Pandas library.
Recently I got my hands on a jupyter notebook and stumbled upon a piece of code that I couldn't figure out how it works?
This is the line
md['genres'] = md['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
The dataset comes with a genres column that contains key-value pairs, the code above removes the keys and replaces everything with only the value if more than one value exists a | is inserted as a seperator between the two for instance
Comedy | Action | Drama
I want to know how the code actually works! Why does it need the literal_eval from ast? What is the lambda function doing?! Is there a more concise and clean way to write this?
Let's take this one step at a time:
md['genres'].fillna('[]')
This line fills all instances of NA or NaN in the series with '[]'.
.apply(literal_eval)
This applies literal_eval() from the ast package. We can imply from the fact that NA values have been replaced with '[]' that the original series contains string representations of lists, so literal_eval is used to convert these strings to lists.
.apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
This lambda function applies the following logic: If the value is a list, map to a list containing the ['name'] values for each element within the list, otherwise map to the empty list.
The result of the full function, therefore, is to map each element in the series, which in the original DF is a string representation of a list, to a list of the ['name'] values for each element within that list. If the element is either not a list, or NA, then it maps to the empty list.
You can lookup line by line:
md['genres'] = md['genres'].fillna('[]')
This first line ensures NaN cells are replaced with a string representing an empty list. That's because column genres are expected to contain lists.
.apply(literal_eval)
The method ast.literal_eval is used to actually evaluate dictionaries, and not use them as strings. Thanks to that, you can further access keys and values. See more here.
.apply(
lambda x: [i['name'] for i in x]
if isinstance(x, list)
else []
)
Now you're just applying some function that will filter your lists. These lists contain dictionaries. The function will return all dictionary values associated with key name within your inputs if they're lists. Otherwise, that'll be an empty list.
I am trying to utilize list comprehension to populate a new list, which is the length of text in a DataFrame column.
So if the text is "electrical engineer", it should output 19 etc. Instead, it just fills the list with None values
I have written out list comprehension below
all_text_length = [all_text_length.append(len(i)) for i in data['all_text']]
Expecting output of integer but its None
As a workaround, I am currently using (successfully)
[all_text_length.append(len(i)) for i in data['all_text']]```
Read the documentation on append: it works in-place. There is no returned value. What you've written is essentially
all_text_length = [None for i in data['all_text']]
It appears that you're trying to make a list comprehension to entirely change your list. Try this:
all_text_length = [len(i) for i in data['all_text']]
If you just need the lengths in a convenient form, would it do to form a new column? Simply apply len to the df column.
The value before the "for" statement in the list comprehension, will be added to the list. If you place a statement in there, like
all_text_length.append(len(i)
, the return value of that function will be added. Because .append() doesnt have areturn-statement in it, you get the value None as return type, wich will be added to your list.
Use the code #Prune recommended and it should work as you want.
You are trying to append to the same list on which you are doing list comprehension. Since the append returns a None type, you are getting None. The below code should work,
all_text_length = map(len, data['all_text'])
map is a function that takes another function (first argument) and applies it to every element in an iterable (second argument) and returns a list of the results.
I am trying to do something pretty simple but cant seem to get it. I have a dictionary where the value is a list. I am trying to just sum the list and assign the value back to the same key as an int. Using the code below the first line doesn't do anything, the second as it says puts the value back but in a list. All other things ive tried has given me an error can only assign iterable. As far as i know iterables are anything that can be iterated on such as list and not int. Why can I only use iterable and how can i fix this issue ? The dict im using is here (https://gist.github.com/ishikawa-rei/53c100449605e370ef66f1c06f15b62e)
for i in dict.values():
i = sum(i)
#i[:] = [sum(i) / 3600] # puts answer into dict but as a list
You can use simple dictionary comprehension if your dict values are all lists
{k:sum(v) for k, v in dict.items()}
for i in dikt.keys():
dickt[i] = sum(dict[i]))
btw, dict is a type. best not to use it as a variable name
I am struggling with a Pyspark assignment. I am required to get a sum of all the viewing numbers per channels. I have 2 sets of files: 1 showing the show and views per show the other showing the shows and what channel they are shown on (can be multiple).
I have performed a join operation on the 2 files and the result looks like ..
[(u'Surreal_News', (u'BAT', u'11')),
(u'Hourly_Sports', (u'CNO', u'79')),
(u'Hourly_Sports', (u'CNO', u'3')),
I now need to extract the channel as the key and then I think do a reduceByKey to get the sum of views for the channels.
I have written this function to extract the chan as key with the views alongside, which I could then use a reduceByKey function to sum the results. However when I try to display results of below function with collect() I get an "AttributeError: 'tuple' object has no attribute 'split'" error
def extract_chan_views(show_chan_views):
key_value = show_chan_views.split(",")
chan_views = key_value[1].split(",")
chan = chan_views[0]
views = int(chan_views[1])
return (chan,views)
Since this is an assignment, I'll try to explain what's going on rather than just doing the answer. Hopefully that will be more helpful!
This actually isn't anything to do with pySpark; it's just a plain Python issue. Like the error is saying, you're trying to split a tuple, when split is a string operation. Instead access them by index. The object you're passing in:
[(u'Surreal_News', (u'BAT', u'11')),
(u'Hourly_Sports', (u'CNO', u'79')),
(u'Hourly_Sports', (u'CNO', u'3')),
is a list of tuples, where the first index is a unicode string and the second is another tuple. You can split them apart like this (I'll annotate each step with comments):
for item in your_list:
#item = (u'Surreal_News', (u'BAT', u'11')) on iteration one
first_index, second_index = item #this will unpack the two indices
#now:
#first_index = u'Surreal_News'
#second_index = (u'BAT', u'11')
first_sub_index, second_sub_index = second_index #unpack again
#now:
#first_sub_index = u'BAT'
#second_sub_index = u'11'
Note that you never had to split on commas anywhere. Also note that the u'11' is a string, not an integer in your data. It can be converted, as long as you're sure it's never malformed, with int(u'11'). Or if you prefer specifying indices to unpacking, you can do the same thing:
first_index, second_index = item
is equivalent to:
first_index = item[0]
second_index = item[1]
Also note that this gets more complicated if you are unsure what form the data will take - that is, if sometimes the objects have two items in them, other times three. In that case unpacking and indexing in a generalized way for a loop require a bit more thought.
I am not exactly resolving your code , but I faced same error when I applied join transformation on two datasets.
lets say , A and B are two RDDs.
c = A.join(B)
We may think that c is also Rdd , wrong. It is a tuple object where we cannot perform any split(",") kind of operations.One needs to make c into Rdd then proceed.
If we want tuple to be accessed, Lets say D is tuple.
E= D[1] // instead of E= D.split(",")[1]
I have tried to define a function to create a two-tiered dictionary, so it should produce the format
dict = {tier1:{tier2:value}}.
The code is:
def two_tier_dict_init(tier1,tier2,value):
dict_name = {}
for t1 in tier1:
dict_name[t1] = {}
for t2 in tier2:
dict_name[t1][t2] = value
return dict_name
So the following example...
tier1 = ["foo","bar"]
tier2 = ["x","y"]
value = []
foobar_dict = two_tier_dict_init(tier1,tier2,value)
on the face of it produces what I want:
foobar_dict = {'foo':{'x': [],'y':[]},
'bar':{'x': [],'y':[]}} }
However, when appending any value like
foobar_dict["foo"]["x"].append("thing")
All values get appended so the result is:
foobar_dict = {'foo':{'x': ["thing"],'y':["thing"]},
'bar':{'x': ["thing"],'y':["thing"]}}
At first I assumed that due to the way my definition builds the dictionary that all values are pointing to the same space in memory, but I could not figure out why this should be the case. I then discovered that if I change the value from an empty list to an integer, when I do the following,
foobar_dict["foo"]["x"] +=1
only the desired value is changed.
I must therefore conclude that it is something to do with the list.append method, but I can not figure it out. What is the explanation?
N.B. I require this function for building large dictionaries of dictionaries where each tier has hundreds of elements. I have also used the same method to build a three-tiered version with the same issue occurring.
You only passed in one list object, and your second-tier dictionary only stored references to that one object.
If you need to store distinct lists, you need to create a new list for each entry. You could use a factory function for that:
def two_tier_dict_init(tier1, tier2, value_factory):
dict_name = {}
for t1 in tier1:
dict_name[t1] = {}
for t2 in tier2:
dict_name[t1][t2] = value_factory()
return dict_name
Then use:
two_tier_dict_init(tier1, tier2, list)
to have it create empty lists. You can use any callable for the value factory here, including a lambda if you want to store an immutable object like a string or an integer:
two_tier_dict_init(tier1, tier2, lambda: "I am shared but immutable")
You could use a dict comprehension to simplify your function:
def two_tier_dict_init(tier1, tier2, value_factory):
return {t1: {t2: value_factory() for t2 in tier2} for t1 in tier1}
It happens because you are filling all second-tier dicts with the same list that you passed as value, and all entries are pointing to the same list object.
One solution is to copy the list at each attribution:
dict_name[t1][t2] = value[:]
This only works if you are sure that value is always a list.
Another, more generic solution, that works with any object, including nested lists and dictionaries, is deep copying:
dict_name[t1][t2] = copy.deepcopy(value)
If you fill the dicts with an immutable object like a number or string, internally all entries would refer to the same object as well, but the undesirable effect would not happen because numbers and strings are immutable.
All the values refer to the same list object. When you call append() on that list object, all of the dictionary values appear to change at the same time.
To create a copy of the list change
dict_name[t1][t2] = value
to
dict_name[t1][t2] = value[:]
or to
dict_name[t1][t2] = copy.deepcopy(value)
The former will make a shallow (i.e. one-level) copy, and the latter will do a deep copy.
The reason this appears to work with ints is because they are immutable, and augmented assignments (+= and friends) do a name rebind just like ordinary assignment statements (it just might be back to the same object). When you do this:
foobar_dict["foo"]["x"] +=1
you end up replacing the old int object with a different one. ints have no capability to change value in-place, so the addition builds (or, possibly finds, since CPython interns certain ints) a different int with the new value.
So even if foobar_dict["foo"]["x"] and foobar_dict["foo"]["y"] started out with the same int (and they did), adding to one of them makes them now contain different ints.
You can see this difference if you try it out with simpler variables:
>>> a = b = 1
>>> a is b
True
>>> a += 1
>>> a
2
>>> b
1
On the other hand, list is mutable, and calling append doesn't do any rebinding. So, as you suspected, if foobar_dict["foo"]["x"] and foobar_dict["foo"]["y"] are the same list (and they are - check this with is), and you append to it, they are still the same list.