I am trying to get my hands dirty by doing some experiments on Data Science using Python and the Pandas library.
Recently I got my hands on a jupyter notebook and stumbled upon a piece of code that I couldn't figure out how it works?
This is the line
md['genres'] = md['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
The dataset comes with a genres column that contains key-value pairs, the code above removes the keys and replaces everything with only the value if more than one value exists a | is inserted as a seperator between the two for instance
Comedy | Action | Drama
I want to know how the code actually works! Why does it need the literal_eval from ast? What is the lambda function doing?! Is there a more concise and clean way to write this?
Let's take this one step at a time:
md['genres'].fillna('[]')
This line fills all instances of NA or NaN in the series with '[]'.
.apply(literal_eval)
This applies literal_eval() from the ast package. We can imply from the fact that NA values have been replaced with '[]' that the original series contains string representations of lists, so literal_eval is used to convert these strings to lists.
.apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
This lambda function applies the following logic: If the value is a list, map to a list containing the ['name'] values for each element within the list, otherwise map to the empty list.
The result of the full function, therefore, is to map each element in the series, which in the original DF is a string representation of a list, to a list of the ['name'] values for each element within that list. If the element is either not a list, or NA, then it maps to the empty list.
You can lookup line by line:
md['genres'] = md['genres'].fillna('[]')
This first line ensures NaN cells are replaced with a string representing an empty list. That's because column genres are expected to contain lists.
.apply(literal_eval)
The method ast.literal_eval is used to actually evaluate dictionaries, and not use them as strings. Thanks to that, you can further access keys and values. See more here.
.apply(
lambda x: [i['name'] for i in x]
if isinstance(x, list)
else []
)
Now you're just applying some function that will filter your lists. These lists contain dictionaries. The function will return all dictionary values associated with key name within your inputs if they're lists. Otherwise, that'll be an empty list.
Related
I have a dictionary d of data frames, where the keys are the names and the values are the actual data frames. I have a function that normalizes some of the data frame and spits out a plot, with the title. The function takes in a tuple from d.items() (as the parameter df) so the first (0th) element is the name and the next is the data frame.
I have to do some manipulations on the data frame in the function, and I do so using df[1] without any issues. However, one line is df[1] = df[1].round(2) and this throws the error 'tuple' object does not support item assignment. I have verified that df[1] is a data frame by printing out its type write before this line. Why doesn't this work? It's not a tuple.
That's because your variable is a tuple and you can't assign to a tuple. Tuples are immutable. My understanding of your question:
from pandas import DataFrame
d = {'mydf' : DataFrame({'c1':(1,2),'c2':(4,5)}) } #A dictionary of DFs.
i = list(d.items())[0] #The first item from .items()
i[1] = i[1].round(2) #ERROR
Notice that "i" is a tuple, because that is what .items() returns (tuples). You can't save to i, even if what you are overwriting is something that is mutable. I know that this sounds strange, because you can do things like this:
x = (7,[1,2,3])
x[1].append(4)
print(x)
The reason this works is complicated. Essentially the tuples above are storing the pointers to the information within the tuples (not the data themselves). Hence, if you access a tuple's item (like x[1]), then python takes you to that pointers item (in my case a list) and allows you to run append on it, because the list is mutable. In your case, you are not trying to access the i[1] all by itself, you are trying to overwrite the i[1] entry in the tuple. Hope this makes sense.
I am trying to utilize list comprehension to populate a new list, which is the length of text in a DataFrame column.
So if the text is "electrical engineer", it should output 19 etc. Instead, it just fills the list with None values
I have written out list comprehension below
all_text_length = [all_text_length.append(len(i)) for i in data['all_text']]
Expecting output of integer but its None
As a workaround, I am currently using (successfully)
[all_text_length.append(len(i)) for i in data['all_text']]```
Read the documentation on append: it works in-place. There is no returned value. What you've written is essentially
all_text_length = [None for i in data['all_text']]
It appears that you're trying to make a list comprehension to entirely change your list. Try this:
all_text_length = [len(i) for i in data['all_text']]
If you just need the lengths in a convenient form, would it do to form a new column? Simply apply len to the df column.
The value before the "for" statement in the list comprehension, will be added to the list. If you place a statement in there, like
all_text_length.append(len(i)
, the return value of that function will be added. Because .append() doesnt have areturn-statement in it, you get the value None as return type, wich will be added to your list.
Use the code #Prune recommended and it should work as you want.
You are trying to append to the same list on which you are doing list comprehension. Since the append returns a None type, you are getting None. The below code should work,
all_text_length = map(len, data['all_text'])
map is a function that takes another function (first argument) and applies it to every element in an iterable (second argument) and returns a list of the results.
Currently I'm trying to sort a list of files which were made of version numbers. For example:
0.0.0.0.py
1.0.0.0.py
1.1.0.0.py
They are all stored in a list. My idea was to use the sort method of the list in combination with a lambda expression. The lambda-expression should first remove the .py extensions and than split the string by the dots. Than casting every number to an integer and sort by them.
I know how I would do this in c#, but I have no idea how to do this with python. One problem is, how can I sort over multiple criteria? And how to embed the lambda-expression doing this?
Can anyone help me?
Thank you very much!
You can use the key argument of sorted function:
filenames = [
'1.0.0.0.py',
'0.0.0.0.py',
'1.1.0.0.py'
]
print sorted(filenames, key=lambda f: map(int, f.split('.')[:-1]))
Result:
['0.0.0.0.py', '1.0.0.0.py', '1.1.0.0.py']
The lambda splits the filename into parts, removes the last part and converts the remaining ones into integers. Then sorted uses this value as the sorting criterion.
Have your key function return a list of items. The sort is lexicographic in that case.
l = [ '1.0.0.0.py', '0.0.0.0.py', '1.1.0.0.py',]
s = sorted(l, key = lambda x: [int(y) for y in x.replace('.py','').split('.')])
print s
# read list in from memory and store as variable file_list
sorted(file_list, key = lambda x: map(int, x.split('.')[:-1]))
In case you're wondering what is going on here:
Our lambda function first takes our filename, splits it into an array delimited by periods. Then we take all of the elements of the list, minus the last element, which is our file extension. Then we apply the 'int' function to every element of the list. The returned list is then sorted by the 'sorted' function according to the elements of the list, starting at the first with ties broken by later elements in the list.
I have a list of several thousand unordered tuples that are of the format
(mainValue, (value, value, value, value))
Given a main value (which may or may not be present), is there a 'nice' way, other than iterating through every item looking and incrementing a value, where I can produce a list of indexes of tuples that match like this:
index = 0;
for destEntry in destList:
if destEntry[0] == sourceMatch:
destMatches.append(index)
index = index + 1
So I can compare the sub values against another set, and remove the best match from the list if necessary.
This works fine, but just seems like python would have a better way!
Edit:
As per the question, when writing the original question, I realised that I could use a dictionary instead of the first value (in fact this list is within another dictionary), but after removing the question, I still wanted to know how to do it as a tuple.
With list comprehension your for loop can be reduced to this expression:
destMatches = [i for i,destEntry in enumerate(destList) if destEntry[0] == sourceMatch]
You can also use filter()1 built in function to filter your data:
destMatches = filter(lambda destEntry:destEntry[0] == sourceMatch, destList)
1: In Python 3 filter is a class and returns a filter object.
I have a list of lists that looks like this:
[['10.2100', '0.93956088E+01'],
['11.1100', '0.96414905E+01'],
['12.1100', '0.98638361E+01'],
['14.1100', '0.12764182E+02'],
['16.1100', '0.16235739E+02'],
['18.1100', '0.11399972E+02'],
['20.1100', '0.76444933E+01'],
['25.1100', '0.37823686E+01'],
['30.1100', '0.23552237E+01'],...]
(here it looks as if it is already ordered, but some of the rest of the elements not included here to avoid a huge list, are not in order)
and I want to sort it by the first element of each pair, I have seen several very similar questions, but in all the cases the examples are with integers, I don't know if that is why when I use the list.sort(key=lambda x: x[0]) or the sorter, or the version with the operator.itemgetter(0) I get the following:
[['10.2100', '0.93956088E+01'],
['100.1100', '0.33752517E+00'],
['11.1100', '0.96414905E+01'],
['110.1100', '0.25774972E+00'],
['12.1100', '0.98638361E+01'],
['14.1100', '0.12764182E+02'],
['14.6100', '0.14123326E+02'],
['15.1100', '0.15451733E+02'],
['16.1100', '0.16235739E+02'],
['16.6100', '0.15351242E+02'],
['17.1100', '0.14040859E+02'],
['18.1100', '0.11399972E+02'], ...]
apparently what is doing is sorting by the first character appearing in the first element of each pair.
Is there a way of using list.sort or sorted() for ordering this pairs with respect to the first element?
dont use list as a variable name!
some_list.sort(key=lambda x: float(x[0]) )
will convert the first element to a float and comparit numerically instead of alphabetically
(note the cast to float is only for comparing... the item is still a string in the list)