I'm working on the pandas tutorial at https://github.com/brandon-rhodes/pycon-pandas-tutorial/blob/master/Exercises-3.ipynb. It has exercises on the cast dataframe, a sample of which is
There are two commands which are almost similar, except for one small difference, and one outputs a Series while the other outputs a dataframe. I don't understand why.
The first code is:
c1 = cast[cast.title == 'The Pink Panther']
c2 = c1.groupby('year')['n'].max()
type(c2)
and it makes c2 a Series. However, if I simply add another square brackets around 'n' as in the following code, I get a dataframe.
c1 = cast[cast.title == 'The Pink Panther']
c2 = c1.groupby('year')[['n']].max()
type(c2)
Can someone help me explain this? Thanks!
If you pass a list of columns, you get a DataFrame. It doesn't matter how many elements the list has. It would be confusing if it returned a Series just in the case of a one-item list, because sometimes your list might be programmatically generated. For instance, suppose you had:
columns_to_use = [column for blah in blahblah]
x = c1.groupby('year')[columns_to_use]
With the current behavior, you know that x will always be a DataFrame, because columns_to_use is a list. If this were not the case, you might get errors later because you wouldn't know ahead of time whether x would be a Series or DataFrame, so you wouldn't know, e.g., what methods you could call on it in later code.
Basically, If you pass __getitem__ on a DataFrame a Series, np.ndarray, Index, or list, then you will get back an array (DataFrame).
Otherwise __getitem__ will attempt to retrieve a a column (Series). This case includes stringtypes, numbers, a custom class, etc.
DataFrameGroupBy behaves similarly to DataFrame in that if a you pass any of the former listed objects(plus tuples apparently), you will get a two-dimensonal object back(DataFrame), otherwise it will attempt to retrieve a one-dimensional object(Series)
In your first code block you are passing a string:
>>> type(c1['year'])
pandas.core.frame.Series
In the second code block you pass a list containing a string to __getitem__
>>> type(c1[['year']])
pandas.core.frame.DataFrame
[] has multiple meanings in this case.
Passing a list of one element is generally not very useful however except for nicely printing the column name at the top (but the Series still retains the name of the column in the name attribute). The primary intent of passing a list to __getitem__ is to key on multiple columns.
To see how brackets [] work on a class, check its __getitem__ method.
From pandas.series.core.frame.DataFrame:
if isinstance(key, (Series, np.ndarray, Index, list)):
# either boolean or fancy integer index
return self._getitem_array(key)
elif isinstance(key, DataFrame):
return self._getitem_frame(key)
elif is_mi_columns:
return self._getitem_multilevel(key)
else:
return self._getitem_column(key)
Related
How do I replace the cell values in a column if they contain a number in general or contain a specific thing like a comma, replace the whole cell value with something else.
Say for example a column that has a comma meaning it has more than one thing I want it to be replaced by text like "ENM".
For a column that has a cell with a number value, I want to replace it by 'UNM'
As you have not provided examples of what your expected and current output look like, I'm making some assumptions below. What it seems like you're trying to do is iterate through every value in a column and if the value meets certain conditions, change it to something else.
Just a general pointer. Iterating through dataframes requires some important considerations for larger sizes. Read through this answer for more insight.
Start by defining a function you want to use to check the value:
def has_comma(value):
if ',' in value:
return True
return False
Then use the pandas.DataFrame.replace method to make the change.
for i in df['column_name']:
if has_comma(i):
df['column_name'] = df['column_name'].replace([i], 'ENM')
else:
df['column_name'] = df['column_name'].replace([i], 'UNM')
Say you have a column, i.e. pandas Series called col
The following code can be used to map values with comma to "ENM" as per your example
col.mask(col.str.contains(','), "ENM")
You can overwrite your original column with this result if that's what you want to do. This approach will be much faster than looping through each element.
For mapping floats to "UNM" as per your example the following would work
col.mask(col.apply(isinstance, args=(float,)), "UNM")
Hopefully you get the idea.
See https://pandas.pydata.org/docs/reference/api/pandas.Series.mask.html for more info on masking
I am parsing an XMI/XML data structure into a pandas dataframe by first decomposing it into a dictionary. When I encounter a named tuple in a list in my XMI, there appear to be a maximum of two named tuples in my list (although the majority only have one).
To handle this case, I am doing the following:
if val is not None and val:
if len(val) == 1:
d['modifiedBegin'] = val[0].begin
d['modifiedEnd'] = val[0].end
d['modifiedBegin1'] = None
d['modifiedEnd1'] = None
else:
d['modifiedBegin1'] = val[1].begin
d['modifiedEnd1'] = val[1].end
My issues with this are: a) I cannot be guaranteed that there are only two lists in my list that I am decomposing, and b) this feels cheap, ugly and just plain wrong!
I really would like to come up with a more general solution, especially given item a) above.
My data look like:
val = [Span(xmiID=105682, begin=13352, end=13358, type='org.metamap.uima.ts.Span'), Span(xmiID=105685, begin=13368, end=13374, type='org.metamap.uima.ts.Span')]
I would really much rather parse this out into two separate rows in my dataframe, instead of having more columns. The major issue is that both of these tuples share common data from a larger object that looks like:
Negation(xmiID=142613, id=None, negType='nega', negTrigger='without', modifier=[Span(xmiID=105682, begin=13352, end=13358, type='org.metamap.uima.ts.Span'), Span(xmiID=105685, begin=13368, end=13374, type='org.metamap.uima.ts.Span')])
So, both rows share the attributes negType and negTrigger... what is a more general way of decomposing this to insert into my dataframe. I though of iterating through the elements when the length of the list ws greater than one and then inserting into the datframe on each iteration, but that seems messy.
My desired outcome would thus be to have a dataframe that looks like (minus the indices and other common junk):
Iterate over Negation namedtuples
for each thing in negation.modifier
add a row using the negation attributes and the things attributes
Or instead of parsing XML to namedtuples to dictionaries skip the middle part and create a single dictionary - {'begin':[row0,row1,...],'end':[row0,row1,...],'negtrigger':[row0,row1,...],'negtype':[row0,row1,...]} - from the XML
I have a dictionary d of data frames, where the keys are the names and the values are the actual data frames. I have a function that normalizes some of the data frame and spits out a plot, with the title. The function takes in a tuple from d.items() (as the parameter df) so the first (0th) element is the name and the next is the data frame.
I have to do some manipulations on the data frame in the function, and I do so using df[1] without any issues. However, one line is df[1] = df[1].round(2) and this throws the error 'tuple' object does not support item assignment. I have verified that df[1] is a data frame by printing out its type write before this line. Why doesn't this work? It's not a tuple.
That's because your variable is a tuple and you can't assign to a tuple. Tuples are immutable. My understanding of your question:
from pandas import DataFrame
d = {'mydf' : DataFrame({'c1':(1,2),'c2':(4,5)}) } #A dictionary of DFs.
i = list(d.items())[0] #The first item from .items()
i[1] = i[1].round(2) #ERROR
Notice that "i" is a tuple, because that is what .items() returns (tuples). You can't save to i, even if what you are overwriting is something that is mutable. I know that this sounds strange, because you can do things like this:
x = (7,[1,2,3])
x[1].append(4)
print(x)
The reason this works is complicated. Essentially the tuples above are storing the pointers to the information within the tuples (not the data themselves). Hence, if you access a tuple's item (like x[1]), then python takes you to that pointers item (in my case a list) and allows you to run append on it, because the list is mutable. In your case, you are not trying to access the i[1] all by itself, you are trying to overwrite the i[1] entry in the tuple. Hope this makes sense.
I am trying to get my hands dirty by doing some experiments on Data Science using Python and the Pandas library.
Recently I got my hands on a jupyter notebook and stumbled upon a piece of code that I couldn't figure out how it works?
This is the line
md['genres'] = md['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
The dataset comes with a genres column that contains key-value pairs, the code above removes the keys and replaces everything with only the value if more than one value exists a | is inserted as a seperator between the two for instance
Comedy | Action | Drama
I want to know how the code actually works! Why does it need the literal_eval from ast? What is the lambda function doing?! Is there a more concise and clean way to write this?
Let's take this one step at a time:
md['genres'].fillna('[]')
This line fills all instances of NA or NaN in the series with '[]'.
.apply(literal_eval)
This applies literal_eval() from the ast package. We can imply from the fact that NA values have been replaced with '[]' that the original series contains string representations of lists, so literal_eval is used to convert these strings to lists.
.apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
This lambda function applies the following logic: If the value is a list, map to a list containing the ['name'] values for each element within the list, otherwise map to the empty list.
The result of the full function, therefore, is to map each element in the series, which in the original DF is a string representation of a list, to a list of the ['name'] values for each element within that list. If the element is either not a list, or NA, then it maps to the empty list.
You can lookup line by line:
md['genres'] = md['genres'].fillna('[]')
This first line ensures NaN cells are replaced with a string representing an empty list. That's because column genres are expected to contain lists.
.apply(literal_eval)
The method ast.literal_eval is used to actually evaluate dictionaries, and not use them as strings. Thanks to that, you can further access keys and values. See more here.
.apply(
lambda x: [i['name'] for i in x]
if isinstance(x, list)
else []
)
Now you're just applying some function that will filter your lists. These lists contain dictionaries. The function will return all dictionary values associated with key name within your inputs if they're lists. Otherwise, that'll be an empty list.
import pandas as pd
businesses = pd.read_json(businesses_filepath, lines=True, encoding='utf_8')
restaurantes = businesses['Restaurants' in businesses['categories']]
I would like to remove the lines that do not have Restaurants in the categories column, and this column has lists, however gave the error 'KeyError: False' and I would like to understand why and how to solve.
The expression 'Restaurants' in businesses['categories'] returns the boolean value False. This is passed to the brackets indexing operator for the DataFrame businesses which does not contain a column called False and thus raises a KeyError.
What you are looking to do is something called boolean indexing which works like this.
businesses[businesses['categories'] == 'Restaurants']
If you find that your data contains spelling variations or alternative restaurant related terms, the following may be of benefit. Essentially you put your restaurant related terms in restuarant_lst. The lambda function returns true if any of the items in restaurant_lst are contained within each row of the business series. The .loc indexer filters out rows which return false for the lambda function.
restaurant_lst = ['Restaurant','restaurantes','diner','bistro']
restaurant = businesses.loc[businesses.apply(lambda x: any(restaurant_str in x for restaurant_str in restaurant_lst))]
The reason for this is that the Series class implements a custom in operator that doesn't return an iterable like the == does, here's a workaround
businesses[['Restaurants' in c for c in list(businesses['categories'])]]
hopefully this helps someone where you're looking for a substring in the column and not a full match.
I think what you meant was :
businesses = businesses.loc[businesses['categories'] == 'Restaurants']
that will only keep rows with the category restaurants
None of the answers here actually worked for me,
businesses[businesses['categories'] == 'Restaurants']
obviously won't work since the value in 'categories' is not a string, it's a list, meaning the comparison will always fail.
What does, however, work, is converting the column into tuples instead of strings:
businesses['categories'] = businesses['categories'].apply(tuple)
That allows you to use the standard .loc thing:
businesses.loc[businesses['categories'] == ('Restaurants',)]