Does iloc use the indices or the place of the row - python

I have extracted few rows from a dataframe to a new dataframe. In this new dataframe old indices remain. However, when i want to specify range from this new dataframe i used it like new indices, starting from zero. Why did it work? Whenever I try to use the old indices it gives an error.
germany_cases = virus_df_2[virus_df_2['location'] == 'Germany']
germany_cases = germany_cases.iloc[:190]
This is the code. The rows that I extracted from the dataframe virus_df_2 have indices between 16100 and 16590. I wanted to take the first 190 rows. in the second line of code i used iloc[:190] and it worked. However, when i tried to use iloc[16100:16290] it gave an error. What could be the reason?

In pandas there are two attributes, loc and iloc.
The iloc is, as you have noticed, an indexing based on the order of the rows in memory, so there you can reference the nth line using iloc[n].
In order to reference rows using the pandas indexing, which can be manually altered and can not only be integers but also strings or other objects that are hashable (have the __hash__ method defined), you should use loc attribute.
In your case, iloc raises an error because you are trying to access a range that is outside the region defined by your dataframe. You can try loc instead and it will be ok.
At first it will be hard to grasp the indexing notation, but it can be very helpful in some circumstances, like for example sorting or performing grouping operations.

Quick example that might help:
df = pd.DataFrame(
dict(
France=[1, 2, 3],
Germany=[4, 5, 6],
UK=['x', 'y', 'z'],
))
df = df.loc[:,"Germany"].iloc[1:2]
Out:
1 5
Name: Germany, dtype: int64
Hope I could help.

Related

pandas - rename_axis doesn't work as expected afterwards - why?

i was reading through the pandas documentation (10 minutes to pandas) and came across this example:
dates = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame(np.random.randn(8, 4),
index=dates, columns=['A', 'B', 'C', 'D'])
s = df['A']
s[dates[5]]
# Out[5]: -0.6736897080883706
It's quite logic, but if I try it on my own and set the indexname afterwards (example follows), then i can't select data with s[dates[5]]. Does someone know why?
e.g.
df = pd.read_csv("xyz.csv").head(100)
s = df['price'] # series with unnamed int index + price
s = s.rename_axis('indexName')
s[indexName[5]] # NameError: name 'indexName' is not defined
Thanks in advance!
Edit: s.index.name returns indexName, despite not working with the call of s[indexName[5]]
You are confusing the name of the index, and the index values.
In your example, the first code chunk runs because dates is a variable, so when you call dates[5] it actually returns the 5th value from the dates object, which is a valid index value in the dataframe.
In your own attempt, you are referring to indexName inside your slice (ie. when you try to run s[indexName[5]]), but indexName is not a variable in your environment, so it will throw an error.
The correct way to subset parts of your series or dataframe, is to refer to the actual values of the index, not the name of the axis. For example, if you have a series as below:
s = pd.Series(range(5), index=list('abcde'))
Then the values in the index are a through e, therefore to subset that series, you could use:
s['b']
or:
s.loc['b']
Also note, if you prefer to access elements by location rather than index value, you can use the .iloc method. So to get the second element, you would use:
s.iloc[1] # locations 0 is the first element
Hope it helps to clarify. I would recommend you continue to work through some introductory pandas tutorials to build up a basic understanding.
First of all lets understand the example:
df[index] is used to select a row having that index.
This is the s dataframe:
The indexes are the dates.
The dates[5] is equal to '2000-01-06'which is the index of the 5th row of the s df. so, the result is the row having that index.
in your code:
indexName is not defined. so, indexName[5] is not representing an index of your df.

Getting not in index error in Pandas Dataframe [duplicate]

I'm struggling to understand the concept behind column naming conventions, given that one of the following attempts to create a new column appears to fail:
from numpy.random import randn
import pandas as pd
df = pd.DataFrame({'a':range(0,10,2), 'c':range(0,1000,200)},
columns=list('ac'))
df['b'] = 10*df.a
df
gives the following result:
Yet, if I were to try to create column b by substituting with the following line, there is no error message, yet the dataframe df remains with only the columns a and c.
df.b = 10*df.a ### rather than the previous df['b'] = 10*df.a ###
What has pandas done and why is my command incorrect?
What you did was add an attribute b to your df:
In [70]:
df.b = 10*df.a
df.b
Out[70]:
0 0
1 20
2 40
3 60
4 80
Name: a, dtype: int32
but we see that no new column has been added:
In [73]:
df.columns
Out[73]:
Index(['a', 'c'], dtype='object')
which means we get a KeyError if we tried df['b'], to avoid this ambiguity you should always use square brackets when assigning.
for instance if you had a column named index or sum or max then doing df.index would return the index and not the index column, and similarly df.sum and df.max would screw up those df methods.
I strongly advise to always use square brackets, it avoids any ambiguity and the latest ipython is able to resolve column names using square brackets. It's also useful to think of a dataframe as a dict of series in which it makes sense to use square brackets for assigning and returning a column
Always use square brackets for assigning columns
Dot notation is a convenience for accessing columns in a dataframe. If they conflict with existing properties (e.g. if you had a column named 'max'), then you need to use square brackets to access that column, e.g. df['max']. You also need to use square brackets when the column name contains spaces, e.g. df['max value'].
A DataFrame is just an object which has the usual properties and methods. If you use dot notation for assignment, you are creating a property or method for the dataframe object. So df.val = 2 will assign df with a property val that has a value of two. This is very different from df['val'] = 2 which creates a new column in the dataframe and assigns each element in that column the value of two.
To be safe, using square bracket notation will always provide the correct result.
As an aside, your columns=list('ac')) doesn't do anything, as you are just creating a variable named columns that is never used. You may have meant df.columns = list('ac'), but you already assigned those in the creation of the dataframe, so I'm not sure what the intent is with this line of code. And remember that dictionaries are unordered, so that pd.DataFrame({'a': [...], 'b': [...]}) could potentially return a dataframe with columns ['b', 'a']. If this were the case, then assigning column names could potentially mix up the column headers.
The issue has to do with how properties are handled in python. There is no restriction in python of setting a new properties for a class, so for example you could do something like
df.myspecialstuff = ["dog", "cat", 5]
So when you do assignment like
df.b = 10*df.a
It is ambiguous whether you want to add a property or a new column, and a property is set. The easiest way to actually see what is going on with this is to use pdb and step through the code
import pdb
x = df.a
pdb.run("df.a1 = x")
This will step into the __setattr__() whereas pdb.run("df['a2'] = x") will step into __setitem__()

What advantages does the iloc function have in pandas and Python

I just began to learn Python and Pandas and I saw in many tutorials the use of the iloc function. It is always stated that you can use this function to refer to columns and rows in a dataframe. However, you can also do this directly without the iloc function. So here is an example that yield the same output:
# features is just a dataframe with several rows and columns
features = pd.DataFrame(features_standardized)
y_train = features.iloc[start:end] [[1]]
y_train_noIloc = features [start:end] [[1]]
What is the difference between the two statements and what advantage do I have when using iloc? I'd appreicate every comment.
Per the pandas docs, iloc provides:
Purely integer-location based indexing for selection by position.
Therefore, as shown in the simplistic examples below, [row, col] indexing is not possible without using loc or iloc, as a KeyError will be thrown.
Example:
# Build a simple, sample DataFrame.
df = pd.DataFrame({'a': [1, 2, 3, 4]})
# No iloc
>>> df[0, 0]
KeyError: (0, 0)
# With iloc:
>>> df.iloc[0, 0]
1
The same logic holds true when using loc and a column name.
What is the difference and when does the indexing work without iloc?
The short answer:
Use loc and/or iloc when indexing rows and columns. If indexing on row or column, you can get away without it, and is referred to as 'slicing'.
However, I see in your example [start:end][[1]] has been used. It is generaly considered bad practice to have back-to-back square brackets in pandas, (e.g.: [][]), and generally an indication that a different (more efficient) approach should be taken - in this case, using iloc.
The longer answer:
Adapting your [start:end] slicing example (shown below), indexing works without iloc when indexing (slicing) on row only. The following example does not use iloc and will return rows 0 through 3.
df[0:3]
Output:
a
0 1
1 2
2 3
Note the difference in [0:3] and [0, 3]. The former (slicing) uses a colon and will return rows or indexes 0 through 3. Whereas the latter uses a comma, and is a [row, col] indexer, which requires the use of iloc.
Aside:
The two methods can be combined as show here, and will return rows 0 through 3, for column index 0. Whereas this is not possible without the use of iloc.
df.iloc[0:3, 0]

Pandas concat: why is the `DataFrame` with duplicated index not working with concat()?

Code example:
a = pd.DataFrame({"a": [1,2,3],}, index=[1,2,2])
b = pd.DataFrame({"b": [1,4,5],}, index=[1,4,5])
pd.concat([a, b], axis=1)
It raises error: ValueError: Shape of passed values is (7, 2), indices imply (5, 2)
What I expected as a result:
Why does it not return like this? concat's default joining is outer so I think my thought is reasonable enough... Am I missing something?
TLDR: Why? I don't really know for sure, but I think it has to do with just the design of the package.
An index in pandas "is like an address, that’s how any data point across the dataframe or series can be accessed. Rows and columns both have indexes, rows indices are called as index and for columns its general column names." source
Now you are doing it where axis = 1, aka along the vertical axis. That means that we have an address which points to two different values. Hence we can still "access" these values by doing a[a.index == 2]. Do note however the index in a mathematical sense is now not a proper function because one value maps to two different values source. I am guessing the implementation was designed so that indices would be injective, surjective, or bijective in order to make it easier to design.
Thus, when attempting to concatenate, pandas wants to match all the indices together where possible and fill in nans where not possible. However, as the error says, it thinks the shape based off the indices is (5, 2) because of this address sharing two different values. So why doesn't it work? Because I believe pandas checks the shape it should be before hand, and then does the concatenation. In order to check the shape before hand it looks at the indices and therefore it breaks when it checks.
Do note too that this would not work with identical column names as well:
a = pd.DataFrame({"a": [1,2,3], 'b': [9,8,7]}, index=[1,2,2])
b = pd.DataFrame({"b": [1,4,5], 'bx': [1,4,3]}, index=[1,4,5]).rename(columns={'bx': 'b'})
pd.concat([a,b]) # axis=0 is the default
ValueError: Plan shapes are not aligned
Therefore pd.concat needs unique indices along whichever axis it is operating upon. You can't have two identical column names when you normally concatenate row wise, and likewise you can't be able to do it column wise.
Interestingly, for your original example, pd.concat([a, b], ignore_index=True, axis=1) also raises the same error, leading me to more strongly suspect that pandas is checking the shape before the concatenation.

Pandas: Selecting column from data frame

Pandas beginner here. I'm looking to return a full column's data and I've seen a couple of different methods for this.
What is the difference between the two entries below, if any? It looks like they return the same thing.
loansData['int_rate']
loansData.int_rate
The latter is basically syntactic sugar for the former. There are (at least) a couple of gotchas:
If the name of the column is not a valid Python identifier (e.g., if the column name is my column name?!, you must use the former.
Somewhat surprisingly, you can only use the former form to completely correctly add a new column (see, e.g., here).
Example for latter statement:
import pandas as pd
df = pd.DataFrame({'a': range(4)})
df.b = range(4)
>> df.columns
Index([u'a'], dtype='object')
For some reason, though, df.b returns the correct results.
They do return the same thing. The column names in pandas are akin to dictionary keys that refer to a series. The column names themselves are named attributes that are part of the dataframe object.
The first method is preferred as it allows for spaces and other illegal operators.
For a more complete explanation, I recommend you take a look at this article:
http://byumcl.bitbucket.org/bootcamp2013/labs/pd_types.html#pandas-types
Search 'Access using dict notation' to find the examples where they show that these two methods return identical values.
They're the same but for me the first method handles spaces in column names and illegal characters so is preferred, example:
In [115]:
df = pd.DataFrame(columns=['a', ' a', '1a'])
df
Out[115]:
Empty DataFrame
Columns: [a, a, 1a]
Index: []
In [116]:
print(df.a) # works
print([' a']) # works
print(df.1a) # error
File "<ipython-input-116-4fa4129a400e>", line 3
print(df.1a)
^
SyntaxError: invalid syntax
Really when you use dot . it's trying to find a key as an attribute, if for some reason you have used column names that match an attribute then using dot will not do what you expect.
Example:
In [121]:
df = pd.DataFrame(columns=['index'], data = np.random.randn(3))
df
Out[121]:
index
0 0.062698
1 -1.066654
2 -1.560549
In [122]:
df.index
Out[122]:
Int64Index([0, 1, 2], dtype='int64')
The above has now shown the index as opposed to the column 'index'
In case if you are working on any ML projects and you want to extract feature and target variables separately and need to have them separably.
Below code will be useful: This is selecting features through indexing as a list and applying them to the dataframe. in this code data is DF.
len_col=len(data.columns)
total_col=list(data.columns)
Target_col_Y=total_col[-1]
Feature_col_X=total_col[0:-1]
print('The dependent variable is')
print(Target_col_Y)
print('The independent variables are')
print(Feature_col_X)
The output for the same can be obtained as given below:
The dependent variable is
output
The independent variables are
['age', 'job', 'marital', 'education','day_of_week', ... etc]

Categories

Resources