Dictionary like get() on pandas Series with index value like iloc - python

Suppose I have following pandas' Series:
series=pd.Series(data=['A', 'B'], index=[2,3])
I can get the first value using .iloc as series.iloc[0] #output: 'A'
Or, I can just use the get method passing the actual index value like series.get(2) #output: 'A'
But is there any way to pass iloc like indexing to get method to get the first value in the series?
>>> series.get(0, '')
'' # Expected 'A', the first value for index 0
One way is to just reset the index and drop it then to call get method:
>>> series.reset_index(drop=True).get(0, '')
'A'
Is there any other function similar to get method (or a way to use get method) that allows passing iloc like index, without the overhead of resetting the index?
EDIT: Please be noted that this series comes from dataframe masking so this series can be empty, that is why I intentionally added default value as empty string '' for get method.

Use next with iter if need first value also with empty Series:
next(iter(series), None)

Is this what you have in mind or not really?
series.get(series.index[0])

What about using head to get the first row and take the value from there?
series.head(1).item() # A

Related

When creating a boolean series from subsetting a df, what does index-subsetting in the same line filter for?

I'm experimenting with pandas and noticed something that seemed odd.
If you have a boolean series defined as an object, you can then subset that object by index numbers, e.g.,
From df 'ah'
, creating
creating this boolean series, 'tah'
via
tah = ah['_merge']=='left_only'
This boolean series could be index-subset like this:
tah[0:1]
yielding:
Yet if I tried to do this all in one line
ah['_merge']=='left_only'[0:1]
I get an unexpected output, where the boolean series is neither sliced nor seems to correspond to the subsetted-column:
I've been experimenting and can't seem to determine what, in the all-in-on-line [0:1] is slicing/filtering-for. Any clarification would be appreciated!
Because you are equating string's 'left_over' index of 0 (letter 'l') and this yields false result in every row and since 'l' is not equal to 'left_over' nor to 'both', it prints all a column of false booleans.
You can use (ah['_merge']=='left_only')[0:1] as MattDMo mentionned in the comments.
Or you can also use pandas.Series.iloc with a slice object to select the elements you need based on their postion/index in your dataframe.
(ah['_merge']=='left_only').iloc[0:1]
Both of commands will return True since the first row of your dataframe has a 'left_only' type of merge.

Does iloc use the indices or the place of the row

I have extracted few rows from a dataframe to a new dataframe. In this new dataframe old indices remain. However, when i want to specify range from this new dataframe i used it like new indices, starting from zero. Why did it work? Whenever I try to use the old indices it gives an error.
germany_cases = virus_df_2[virus_df_2['location'] == 'Germany']
germany_cases = germany_cases.iloc[:190]
This is the code. The rows that I extracted from the dataframe virus_df_2 have indices between 16100 and 16590. I wanted to take the first 190 rows. in the second line of code i used iloc[:190] and it worked. However, when i tried to use iloc[16100:16290] it gave an error. What could be the reason?
In pandas there are two attributes, loc and iloc.
The iloc is, as you have noticed, an indexing based on the order of the rows in memory, so there you can reference the nth line using iloc[n].
In order to reference rows using the pandas indexing, which can be manually altered and can not only be integers but also strings or other objects that are hashable (have the __hash__ method defined), you should use loc attribute.
In your case, iloc raises an error because you are trying to access a range that is outside the region defined by your dataframe. You can try loc instead and it will be ok.
At first it will be hard to grasp the indexing notation, but it can be very helpful in some circumstances, like for example sorting or performing grouping operations.
Quick example that might help:
df = pd.DataFrame(
dict(
France=[1, 2, 3],
Germany=[4, 5, 6],
UK=['x', 'y', 'z'],
))
df = df.loc[:,"Germany"].iloc[1:2]
Out:
1 5
Name: Germany, dtype: int64
Hope I could help.

Get a KeyError in Pandas

I am trying to call a function from a different module as below:
module1 - func1: returns a dataframe
module1 - func2(p_df_in_fromfunc1)
function 2:
for i in range(0,len(p_df_in_fromfunc1):
# Trying to retrieve row values of individual columns and assign to variables
v_tmp = p_df_in_fromfunc1.loc[i,"Col1"]
When trying to run the above code, I get the error:
KeyError 0
Could the issue be because I don't have a zero numbered row?
Without knowing much of you're code, well my guess is, for positional indexing try using iloc instead of loc, if you're interesed in going index-wise.
Something like:
v_tmp = p_df_in_fromfunc1.iloc[i,"Col1"]
You may have a missed to close the quote in the loc function after Col1 ?
v_tmp = p_df_in_fromfunc1.loc[i,"Col1"]
For retrieving a row for specific columns do:
columns = ['Col1', 'Col2']
df[columns].iloc[index]
If you only want one column, you can simplify it to: df['Col1'].iloc(index)
As per your comment, you do not need to reset the index, you can iterate over the values of your index array: df.index

Can't get index position from list of Dataframes

I'm trying to get the position of a dataframe from a list of dataframes, by using the built-in method index in python. My code is below:
df1 = pd.DataFrame([1, 2, 3])
df2 = pd.DataFrame([4, 5, 6])
df3 = pd.DataFrame([7, 8, 9])
dfs = [df1, df2, df3]
for df in dfs:
print(dfs.index(df))
Where instead of getting the expected 0, 1 and 2, it only returns 0, followed by ValueError:
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I know this error typically comes from data frames comparison, and I've tried adding the .all() but to no avail.
If I change the list to something like a list of strings, ints or a mix of them, no problem whatsoever.
I've tried searching around but found nothing on this specific error.
I know I can easily add an extra variable that keeps on adding 1 for each iteration of the list, but I'd really like to understand what I'm doing wrong.
As others have said, you can use enumerate to do what you want.
As to why what you're trying isn't working:
list.index(item) looks for the first element of the list such that element == item. As an example consider df = dfs[0]. When we call dfs.index(df), we first check whether df == the first element of dfs. In other words we are checking whether df == df. If you type this into your interpreter you will find that this gives you a DataFrame of Trues. This is a DataFrame object -- but what Python wants to know is whether it should consider this object as True or not. So it needs to convert this DataFrame into a single bool. It tries to do this via bool(df == df), which relies on pandas implementing a method that converts any DataFrame into a bool. However there is no such method, and for good reason -- because the correct way to do this is ambiguous. So at this point pandas raises the error that you see.
In summary: for index to make sense, the objects must have some notion of equality (==), but DataFrames don't have such a notion, for good reason.
If in a future problem you need to find the index of a DataFrame within a list of DataFrames, you first have to decide on a notion of equality. One sensible such notion is if all the values are the same. Then you can define a function like:
def index(search_dfs, target_df):
for i, search_df in enumerate(search_dfs):
if (search_df.values == target_df.values).all():
return i
return ValueError('DataFrame not in list')
index(dfs, df[2])
Out: 2
Use enumerate :
for i, df in enumerate(dfs):
print(i)
I guess you want to do something like this:
dfs = [df1, df2, df3]
for i, df in enumerate(dfs):
print(i)
And this is not a pandas related question. It's simply Python question.

How to get values from a timestamp indexed dataframe

I'm constructing a DataFrame like so:
dates = [datetime.datetime.today() + datetime.timedelta(days=x) for x in range(0, 2)]
d = DataFrame([[1,2],[3,4]], index = dates, columns = ['a', 'b'])
I want to get a value like so:
d[d.index[0]]['a']
But I get the following error:
KeyError: Timestamp('2018-04-26 16:08:16.120031')
How come?
If you are trying to get the first element from column 'a', you access it like this:
d.loc[d.index[0], 'a']
The way you have it written now, d[d.index[0]] is trying to get a column with name d.index[0].
It depends what you want to do.
If you just want to first row, you could access it with iloc do the following:
d.iloc[0]['a']
If you want to filter the dataframe for example by the year, you could do:
d.loc[d.index.year == 2018, 'a']
d['a'][d.index[0]]
My confusion came from the fact that DataFrame is column first, and not row first as one would expect from general multi-dimension data structures. So in order to get the value, one must switch indices.
dataFrame[coumn][row]
Thanks #Michael for the hint.
First of all, you have to always know with the type of data you are dealing with, in your case, you create a DatetimeIndex:
DatetimeIndex(['2020-08-25 11:00:00.000307403',
'2020-08-25 11:00:00.000558638',
'2020-08-25 11:00:00.002280412',
'2020-08-25 11:00:00.002440933'])
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>
and inside the DatetimeIndex, each element of it is a Timestamp:
2020-08-25 11:00:00.000307403
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
As you are working with DatetimeIndexes, you have to index the values by the actual timestamp ('2020-08-25 11:00:00.000307403') and not by the full timestamp variable (Timestamp('2020-08-25 11:00:00.000307403'))
So, instead of doing:
df[Timestamp('2020-08-25 11:00:00.000307403')]
you should do:
df['2020-08-25 11:00:00.000307403']
I lost like two hours to catch this error, since it is a bit stupid to include the type of data inside the variable, the easiest way to solve this is just to parse the DatetimeIndex to string.
For your solution:
d[str(d.index[0]),'a']
should work

Categories

Resources