Can't get index position from list of Dataframes - python

I'm trying to get the position of a dataframe from a list of dataframes, by using the built-in method index in python. My code is below:
df1 = pd.DataFrame([1, 2, 3])
df2 = pd.DataFrame([4, 5, 6])
df3 = pd.DataFrame([7, 8, 9])
dfs = [df1, df2, df3]
for df in dfs:
print(dfs.index(df))
Where instead of getting the expected 0, 1 and 2, it only returns 0, followed by ValueError:
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I know this error typically comes from data frames comparison, and I've tried adding the .all() but to no avail.
If I change the list to something like a list of strings, ints or a mix of them, no problem whatsoever.
I've tried searching around but found nothing on this specific error.
I know I can easily add an extra variable that keeps on adding 1 for each iteration of the list, but I'd really like to understand what I'm doing wrong.

As others have said, you can use enumerate to do what you want.
As to why what you're trying isn't working:
list.index(item) looks for the first element of the list such that element == item. As an example consider df = dfs[0]. When we call dfs.index(df), we first check whether df == the first element of dfs. In other words we are checking whether df == df. If you type this into your interpreter you will find that this gives you a DataFrame of Trues. This is a DataFrame object -- but what Python wants to know is whether it should consider this object as True or not. So it needs to convert this DataFrame into a single bool. It tries to do this via bool(df == df), which relies on pandas implementing a method that converts any DataFrame into a bool. However there is no such method, and for good reason -- because the correct way to do this is ambiguous. So at this point pandas raises the error that you see.
In summary: for index to make sense, the objects must have some notion of equality (==), but DataFrames don't have such a notion, for good reason.
If in a future problem you need to find the index of a DataFrame within a list of DataFrames, you first have to decide on a notion of equality. One sensible such notion is if all the values are the same. Then you can define a function like:
def index(search_dfs, target_df):
for i, search_df in enumerate(search_dfs):
if (search_df.values == target_df.values).all():
return i
return ValueError('DataFrame not in list')
index(dfs, df[2])
Out: 2

Use enumerate :
for i, df in enumerate(dfs):
print(i)

I guess you want to do something like this:
dfs = [df1, df2, df3]
for i, df in enumerate(dfs):
print(i)
And this is not a pandas related question. It's simply Python question.

Related

When creating a boolean series from subsetting a df, what does index-subsetting in the same line filter for?

I'm experimenting with pandas and noticed something that seemed odd.
If you have a boolean series defined as an object, you can then subset that object by index numbers, e.g.,
From df 'ah'
, creating
creating this boolean series, 'tah'
via
tah = ah['_merge']=='left_only'
This boolean series could be index-subset like this:
tah[0:1]
yielding:
Yet if I tried to do this all in one line
ah['_merge']=='left_only'[0:1]
I get an unexpected output, where the boolean series is neither sliced nor seems to correspond to the subsetted-column:
I've been experimenting and can't seem to determine what, in the all-in-on-line [0:1] is slicing/filtering-for. Any clarification would be appreciated!
Because you are equating string's 'left_over' index of 0 (letter 'l') and this yields false result in every row and since 'l' is not equal to 'left_over' nor to 'both', it prints all a column of false booleans.
You can use (ah['_merge']=='left_only')[0:1] as MattDMo mentionned in the comments.
Or you can also use pandas.Series.iloc with a slice object to select the elements you need based on their postion/index in your dataframe.
(ah['_merge']=='left_only').iloc[0:1]
Both of commands will return True since the first row of your dataframe has a 'left_only' type of merge.

pandas - rename_axis doesn't work as expected afterwards - why?

i was reading through the pandas documentation (10 minutes to pandas) and came across this example:
dates = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame(np.random.randn(8, 4),
index=dates, columns=['A', 'B', 'C', 'D'])
s = df['A']
s[dates[5]]
# Out[5]: -0.6736897080883706
It's quite logic, but if I try it on my own and set the indexname afterwards (example follows), then i can't select data with s[dates[5]]. Does someone know why?
e.g.
df = pd.read_csv("xyz.csv").head(100)
s = df['price'] # series with unnamed int index + price
s = s.rename_axis('indexName')
s[indexName[5]] # NameError: name 'indexName' is not defined
Thanks in advance!
Edit: s.index.name returns indexName, despite not working with the call of s[indexName[5]]
You are confusing the name of the index, and the index values.
In your example, the first code chunk runs because dates is a variable, so when you call dates[5] it actually returns the 5th value from the dates object, which is a valid index value in the dataframe.
In your own attempt, you are referring to indexName inside your slice (ie. when you try to run s[indexName[5]]), but indexName is not a variable in your environment, so it will throw an error.
The correct way to subset parts of your series or dataframe, is to refer to the actual values of the index, not the name of the axis. For example, if you have a series as below:
s = pd.Series(range(5), index=list('abcde'))
Then the values in the index are a through e, therefore to subset that series, you could use:
s['b']
or:
s.loc['b']
Also note, if you prefer to access elements by location rather than index value, you can use the .iloc method. So to get the second element, you would use:
s.iloc[1] # locations 0 is the first element
Hope it helps to clarify. I would recommend you continue to work through some introductory pandas tutorials to build up a basic understanding.
First of all lets understand the example:
df[index] is used to select a row having that index.
This is the s dataframe:
The indexes are the dates.
The dates[5] is equal to '2000-01-06'which is the index of the 5th row of the s df. so, the result is the row having that index.
in your code:
indexName is not defined. so, indexName[5] is not representing an index of your df.

Dictionary like get() on pandas Series with index value like iloc

Suppose I have following pandas' Series:
series=pd.Series(data=['A', 'B'], index=[2,3])
I can get the first value using .iloc as series.iloc[0] #output: 'A'
Or, I can just use the get method passing the actual index value like series.get(2) #output: 'A'
But is there any way to pass iloc like indexing to get method to get the first value in the series?
>>> series.get(0, '')
'' # Expected 'A', the first value for index 0
One way is to just reset the index and drop it then to call get method:
>>> series.reset_index(drop=True).get(0, '')
'A'
Is there any other function similar to get method (or a way to use get method) that allows passing iloc like index, without the overhead of resetting the index?
EDIT: Please be noted that this series comes from dataframe masking so this series can be empty, that is why I intentionally added default value as empty string '' for get method.
Use next with iter if need first value also with empty Series:
next(iter(series), None)
Is this what you have in mind or not really?
series.get(series.index[0])
What about using head to get the first row and take the value from there?
series.head(1).item() # A

Encountering ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()

I have a function
def cruise_fun(speed, accl, acmx, dcmx):
count = 0
index = []
for i in range(len(speed.dropna())):
if ((speed[i]>40) & (accl[i]<acmx*0.2) & (accl[i]>dcmx*0.2)):
count +=1
index.append(i)
return count, index
This function is being called in the following statement
cruise_t_all, index_all =cruise_fun(all_data_speed[0], acc_val_all[0], acc_max_all, decc_max_all)
all_data_speed and acc_val_all are two dataframes of 1 column and 38287 rows. acc_max_all and decc_max_all are two float64 values. I have tried to implement solutions provided in stackoverflow as much as I could. I have used both and and &. I can not get around the problem.
You are using pandas in the wrong way. You should not loop over all the rows like you do. You can concatenate the columns provided and then check the conditions:
def cruise_fun(speed, accl, acmx, dcmx):
df = pd.concat([speed.dropna(), accl], axis=1)
df.columns = ["speed", "accl"]
mask = (df["speed"] > 40) & df["accl"].between(dcmx * .2, acmx * .2, inclusive=False)
return mask.sum(), df[mask].index
NB: A few assumptions that I make:
I assume that you do not have conflicts for your column names, otherwise the concat will not work and you will need to rename your columns first
I assume that the index from speed.dropna() and accl match but I would not be surprised if it is not the case. You should make sure that this is fine, or better: store everything in the same dataframe

Does iloc use the indices or the place of the row

I have extracted few rows from a dataframe to a new dataframe. In this new dataframe old indices remain. However, when i want to specify range from this new dataframe i used it like new indices, starting from zero. Why did it work? Whenever I try to use the old indices it gives an error.
germany_cases = virus_df_2[virus_df_2['location'] == 'Germany']
germany_cases = germany_cases.iloc[:190]
This is the code. The rows that I extracted from the dataframe virus_df_2 have indices between 16100 and 16590. I wanted to take the first 190 rows. in the second line of code i used iloc[:190] and it worked. However, when i tried to use iloc[16100:16290] it gave an error. What could be the reason?
In pandas there are two attributes, loc and iloc.
The iloc is, as you have noticed, an indexing based on the order of the rows in memory, so there you can reference the nth line using iloc[n].
In order to reference rows using the pandas indexing, which can be manually altered and can not only be integers but also strings or other objects that are hashable (have the __hash__ method defined), you should use loc attribute.
In your case, iloc raises an error because you are trying to access a range that is outside the region defined by your dataframe. You can try loc instead and it will be ok.
At first it will be hard to grasp the indexing notation, but it can be very helpful in some circumstances, like for example sorting or performing grouping operations.
Quick example that might help:
df = pd.DataFrame(
dict(
France=[1, 2, 3],
Germany=[4, 5, 6],
UK=['x', 'y', 'z'],
))
df = df.loc[:,"Germany"].iloc[1:2]
Out:
1 5
Name: Germany, dtype: int64
Hope I could help.

Categories

Resources