Problems with DataFrame indexing with pandas - python

Using pandas, I have to modify a DataFrame so that it only has the indexes that are also present in a vector, which was acquired by performing operations in one of the df's columns. Here's the specific line of code used for that (please do not mind me picking the name 'dataset' instead of 'dataframe' or 'df'):
dataset = dataset.iloc[list(set(dataset.index).intersection(set(vector.index)))]
it worked, and the image attached here shows the df and some of its indexes. However, when I try accessing a specific value by index in the new 'dataset', such as the line shown below, I get an error: single positional indexer is out-of-bounds
print(dataset.iloc[:, 21612])
note: I've also tried the following, to make sure it isn't simply an issue with me not knowing how to use iloc:
print(dataset.iloc[21612, :])
and
print(dataset.iloc[21612])
Do I have to create another column to "mimic" the actual indexes? What am I doing wrong? Please mind that it's necessary for me to make it so the indexes are not changed at all, despite the size of the DataFrame changing. E.g. if the DataFrame originally had 21000 rows and the new one only 15000, I still need to use the number 20999 as an index if it passed the intersection check shown in the first code snippet. Thanks in advance

Try this:
print(dataset.loc[21612, :])
After you have eliminated some of the original rows, the first (i.e., index) argument to iloc[] must not be greater than len(index) - 1.

Related

Sort dataframe by absolute value without changing value or adding column

I have a dataframe that's the result of importing a csv and then performing a few operations and adding a column that's the difference between two other columns (column 10 - column 9 let's say). I am trying to sort the dataframe by the absolute value of that difference column, without changing its value or adding another column.
I have seen this syntax over and over all over the internet, with indications that it was a success (accepted answers, comments saying "thanks, that worked", etc.). However, I get the error you see below:
df.sort_values(by='Difference', ascending=False, inplace=True, key=abs)
Error:
TypeError: sort_values() got an unexpected keyword argument 'key'
I'm not sure why the syntax that I see working for other people is not working for me. I have a lot more going on with the code and other dataframes, so it's not a pandas import problem I don't think.
I have moved on and just made a new column that is the absolute value of the difference column and sorted by that, and exclude that column from my export to worksheet, but I really would like to know how to get it to work the other way. Any help is appreciated.
I'm using Python 3
df.loc[(df.c - df.b).sort_values(ascending = False).index]
Sorting by difference between "c" and "b" , without creating new column.
I hope this is what you were looking for.
key is optional argument
It accepts series as input , maybe you were working with dataframe.
check this

Cannot sort dataframe – the column label is not unique

I merge three dataframes with the first line and try to sort them with the second. This used to work fine, but now I get this error (our company may have updated the python version during this time):
ValueError: The column label 'Areanr' is not unique.
For a multi-index, the label must be a tuple with elements corresponding to each level.
The code looks like this
pref_info4 = pref_info1.append(pref_info2).append(pref_info3)
pref_info4 = pref_info4.sort_values(['Areanr','nr'])
The second line gives the error. When inspecting 'pref_info4' after the first line is done there is only one column with the label 'Areanr'. Is there some hidden labels that I need to remove? Otherwise it should be unique right? Each of the original dataframes has columns Areanr and nr, but this worked fine (and I cannot see any bad merging issue when inspecting pref_info4...)
You can try to assign a running id to each columns so that they are unique and then sort them this running total

How to append dataframes in Pandas without staggered format

I was able to append dataframes but as they are added, they appear at the end of the one previously appended an so on.
Each dataframe has a different header name.
Here’s what I’ve tried so far:
df1 = df1.append(dforiginal,sort=False, ignore_index=False)
What’s more, every time they are appended, their index is set back to 0. Is it possible to append each dataframe all starting at Index=0?
The screenshots below show what I'm getting(top image) and what I'm trying to accomplish (bottom image).
Thanks.
[1
If I got your point correctly you want to add rows instead of columns to your Dataframe, dont you?
Nevertheless, you could use for example this website to get a general overview on how to use the append function: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
Moreover, you can reset the index if you set the keyword ignore_index as True.

Not able to delete 'matchId' column from pandas dataframe

I have a dataframe which looks like this
I tried to delete matchId but no matter what I use to delete it, for preprocessing, its outputting this error:
KeyError: "['matchId'] not found in axis"
What you attempted to do (which you should have mentioned in the question) is probably failing because you assume that the matchID column is a normal column. It is actually a special, index column and so cannot be accessed in the same way other columns can be accessed.
As suggested by anky_91, because of that, you should do
df = df.reset_index(drop=True)
if you want to completely remove the indexes in your table. This will replace them with the default indexes. To just make them into another column, you can just remove the drop=True from the above statement.
Your table will always have indexes, however, so you cannot completely get rid of them.
You can, however, output it with
df.values
and this will ignore the indexes and show just the values as arrays.

How can I specify row order when I use dask.dataframe

I have two dataframe with same shape.
I tried to convert to dask dataframe specifying same n_partition=50.
However, how each dataframe split into partition seems different as shown below image.
Does anyone know how I can specify how dataframe should be separated?
Here is a guess: the index values appear to be sorted, but one would be numerical and one lexicographical; i.e., I suspect that your dataframe mrt_dask has an index containing strings, not numbers. If this is so, then calling astype before passing it to dask should solve your issue, or perhaps you should change how it is being loaded in the first place.

Categories

Resources