Subsetting Pandas dataframe via column number - python

When I want to retrieve the jth+1 value from the column of a panda dataframe, I can write: df["column_name"].ix[j]
When I check the type of the above code, I get:
type(df["column_name"].ix[i]) #str
I want to write less lengthy code though subsetting by the index. So I write:
df[[i]].ix[j]
However, when I check the type, I get: pandas.core.series.Series
How I rewrite this for the indexical subsetting to produce a str?

The double subscripting does something else than what you seem to imply it does - it returns a DataFrame of the corresponding columns.
As far as I know, the shortest way to do what you're asking using column-row ordering is
df.iloc[:, j].ix[i]
(There's the shorter
df.icol(j).ix[i]
but it's deprecated.)

One way to do this is like so:
df.ix[i][j]
This is kind of funky though, because the first index is the row, and the second is the column, which is rather not pandas. More like matrix indexing than pandas indexing.

Related

Pandas - iterating DataFrame rows efficiently and getting values by column name

Iterating in Pandas is notoriously inefficient, and best avoid whenever possible (using apply for data manipulation, etc.). In my case, unfortunately, it is unavoidable.
While it is wildly known that the most efficient way to do this is itertuples, with that function accessing the column data using the str name of the tuple throws the following error:
TypeError: tuple indices must be integers or slices, not str
Some suggest that the solution to this problem is to just switch to iterrows, but as mentioned previously, this is not efficient.
How do I utilize itertuples, while still using the str name of the column to get its row value?
Essentially, one just needs to use the index of the required column instead. Since the first value in the tuple is the Index from the originating dataframe, one can use the column index from the original dataframe, and add one to account for the index.
df = pd.DataFrame(some_data)
col_idx = df.columns.get_loc('col name') + 1 # +1 to account for the tuple Index
for row in training_df.itertuples():
val = row[col_idx]
print(val)
This solution may not be the most elegant option, but it works :)

Pandas `groupby.aggregate` on `df.index.duplicated()`

Scenario. Assume a
pd.DataFrame, loaded from an external source
where one row is a line from a sensor. The index is a DateTimeIndex
with some rows having df.index.duplicated()==True. This actually means, there are lines with the same timestamp from different sensors.
Now applying some logic, like df.loc[df.A>0, 'my_col'] = 1, I ran into ValueError: cannot reindex from a duplicate axis. This can be solved by simply removing the duplicated rows using
df[~df.index.duplicated()]
But I wonder, if it would be possible, to actually apply a column based function during the Index de-duplication process? E.g.: Calculating the mean/max/min of column A/B/C for the duplicated rows.
Is this possible? Its something like a groupby.aggregate on df.index.duplicated() rows.
Check with describe
df.groupby(level=0).describe()

pandas: Select one-row data frame instead of series [duplicate]

I have a huge dataframe, and I index it like so:
df.ix[<integer>]
Depending on the index, sometimes this will have only one row of values. Pandas automatically converts this to a Series, which, quite frankly, is annoying because I can't operate on it the same way I can a df.
How do I either:
1) Stop pandas from converting and keep it as a dataframe ?
OR
2) easily convert the resulting series back to a dataframe ?
pd.DataFrame(df.ix[<integer>]) does not work because it doesn't keep the original columns. It treats the <integer> as the column, and the columns as indices. Much appreciated.
You can do df.ix[[n]] to get a one-row dataframe of row n.

how to make 1 by n dataframe from series in pandas?

I have a huge dataframe, and I index it like so:
df.ix[<integer>]
Depending on the index, sometimes this will have only one row of values. Pandas automatically converts this to a Series, which, quite frankly, is annoying because I can't operate on it the same way I can a df.
How do I either:
1) Stop pandas from converting and keep it as a dataframe ?
OR
2) easily convert the resulting series back to a dataframe ?
pd.DataFrame(df.ix[<integer>]) does not work because it doesn't keep the original columns. It treats the <integer> as the column, and the columns as indices. Much appreciated.
You can do df.ix[[n]] to get a one-row dataframe of row n.

Type of item in pandas DataFrame bug or feature?

If I have a pandas DataFrame
df = read_csv("infile.csv")
where infile looks something like
i1,i2,f1,f2
3,1,0.1,2.0
2,1,0.3,0.5
i.e. two columns of integers and one of floats.
If I query this DataFrame with:
print type(df["i1"].ix[0])
the type is (as I would expect it too be!) np.int64
Whereas if I use:
print type(df.ix[0]["i1"])
The type is np.float64
Is this correct behaviour or a bug?
I guess that this is because:
df.ix[0]
creates a series object which ["i1"] then selects from? But still this is annoying.
As you note yourself, this is indeed expected behaviour because in df.ix[0]["i1"] you first create a Series for the first row (so all items are upcasted to float to get one dtype), and only then you take the item with label "i1"
The solution is easy: don't use this chained indexing, but combine both look-ups (for row and column) in one indexing call:
df.ix[0, "i1"]
There are also other good reasons to avoid this chained indexing (getting problems with view/copy): http://pandas.pydata.org/pandas-docs/stable/indexing.html#returning-a-view-versus-a-copy

Categories

Resources