In a Pandas DataFrame, say cars, I can select and print a single column like this:
# country is a column
print(cars['country'])
However, when I try to do the same thing with a row, I failed:
#US is a row
print(cars['US'])
KeyError: 'US'
Then I tried this and it worked:
print(cars['US':'US'])
So, in Pandas DataFrame, column indexes are keys and row indexes are not?Could someone explain what's the reason for making row selection more complicated than column selection?
In pandas
cars['Country']
returns the column as a Series. To slice a row as a Series, the equivalent command would be (assuming 'US' is an index value):
cars.loc['US']
If you do
cars['US':'US']
then you'll get the same data, but it'll be a DataFrame instead of a Series, so not quite equivalent.
Obviously, you can't have both columns and rows be referenced in the same way. If I had a dataframe of correlations between all the stocks in the S&P 500, then your rows and index values would have the same elements. So you'd need to know whether df['AAPL'] was referring to 'AAPL' from the index or from the columns. So df['AAPL'] is the 'AAPL' column, while df.loc['AAPL'] is the 'AAPL' row. You could also do df.loc['AAPL', :] for the row and df.loc[:, 'AAPL'] for the column, if you prefer.
As for why they chose df['AAPL'] to be the column, my best guess is that typically dates are listed as the rows, so it's more common to select a column of data to do some operation on than it is to select a row. However slicing rows is just as easy as columns, so there really isn't much difference except a few keystrokes.
Related
I have a dataset with information on cities in the United States and I want to give it a two-level index with the state and the city. I've been trying to use the MultiIndex approach in the documentation that goes something like this.
lists = [list(df['state'],list(df['city'])]
tuples = list(zip(*lists))
index = pd.MultiIndex.from_tuples(tuples)
new_df = pd.DataFrame(df,index=index)
The output is a new DataFrame with the correct index but it's full of np.nan values. Any idea what's going on?
When you reindex a DataFrame with a new index, Pandas operates roughly
the following way:
Iterates over the current index.
Checks whether this index value occurs in the new index.
From the "old" (existing) rows, leaves only those with index values
present in the new index.
There can be reordering of rows, to align with the order of the new
index.
If the new index contains values absent in the DataFrame, then
the coresponding row has only NaN values.
Maybe your DataFrame has initially a "standard" index (a sequence
of integers starting from 0)?
In this case no item of the old index is present in the new
index (actualy MultiIndex), so the resulting DataFrame has
all rows full of NaNs.
Maybe you should set the index to the two columns of interest,
i.e. run:
df.set_index(['state', 'city'], inplace=True)
I have a data like this in a csv file which I am importing to pandas df
I want to collapse the values of Type column by concatenating its strings to one sentence and keeping it at the first row next to date value while keeping rest all rows and values same.
As shown below.
Edit:
You can try ffill + transform
df1=df.copy()
df1[['Number', 'Date']]=df1[['Number', 'Date']].ffill()
df1.Type=df1.Type.fillna('')
s=df1.groupby(['Number', 'Date']).Type.transform(' '.join)
df.loc[df.Date.notnull(),'Type']=s
df.loc[df.Date.isnull(),'Type']=''
I have a table: Table
How would I roll up Group, so that the group numbers don't repeat? I don't want to pd.df.groupby, as I don't want to summarize the other columns. I just want to not repeat item labels, sort of like an Excel pivot table.
Thanks!
In your dataframe it appears that 'Group' is in the index, the purpose of the index is to label each row. Therefore, is unusual and uncommon to have blank row indexes.
You you could so this:
df2.reset_index().set_index('Group', append=True).swaplevel(0,1,axis=0)
Or if you really must show blank row indexes you could do this, but you must change the dtype of the index to str.
df1 = df.set_index('Group').astype(str)
df1.index = df1.index.where(~df1.index.duplicated(),[' '])
I have a pandas Series that contains key-value pairs, where the key is the name of a column in my pandas DataFrame and the value is an index in that column of the DataFrame.
For example:
Series:
Series
Then in my DataFrame:
Dataframe
Therefore, from my DataFrame I want to extract the value at index 12 from my DataFrame for 'A', which is 435.81 . I want to put all these values into another Series, so something like { 'A': 435.81 , 'AAP': 468.97,...}
My reputation is low so I can't post my images as images instead of links (can someone help fix this? thanks!)
I think this indexing is what you're looking for.
pd.Series(np.diag(df.loc[ser,ser.axes[0]]), index=df.columns)
df.loc allows you to index based on string indices. You get your rows given from the values in ser (first positional argument in df.loc) and you get your column location from the labels of ser (I don't know if there is a better way to get the labels from a series than ser.axes[0]). The values you want are along the main diagonal of the result, so you take just the diagonal and associate them with the column labels.
The indexing I gave before only works if your DataFrame uses integer row indices, or if the data type of your Series values matches the DataFrame row indices. If you have a DataFrame with non-integer row indices, but still want to get values based on integer rows, then use the following (however, all indices from your series must be within the range of the DataFrame, which is not the case with 'AAL' being 1758 and only 12 rows, for example):
pd.Series(np.diag(df.iloc[ser,:].loc[:,ser.axes[0]]), index=df.columns)
Forgive me if the answer is simplistic. I am a beginner of Pandas. Basically, I want to retrieve the label index of a row of my pandas dataframe. I know the integer index of it.
For example, suppose that I want to get the label index of the last row of my pandas dataframe df. I tried:
df.iloc[-1].index
But that retrieved the column headers of my dataframe, rather than the label index of the last row. How can I get that label index?
Passing a scalar to iloc will return a Series of the last row, putting the columns into the index. Pass iloc a list to return a dataframe which will allow you to grab the index how you normally would.
df.iloc[[-1]].index
You can also grab the index first and then get the last value with df.index[-1]