Why does reindexing a pandas DataFrame give me an empty DataFrame? - python

I have a dataset with information on cities in the United States and I want to give it a two-level index with the state and the city. I've been trying to use the MultiIndex approach in the documentation that goes something like this.
lists = [list(df['state'],list(df['city'])]
tuples = list(zip(*lists))
index = pd.MultiIndex.from_tuples(tuples)
new_df = pd.DataFrame(df,index=index)
The output is a new DataFrame with the correct index but it's full of np.nan values. Any idea what's going on?

When you reindex a DataFrame with a new index, Pandas operates roughly
the following way:
Iterates over the current index.
Checks whether this index value occurs in the new index.
From the "old" (existing) rows, leaves only those with index values
present in the new index.
There can be reordering of rows, to align with the order of the new
index.
If the new index contains values absent in the DataFrame, then
the coresponding row has only NaN values.
Maybe your DataFrame has initially a "standard" index (a sequence
of integers starting from 0)?
In this case no item of the old index is present in the new
index (actualy MultiIndex), so the resulting DataFrame has
all rows full of NaNs.
Maybe you should set the index to the two columns of interest,
i.e. run:
df.set_index(['state', 'city'], inplace=True)

Related

How to convert a pandas dataframe with non unique indexes into a one with unique indexes?

I created a dataframe with some previous operations but when I query a column name with an index (for example, df['order_number][0] ), multiple rows/records come as output.
The screenshot shows the unique and total indexes of the dataframe. image shows the difference in lengths of uniques indexes and all indexes
It looks like the data kept their index when you merged/joined df. Try:
df.reset_index()
Could you should a df.head() for example, usually when you consume a data source, if you sent the arg indexto True each row will be assigned a unique numerical index

Extract indices of a dataframe based on a values (provided as an array) from a different column

I have an array as: df1.values = array([1,2,3,4]).
Now, I want to get the indices of df2 where df2.x has the values from df1.values. So for instance, if df2.x.values= [1,3,4,2,5,6], then I want the return to be 1,4,2,3 which are index values of df2 where the values from df1 can be found.
I looked everywhere on stackoverflow and was not able to find how to do this.
If I understand your question, this should work:
import pandas as pd
df1 = pd.DataFrame([1,2,3,4],columns=['x'])
df2 = pd.DataFrame([1,3,4,2,5,6],columns=['x'])
df2['old_index']=df2.index.values
df2.set_index('x').loc[df1['x']]['old_index'].values
Basically, we extract the values of the original index of df2 (these are the return values that you want) as a new column, set the x column as a new index using .set_index (assuming you don't have any missing or duplicate values), and get your return values based on the new index.

Extracting values from pandas DataFrame using a pandas Series

I have a pandas Series that contains key-value pairs, where the key is the name of a column in my pandas DataFrame and the value is an index in that column of the DataFrame.
For example:
Series:
Series
Then in my DataFrame:
Dataframe
Therefore, from my DataFrame I want to extract the value at index 12 from my DataFrame for 'A', which is 435.81 . I want to put all these values into another Series, so something like { 'A': 435.81 , 'AAP': 468.97,...}
My reputation is low so I can't post my images as images instead of links (can someone help fix this? thanks!)
I think this indexing is what you're looking for.
pd.Series(np.diag(df.loc[ser,ser.axes[0]]), index=df.columns)
df.loc allows you to index based on string indices. You get your rows given from the values in ser (first positional argument in df.loc) and you get your column location from the labels of ser (I don't know if there is a better way to get the labels from a series than ser.axes[0]). The values you want are along the main diagonal of the result, so you take just the diagonal and associate them with the column labels.
The indexing I gave before only works if your DataFrame uses integer row indices, or if the data type of your Series values matches the DataFrame row indices. If you have a DataFrame with non-integer row indices, but still want to get values based on integer rows, then use the following (however, all indices from your series must be within the range of the DataFrame, which is not the case with 'AAL' being 1758 and only 12 rows, for example):
pd.Series(np.diag(df.iloc[ser,:].loc[:,ser.axes[0]]), index=df.columns)

How can I retrieve the label index of a pandas dataframe row given its integer index?

Forgive me if the answer is simplistic. I am a beginner of Pandas. Basically, I want to retrieve the label index of a row of my pandas dataframe. I know the integer index of it.
For example, suppose that I want to get the label index of the last row of my pandas dataframe df. I tried:
df.iloc[-1].index
But that retrieved the column headers of my dataframe, rather than the label index of the last row. How can I get that label index?
Passing a scalar to iloc will return a Series of the last row, putting the columns into the index. Pass iloc a list to return a dataframe which will allow you to grab the index how you normally would.
df.iloc[[-1]].index
You can also grab the index first and then get the last value with df.index[-1]

Add pandas Series to a DataFrame, preserving index

I have been having some problems adding the contents of a pandas Series to a pandas DataFrame. I start with an empty DataFrame, initialised with several columns (corresponding to consecutive dates).
I would like to then sequentially fill the DataFrame using different pandas Series, each one corresponding to a different date. However, each Series has a (potentially) different index.
I would like the resulting DataFrame to have an index that is essentially the union of each of the Series indices.
I have been doing this so far:
for date in dates:
df[date] = series_for_date
However, my df index corresponds to that of the first Series and so any data in successive Series that correspond to an index 'key' not in the first Series are lost.
Any help would be much appreciated!
Ben
If i understand you can use concat:
pd.concat([series1,series2,series3],axis=1)

Categories

Resources