Pandas: How to set an index without sorting - python

I have a df I wish to sort by two columns (say, by=['Cost','Producttype']) and set the index to be on a different column (df.what_if_cost.abs())
I am new to Python, so originally I sorted the df twice, where in the second time I kept the original index. It is too inefficient, even when doing it inplace. I tried stuff like set_index and reset_index but to no avail. Ideally, as described, I wish the output df to be sorted by two columns, but indexed by a different third column.
Thanks!

Related

GroupBy using select columns with apply(list) and retaining other columns of the dataframe

data={'order_num':[123,234,356,123,234,356],'email':['abc#gmail.com','pqr#hotmail.com','xyz#yahoo.com','abc#gmail.com','pqr#hotmail.com','xyz#gmail.com'],'product_code':['rdcf1','6fgxd','2sdfs','34fgdf','gvwt5','5ganb']}
df=pd.DataFrame(data,columns=['order_num','email','product_code'])
My data frame looks something like this:
Image of data frame
For sake of simplicity, while making the example, I omitted the other columns. What I need to do is that I need to groupby on the column called order_num, apply(list) on product_code, sort the groups based on a timestamp column and retain the columns like email as they are.
I tried doing something like:
df.groupby(['order_num', 'email', 'timestamp'])['product_code'].apply(list).sort_values(by='timestamp').reset_index()
Output: Expected output appearance
but I do not wish to groupby with other columns. Is there any other alternative to performing the list operation? I tried using transform but it threw me size mismatch error and I don't think it's the right way to go either.
If there is a lot another columns and need grouping by order_num only use Series.map for new column filled by lists and then remove duplicates by DataFrame.drop_duplicates by column order_num, last if necessary sorting:
df['product_code']=df['order_num'].map(df.groupby('order_num')['product_code'].apply(list))
df = df.drop_duplicates('order_num').sort_values(by='timestamp')

Why does reindexing a pandas DataFrame give me an empty DataFrame?

I have a dataset with information on cities in the United States and I want to give it a two-level index with the state and the city. I've been trying to use the MultiIndex approach in the documentation that goes something like this.
lists = [list(df['state'],list(df['city'])]
tuples = list(zip(*lists))
index = pd.MultiIndex.from_tuples(tuples)
new_df = pd.DataFrame(df,index=index)
The output is a new DataFrame with the correct index but it's full of np.nan values. Any idea what's going on?
When you reindex a DataFrame with a new index, Pandas operates roughly
the following way:
Iterates over the current index.
Checks whether this index value occurs in the new index.
From the "old" (existing) rows, leaves only those with index values
present in the new index.
There can be reordering of rows, to align with the order of the new
index.
If the new index contains values absent in the DataFrame, then
the coresponding row has only NaN values.
Maybe your DataFrame has initially a "standard" index (a sequence
of integers starting from 0)?
In this case no item of the old index is present in the new
index (actualy MultiIndex), so the resulting DataFrame has
all rows full of NaNs.
Maybe you should set the index to the two columns of interest,
i.e. run:
df.set_index(['state', 'city'], inplace=True)

Dask: subset (or drop) rows from Dataframe by index

I'd like to take a subset of rows of a Dask dataframe based on a set of index keys. (Specifically, I want to find rows of ddf1 whose index is not in the index of ddf2.)
Both cache.drop([overlap_list]) and diff = cache[should_keep_bool_array] either throw a NotImplementedException or otherwise don't work.
What is the best way to do this?
I'm not sure this is the "best" way, but here's how I ended up doing it:
Create a Pandas DataFrame with the index be the series of index keys I want to keep (e.g., pd.DataFrame(index=overlap_list))
Inner join the Dask Dataframe
Another possibility is:
df_index = df.reset_index()
df_index = df_index.dorp_dplicates()

Keep rows from a dataframe whose index name is NOT in a given list

So, I have a list with tuples, and a multi-index dataframe. I want to find the rows of the dataframe whose indices are NOT included in the list of tuples, and create a new dataframe with these elements. Any help? Thanx!
You can use isin with a negation to explicitly filter your DataFrame:
new_df = df[~df.index.isin(list_of_tuples)]
Alternatively, use drop to remove the tuples you don't want to be included in the new DataFrame.
new_df = df.drop(list_of_tuples)
From a couple simple tests, using isin appears to be faster, although drop is a bit more readable.

Splitting Pandas Dataframe with groupby and last

I am working with a pandas dataframe where i want to group by one column, grab the last row of each group (creating a new dataframe), and then drop those rows from the original.
I've done a lot of reading and testing, and it seems that I can't do that as easily as I'd hoped. I can do a kludgy solution, but it seems inefficient and, well, kludgy.
Here's pseudocode for what I wanted to do:
df = pd.DataFrame
last_lines = df.groupby('id').last()
df.drop(last_lines.index)
creating the last_lines dataframe is fine, it's dropping those rows from the original df that's an issue. the problem is that the original index (from df) is disconnected when last_lines is created. i looked at filter and transform, but neither seems to address this problem. is there a good way to split the dataframe into two pieces based on position?
my kludge solution is to iterate over the group iterator and create a list of indexes, then drop those.
grouped = df.groupby('id')
idx_to_remove = []
for _, group in grouped:
idx_to_remove.append(group.tail(1).index[0])
df.drop(idx_to_remove)
Better suggestions?
If you use .reset_index() first, you'll get the index as a column and you can use .last() on that to get the indices you want.
last_lines = df.reset_index().groupby('A').index.last()
df.drop(last_lines)
Here the index is accessed as .index because "index" is the default name given to this column when you use reset_index. If your index has a name, you'll use that instead.
You can also "manually" grab the last index by using .apply():
last_lines = d.groupby('A').apply(lambda g: g.index[-1])
You'll probably have to do it this way if you're using a MultiIndex (since in that case using .reset_index() would add multiple columns that can't easily be combined back into indices to drop).
Try:
df.groupby('A').apply(lambda x: x.iloc[:-1, :])

Categories

Resources