Dask: subset (or drop) rows from Dataframe by index - python

I'd like to take a subset of rows of a Dask dataframe based on a set of index keys. (Specifically, I want to find rows of ddf1 whose index is not in the index of ddf2.)
Both cache.drop([overlap_list]) and diff = cache[should_keep_bool_array] either throw a NotImplementedException or otherwise don't work.
What is the best way to do this?

I'm not sure this is the "best" way, but here's how I ended up doing it:
Create a Pandas DataFrame with the index be the series of index keys I want to keep (e.g., pd.DataFrame(index=overlap_list))
Inner join the Dask Dataframe

Another possibility is:
df_index = df.reset_index()
df_index = df_index.dorp_dplicates()

Related

Pandas: How to set an index without sorting

I have a df I wish to sort by two columns (say, by=['Cost','Producttype']) and set the index to be on a different column (df.what_if_cost.abs())
I am new to Python, so originally I sorted the df twice, where in the second time I kept the original index. It is too inefficient, even when doing it inplace. I tried stuff like set_index and reset_index but to no avail. Ideally, as described, I wish the output df to be sorted by two columns, but indexed by a different third column.
Thanks!

How to append rows to a Pandas dataframe, and have it turn multiple overlapping cells (with the same index) into a single value, instead of a series?

I am appending different dataframes to make one set. Occasionally, some values have the same index, so it stores the value as a series. Is there a quick way within Pandas to just overwrite the value instead of storing all the values as a series?
You weren't very clear guy. If you want to resolve the duplicated indexes problem, probably the pd.Dataframe.reset_index() method will be enough. But, if you have duplicate rows when you concat the Dataframes, just use the pd.DataFrame.drop_duplicates() method. Else, share a bit of your code with or be clearer.
I'm not sure that the code below is what you're searching.
we say two dataframes, one columns, the same index and different values. and you wanna overwrite the value in one dataframe with the other. you can do it with a simple loop with iloc indexer.
import pandas as pd
df_1 = pd.DataFrame({'col_1':['a','b','c','d']})
df_2 = pd.DataFrame({'col_1':['q','w','e','r']})
rows = df_1.shape[0]
for idx in range(rows):
df_1['col_1'].iloc[idx] = df_2['col_2'].iloc[idx]
Then, you check the df_1. you should get that:
df_1
col_1
0 q
1 w
2 e
3 r
Whatever the response is what you want, let me know so I can help you.

pandas max function results in inoperable DataFrame

I have a DataFrame with four columns and want to generate a new DataFrame with only one column containing the maximum value of each row.
Using df2 = df1.max(axis=1) gave me the correct results, but the column is titled 0 and is not operable. Meaning I can not check it's data type or change it's name, which is critical for further processing. Does anyone know what is going on here? Or better yet, has a better way to generate this new DataFrame?
It is Series, for one column DataFrame use Series.to_frame:
df2 = df1.max(axis=1).to_frame('maximum')

How to find if a values exists in all rows of a dataframe?

I have an array of unique elements and a dataframe.
I want to find out if the elements in the array exist in all the row of the dataframe.
p.s- I am new to python.
This is the piece of code I've written.
for i in uniqueArray:
for index,row in newDF.iterrows():
if i in row['MKT']:
#do something to find out if the element i exists in all rows
Also, this way of iterating is quite expensive, is there any better way to do the same?
Thanks in Advance.
Pandas allow you to filter a whole column like if it was Excel:
import pandas
df = pandas.Dataframe(tableData)
Imagine your columns names are "Column1", "Column2"... etc
df2 = df[ df["Column1"] == "ValueToFind"]
df2 now has only the rows that has "ValueToFind" in df["Column1"]. You can concatenate several filters and use AND OR logical doors.
You can try
for i in uniqueArray:
if newDF['MKT'].contains(i).any():
# do your task
You can use isin() method of pd.Series object.
Assuming you have a data frame named df and you check if your column 'MKT' includes any items of your uniqueArray.
new_df = df[df.MKT.isin(uniqueArray)].copy()
new_df will only contain the rows where values of MKT is contained in unique Array.
Now do your things on new_df, and join/merge/concat to the former df as you wish.

Splitting Pandas Dataframe with groupby and last

I am working with a pandas dataframe where i want to group by one column, grab the last row of each group (creating a new dataframe), and then drop those rows from the original.
I've done a lot of reading and testing, and it seems that I can't do that as easily as I'd hoped. I can do a kludgy solution, but it seems inefficient and, well, kludgy.
Here's pseudocode for what I wanted to do:
df = pd.DataFrame
last_lines = df.groupby('id').last()
df.drop(last_lines.index)
creating the last_lines dataframe is fine, it's dropping those rows from the original df that's an issue. the problem is that the original index (from df) is disconnected when last_lines is created. i looked at filter and transform, but neither seems to address this problem. is there a good way to split the dataframe into two pieces based on position?
my kludge solution is to iterate over the group iterator and create a list of indexes, then drop those.
grouped = df.groupby('id')
idx_to_remove = []
for _, group in grouped:
idx_to_remove.append(group.tail(1).index[0])
df.drop(idx_to_remove)
Better suggestions?
If you use .reset_index() first, you'll get the index as a column and you can use .last() on that to get the indices you want.
last_lines = df.reset_index().groupby('A').index.last()
df.drop(last_lines)
Here the index is accessed as .index because "index" is the default name given to this column when you use reset_index. If your index has a name, you'll use that instead.
You can also "manually" grab the last index by using .apply():
last_lines = d.groupby('A').apply(lambda g: g.index[-1])
You'll probably have to do it this way if you're using a MultiIndex (since in that case using .reset_index() would add multiple columns that can't easily be combined back into indices to drop).
Try:
df.groupby('A').apply(lambda x: x.iloc[:-1, :])

Categories

Resources