Splitting Pandas Dataframe with groupby and last - python

I am working with a pandas dataframe where i want to group by one column, grab the last row of each group (creating a new dataframe), and then drop those rows from the original.
I've done a lot of reading and testing, and it seems that I can't do that as easily as I'd hoped. I can do a kludgy solution, but it seems inefficient and, well, kludgy.
Here's pseudocode for what I wanted to do:
df = pd.DataFrame
last_lines = df.groupby('id').last()
df.drop(last_lines.index)
creating the last_lines dataframe is fine, it's dropping those rows from the original df that's an issue. the problem is that the original index (from df) is disconnected when last_lines is created. i looked at filter and transform, but neither seems to address this problem. is there a good way to split the dataframe into two pieces based on position?
my kludge solution is to iterate over the group iterator and create a list of indexes, then drop those.
grouped = df.groupby('id')
idx_to_remove = []
for _, group in grouped:
idx_to_remove.append(group.tail(1).index[0])
df.drop(idx_to_remove)
Better suggestions?

If you use .reset_index() first, you'll get the index as a column and you can use .last() on that to get the indices you want.
last_lines = df.reset_index().groupby('A').index.last()
df.drop(last_lines)
Here the index is accessed as .index because "index" is the default name given to this column when you use reset_index. If your index has a name, you'll use that instead.
You can also "manually" grab the last index by using .apply():
last_lines = d.groupby('A').apply(lambda g: g.index[-1])
You'll probably have to do it this way if you're using a MultiIndex (since in that case using .reset_index() would add multiple columns that can't easily be combined back into indices to drop).

Try:
df.groupby('A').apply(lambda x: x.iloc[:-1, :])

Related

Get result of merge in loop for

I have a huge 800k row dataframe which I need to find the key with another dataframe.
Initially I was looping through my 2 dataframes with a loop and checking the value of the keys with a condition.
I was told about the possibility of using merge to save time. However, no way to make it work :(
Overall, here's the code I'm trying to adapt:
mergeTwo = pd.read_json('merge/mergeUpdate.json')
matches = pd.read_csv('archive/matches.csv')
for indexOne,value in tqdm(mergeTwo.iterrows()):
for index, match in matches.iterrows():
if value["gameid"] == match["gameid"]:
print(match)
for index, value in mergeTwo.iterrows():
test = value.to_frame().merge(matches, on='gameid')
print(test)
In my first case, my code works without worries.
In the second, this one tells me a problem of not known key (gameid)
Anyone got a solution?
Thanks in advance !
When you iterate over rows, your value is a Series which is transformed into a one-column frame by to_frame method with the original column names as its index. So you need to transpose it to make the second way work:
for index, value in mergeTwo.iterrows():
# note .T after .to_frame
test = value.to_frame().T.merge(matches, on='gameid')
print(test)
But iteration is a redundant tool, merge applied to the first frame should be enough:
mergeTwo.merge(matches, on='gameid', how='left')

Is there a way to reverse the dropping method in pandas?

I'm aware that you can use
df1 = df1[df1['Computer Name'] != 'someNameToBeDropped']
to drop a given string as a row
what if i wanted to do it the other way around. Let's say dropping everything except what i have in a list of strings.
is there a simple hack I haven't noticed?
Try this to get rows such that value of col is in that given list
df = df[df[column].isin(list_of_strings)]
Additional to exclude what's in the list
df = df[~df[column].isin(list_of_values)]

How to find if a values exists in all rows of a dataframe?

I have an array of unique elements and a dataframe.
I want to find out if the elements in the array exist in all the row of the dataframe.
p.s- I am new to python.
This is the piece of code I've written.
for i in uniqueArray:
for index,row in newDF.iterrows():
if i in row['MKT']:
#do something to find out if the element i exists in all rows
Also, this way of iterating is quite expensive, is there any better way to do the same?
Thanks in Advance.
Pandas allow you to filter a whole column like if it was Excel:
import pandas
df = pandas.Dataframe(tableData)
Imagine your columns names are "Column1", "Column2"... etc
df2 = df[ df["Column1"] == "ValueToFind"]
df2 now has only the rows that has "ValueToFind" in df["Column1"]. You can concatenate several filters and use AND OR logical doors.
You can try
for i in uniqueArray:
if newDF['MKT'].contains(i).any():
# do your task
You can use isin() method of pd.Series object.
Assuming you have a data frame named df and you check if your column 'MKT' includes any items of your uniqueArray.
new_df = df[df.MKT.isin(uniqueArray)].copy()
new_df will only contain the rows where values of MKT is contained in unique Array.
Now do your things on new_df, and join/merge/concat to the former df as you wish.

Dask: subset (or drop) rows from Dataframe by index

I'd like to take a subset of rows of a Dask dataframe based on a set of index keys. (Specifically, I want to find rows of ddf1 whose index is not in the index of ddf2.)
Both cache.drop([overlap_list]) and diff = cache[should_keep_bool_array] either throw a NotImplementedException or otherwise don't work.
What is the best way to do this?
I'm not sure this is the "best" way, but here's how I ended up doing it:
Create a Pandas DataFrame with the index be the series of index keys I want to keep (e.g., pd.DataFrame(index=overlap_list))
Inner join the Dask Dataframe
Another possibility is:
df_index = df.reset_index()
df_index = df_index.dorp_dplicates()

Pandas Column of Lists to Separate Rows

I've got a dataframe that contains analysed news articles w/ each row referencing an article and columns w/ some information about that article (e.g. tone).
One column of that df contains a list of FIPS country codes of the locations that were mentioned in that article.
I want to "extract" these country codes such that I get a dataframe in which each mentioned location has its own row, along with the other columns of the original row in which that location was referenced (there will be multiple rows with the same information, but different locations, as the same article may mention multiple locations).
I tried something like this, but iterrows() is notoriously slow, so is there any faster/more efficient way for me to do this?
Thanks a lot.
'events' is the column that contains the locations
'event_cols' are the columns from the original df that I want to retain in the new df.
'df_events' is the new data frame
for i, row in df.iterrows():
for location in df.events.loc[i]:
try:
df_storage = pd.DataFrame(row[event_cols]).T
df_storage['loc'] = location
df_events = df_events.append(df_storage)
except ValueError as e:
continue
I would group the DataFrame with groupby(), explode the lists with a combination of apply and a lambda function, and then reset the index and drop the level column that is created to clean up the resulting DataFrame.
df_events = df.groupby(['event_col1', 'event_col2', 'event_col3'])['events']\
.apply(lambda x: pd.DataFrame(x.values[0]))\
.reset_index().drop('level_3', axis = 1)
In general, I always try to find a way to use apply() before most other methods, because it is often much faster than iterating over each row.

Categories

Resources