I have a dataframe that looks like this, where there is a new row per ID if one of the following columns has a value. I'm trying to combine on the ID, and just consolidate all of the remaining columns. I've tried every groupby/agg combination and can't get the right output. There are no conflicting column values. So for instance if ID "1" has an email value in row 0, the remaining rows will be empty in the column. So I just need it to sum/consolidate, not concatenate or anything.
my current dataframe:
the output i'm looking to achieve:
# fill Nones in string columns with empty string
df[['email', 'status']] = df[['email', 'status']].fillna('')
df = df.groupby('id').agg('max')
If you still want the index as you shown in desired output,
df = df.reset_index(drop=False)
Related
I have a dataframe with columns that are a string of blanks (null/nan set to 0) with sporadic number values.
I am tying to compare the last two non-zero values in a data frame column.
Something like :
df['Column_c'] = df[column_a'].last_non_zero_value > df[column_a'].second_to_last_non_zero_value
This is what the columns look like in excel
You could drop all the rows with missing data using pd.df.dropna() and then access the last row in the dataframe index and have it return the values as an array which should be easy to find the last two elements in.
I have a dataframe of two columns Stock and DueDate, where I need to select first row from the repeated consecutive entries based on stock column.
df:
I am expecting output like below,
Expected output:
My Approach
The approach I tried to use is to first list out what all rows repeating based on stock column by creating a new column repeated_yes and then subset the first row only if any rows are repeating more than twice.
I have used the below line of code to create new column "repeated_yes",
ss = df.Stock.ne(df.Stock.shift())
df['repeated_yes'] = ss.groupby(ss.cumsum()).cumcount() + 1
so the new updated dataframe looks like this,
df_new
But I am stuck on subsetting only row number 3 and 8 inorder to attain the result. If there are any other effective approach it would be helpful.
Edited:
Forgot to include the actual full question,
If there are any other rows below the last row in the dataframe df it should not display any output.
Chain another mask created by Series.duplicated with keep=False by & for bitwise AND and filter in boolean indexing:
ss = df.Stock.ne(df.Stock.shift())
ss1 = ss.cumsum().duplicated(keep=False)
df = df[ss & ss1]
I have a table: Table
How would I roll up Group, so that the group numbers don't repeat? I don't want to pd.df.groupby, as I don't want to summarize the other columns. I just want to not repeat item labels, sort of like an Excel pivot table.
Thanks!
In your dataframe it appears that 'Group' is in the index, the purpose of the index is to label each row. Therefore, is unusual and uncommon to have blank row indexes.
You you could so this:
df2.reset_index().set_index('Group', append=True).swaplevel(0,1,axis=0)
Or if you really must show blank row indexes you could do this, but you must change the dtype of the index to str.
df1 = df.set_index('Group').astype(str)
df1.index = df1.index.where(~df1.index.duplicated(),[' '])
I have a Panda dataframe with a text column in the format below. There are some values/text meshed in between ##. I want to find such text which are present between ## and extract them in a separate column as a list.
##fare_curr.currency####based_fare_90d.price##
htt://www.abcd.lol/abcd-Search?from:##based_best_flight_fare_90d.air##,to:##mbased_90d.water##,departure:##mbased_90d.date_1##TANYT&pas=ch:0Y&mode=search
Consider the above two strings to be two rows of the same column. I want to get a new column with a list [fare_curr.currency, based_fare_90d.price] in the first row and [based_best_flight_fare_90d.air, mbased_90d.water, based_90d.date_1] in the second row.
Given this df
df = pd.DataFrame({'data':
['##fare_curr.currency####based_fare_90d.price##',
'htt://www.abcd.lol/abcd-Search?\ from:##based_best_flight_fare_90d.air##,to:##mbased_90d.water##,departure:#
#mbased_90d.date_1##TANYT&pas=ch:0Y&mode=search']})
You can get desired result in a new column using
df['new'] = pd.Series(df.data.str.extractall('##(.*?)##').unstack().values.tolist())
You get
data new
0 ##fare_curr.currency####based_fare_90d.price## [fare_curr.currency, based_fare_90d.price, None]
1 htt://www.abcd.lol/abcd-Search?from:##based_be... [based_best_flight_fare_90d.air, mbased_90d.wa...
I'm new to pandas and python, and could definitely use some help.
I have the code below, which almost does what I want. It creates dummy variables for the unique values in a field and indexes them by the unique combinations of the unique values in two other fields.
What I would like is only one row for each unique combination of the fields used for the index. Right now I get multiple rows for say 'asset subs end dt' = 10/30/2008 and 'reseller csn' = 55008 if the dummy variable comes up 3 times. I would rather have one row for the combination of index field values with a 3 in the dummy variable column.
Code:
df = data
df = df.set_index(['ASSET_SUBS_END_DT','RESELLER_CSN'])
Dummies=pd.get_dummies(df['EXPERTISE'])
something like:
df.groupby(level=[0, 1]).EXPERTISE.count()
when you do this groupby, everything with the same index is grouped together. assuming your data in EXPERTISE is notnull, you will get a new DataFrame returned with unique index values and the count per each index. try it out for yourself, play around with the results, and see how it can be combined with your existing DataFrame to get the final result you want.