Python Pandas Don't Repeat Item Labels - python

I have a table: Table
How would I roll up Group, so that the group numbers don't repeat? I don't want to pd.df.groupby, as I don't want to summarize the other columns. I just want to not repeat item labels, sort of like an Excel pivot table.
Thanks!

In your dataframe it appears that 'Group' is in the index, the purpose of the index is to label each row. Therefore, is unusual and uncommon to have blank row indexes.
You you could so this:
df2.reset_index().set_index('Group', append=True).swaplevel(0,1,axis=0)
Or if you really must show blank row indexes you could do this, but you must change the dtype of the index to str.
df1 = df.set_index('Group').astype(str)
df1.index = df1.index.where(~df1.index.duplicated(),[' '])

Related

Combing Rows in a Single Dataframe

I have a dataframe that looks like this, where there is a new row per ID if one of the following columns has a value. I'm trying to combine on the ID, and just consolidate all of the remaining columns. I've tried every groupby/agg combination and can't get the right output. There are no conflicting column values. So for instance if ID "1" has an email value in row 0, the remaining rows will be empty in the column. So I just need it to sum/consolidate, not concatenate or anything.
my current dataframe:
the output i'm looking to achieve:
# fill Nones in string columns with empty string
df[['email', 'status']] = df[['email', 'status']].fillna('')
df = df.groupby('id').agg('max')
If you still want the index as you shown in desired output,
df = df.reset_index(drop=False)

Finding first repeated consecutive entries in pandas dataframe

I have a dataframe of two columns Stock and DueDate, where I need to select first row from the repeated consecutive entries based on stock column.
df:
I am expecting output like below,
Expected output:
My Approach
The approach I tried to use is to first list out what all rows repeating based on stock column by creating a new column repeated_yes and then subset the first row only if any rows are repeating more than twice.
I have used the below line of code to create new column "repeated_yes",
ss = df.Stock.ne(df.Stock.shift())
df['repeated_yes'] = ss.groupby(ss.cumsum()).cumcount() + 1
so the new updated dataframe looks like this,
df_new
But I am stuck on subsetting only row number 3 and 8 inorder to attain the result. If there are any other effective approach it would be helpful.
Edited:
Forgot to include the actual full question,
If there are any other rows below the last row in the dataframe df it should not display any output.
Chain another mask created by Series.duplicated with keep=False by & for bitwise AND and filter in boolean indexing:
ss = df.Stock.ne(df.Stock.shift())
ss1 = ss.cumsum().duplicated(keep=False)
df = df[ss & ss1]

Selecting Row vs Column in a DataFrame

In a Pandas DataFrame, say cars, I can select and print a single column like this:
# country is a column
print(cars['country'])
However, when I try to do the same thing with a row, I failed:
#US is a row
print(cars['US'])
KeyError: 'US'
Then I tried this and it worked:
print(cars['US':'US'])
So, in Pandas DataFrame, column indexes are keys and row indexes are not?Could someone explain what's the reason for making row selection more complicated than column selection?
In pandas
cars['Country']
returns the column as a Series. To slice a row as a Series, the equivalent command would be (assuming 'US' is an index value):
cars.loc['US']
If you do
cars['US':'US']
then you'll get the same data, but it'll be a DataFrame instead of a Series, so not quite equivalent.
Obviously, you can't have both columns and rows be referenced in the same way. If I had a dataframe of correlations between all the stocks in the S&P 500, then your rows and index values would have the same elements. So you'd need to know whether df['AAPL'] was referring to 'AAPL' from the index or from the columns. So df['AAPL'] is the 'AAPL' column, while df.loc['AAPL'] is the 'AAPL' row. You could also do df.loc['AAPL', :] for the row and df.loc[:, 'AAPL'] for the column, if you prefer.
As for why they chose df['AAPL'] to be the column, my best guess is that typically dates are listed as the rows, so it's more common to select a column of data to do some operation on than it is to select a row. However slicing rows is just as easy as columns, so there really isn't much difference except a few keystrokes.

How to print a specific row of a pandas DataFrame?

I have a massive DataFrame, and I'm getting the error:
TypeError: ("Empty 'DataFrame': no numeric data to plot", 'occurred at index 159220')
I've already dropped nulls, and checked dtypes for the DataFrame so I have no guess as to why it's failing on that row.
How do I print out just that row (at index 159220) of the DataFrame?
When you call loc with a scalar value, you get a pd.Series. That series will then have one dtype. If you want to see the row as it is in the dataframe, you'll want to pass an array like indexer to loc.
Wrap your index value with an additional pair of square brackets
print(df.loc[[159220]])
To print a specific row we have couple of pandas method
loc - It only get label i.e column name or Features
iloc - Here i stands for integer, actually row number
ix - It is a mix of label as well as integer
How to use for specific row
loc
df.loc[row,column]
For first row and all column
df.loc[0,:]
For first row and some specific column
df.loc[0,'column_name']
iloc
For first row and all column
df.iloc[0,:]
For first row and some specific column i.e first three cols
df.iloc[0,0:3]
Use ix operator:
print df.ix[159220]
If you want to display at row=159220
row=159220
#To display in a table format
display(df.loc[row:row])
display(df.iloc[row:row+1])
#To display in print format
display(df.loc[row])
display(df.iloc[row])
Sounds like you're calling df.plot(). That error indicates that you're trying to plot a frame that has no numeric data. The data types shouldn't affect what you print().
Use print(df.iloc[159220])
You can also index the index and use the result to select row(s) using loc:
row = 159220 # this creates a pandas Series (`row` is an integer)
row = [159220] # this creates a pandas DataFrame (`row` is a list)
df.loc[df.index[row]]
This is especially useful if you want to select rows by integer-location and columns by name. For example:
rows = 159220
cols = ['col2', 'col6']
df.loc[df.index[row], cols] # <--- OK
df.iloc[rows, cols] # <--- doesn't work
df.loc[cols].iloc[rows] # <--- OK but creates an intermediate copy

Unique Value Index from two fields

I'm new to pandas and python, and could definitely use some help.
I have the code below, which almost does what I want. It creates dummy variables for the unique values in a field and indexes them by the unique combinations of the unique values in two other fields.
What I would like is only one row for each unique combination of the fields used for the index. Right now I get multiple rows for say 'asset subs end dt' = 10/30/2008 and 'reseller csn' = 55008 if the dummy variable comes up 3 times. I would rather have one row for the combination of index field values with a 3 in the dummy variable column.
Code:
df = data
df = df.set_index(['ASSET_SUBS_END_DT','RESELLER_CSN'])
Dummies=pd.get_dummies(df['EXPERTISE'])
something like:
df.groupby(level=[0, 1]).EXPERTISE.count()
when you do this groupby, everything with the same index is grouped together. assuming your data in EXPERTISE is notnull, you will get a new DataFrame returned with unique index values and the count per each index. try it out for yourself, play around with the results, and see how it can be combined with your existing DataFrame to get the final result you want.

Categories

Resources