I've completed an analysis and convert results into a data frame using result.to_frame(). And the result looks like this:
The next step is to build a pivotal table based on column 1 and sum the counts. In order to do that, I want to label the first column as "query". I used this code, but the first column cannot be renamed.
df.rename(columns = {'query':'sum of counts'}, inplace = True)
I found it shows this data frame only has one column even though I saw two.
It looks like you indeed have 1 column, and the description is actually your index. If your DataFrame is "df", try the following:
df.shape # this should show (N, 1) where N is the number of records
df['Query'] = df.index # to pull a copy of the index into a column
You should be able to proceed normally then.
Related
I have a dataframe that looks like this, where there is a new row per ID if one of the following columns has a value. I'm trying to combine on the ID, and just consolidate all of the remaining columns. I've tried every groupby/agg combination and can't get the right output. There are no conflicting column values. So for instance if ID "1" has an email value in row 0, the remaining rows will be empty in the column. So I just need it to sum/consolidate, not concatenate or anything.
my current dataframe:
the output i'm looking to achieve:
# fill Nones in string columns with empty string
df[['email', 'status']] = df[['email', 'status']].fillna('')
df = df.groupby('id').agg('max')
If you still want the index as you shown in desired output,
df = df.reset_index(drop=False)
I am new to using python with data sets and am trying to exclude a column ("id") from being shown in the output. Wondering how to go about this using the describe() and exclude functions.
describe works on the datatypes. You can include or exclude based on the datatype & not based on columns. If your column id is of unique data type, then
df.describe(exclude=[datatype])
or if you just want to remove the column(s) in describe, then try this
cols = set(df.columns) - {'id'}
df1 = df[list(cols)]
df1.describe()
TaDa its done. For more info on describe click here
You can do that by slicing your original DF and remove the 'id' column. One way is through .iloc . Let's suppose the column 'id' is the first column from you DF, then, you could do this:
df.iloc[:,1:].describe()
The first colon represents the rows, the second the columns.
Although somebody responded with an example given from the official docs which is more then enough, I'd just want to add this, since It might help a few ppl:
IF your DataFrame is large (let's say 100s columns), removing one or two, might not be a good idea (not enough), instead, create a smaller DataFrame holding what you're interested and go from there.
Example of removing 2+ columns:
table_of_columns_you_dont_want = set(your_bigger_data_frame.colums) = {'column_1', 'column_2','column3','etc'}
your_new_smaller_data_frame = your_new_smaller_data_frame[list[table_of_columns_you_dont_want]]
your_new_smaller_data_frame.describe()
IF your DataFrame is medium/small size, you already know every column and you only need a few columns, just create a new DataFrame and then apply describe():
I'll give an example from reading a .csv file and then read a smaller portion of that DataFrame which only holds what you need:
df = pd.read_csv('.\docs\project\file.csv')
df = [['column_1','column_2','column_3','etc']]
df.describe()
Use output.describe(exclude=['id'])
I have a dataframe of two columns Stock and DueDate, where I need to select first row from the repeated consecutive entries based on stock column.
df:
I am expecting output like below,
Expected output:
My Approach
The approach I tried to use is to first list out what all rows repeating based on stock column by creating a new column repeated_yes and then subset the first row only if any rows are repeating more than twice.
I have used the below line of code to create new column "repeated_yes",
ss = df.Stock.ne(df.Stock.shift())
df['repeated_yes'] = ss.groupby(ss.cumsum()).cumcount() + 1
so the new updated dataframe looks like this,
df_new
But I am stuck on subsetting only row number 3 and 8 inorder to attain the result. If there are any other effective approach it would be helpful.
Edited:
Forgot to include the actual full question,
If there are any other rows below the last row in the dataframe df it should not display any output.
Chain another mask created by Series.duplicated with keep=False by & for bitwise AND and filter in boolean indexing:
ss = df.Stock.ne(df.Stock.shift())
ss1 = ss.cumsum().duplicated(keep=False)
df = df[ss & ss1]
For a dataframe which looks like this:
I want to simply set the index to be the Date column which you see as first column.
The dataframe comes from an api where i save the data into csv:
data.to_csv('stocks.csv', header=True ,sep=',',mode='a')
data = pd.read_csv('stocks.csv',header=[0,1,2])
data
Preferably i would also like to get rid of the "Unnamed:.." labels you see in the picture.
Thanks.
I solved it by specifying header=[0,1] ,index_col=0 in the read_csv function and after convert dataframe to numeric since the datatype got distorted but not necessary always i believe:
data = pd.read_csv('stocks.csv', header=[0,1] ,index_col=0)
data = data.apply(pd.to_numeric, errors='coerce')
# eventually:
data = data.dropna()
In this fashion I get exactly what I want, namely write e.g.
data['AGN.AS']['High']
and get the high values for a specific stock.
data example I have a large data frame with over 20000 observations, I have a variable called “station” and I need to remove all rows that only have numbers as the s station name.
The only code that has worked so far is :
Df[‘station’][~df[‘station’].str.isnumeric()
However this only creates a data frame with one variable
You can use an extre column with .str.isnumeric() to be used later on as a filter:
df['filter'] = df['station'].str.isnumeric()
df_filtered = df[df['filter'] != False]#.drop(columns=['filter']
This should return all rows that are not only numbers for the column station. After that, you can remove the hash if you wish to drop the filter column to mantain you original structure.
You can do it like so,
df_filtered = df[df['station'].str.isnumeric()==False]
You wouldn't have to do set operations on your dataframe if you use this.
The internal statements are ultimately a Boolean logic filter that is being applied on the dataframe.