I am new to using python with data sets and am trying to exclude a column ("id") from being shown in the output. Wondering how to go about this using the describe() and exclude functions.
describe works on the datatypes. You can include or exclude based on the datatype & not based on columns. If your column id is of unique data type, then
df.describe(exclude=[datatype])
or if you just want to remove the column(s) in describe, then try this
cols = set(df.columns) - {'id'}
df1 = df[list(cols)]
df1.describe()
TaDa its done. For more info on describe click here
You can do that by slicing your original DF and remove the 'id' column. One way is through .iloc . Let's suppose the column 'id' is the first column from you DF, then, you could do this:
df.iloc[:,1:].describe()
The first colon represents the rows, the second the columns.
Although somebody responded with an example given from the official docs which is more then enough, I'd just want to add this, since It might help a few ppl:
IF your DataFrame is large (let's say 100s columns), removing one or two, might not be a good idea (not enough), instead, create a smaller DataFrame holding what you're interested and go from there.
Example of removing 2+ columns:
table_of_columns_you_dont_want = set(your_bigger_data_frame.colums) = {'column_1', 'column_2','column3','etc'}
your_new_smaller_data_frame = your_new_smaller_data_frame[list[table_of_columns_you_dont_want]]
your_new_smaller_data_frame.describe()
IF your DataFrame is medium/small size, you already know every column and you only need a few columns, just create a new DataFrame and then apply describe():
I'll give an example from reading a .csv file and then read a smaller portion of that DataFrame which only holds what you need:
df = pd.read_csv('.\docs\project\file.csv')
df = [['column_1','column_2','column_3','etc']]
df.describe()
Use output.describe(exclude=['id'])
Related
I'm trying to iterate over a large DataFrame that has 32 fields, 1 million plus rows.
What i'm trying to do is iterate over each row, and check whether any of the rest of the rows have duplicate information in 30 of the fields, while the other two fields have different information.
I'd then like to store the the ID info. of the rows that meet these conditions.
So far i've been trying to figure out how to check two rows with the below code, it seems to work when comparing single columns but throws an error when I try more than one column, could anyone advise on how best to approach?
for index in range(len(df)):
for row in range(index, len(df)):
if df.iloc[index][1:30] == df.iloc[row][1:30]:
print(df.iloc[index])
As a general rule, you should always always try not to iterate over the rows of a DataFrame.
It seems that what you need is the pandas duplicated() method. If you have a list of the 30 columns you want to use to determine duplicates rows, the code looks something like this:
df.duplicated(subset=['col1', 'col2', 'col3']) # etc.
Full example:
# Set up test df
from io import StringIO
sub_df = pd.read_csv(
StringIO("""ID;col1;col2;col3
One;23;451;42;31
Two;24;451;42;54
Three;25;513;31;31"""
),
sep=";"
)
Find which rows are duplicates in col1 and col2. Note that the default is that the first instance is not marked as a duplicate, but later duplicates are. This behaviour can be changed as described in the documentation I linked to above.
mask = sub_df.duplicated(["col1", "col2"])
This looks like:
Now, filter using the mask.
sub_df["ID"][sub_df.duplicated(["col1", "col2"])]
Of course, you can do the last two steps in one line.
Scenario. Assume a
pd.DataFrame, loaded from an external source
where one row is a line from a sensor. The index is a DateTimeIndex
with some rows having df.index.duplicated()==True. This actually means, there are lines with the same timestamp from different sensors.
Now applying some logic, like df.loc[df.A>0, 'my_col'] = 1, I ran into ValueError: cannot reindex from a duplicate axis. This can be solved by simply removing the duplicated rows using
df[~df.index.duplicated()]
But I wonder, if it would be possible, to actually apply a column based function during the Index de-duplication process? E.g.: Calculating the mean/max/min of column A/B/C for the duplicated rows.
Is this possible? Its something like a groupby.aggregate on df.index.duplicated() rows.
Check with describe
df.groupby(level=0).describe()
data example I have a large data frame with over 20000 observations, I have a variable called “station” and I need to remove all rows that only have numbers as the s station name.
The only code that has worked so far is :
Df[‘station’][~df[‘station’].str.isnumeric()
However this only creates a data frame with one variable
You can use an extre column with .str.isnumeric() to be used later on as a filter:
df['filter'] = df['station'].str.isnumeric()
df_filtered = df[df['filter'] != False]#.drop(columns=['filter']
This should return all rows that are not only numbers for the column station. After that, you can remove the hash if you wish to drop the filter column to mantain you original structure.
You can do it like so,
df_filtered = df[df['station'].str.isnumeric()==False]
You wouldn't have to do set operations on your dataframe if you use this.
The internal statements are ultimately a Boolean logic filter that is being applied on the dataframe.
I am trying to use a data frame to regroup different kind of data.
I have a data frame with 3 columns :
one that I define and the index (used a groupby command)
one that regroups a parameter, say 'valeur1', for which I want a mean for these that have the same index (used a mean command after the group by)
the last column contains strings. There is only 1 string for each index but some cell might contain nan.
I am trying to get in the end a dataframe with the mean for 1 parameter depending on the index as well as the string that goes with the index (nan in the string column are not important). Here is a picture with an example or what I am trying to get : illustration . Main issue is that dataframe.mean does not work with string
The code I used so far is pretty basic :
dataRaw=pd.read_csv('file.csv', sep=';', encoding='latin-1')
data=dataRaw.groupby(index)
databis=data.mean();
Any suggestion would be greatly appreciated.
Thanks !
I think you need to group by multiple columns:
databis = dataRaw.groupby(['index', 'String']).mean()
I'm handling my data.
Here's my data.
I write my code like this.
complete_data = complete_data.groupby(['STDR_YM_CD', 'TRDAR_CD' ]).sum().reset_index()
I got the dataframe like below picture After executing the code
But I wanna aggregate the values based on the first three letters of characters in SVC_INDUTY_CD column like below picture.
Here is my data link
http://blogattach.naver.com/c356df6c7f2127fbd539596759bfc1bd1848b453f1/20170316_215_blogfile/khm2963_1489653338468_dtPz6k_csv/test2.csv?type=attachment
Thank in advance
I'm sure there's a better way but this is one way you could do this:
complete_data['first_three_temp'] = complete_data['SVC_INDUTY_CD'].str[:3]
complete_data = complete_data.groupby(['STDR_YM_CD', 'TRDAR_CD', 'first_three_temp' ], as_index=False).sum()
complete_data.drop('first_three_temp', axis=1, inplace=True)
This will add a temporary column containing only the first three characters of your SVC_INDUTY_CD column. You can then groupby on and drop the temporary column. As I said I'm sure there's a more efficient way so I'm not sure if you'll be limited by the size of your dataset.