Dropping multiple columns from dataframe - python

I have the following code snippet
{dataset: https://www.internationalgenome.org/data-portal/sample}
genome_data = pd.read_csv('../genome')
genome_data_columns = genome_data.columns
genPredict = genome_data[genome_data_columns[genome_data_columns != 'Geuvadis']]
This drops the column Geuvadis, is there a way I can include more than one column?

You could use DataFrame.drop like genome_data.drop(['Geuvadis', 'C2', ...], axis=1).

Is it ok for you to not read them in the first place?
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
The ‘usecols’ option in read_csv lets you specify the columns of data you want to include in the DataFrame.
Venkatesh-PrasadRanganath is the correct answer to how to drop multiple columns.
But if you want to avoid reading data into memory which you’re not going to use, genome_data = pd.read_csv('../genome', usecols=["only", "required", "columns"] is the syntax to use.

I think #Venkatesh-PrasadRanganath 's answer is better, but taking a similar approach to your attempt, this is how I would do it.:
identify all of the columns with columns.to_list()'
Create a list of columns to be excluded
Subtract the columns to be excluded from the full list with list(set() - set())
Select the remaining columns.
genome_data = pd.read_csv('../genome')
all_genome_data_columns = genome_data.columns.to_list()
excluded_genome_data_columns = ['a', 'b', 'c'] #Type in the columns that you want to exclude here.
genome_data_columns = list(set(all_genome_data_columns) - set(excluded_genome_data_columns))
genPredict = genome_data[genome_data_columns]

Related

Optimal way to create a column by matching two other columns

The first df I have is one that has station codes and names, along with lat/long (not as relevant), like so:
code name latitude longitude
I have another df with start/end dates for travel times. This df has only the station code, not the station name, like so:
start_date start_station_code end_date end_station_code duration_sec
I am looking to add columns that have the name of the start/end stations to the second df by matching the first df "code" and second df "start_station_code" / "end_station_code".
I am relatively new to pandas, and was looking for a way to optimize doing this as my current method takes quite a while. I use the following code:
for j in range(0, len(df_stations)):
for i in range(0, len(df)):
if(df_stations['code'][j] == df['start_station_code'][i]):
df['start_station'][i] = df_stations['name'][j]
if(df_stations['code'][j] == df['end_station_code'][i]):
df['end_station'][i] = df_stations['name'][j]
I am looking for a faster method, any help is appreciated. Thank you in advance.
Use merge. If you are familiar with SQL, merge is equivalent to LEFT JOIN:
cols = ["code", "name"]
result = (
second_df
.merge(first_df[cols], left_on="start_station_code", right_on="code")
.merge(first_df[cols], left_on="end_station_code", right_on="code")
.rename(columns={"code_x": "start_station_code", "code_y": "end_station_code"})
)
The answer by #Code-Different is very nearly correct. However the columns to be renamed are the name columns not the code columns. For neatness you will likely want to drop the additional code columns that get created by the merges. Using your names for the dataframes df and df_station the code needed to produce df_required is:
cols = ["code", "name"]
required_df = (
df
.merge(df_stations[cols], left_on="start_station_code", right_on="code")
.merge(df_stations[cols], left_on="end_station_code", right_on="code")
.rename(columns={"name_x": "start_station", "name_y": "end_station"})
.drop(columns = ['code_x', 'code_y'])
)
As you may notice the merge means that the dataframe acquires duplicate 'code' columns which get suffixed automatically, this is a built in default of the merge command. See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html for more detail.

Creating new dataframe by appending rows from an old dataframe

I'm trying to create a dataframe by selecting rows that meet only specific conditions from a different dataframe.
Technicians can only select one of several fields for Column 1 using a dropdown menu so I want to specify the specific field. However, column 2 is a freetext entry therefore I'm looking for two specific key words with any type of spelling/case.
I want all columns from the rows in the new dataframe.
Any help or insight would be much appreciated.
import pandas as pd
df = pd.read_excel (r'File.xlsx, sheet_name = 'Sheet1')
filter = ['x', 'y']
columns=df.columns
data = pd.DataFrame(columns=columns)
for row in df.iterrows():
if 'Column 1' == 'a':
row.data.append()
elif df['Column 2'].str.contains('filter', case = 'false'):
row.data.append()
print(data.head())
In general, it's best to have a vectorized solution to things, so I'll put my solution as follows (there are many ways to do this, this is one of the ways that came to my head). Here, you can use a simple boolean mask to filter out some specific rows that you don't want, since you've already clearly defined your criteria (df['Column 1'] == 'a' or df['Column 2'].str.contains('filter', case = 'false')).
As such, you can simply create a boolean mask that includes this criteria. By itself, df['Column 1'] == 'a' will create an indexing dataframe with the structure of [1, 0, 1, 1, ...], where each number corresponds to whether it's true in the original array. Once you have that, you can simply index back into the original array with df[df['Column 1'] == 'a'] to return your filtered array.
Of course, since you have two conditions here (which seem to follow an "or" clause), you can simply feed both of these conditions into the boolean mask, such as df[df['Column 1'] == 'a' & df['Column 2'].str.contains('filter', case = 'false')].
I'm not at my development computer, so this might not work as expected due to a couple minor issues, but this is the general idea. This line should replace your entire df.iterrows block. Hope this helps :)

Python.pandas: how to select rows where objects start with letters 'PL'

I have specific problem with pandas: I need to select rows in dataframe which start with specific letters.
Details: I've imported my data to dataframe and selected columns that I need. I've also narrowed it down to row index I need. Now I also need to select rows in other column where objects START with letters 'pl'.
Is there any solution to select row only based on first two characters in it?
I was thinking about
pl = df[‘Code’] == pl*
but it won't work due to row indexing. Advise appreciated!
Use startswith for this:
df = df[df['Code'].str.startswith('pl')]
Fully reproducible example for those who want to try it.
import pandas as pd
df = pd.DataFrame([["plusieurs", 1], ["toi", 2], ["plutot", 3]])
df.columns = ["Code", "number"]
df = df[df.Code.str.startswith("pl")] # alternative is df = df[df["Code"].str.startswith("pl")]
If you use a string method on the Series that should return you a true/false result. You can then use that as a filter combined with .loc to create your data subset.
new_df = df.loc[df[‘Code’].str.startswith('pl')].copy()
The condition is just a filter, then you need to apply it to the dataframe. as filter you may use the method Series.str.startswith and do
df_pl = df[df['Code'].str.startswith('pl')]

Rename Columns of one DataFrame to Another (R or Python)

I want to rename the columns from one dataframe into the columns of another, meanwhile creating a completely new dataframe. I am not really sure how to approach this and which is why the consultation. I want to take the name of one element in the string from one column and reuse it to another. This can be either in R or python, doesn't matter too much. The rest of the string values can be fixed values.
Such as:
Hm106_120.region001 1813 PKSI_GCF1813 Streptomyces_sp_Hm106
MBT13_26.region001 1813 PKSI_GCF1813 Streptomyces_sp_MBT13
Please see the example in the picture posted for better description
Thanks for the help :)Table Rename
df2 = df1.copy()
df2.rename(columns={"GCF No": "GCF"}, inplace=True)
df2['GCF'] = 'PKSI_GCF' + df2['GCF'].astype(str)
df1[['BGC','BGC2']] = df1['BGC'].str.split('_',expand=True)
df2['Genome'] = 'Streptomyces_sp_' + df1['BGC'].astype(str)
df2.set_index('GCF', inplace=True)

How to select rows based categories in Pandas dataframe

this is really trivial but can't believe I have wandered around for an hour and still can find the answer, so here you are:
df = pd.DataFrame({"cats":["a","b"], "vals":[1,2]})
df.cats = df.cats.astype("category")
df
My problem is how to select the row that its "cats" columns's category is "a". I know that df.loc[df.cats == "a"] will work but it's based on equality on element. Is there a way to select based on levels of category?
This works:
df.cats[df.cats=='a']
UPDATE
The question was updated. New solution:
df[df.cats.cat.categories == ['a']]
For those who are trying to filter rows based on a numerical categorical column:
df[df['col'] == pd.Interval(46, 53, closed='right')]
This would keep the rows where the col column has category (46, 53].
This kind of categorical column is common when you discretize numerical columns using pd.qcut() method.
You can query the categorical list using df.cats.cat.categories which prints output as
Index(['a', 'b'], dtype='object')
For this case, to select a row with category of 'a' which is df.cats.cat.categories['0'], you just use:
df[df.cats == df.cats.cat.categories[0]]
Using the isin function to create a boolean index is an approach that will extend to multiple categories, similar to R's %in% operator.
# will return desired subset
df[df.cats.isin(['a'])]
# can be extended to multiple categories
df[df.cats.isin(['a', 'b'])]
df[df.cats.cat.categories == df.cats.cat.categories[0]]

Categories

Resources