Rename Columns of one DataFrame to Another (R or Python) - python

I want to rename the columns from one dataframe into the columns of another, meanwhile creating a completely new dataframe. I am not really sure how to approach this and which is why the consultation. I want to take the name of one element in the string from one column and reuse it to another. This can be either in R or python, doesn't matter too much. The rest of the string values can be fixed values.
Such as:
Hm106_120.region001 1813 PKSI_GCF1813 Streptomyces_sp_Hm106
MBT13_26.region001 1813 PKSI_GCF1813 Streptomyces_sp_MBT13
Please see the example in the picture posted for better description
Thanks for the help :)Table Rename

df2 = df1.copy()
df2.rename(columns={"GCF No": "GCF"}, inplace=True)
df2['GCF'] = 'PKSI_GCF' + df2['GCF'].astype(str)
df1[['BGC','BGC2']] = df1['BGC'].str.split('_',expand=True)
df2['Genome'] = 'Streptomyces_sp_' + df1['BGC'].astype(str)
df2.set_index('GCF', inplace=True)

Related

Optimal way to create a column by matching two other columns

The first df I have is one that has station codes and names, along with lat/long (not as relevant), like so:
code name latitude longitude
I have another df with start/end dates for travel times. This df has only the station code, not the station name, like so:
start_date start_station_code end_date end_station_code duration_sec
I am looking to add columns that have the name of the start/end stations to the second df by matching the first df "code" and second df "start_station_code" / "end_station_code".
I am relatively new to pandas, and was looking for a way to optimize doing this as my current method takes quite a while. I use the following code:
for j in range(0, len(df_stations)):
for i in range(0, len(df)):
if(df_stations['code'][j] == df['start_station_code'][i]):
df['start_station'][i] = df_stations['name'][j]
if(df_stations['code'][j] == df['end_station_code'][i]):
df['end_station'][i] = df_stations['name'][j]
I am looking for a faster method, any help is appreciated. Thank you in advance.
Use merge. If you are familiar with SQL, merge is equivalent to LEFT JOIN:
cols = ["code", "name"]
result = (
second_df
.merge(first_df[cols], left_on="start_station_code", right_on="code")
.merge(first_df[cols], left_on="end_station_code", right_on="code")
.rename(columns={"code_x": "start_station_code", "code_y": "end_station_code"})
)
The answer by #Code-Different is very nearly correct. However the columns to be renamed are the name columns not the code columns. For neatness you will likely want to drop the additional code columns that get created by the merges. Using your names for the dataframes df and df_station the code needed to produce df_required is:
cols = ["code", "name"]
required_df = (
df
.merge(df_stations[cols], left_on="start_station_code", right_on="code")
.merge(df_stations[cols], left_on="end_station_code", right_on="code")
.rename(columns={"name_x": "start_station", "name_y": "end_station"})
.drop(columns = ['code_x', 'code_y'])
)
As you may notice the merge means that the dataframe acquires duplicate 'code' columns which get suffixed automatically, this is a built in default of the merge command. See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html for more detail.

Changing column values for a value in an adjacent column in the same dataframe using Python

I am quite new to Python programming.
I am working with the following dataframe:
Before
Note that in column "FBgn", there is a mix of FBgn and FBtr string values. I would like to replace the FBtr-containing values with FBgn values provided in the adjacent column called "## FlyBase_FBgn". However, I want to keep the FBgn values in column "FBgn". Maybe keep in mind that I am showing only a portion of the dataframe (reality: 1432 rows). How would I do that? I tried the replace() method from Pandas, but it did not work.
This is actually what I would like to have:
After
Thanks a lot!
With Pandas, you could try:
df.loc[df["FBgn"].str.contains("FBtr"), "FBgn"] = df["## FlyBase_FBgn"]
Welcome to stackoverflow. Please next time provide more info including your code. It is always helpful
Please see the code below, I think you need something similar
import pandas as pd
#ignore the dict1, I just wanted to recreate your df
dict1= {"FBgn": ['FBtr389394949', 'FBgn3093840', 'FBtr000025'], "FBtr": ['FBgn546466646', '', 'FBgn15565555']}
df = pd.DataFrame(dict1) #recreating your dataframe
#print df
print(df)
#function to replace the values
def replace_values(df):
for i in range(0, (df.size//2)):
if 'tr' in df['FBgn'][i]:
df['FBgn'][i] = df['FBtr'][i]
return df
df = replace_values(df)
#print new df
print(df)

Dropping multiple columns from dataframe

I have the following code snippet
{dataset: https://www.internationalgenome.org/data-portal/sample}
genome_data = pd.read_csv('../genome')
genome_data_columns = genome_data.columns
genPredict = genome_data[genome_data_columns[genome_data_columns != 'Geuvadis']]
This drops the column Geuvadis, is there a way I can include more than one column?
You could use DataFrame.drop like genome_data.drop(['Geuvadis', 'C2', ...], axis=1).
Is it ok for you to not read them in the first place?
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
The ‘usecols’ option in read_csv lets you specify the columns of data you want to include in the DataFrame.
Venkatesh-PrasadRanganath is the correct answer to how to drop multiple columns.
But if you want to avoid reading data into memory which you’re not going to use, genome_data = pd.read_csv('../genome', usecols=["only", "required", "columns"] is the syntax to use.
I think #Venkatesh-PrasadRanganath 's answer is better, but taking a similar approach to your attempt, this is how I would do it.:
identify all of the columns with columns.to_list()'
Create a list of columns to be excluded
Subtract the columns to be excluded from the full list with list(set() - set())
Select the remaining columns.
genome_data = pd.read_csv('../genome')
all_genome_data_columns = genome_data.columns.to_list()
excluded_genome_data_columns = ['a', 'b', 'c'] #Type in the columns that you want to exclude here.
genome_data_columns = list(set(all_genome_data_columns) - set(excluded_genome_data_columns))
genPredict = genome_data[genome_data_columns]

Compare Dataframes of different size and create a new one if condition is met

I need help to figure out the following problem:
I have two(2) dataframes of different sizes. I need to compare the values and, if the condition is met, replace the values in Dataframe 1.
If the values for a Material and Char in Dataframe 1 are = "Y", I need to get the "Required or Optional" value from Dataframe 2. If it's Required, then I replace the "Y" with "Y_REQD" otherwise if it's Optional then replace "Y" with "Y_OPT".
I've been using For loops but now the code is getting too complicated which hints me this may not be the best way.
Thanks in advance.
This is more like a pivot problem , then we can reindex the dataframe then sum
df1=df1.replace({'Y':'Y_'})+df2.pivot(*df2.columns).reindex_like(df1).fillna('')
Mostly agree with #WeNYoBen's answer. But to make it completely right, dataframe2 need to be revised using df.replace.
Short version:
df1=df1.replace({'Y':'Y_'})+df2.replace({'Rqd': 'REQD', 'Opt': 'OPT'}).pivot(*df2.columns).reindex_like(df1).fillna('')
Long version:
# break short into steps
# 1. replace
df2 = df2.replace({'Rqd': 'REQD', 'Opt': 'OPT'})
# 2. pivot
df2 = df2.pivot(*df2.columns)
# 3. reindex
df2 = df2.reindex_like(df1)
# 4. fillna(cleanup df with string form)
df2 = df2.fillna('')
# 5. map on df1 and add up with df2
df1=df1.replace({'Y':'Y_'})+df2
Hope it helps.

Trouble matching and setting values from multiple dataframes

I am trying to match the stop_id in stop_times.csv to the stop_id in stops.csv in order to copy over the stop_lat and stop_lon to their respective columns in stop_times.csv.
Gist files:
stops.csv LINK
stop_times.csv LINK
Here's my code:
import pandas as pd
st = pd.read_csv('csv/stop_times.csv', sep=',')
st.set_index(['trip_id','stop_sequence'])
stops = pd.read_csv('csv/stops.csv')
for i in range(len(st)):
for x in range(len(stops)):
if st['stop_id'][i] == stops['stop_id'][x]:
st['stop_lat'][i] = stops['stop_lat'][x]
st['stop_lon'][i] = stops['stop_lon'][x]
st.to_csv('csv/stop_times.csv', index=False)
I'm aware that the script is applying a copy, but I'm not sure what other method to go about doing this, as I'm fairly new to pandas.
You can merge the two DataFrames:
pd.merge(stops, st, on='stop_id')
Since there are stop_lat columns in each, it will give you stop_lat_x (the good one) and stop_lat_y (the always-zero one). You can then remove or ignore the bad column and output the resulting DataFrame however you want.

Categories

Resources