I am trying to match the stop_id in stop_times.csv to the stop_id in stops.csv in order to copy over the stop_lat and stop_lon to their respective columns in stop_times.csv.
Gist files:
stops.csv LINK
stop_times.csv LINK
Here's my code:
import pandas as pd
st = pd.read_csv('csv/stop_times.csv', sep=',')
st.set_index(['trip_id','stop_sequence'])
stops = pd.read_csv('csv/stops.csv')
for i in range(len(st)):
for x in range(len(stops)):
if st['stop_id'][i] == stops['stop_id'][x]:
st['stop_lat'][i] = stops['stop_lat'][x]
st['stop_lon'][i] = stops['stop_lon'][x]
st.to_csv('csv/stop_times.csv', index=False)
I'm aware that the script is applying a copy, but I'm not sure what other method to go about doing this, as I'm fairly new to pandas.
You can merge the two DataFrames:
pd.merge(stops, st, on='stop_id')
Since there are stop_lat columns in each, it will give you stop_lat_x (the good one) and stop_lat_y (the always-zero one). You can then remove or ignore the bad column and output the resulting DataFrame however you want.
Related
The first df I have is one that has station codes and names, along with lat/long (not as relevant), like so:
code name latitude longitude
I have another df with start/end dates for travel times. This df has only the station code, not the station name, like so:
start_date start_station_code end_date end_station_code duration_sec
I am looking to add columns that have the name of the start/end stations to the second df by matching the first df "code" and second df "start_station_code" / "end_station_code".
I am relatively new to pandas, and was looking for a way to optimize doing this as my current method takes quite a while. I use the following code:
for j in range(0, len(df_stations)):
for i in range(0, len(df)):
if(df_stations['code'][j] == df['start_station_code'][i]):
df['start_station'][i] = df_stations['name'][j]
if(df_stations['code'][j] == df['end_station_code'][i]):
df['end_station'][i] = df_stations['name'][j]
I am looking for a faster method, any help is appreciated. Thank you in advance.
Use merge. If you are familiar with SQL, merge is equivalent to LEFT JOIN:
cols = ["code", "name"]
result = (
second_df
.merge(first_df[cols], left_on="start_station_code", right_on="code")
.merge(first_df[cols], left_on="end_station_code", right_on="code")
.rename(columns={"code_x": "start_station_code", "code_y": "end_station_code"})
)
The answer by #Code-Different is very nearly correct. However the columns to be renamed are the name columns not the code columns. For neatness you will likely want to drop the additional code columns that get created by the merges. Using your names for the dataframes df and df_station the code needed to produce df_required is:
cols = ["code", "name"]
required_df = (
df
.merge(df_stations[cols], left_on="start_station_code", right_on="code")
.merge(df_stations[cols], left_on="end_station_code", right_on="code")
.rename(columns={"name_x": "start_station", "name_y": "end_station"})
.drop(columns = ['code_x', 'code_y'])
)
As you may notice the merge means that the dataframe acquires duplicate 'code' columns which get suffixed automatically, this is a built in default of the merge command. See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html for more detail.
I am trying to replace some missing and incorrect values in my master dataset by filling it in with correct values from two different datasets.
I created a miniature version of the full dataset like so (note the real dataset is several thousand rows long):
import pandas as pd
data = {'From':['GA0251','GA5201','GA5551','GA510A','GA5171','GA5151'],
'To':['GA0201_T','GA5151_T','GA5151_R','GA5151_V','GA5151_P','GA5171_B'],
'From_Latitude':[55.86630869,0,55.85508787,55.85594626,55.85692217,55.85669934],
'From_Longitude':[-4.27138731,0,-4.24126866,-4.24446585,-4.24516129,-4.24358251,],
'To_Latitude':[55.86614756,0,55.85522197,55.85593762,55.85693878,0],
'To_Longitude':[-4.271040979,0,-4.241466534,-4.244607602,-4.244905037,0]}
dataset_to_correct = pd.DataFrame(data)
However, some values in the From lat/long and the To lat/long are incorrect. I have two tables like the one below for each of From and To, which I would like to substitute into the table in place of the two values for that row.
Table of Corrected From lat/long:
data = {'Site':['GA5151_T','GA5171_B'],
'Correct_Latitude':[55.85952791,55.87044558],
'Correct_Longitude':[55.85661767,-4.24358251,]}
correct_to_coords = pd.DataFrame(data)
I would like to match this table to the From column and then replace the From_Latitude and From_Longitude with the correct values.
Table of Corrected To lat/long:
data = {'Site':['GA5201','GA0251'],
'Correct_Latitude':[55.857577,55.86616756],
'Correct_Longitude':[-4.242770,-4.272140979]}
correct_from_coords = pd.DataFrame(data)
I would like to match this table to the To column and then replace the To_Latitude and To_Longitude with the correct values.
Is there a way to match the site in each table to the corresponding From or To column and then replace only the values in the respective columns?
I have tried using code from this answer (Elegant way to replace values in pandas.DataFrame from another DataFrame) but it seems to have no effect on the database.
(correct_to_coords.set_index('Site').rename(columns = {'Correct_Latitude':'To_Latitude'}) .combine_first(dataset_to_correct.set_index('To')))
#zswqa 's answer produces right result, #Anurag Dabas 's doesn't.
Another possible solution, It is a bit faster than merge method suggested above, although both are correct.
dataset_to_correct.set_index("To",inplace=True)
correct_to_coords.set_index("Site",inplace=True)
dataset_to_correct.loc[correct_to_coords.index, "To_Latitude"] = correct_to_coords["Correct_Latitude"]
dataset_to_correct.loc[correct_to_coords.index, "To_Longitude"] = correct_to_coords["Correct_Longitude"]
dataset_to_correct.reset_index(inplace=True)
dataset_to_correct.set_index("From",inplace=True)
correct_from_coords.set_index("Site",inplace=True)
dataset_to_correct.loc[correct_from_coords.index, "From_Latitude"] = correct_from_coords["Correct_Latitude"]
dataset_to_correct.loc[correct_from_coords.index, "From_Longitude"] = correct_from_coords["Correct_Longitude"]
dataset_to_correct.reset_index(inplace=True)
merge = dataset_to_correct.merge(correct_to_coords, left_on='To', right_on='Site', how='left')
merge.loc[(merge.To == merge.Site), 'To_Latitude'] = merge.Correct_Latitude
merge.loc[(merge.To == merge.Site), 'To_Longitude'] = merge.Correct_Longitude
# del merge['Site']
# del merge['Correct_Latitude']
# del merge['Correct_Longitude']
merge = merge.drop(columns = ['Site','Correct_Latitude','Correct_Longitude'])
merge = merge.merge(correct_from_coords, left_on='From', right_on='Site', how='left')
merge.loc[(merge.From == merge.Site), 'From_Latitude'] = merge.Correct_Latitude
merge.loc[(merge.From == merge.Site), 'From_Longitude'] = merge.Correct_Longitude
# del merge['Site']
# del merge['Correct_Latitude']
# del merge['Correct_Longitude']
merge = merge.drop(columns = ['Site','Correct_Latitude','Correct_Longitude'])
merge
lets try dual merge by merge()+pop()+fillna()+drop():
dataset_to_correct=dataset_to_correct.merge(correct_to_coords,left_on='To',right_on='Site',how='left').drop('Site',1)
dataset_to_correct['From_Latitude']=dataset_to_correct.pop('Correct_Latitude').fillna(dataset_to_correct['From_Latitude'])
dataset_to_correct['From_Longitude']=dataset_to_correct.pop('Correct_Longitude').fillna(dataset_to_correct['From_Longitude'])
dataset_to_correct=dataset_to_correct.merge(correct_from_coords,left_on='From',right_on='Site',how='left').drop('Site',1)
dataset_to_correct['To_Latitude']=dataset_to_correct.pop('Correct_Latitude').fillna(dataset_to_correct['To_Latitude'])
dataset_to_correct['To_Longitude']=dataset_to_correct.pop('Correct_Longitude').fillna(dataset_to_correct['To_Longitude'])
I am quite new to Python programming.
I am working with the following dataframe:
Before
Note that in column "FBgn", there is a mix of FBgn and FBtr string values. I would like to replace the FBtr-containing values with FBgn values provided in the adjacent column called "## FlyBase_FBgn". However, I want to keep the FBgn values in column "FBgn". Maybe keep in mind that I am showing only a portion of the dataframe (reality: 1432 rows). How would I do that? I tried the replace() method from Pandas, but it did not work.
This is actually what I would like to have:
After
Thanks a lot!
With Pandas, you could try:
df.loc[df["FBgn"].str.contains("FBtr"), "FBgn"] = df["## FlyBase_FBgn"]
Welcome to stackoverflow. Please next time provide more info including your code. It is always helpful
Please see the code below, I think you need something similar
import pandas as pd
#ignore the dict1, I just wanted to recreate your df
dict1= {"FBgn": ['FBtr389394949', 'FBgn3093840', 'FBtr000025'], "FBtr": ['FBgn546466646', '', 'FBgn15565555']}
df = pd.DataFrame(dict1) #recreating your dataframe
#print df
print(df)
#function to replace the values
def replace_values(df):
for i in range(0, (df.size//2)):
if 'tr' in df['FBgn'][i]:
df['FBgn'][i] = df['FBtr'][i]
return df
df = replace_values(df)
#print new df
print(df)
I have two dataframes of unequal size, one contains cuisine style along with its frequency in the dataset and another is the original dataset which has restaurant name and cuisine corresponding to it. I want to add a new column on the original dataset where the frequency value of each cuisine is displayed from the dataframe containing the frequency data. What is the best way to perform that? I have tried by using merge but that creates NaN values. Please suggest
I tried below code snippet suggested but it did not give me the required result. it generates freq for first row and excludes the other rows for the same 'name' column.
df = df.assign(freq=0)
# get all the cuisine styles in the cuisine df
for cuisine in np.unique(cuisine_df['cuisine_style']):
# get the freq
freq = cuisine_df.loc[cuisine_df['cuisine_style'] == cuisine,
'freq'].values
# update value in main df
df.loc[df['cuisine_style'] == cuisine, 'freq'] = freq
Result dataframe
I re ran the code on your data set and still got the same results. Here is the code I ran.
import pandas as pd
import numpy as np
# used to set 'Cuisine Style' to first 'style' in array of values
def getCusinie(row):
arr = row['Cuisine Style'].split("'")
return arr[1]
# read in data set. Used first col for index and drop nan for ease of use
csv = pd.read_csv('TA_restaurants_curated.csv', index_col=0).dropna()
# get cuisine values
cuisines = csv.apply(lambda row: getCusinie(row), axis=1)
# update dataframe
csv['Cuisine Style'] = cuisines
# json obj to quickly make a new data frame with meaningless frequencies
c = {'Cuisine Style' : np.unique(csv['Cuisine Style']), 'freq': range(113)}
cuisine_df = pd.DataFrame(c)
# add 'freq' column to original Data Frame
csv = csv.assign(freq=0)
# same loop as before
for cuisine in np.unique(cuisine_df['Cuisine Style']):
# get the freq
freq = cuisine_df.loc[cuisine_df['Cuisine Style'] == cuisine,
'freq'].values
# update value in main df
csv.loc[csv['Cuisine Style'] == cuisine, 'freq'] = freq
Output:
As you can see, every column, even duplicates, have been updated. If they still are not being updated I'd check to make sure that the names are actually equal i.e. make sure there isn't any hidden spaces or anything causing issues.
You can read up on selecting and indexing DataFrames here.
Its quite long but you can pick apart what you need, when you need it
I'm importing multiple dataframes and wrote the following process: 1. list of files to be coverted to dataframes + 2. list of names I want for the corresponding dataframes. 3. I combined the list into a dictionary:
tbls = ['tbl1', 'tbl2', 'tbl3']
dbname = ['dfABC', 'dfrand', 'dfXYZ']
dictdf = dict(zip(tbls, dbname))
Then I cycle through tbls to import the dataframes. (getdf below is a short function I wrote that reads the path, sheetname etc. for the excel/csv file in which the table(data) sits and imports the data.
for tbl in tbls:
dictdf[tbl] = getdf(tbl, dfRT, sfsession)
The process works except that the dataframes are written into the dictionary, i.e dfABC in the dictionary is replaced with a dataframe of 65K rows and 27 cols and so on.
What I want is dfABC = dataframe of 65krows and 27 cols. i.e in the above code. I tried:
str(dictdf[tbl]) = getdf(tbl, dfRT, sfsession)
but that gave an error. Is there a way to do this? thanks.
solved using exec and flipping the dictionary (the flip isn't needed to solve):
tbls = ['tbl1', 'tbl2', 'tbl3']
dfa = ['dfABC', 'dfrand', 'dfXYZ']
dictdf = dict(zip(dbname, tbls))
for df in dfs:
tbl = dictdf[df]
exec(f'{df} = getdf(\'{tbl}\', dfRT, sfsession)')
please note #Xukrao and #Yo_Chris's comments on keeping the dfs within the dictionary as a superior solution.
I found this question useful to understand how exec worked: What's the difference between eval, exec, and compile?