I am trying to replace some missing and incorrect values in my master dataset by filling it in with correct values from two different datasets.
I created a miniature version of the full dataset like so (note the real dataset is several thousand rows long):
import pandas as pd
data = {'From':['GA0251','GA5201','GA5551','GA510A','GA5171','GA5151'],
'To':['GA0201_T','GA5151_T','GA5151_R','GA5151_V','GA5151_P','GA5171_B'],
'From_Latitude':[55.86630869,0,55.85508787,55.85594626,55.85692217,55.85669934],
'From_Longitude':[-4.27138731,0,-4.24126866,-4.24446585,-4.24516129,-4.24358251,],
'To_Latitude':[55.86614756,0,55.85522197,55.85593762,55.85693878,0],
'To_Longitude':[-4.271040979,0,-4.241466534,-4.244607602,-4.244905037,0]}
dataset_to_correct = pd.DataFrame(data)
However, some values in the From lat/long and the To lat/long are incorrect. I have two tables like the one below for each of From and To, which I would like to substitute into the table in place of the two values for that row.
Table of Corrected From lat/long:
data = {'Site':['GA5151_T','GA5171_B'],
'Correct_Latitude':[55.85952791,55.87044558],
'Correct_Longitude':[55.85661767,-4.24358251,]}
correct_to_coords = pd.DataFrame(data)
I would like to match this table to the From column and then replace the From_Latitude and From_Longitude with the correct values.
Table of Corrected To lat/long:
data = {'Site':['GA5201','GA0251'],
'Correct_Latitude':[55.857577,55.86616756],
'Correct_Longitude':[-4.242770,-4.272140979]}
correct_from_coords = pd.DataFrame(data)
I would like to match this table to the To column and then replace the To_Latitude and To_Longitude with the correct values.
Is there a way to match the site in each table to the corresponding From or To column and then replace only the values in the respective columns?
I have tried using code from this answer (Elegant way to replace values in pandas.DataFrame from another DataFrame) but it seems to have no effect on the database.
(correct_to_coords.set_index('Site').rename(columns = {'Correct_Latitude':'To_Latitude'}) .combine_first(dataset_to_correct.set_index('To')))
#zswqa 's answer produces right result, #Anurag Dabas 's doesn't.
Another possible solution, It is a bit faster than merge method suggested above, although both are correct.
dataset_to_correct.set_index("To",inplace=True)
correct_to_coords.set_index("Site",inplace=True)
dataset_to_correct.loc[correct_to_coords.index, "To_Latitude"] = correct_to_coords["Correct_Latitude"]
dataset_to_correct.loc[correct_to_coords.index, "To_Longitude"] = correct_to_coords["Correct_Longitude"]
dataset_to_correct.reset_index(inplace=True)
dataset_to_correct.set_index("From",inplace=True)
correct_from_coords.set_index("Site",inplace=True)
dataset_to_correct.loc[correct_from_coords.index, "From_Latitude"] = correct_from_coords["Correct_Latitude"]
dataset_to_correct.loc[correct_from_coords.index, "From_Longitude"] = correct_from_coords["Correct_Longitude"]
dataset_to_correct.reset_index(inplace=True)
merge = dataset_to_correct.merge(correct_to_coords, left_on='To', right_on='Site', how='left')
merge.loc[(merge.To == merge.Site), 'To_Latitude'] = merge.Correct_Latitude
merge.loc[(merge.To == merge.Site), 'To_Longitude'] = merge.Correct_Longitude
# del merge['Site']
# del merge['Correct_Latitude']
# del merge['Correct_Longitude']
merge = merge.drop(columns = ['Site','Correct_Latitude','Correct_Longitude'])
merge = merge.merge(correct_from_coords, left_on='From', right_on='Site', how='left')
merge.loc[(merge.From == merge.Site), 'From_Latitude'] = merge.Correct_Latitude
merge.loc[(merge.From == merge.Site), 'From_Longitude'] = merge.Correct_Longitude
# del merge['Site']
# del merge['Correct_Latitude']
# del merge['Correct_Longitude']
merge = merge.drop(columns = ['Site','Correct_Latitude','Correct_Longitude'])
merge
lets try dual merge by merge()+pop()+fillna()+drop():
dataset_to_correct=dataset_to_correct.merge(correct_to_coords,left_on='To',right_on='Site',how='left').drop('Site',1)
dataset_to_correct['From_Latitude']=dataset_to_correct.pop('Correct_Latitude').fillna(dataset_to_correct['From_Latitude'])
dataset_to_correct['From_Longitude']=dataset_to_correct.pop('Correct_Longitude').fillna(dataset_to_correct['From_Longitude'])
dataset_to_correct=dataset_to_correct.merge(correct_from_coords,left_on='From',right_on='Site',how='left').drop('Site',1)
dataset_to_correct['To_Latitude']=dataset_to_correct.pop('Correct_Latitude').fillna(dataset_to_correct['To_Latitude'])
dataset_to_correct['To_Longitude']=dataset_to_correct.pop('Correct_Longitude').fillna(dataset_to_correct['To_Longitude'])
I am trying to generate a third column in pandas dataframe using two other columns in dataframe. The requirement is very particular to the scenario for which I need to generate the third column data.
The requirement is stated as:
let the dataframe name be df, first column be 'first_name'. second column be 'last_name'.
I need to generate third column in such a manner so that it uses string formatting to generate a particular string and pass it to a function and whatever the function returns should be used as value to third column.
Problem 1
base_string = "my name is {first} {last}"
df['summary'] = base_string.format(first=df['first_name'], last=df['last_name'])
Problem 2
df['summary'] = some_func(base_string.format(first=df['first_name'], last=df['last_name']))
My ultimate goal is to solve problem 2 but for that problem 1 is pre-requisite and as of now I'm unable to solve that. I have tried converting my dataframe values to string but it is not working the way I expected.
You can do apply:
df.apply(lambda r: base_string.format(first=r['first_name'], last=r['last_name']) ),
axis=1)
Or list comprehension:
df['summary'] = [base_string.format(first=x,last=y)
for x,y in zip(df['first_name'], df['last_name'])
And then, for general function some_func:
df['summary'] = [some_func(base_string.format(first=x,last=y) )
for x,y in zip(df['first_name'], df['last_name'])
You could use pandas.DataFrame.apply with axis=1 so your code will look like this:
def mapping_function(row):
#make your calculation
return value
df['summary'] = df.apply(mapping_function, axis=1)
Normally when splitting a value which is a string, one would simply do:
string = 'aabbcc'
small = string[0:2]
And that's simply it. I thought it would be the same thing for a dataframe by doing:
df = df['Column'][Range][Length of Desired value]
df = df['Date'][0:4][2:4]
Note: Every string in the column have the same length and are all integers written as a string data type
If I use the code above the program just throws the Range and takes [2:4] as the range which is weird.
When doing this individually it works:
df2 = df['Column'][index][2:4]
So right now I had to make a loop that goes one by one and append it to a new Dataframe.
To do the operation element wise, you can use apply (see link):
df['Column'][0:4].apply(lambda x : x[2:4])
When you did df2 = df['Column'][0:4][2:4], you are doing the same as df2 = df['Column'][2:4].
You're getting the indexes 0 to 4 of df and then getting the indexes 2 to 4 of that one.
I have two dataframes of unequal size, one contains cuisine style along with its frequency in the dataset and another is the original dataset which has restaurant name and cuisine corresponding to it. I want to add a new column on the original dataset where the frequency value of each cuisine is displayed from the dataframe containing the frequency data. What is the best way to perform that? I have tried by using merge but that creates NaN values. Please suggest
I tried below code snippet suggested but it did not give me the required result. it generates freq for first row and excludes the other rows for the same 'name' column.
df = df.assign(freq=0)
# get all the cuisine styles in the cuisine df
for cuisine in np.unique(cuisine_df['cuisine_style']):
# get the freq
freq = cuisine_df.loc[cuisine_df['cuisine_style'] == cuisine,
'freq'].values
# update value in main df
df.loc[df['cuisine_style'] == cuisine, 'freq'] = freq
Result dataframe
I re ran the code on your data set and still got the same results. Here is the code I ran.
import pandas as pd
import numpy as np
# used to set 'Cuisine Style' to first 'style' in array of values
def getCusinie(row):
arr = row['Cuisine Style'].split("'")
return arr[1]
# read in data set. Used first col for index and drop nan for ease of use
csv = pd.read_csv('TA_restaurants_curated.csv', index_col=0).dropna()
# get cuisine values
cuisines = csv.apply(lambda row: getCusinie(row), axis=1)
# update dataframe
csv['Cuisine Style'] = cuisines
# json obj to quickly make a new data frame with meaningless frequencies
c = {'Cuisine Style' : np.unique(csv['Cuisine Style']), 'freq': range(113)}
cuisine_df = pd.DataFrame(c)
# add 'freq' column to original Data Frame
csv = csv.assign(freq=0)
# same loop as before
for cuisine in np.unique(cuisine_df['Cuisine Style']):
# get the freq
freq = cuisine_df.loc[cuisine_df['Cuisine Style'] == cuisine,
'freq'].values
# update value in main df
csv.loc[csv['Cuisine Style'] == cuisine, 'freq'] = freq
Output:
As you can see, every column, even duplicates, have been updated. If they still are not being updated I'd check to make sure that the names are actually equal i.e. make sure there isn't any hidden spaces or anything causing issues.
You can read up on selecting and indexing DataFrames here.
Its quite long but you can pick apart what you need, when you need it
I have a certain data to clean, it's some keys where the keys have six leading zeros that I want to get rid of, and if the keys are not ending with "ABC" or it's not ending with "DEFG", then I need to clean the currency code in the last 3 indexes. If the key doesn't start with leading zeros, then just return the key as it is.
To achieve this I wrote a function that deals with string as below:
def cleanAttainKey(dirtyAttainKey):
if dirtyAttainKey[0] != "0":
return dirtyAttainKey
else:
dirtyAttainKey = dirtyAttainKey.strip("0")
if dirtyAttainKey[-3:] != "ABC" and dirtyAttainKey[-3:] != "DEFG":
dirtyAttainKey = dirtyAttainKey[:-3]
cleanAttainKey = dirtyAttainKey
return cleanAttainKey
Now I build a dummy data frame to test it but it's reporting errors:
data frame
df = pd.DataFrame({'dirtyKey':["00000012345ABC","0000012345DEFG","0000023456DEFGUSD"],'amount':[100,101,102]},
columns=["dirtyKey","amount"])
I need to get a new column called "cleanAttainKey" in the df, then modify each value in the "dirtyKey" using the "cleanAttainKey" function, then assign the cleaned key to the new column "cleanAttainKey", however it seems pandas doesn't support this type of modification.
# add a new column in df called cleanAttainKey
df['cleanAttainKey'] = ""
# I want to clean the keys and get into the new column of cleanAttainKey
dirtyAttainKeyList = df['dirtyKey'].tolist()
for i in range(len(df['cleanAttainKey'])):
df['cleanAttainKey'][i] = cleanAttainKey(vpAttainKeyList[i])
I am getting the below error message:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
The result should be the same as the df2 below:
df2 = pd.DataFrame({'dirtyKey':["00000012345ABC","0000012345DEFG","0000023456DEFGUSD"],'amount':[100,101,102],
'cleanAttainKey':["12345ABC","12345DEFG","23456DEFG"]},
columns=["dirtyKey","cleanAttainKey","amount"])
df2
Is there any better way to modify the dirty keys and get a new column with the clean keys in Pandas?
Thanks
Here is the culprit:
df['cleanAttainKey'][i] = cleanAttainKey(vpAttainKeyList[i])
When you use extract of the dataframe, Pandas reserves the ability to choose to make a copy or a view. It does not matter if you are just reading the data, but it means that you should never modify it.
The idiomatic way is to use loc (or iloc or [i]at):
df.loc[i, 'cleanAttainKey'] = cleanAttainKey(vpAttainKeyList[i])
(above assumes a natural range index...)