Append to a pd.DataFrame, dynamically allocating any new columns - python

I'm wanting to aggregate some API responses into a DataFrame.
The request consistently returns a number of json key value pairs, lets say A,B,C. occasionally however it will return A,B,C,D.
I would like something comparible to SQL's OUTER JOIN, that will simply add the new row, whilst filling the corresponding previous columns as NULL or some other placeholder.
The pandas join options insist upon imposing a unique suffix for the side, I really don't want this.
Am I looking at this the wrong way?
If there is no easy solution, I could just select a subset of the consistently available columns but I really wanted to download the lot and do the processing as a separate stage.

You can use pandas.concat as it provides with all the functionality required for your problem. Let this toy problem illustrate the possible solution.
# This generates random data with some key and value pair.
def gen_data(_size):
import string
keys = list(string.ascii_uppercase)
return dict((k,[v]) for k,v in zip(np.random.choice(keys, _size),np.random.randint(1000, size=_size)))
counter = 0
df = pd.DataFrame()
while True:
if counter > 5:
break;
# Recieve the data
new_data = gen_data(5)
# Converting this to dataframe obj
new_data = pd.DataFrame(new_data)
# Appending this data to my stack
df = pd.concat((df, new_data), axis=0, sort=True)
counter += 1
df.reset_index(drop=True, inplace=True)
print(df.to_string())

Related

Fill specific columns in a pandas dataframe rows with values from another dataframes values

I am trying to replace some missing and incorrect values in my master dataset by filling it in with correct values from two different datasets.
I created a miniature version of the full dataset like so (note the real dataset is several thousand rows long):
import pandas as pd
data = {'From':['GA0251','GA5201','GA5551','GA510A','GA5171','GA5151'],
'To':['GA0201_T','GA5151_T','GA5151_R','GA5151_V','GA5151_P','GA5171_B'],
'From_Latitude':[55.86630869,0,55.85508787,55.85594626,55.85692217,55.85669934],
'From_Longitude':[-4.27138731,0,-4.24126866,-4.24446585,-4.24516129,-4.24358251,],
'To_Latitude':[55.86614756,0,55.85522197,55.85593762,55.85693878,0],
'To_Longitude':[-4.271040979,0,-4.241466534,-4.244607602,-4.244905037,0]}
dataset_to_correct = pd.DataFrame(data)
However, some values in the From lat/long and the To lat/long are incorrect. I have two tables like the one below for each of From and To, which I would like to substitute into the table in place of the two values for that row.
Table of Corrected From lat/long:
data = {'Site':['GA5151_T','GA5171_B'],
'Correct_Latitude':[55.85952791,55.87044558],
'Correct_Longitude':[55.85661767,-4.24358251,]}
correct_to_coords = pd.DataFrame(data)
I would like to match this table to the From column and then replace the From_Latitude and From_Longitude with the correct values.
Table of Corrected To lat/long:
data = {'Site':['GA5201','GA0251'],
'Correct_Latitude':[55.857577,55.86616756],
'Correct_Longitude':[-4.242770,-4.272140979]}
correct_from_coords = pd.DataFrame(data)
I would like to match this table to the To column and then replace the To_Latitude and To_Longitude with the correct values.
Is there a way to match the site in each table to the corresponding From or To column and then replace only the values in the respective columns?
I have tried using code from this answer (Elegant way to replace values in pandas.DataFrame from another DataFrame) but it seems to have no effect on the database.
(correct_to_coords.set_index('Site').rename(columns = {'Correct_Latitude':'To_Latitude'}) .combine_first(dataset_to_correct.set_index('To')))
#zswqa 's answer produces right result, #Anurag Dabas 's doesn't.
Another possible solution, It is a bit faster than merge method suggested above, although both are correct.
dataset_to_correct.set_index("To",inplace=True)
correct_to_coords.set_index("Site",inplace=True)
dataset_to_correct.loc[correct_to_coords.index, "To_Latitude"] = correct_to_coords["Correct_Latitude"]
dataset_to_correct.loc[correct_to_coords.index, "To_Longitude"] = correct_to_coords["Correct_Longitude"]
dataset_to_correct.reset_index(inplace=True)
dataset_to_correct.set_index("From",inplace=True)
correct_from_coords.set_index("Site",inplace=True)
dataset_to_correct.loc[correct_from_coords.index, "From_Latitude"] = correct_from_coords["Correct_Latitude"]
dataset_to_correct.loc[correct_from_coords.index, "From_Longitude"] = correct_from_coords["Correct_Longitude"]
dataset_to_correct.reset_index(inplace=True)
merge = dataset_to_correct.merge(correct_to_coords, left_on='To', right_on='Site', how='left')
merge.loc[(merge.To == merge.Site), 'To_Latitude'] = merge.Correct_Latitude
merge.loc[(merge.To == merge.Site), 'To_Longitude'] = merge.Correct_Longitude
# del merge['Site']
# del merge['Correct_Latitude']
# del merge['Correct_Longitude']
merge = merge.drop(columns = ['Site','Correct_Latitude','Correct_Longitude'])
merge = merge.merge(correct_from_coords, left_on='From', right_on='Site', how='left')
merge.loc[(merge.From == merge.Site), 'From_Latitude'] = merge.Correct_Latitude
merge.loc[(merge.From == merge.Site), 'From_Longitude'] = merge.Correct_Longitude
# del merge['Site']
# del merge['Correct_Latitude']
# del merge['Correct_Longitude']
merge = merge.drop(columns = ['Site','Correct_Latitude','Correct_Longitude'])
merge
lets try dual merge by merge()+pop()+fillna()+drop():
dataset_to_correct=dataset_to_correct.merge(correct_to_coords,left_on='To',right_on='Site',how='left').drop('Site',1)
dataset_to_correct['From_Latitude']=dataset_to_correct.pop('Correct_Latitude').fillna(dataset_to_correct['From_Latitude'])
dataset_to_correct['From_Longitude']=dataset_to_correct.pop('Correct_Longitude').fillna(dataset_to_correct['From_Longitude'])
dataset_to_correct=dataset_to_correct.merge(correct_from_coords,left_on='From',right_on='Site',how='left').drop('Site',1)
dataset_to_correct['To_Latitude']=dataset_to_correct.pop('Correct_Latitude').fillna(dataset_to_correct['To_Latitude'])
dataset_to_correct['To_Longitude']=dataset_to_correct.pop('Correct_Longitude').fillna(dataset_to_correct['To_Longitude'])

Create a dictionary from pandas empty dataframe with only column names

I have a pandas data frame with only two column names( single row, which can be also considered as headers).I want to make a dictionary out of this with the first column being the value and the second column being the key.I already tried the
to.dict() method, but it's not working as it's an empty dataframe.
Example
df=|Land |Norway| to {'Land': Norway}
I can change the pandas data frame to some other type and find my way around it, but this question is mostly to learn the best/different/efficient approach for this problem.
For now I have this as the solution :
dict(zip(a.iloc[0:0,0:1],a.iloc[0:0,1:2]))
Is there any other way to do this?
Here's a simple way convert the columns to a list and a list to a dictionary
def list_to_dict(a):
it = iter(a)
ret_dict = dict(zip(it, it))
return ret_dict
df = pd.DataFrame([], columns=['Land', 'Normway'])
dict_val = list_to_dict(df.columns.to_list())
dict_val # {'Land': 'Normway'}
Very manual solution
df = pd.DataFrame(columns=['Land', 'Norway'])
df = pd.DataFrame({df.columns[0]: df.columns[1]}, index=[0])
If you have any number of columns and you want each sequential pair to have this transformation, try:
df = pd.DataFrame(dict(zip(df.columns[::2], df.columns[1::2])), index=[0])
Note: You will get an error if your DataFrame does not have at least two columns.

Pandas:merge two dataframes with unequal row and column size and merge them based on explanation below

I have two dataframes of unequal size, one contains cuisine style along with its frequency in the dataset and another is the original dataset which has restaurant name and cuisine corresponding to it. I want to add a new column on the original dataset where the frequency value of each cuisine is displayed from the dataframe containing the frequency data. What is the best way to perform that? I have tried by using merge but that creates NaN values. Please suggest
I tried below code snippet suggested but it did not give me the required result. it generates freq for first row and excludes the other rows for the same 'name' column.
df = df.assign(freq=0)
# get all the cuisine styles in the cuisine df
for cuisine in np.unique(cuisine_df['cuisine_style']):
# get the freq
freq = cuisine_df.loc[cuisine_df['cuisine_style'] == cuisine,
'freq'].values
# update value in main df
df.loc[df['cuisine_style'] == cuisine, 'freq'] = freq
Result dataframe
I re ran the code on your data set and still got the same results. Here is the code I ran.
import pandas as pd
import numpy as np
# used to set 'Cuisine Style' to first 'style' in array of values
def getCusinie(row):
arr = row['Cuisine Style'].split("'")
return arr[1]
# read in data set. Used first col for index and drop nan for ease of use
csv = pd.read_csv('TA_restaurants_curated.csv', index_col=0).dropna()
# get cuisine values
cuisines = csv.apply(lambda row: getCusinie(row), axis=1)
# update dataframe
csv['Cuisine Style'] = cuisines
# json obj to quickly make a new data frame with meaningless frequencies
c = {'Cuisine Style' : np.unique(csv['Cuisine Style']), 'freq': range(113)}
cuisine_df = pd.DataFrame(c)
# add 'freq' column to original Data Frame
csv = csv.assign(freq=0)
# same loop as before
for cuisine in np.unique(cuisine_df['Cuisine Style']):
# get the freq
freq = cuisine_df.loc[cuisine_df['Cuisine Style'] == cuisine,
'freq'].values
# update value in main df
csv.loc[csv['Cuisine Style'] == cuisine, 'freq'] = freq
Output:
As you can see, every column, even duplicates, have been updated. If they still are not being updated I'd check to make sure that the names are actually equal i.e. make sure there isn't any hidden spaces or anything causing issues.
You can read up on selecting and indexing DataFrames here.
Its quite long but you can pick apart what you need, when you need it

Python Pandas dataframe modify column value based on function that cleans string value and assign to new column

I have a certain data to clean, it's some keys where the keys have six leading zeros that I want to get rid of, and if the keys are not ending with "ABC" or it's not ending with "DEFG", then I need to clean the currency code in the last 3 indexes. If the key doesn't start with leading zeros, then just return the key as it is.
To achieve this I wrote a function that deals with string as below:
def cleanAttainKey(dirtyAttainKey):
if dirtyAttainKey[0] != "0":
return dirtyAttainKey
else:
dirtyAttainKey = dirtyAttainKey.strip("0")
if dirtyAttainKey[-3:] != "ABC" and dirtyAttainKey[-3:] != "DEFG":
dirtyAttainKey = dirtyAttainKey[:-3]
cleanAttainKey = dirtyAttainKey
return cleanAttainKey
Now I build a dummy data frame to test it but it's reporting errors:
data frame
df = pd.DataFrame({'dirtyKey':["00000012345ABC","0000012345DEFG","0000023456DEFGUSD"],'amount':[100,101,102]},
columns=["dirtyKey","amount"])
I need to get a new column called "cleanAttainKey" in the df, then modify each value in the "dirtyKey" using the "cleanAttainKey" function, then assign the cleaned key to the new column "cleanAttainKey", however it seems pandas doesn't support this type of modification.
# add a new column in df called cleanAttainKey
df['cleanAttainKey'] = ""
# I want to clean the keys and get into the new column of cleanAttainKey
dirtyAttainKeyList = df['dirtyKey'].tolist()
for i in range(len(df['cleanAttainKey'])):
df['cleanAttainKey'][i] = cleanAttainKey(vpAttainKeyList[i])
I am getting the below error message:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
The result should be the same as the df2 below:
df2 = pd.DataFrame({'dirtyKey':["00000012345ABC","0000012345DEFG","0000023456DEFGUSD"],'amount':[100,101,102],
'cleanAttainKey':["12345ABC","12345DEFG","23456DEFG"]},
columns=["dirtyKey","cleanAttainKey","amount"])
df2
Is there any better way to modify the dirty keys and get a new column with the clean keys in Pandas?
Thanks
Here is the culprit:
df['cleanAttainKey'][i] = cleanAttainKey(vpAttainKeyList[i])
When you use extract of the dataframe, Pandas reserves the ability to choose to make a copy or a view. It does not matter if you are just reading the data, but it means that you should never modify it.
The idiomatic way is to use loc (or iloc or [i]at):
df.loc[i, 'cleanAttainKey'] = cleanAttainKey(vpAttainKeyList[i])
(above assumes a natural range index...)

Pandas, append column based on unique subset of column values

I have a dataframe with many rows. I am appending a column using data produced from a custom function, like this:
import numpy
df['new_column'] = numpy.vectorize(fx)(df['col_a'], df['col_b'])
# takes 180964.377 ms
It is working fine, what I am trying to do is speed it up. There is really only a small group of unique combinations of col_a and col_b. Many of the iterations are redundant. I was thinking maybe pandas would just figure that out on its own but I don't think that is the case. Consider this:
print len(df.index) #prints 127255
df_unique = df.copy().drop_duplicates(['col_a', 'col_b'])
print len(df_unique.index) #prints 9834
I also convinced myself of the possible speedup by running this:
df_unique['new_column'] = numpy.vectorize(fx)(df_unique['col_a'], df_unique['col_b'])
# takes 14611.357 ms
Since there is a lot of redundant data, what I am trying to do is update the large dataframe ( df 127255 rows ) but only need to run the fx function the minimum amount of times ( 9834 times ). This is because of all the duplicate rows for col_a and col_b. Of course this means that there will be multiple rows in df that have the same values for col_a and col_b, but that is OK, the other columns of df are different and make each row unique.
Before I create a normal iterative for loop to loop through the df_unique dataframe and do a conditional update on df, I wanted to ask if there was a more "pythonic" neat way of doing this kind of update. Thanks a lot.
** UPDATE **
I created the simple for loop mentioned above, like this:
df = ...
df_unique = df.copy().drop_duplicates(['col_a', 'col_b'])
df_unique['new_column'] = np.vectorize(fx)(df_unique['col_a'], df_unique['col_b'])
for index, row in df_unique.iterrows():
df.loc[(df['col_a'] == row['col_a']) & (df['col_b'] == row['col_b']),'new_column'] = row['new_column']
# takes 165971.890
So with this for loop there may be a slight performance increase but not nearly what I would have expected.
FYI
This is the fx function. It queries a mysql database.
def fx(d):
exp_date = datetime.strptime(d.col_a, '%m/%d/%Y')
if exp_date.weekday() == 5:
exp_date -= timedelta(days=1)
p = pandas.read_sql("select stat from table where a = '%s' and b_date = '%s';" % (d.col_a,exp_date.strftime('%Y-%m-%d')),engine)
if len(p.index) == 0:
return None
else:
return p.iloc[0].close
UPDATE:
if you can manage to read up your three columns ['stat','a','b_date'] belonging to table table into tab DF then you could merge it like this:
tab = pd.read_sql('select stat,a,b_date from table', engine)
df.merge(tab, left_on=[...], right_on=[...], how='left')
OLD answer:
you can merge/join your precalculated df_unique DF with the original df DF:
df['new_column'] = df.merge(df_unique, on=['col_a','col_b'], how='left')['new_column']
MaxU's answer may be already something you want. But I'll show another approach which may be a bit faster (I didn't measure).
I assume that:
df[['col_a', 'col_b']] is sorted so that all identical entries are in consecutive rows (it's important)
df has a unique index (if not, you may create some temporary unique index).
I'll use the fact that df_unique.index is a subset of df.index.
# (keep='first' is actually default)
df_unique = df[['col_a', 'col_b']].drop_duplicates(keep='first').copy()
# You may try .apply instead of np.vectorize (I think it may be faster):
df_unique['result'] = df_unique.apply(fx, axis=1)
# Main part:
df['result'] = df_unique['result'] # uses 2.
df['result'].fillna(method='ffill', inplace=True) # uses 1.

Categories

Resources