JSON object inside Pandas DataFrame - python

I have a JSON object inside a pandas dataframe column, which I want to pull apart and put into other columns. In the dataframe, the JSON object looks like a string containing an array of dictionaries. The array can be of variable length, including zero, or the column can even be null. I've written some code, shown below, which does what I want. The column names are built from two components, the first being the keys in the dictionaries, and the second being a substring from a key value in the dictionary.
This code works okay, but it is very slow when running on a big dataframe. Can anyone offer a faster (and probably simpler) way to do this? Also, feel free to pick holes in what I have done if you see something which is not sensible/efficient/pythonic. I'm still a relative beginner. Thanks heaps.
# Import libraries
import pandas as pd
from IPython.display import display # Used to display df's nicely in jupyter notebook.
import json
# Set some display options
pd.set_option('max_colwidth',150)
# Create the example dataframe
print("Original df:")
df = pd.DataFrame.from_dict({'ColA': {0: 123, 1: 234, 2: 345, 3: 456, 4: 567},\
'ColB': {0: '[{"key":"keyValue=1","valA":"8","valB":"18"},{"key":"keyValue=2","valA":"9","valB":"19"}]',\
1: '[{"key":"keyValue=2","valA":"28","valB":"38"},{"key":"keyValue=3","valA":"29","valB":"39"}]',\
2: '[{"key":"keyValue=4","valA":"48","valC":"58"}]',\
3: '[]',\
4: None}})
display(df)
# Create a temporary dataframe to append results to, record by record
dfTemp = pd.DataFrame()
# Step through all rows in the dataframe
for i in range(df.shape[0]):
# Check whether record is null, or doesn't contain any real data
if pd.notnull(df.iloc[i,df.columns.get_loc("ColB")]) and len(df.iloc[i,df.columns.get_loc("ColB")]) > 2:
# Convert the json structure into a dataframe, one cell at a time in the relevant column
x = pd.read_json(df.iloc[i,df.columns.get_loc("ColB")])
# The last bit of this string (after the last =) will be used as a key for the column labels
x['key'] = x['key'].apply(lambda x: x.split("=")[-1])
# Set this new key to be the index
y = x.set_index('key')
# Stack the rows up via a multi-level column index
y = y.stack().to_frame().T
# Flatten out the multi-level column index
y.columns = ['{1}_{0}'.format(*c) for c in y.columns]
# Give the single record the same index number as the parent dataframe (for the merge to work)
y.index = [df.index[i]]
# Append this dataframe on sequentially for each row as we go through the loop
dfTemp = dfTemp.append(y)
# Merge the new dataframe back onto the original one as extra columns, with index mataching original dataframe
df = pd.merge(df,dfTemp, how = 'left', left_index = True, right_index = True)
print("Processed df:")
display(df)

First, a general piece of advice about pandas. If you find yourself iterating over the rows of a dataframe, you are most likely doing it wrong.
With this in mind, we can re-write your current procedure using pandas 'apply' method (this will likely speed it up to begin with, as it means far fewer index lookups on the df):
# Check whether record is null, or doesn't contain any real data
def do_the_thing(row):
if pd.notnull(row) and len(row) > 2:
# Convert the json structure into a dataframe, one cell at a time in the relevant column
x = pd.read_json(row)
# The last bit of this string (after the last =) will be used as a key for the column labels
x['key'] = x['key'].apply(lambda x: x.split("=")[-1])
# Set this new key to be the index
y = x.set_index('key')
# Stack the rows up via a multi-level column index
y = y.stack().to_frame().T
# Flatten out the multi-level column index
y.columns = ['{1}_{0}'.format(*c) for c in y.columns]
#we don't need to re-index
# Give the single record the same index number as the parent dataframe (for the merge to work)
#y.index = [df.index[i]]
#we don't need to add to a temp df
# Append this dataframe on sequentially for each row as we go through the loop
return y.iloc[0]
else:
return pd.Series()
df2 = df.merge(df.ColB.apply(do_the_thing), how = 'left', left_index = True, right_index = True)
Notice that this returns exactly the same result as before, we haven't changed the logic. the apply method sorts out the indexes, so we can just merge, fine.
I believe that answers your question in terms of speeding it up and being a bit more idiomatic.
I think you should consider however, what you want to do with this data structure, and how you might better structure what you're doing.
Given ColB could be of arbitrary length, you will end up with a dataframe with an arbitrary number of columns. When you come to access these values for whatever purpose, this will cause you pain, whatever the purpose is.
Are all entries in ColB important? Could you get away with just keeping the first one? Do you need to know the index of a certain valA val?
These are questions you should ask yourself, then decide on a structure which will allow you to do whatever analysis you need, without having to check a bunch of arbitrary things.

Related

Pandas - Find a column with a specific value in the entire dataframe

I have a DataFrame which has a few columns. There is a column with a value that only appears once in the entire dataframe. I want to write a function that returns the column name of the column with that specific value. I can manually find which column it is with the usual data exploration, but since I have multiple dataframes with the same properties, I need to be able to find that column for multiple dataframes. So a somewhat generalized function would be of better use.
The problem is that I don't know beforehand which column is the one I am looking for since in every dataframe the position of that particular column with that particular value is different. Also the desired columns in different dataframes have different names, so I cannot use something like df['my_column'] to extract the column.
Thanks
You'll need to iterate columns and look for the value:
def find_col_with_value(df, value):
for col in df:
if (df[col] == value).any():
return col
This will return the name of the first column that contains value. If value does not exist, it will return None.
Check the entire DataFrame for the specific value, checking any to see if it ever appears in a column, then slice the columns (or the DataFrame if you want the Series)
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.normal(0, 5, (100, 200)),
columns=[chr(i+40) for i in range(200)])
df.loc[5, 'Y'] = 'secret_value' # Secret value in column 'Y'
df.eq('secret_value').any().loc[lambda x: x].index
# or
df.columns[df.eq('secret_value').any()]
Index(['Y'], dtype='object')
I have another solution:
names = ds.columns
for i in names:
for j in ds[i]:
if j == 'your_value':
print(i)
break
Here you are collecting all the names of columns and then iterating all dataset while it will be found. Then print the name of column.

Pandas fill dataframe from another dataframe where the [non-] index doesnt always overlap

I have a bunch of dataframes where I want to pull out a single column from each and merge them into another dataframe with a timestamp column that is not indexed.
So e.g. all the dataframes look like:
[Index] [time] [col1] [col2] [etc]
0 2020-04-21T18:00:00Z 1 2 ...
All of the dataframes have a 'time' column and a 'col1' column. Because the 'time' column does not necessarily overlap, I made a new dataframe with a join of all the dataframes (that I added to a dictionary)
di = ... #dictionary of all the dataframes of interest
for key in di:
temptimeslist = di[key]['time'].tolist()
fulltimeslist.extend(x for x in temptimeslist if x not in fulltimeslist)
datadf['time'] = fulltimeslist #make a new df and add this as a column
(i'm sure there's an easier way to do the above, any suggestions are welcome). Note that for a number of reasons, translatting the ISO datetime format into a datetime and setting that as an index is not ideal.
The dumb way to do what I want is obvious enough:
for key in di:
datadf[key] = float("NaN")
tempdf = di[key] #could skip this probably
for i in range(len(datadf)):
if tempdf.time[tempdf.time==datadf.time[i]].index.tolist():
if len(tempdf.time[tempdf.time==datadf.time[i]].index.tolist())==1: #make sure value only shows up once, could reasonably skip this and put protection in elsewhere
datadf.loc[i,key] = float(tempdf[colofinterest][tempdf.time[tempdf.time==datadf.time[i]].index.tolist()])
#i guess i could do the above backwards so i loop over only the shorter dataframe to save some time.
but this seems needlessly long for python...I originally tried the pandas merge and join methods but got various keyerrors when trying them...same goes for 'in' statements inside the if statements.
e.g, I've tried thins like
datadf.join(Nodes_dict[key],datadf['time']==Nodes_dict[key]['time'],how="left").select()
but this fails.
I guess the question boils down to the following steps:
1) given 2 dataframes with a column of strings (times in iso format), find the indexes in the larger one for where they match the shorter one (or vice versa)
2) given that list of indexes, populate a separate column in the larger df using values from the smaller df but only in the correct spots and nan otherwise

adding row from one dataframe to another

I am trying to insert or add from one dataframe to another dataframe. I am going through the original dataframe looking for certain words in one column. When I find one of these terms I want to add that row to a new dataframe.
I get the row by using.
entry = df.loc[df['A'] == item]
But when trying to add this row to another dataframe using .add, .insert, .update or other methods i just get an empty dataframe.
I have also tried adding the column to a dictionary and turning that into a dataframe but it writes data for the entire row rather than just the column value. So is there a way to add one specific row to a new dataframe from my existing variable ?
So the entry is a dataframe containing the rows you want to add?
you can simply concatenate two dataframe using concat function if both have the same columns' name
import pandas as pd
entry = df.loc[df['A'] == item]
concat_df = pd.concat([new_df,entry])
pandas.concat reference:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html
The append function expect a list of rows in this formation:
[row_1, row_2, ..., row_N]
While each row is a list, representing the value for each columns
So, assuming your trying to add one row, you shuld use:
entry = df.loc[df['A'] == item]
df2=df2.append( [entry] )
Notice that unlike python's list, the DataFrame.append function returning a new object and not changing the object called it.
See also enter link description here
Not sure how large your operations will be, but from an efficiency standpoint, you're better off adding all of the found rows to a list, and then concatenating them together at once using pandas.concat, and then using concat again to combine the found entries dataframe with the "insert into" dataframe. This will be much faster than using concat each time. If you're searching from a list of items search_keys, then something like:
entries = []
for i in search_keys:
entry = df.loc[df['A'] == item]
entries.append(entry)
found_df = pd.concat(entries)
result_df = pd.concat([old_df, found_df])

Pythonic way to apply a filter to a table

I have two tables. A data table and a filter table. I want to apply the filter table on the data table to select only certain records. When the filter table has a # in a column, the filter is ignored. Additionally, multiple selections can be applied using a | separator.
I have achieved this using a for loop with a bunch of & and | conditions. However, given that my filter table is quite large, I was wondering if there is a more efficient way to achieve this. My filter table looks like:
import pandas as pd
import numpy as np
f = {'business':['FX','FX','IR','IR','CR'],
'A/L':['A','L','A','L','#'],
'Company':['207|401','#','#','207','#']}
filter = pd.DataFrame(data=f)
filter
and the data table looks like:
d = {'business': ['FX','a','CR'],
'A/L': ['A','A','L'],
'Company': ['207','1','2']}
data = pd.DataFrame(data=d)
data
finally the filter looks like:
for counter in range (0, len(filter)):
businessV = str(filter.iat[counter,0])
ALV = str(filter.iat[counter,1])
CompanyV = str(filter.iat[counter,2])
businessV1 = businessV.split("|", 100)
ALV1 = ALV.split("|", 100)
CompanyV1 = CompanyV.split("|", 100)
businessV2 = ('#' in businessV1)| (data['business'].isin(businessV1))
ALV2 = ('#' in ALV1)|(data['A/L'].isin(ALV1))
CompanyV2 = ('#' in CompanyV1)| (data['Company'].isin(CompanyV1))
final_filter = businessV2 & ALV2 & CompanyV2
print(final_filter)
I am trying to find a more efficient way to select the first and last rows in the data table using the filters in the filter table.
Specifically, I am wondering how to :
Handle cases when the filter table has quite a few more columns
The current code goes through each row in the data table once for each row in the filter table. For large data sets, this takes way too much time and does not seem very efficient to me.
It is a rather complex question. I would start by pre-processing the filter table to have only one value per field by duplicating rows containing '|'. In order to limit the number of useless rows, I would first replace anything containing a '#' and other values with a single '#'.
Once this is done, it is possible to select the rows from the business table with a merge, provided merging on the columns containing no sharp.
Code could be:
# store the original column names
cols = filter.columns
# remove any alternate value if a # is already present:
tosimp = pd.DataFrame({col: filter[col].str.contains('#')&
filter[col].str.contains('\|')
for col in cols})
# add a column to store in a (hashable) tuple the columns with no '#'
filter['wild'] = filter.apply(lambda x: tuple(col for col in cols
if x[col] != '#'), axis=1)
# now explode the fields containing a '|'
tosimp = pd.DataFrame({col: filter[col].str.contains('\|')
for col in filter.columns})
# again, store in a new column the columns containing a '|'
tosimp['wild'] = filter.apply(lambda x: tuple(col for col in cols
if '|' in filter.loc[x.name, col]),
axis=1)
# compute a new filter table with one single value per field (or #)
# by grouping on tosimp['wild']
dfl = [filter[tosimp['wild'].astype(str)=='()']]
for k, df in filter[tosimp['wild'].astype(str)!='()'].groupby(tosimp['wild']):
for ix, row in df.iterrows():
tmp = pd.MultiIndex.from_product([df.loc[ix, col].split('|')
for col in k], names=k).to_frame(None)
l = len(tmp)
dfl.append(pd.DataFrame({col: tmp[col]
if col in k else [row[col]] * l
for col in filter.columns}))
filter2 = pd.concat(dfl)
# Ok, we can now use that new filter table to filter the business table
result = pd.concat([data.merge(df, on=k, suffixes=('', '_y'),
right_index=True)[cols]
for k, df in filter2.groupby('wild')]).sort_index()
Limits:
the pre-processing iterates on group by dataframes and uses a iterrows call: it can take some time on a large filter table
the current algo does not handle at all a line containing '#' in all of its fields. If it is a possible use case, it must be searched before any other processing. Anyway in that case any row from the business table will be kept.
Explaination of the pd.concat(... line:
[... for k, df in filter2.groupby('wild')]: split the filter dataframe in sub-dataframes each having a different wild values, that is a different set of non # fields
data.merge(df, on=k, suffixes=('', '_y'), right_index=True): merge each sub filter dataframe with the data dataframe on the non # fields, that is select the rows from the data dataframe matching one of these filter rows. Keep the original index of the data dataframe
...[cols] keep only the relevant fields
pd.concat(...) concat all those partial dataframes
... .sort_index() sort the concatenated dataframe according to its index which is by construction the index of the original data dataframe
My understanding of your problem is that you want the all the first matches for business,A/L with a Company that is specified (or any if # is used) in the corresponding filter.
I'm supposing that your intended result is a dataframe with just the first row of data. When your filter gets large, you can speed things up by using a join operation on the filter and keep only the first result.
# Split on | so that every option is represented in a single row
filter0 = filter.set_index(['business','A/L']).Company.str.split('|',expand=True).stack().reset_index().drop('level_2',axis=1).rename(columns={0:'Company'})
# The set of *all* rows in data which are caught by filters with a Company specification
r1 = data.merge(filter0[filter0.Company != '#'])
# The set of *all* rows in data which are caught by filters allowing for *any* Company
r2 = data.merge(filter0[filter0.Company == '#'].drop('Company', axis=1))
# r1 and r2 are not necessarily disjoint, and each one may have multiple rows that pass one filter
# Take the union, sort on the index to preserve the original ordering,
# then finally drop duplicates of business+A/L, keeping only the first entry
pd.concat([r1,r2]).drop_duplicates(subset=['business','A/L'], keep='first')
With regard to your case in handling multiple columns on the filter: A single row in your filter would essentially say something along the lines of,
"I want field1=foo AND field2=bar AND field3=baz1 OR field3=baz2 AND field4=qux1 OR field4=qux2."
The main idea is to expand this into a multiple rows composed of only AND conditinals, so in this case it would be to turn it into four rows of
field1=foo AND field2=bar AND field3=baz1 AND field4=qux1
field1=foo AND field2=bar AND field3=baz1 AND field4=qux2
field1=foo AND field2=bar AND field3=baz2 AND field4=qux1
field1=foo AND field2=bar AND field3=baz2 AND field4=qux2
In other words, use .split and .stack multiple times, once for each column with an OR condition. This may be slightly inefficient (you might get better speed and code readability using itertools.product somewhere), but your bottleneck is usually in the join operation, so this isn't too much of a worry as far as speed is concerned.

python pandas difference between df_train["x"] and df_train[["x"]]

I have the following dataset and reading it from csv file.
x =[1,2,3,4,5]
with the pandas i can access the array
df_train = pd.read_csv("train.csv")
x = df_train["x"]
And
x = df_train[["x"]]
I could wonder since both producing the same result the former one could make sense but later one not. PLEASE, COULD YOU explain the difference and use?
In pandas, you can slice your data frame in different ways. On a high level, you can choose to select a single column out of a data frame, or many columns.
When you select many columns, you have to slice using a list, and the return is a pandas DataFrame. For example
df[['col1', 'col2', 'col3']] # returns a data frame
When you select only one column, you can pass only the column name, and the return is just a pandas Series
df['col1'] # returns a series
When you do df[['col1']], you return a DataFrame with only one column. In other words, it's like your telling pandas "give me all the columns from the following list:" and just give it a list with one column on it. It will filter your df, returning all columns in your list (in this case, a data frame with only 1 column)
If you want more details on the difference between a Series and a one-column DataFrame, check this thread with very good answers

Categories

Resources