Expand Pandas DataFrame Column with JSON Object - python

I'm looking for a clean, fast way to expand a pandas dataframe column which contains a json object (essentially a dict of nested dicts), so I could have one column for each element in the json column in json normalized form; however, this needs to retain all of the original dataframe columns as well. In some instances, this dict might have a common identifier I could use to merge with the original dataframe, but not always. For example:
import pandas as pd
import numpy as np
df = pd.DataFrame([
{
'col1': 'a',
'col2': {'col2.1': 'a1', 'col2.2': {'col2.2.1': 'a2.1', 'col2.2.2': 'a2.2'}},
'col3': '3a'
},
{
'col1': 'b',
'col2': np.nan,
'col3': '3b'
},
{
'col1': 'c',
'col2': {'col2.1': 'c1', 'col2.2': {'col2.2.1': np.nan, 'col2.2.2': 'c2.2'}},
'col3': '3c'
}
])
Here is a sample dataframe. As you can see, col2 is a dict in all of these cases which has another nested dict inside of it, or could be a null value, containing nested elements I would like to be able to access. (For the nulls, I would want to be able to handle them at any level--entire elements in the dataframe, or just specific elements in the row.) In this case, they have no ID that could link up to the original dataframe. My end goal would be essentially to have this:
final = pd.DataFrame([
{
'col1': 'a',
'col2.1': 'a1',
'col2.2.col2.2.1': 'a2.1',
'col2.2.col2.2.2': 'a2.2',
'col3': '3a'
},
{
'col1': 'b',
'col2.1': np.nan,
'col2.2.col2.2.1': np.nan,
'col2.2.col2.2.2': np.nan,
'col3': '3b'
},
{
'col1': 'c',
'col2.1': 'c1',
'col2.2.col2.2.1': np.nan,
'col2.2.col2.2.2': 'c2.2',
'col3': '3c'
}
])
In my instance, the dict could have up to 50 nested key-value pairs, and I might only need to access a few of them. Additionally, I have about 50 - 100 other columns of data I need to preserve with these new columns (so an end goal of around 100 - 150). So I suppose there might be two methods I'd be looking for--getting a column for each value in the dict, or getting a column for a select few. The former option I haven't yet found a great workaround for; I've looked at some prior answers but found them to be rather confusing, and most threw errors. This seems especially difficult when there are dicts nested inside of the column. To attempt the second solution, I tried the following code:
def get_val_from_dict(row, col, label):
if pd.isnull(row[col]):
return np.nan
norm = pd.json_normalize(row[col])
try:
return norm[label]
except:
return np.nan
needed_cols = ['col2.1', 'col2.2.col2.2.1', 'col2.2.col2.2.2']
for label in needed_cols:
df[label] = df.apply(get_val_from_dict, args = ('col2', label), axis = 1)
This seemed to work for this example, and I'm perfectly happy with the output, but for my actual dataframe which had substantially more data, this seemed a bit slow--and, I would imagine, is not a great or scalable solution. Would anyone be able to offer an alternative to this sluggish approach to resolving the issue I'm having?
(Also, apologies also about the massive amounts of nesting in my naming here. If helpful, I am adding in several images of the dataframes below--the original, then the target, and then the current output.)

instead of using apply or pd.json_normalize on the column that has a dictionary, convert the whole data frame to dictionary & use pd.json_normalize on it, finally picking the fields you wish to keep. This works because while the individual column for any given row may be null, the entire row would not be.
example:
# note that this method also prefixes an extra `col2.`
# at the start of the names of the denested data,
# which is not present in the example output
# the column renaming conforms to your desired name.
import re
final_cols = ['col1', 'col2.col2.1', 'col2.col2.2.col2.2.1', 'col2.col2.2.col2.2.2', 'col3']
out = pd.json_normalize(df.to_dict(orient='records'))[final_cols]
out.rename(columns=lambda x: re.sub(r'^col2\.', '', x), inplace=True)
out
# out:
col1 col2.1 col2.2.col2.2.1 col2.2.col2.2.2 col3
0 a a1 a2.1 a2.2 3a
1 b NaN NaN NaN 3b
2 c c1 NaN c2.2 3c
but for my actual dataframe which had substantially more data, this was quite slow
Right now I have 1000 rows of data, each row has about 100 columns, and then the column I want to expand has about 50 nested key/value pairs in it. I would expect that the data could scale up to 100k rows with the same number of columns over the next year or so, and so I'm hoping to have a scalable process ready to go at that point
pd.json_normalize should be faster than your attempt, but it is not faster than doing the flattening in pure python, so you might get more performance if you wrote a custom transform function & constructed the dataframe as below.
out = pd.DataFrame(transform(x) for x in df.to_dict(orient='records'))

Related

Pandas series: Only keep the first entry that contains a given character (comma)

I scraped a table of SEC filings and extract a specific row as a pandas Series.
The tables are not very standardized in their formatting which makes scraping quite hard as unwanted information is extracted as well.
Take for example the following series I scraped from a table:
series = {'A': "3,360,003|", 'B': "(17) |", 'C': "16.8|"}
series = pd.Series(data=series, index=['A', 'B', 'C'])
The only information that is relevant for me is the one that contains commas. Is there a way to remove all other entries of the series that doesn't contain commas?
There may be cases where there is more than one entry with commas, e.g.
series = {'A': "3,360,003|", 'B': "(17,424,32) |", 'C': "16.8|"}
series = pd.Series(data=series, index=['A', 'B', 'C'])
in this case, the first entry that contains commas should be kept while all other should be removed.
Help is much appreciated
If you really want to work with Series methods, the approach would be:
series[series.str.contains(',')].iloc[0]
However, this requires checking all elements, just to keep one.
A much more efficient approach (depending on the exact data, there might be edge case where this isn't true), would be to use a filter and next to get the first element. This is more that 100 times faster on the provided example.
next(filter(lambda x: ',' in x, series))
Output: '3,360,003|'
Use .str.contains() as a boolean indexer;
s = series[series.str.contains(',', na=False)]

Spark dataframe to dict with set

I'm having an issue with the output of my spark dataframe. The file can range from few GB to 50+GB
SparkDF = spark.read.format("csv").options(header="true", delimiter="|", maxColumns="100000").load(my_file.csv)
This give me the correct DF that I want. But as per requirement I need to have as key the column name and all the values in a set related to that key.
For example:
df = {'col1': ['1', '2', '3', '4'], 'col2': ['Jean', 'Cecil', 'Annie', 'Maurice'], 'col3': ['test', 'aaa', 'bbb', 'ccc','ddd']}
df = pd.DataFrame(data=d)
Should give me at the end:
{'col1': {'1', '2', '3', '4'},'col2': {'Jean', 'Cecil', 'Annie', 'Maurice'},'col3': {'test', 'aaa', 'bbb', 'ccc','ddd'}
I've implemented the following:
def columnDict(dataFrame):
colDict = dict(zip(dataFrame.schema.names, zip(*dataFrame.collect())))
return colDict if colDict else dict.fromkeys(dataFrame.schema.names, ())
However, it returned me a dict with a tuple as value and not a set as I require.
I would like either to convert the tuple in the dictionary into a set or just directly get a dictionary as a set as an output of my function.
EDIT:
For the full requirements:
Beside the dictionary mentioned above, there is another one that contains similar data for checking.
Means that the file that I load to a spark DF and transform into a dictionary contains data that must be checked against the other dictionary.
The goal is to check every key from my dict (the loaded file), against the check dictionary, first to see if they exist, then if it exist to check if the values of the keys are a subset of the check values.
If I load the check data in a dataframe it would look like this : (note that I may not be able to change the fact that it's a dict, I will see if I can modify from dict to spark df)
df = {'KeyName': ['col1', 'col2', 'col3'], 'ValueName': ['1, 2, 3, 4', 'Jean, Cecil, Annie, Maurice, Annie, Maurice', 'test, aaa, bbb, ccc,ddd,eee']}
df = pd.DataFrame(data=df)
print(df)
KeyName ValueName
0 col1 1, 2, 3, 4
1 col2 Jean, Cecil, Annie, Maurice, Annie, Maurice
2 col3 test, aaa, bbb, ccc,ddd,eee
So at the end, the data in my file should be a subset of a row that have the same KeyName as my dict.
I'm slightly stuck with legacy code and I'm little bit struggling to migrate it to spark databricks.
EDIT 2:
hopefully this will work. I uploaded the 2 files with modified data:
https://filebin.net/1rnnvqn2b0ww7qc8
FakeData.csv contains the data that I loaded on my side with the above code and must be a subset of the second one.
FakeDataChecker.csv contains the data that is the actual full set available
EDIT 3:
Forgot to add that all empty string in the FakeData should not be taken in account as well as the one in FakeDataChecker
So I'm not sure I have understood your usecase perfectly. But let's try with a first draft.
From what I'm understanding, you have a first file with all your data. And a file checker with keys that you need to have in the data foreach column. And additional keys present in the data should be filtered out.
This could be done with inner join between your initial data and the data checker. If there aren't too many keys in the data checker, Spark should automatically broadcast the data checker dataframe for optimized joins.
Here is the first draft of the code, this isn't yet completely automated waiting for your first questions and remarks.
First let's import the needed functions and the data:
from pyspark.sql.functions import col
from pyspark.sql import Window
spark.sql("set spark.sql.caseSensitive=true")
data = (
spark
.read
.format("csv")
.options(header=True, delimiter="|", maxColumns="100000")
.load("FakeData.csv")
.na.drop()
)
data_checker = (
spark
.read
.format("csv")
.options(header=True, delimiter="|", maxColumns="100000")
.load("FakeDataChecker.csv")
.na.drop(subset=["ValueName"])
)
We drop null values as you need, you can specify the wanted columns with the subset keyword
Then let's prepare the join dataframes
data_checker_date = data_checker.filter(col("KeyName") == "DATE").select(col("ValueName").alias("date"))
data_checker_location = data_checker.filter(col("KeyName") == "LOCATION").select(col("ValueName").alias("location"))
data_checker_location_id = data_checker.filter(col("KeyName") == "LOCATIONID").select(col("ValueName").alias("locationid"))
data_checker_type = data_checker.filter(col("KeyName") == "TYPE").select(col("ValueName").alias("type"))
We need to alias the column during the joins to avoid duplicated column names. And we specify the case sensitive option for when we drop the columns, so that we don't drop the initial ones in CAPS.
Finally we filter out, through inner join all keys not present in the data checker:
(
data
.join(data_checker_date, data.DATE == data_checker_date.date)
.join(data_checker_location, data.LOCATION == data_checker_location.location)
.join(data_checker_location_id, data.LOCATIONID == data_checker_location_id.locationid)
.join(data_checker_type, data.TYPE == data_checker_type.type)
.drop("date", "location", "locationid", "type")
.show()
)
In next steps, we can automate this through retrieving the distinct keyNames of the columns (e.g.: "DATE", "LOCATION", etc...) So we can don't have to copy paste the code 4 times or X times in the future.
Something in the line of:
from pyspark.sql.functions import collect_set
distinct_keynames = data_checker.select(collect_set('KeyName').alias('KeyName')).first()['KeyName']
for keyname in distinct_keynames:
etc... implement the logic of chaining joins

Pandas DataFrame take automatically wrong value as index

I tried to create DataFrames from a JSON file.
I have a list named "Series_participants" containing a part of this JSON file. My list look like thise when i print it.
participantId 1
championId 76
stats {'item0': 3265, 'item2': 3143, 'totalUnitsHeal...
teamId 100
timeline {'participantId': 1, 'csDiffPerMinDeltas': {'1...
spell1Id 4
spell2Id 12
highestAchievedSeasonTier SILVER
dtype: object
<class 'list'>
After i tri to convert this list to a DataFrame like this
pd.DataFrame(Series_participants)
But pandas use values of "stats" and "timeline" as index for the DataFrame. I expected to have automatic index range (0, ..., n)
EDIT 1:
participantId championId stats teamId timeline spell1Id spell2Id highestAchievedSeasonTier
0 1 76 3265 100 NaN 4 12 SILVER
I want to have a dataframe with "stats" & "timeline" colomns containing dicts of their values as in the Series display.
What is my error ?
EDIT 2:
I have tried to create manually the DataFrame but pandas didn't take my choices in consideration and finally take indexes of "stats" key of the Series.
here is my code :
for j in range(0,len(df.participants[0])):
for i in range(0,len(df.participants[0][0])):
Series_participants = pd.Series(df.participants[0][i])
test = {'participantId':Series_participants.values[0],'championId':Series_participants.values[1],'stats':Series_participants.values[2],'teamId':Series_participants.values[3],'timeline':Series_participants.values[4],'spell1Id':Series_participants.values[5],'spell2Id':Series_participants.values[6],'highestAchievedSeasonTier':Series_participants.values[7]}
if j == 0:
df_participants = pd.DataFrame(test)
else:
df_participants.append(test, ignore_index=True)
The double loop is to parse all "participant" of my JSON file.
LAST EDIT :
I achieved what i wanted with the following code :
for i in range(0,len(df.participants[0])):
Series_participants = pd.Series(df.participants[0][i])
df_test = pd.DataFrame(data=[Series_participants.values], columns=['participantId','championId','stats','teamId','timeline','spell1Id','spell2Id','highestAchievedSeasonTier'])
if i == 0:
df_participants = pd.DataFrame(df_test)
else:
df_participants = df_participants.append(df_test, ignore_index=True)
print(df_participants)
Thanks to all for your help !
For efficiency, you should try and manipulate your data as you construct your dataframe rather than as a separate step.
However, to split apart your dictionary keys and values you can use a combination of numpy.repeat and itertools.chain. Here's a minimal example:
df = pd.DataFrame({'A': [1, 2],
'B': [{'key1': 'val0', 'key2': 'val9'},
{'key1': 'val1', 'key2': 'val2'}],
'C': [{'key3': 'val10', 'key4': 'val8'},
{'key3': 'val3', 'key4': 'val4'}]})
import numpy as np
from itertools import chain
chainer = chain.from_iterable
lens = df['B'].map(len)
res = pd.DataFrame({'A': np.repeat(df['A'], lens),
'B': list(chainer(df['B'].map(lambda x: x.values())))})
res.index = chainer(df['B'].map(lambda x: x.keys()))
print(res)
A B
key1 1 val0
key2 1 val9
key1 2 val1
key2 2 val2
If you try to input lists, series or arrays containing dicts into the object constructor, it doesn't recognise what you're trying to do. One way around this is manually setting:
df.at['a', 'b'] = {'x':value}
Note, the above will only work if the columns and indexes are already created in your DataFrame.
Updated per comments: Pandas data frames can hold dictionaries, but it is not recommended.
Pandas is interpreting that you want one index for each of the your dictionary keys and then broadcasting the single item columns across them.
So to help with what you are trying to do I would recommend reading in your dictionaries items as columns. Which is what data frames are typically used for and very good at.
Example Error due to pandas trying to read in the dictionary by key, value pair:
df = pd.DataFrame(columns= ['a', 'b'], index=['a', 'b'])
df.loc['a','a'] = {'apple': 2}
returns
ValueError: Incompatible indexer with Series
Per jpp in the comments below (When using the constructor method):
"They can hold arbitrary types, e.g.
df.iat[0, 0] = {'apple': 2}
However, it's not recommended to use Pandas in this way."

Use results from two queries with common key to create a dataframe without having to use merge

data set:
df = pd.DataFrame(np.random.randn(5, 4), columns=['A', 'B', 'C', 'D'],index=['abcd','efgh','abcd','abc123','efgh']).reset_index()
s = pd.Series(data=[True,True,False],index=['abcd','efgh','abc123'], name='availability').reset_index()
(feel free to remove the reset_index bits above, they are simply there to easily provide a different approach to the problem. however, the resulting datasets from the queries i'm running resemble the above most accurately)
I have two separate queries that return data similar to the above. One query queries one field from a DB that has one column of information that does not exist in the other. The 'index' column is the common key across both tables.
My result set needs to have the 2nd query's result series injected into the first query's resulting dataframe at a specific column index.
I know that I can simply run:
df = df.merge(s, how='left', on='index')
Then to enforce column order:
df = df[['index', 'A', 'B', 'availability', 'C', 'D']
I saw that you can do df.inject, but that requires that the series be the same length as the df.
I'm wondering if there is a way to do this without having to run merge and then enforce column order. With my actual dataset, the number of columns is significantly longer. I'd imagine the best solution likely relies on list manipulation, but I'd much rather do something clever with how the dataframe is created in the first place.
df.set_index(['index','id']).index.map(s['availability'])
is returning:
TypeError: 'Series' object is not callable
S is a dataframe with a multi-index and one column which is a boolean.
df is a dataframe with columns in it which make up S's multi-index
IIUC:
In [260]: df.insert(3, 'availability',
df['index'].map(s.set_index('index')['availability']))
In [261]: df
Out[261]:
index A B availability C D
0 abcd 1.867270 0.517894 True 0.584115 -0.162361
1 efgh -0.036696 1.155110 True -1.112075 2.005678
2 abcd 0.693795 -0.843335 True -1.003202 1.001791
3 abc123 -1.466148 -0.848055 False -0.373293 0.360091
4 efgh -0.436618 -0.625454 True -0.285795 -0.220717

Pandas: Use iterrows on Dataframe subset

What is the best way to do iterrows with a subset of a DataFrame?
Let's take the following simple example:
import pandas as pd
df = pd.DataFrame({
'Product': list('AAAABBAA'),
'Quantity': [5,2,5,10,1,5,2,3],
'Start' : [
DT.datetime(2013,1,1,9,0),
DT.datetime(2013,1,1,8,5),
DT.datetime(2013,2,5,14,0),
DT.datetime(2013,2,5,16,0),
DT.datetime(2013,2,8,20,0),
DT.datetime(2013,2,8,16,50),
DT.datetime(2013,2,8,7,0),
DT.datetime(2013,7,4,8,0)]})
df = df.set_index(['Start'])
Now I would like to modify a subset of this DataFrame using the itterrows function, e.g.:
for i, row_i in df[df.Product == 'A'].iterrows():
row_i['Product'] = 'A1' # actually a more complex calculation
However, the changes do not persist.
Is there any possibility (except a manual lookup using the index 'i') to make persistent changes on the original Dataframe ?
Why do you need iterrows() for this? I think it's always preferrable to use vectorized operations in pandas (or numpy):
df.ix[df['Product'] == 'A', "Product"] = 'A1'
I guess the best way that comes to my mind is to generate a new vector with the desired result, where you can loop all you want and then reassign it back to the column
#make a copy of the column
P = df.Product.copy()
#do the operation or loop if you really must
P[ P=="A" ] = "A1"
#reassign to original df
df["Product"] = P

Categories

Resources