Pandas JSON data parsing projection - python

It seems to me that it would be imminently useful for pandas to support the idea of projection (omitting or selecting columns) during data parsing.
Many JSON datasets I find have a ton of extraneous fields I don't need, or I need to parse a specific field in the nested structure.
What I do currently is pipe through jq to create a file that contains only the fields I need. This becomes the "cleaned" file.
I would prefer a method where I didn't have to create a new cleaned file every time I want to look at a particular facet or set of facets, but I could instead tell pandas to load the JSON path .data.interesting and only project fields: A B C.
As an example:
{
"data": {
"not interesting": ["milk", "yogurt", "dirt"],
"interesting": [{ "A": "moonlanding", "B": "1956", "C": 100000, "D": "meh" }]
}

Unfortunately, it seems like there's no easy way to do it on load, but if you're okay with doing it immediately after...
# drop by index
df.drop(df.columns[[1, 2]], axis=1, inplace=True)
# drop by name
df.drop(['B', 'C'], axis=1, inplace=True)

Related

How to import Excel data in pandas as list?

My Excel worksheet has some columns, one of which is Python list-like column. If I import this Excel data using pandas.read_excel, is it possible for pandas to recognize that column as list at this stage or later? I am asking because I have comma-seperated values residing in Excel and I want to use pandas' explode() after importing the Excel file.
I tried to wrap the Excel cells with [""] but the importing and exploding did not work as desired. Any guidance?
Thanks!
data = {
"Name": ["A", "B", "C","D"],
"Product Sold": [["Apple", "Banana"], ["Apple", "Pear"], ["Pear"], ["Berry"]],
"Prices": [[5,6], [5,8], [4], [3]]
}
df = pd.DataFrame(data)
df.explode(['Product Sold', 'Prices'])
You could try something like this:
import pandas as pd
data = {
"Name": "Apple,Pear",
}
df = pd.DataFrame(data,index=[1])
for c in pdf.columns:
if df[c].str.contains(','):
df[c] = df[c].apply(lambda x : str(x).split(','))
print(type(df.Name.iloc[0]))
Read in your excel file, then pass it through the for loop above and it should make lists out of comma-delimited cells.
Let me know if it helps.

Converting json data in dataframe

I'm analyzing club participation. Getting data as json through url request. This is the json I get and load with json_loads:
df = [{"club_id":"1234", "sum_totalparticipation":227, "level":1, "idsubdatatable":1229, "segment": "club_id==1234;eventName==national%2520participation,eventName==local%2520partipation,eventName==global%2520participation", "subtable":[{"label":"national participation", "sum_events_totalevents":105,"level":2},{"label":"local participation","sum_events_totalevents":100,"level":2},{"label":"global_participation","sum_events_totalevents":22,"level":2}]}]
when I use json_normalize, this is how df looks:
so, specific participations are aggregated and only sum is available, and I need them flatten, with global/national/local participation in separate rows.
Can you help by providing code?
If you want to see the details of the subtable field (which is another list of dictionaries itself), then you can do the following:
...
df = pd.DataFrame(*data)
for i in range(len(df)):
df.loc[i, 'label'] = df.loc[i, 'subtable']['label']
df.loc[i, 'sum_events_totalevents'] = df.loc[i, 'subtable']['sum_events_totalevents']
df.loc[i, 'sublevel'] = int(df.loc[i, 'subtable']['level'])
Note: I purposely renamed the level field inside the subtable as sublevel, the reason is there is already a column named level in the dataframe, and thus avoiding name conflict
The data you show us after your json.load looks quite dirty, some quotes look missing, especially after "segment":"club_id==1234", and the ; separator at the beginning does not fit the keys separator inside a dict.
Nonetheless, let's consider the data you get is supposed to look like this (a list of dictionaries):
import pandas as pd
data = [{"club_id":"1234", "sum_totalparticipation":227, "level":1, "idsubdatatable":1229, "segment": "club_id==1234;eventName==national%2520participation,eventName==local%2520partipation,eventName==global%2520participation",
"subtable":[{"label":"national participation", "sum_events_totalevents":105,"level":2},{"label":"local participation","sum_events_totalevents":100,"level":2},{"label":"global_participation","sum_events_totalevents":22,"level":2}]}]
You can see the result with rows separated by unpacking your data inside a DataFrame:
df = pd.DataFrame(*data)
This is the table we get:
Hope this helps

PySpark problem flattening array with nested JSON and other elements

I'm struggling with the correct syntax to flatten some data.
I have a dlt table with a column (named lorem for the sake of the example) where each row looks like this:
[{"field1": {"field1_1": null, "field1_2": null},
"field2": "blabla", "field3": 13209914,
"field4": {"field4_1": null, "field4_2": null}, "field5": 4}, ...
]
I want my output to create a new table based on the first that basically creates a row per each element in the array I shared above.
Table should look like:
|field1_1|field1_2|field2|field3|field4_1|field4_2|field_5|
|:-------|:-------|:-----|:-----|:-------|:-------|:------|
|null|null|blabla|13209914|null|null|4|
However when I explode like: select(explode("lorem")) I do not get the wanted output, instead I get only field 0 and exploded and the other fields except everything inside field4.
My question is, in what other way should I be flattening this data?
I can provide a clearer example if needed.
Use withColumn to add the additional columns you need. A simple example:
%%pyspark
from pyspark.sql.functions import col
df = spark.read.json("abfss://somelake#somestorage.dfs.core.windows.net/raw/flattenJson.json")
df2 = df \
.withColumn("field4_1", col("field4.field4_1")) \
.withColumn("field4_2", col("field4.field4_2"))
df2.show()
My results:

Expand Pandas DataFrame Column with JSON Object

I'm looking for a clean, fast way to expand a pandas dataframe column which contains a json object (essentially a dict of nested dicts), so I could have one column for each element in the json column in json normalized form; however, this needs to retain all of the original dataframe columns as well. In some instances, this dict might have a common identifier I could use to merge with the original dataframe, but not always. For example:
import pandas as pd
import numpy as np
df = pd.DataFrame([
{
'col1': 'a',
'col2': {'col2.1': 'a1', 'col2.2': {'col2.2.1': 'a2.1', 'col2.2.2': 'a2.2'}},
'col3': '3a'
},
{
'col1': 'b',
'col2': np.nan,
'col3': '3b'
},
{
'col1': 'c',
'col2': {'col2.1': 'c1', 'col2.2': {'col2.2.1': np.nan, 'col2.2.2': 'c2.2'}},
'col3': '3c'
}
])
Here is a sample dataframe. As you can see, col2 is a dict in all of these cases which has another nested dict inside of it, or could be a null value, containing nested elements I would like to be able to access. (For the nulls, I would want to be able to handle them at any level--entire elements in the dataframe, or just specific elements in the row.) In this case, they have no ID that could link up to the original dataframe. My end goal would be essentially to have this:
final = pd.DataFrame([
{
'col1': 'a',
'col2.1': 'a1',
'col2.2.col2.2.1': 'a2.1',
'col2.2.col2.2.2': 'a2.2',
'col3': '3a'
},
{
'col1': 'b',
'col2.1': np.nan,
'col2.2.col2.2.1': np.nan,
'col2.2.col2.2.2': np.nan,
'col3': '3b'
},
{
'col1': 'c',
'col2.1': 'c1',
'col2.2.col2.2.1': np.nan,
'col2.2.col2.2.2': 'c2.2',
'col3': '3c'
}
])
In my instance, the dict could have up to 50 nested key-value pairs, and I might only need to access a few of them. Additionally, I have about 50 - 100 other columns of data I need to preserve with these new columns (so an end goal of around 100 - 150). So I suppose there might be two methods I'd be looking for--getting a column for each value in the dict, or getting a column for a select few. The former option I haven't yet found a great workaround for; I've looked at some prior answers but found them to be rather confusing, and most threw errors. This seems especially difficult when there are dicts nested inside of the column. To attempt the second solution, I tried the following code:
def get_val_from_dict(row, col, label):
if pd.isnull(row[col]):
return np.nan
norm = pd.json_normalize(row[col])
try:
return norm[label]
except:
return np.nan
needed_cols = ['col2.1', 'col2.2.col2.2.1', 'col2.2.col2.2.2']
for label in needed_cols:
df[label] = df.apply(get_val_from_dict, args = ('col2', label), axis = 1)
This seemed to work for this example, and I'm perfectly happy with the output, but for my actual dataframe which had substantially more data, this seemed a bit slow--and, I would imagine, is not a great or scalable solution. Would anyone be able to offer an alternative to this sluggish approach to resolving the issue I'm having?
(Also, apologies also about the massive amounts of nesting in my naming here. If helpful, I am adding in several images of the dataframes below--the original, then the target, and then the current output.)
instead of using apply or pd.json_normalize on the column that has a dictionary, convert the whole data frame to dictionary & use pd.json_normalize on it, finally picking the fields you wish to keep. This works because while the individual column for any given row may be null, the entire row would not be.
example:
# note that this method also prefixes an extra `col2.`
# at the start of the names of the denested data,
# which is not present in the example output
# the column renaming conforms to your desired name.
import re
final_cols = ['col1', 'col2.col2.1', 'col2.col2.2.col2.2.1', 'col2.col2.2.col2.2.2', 'col3']
out = pd.json_normalize(df.to_dict(orient='records'))[final_cols]
out.rename(columns=lambda x: re.sub(r'^col2\.', '', x), inplace=True)
out
# out:
col1 col2.1 col2.2.col2.2.1 col2.2.col2.2.2 col3
0 a a1 a2.1 a2.2 3a
1 b NaN NaN NaN 3b
2 c c1 NaN c2.2 3c
but for my actual dataframe which had substantially more data, this was quite slow
Right now I have 1000 rows of data, each row has about 100 columns, and then the column I want to expand has about 50 nested key/value pairs in it. I would expect that the data could scale up to 100k rows with the same number of columns over the next year or so, and so I'm hoping to have a scalable process ready to go at that point
pd.json_normalize should be faster than your attempt, but it is not faster than doing the flattening in pure python, so you might get more performance if you wrote a custom transform function & constructed the dataframe as below.
out = pd.DataFrame(transform(x) for x in df.to_dict(orient='records'))

Python pandas: How many values of one series are in another?

I have two pandas dataframes:
df1 = pd.DataFrame(
{
"col1": ["1","2",np.nan,"3"],
}
)
df2 = pd.DataFrame(
{
"col1": [2.0,3.0,4.0,np.nan],
}
)
I would like to know how many values of df1.col1 exist in df2.col1. In this case it should be 2 as I want "2" and 2.0 to be seen as equal.
I do have a working solution, but because I think I'll need this more often (and for learning purposes, of cause), I wanted to ask you if there is a more comfortable way to do that.
df1.col1[df1.col1.notnull()].isin(df2.col1[df2.col1.notnull()].astype(int).astype(str)).value_counts()
Use Series.dropna and convert to floats, if working with integers and missing values:
a = df1.col1.dropna().astype(float).isin(df2.col1.dropna()).value_counts()
Or:
a = df1.col1.dropna().isin(df2.col1.dropna().astype(int).astype(str)).value_counts()

Categories

Resources