Spark dataframe to dict with set - python

I'm having an issue with the output of my spark dataframe. The file can range from few GB to 50+GB
SparkDF = spark.read.format("csv").options(header="true", delimiter="|", maxColumns="100000").load(my_file.csv)
This give me the correct DF that I want. But as per requirement I need to have as key the column name and all the values in a set related to that key.
For example:
df = {'col1': ['1', '2', '3', '4'], 'col2': ['Jean', 'Cecil', 'Annie', 'Maurice'], 'col3': ['test', 'aaa', 'bbb', 'ccc','ddd']}
df = pd.DataFrame(data=d)
Should give me at the end:
{'col1': {'1', '2', '3', '4'},'col2': {'Jean', 'Cecil', 'Annie', 'Maurice'},'col3': {'test', 'aaa', 'bbb', 'ccc','ddd'}
I've implemented the following:
def columnDict(dataFrame):
colDict = dict(zip(dataFrame.schema.names, zip(*dataFrame.collect())))
return colDict if colDict else dict.fromkeys(dataFrame.schema.names, ())
However, it returned me a dict with a tuple as value and not a set as I require.
I would like either to convert the tuple in the dictionary into a set or just directly get a dictionary as a set as an output of my function.
EDIT:
For the full requirements:
Beside the dictionary mentioned above, there is another one that contains similar data for checking.
Means that the file that I load to a spark DF and transform into a dictionary contains data that must be checked against the other dictionary.
The goal is to check every key from my dict (the loaded file), against the check dictionary, first to see if they exist, then if it exist to check if the values of the keys are a subset of the check values.
If I load the check data in a dataframe it would look like this : (note that I may not be able to change the fact that it's a dict, I will see if I can modify from dict to spark df)
df = {'KeyName': ['col1', 'col2', 'col3'], 'ValueName': ['1, 2, 3, 4', 'Jean, Cecil, Annie, Maurice, Annie, Maurice', 'test, aaa, bbb, ccc,ddd,eee']}
df = pd.DataFrame(data=df)
print(df)
KeyName ValueName
0 col1 1, 2, 3, 4
1 col2 Jean, Cecil, Annie, Maurice, Annie, Maurice
2 col3 test, aaa, bbb, ccc,ddd,eee
So at the end, the data in my file should be a subset of a row that have the same KeyName as my dict.
I'm slightly stuck with legacy code and I'm little bit struggling to migrate it to spark databricks.
EDIT 2:
hopefully this will work. I uploaded the 2 files with modified data:
https://filebin.net/1rnnvqn2b0ww7qc8
FakeData.csv contains the data that I loaded on my side with the above code and must be a subset of the second one.
FakeDataChecker.csv contains the data that is the actual full set available
EDIT 3:
Forgot to add that all empty string in the FakeData should not be taken in account as well as the one in FakeDataChecker

So I'm not sure I have understood your usecase perfectly. But let's try with a first draft.
From what I'm understanding, you have a first file with all your data. And a file checker with keys that you need to have in the data foreach column. And additional keys present in the data should be filtered out.
This could be done with inner join between your initial data and the data checker. If there aren't too many keys in the data checker, Spark should automatically broadcast the data checker dataframe for optimized joins.
Here is the first draft of the code, this isn't yet completely automated waiting for your first questions and remarks.
First let's import the needed functions and the data:
from pyspark.sql.functions import col
from pyspark.sql import Window
spark.sql("set spark.sql.caseSensitive=true")
data = (
spark
.read
.format("csv")
.options(header=True, delimiter="|", maxColumns="100000")
.load("FakeData.csv")
.na.drop()
)
data_checker = (
spark
.read
.format("csv")
.options(header=True, delimiter="|", maxColumns="100000")
.load("FakeDataChecker.csv")
.na.drop(subset=["ValueName"])
)
We drop null values as you need, you can specify the wanted columns with the subset keyword
Then let's prepare the join dataframes
data_checker_date = data_checker.filter(col("KeyName") == "DATE").select(col("ValueName").alias("date"))
data_checker_location = data_checker.filter(col("KeyName") == "LOCATION").select(col("ValueName").alias("location"))
data_checker_location_id = data_checker.filter(col("KeyName") == "LOCATIONID").select(col("ValueName").alias("locationid"))
data_checker_type = data_checker.filter(col("KeyName") == "TYPE").select(col("ValueName").alias("type"))
We need to alias the column during the joins to avoid duplicated column names. And we specify the case sensitive option for when we drop the columns, so that we don't drop the initial ones in CAPS.
Finally we filter out, through inner join all keys not present in the data checker:
(
data
.join(data_checker_date, data.DATE == data_checker_date.date)
.join(data_checker_location, data.LOCATION == data_checker_location.location)
.join(data_checker_location_id, data.LOCATIONID == data_checker_location_id.locationid)
.join(data_checker_type, data.TYPE == data_checker_type.type)
.drop("date", "location", "locationid", "type")
.show()
)
In next steps, we can automate this through retrieving the distinct keyNames of the columns (e.g.: "DATE", "LOCATION", etc...) So we can don't have to copy paste the code 4 times or X times in the future.
Something in the line of:
from pyspark.sql.functions import collect_set
distinct_keynames = data_checker.select(collect_set('KeyName').alias('KeyName')).first()['KeyName']
for keyname in distinct_keynames:
etc... implement the logic of chaining joins

Related

Normalization and flattening of JSON column in a mixed type dataframe

There dataframe below has columns with mixed types. Column of interest for expansion is "Info". Each row value in this column is a JSON object.
data = {'Code':['001', '002', '003', '004'],
'Info':['{"id":001,"x_cord":[1,1,1,1],"x_y_cord":[4.703978,-39.601876],"neutral":1,"code_h":"S38A46","group":null}','{"id":002,"x_cord":[2,1,3,1],"x_y_cord":[1.703978,-38.601876],"neutral":2,"code_h":"S17A46","group":"New"}','{"id":003,"x_cord":[1,1,4,1],"x_y_cord":[112.703978,-9.601876],"neutral":4,"code_h":"S12A46","group":"Old"}','{"id":004,"x_cord":[2,1,7,1],"x_y_cord":[6.703978,-56.601876],"neutral":1,"code_h":"S12A46","group":null}'],
'Region':['US','Pacific','Africa','Asia']}
df = pd.DataFrame(data)
I would like to have the headers expanded i.e. have "Info.id","info.x_y_cord","info.neutral" etc as individual columns with corresponding values under them across the dataset. I've tried normalizing them via pd.json_normalize(df["Info"]) iteration but nothing seems to change. Do I need to convert the column to another type first? Can someone point me to the right direction?
The output should be something like this:
data1 = {'Code':['001', '002', '003', '004'],
'Info.id':['001','002','003','004'],
'Info.x_cord':['[1,1,1,1]','[2,1,3,1]','[1,1,4,1]','[2,1,7,1]'],
'Info.x_y_cord':['[4.703978,-39.601876]','[1.703978,-38.601876]','[112.703978,-9.601876]','[6.703978,-56.601876]'],
'Info.neutral':[1,2,4,1],
'Info.code_h':['S38A46','S17A46','S12A46','S12A46'],
'Info.group':[np.NaN,"New","Old",np.NaN],
'Region':['US','Pacific','Africa','Asia']}
df_final = pd.DataFrame(data1)
First of all, your JSON strings seem to be not valid because of the ID value. 001 is not processed correctly so you'll need to pass the "id" value as a string instead. Here's one way to do that:
def id_as_string(matchObj):
# Adds " around the ID value
return f"\"id\":\"{matchObj.group(1)}\","
df["Info"] = df["Info"].str.replace("\"id\":(\d*),", repl=id_to_string, regex=True))
Once you've done that, you can use pd.json_normalize on your "Info" column after you've loaded the values from the JSON strings using json.loads:
import json
json_part_df = pd.json_normalize(df["Info"].map(json.loads))
After that, just rename the columns and use pd.concat to form the output dataframe:
# Rename columns
json_part_df.columns = [f"Info.{column}" for column in json_part_df.columns]
# Use pd.concat to create output
df = pd.concat([df[["Code", "Region"]], json_part_df], axis=1)

Expand Pandas DataFrame Column with JSON Object

I'm looking for a clean, fast way to expand a pandas dataframe column which contains a json object (essentially a dict of nested dicts), so I could have one column for each element in the json column in json normalized form; however, this needs to retain all of the original dataframe columns as well. In some instances, this dict might have a common identifier I could use to merge with the original dataframe, but not always. For example:
import pandas as pd
import numpy as np
df = pd.DataFrame([
{
'col1': 'a',
'col2': {'col2.1': 'a1', 'col2.2': {'col2.2.1': 'a2.1', 'col2.2.2': 'a2.2'}},
'col3': '3a'
},
{
'col1': 'b',
'col2': np.nan,
'col3': '3b'
},
{
'col1': 'c',
'col2': {'col2.1': 'c1', 'col2.2': {'col2.2.1': np.nan, 'col2.2.2': 'c2.2'}},
'col3': '3c'
}
])
Here is a sample dataframe. As you can see, col2 is a dict in all of these cases which has another nested dict inside of it, or could be a null value, containing nested elements I would like to be able to access. (For the nulls, I would want to be able to handle them at any level--entire elements in the dataframe, or just specific elements in the row.) In this case, they have no ID that could link up to the original dataframe. My end goal would be essentially to have this:
final = pd.DataFrame([
{
'col1': 'a',
'col2.1': 'a1',
'col2.2.col2.2.1': 'a2.1',
'col2.2.col2.2.2': 'a2.2',
'col3': '3a'
},
{
'col1': 'b',
'col2.1': np.nan,
'col2.2.col2.2.1': np.nan,
'col2.2.col2.2.2': np.nan,
'col3': '3b'
},
{
'col1': 'c',
'col2.1': 'c1',
'col2.2.col2.2.1': np.nan,
'col2.2.col2.2.2': 'c2.2',
'col3': '3c'
}
])
In my instance, the dict could have up to 50 nested key-value pairs, and I might only need to access a few of them. Additionally, I have about 50 - 100 other columns of data I need to preserve with these new columns (so an end goal of around 100 - 150). So I suppose there might be two methods I'd be looking for--getting a column for each value in the dict, or getting a column for a select few. The former option I haven't yet found a great workaround for; I've looked at some prior answers but found them to be rather confusing, and most threw errors. This seems especially difficult when there are dicts nested inside of the column. To attempt the second solution, I tried the following code:
def get_val_from_dict(row, col, label):
if pd.isnull(row[col]):
return np.nan
norm = pd.json_normalize(row[col])
try:
return norm[label]
except:
return np.nan
needed_cols = ['col2.1', 'col2.2.col2.2.1', 'col2.2.col2.2.2']
for label in needed_cols:
df[label] = df.apply(get_val_from_dict, args = ('col2', label), axis = 1)
This seemed to work for this example, and I'm perfectly happy with the output, but for my actual dataframe which had substantially more data, this seemed a bit slow--and, I would imagine, is not a great or scalable solution. Would anyone be able to offer an alternative to this sluggish approach to resolving the issue I'm having?
(Also, apologies also about the massive amounts of nesting in my naming here. If helpful, I am adding in several images of the dataframes below--the original, then the target, and then the current output.)
instead of using apply or pd.json_normalize on the column that has a dictionary, convert the whole data frame to dictionary & use pd.json_normalize on it, finally picking the fields you wish to keep. This works because while the individual column for any given row may be null, the entire row would not be.
example:
# note that this method also prefixes an extra `col2.`
# at the start of the names of the denested data,
# which is not present in the example output
# the column renaming conforms to your desired name.
import re
final_cols = ['col1', 'col2.col2.1', 'col2.col2.2.col2.2.1', 'col2.col2.2.col2.2.2', 'col3']
out = pd.json_normalize(df.to_dict(orient='records'))[final_cols]
out.rename(columns=lambda x: re.sub(r'^col2\.', '', x), inplace=True)
out
# out:
col1 col2.1 col2.2.col2.2.1 col2.2.col2.2.2 col3
0 a a1 a2.1 a2.2 3a
1 b NaN NaN NaN 3b
2 c c1 NaN c2.2 3c
but for my actual dataframe which had substantially more data, this was quite slow
Right now I have 1000 rows of data, each row has about 100 columns, and then the column I want to expand has about 50 nested key/value pairs in it. I would expect that the data could scale up to 100k rows with the same number of columns over the next year or so, and so I'm hoping to have a scalable process ready to go at that point
pd.json_normalize should be faster than your attempt, but it is not faster than doing the flattening in pure python, so you might get more performance if you wrote a custom transform function & constructed the dataframe as below.
out = pd.DataFrame(transform(x) for x in df.to_dict(orient='records'))

Append to a pd.DataFrame, dynamically allocating any new columns

I'm wanting to aggregate some API responses into a DataFrame.
The request consistently returns a number of json key value pairs, lets say A,B,C. occasionally however it will return A,B,C,D.
I would like something comparible to SQL's OUTER JOIN, that will simply add the new row, whilst filling the corresponding previous columns as NULL or some other placeholder.
The pandas join options insist upon imposing a unique suffix for the side, I really don't want this.
Am I looking at this the wrong way?
If there is no easy solution, I could just select a subset of the consistently available columns but I really wanted to download the lot and do the processing as a separate stage.
You can use pandas.concat as it provides with all the functionality required for your problem. Let this toy problem illustrate the possible solution.
# This generates random data with some key and value pair.
def gen_data(_size):
import string
keys = list(string.ascii_uppercase)
return dict((k,[v]) for k,v in zip(np.random.choice(keys, _size),np.random.randint(1000, size=_size)))
counter = 0
df = pd.DataFrame()
while True:
if counter > 5:
break;
# Recieve the data
new_data = gen_data(5)
# Converting this to dataframe obj
new_data = pd.DataFrame(new_data)
# Appending this data to my stack
df = pd.concat((df, new_data), axis=0, sort=True)
counter += 1
df.reset_index(drop=True, inplace=True)
print(df.to_string())

Pandas read_csv into multiple DataFrames

I have some data in text file that I am reading into Pandas. A simplified version of the txt read in is:
idx_level1|idx_level2|idx_level3|idx_level4|START_NODE|END_NODE|OtherData...
353386066294006|1142|2018-09-20T07:57:26Z|1|18260004567689|18260005575180|...
353386066294006|1142|2018-09-20T07:57:26Z|2|18260004567689|18260004240718|...
353386066294006|1142|2018-09-20T07:57:26Z|3|18260005359901|18260004567689|...
353386066294006|1142|2018-09-20T07:57:31Z|1|18260004567689|18260005575180|...
353386066294006|1142|2018-09-20T07:57:31Z|2|18260004567689|18260004240718|...
353386066294006|1142|2018-09-20T07:57:31Z|3|18260005359901|18260004567689|...
353386066294006|1142|2018-09-20T07:57:36Z|1|18260004567689|18260005575180|...
353386066294006|1142|2018-09-20T07:57:36Z|2|18260004567689|18260004240718|...
353386066294006|1142|2018-09-20T07:57:36Z|3|18260005359901|18260004567689|...
353386066736543|22|2018-04-17T07:08:23Z||||...
353386066736543|22|2018-04-17T07:08:24Z||||...
353386066736543|22|2018-04-17T07:08:25Z||||...
353386066736543|22|2018-04-17T07:08:26Z||||...
353386066736543|403|2018-07-02T16:55:07Z|1|18260004580350|18260005235340|...
353386066736543|403|2018-07-02T16:55:07Z|2|18260005235340|18260005141535|...
353386066736543|403|2018-07-02T16:55:07Z|3|18260005235340|18260005945439|...
353386066736543|403|2018-07-02T16:55:07Z|4|18260006215338|18260005235340|...
353386066736543|403|2018-07-02T16:55:07Z|5|18260004483352|18260005945439|...
353386066736543|403|2018-07-02T16:55:07Z|6|18260004283163|18260006215338|...
353386066736543|403|2018-07-02T16:55:01Z|1|18260004580350|18260005235340|...
353386066736543|403|2018-07-02T16:55:01Z|2|18260005235340|18260005141535|...
353386066736543|403|2018-07-02T16:55:01Z|3|18260005235340|18260005945439|...
353386066736543|403|2018-07-02T16:55:01Z|4|18260006215338|18260005235340|...
353386066736543|403|2018-07-02T16:55:01Z|5|18260004483352|18260005945439|...
353386066736543|403|2018-07-02T16:55:01Z|6|18260004283163|18260006215338|...
And the code I use to read in is as follows:
mydata = pd.read_csv('/myloc/my_simple_data.txt', sep='|',
dtype={'idx_level1': 'int',
'idx_level2': 'int',
'idx_level3': 'str',
'idx_level4': 'float',
'START_NODE': 'str',
'END_NODE': 'str',
'OtherData...': 'str'},
parse_dates = ['idx_level3'],
index_col=['idx_level1','idx_level2','idx_level3','idx_level4'])
What I really want to do is have a seperate panadas DataFrames for each unique idx_level1 & idx_level2 value. So in the above example there would be 3 DataFrames pertaining to idx_level1|idx_level2 values of 353386066294006|1142, 353386066736543|22 & 353386066736543|403 respectively.
Is it possible to read in a text file like this and output each change in idx_level2 to a new Pandas DataFrame, maybe as part of some kind of loop? Alternatively, what would be the most efficient way to split mydata into DataFrame subsets, given that everything I have read suggests that it is inefficient to iterate through a DataFrame.
Read your dataframe as you are currently doing then groupby and use list comprehension:
group = mydata.groupby(level=[0,1])
dfs = [group.get_group(x) for x in group.groups]
you can call your dataframes by doing dfs[0] and so on
To specifically address your last paragraph, you could create a dict of dfs, based on unique values in the column using something like:
import copy
dict = {}
cols = df[column].unique()
for value in col_values:
key = 'df'+str(value)
dict[key] = copy.deepcopy(df)
dict[key] = dict[key][df[column] == value]
dict[key].reset_index(inplace = True, drop = True)
where column = idx_level2
Read the table as-it-is and use groupby, for instance:
data = pd.read_table('/myloc/my_simple_data.txt', sep='|')
groups = dict()
for group, subdf in data.groupby(data.columns[:2].tolist()):
groups[group] = subdf
Now you have all the sub-data frames in a dictionary whose keys are a tuple of the two indexers (eg: (353386066294006, 1142))

Multiple columns with the same name in Pandas

I am creating a dataframe from a CSV file. I have gone through the docs, multiple SO posts, links as I have just started Pandas but didn't get it. The CSV file has multiple columns with same names say a.
So after forming dataframe and when I do df['a'] which value will it return? It does not return all values.
Also only one of the values will have a string rest will be None. How can I get that column?
the relevant parameter is mangle_dupe_cols
from the docs
mangle_dupe_cols : boolean, default True
Duplicate columns will be specified as 'X.0'...'X.N', rather than 'X'...'X'
by default, all of your 'a' columns get named 'a.0'...'a.N' as specified above.
if you used mangle_dupe_cols=False, importing this csv would produce an error.
you can get all of your columns with
df.filter(like='a')
demonstration
from StringIO import StringIO
import pandas as pd
txt = """a, a, a, b, c, d
1, 2, 3, 4, 5, 6
7, 8, 9, 10, 11, 12"""
df = pd.read_csv(StringIO(txt), skipinitialspace=True)
df
df.filter(like='a')
I had a similar issue, not due to reading from csv, but I had multiple df columns with the same name (in my case 'id'). I solved it by taking df.columns and resetting the column names using a list.
In : df.columns
Out:
Index(['success', 'created', 'id', 'errors', 'id'], dtype='object')
In : df.columns = ['success', 'created', 'id1', 'errors', 'id2']
In : df.columns
Out:
Index(['success', 'created', 'id1', 'errors', 'id2'], dtype='object')
From here, I was able to call 'id1' or 'id2' to get just the column I wanted.
That's what I usually do with my genes expression dataset, where the same gene name can occur more than once because of a slightly different genetic sequence of the same gene:
create a list of the duplicated columns in my dataframe (refers to column names which appear more than once):
duplicated_columns_list = []
list_of_all_columns = list(df.columns)
for column in list_of_all_columns:
if list_of_all_columns.count(column) > 1 and not column in duplicated_columns_list:
duplicated_columns_list.append(column)
duplicated_columns_list
Use the function .index() that helps me to find the first element that is duplicated on each iteration and underscore it:
for column in duplicated_columns_list:
list_of_all_columns[list_of_all_columns.index(column)] = column + '_1'
list_of_all_columns[list_of_all_columns.index(column)] = column + '_2'
This for loop helps me to underscore all of the duplicated columns and now every column has a distinct name.
This specific code is relevant for columns that appear exactly 2 times, but it can be modified for columns that appear even more than 2 times in your dataframe.
Finally, rename your columns with the underscored elements:
df.columns = list_of_all_columns
That's it, I hope it helps :)
Similarly to JDenman6 (and related to your question), I had two df columns with the same name (named 'id').
Hence, calling
df['id']
returns 2 columns.
You can use
df.iloc[:,ind]
where ind corresponds to the index of the column according how they are ordered in the df. You can find the indices using:
indices = [i for i,x in enumerate(df.columns) if x == 'id']
where you replace 'id' with the name of the column you are searching for.

Categories

Resources