I have a large dataset(5GB) in the form of jason in S3 bucket.
I need to transform the schema of the data, and write back the transformed data to S3 using an ETL script.
So I use a crawler to detect the schema and load the data in pyspark dataframe, and change the schema. Now I iterate over every row in the dataframe and convert it to dictionary. Remove null columns, and then convert the dictionary to string and write back to S3. Following is the code:
#df is the pyspark dataframe
columns = df.columns
print(columns)
s3 = boto3.resource('s3')
cnt = 1
for row in df.rdd.toLocalIterator():
data = row.asDict(True)
for col_name in columns:
if data[col_name] is None:
del data[col_name]
content = json.dumps(data)
object = s3.Object('write-test-transaction-transformed', str(cnt)).put(Body=content)
cnt = cnt+1
print(cnt)
I have used toLocalIterator.
Is the execution of above code performes serially? if yes then how to optimize it? Is there any better approach for execution of above logic?
assuming, each row in the dataset as json string format
import pyspark.sql.functions as F
def drop_null_cols(data):
import json
content = json.loads(data)
for key, value in list(content.items()):
if value is None:
del content[key]
return json.dumps(content)
drop_null_cols_udf = F.udf(drop_null_cols, F.StringType())
df = spark.createDataFrame(
["{\"name\":\"Ranga\", \"age\":25, \"city\":\"Hyderabad\"}",
"{\"name\":\"John\", \"age\":null, \"city\":\"New York\"}",
"{\"name\":null, \"age\":31, \"city\":\"London\"}"],
"string"
).toDF("data")
df.select(
drop_null_cols_udf("data").alias("data")
).show(10,False)
If the input dataframe is having the cols and output only needs to be not null cols json
df = spark.createDataFrame(
[('Ranga', 25, 'Hyderabad'),
('John', None, 'New York'),
(None, 31, 'London'),
],
['name', 'age', 'city']
)
df.withColumn(
"data", F.to_json(F.struct([x for x in df.columns]))
).select(
drop_null_cols_udf("data").alias("data")
).show(10, False)
#df.write.format("csv").save("s3://path/to/file/) -- save to s3
which results
+-------------------------------------------------+
|data |
+-------------------------------------------------+
|{"name": "Ranga", "age": 25, "city": "Hyderabad"}|
|{"name": "John", "city": "New York"} |
|{"age": 31, "city": "London"} |
+-------------------------------------------------+
I'll follow the below approach(written in scala, but can be implemented in python with minimal change)-
Find the dataset count and named it as totalCount
val totalcount = inputDF.count()
Find the count(col) for all the dataframe columns and get the map of fields to their count
Here for all columns of input dataframe, the count is getting computed
Please note that count(anycol) returns the number of rows for which the supplied column are all non-null. For example - if a column has 10 row value and if say 5 values are null then the count(column) becomes 5
Fetch the first row as Map[colName, count(colName)] referred as fieldToCount
val cols = inputDF.columns.map { inputCol =>
functions.count(col(inputCol)).as(inputCol)
}
// Returns the number of rows for which the supplied column are all non-null.
// count(null) returns 0
val row = dataset.select(cols: _*).head()
val fieldToCount = row.getValuesMap[Long]($(inputCols))
Get the columns to be removed
Use the Map created in step#2 here and mark the column having count less than the totalCount as the column to be removed
select all the columns which has count == totalCount from the input dataframe and save the processed output Dataframe anywhere in any format as per requirement.
Please note that, this approach will remove all the column having at least one null value
val fieldToBool = fieldToCount.mapValues(_ < totalcount)
val processedDF = inputDF.select(fieldToBool.filterNot(_._2).map(_.1) :_*)
// save this processedDF anywhere in any format as per requirement
I believe this approach will perform well than the approach you have currently
I solved the above problem.
We can simply query the dataframe for null values.
df = df.filter(df.column.isNotNull()) thereby removing all rows where null is present.
So if there are n columns, We need 2^n queries to filter out all possible combinations. In my case there were 10 columns so total of 1024 queries, which is acceptable as sql queries are parrallelized.
Related
I have fed some data into a TravelTime (https://github.com/traveltime-dev/traveltime-python-sdk) API which calculates the time it takes to drive between 2 locations. The result of this (called out) is a dictionary with that looks like this:
{'results': [{'search_id': 'arrival_one_to_many',
'locations': [{'id': 'KA8 0EU', 'properties': {'travel_time': 2646}},
{'id': 'KA21 5DT', 'properties': {'travel_time': 392}}],
'unreachable': []}]}
However, I need a table that would look a bit like this:
search_id
id
Travel Time
arrival_one_to_many
KA21 5DT
2646
arrival_one_to_many
KA21 5DT
392
I've tried converting this dictionary to a dataframe using
out_2 = pd.DataFrame.from_dict(out)
This shows as a dataframe with one column called results, so I tried use out_2['results'].str.split(',', expand=True) to split this into multiple columns at the comma delimiters but got an error:
0
0
NaN
Is anyone able to help me to get this dictionary to a readable and useable dataframe/table?
Thanks
#MICHAELKM22 since you are not using all the keys from the dictionary you wont be able to convert it directly to dataframe.
First extract required keys and then convert it into dataframe.
df_list = []
for res in data['results']:
serch_id = res['search_id']
for loc in res['locations']:
temp_df = {}
temp_df['search_id'] = res['search_id']
temp_df['id'] = loc["id"]
temp_df['travel_time'] = loc["properties"]['travel_time']
df_list.append(temp_df)
df = pd.DataFrame(df_list)
search_id id travel_time
0 arrival_one_to_many KA8 0EU 2646
1 arrival_one_to_many KA21 5DT 392
First, this json to be parsed to fetch required value. Once those values are fetched, then we can store them into dataframe.
Below is the code to parse this json (PS: I have saved json in one file) and these values added to DataFrame.
import json
import pandas as pd
f = open('Json_file.json')
req_file = json.load(f)
df = pd.DataFrame()
for i in req_file['results']:
dict_new = dict()
dict_new['searchid'] = i['search_id']
for j in i['locations']:
dict_new['location_id'] = j['id']
dict_new['travel_time'] = j['properties']['travel_time']
df = df.append(dict_new, ignore_index=True)
print(df)
Below is the output of above code:
searchid location_id travel_time
0 arrival_one_to_many KA8 0EU 2646.0
1 arrival_one_to_many KA21 5DT 392.0
Please point out where i am doing wrong or a duplicate of this question
I have 11 columns in my table, i am loading data from Ceph(AWS) bucket to Postgres and while doing that i have to filter the data with the below conditions before inserting data into Postgres
Drop the entire row if there is any empty/ Null values in any column
First name and last name should have more than a single letter. Ex : first name = A or last name = P, any record either first name or last name or both , entire record/row should be dropped
Zip code should be 5 digit or greater . Max 7 digit
First name and last name records should not have [Jr, Sr, I, II, etc] in it. or drop the entire record
i have managed to execute the first step (new to pandas) but i was blocked at the next step and i believe that it might also help me solve step3 if i find a solution for step2. While doing a quick research in google, I found that i might be complicating the process by using chunks and might have to use 'concat' to apply it for all chunks or may be i am wrong but i am dealing with huge amount of data and using chunks would help me load the data faster into Postgres.
I am going to paste my code here and mention what i tried, what was the output and what would be the expected output
what i tried:
columns = [
'cust_last_nm',
'cust_frst_nm',
'cust_brth_dt',
'cust_gendr_cd',
'cust_postl_cd',
'indiv_entpr_id',
'TOKEN_1',
'TOKEN_2',
'TOKEN_3',
'TOKEN_4',
'TOKEN_KEY'
]
def push_to_pg_weekly(key):
vants = []
print(key)
key = _download_s3(key)
how_many_files_pushed.append(True)
s=sp.Popen(["wc", "-l", key], stdout=sp.PIPE)
a, b = s.communicate()
total_rows = int(a.split()[0])
rows = 0
data = pd.read_csv(key, sep="|", header=None, chunksize=100000)
for chunk in data:
rows += len(chunk)
print("Processed rows: ", (float(rows)/total_rows)*100)
chunk = chunk.dropna(axis=0) #step-1 Drop the rows where at least one element is missing.
index_names = chunk[(len(chunk[0]) <= 1) | (len(chunk[1]) <= 1)].index #step2
chunk.drop(index_names, axis=0)
chunk.to_csv("/tmp/sample.csv", sep="|", header=None, index=False)
connection = psycopg2.connect(user = os.environ.get("DATABASE_USER", “USERNAME”),
password = os.environ.get("DATABASE_PASS", “PASSWORD“),
host = os.environ.get("DATABASE_HOST", "cvlpsql.pgsql.com"),
port = 5432,
dbname = os.environ.get("DATABASE_NAME", "cvlpsql_db"),
options = "-c search_path=DATAVANT_O")
with connection.cursor() as cursor:
cursor.copy_from(open('/tmp/sample.csv'), "COVID1", sep='|')
connection.commit()
def push_to_pg():
paginator = CLIENT.get_paginator('list_objects')
pages = paginator.paginate(Bucket=bucket)
for page in pages:
if "Contents" in page:
for obj in page["Contents"]:
if obj['Key'].startswith('test/covid-2020-11-10-175213') and (obj['Key'].endswith('.txt') or obj['Key'].endswith('.csv')):
push_to_pg_weekly(obj['Key'])
os.remove(obj['Key'])
return
Data:
john|doe|1974-01-01|F|606.0|113955973|cC80fi6kHVjKRNgUnATuE8Nn5x/YyoTUdSDY3sUDis4=|2qalDHguJRO9gR66LZcRLSe2SSQSQAIcT9btvaqLnZk=|eLQ9vYAj0aUfMM9smdpXqIh7QRxLBh6wYl6iYkItz6g=|3ktelRCCKf1CHOVBUdaVbjqltxa70FF+9Lf9MNJ+HDU=|cigna_TOKEN_ENCRYPTION_KEY
j|ab|1978-01-01|M|328.0|125135976|yjYaupdG9gdlje+2HdQB+FdEEj6Lh4+WekqEuB1DSvM=|j8VuTUKll7mywqsKrqBnomppGutsoJAR+2IoH/Tq0b8=|6qNP9ch57MlX912gXS7RMg7UfjtaP6by/cR68PbzNmQ=|R5DemSNrFvcevijrktwf3aixOShNU6j7wfahyKeUyzk=|cigna_TOKEN_ENCRYPTION_KEY
j|j|1985-01-01|F|105.0|115144390|fn0r8nVzmDJUihnaQh1SXm1sLOIjzGsPDBskdX4/b+0=|Fh6facONoOiL9hCCA8Q1rtUp9n5h9VBhg2IaX9gjaKI=|NWtnZegpcpgcit2u063zQv3pcEhk4bpKHKFa9hW7LtU=|P3cVOUd6PyYN5tKezdMkVDI62aW8dv+bjIwKtAgX3OM=|cigna_TOKEN_ENCRYPTION_KEY
jh|on|1989-01-01|M|381.0|133794239|PvCWdh+ucgi1WyP5Vr0E6ysTrTZ1gLTQIteXDxZbEJg=|7K3RsfC8ItQtrEQ+MdBGpx6neggYvBvR8nNDMOBTRtU=|nHsF/rJFM/O+HPevTj9cVYwrXS1ou+2/4FelEXTV0Ww=|Jw/nzI/Gu9s6QsgtxTZhTFFBXGLUv06vEewxQbhDyWk=|cigna_TOKEN_ENCRYPTION_KEY
||1969-01-01|M|926.0|135112782|E2sboFz4Mk2aGIKhD4vm6J9Jt3ZSoSdLm+0PCdWsJto=|YSILMFS5sPPZZF/KFroEHV77z1bMeiL/f4FqF2kj4Xc=|tNjgnby5zDbfT2SLsCCwhNBxobSDcCp7ws0zYVme5w4=|kk25p0lrp2T54Z3B1HM3ZQN0RM63rjqvewrwW5VhYcI=|cigna_TOKEN_ENCRYPTION_KEY
||1978-01-01|M|70.0|170737333|Q8NDJz563UrquOUUz0vD6Es05vIaAD/AfVOef4Mhj24=|k5Q02GVd0nJ6xMs1vHVM24MxV6tZ46HJNKoePcDsyoM=|C9cvHz5n+sDycUecioiWZW8USE6D2dli5gRzo4nOyvY=|z4eNSVNDAjiPU2Sw3VY+Ni1djO5fptl5FGQvfnBodr4=|cigna_TOKEN_ENCRYPTION_KEY
||1996-01-01|M|840.0|91951973|Y4kmxp0qdZVCW5pJgQmvWCfc4URg9oFnv2DWGglfQKM=|RJfyDYJjwuZ1ZDjP+5PA5S2fLS6llFD51Lg+uJ84Tus=|+PXzrKt7O79FehSnL3Q8EjGmnyZVDUfdM4zzHk1ghOY=|gjyVKjunky2Aui3dxzmeLt0U6+vT39/uILMbEiT0co8=|cigna_TOKEN_ENCRYPTION_KEY
||1960-01-01|M|180.0|64496569|80e1CgNJeO8oYQHlSn8zWYL4vVrHSPe9AnK2T2PrdII=|bJl7veT+4MlU4j2mhFpFyins0xeCFWeaA30JUzWsfqo=|0GuhUfbS4xCnCj2ms43wqmGFG5lCnfiIQdyti9moneM=|lq84jO9yhz8f9/DUM0ACVc/Rp+sKDvHznVjNnLOaRo4=|cigna_TOKEN_ENCRYPTION_KEY
||1963-01-01|M|310.0|122732991|zEvHkd5AVT7hZFR3/13dR9KzN5WSulewY0pjTFEov2Y=|eGqNbLoeCN1GJyvgaa01w+z26OtmplcrAY2vxwOZ4Y4=|6q9DPLPK5PPAItZA/x253DvdAWA/r6zIi0dtIqPIu2g=|lOl11DhznPphGQOFz6YFJ8i28HID1T6Sg7B/Y7W1M3o=|cigna_TOKEN_ENCRYPTION_KEY
||2001-01-01|F|650.0|43653178|vv/+KLdhHqUm13bWhpzBexwxgosXSIzgrxZIUwB7PDo=|78cJu1biJAlMddJT1yIzQAH1KCkyDoXiL1+Lo1I2jkw=|9/BM/hvqHYXgfmWehPP2JGGuB6lKmfu7uUsmCtpPyz8=|o/yP8bMzFl6KJ1cX+uFll1SrleCC+8BXmqBzyuGdtwM=|cigna_TOKEN_ENCRYPTION_KEY
output - data inserted into postgresDB:
john|doe|1974-01-01|F|606.0|113955973|cC80fi6kHVjKRNgUnATuE8Nn5x/YyoTUdSDY3sUDis4=|2qalDHguJRO9gR66LZcRLSe2SSQSQAIcT9btvaqLnZk=|eLQ9vYAj0aUfMM9smdpXqIh7QRxLBh6wYl6iYkItz6g=|3ktelRCCKf1CHOVBUdaVbjqltxa70FF+9Lf9MNJ+HDU=|cigna_TOKEN_ENCRYPTION_KEY
j|ab|1978-01-01|M|328.0|125135976|yjYaupdG9gdlje+2HdQB+FdEEj6Lh4+WekqEuB1DSvM=|j8VuTUKll7mywqsKrqBnomppGutsoJAR+2IoH/Tq0b8=|6qNP9ch57MlX912gXS7RMg7UfjtaP6by/cR68PbzNmQ=|R5DemSNrFvcevijrktwf3aixOShNU6j7wfahyKeUyzk=|cigna_TOKEN_ENCRYPTION_KEY
j|j|1985-01-01|F|105.0|115144390|fn0r8nVzmDJUihnaQh1SXm1sLOIjzGsPDBskdX4/b+0=|Fh6facONoOiL9hCCA8Q1rtUp9n5h9VBhg2IaX9gjaKI=|NWtnZegpcpgcit2u063zQv3pcEhk4bpKHKFa9hW7LtU=|P3cVOUd6PyYN5tKezdMkVDI62aW8dv+bjIwKtAgX3OM=|cigna_TOKEN_ENCRYPTION_KEY
jh|on|1989-01-01|M|381.0|133794239|PvCWdh+ucgi1WyP5Vr0E6ysTrTZ1gLTQIteXDxZbEJg=|7K3RsfC8ItQtrEQ+MdBGpx6neggYvBvR8nNDMOBTRtU=|nHsF/rJFM/O+HPevTj9cVYwrXS1ou+2/4FelEXTV0Ww=|Jw/nzI/Gu9s6QsgtxTZhTFFBXGLUv06vEewxQbhDyWk=|cigna_TOKEN_ENCRYPTION_KEY
Expected Output:
john|doe|1974-01-01|F|606.0|113955973|cC80fi6kHVjKRNgUnATuE8Nn5x/YyoTUdSDY3sUDis4=|2qalDHguJRO9gR66LZcRLSe2SSQSQAIcT9btvaqLnZk=|eLQ9vYAj0aUfMM9smdpXqIh7QRxLBh6wYl6iYkItz6g=|3ktelRCCKf1CHOVBUdaVbjqltxa70FF+9Lf9MNJ+HDU=|cigna_TOKEN_ENCRYPTION_KEY
jh|on|1989-01-01|M|381.0|133794239|PvCWdh+ucgi1WyP5Vr0E6ysTrTZ1gLTQIteXDxZbEJg=|7K3RsfC8ItQtrEQ+MdBGpx6neggYvBvR8nNDMOBTRtU=|nHsF/rJFM/O+HPevTj9cVYwrXS1ou+2/4FelEXTV0Ww=|Jw/nzI/Gu9s6QsgtxTZhTFFBXGLUv06vEewxQbhDyWk=|cigna_TOKEN_ENCRYPTION_KEY
Any answers/comments will be very much appriciated, thank you
Fastest way to do operations like this on pandas is through numpy.where.
eg for String length:
data = data[np.where((data['cust_last_nm'].str.len()>1) &
(data['cust_frst_nm'].str.len()>1), True, False)]
Note: you can add postal code condition in same way. by default in your data postal codes will read in as floats, so cast them to string first, and then set length limit:
## string length & postal code conditions together
data = data[np.where((data['cust_last_nm'].str.len()>1) &
(data['cust_frst_nm'].str.len()>1) &
(data['cust_postl_cd'].astype('str').str.len()>4) &
(data['cust_postl_cd'].astype('str').str.len()<8)
, True, False)]
EDIT:
Since you working in chunks, change the data to chunk and put this inside your loop. Also, since you don't import headers (headers=0, change column names to their index values. And convert all values to strings before comparison, since otherwise NaN columns will be treated as floats eg:
chunk = chunk[np.where((chunk[0].astype('str').str.len()>1) &
(chunk[1].astype('str').str.len()>1) &
(chunk[5].astype('str').str.len()>4) &
(chunk[5].astype('str').str.len()<8), True, False)]
Create a new column in the dataframe with a value for the length:
df['name_length'] = df.name.str.len()
Index using the new column:
df = df[df.name_length > 1]
#Create the pandas DataFrame#
My data frame is like this
data = [[6, 1, "False","var_1"], [6, 1, "False","var_2"], [7, 1, "False","var_3"]]
df = pd.DataFrame(data, columns =['CONSTRAINT_ID','CONSTRAINT_NODE_ID','PRODUCT_GRAIN','LEFT_SIDE_TYPE'])
##Expected Output Json##
I want to group by column CONSTRAINT_ID and the key should be natural numbers or index. LEFT_SIDE_TYPE column values should come in list
{
"1": {"CONSTRAINT_NODE_ID ":[1],
"product_grain":False,
"left_side_type":["Variable_1","Variable_2"],
},
"2": {"CONSTRAINT_NODE_ID ":[2],
"product_grain":False,
"left_side_type":["Variable_3"],
}
}
It is likely not the most efficient solution. However provided a df in the format specified in your original question, the below function will return a str consisting of a valid json string with the desired structure and values.
It filters the df by CONSTRAINT_ID, iterating across each unique value and creating a JSON object with a key 1...n and the desired values based on your original question within the response variable. This implementation uses set structures to store values during iterations to avoid duplication of values before converting these to list instances before they are added to the response.
import json
def generate_response(df):
response = dict()
constraints = df['CONSTRAINT_ID'].unique()
for i, c in enumerate(constraints):
temp = {'CONSTRAINT_NODE_ID': set(),'PRODUCT_GRAIN': None, 'LEFT_SIDE_TYPE': set()}
for _, row in df[df['CONSTRAINT_ID'] == c].iterrows():
temp['CONSTRAINT_NODE_ID'].add(row['CONSTRAINT_NODE_ID'])
temp['PRODUCT_GRAIN'] = row['PRODUCT_GRAIN']
temp['LEFT_SIDE_TYPE'].add(row['LEFT_SIDE_TYPE'])
temp['CONSTRAINT_NODE_ID'] = list(temp['CONSTRAINT_NODE_ID'])
temp['LEFT_SIDE_TYPE'] = list(temp['LEFT_SIDE_TYPE'])
response[str(i + 1)] = temp
return json.dumps(response, indent=4)
I am trying to create a table from JSON data. I have already used the json.dumps for my data:
this is what I am trying to export to the table:
label3 = json.dumps({'class': CLASSES[idx],"confidence": str(round(confidence * 100, 1)) + "%","startX": str(startX),"startY": str(startY),"EndX": str(endX),"EndY": str(endY),"Timestamp": now.strftime("%d/%m/%Y, %H:%M")})
I have tryied with:
val1 = json.loads(label3)
df = pd.DataFrame(val1)
print(df.T)
The system gives me an error that I must pass an index.
And also with:
val = ast.literal_eval(label3)
val1 = json.loads(json.dumps(val))
print(val1)
val2 = val1["class"][0]["confidence"][0]["startX"][0]["startY"][0]["endX"][0]["endY"][0]["Timestamp"][0]
df = pd.DataFrame(data=val2, columns=["class", "confidence", "startX", "startY", "EndX", "EndY", "Timestamp"])
print(df)
When I try this, the error it gives is that String indices mustb be integers.
How can I create the index?
Thank you,
There are two ways we can tackle this issue.
Do as directed by the error, pass the index to the dataframe function
pd.Dataframe(val1, index=list(range(number_of_rows)) # number of rows is 1 in your case.
While dumping the data using json.dumps, dump dictionary which has the mapping from key:list of values instead of key:value. For example
json.dumps({ 'class': [ CLASSES[idx] ],"confidence": [ ' some confidence ' ] })
I have shortened your given example. See I am passing values as list of values(even if it is only one value per key).
I have a data frame df that reads a JSON file as follows:
df = spark.read.json("/myfiles/file1.json")
df.dtypes shows the following columns and data types:
id – string
Name - struct
address - struct
Phone - struct
start_date - string
years_with_company - int
highest_education - string
department - string
reporting_hierarchy - struct
I want to extract only non-struct columns and create a data frame. For example, my resulting data frame should only have id, start_date, highest_education, and department.
Here is the code I have which partially works, as I only get the last non-struct column department's values populated in it. I want to get all non-struct type columns collected and then converted to data frame:
names = df.schema.names
for col_name in names:
if isinstance(df.schema[col_name].dataType, StructType):
print("Skipping struct column %s "%(col_name))
else:
df1 = df.select(col_name).collect()
I'm pretty sure this may not be the best way to do it and I missing something that I cannot put my finger on, so I would appreciate your help. Thank you.
Use a list comprehension:
cols_filtered = [
c for c in df.schema.names
if not isinstance(df.schema[c].dataType, StructType)
]
Or,
# Thank you #pault for the suggestion!
cols_filtered = [c for c, t in df.dtypes if t != 'struct']
Now, you can pass the result to df.select.
df2 = df.select(*cols_filtered)