How to make multiple json processing faster using pySpark? - python

I am having a list of json files within Databricks and what I am trying to do is to read each json, extract the values needed and then append that in an empty pandas dataframe. Each json file corresponds to one row on the final dataframe. The initial json filelist length is 50k. What I have built so far is the function below which does the job perfectly, but it takes so much time that it makes me subset the json filelist in 5k bins and run each one separately. It takes 30mins each. I am limited to use only a 3-node cluster in Databricks.
Any chance that you could improve the efficiency of my function? Thanks in advance.
### Create a big dataframe including all json files ###
def jsons_to_pdf(all_paths):
# Create an empty pandas dataframes (it is defined only with column names)
pdf = create_initial_pdf(samplefile)
# Append each row into the above dataframe
for path in all_paths:
# Create a spark dataframe
sdf = sqlContext.read.json(path)
# Create a two extracted lists of values
init_values = sdf.select("id","logTimestamp","otherTimestamp").rdd.flatMap(lambda x: x).collect()
id_values = sdf.select(sdf["dataPoints"]["value"]).rdd.flatMap(lambda x: x).collect()[0]
#Append the concatenated list each one as a row into the initial dataframe
pdf.loc[len(pdf)] = init_values + id_values
return pdf
One json file looks like the following:
And what I want to achieve is to have dataPoints['id'] as new columns and dataPoints['value'] as their value, so as to end up into this:

According to your example, what you want to perform is a pivot and then transform your data into a pandas dataframe.
The steps are :
Collect all you jsons into 1 big dataframe,
pivot your data,
transform them into a pandas dataframe
Try something like this :
from functools import reduce
def jsons_to_pdf(all_paths):
# Create a big dataframe from all the jsons
sdf = reduce(
lambda a,b : a.union(b),
[
sqlContext.read.json(path)
for path
in all_paths
]
)
# select and pivot your data
pivot_df = sdf.select(
"imoNo",
"logTimestamp",
"payloadTimestamp",
F.explode("datapoints").alias("datapoint")
).groupBy(
"imoNo",
"logTimestamp",
"payloadTimestamp",
).pivot(
"datapoint.id"
).sum("datapoint.value")
# convert to a pandas dataframe
pdf = pivot_df.toPandas()
return pdf
According to your comment, you can replace the list of files all_paths with a generic path and change the way you create sdf:
all_paths = 'abc/*/*/*' # 3x*, one for year, one for month, one for day
def jsons_to_pdf(all_paths):
# Create a big dataframe from all the jsons
sdf = sqlContext.read.json(path)
This will surely increase the performances.

Related

How to read multiple csv from folder without concatenating each file

I have a folder and inside the folder suppose there are 1000 of .csv files stored. Now I have to create a data frame based on 50 of these files so instead of loading line by line is there any fast approach available?
And I also want the file_name to be the name of my data frame?
I tried below method but it is not working.
# List of file that I want to load out of 1000
path = "..."
file_names = ['a.csv', 'b.csv', 'c.csv', 'd.csv', 'e.csv']
for i in range(0, len(file_names)):
col_names[i] = pd.read_csv(path + col_name[i])
But when I tried to read the variable name it is not displaying any result.
Is there any way I can achieve the desired result.
I have checked various article but in each of these article at the end all the data has been concatenated and I want each file to be loaded indiviually.
IIUC, what you want to do is create multiple dfs, and not concatenate them
in this case well you can read it using read_csv, and them stackings the return df objects in a list
your_paths = [# Paths to all of your wanted csvs]
l = [pd.read_csv(i) for i in your_paths] # This will give you a list of your dfs
l[0] # One of your dfs
If you want them named, you can make it as dict with different named keys
You can access them individually, through index slicing or key slicing depends on the data structure you use.
Would not recommend this action tho, as well it is counter intuitive and well multiple df objects use a little more memory than a unique one
file_names = ['a.csv', 'b.csv', 'c.csv', 'd.csv', 'e.csv']
data_frames = {}
for file_name in file_names:
df = pd.read_csv(file_name)
data_frames[file_name.split('.')[0]] = df
Now you can reach any data frame from data_frames dictionary; as data_frames['a'] to access a.csv
try:
import glob
p = glob.glob( 'folder_path_where_csv_files_stored/*.csv' ) #1. will return a list of all csv files in this folder, no need to tape them one by one.
d = [pd.read_csv(i) for i in p] #2. will create a list of dataframes: one dataframe from each csv file
df = pd.concat(d, axis=0, ignore_index=True) #3. will create one dataframe `df` from those dataframes in the list `d`

Skipping every nth row and copy/transform data into another matrix

I've extracted information from 142 different files, which is stored in CSV-file with one column, which contains both number and text. I want to copy row 11-145, transform it, and paste it into another file (xlsx or csv doesn't matter). Then, I want to skip the next 10 rows, and copy row 156-290, transform and paste it etc etc. I have tried the following code:
import numpy as np
overview = np.zeros((145, 135))
for i in original:
original[i+11:i+145, 1] = overview[1, i+1:i+135]
print(overview)
The original file is the imported file, for which I used pd.read_csv.
pd.read_csv is a function that returns a dataframe.
To select specific rows from a dataframe you can use this function :
df.loc[start:stop:step]
so it would look something like this :
df = pd.read_csv(your_file)
new_df = df.loc[11:140]
#transform it as you please
#convert it to excel or csv
new_df .to_excel("new_file.xlsx") or new_df .to_csv("new_file.csv")

Amending dataframe from a generator that reads multiple excel files

My question ultimately is - is it possible to amend inplace each dataframe of a generator of dataframes?
I have a series of excel files in a folder that each have a table in the same format. Ultimately I want to concatenate each file into 1 large dataframe. They all have unique column headers but share the same indices (historical dates but may be across different time frames) so I want to concatenate the dataframes but aligned by their date. So I first created a generator function to create dataframes from each 'Data1' worksheet in the excel files
all_files = glob.glob(os.path.join(path, "*"))
df_from_each_file = (pd.read_excel(f,'Data1') for f in all_files) #generator comprehension
The below code is the formatting that needs to be done to each dataframe so that I can concatenate them correctly in my final line. I changed the index to the date column but there are also some rows that contain data that is not relevant.
def format_ABS(df):
df.drop(labels=range(0, 9), axis=0,inplace=True)
df.set_index(df.iloc[:,0],inplace=True)
df.drop(df.columns[0],axis=1,inplace=True)
However this doesn't work when I place the function within a generator comphrension (as i am amending all the dataframes inplace). The generator produced has no objects. Why doesn't the below line work? Is it because it can only loop through the generator once?
format_df = (format_ABS(x) for x in df_from_each_file)
but
format_df(next(df_from_each_file)
does work on each individual dataframe
The final product is then the below
concatenated_df = pd.concat(df_from_each_file, ignore_index=True)
I have gotten what I wanted by assigning index_col=0 in the pd.read_excel line but it go me thinking about generators and amending the dataframe in general.

Split per attribute

I am trying to read a big CSV. Then split big CSV into smaller CSV files, based on unique values in the column team.
At first I created new dataframes for each team. The new txt files generated, one for each unique value in team column.
Code:
import pandas as pd
df = pd.read_csv('combined.csv')
df = df[df.team == 'RED']
df.to_csv('RED.csv')
However I want to start from a single dataframe, read all unique 'teams', and create a .txt file for each team, with headers.
Is it possible?
pandas.DataFrame.groupby, when used without an aggregation, returns the dataframe components associated with each group in the groupby column.
The following code will create a file for the data associated to each unique value in the column used to groupby.
Use f-strings to create a unique filename for each group.
import pandas as pd
# create the dataframe
df = pd.read_csv('combined.csv')
# groupby the desired column and iterate through the groupby object
for group, dataframe in df.groupby('team'):
# save the dataframe for each group to a csv
dataframe.to_csv(f'{group}.txt', sep='\t', index=False)

Add calculated columns to each DataFrame inside a Panel without for-loop

I have ~300 .csv files all with the same number of rows and columns for instrumentation data. Since each .csv file represents a day and the structure is the same, I figured it would be best to pull each .csv into a Pandas DataFrame and then throw them into a Panel object to perform faster calculations.
I would like to add additional calculated columns to each DataFrame that is inside the Panel, preferably without a for-loop. I'm attempting to use the apply function to the panel and name the new columns based on the original column name appended with a 'p' (for easier indexing later). Below is the code I am currently using.
import pandas as pd
import numpy as np
import os.path
dir = "data/testsetup1/"
filelist = []
def initializeDataFrames():
for f in os.listdir(dir):
if ".csv" in f:
filelist.append(dir + f)
dd={}
for f in filelist:
dd[f[len(dir):(len(f)-4)]] = pd.read_csv(f)
return pd.Panel(dd)
def newCalculation(pointSeries):
#test function, more complex functions to follow
pointSeriesManiuplated = pointSeries.copy()
percentageMove = 1.0/float(len(pointSeriesManiuplated))
return pointSeriesManiuplated * percentageMove
myPanel = initializeDataFrames()
#calculatedPanel = myPanel.join(lambda x: myPanel[x,:,0:17].apply(lambda y:newCalculation(myPanel[x,:,0:17].ix[y])), rsuffix='p')
calculatedPanel = myPanel.ix[:,:,0:17].join(myPanel.ix[:,:,0:17].apply(lambda x: newCalculation(x), axis=2), rsuffix='p')
print calculatedPanel.values
The code above currently duplicates each DataFrame with the calculated columns instead of appending them to each DataFrame. The apply function I'm using operates on a Series object, which in this case would be a passed column. My question is how can I use the apply function on a Panel such that it calculates new columns and appends them to each DataFrame?
Thanks in advance.
If you want to add a new column via apply simply assign the output of the apply operation to the column you desire:
myPanel['new_column_suffix_p'] = myPanel.apply(newCalculation)
If you want multiple columns you can make a custom function for this:
def calc_new_columns (rowset):
rowset['newcolumn1'] = calculation1(rowset.columnofinterest)
rowset['newcolumn2'] = calculation2(rowset.columnofinterest2 + rowset.column3)
return rowset
myPanel = myPanel.apply(calc_new_columns)
On a broader note. You are manually handling sections of your data frame when it looks like you can just do the new column operation all at once. I would suggest importing the first csv file into a data frame. Then loop through the remaining 299 csv and use DataFrame.append to add to the original data frame. Then you would have one data frame for all the data that simple needs the calculated column added.
nit: "dir" is a builtin function. you shouldn't use it as a variable name.
Try using a double transpose:
p = pd.Panel(np.random.rand(4,10,17),
items=pd.date_range('2013/11/10',periods=4),
major_axis=range(10),
minor_axis=map(lambda x: "col%d" % x, range(17)))
pT = p.transpose(2,1,0)
pT = pT.join(pT.apply(newCalculation, axis='major'), rsuffix='p')
p = pT.transpose(2,1,0)

Categories

Resources