Save dictionary as a pyspark Dataframe and load it - Python, Databricks - python

I have a dictionary as follows:
my_dict = {'a':[12,15.2,52.1],'b':[2.5,2.4,5.2],'c':[1.2,5.3,12]}
I want to save this dictionary in Databricks in order for me not to obtain it every time I want to start working with it. Furthermore, I would like to know how to retrieve it and have it in its original form again.
I have tried doing the following:
from itertools import zip_longest
column_names, data = zip(*my_dict.items())
spark.createDataFrame(zip_longest(*data), column_names).show()
and
column_names, data = zip(*dict_brands.items())
spark.createDataFrame(zip(*data), column_names).show()
However, I get the following error:
zip_longest argument #10342 must support iteration
I also do not know how to reload it or upload it. I tried with a sample dataframe (not the same one), as follows:
df.write.format("tfrecords").mode("overwrite").save('/data/tmp/my_df')
And the error is:
Attribute name "my_column" contains invalid character(s)
among " ,;{}()\n\t=". Please use alias to rename it.
Finally, in order to obtain it, I thought about:
my_df = spark.table("my_df") # Get table
df = my_df.toPandas() # Make pd dataframe
and then make it a dictionary, but maybe there is an easier way than making it a dataframe and then retrieving as dataframe and converting into dictionary back again.
I would also like to know the computational cost for the solutions, since the actual dataset is very large.

Here is my sample code for realizing your needs step by step.
Convert a dictionary to a Pandas dataframe
my_dict = {'a':[12,15.2,52.1],'b':[2.5,2.4,5.2],'c':[1.2,5.3,12]}
import pandas as pd
pdf = pd.DataFrame(my_dict)
Convert a Pandas dataframe to a PySpark dataframe
df = spark.createDataFrame(pdf)
To save a PySpark dataframe to a file using parquet format. Format tfrecords is not supported at here.
df.write.format("parquet").mode("overwrite").save('/data/tmp/my_df')
To load the saved file above as a PySpark dataframe.
df2 = spark.read.format("parquet").load('/data/tmp/my_df')
To convet a PySpark dataframe to a dictionary.
my_dict2 = df2.toPandas().to_dict()
The computational cost of these code above is depended on the memory usage for your actual dataset.

Related

Adding file name column to Dask DataFrame

I have a data set of around 400 CSV files containing a time series of multiple variables (my CSV has a time column and then multiple columns of other variables).
My final goal is the choose some variables and plot those 400 time series in a graph.
In order to do so, I tried to use Dask to read the 400 files and then plot them.
However, from my understanding, In order to actually draw 400 time series and not a single appended data frame, I should groupby the data by the file name it came from.
Is there any Dask efficient way to add a column to each CSV so I could later groupby my results?
A parquet files is also an option.
For example, I tried to do something like this:
import dask.dataframe as dd
import os
filenames = ['part0.parquet', 'part1.parquet', 'part2.parquet']
df = dd.read_parquet(filenames, engine='pyarrow')
df = df.assign(file=lambda x: filenames[x.index])
df_grouped = df.groupby('file')
I understand that I can use from_delayed() but then I lose al the parallel computation.
Thank you
If you are can work with CSV files, then passing include_path_column option might be sufficient for your purpose:
from dask.dataframe import read_csv
ddf = read_csv("some_path/*.csv", include_path_column="file_path")
print(ddf.columns)
# the list of columns will include `file_path` column
There is no equivalent option for read_parquet, but something similar can be achieved with delayed. Using delayed will not remove parallelism, the code just need to make sure that the actual calculation is done after the delayed tasks are defined.

What code should I use in extracting specific column (with specific data) from a csv file to python. It can be either pandas or numpy

please see attached photo
here's the image
I only need to import a specific column with conditions(such as specific data found in that column). And also, I only need to remove unnecessary columns. dropping them takes too much code. What specific code or syntax is applicable?
How to get a column from pandas dataframe is answered in Read specific columns from a csv file with csv module?
To quote:
Pandas is spectacular for dealing with csv files, and the following
code would be all you need to read a csv and save an entire column
into a variable:
import pandas as pd
df = pd.read_csv(csv_file)
saved_column = df.column_name #you can also use df['column_name']
So in your case, you just save the the filtered data frame in a new variable.
This means you do newdf = data.loc[...... and then use the code snippet from above to extract the column you desire, for example newdf.continent

Using PySpark to efficiently combine many small csv files (130,000 with 2 columns in each) into one large frame

This is another follow up to an earlier question I posted How can I merge these many csv files (around 130,000) using PySpark into one large dataset efficiently?
I have the following dataset https://fred.stlouisfed.org/categories/32263/downloaddata/INTRNTL_csv_2.zip
In it, there's a list of files (around 130,000). In the main directory with their sub-directories listed, so in there the first cell might be A/AAAAA, and the file would be located at /data/A/AAAAA.csv
The files are all with a similar format, the first column is called DATE and the second column is a series which are all named VALUE. So first of all, the VALUE column name needs to be renamed to the file name in each csv file. Second, the frames need to be full outer joined with each other with the DATE as the main index. Third, I want to save the file and be able to load and manipulate it. The file should be around N rows (number of dates) X 130,001 roughly.
I am trying to full outer join all the files into a single dataframe, I previously tried pandas but ran out of memory when trying to concat the list of files and someone recommended that I try to use PySpark instead.
In a previous post I was told that I could do this:
df = spark.read.csv("/kaggle/input/bf-csv-2/BF_csv_2/data/**/*.csv", "date DATE, value DOUBLE")
But all the columns are named value and the frame just becomes two columns, the first column is DATE and second column is VALUE, it loads quite fast, around 38 seconds and around 3.8 million values by 2 columns, so I know that it's not doing the full outer join, it's appending the files row wise.
So I tried the following code:
import pandas as pd
import time
import os
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('spark-dataframe-demo').getOrCreate()
from pyspark.sql import *
from pyspark.sql.functions import col
from pyspark.sql import DataFrame
from pyspark.sql.types import *
filelist = pd.read_excel("/kaggle/input/list/BF_csv_2.xlsx") #list of filenames
firstname = min(filelist.File)
length = len(filelist.File)
dff = spark.read.csv(f"/kaggle/input/bf-csv-2/BF_csv_2/data/" + firstname, inferSchema = True, header = True).withColumnRenamed("VALUE",firstname) #read file and changes name of column to filename
for row in filelist.File.items():
if row == firstname:
continue
print (row[1],length,end='', flush=True)
df = spark.read.csv(f"/kaggle/input/bf-csv-2/BF_csv_2/data/" + row[1], inferSchema = True, header = True).withColumnRenamed("VALUE",row[1][:-4])
#df = df.select(col("DATE").alias("DATE"),col("VALUE").alias(row[1][:-4]))
dff = dff.join(df, ['DATE'], how='full')
length -= 1
dff.write.save('/kaggle/working/whatever', format='parquet', mode='overwrite')
So to test it, I try to load the the df.show() function after 3 columns are merged and it's quite fast. But, when I try around 25 columns, it takes around 2 minutes. When I try 500 columns it's next to impossible.
I don't think I'm doing it right. The formatting and everything is correct. But why is it taking so long? How can I use PySpark properly? Are there any better libraries to achieve what I need?
Spark doesn't do anything magical compared to other software. The strength of spark is parallel processing. Most of the times that means you can use multiple machines to do the work. If you are running spark locally you may have the same issues you did when using pandas.
That being said, there might be a way for you to run it locally using Spark because it can spill to disk under certain conditions and does not need to have everything in memory.
I'm not verse in PySpark, but the approach I'd take is:
load all the files using like you did /kaggle/input/bf-csv-2/BF_csv_2/data/**/*.csv
Use the function from pyspark.sql.functions import input_file_name that allows you to get the path for each record in your DF (df.select("date", "value", input_file_name().as("filename")) or similar)
Parse the path into a format that I'd like to have as a column (eg. extract filename)
the schema should look like date, value, filename at this step
use the PySpark equivalent of df.groupBy("date").pivot("filename").agg(first("value")). Note: I used first() because I think you have 1 or 0 records possible
Also try: setting the number of partitions to be equal to number of dates you got
If you want output as a single file, do not forget to repartition(1) before df.write. This step might be problematic depending on data size. You do not need to do this if you plan to keep using Spark for your work as you could load the data using the same approach as in step 1 (/new_result_data/*.csv)

OverflowError with Pandas to_hdf

Python newbie here.
I am trying to save a large data frame into HDF file with lz4 compression using to_hdf.
I use Windows 10, Python 3, Pandas 20.2
I get the error “OverflowError: Python int too large to convert to C long”.
None of the machine resources are close to their limits (RAM, CPU, SWAP usage)
Previous posts discuss the dtype, but the following example shows that there is some other problem, potentially related to the size?
import numpy as np
import pandas as pd
# sample dataframe to be saved, pardon my French
n=500*1000*1000
df= pd.DataFrame({'col1':[999999999999999999]*n,
'col2':['aaaaaaaaaaaaaaaaa']*n,
'col3':[999999999999999999]*n,
'col4':['aaaaaaaaaaaaaaaaa']*n,
'col5':[999999999999999999]*n,
'col6':['aaaaaaaaaaaaaaaaa']*n})
# works fine
lim=200*1000*1000
df[:lim].to_hdf('df.h5','table', complib= 'blosc:lz4', mode='w')
# works fine
lim=300*1000*1000
df[:lim].to_hdf('df.h5','table', complib= 'blosc:lz4', mode='w')
# Error
lim=400*1000*1000
df[:lim].to_hdf('df.h5','table', complib= 'blosc:lz4', mode='w')
....
OverflowError: Python int too large to convert to C long
I experienced the same issue and it seems that it is indeed connected to the size of the data frame rather than to dtype (I had all the columns stored as strings and was able to store them to .h5 separately).
The solution that worked for me is to save the data frame in chunks using mode='a'.
As suggested in pandas documentation : mode{‘a’, ‘w’, ‘r+’}, default ‘a’: ‘a’: append, an existing file is opened for reading and writing, and if the file does not exist it is created.
So the sample code would look something like:
batch_size = 1000
for i, df_chunk in df.groupby(np.arange(df.shape[0]) // batch_size):
df_chunk.to_hdf('df.h5','table', complib= 'blosc:lz4', mode='a')
As #Giovanni Maria Strampelli pointed, the answer of #Artem Snorkovenko only saves the last batch. Pandas documentation states the following:
In order to add another DataFrame or Series to an existing HDF file please use append mode and a different a key.
Here is a possible workaround to save all batches (adjusted from the answer of #Artem Snorkovenko):
for i in range(len(df)):
sr = df.loc[i] #pandas series object for the given index
sr.to_hdf('df.h5', key='table_%i'%i, complib='blosc:lz4', mode='a')
This code saves each Pandas Series object with a different key. Each key is indexed by i.
To load the existing .h5 file after saving, one can do the following:
i = 0
dfdone = False #if True, all keys in the .h5 file are successfully loaded.
srl = [] #df series object list
while dfdone == False:
#print(i) #this is to see if code is working properly.
try: #check whether current i value exists in the keys of the .h5 file
sdfr = pd.read_hdf('df.h5', key='table_%i'%i) #Current series object
srl.append(sdfr) #append each series to a list to create the dataframe in the end.
i += 1 #increment i by 1 after loading the series object
except: #if an error occurs, current i value exceeds the number of keys, all keys are loaded.
dfdone = True #Terminate the while loop.
df = pd.DataFrame(srl) #Generate the dataframe from the list of series objects.
I used a while loop, assuming we do not know the exact length of the dataframe in the .h5 file. If the length is known, for loop can also be used.
Note that I am not saving dataframes in chunks here. Thus, the loading procedure is in its current form not suitable for saving in chunks, where the data type would be DataFrame for each chunk. In my implementation, each saved object is Series, and DataFrame is generated from a list of Series. The code I provided can be adjusted to work for saving in chunks and generating a DataFrame from a list of DataFrame objects (a nice starting point can be found in ths Stack Overflow entry.).
In addition to #tetrisforjeff 's answer:
If the df contains object types, the reading could lead to error. I would suggest pd.concat(srl) instead of pd.DataFrame(srl)

updating a pickle file from a dataframe values

i am having a huge pickle file which needs to be updated in every 3 hrs from a dailydata file(a csv file.)
there are two field named TRX_DATE and TIME_STAMP in each two having values like 24/11/2015 and 24/11/2015 10:19:02 respectively.(also 50 additionl fields are there)
so what i am doing is first reading the huge pickle to a dataframe. Then dropping any values for today's date by comparing with TRX_DATE field.
Then reading that csv file to another dataframe. then appending both dataframe and again creating new pickle.
my scripts looks like
import pandas as pd
import datetime as dt
import pickle
df = pd.read_pickle('hugedata pickle')
Today = dt.datetime.today()
df = df[(df.TRX_DATE > Today)] #delete any entries for today in main pickle
df1 = pd.read_csv(daily data csv file)
df = df.append(df1,ignore_index=True)
df.to_pickle('same huge data pickle')
problem is as follows
1.it is taking huge memory as well as time reading that huge pickle.
2.i need to append df1 to df and only columns from df should only remain and it should exclude if any new column from df1 getting appended. But i am getting new column values having NUN values at so many places.
So need assistance on these things
1.is there way that i will read the small sized csv only and append to pickle file ...(or reading that pickle is mandatory)
2.can it be done like converting the csv to pickle and merge two pickles. by load ,dump method (actually never used that)
3.how to read time from TIME_STAMP field and getting datas between two timestamp (filtering by TIME_STAMP).and upadting that to main pickle.previously i am filtering by TRX_DATE values.
Is there a better way--- please suggest.
HDF5 is made for what you are trying to do.
import tables
import numpy as np
from pandas import HDFStore,DataFrame
df.to_hdf('test.h5',key='test1') # create an hdf5 file
pd.read_hdf('test.h5',key='test1') # read an hdf5 file
df.to_hdf() defaults to append mode.

Categories

Resources