I'm reading through a PySpark program I found online.
In the code, we create a Pandas dataframe, convert it to a Spark dataframe, and create a temporary view.
pd_temp = pd.DataFrame(np.random.random(10))
spark_temp = spark.createDataFrame(pd_temp)
spark_temp.createOrReplaceTempView('temp')
Straightforward and understandable.
Later in the code, we read in a flights dataset, and also create a temporary view.
flights = spark.read.csv('data/flights_small.csv', header=True)
flights.name = flights.createOrReplaceTempView('flights') # This is the line I'm not sure about
This is what I'm confused about. What is flights.name doing? There's no name column.
Why not simply assign to flights - what is .name adding?
Related
I have been trying to rename the column name in a csv file which I have been working on through Google-Colab. But the same line of code is working on one column name and is also not working for the other.
import pandas as pd
import numpy as np
data = pd.read_csv("Daily Bike Sharing.csv",
index_col="dteday",
parse_dates=True)
dataset = data.loc[:,["cnt","holiday","workingday","weathersit",
"temp","atemp","hum","windspeed"]]
dataset = dataset.rename(columns={'cnt' : 'y'})
dataset = dataset.rename(columns={"dteday" : 'ds'})
dataset.head(1)
The Image below is the dataframe called data
The Image below is dataset
This image is the final output which I get when I try to rename the dataframe.
The column name "dtedate" is not getting renamed but "cnt" is getting replaced "y" by the same code. Can someone help me out, I have been racking my brain on this for sometime now.
That's because you're setting dteday as your index, upon reading in the csv, whereas cnt is quite simply a column. Avoid the index_col attribute in read_csv and instead perform dataset = dataset.set_index('ds') after renaming.
An alternative in which only your penultimate line (trying to rename the index) would need to be changed:
dataset.index.names = ['ds']
You can remove the 'index-col' in the read statement, include 'dtedate' in your dataset and then change the column name. You can make the column index using df.set_index later.
Use case is to append a column to a Parquet dataset and then re-write efficiently at the same location. Here is a minimal example.
Create a pandas DataFrame and write as a partitioned Parquet dataset.
import pandas as pd
df = pd.DataFrame({
'id': ['a','a','a','b','b','b','b','c','c'],
'value': [0,1,2,3,4,5,6,7,8]})
path = r'c:/data.parquet'
df.to_parquet(path=path, engine='pyarrow', compression='snappy', index=False, partition_cols=['id'], flavor='spark')
Then load the Parquet dataset as a pyspark view and create a modified dataset as a pyspark DataFrame.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
spark.read.parquet(path).createTempView('data')
sf = spark.sql(f"""SELECT id, value, 0 AS segment FROM data""")
At this point sf data is same as df data but with an additional segment column of all zeros. I would like to efficiently overwrite the existing Parquet dataset at path with sf as a Parquet dataset in the same location. Below is what does not work. Also prefer not to write sf to a new location, delete old Parquet dataset, and rename as does not seem efficient.
# saves existing data and new data
sf.write.partitionBy('id').mode('append').parquet(path)
# immediately deletes existing data then crashes
sf.write.partitionBy('id').mode('overwrite').parquet(path)
My answer in short: you shouldn't :\
One principle of bigdata (and spark is for bigdata), is to never override stuff. Sure, there exist the .mode('overwrite'), but this is not a correct usage.
My guesses as to why it could (should) fail:
you add a column, so written dataset have a different format than the one currently stored there. This can create a schema confusion
you override the input data while processing. So spark read some lines, process them and override the input files. But then those files are still the inputs for other lines to process.
What I usually do in such situation is to create another dataset, and when there is no reason to keep to old one (i.e. when the processing is completely finished), clean it. To remove files, you can check this post on how to delete hdfs files. It should work for all files accessible by spark. However it is in scala, so I'm not sure if it can be adapted to pyspark.
Note that efficiency is not a good reason to override, it does more work that
simply writing.
So I'm trying to create a python script that allows me to perform SQL manipulations on a dataframe (masterfile) I created using pandas. The dataframe draws its contents from the csv files found in a specific folder.
I was able to successfully create everything else, but I am having trouble with the SQL manipulation part. I am trying to use the dataframe as the "database" where I will pull the data using my SQL query but I am getting a "AttributeError: 'DataFrame' object has no attribute 'cursor' " error.
I'm not really seeing a lot of examples for pandas.read_sql_query() so I am having a difficult time on understanding how I will use my dataframe in it.
import os
import glob
import pandas
os.chdir("SOMECENSOREDDIRECTORY")
all_csv = [i for i in glob.glob('*.{}'.format('csv')) if i != 'Masterfile.csv']
edited_files = []
for i in all_csv:
df = pandas.read_csv(i)
df["file_name"] = i.split('.')[0]
edited_files.append(df)
masterfile = pandas.concat(edited_files, sort=False)
print("Data fields are as shown below:")
print(masterfile.iloc[0])
sql_query = "SELECT Country, file_name as Year, Happiness_Score FROM masterfile WHERE Country = 'Switzerland'"
output = pandas.read_sql_query(sql_query, masterfile)
output.to_csv('data_pull')
I know this part is wrong, but this is the concept I am trying to get to work but don't know how:
output = pandas.read_sql_query(sql_query, masterfile)
I appreciate any help I can get! I am a self-thought python programmer by the way, so I might be missing some general rule or something. Thanks!
Edit: replaced "slice" with "manipulate" because I realized I didn't want to just slice it. Also fixed some alignment issues on my code block.
It's is possible to slice dataframe Which created through Pandas and SQL You can use loc function of pandas to slice dataframe.
pd.df.loc[row,colums]
I have a dictionary as follows:
my_dict = {'a':[12,15.2,52.1],'b':[2.5,2.4,5.2],'c':[1.2,5.3,12]}
I want to save this dictionary in Databricks in order for me not to obtain it every time I want to start working with it. Furthermore, I would like to know how to retrieve it and have it in its original form again.
I have tried doing the following:
from itertools import zip_longest
column_names, data = zip(*my_dict.items())
spark.createDataFrame(zip_longest(*data), column_names).show()
and
column_names, data = zip(*dict_brands.items())
spark.createDataFrame(zip(*data), column_names).show()
However, I get the following error:
zip_longest argument #10342 must support iteration
I also do not know how to reload it or upload it. I tried with a sample dataframe (not the same one), as follows:
df.write.format("tfrecords").mode("overwrite").save('/data/tmp/my_df')
And the error is:
Attribute name "my_column" contains invalid character(s)
among " ,;{}()\n\t=". Please use alias to rename it.
Finally, in order to obtain it, I thought about:
my_df = spark.table("my_df") # Get table
df = my_df.toPandas() # Make pd dataframe
and then make it a dictionary, but maybe there is an easier way than making it a dataframe and then retrieving as dataframe and converting into dictionary back again.
I would also like to know the computational cost for the solutions, since the actual dataset is very large.
Here is my sample code for realizing your needs step by step.
Convert a dictionary to a Pandas dataframe
my_dict = {'a':[12,15.2,52.1],'b':[2.5,2.4,5.2],'c':[1.2,5.3,12]}
import pandas as pd
pdf = pd.DataFrame(my_dict)
Convert a Pandas dataframe to a PySpark dataframe
df = spark.createDataFrame(pdf)
To save a PySpark dataframe to a file using parquet format. Format tfrecords is not supported at here.
df.write.format("parquet").mode("overwrite").save('/data/tmp/my_df')
To load the saved file above as a PySpark dataframe.
df2 = spark.read.format("parquet").load('/data/tmp/my_df')
To convet a PySpark dataframe to a dictionary.
my_dict2 = df2.toPandas().to_dict()
The computational cost of these code above is depended on the memory usage for your actual dataset.
If you load some data, compute a DataFrame, write that to disk and then use the DataFrame later... assuming it isn't still cached in RAM (lets say there wasn't enough), would Spark be smart enough to load the data from disk rather than recompute the DataFrame from the original data?
For example:
df1 = spark.read.parquet('data/df1.parquet')
df2 = spark.read.parquet('data/df2.parquet')
joined = df1.join(df2, df1.id == df2.id)
joined.write.parquet('data/joined.parquet')
computed = joined.select('id').withColummn('double_total', 2 * joined.total)
computed.write.parquet('data/computed.parquet')
Under the right circumstances, when we store computed, will it load the joined DataFrame from data/joined.parquet or will it always re-compute by loading/joining df1/df2 if it isn't currently caching joined?
The joined dataframe points to df1.join(df2, df1.id == df2.id). As far as I know the parquet writer will not cause any changes to that reference therefore in order to load the parquet data you need to construct a new Spark reader with spark.reader.parquet(...).
You can verify the above claim from the DataFrameWriter code (check parquet/save methods) which returns Unit and not modifying somehow the reference of the source dataframe. Finally to answer your question in the above example the joined dataframe will be calculated once for joined.write.parquet('data/joined.parquet') and once for computed.write.parquet('data/computed.parquet')