I am trying to develop a few data pipelines using Apache Airflow with scheduled Spark jobs.
For one of these pipelines, I am trying to write data from a PySpark DataFrame to MySQL and I keep running into a few problems. This is simply what my code looks like for now, but I do want to add more transformation to this in the future,
df_tsv = spark.read.csv(tsv_file, sep=r'\t', header=True)
df_tsv.write.jdbc(url=mysql_url, table=mysql_table, mode="append", properties={"user":mysql_user, "password": mysql_password, "driver": "com.mysql.cj.jdbc.Driver" })
This is the exception that keeps getting raised,
java.lang.ClassNotFoundException: com.mysql.cj.jdbc.Driver
The first thing that I want to know is how I can solve the above issue.
Secondly, I would like to know what the best practice is when writing data from Spark to databases like MySQL. For instance, is there an option to make it so that data from a given column in the DataFrame is stored in a specified column in the table? Or should the column names of the table be the same as those of the DataFrame?
The other option that I can think of here is to convert the DataFrame to say, a list of tuples and then use something like the mysql-python-connector to load the data into the database,
rdd = df.rdd
b = rdd.map(tuple)
data = b.collect()
# write data to database using mysql-python-connector
What is the more efficient option here? Are there any other options that I do not know of?
java.lang.ClassNotFoundException: com.mysql.cj.jdbc.Driver
The first thing that I want to know is how I can solve the above issue.
You need to pass the JDBC connector when starting your Spark session https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html.
Secondly, I would like to know what the best practice is when writing data from Spark to databases like MySQL. For instance, is there an option to make it so that data from a given column in the DataFrame is stored in a specified column in the table? Or should the column names of the table be the same as those of the DataFrame?
Yes, dataframe column names would match with table column names.
The other option that I can think of here is to convert the DataFrame to say, a list of tuples and then use something like the mysql-python-connector to load the data into the database,
rdd = df.rdd
b = rdd.map(tuple)
data = b.collect()
# write data to database using mysql-python-connector
No, never ever do this, this will defeat all purposes of using Spark (which is distributed computation). Check out the link above, you will find some good advises on where to start and how to read/write from/to a JDBC data source.
Related
I am trying to figure out how to iterate through rows in a .CSV files and enter that data into a table in sqlite but only if the data in that row meets certain criteria.
I am trying to build a database of my personal spending. I have used python to categorise my spending data I now want to enter that data into a database with each category as a different table. This means I need to sort the data and enter it into different tables based on the category of spend.
I looked for quite a long time. Can anyone help?
You need to read the CSV file using pandas and store it in a pandas DataFrame. Then (If you did not create already a database) use SQLAlchemy library (Here is the documentation) to create an engine engine = sqlalchemy.create_engine('sqlite:///file.db').
Afterwards, you need to convert the DataFrame to the SQL database using pandas to_sql function (Documentation). df.to_sql('file_name', engine, index=False). I used the index=False to avoid creating a column for the index of the DataFrame.
currently working within a dev environment in Databricks using a notebook to apply some Python code to analyse some dummy data (just a few 1,000 rows) held in within a database table, I then deploy this to the main environment and run it on the real data, (100's of millions of rows)
to start with I just need values from a single column that meet a certain criteria, in order to get at the data I'm currently doing this:
spk_data = spark.sql("SELECT field FROM database.table WHERE field == 'value'")
data = spk_data.toPandas()
then the rest of the Python notebook does its thing on that data which works fine in the dev environment but when I run it for real it falls over at line 2 saying it's out of memory
I want to import the data DIRECTLY into the Pandas dataframe and so remove the need to convert from Spark as I'm assuming that will avoid the error but after a LOT of Googling I still can't work out how, the only thing I've tried that appears syntactically valid is:
data = pd.read_table (r'database.table')
but just get:
'PermissionError: [Errno 13] Permission denied:'
(nb. unfortunately I have no control over the content, form or location of the database I'm querying)
You've to use pd.read_sql_query for this case.
Your assumption is very likely to be untrue.
Spark is a distributed computation engine, pandas is a single-node toolset. So when you run a query on milions of rows it's likely to fail. When doing df.toPandas, Spark moves all of the data to your driver node, so if it's more than driver memory, it's going to fail with out of memory exception. In other words, if your dataset is larger then memory, pandas are not going to work well.
Also, when using pandas on databricks you are missing all of the benefits of using the underlying cluster. You are just using the driver.
There are two sensible options to solve this:
redo your solution using spark
use koalas which has API mostly compatible with pandas
I have a dataset at BigQuery with 100 thousand+ rows and 10 columns. I'm also continuously adding new data to the dataset. I want to fetch data that not processed, process them and write back to my table. Currently, I'm fetching them to a pandas dataframe using bigquery python library and processing using pandas.
Now, I want to update table with new pre-processed data. One way of doing it using SQL statement and calling query function of the bigquery.Client() class. Or use a job like here.
bqclient = bigquery.Client(
credentials=credentials,
project=project_id,
)
query = """UPDATE `dataset.table` SET field_1 = '3' WHERE field_2 = '1'"""
bqclient.query(query_string)
But it doesn't make sense to create update statement for each row.
Another way I found is using to_gbq function of pandas-gbq package. Disadvantage of this , it updates all table.
Question: What is the best way of updating Bigquery table from pandas dataframe?
Google BigQuery is mainly used for Data Analysis when your data is static and you don't have to update a value, since the arquitecture is basically to do that kind of thinking. Therefore, if you want to update the data, there are some options but are very heavy:
The one you mentioned, with a query and update one by one row.
Recreate the table using only the new values.
Appending the new data with different timestamp.
Using partitioned tables [1] and if possible clustered tables [2], this way when you want to update the table you can use the partitioned and clustered columns to update it and the query will be less heavy. Also, you can append the new data in a new partitioned table, let's say on the current day.
If you are using the data for analytical reasons, maybe the best option is 2 and 3, but I always recommend having [1] and [2].
[1] https://cloud.google.com/bigquery/docs/querying-partitioned-tables
[2] https://cloud.google.com/bigquery/docs/clustered-tables
I am pretty new with elasticsearch. so, please forgive if i am asking a very simple question.
In my workplace we have a proper setup of ELK.
Due to the very large volume of data we are just storing 14 days of data and my question is how can i read the data in Python and later store my analysis in some NOSQL.
As of now my primary goal is to read the raw data into python in the form of data frame or any format from the elastic cluster.
I want to get it for different time intervals like 1 day, 1 week, 1 month etc..
I am struggling for the last 1 week.
you can use the below code to achieve that
# Create a DataFrame object
from pandasticsearch import DataFrame
df = DataFrame.from_es(url='http://localhost:9200', index='indexname')
To get the schema of your index:-
df.print_schema()
After that you can perform general dataframe operation on the df.
If you want to parse the result then do the below :-
from elasticsearch import Elasticsearch
es = Elasticsearch('http://localhost:9200')
result_dict = es.search(index="indexname", body={"query": {"match_all": {}}})
and then finally everything into your final dataframe:-
from pandasticsearch import Select
pandas_df = Select.from_dict(result_dict).to_pandas()
I hope it helps..
It depends on how you want to read the data from the Elasticsearch. Is it incremental reading i.e. reading new data that comes to you every day or is it like a bulk reading. For the latter, you need to use the bulk API of Elasticsearch in python and for the former, you can restrict yourself to a simple range query.
Schematic code for reading bulk data: https://gist.github.com/dpkshrma/04be6092eda6ae108bfc1ed820621130
How to use bulk API of ES:
How to use Bulk API to store the keywords in ES by using Python
https://elasticsearch-py.readthedocs.io/en/master/helpers.html#elasticsearch.helpers.bulk
How to use the range query for incremental inserts:
https://martinapugliese.github.io/python-for-(some)-elasticsearch-queries/
How to have Range and Match query in one elastic search query using python?
Since you want your data to be inserted in different intervals, you will require to perform date aggregations as well.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-datehistogram-aggregation.html
How to perform multiple aggregation on an object in Elasticsearch using Python?
Once you issue your Elasticsearch query, your data will be collected in a temporary variable, you can use the python library over NOSQL database such as PyMongo to insert into Elasticsearch data into it.
I am using python pandas to load data from a MySQL database, change, then update another table. There are a 100,000+ rows so the UPDATE query's take some time.
Is there a more efficient way to update the data in the database than to use the df.iterrows() and run an UPDATE query for each row?
The problem here is not pandas, it is the UPDATE operations. Each row will fire its own UPDATE query, meaning lots of overhead for the database connector to handle.
You are better off using the df.to_csv('filename.csv') method for dumping your dataframe into CSV, then read that CSV file into your MySQL database using the LOAD DATA INFILE
Load it into a new table, then DROP the old one and RENAME the new one to the old ones name.
Furthermore, I suggest you do the same when loading data into pandas. Use the SELECT INTO OUTFILE MySQL command and then load that file into pandas using the pd.read_csv() method.