I want to append a pandas dataframe (8 columns) to an existing table in databricks (12 columns), and fill the other 4 columns that can't be matched with None values. Here is I've tried:
spark_df = spark.createDataFrame(df)
spark_df.write.mode("append").insertInto("my_table")
It thrown the error:
ParseException: "\nmismatched input ':' expecting (line 1, pos 4)\n\n== SQL ==\n my_table
Looks like spark can't handle this operation with unmatched columns, is there any way to achieve what I want?
I think that the most natural course of action would be a select() transformation to add the missing columns to the 8-column dataframe, followed by a unionAll() transformation to merge the two.
from pyspark.sql import Row
from pyspark.sql.functions import lit
bigrow = Row(a='foo', b='bar')
bigdf = spark.createDataFrame([bigrow])
smallrow = Row(a='foobar')
smalldf = spark.createDataFrame([smallrow])
fitdf = smalldf.select(smalldf.a, lit(None).alias('b'))
uniondf = bigdf.unionAll(fitdf)
Can you try this
df = spark.createDataFrame(pandas_df)
df_table_struct = sqlContext.sql('select * from my_table limit 0')
for col in set(df_table_struct.columns) - set(df.columns):
df = df.withColumn(col, F.lit(None))
df_table_struct = df_table_struct.unionByName(df)
df_table_struct.write.saveAsTable('my_table', mode='append')
Related
I have a pyspark dataframe with two columns, name and source. All the values in the name column are distinct. Source has multiple strings separated with a comma (,).
I want to filter out all those rows where any of the strings in the source column contains any value from the whole name column.
I am using the following UDF:
def checkDependentKPI(df, name_list):
for row in df.collect():
for src in row["source"].split(","):
for name in name_list:
if name in src:
return row['name']
return row['name']
My end goal is to put all such rows at the end of the dataframe. How can I do it?
Sample dataframe:
+--------------------+--------------------+
| name| source|
+--------------------+--------------------+
|dev.................|prod, sum, diff.....|
|prod................|dev, diff, avg......|
|stage...............|mean, mode..........|
|balance.............|median, mean........|
|target..............|avg, diff, sum......|
+--------------------+--------------------+
You can use a like() to leverage the SQL like expression without any heavy collect() action and loop checking. Suppose you already have a list of name:
from functools import reduce
df.filter(
reduce(lambda x, y: x|y, [func.col('source').like(f"%{pattern}%") for pattern in name])
).show(20, False)
Maybe this?
from pyspark.sql import functions as psf
test_data = [('dev','prod,sum,diff')
, ('prod','dev,diff,avg')
, ('stage','mean,mode')
, ('balance','median,mean')
, ('target','avg,diff,sum')]
df = spark.createDataFrame(test_data, ['kpi_name','kpi_source_table'])
df = df.withColumn('kpi_source_table', psf.split('kpi_source_table', ','))
df_flat = df.agg(psf.collect_list('kpi_name').alias('flat_kpi'))
df = df.join(df_flat, how='cross')
df = df.withColumn('match', psf.array_intersect('kpi_source_table', 'flat_kpi'))
display(df.orderBy('match'))
Assuming I have the following multiindex DF
import pandas as pd
import numpy as np
import pandas as pd
input_id = np.array(['12345'])
docType = np.array(['pre','pub','app','dw'])
docId = np.array(['34455667'])
sec_type = np.array(['bib','abs','cl','de'])
sec_ids = np.array(['x-y','z-k'])
index = pd.MultiIndex.from_product([input_id,docType,docId,sec_type,sec_ids])
content= [str(randint(1,10))+ '##' + str(randint(1,10)) for i in range(len(index))]
df = pd.DataFrame(content, index=index, columns=['content'])
df.rename_axis(index=['input_id','docType','docId','secType','sec_ids'], inplace=True)
df
I know that I can query a multiindex DF as follows:
# querying a multiindex DF
idx = pd.IndexSlice
df.loc[idx[:,['pub','pre'],:,'de',:]]
basically with the help of pd.IndexSlice I can pass the values I want for every of the indexes. In the above case I want the resulting DF where the second index is 'pub' OR 'pre' and the 4th one is 'de'.
I am looking for the way to pass a range of values to the query. something like multiindex 3 beeing between 34567 and 45657. Assume those are integers.
pseudocode: df.loc[idx[:,['pub','pre'],XXXXX,'de',:]]
XXXX = ?
EDIT 1:
docId column index is of text type, probably its necessary to change it first to int
Turns out query is very powerful:
df.query('docType in ["pub","pre"] and ("34455667" <= docId <= "3445568") and (secType=="de")')
Output:
content
input_id docType docId secType sec_ids
12345 pre 34455667 de x-y 2##9
z-k 6##1
pub 34455667 de x-y 6##5
z-k 9##8
I am able to load a csv into pandas dataframe, but it is stuck in a list. How can I load directly into a pandas dataframe from Pydrill or unlist the pandas dataframe columns and data? I've tried unlisting and it puts everything into a list of a list.
I've used the to_dataframe(), but can't seem to find documentation on if I can use a delimeter. pd.dataframe doesn't work because of the Pydrill query.
reviews = drill.query("SELECT * FROM hdfs.datasets.`titanic_ML/titanic.csv` LIMIT 1000", timeout=30)
print(reviews)
import pandas as pd
df2 = reviews.to_dataframe()
df2.rename(columns=df2.iloc[0])
headers = df2.iloc[0]
print(headers)
new_df = pd.DataFrame(df2.values[1:], columns=headers)
new_df.head()
The results cast everything into a list.
["pclass","sex","age","sibsp","parch","fare","embarked","survived"]
0 ["3","1","38.0","0","0","7.8958","1","0"]
1 ["1","1","42.0","0","0","26.55","1","0"]
2 ["3","0","9.0","4","2","31.275","1","0"]
3 ["3","1","27.0","0","0","7.25","1","0"]
4 ["1","1","41.0","0","0","26.55","1","0"]
I'd like to get everything into a normal pandas dataframe.
The solution I found was this:
it doesn't unlist the dataframe, but it's an alternate solution to the problem.
connect_str = "dbname='dbname' user='dsa_ro_user'
conn = psycopg2.connect(connect_str) host='host database'
SQL = "SELECT * "
SQL += " FROM train"
df = pd.read_sql(SQL,conn)
df.head()
Try using Table Functions as described in O’Reily Text: Chapter 4. Querying Delimited Data. This will delimit the file and apply the first row to your columns. Note: because everything is being read as text, you may need to cast your values as floats if you want to do arithmetic in your select or where.
This should get you what you want:
sql="""
SELECT *
FROM table(hdfs.datasets.`/titanic_ML/titanic.csv`(
type => 'text',
extractHeader => true,
fieldDelimiter => ',')
) LIMIT 1000
"""
rows = drill.query(sql, timeout=30)
df = rows.to_dataframe()
df.head()
I have an instance of pyspark.sql.dataframe.DataFrame created using
dataframe = sqlContext.sql("select * from table").
One column is 'arrival_date' and contains a string.
How do I modify this column so as to only take the first 4 characters from it and throw away the rest?
How would I convert the type of this column from string to date?
In graphlab.SFrame, this would be:
dataframe['column_name'] = dataframe['column_name'].apply(lambda x: x[:4] )
and
dataframe['column_name'] = dataframe['column_name'].str_to_datetime()
As stated by Orions, you can't modify a column, but you can override it. Also, you shouldn't need to create an user defined function, as there is a built-in function for extracting substrings:
from pyspark.sql.functions import *
df = df.withColumn("arrival_date", df['arrival_date'].substr(0, 4))
To convert it to date, you can use to_date, as Orions said:
from pyspark.sql.functions import *
df = df.withColumn("arrival_date", to_date(df['arrival_date'].substr(0, 4)))
However, if you need to specify the format, you should use unix_timestamp:
from pyspark.sql.functions import *
format = 'yyMM'
col = unix_timestamp(df['arrival_date'].substr(0, 4), format).cast('timestamp')
df = df.withColumn("arrival_date", col)
All this can be found in the pyspark documentation.
To extract first 4 characters from the arrival_date (StringType) column, create a new_df by using UserDefinedFunction (as you cannot modify the columns: they are immutable):
from pyspark.sql.functions import UserDefinedFunction, to_date
old_df = spark.sql("SELECT * FROM table")
udf = UserDefinedFunction(lambda x: str(x)[:4], StringType())
new_df = old_df.select(*[udf(column).alias('arrival_date') if column == 'arrival_date' else column for column in old_df.columns])
And to covert the arrival_date (StringType) column into DateType column, use the to_date function as show below:
new_df = old_df.select(old_df.other_cols_if_any, to_date(old_df.arrival_date).alias('arrival_date'))
Sources:
https://stackoverflow.com/a/29257220/2873538
https://databricks.com/blog/2015/09/16/apache-spark-1-5-dataframe-api-highlights.html
I want to filter a Pyspark DataFrame with a SQL-like IN clause, as in
sc = SparkContext()
sqlc = SQLContext(sc)
df = sqlc.sql('SELECT * from my_df WHERE field1 IN a')
where a is the tuple (1, 2, 3). I am getting this error:
java.lang.RuntimeException: [1.67] failure: ``('' expected but identifier a found
which is basically saying it was expecting something like '(1, 2, 3)' instead of a.
The problem is I can't manually write the values in a as it's extracted from another job.
How would I filter in this case?
String you pass to SQLContext it evaluated in the scope of the SQL environment. It doesn't capture the closure. If you want to pass a variable you'll have to do it explicitly using string formatting:
df = sc.parallelize([(1, "foo"), (2, "x"), (3, "bar")]).toDF(("k", "v"))
df.registerTempTable("df")
sqlContext.sql("SELECT * FROM df WHERE v IN {0}".format(("foo", "bar"))).count()
## 2
Obviously this is not something you would use in a "real" SQL environment due to security considerations but it shouldn't matter here.
In practice DataFrame DSL is a much better choice when you want to create dynamic queries:
from pyspark.sql.functions import col
df.where(col("v").isin({"foo", "bar"})).count()
## 2
It is easy to build and compose and handles all details of HiveQL / Spark SQL for you.
reiterating what #zero323 has mentioned above : we can do the same thing using a list as well (not only set) like below
from pyspark.sql.functions import col
df.where(col("v").isin(["foo", "bar"])).count()
Just a little addition/update:
choice_list = ["foo", "bar", "jack", "joan"]
If you want to filter your dataframe "df", such that you want to keep rows based upon a column "v" taking only the values from choice_list, then
from pyspark.sql.functions import col
df_filtered = df.where( ( col("v").isin (choice_list) ) )
You can also do this for integer columns:
df_filtered = df.filter("field1 in (1,2,3)")
or this for string columns:
df_filtered = df.filter("field1 in ('a','b','c')")
A slightly different approach that worked for me is to filter with a custom filter function.
def filter_func(a):
"""wrapper function to pass a in udf"""
def filter_func_(col):
"""filtering function"""
if col in a.value:
return True
return False
return udf(filter_func_, BooleanType())
# Broadcasting allows to pass large variables efficiently
a = sc.broadcast((1, 2, 3))
df = my_df.filter(filter_func(a)(col('field1'))) \
from pyspark.sql import SparkSession
import pandas as pd
spark=SparkSession.builder.appName('Practise').getOrCreate()
df_pyspark=spark.read.csv('datasets/myData.csv',header=True,inferSchema=True)
df_spark.createOrReplaceTempView("df") # we need to create a Temp table first
spark.sql("SELECT * FROM df where Departments in ('IOT','Big Data') order by Departments").show()