Pyspark changes longs to ints - python

I'm using Pyspark 1.2.1 with Hive. (Upgrading will not happen immediately).
The problem I have is that when I select from a Hive table, and add an index, pyspark changes long values to ints, so I end up with a temp table with a column of type Long, but values of type Integer. (See code below).
My question is: how can I either (a) perform the merge of the index (see code) without changing longs to ints; or (b) add the index in some other way that avoids the problem; or (c) randomize table columns without needing to join?
The underlying problem I'm trying to solve is that I want to randomize the order of certain columns in a hive table, and write that to a new table. This is to make data no longer personally identifiable. I'm doing that by adding an incrementing index to the original table and the randomised columns, then joining on that index.
The table looks like:
primary | longcolumn | randomisecolumn
The code is:
hc = HiveContext(sc)
orig = hc.sql('select * from mytable')
widx = orig.zipWithIndex().map(merge_index_on_row)
sql_context.applySchema(widx, add_index_schema(orig.schema()))
.registerTempTable('sani_first')
# At this point sani_first has a column longcolumn with type long,
# but (many of) the values are ints
def merge_index_on_row((row, idx), idx_name=INDEX_COL):
"""
Row is a SchemaRDD row object; idx is an integer;
schema is the schema for row with an added index col at the end
returns a version of row applying schema and holding the index in the new row
"""
as_dict = row.asDict()
as_dict[idx_name] = idx
return Row(**as_dict)
def add_index_schema(schema):
"""
Take a schema, add a column for an index, return the new schema
"""
return StructType(sorted(schema.fields + [StructField(INDEX_COL, IntegerType(), False)],key=lambda x:x.name))
In the absence of a better solution, I'm going to force the affected columns to long type in the python code. This is...not great.

Related

How to dynamically add new columns with the datatypes to the existing Delta table and update the new columns with values

Scenario:
df1 ---> Col1,Col2,Col3 -- which are the columns in the delta table
df2 ---> Col1,Col2,Col3,Col4,Col5 -- which are the columns in the latest refresh table
How to get the new columns (in the above Col4,Col5) with datatypes dynamically.
How to alter the existing Delta table to include the new columns (in the above Col4,Col5) dynamically and update the new column values
Thanks for your help.
You don't need to perform explicit ALTER TABLE if you have Delta table - you just need to use built-in capabilities for schema evolution (blog post) - just add the mergeSchema option with value true, and Delta will take care for updating schema. For example, if I have initial table with two fields: i1 & i2:
df1 = spark.createDataFrame([[1,2]], schema="i1 int, i2 int")
df1.write.format("delta").mode("overwrite").saveAsTable("test")
then I can update it even if I have more columns:
df2 = spark.createDataFrame([[1,2, '1']], schema="i1 int, i2 int, v string")
df2.write.format("delta").mode("append") \
.option("mergeSchema", "true").saveAsTable("test")
You can find more information about schema evolution & enforcement in the following blog post and this webinar.

Efficient row comparison in pandas dataframe on incomplete data

I work on an incomplete data that also has doubles and I need to clear it from doubles, choosing complete rows if available.
For example: that's how the data look
I need to search trough each row to see whether it's a double (has a 'rank'>1), and whether if it is incomplete itself, but has some complete doubles.
I'll explain now:
not every row with the 'rank' = 1 has a date in it (it is crutial),
but some of them have doubles ('rank'>1) which has a date.
not every row has a double. And if it doesn't have a date in it, that's ok.
So, I need to find the double with the date if it does exist, and rewrite it to the row with the rank 1 (or delete an incomplete first row)
In the end I need to have a DataFrame with no doubles and as much dates as available.
There's my code with EXTREMELY inefficient iterative loop, but I don't know how to rewrite it with vectorization or .apply() method:
def test_func(dataframe):
df = dataframe
df.iloc[0:0]
for i in range(0, dataframe.shape[0]):
if dataframe.iloc[i]['rank'] == 1:
temp_row = dataframe.iloc[i]
elif ((dataframe.iloc[i+1]['rank']>1)&
(pd.isna(dataframe.iloc[i]['date'])
&(~pd.isna(dataframe.iloc[i+1]['date'])))):
temp_row = dataframe.iloc[i+1]
df.loc[i] = temp_row
return df
Hope to find some help! From Russia with love xo.
Assuming that you are grouping by phone, and you are interested in populating missing dates, then you can use backwards fill and group by, which will fill the missing dates with the next available not null date within the group.
test_df['date'] = test_df.groupby(['phone'])['date'].apply(lambda x: x.bfill())
if you need to populate other missing data, just replace 'date' with the relevant column name

Creating a table in MariaDB using a list of column names in Python

I am trying to create a table in mariadb using python. I have all the column names stored in a list as shown below.
collist = ['RR', 'ABPm', 'ABPs', 'ABPd', 'HR', 'SPO']
This is just the sample list. Actual list has 200 items in the list. I am trying to create a table using the above collist elements as columns and the datatype for the columns is VARCHAR.
This is the code I am using to create a table
for p in collist:
cur.execute('CREATE TABLE IF NOT EXISTS table1 ({} VARCHAR(45))'.format(p)
The above code is executing but only the first element of the list is being added as a column in the table and I cannot see the remaining elements. I'd really appreciate if I can get a help with this.
You can build the string in 3 parts and then .join() those together. The middle portion is the column definitions, joining each of the item in the original list. This doesn't seem particularly healthy; both in the number of columns and the fact that everything is VARCHAR(45) but that's your decision:
collist = ['RR', 'ABPm', 'ABPs', 'ABPd', 'HR', 'SPO']
query = ''.join(["(CREATE TABLE IF NOT EXISTS table1 ",
' VARCHAR(45), '.join(collist),
' VARCHAR(45))'])
Because we used join, you need to specify the last column type separately (the third item in the list) to correctly close the query.
NOTE: If the input data comes from user input then this would be susceptible to SQL injection since you are just formatting unknown strings in, to be executed. I am assuming the list of column names is internal to your program.

sqlite - return all columns for max of one column without repeats

Im using Python to query a SQL database. I'm fairly new with databases. I've tried looking up this question, but I can't find a similar enough question to get the right answer.
I have a table with multiple columns/rows. I want to find the MAX of a single column, I want ALL columns returned (the entire ROW), and I want only one instance of the MAX. Right now I'm getting ten ROWS returned, because the MAX is repeated ten times. I only want one ROW returned.
The query strings I've tried so far:
sql = 'select max(f) from cbar'
# this returns one ROW, but only a single COLUMN (a single value)
sql = 'select * from cbar where f = (select max(f) from cbar)'
# this returns all COLUMNS, but it also returns multiple ROWS
I've tried a bunch more, but they returned nothing. They weren't right somehow. That's the problem, I'm too new to find the middle ground between my two working query statements.
In SQLite 3.7.11 or later, you can just retrieve all columns together with the maximum value:
SELECT *, max(f) FROM cbar;
But your Python might be too old. In the general case, you can sort the table by that column, and then just read the first row:
SELECT * FROM cbar ORDER BY f DESC LIMIT 1;

Add MySQL query results to R dataframe

I want to convert a MySQL query from a python script to an analogous query in R. The python uses a loop structure to search for specific values using genomic coordinates:
SQL = """SELECT value FROM %s FORCE INDEX (chrs) FORCE INDEX (sites)
WHERE `chrom` = %d AND `site` = %d""" % (Table, Chr, Start)
cur.execute(SQL)
In R the chromosomes and sites are in a dataframe and for every row in the dataframe I would like to extract a single value and add it to a new column in the dataframe
So my current dataframe has a similar structure to the following:
df <- data.frame("Chr"=c(1,1,3,5,5), "Site"=c(100, 200, 400, 100, 300))
The amended dataframe should have an additional column with values from the database (at corresponding genomic coordinates. The structure should be similar to:
df <- data.frame("Chr"=c(1,1,3,5,5), "Site"=c(100, 200, 400, 100, 300), "Value"=c(1.5, 0, 5, 60, 100)
So far I connected to the database using:
con <- dbConnect(MySQL(),
user="root", password="",
dbname="MyDataBase")
Rather than loop over each row in my dataframe, I would like to use something that would add the corresponding value to a new column in the existing dataframe.
Update with working solution based on answer below:
library(RMySQL)
con <- dbConnect(MySQL(),
user="root", password="",
dbname="MyDataBase")
GetValue <- function(DataFrame, Table){
queries <- sprintf("SELECT value as value
FROM %s FORCE INDEX (chrs) FORCE INDEX (sites)
WHERE chrom = %d AND site = %d UNION ALL SELECT 'NA' LIMIT 1", Table, DataFrame$Chr, DataFrame$start)
res <- ldply(queries, function(query) { dbGetQuery(con, query)})
DataFrame[, Table] <- res$value
return(DataFrame)
}
df <- GetValue(df, "TableName")
Maybe you could do something like this. First, build up your queries, then execute them, storing the results in a column of your dataframe. Not sure if the do.call(rbind part is necessary, but that basically takes a bunch of dataframe rows, and squishes them together by row into a dataframe.
queries=sprintf("SELECT value as value FROM %s FORCE INDEX (chrs) FORCE INDEX (sites) WHERE chrom = %d AND site = %d UNION ALL SELECT 0 LIMIT 1", "TableName", df$Chrom, df$Pos)
df$Value = do.call("rbind",sapply(queries, function(query) dbSendQuery(mydb, query)))$value
I played with your SQL a little, my concern with the original is with cases where it might return more than 1 row.
I like the data.table package for this kind of tasks as its syntax is inspired by SQL
require(data.table)
So an example database to match the values to a table
table <- data.table(chrom=rep(1:5, each=5),
site=rep(100*1:5, times=5),
Value=runif(5*5))
Now the SQL query can be translated into something like
# select from table, where chrom=Chr and site=Site, value
Chr <- 2
Site <- 200
table[chrom==Chr & site==Site, Value] # returns data.table
table[chrom==Chr & site==Site, ]$Value # returns numeric
Key (index) database for quick lookup (assuming unique chrom and site..)
setkey(table, chrom, site)
table[J(Chr, Site), ]$Value # very fast lookup due to indexed table
Your dataframe as data table with two columns 'Chr' and 'Site' both integer
df <- data.frame("Chr"=c(1,1,3,5,5), "Site"=c(100, 200, 400, 100, 300))
dt <- as.data.table(df) # adds data.table class to data.frame
setkey(dt, Chr, Site) # index for 'by' and for 'J' join
Match the values and append in new column (by reference, so no copying of table)
# loop over keys Chr and Site and find the match in the table
# select the Value column and create a new column that contains this
dt[, Value:=table[chrom==Chr & site==Site]$Value, by=list(Chr, Site)]
# faster:
dt[, Value:=table[J(Chr, Site)]$Value, by=list(Chr, Site)]
# fastest: in one table merge operation assuming the keys are in the same order
table[J(dt)]
kind greetings
Why don't you use the RMySQL or sqldf package?
With RMySQL, you get MySQL access in R.
With sqldf, you can issue SQL queries on R data structures.
Using either of those, you do not need to reword you SQL query to get the same results.
Let me also mention the data.table package, which lets you do very efficient selects and joins on your data frames after converting them to data tables using as.data.table(your.data.frame). Another good thing about it is that a data.table object is a data.frame at the same time, so all your functions that work on the data frames work on these converted objects, too.
You could easily use dplyr package. There is even nice vignette about that - http://cran.rstudio.com/web/packages/dplyr/vignettes/databases.html.
One thing you need to know is:
You can connect to MySQL and MariaDB (a recent fork of MySQL) through
src_mysql(), mediated by the RMySQL package. Like PostgreSQL, you'll
need to provide a dbname, username, password, host, and port.

Categories

Resources