Direct flattening of records from dataframe in PySpark

Direct flattening of records from dataframe in PySpark - python

I have a question whether we have any direct or a simple approach to flatten the data than what I am describing as below to implement in PySpark/Hive.
I have a dataset in one table which looks like below:
rdd = sc.parallelize([("123","000"),("456","123"),("789","456"),("111","000"),("999","888")])
df = rdd.toDF(["active_acct","inactive_acct"])
df.createOrReplaceTempView("temp_main_active_accts")
temp_main_active_accts_pd=temp_main_active_accts.toPandas()
df.show()
| active_acct | inactive_acct |
| 123 | 000 |
| 456 | 123 |
| 789 | 456 |
| 111 | 000 |
| 999 | 888 |
I am expecting the final output to be like:
| Current_active | all_old_active |
| 789 | 456,123,000 |
| 111 | 000 |
| 999 | 888 |
Which means 789 is the current active record and the records 456,123,000 one or the other time were active and this is the reason you can see recursive link in the main table.
I had to get to the latest record i.e. 789 so that I can find the link to the previous credit cards. I have query which gets to the latest account used and returns the records like :
active_accts=spark.sql("select active_acct,inactive_acct from temp_main_active_accts where active_acct not in
(select t1.active_acct from temp_main_active_accts t1 join temp_main_active_accts t2 on t1.active_acct=t2.inactive_acct)")
active_accts.show()
| active_acct | inactive_acct |
| 789 | 456 |
| 111 | 000 |
| 999 | 888 |
Below is the logic to flatten the records with a UDF but, the problem is this is taking a lot of time to run. I was looking whether we have any simple solution to do this so that I can avoid the UDF implementation here or any other way which is much simple. May it be sql or pyspark based.
global old_acc_list
def first_itr(active_acc):
qry="""active_account == '{0}'""".format(active_acc)
active_acc_pd=(temp_main_active_accts_pd.query(qry))
active_acc_pd=active_acc_pd.drop_duplicates()
active_acc_pd=active_acc_pd.reset_index(drop=True)
active_acc_cnt=active_acc_pd.size
if active_acc_cnt>0:
inactive_acc=active_acc_pd['old_active'].astype(str)[0]
global old_acc_list
old_acc_list+=","+str(inactive_acc)
first_itr(inactive_acc)
else:
old_acc_list=old_acc_list.lstrip(",")
return old_acc_list
extract_old_acc_udf=udf(lambda row: first_itr(row),StringType())
df_final=active_accts.withColumn("all_old_accs",extract_old_acc_udf(col['old_active_acc']))

Related

PySpark search inside very large dataframe

I have a very large dataframe in pyspark. It has over 10 million rows and over 30 columns.
What is the best and efficient method to search the entire dataframe for a given list of values and remove the row which contains that value?
The given list of values:
list=['1097192','10727550','1098754']
The dataframe(df) is :
+---------+--------------+---------------+---------+------------+
| id | first_name | last_name | Salary | Verifycode |
+---------+--------------+---------------+---------+------------+
| 1986 | Rollie | Lewin | 1097192 | 42254172 | -Remove Row
| 289743 | Karil | Sudron | 2785190 | 3703538 |
| 3864 | Massimiliano | Dallicott | 1194553 | 23292573 |
| 49074 | Gerry | Grinnov | 1506584 | 62291161 |
| 5087654 | Nat | Leatherborrow | 1781870 | 55183252 |
| 689 | Thaine | Tipple | 2150105 | 40583249 |
| 7907 | Myrlene | Croley | 2883250 | 70380540 |
| 887 | Nada | Redier | 2676139 | 10727550 | -Remove Row
| 96533 | Sonny | Bosden | 1050067 | 13110714 |
| 1098754 | Dennie | McGahy | 1804487 | 927935 | -Remove Row
+---------+--------------+---------------+---------+------------+
If it was a smaller dataframe I could use collect() or toLocalIterator() functions and then iterate over the rows and remove it based on list values.
Since it is a very large dataframe what is the best way to solve this?
I have come up with this solution now but is there a better way:
column_names = df.schema.names
for name in column_names:
df = df.filter(~col(name).isin(list))

You got the correct approach of filtering the Dataframe using filter and isin function. You can use isin function if the list is small (in few thousands not millions). Also make sure that your dataframe is partitioned to at least 3*number of CPUs on the executors. It is a must to have lot of partitions without that parallelism will suffer.
I am comfortable with Scala so please take the concept from the below code. You need to build a Column object by combining all columns to be filtered on. Then provide that final column object on the dataframe.filter
column_names = df.schema.names
colFinal // initialize with 1 column name as col("colName").isin(list)
for name in column_names:
colFinal = colFinal.or(col(name).isin(list))
df = df.filter(!colFinal) // apply negation of final column object

Iterate pyspark dataframe rows and apply UDF

I have a dataframe that looks like this:
partitionCol orderCol valueCol
+--------------+----------+----------+
| partitionCol | orderCol | valueCol |
+--------------+----------+----------+
| A | 1 | 201 |
| A | 2 | 645 |
| A | 3 | 302 |
| B | 1 | 335 |
| B | 2 | 834 |
+--------------+----------+----------+
I want to group by the partitionCol, then within each partition to iterate over the rows, ordered by orderCol and apply some function to calculate a new column based on the valueCol and a cached value.
e.g.
def foo(col_value, cached_value):
tmp = <some value based on a condition between col_value and cached_value>
<update the cached_value using some logic>
return tmp
I understand I need to groupby the partitionCol and apply a UDF that will operate on each chink separately, but struggling to find a good way to iterate the rows and applying the logic I described, to get a desired output of:
+--------------+----------+----------+---------------+
| partitionCol | orderCol | valueCol | calculatedCol -
+--------------+----------+----------+---------------+
| A | 1 | 201 | C1 |
| A | 2 | 645 | C1 |
| A | 3 | 302 | C2 |
| B | 1 | 335 | C1 |
| B | 2 | 834 | C2 |
+--------------+----------+----------+---------------+

I think the best way for you to do that is to apply an UDF on the whole set of data :
# first, you create a struct with the order col and the valu col
df = df.withColumn("my_data", F.struct(F.col('orderCol'), F.col('valueCol'))
# then you create an array of that new column
df = df.groupBy("partitionCol").agg(F.collect_list('my_data').alias("my_data")
# finaly, you apply your function on that array
df = df.withColumn("calculatedCol", my_udf(F.col("my_data"))
But without knowing exactly what you want to do, that is all I can offer.

(Python 3.x and SQL) Check if entries exists in multiple tables and convert to binary values

I have a 'master table' that contains just one column of ids from all my other tables. I also have several other tables that contain some of the ids, along with other columns of data. I am trying to iterate through all of the ids for each smaller table, create a new column for the smaller table, check if the id exists on that table and create a binary entry in the master table. (0 if the id doesn't exist, and 1 if the id does exist on the specified table)
That seems pretty confusing, but the application of this is to check if a user exists on the table for a specific date, and keep track of this information day to day.
Right now my I am iterating through the dates, and inside each iteration I am iterating through all of the ids to check if they exist for that date. This is likely going to be incredibly slow, and there is probably a better way to do this though. My code looks like this:
def main():
dates = init()
id_list = getids()
print(dates)
for date in reversed(dates):
cursor.execute("ALTER TABLE " + table + " ADD " + date + " BIT;")
cnx.commit()
for ID in id_list:
(...)
I know that the next step will be to generate a query using each id that looks something like:
SELECT id FROM [date]_table
WHERE EXISTS (SELECT 1 FROM master_table WHERE master_table.id = [date]_table.id)
I've been stuck on this problem for a couple days and so far I cannot come up with a query that gives a useful result.
.
For an example, if I had three tables for three days...
Monday:
+------+-----+
| id | ... |
+------+-----+
| 1001 | ... |
| 1002 | ... |
| 1003 | ... |
| 1004 | ... |
| 1005 | ... |
+------+-----+
Tuesday:
+------+-----+
| id | ... |
+------+-----+
| 1001 | ... |
| 1003 | ... |
| 1005 | ... |
+------+-----+
Wednesday:
+------+-----+
| id | ... |
+------+-----+
| 1002 | ... |
| 1004 | ... |
+------+-----+
I'd like to end up with a master table like this:
+------+--------+---------+-----------+
| id | monday | tuesday | wednesday |
+------+--------+---------+-----------+
| 1001 | 1 | 1 | 0 |
| 1002 | 1 | 0 | 1 |
| 1003 | 1 | 1 | 0 |
| 1004 | 1 | 0 | 1 |
| 1005 | 1 | 1 | 0 |
+------+--------+---------+-----------+
Thank you ahead of time for any help with this issue. And since it's sort of a confusing problem, please let me know if there are any additional details I can provide.

How to efficiently extract unique rows from massive CSV using Python or R

I have a massive CSV (1.4gb, over 1MM rows) of stock market data that I will process using R.
The table looks roughly like this. For each ticker, there are thousands of rows of data.
+--------+------+-------+------+------+
| Ticker | Open | Close | High | Low |
+--------+------+-------+------+------+
| A | 121 | 121 | 212 | 2434 |
| A | 32 | 23 | 43 | 344 |
| A | 121 | 121 | 212 | 2434 |
| A | 32 | 23 | 43 | 344 |
| A | 121 | 121 | 212 | 2434 |
| B | 32 | 23 | 43 | 344 |
+--------+------+-------+------+------+
To make processing and testing easier, I'm breaking this colossus into smaller files using the script mentioned in this question: How do I slice a single CSV file into several smaller ones grouped by a field?
The script would output files such as data_a.csv, data_b.csv, etc.
But, I would also like to create index.csv which simply lists all the unique stock ticker names.
E.g.
+---------+
| Ticker |
+---------+
| A |
| B |
| C |
| D |
| ... |
+---------+
Can anybody recommend an efficient way of doing this in R or Python, when handling a huge filesize?

You could loop through each file, grabbing the index of each and creating a set union of all indices.
import glob
tickers = set()
for csvfile in glob.glob('*.csv'):
data = pd.read_csv(csvfile, index_col=0, header=None) # or True, however your data is set up
tickers.update(data.index.tolist())
pd.Series(list(tickers)).to_csv('index.csv', index=False)

You can retrieve the index from the file names:
(index <- data.frame(Ticker = toupper(gsub("^.*_(.*)\\.csv",
"\\1",
list.files()))))
## Ticker
## 1 A
## 2 B
write.csv(index, "index.csv")

In mysql, is it possible to add a column based on values in one column?

I have a mysql table data which has following columns
+-------+-----------+----------+
|a | b | c |
+-------+-----------+----------+
| John | 225630096 | 447 |
| John | 225630118 | 491 |
| John | 225630206 | 667 |
| John | 225630480 | 1215 |
| John | 225630677 | 1609 |
| John | 225631010 | 2275 |
| Ryan | 154247076 | 6235 |
| Ryan | 154247079 | 6241 |
| Ryan | 154247083 | 6249 |
| Ryan | 154247084 | 6251 |
+-------+-----------+----------+
I want to add a column d based on the values in a and c (See expected table below). Values in a is the name of the subject, b is one of its attribute, and c another. So, if the values of c are within 15 units of each other for each subject assign them a same cluster number (for example, each value in c for Ryan is within 15 unit, so they all are assigned 1), but if not assign them a different value as in for John, where each row gets a different value for d.
+-------+-----------+----------+---+
|a | b | c |d |
+-------+-----------+----------+---+
| John | 225630096 | 447 | 1 |
| John | 225630118 | 491 | 2 |
| John | 225630206 | 667 | 3 |
| John | 225630480 | 1215 | 4 |
| John | 225630677 | 1609 | 5 |
| John | 225631010 | 2275 | 6 |
| Ryan | 154247076 | 6235 | 1 |
| Ryan | 154247079 | 6241 | 1 |
| Ryan | 154247083 | 6249 | 1 |
| Ryan | 154247084 | 6251 | 1 |
+-------+-----------+----------+---+
I am not sure if this could be done in mysql, but if not i would welcome any python based answers as well, in that case, working on this table as cdv format.
Thanks.

You could use a query with variables:
SELECT a, b, c,
CASE WHEN #last_a != a THEN #d:=1
WHEN (#last_a = a) AND (c>#last_c+15) THEN #d:=#d+1
ELSE #d END d,
#last_a := a,
#last_c := c
FROM
tablename, (SELECT #d:=1, #last_a:=null, #last_c:=null) _n
ORDER BY a, c
Please see fiddle here.
Explanation
I'm using a join between tablename and the subquery (SELECT ...) _n just to initialize some variables (d is initialized to 1, #last_a to null, #last_c to null).
Then, for every row, I'm checking if the last encountered a -the one on the previous row- is different than the current a: in that case set #d to 1 (and return it).
If the last encountered a is the same as the current row and c is greater than the last encountered c + 15, then increment #d and return its value.
Otherwise, just return d without incrementing it. This will happen when a has not changed and c is not greater than the previous c+15, or this will happen at the first row (because #last_a and #last_c have been initialized to null).
To make it work, we need to order by a and c.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Direct flattening of records from dataframe in PySpark - python

Related

PySpark search inside very large dataframe

Iterate pyspark dataframe rows and apply UDF

(Python 3.x and SQL) Check if entries exists in multiple tables and convert to binary values

How to efficiently extract unique rows from massive CSV using Python or R

In mysql, is it possible to add a column based on values in one column?

Categories

Resources