Multiple files ranking using pyspark

Multiple files ranking using pyspark - python

I have stock data (over 6000+ stocks, 100+ GB) saved as HDF5 file.
Basically, I am trying to translate this pandas code into pyspark. Ideally, I would love to have values used for the ranking, as well as ranks themselves saved to a file.
agg_df = pd.DataFrame()
for stock in stocks:
df = pd.read_csv(stock)
df = my_func(df) # custom function, output of which will be used for ranking. For simplicity, can use standard deviation
agg_df = pd.concat([agg_df, df], ax=1) # row-wise concat
agg_df.rank() #.to_csv() <- would like to save ranks for future use
Each data file has the same schema like:
Symbol Open High Low Close Volume
DateTime
2010-09-13 09:30:00 A 29.23 29.25 29.17 29.25 17667.0
2010-09-13 09:31:00 A 29.26 29.34 29.25 29.33 5000.0
2010-09-13 09:32:00 A 29.31 29.36 29.31 29.36 600.0
2010-09-13 09:33:00 A 29.33 29.36 29.30 29.35 7300.0
2010-09-13 09:34:00 A 29.35 29.39 29.31 29.39 3222.0
Desired output (where each number is a rank):
A AAPL MSFT ...etc
DateTime
2010-09-13 09:30:00 1 3 7 ...
2010-09-13 09:31:00 4 5 7 ...
2010-09-13 09:32:00 24 17 99 ...
2010-09-13 09:33:00 7 63 42 ...
2010-09-13 09:34:00 5 4 13 ...
I read other answers about Window and pyspark.sql, but not sure how to apply those to my case as I kind of need to aggregate those by row before ranking (at least in pandas)
Edit 1: After I read the data to an rdd rdd = sc.parallelize(data.keys).map(data.read_data), rdd becomes a PipelineRDD, which doesnt have .select() method. 0xDFDFDFDF's example contains all data in one dataframe, but I dont think it's a good idea to append everything to one dataframe to do the computation.
Result: Finally was able to solve it. There were 2 problems: reading files and performing the calculations.
Regarding reading file, I initially loaded them from HDF5 using rdd = sc.parallelize(data.keys).map(data.read_data) which resulted in PipelineRDD, which was a collection of pandas dataframes. These needed to be transformed to spark dataframe in order for the solution to work. I ended up transforming my hdf5 file to parquet and saved them to a separate folder. Then using
sqlContext = pyspark.sql.SQLContext(sc)
rdd_p = sqlContext.read.parquet(r"D:\parq")
read all the files to a dataframe.
After that performed calculation from accepted answer. Huge thanks to 0xDFDFDFDF for help
Extras:
discussion - https://chat.stackoverflow.com/rooms/214307/discussion-between-biarys-and-0xdfdfdfdf
0xDFDFDFDF solution - https://gist.github.com/0xDFDFDFDF/a93a7e4448abc03f606008c7422784d1

Indeed, windows functions will do the trick.
I've created a small mock dataset, which should resemble yours.
columns = ['DateTime', 'Symbol', 'Open', 'High', 'Low', 'Close', 'Volume']
data = [('2010-09-13 09:30:00','A',29.23,29.25,29.17,29.25,17667.0),
('2010-09-13 09:31:00','A',29.26,29.34,29.25,29.33,5000.0),
('2010-09-13 09:32:00','A',29.31,29.36,29.31,29.36,600.0),
('2010-09-13 09:34:00','A',29.35,29.39,29.31,29.39,3222.0),
('2010-09-13 09:30:00','AAPL',39.23,39.25,39.17,39.25,37667.0),
('2010-09-13 09:31:00','AAPL',39.26,39.34,39.25,39.33,3000.0),
('2010-09-13 09:32:00','AAPL',39.31,39.36,39.31,39.36,300.0),
('2010-09-13 09:33:00','AAPL',39.33,39.36,39.30,39.35,3300.0),
('2010-09-13 09:34:00','AAPL',39.35,39.39,39.31,39.39,4222.0),
('2010-09-13 09:34:00','MSFT',39.35,39.39,39.31,39.39,7222.0)]
df = spark.createDataFrame(data, columns)
Now, df.show() will give us this:
+-------------------+------+-----+-----+-----+-----+-------+
| DateTime|Symbol| Open| High| Low|Close| Volume|
+-------------------+------+-----+-----+-----+-----+-------+
|2010-09-13 09:30:00| A|29.23|29.25|29.17|29.25|17667.0|
|2010-09-13 09:31:00| A|29.26|29.34|29.25|29.33| 5000.0|
|2010-09-13 09:32:00| A|29.31|29.36|29.31|29.36| 600.0|
|2010-09-13 09:34:00| A|29.35|29.39|29.31|29.39| 3222.0|
|2010-09-13 09:30:00| AAPL|39.23|39.25|39.17|39.25|37667.0|
|2010-09-13 09:31:00| AAPL|39.26|39.34|39.25|39.33| 3000.0|
|2010-09-13 09:32:00| AAPL|39.31|39.36|39.31|39.36| 300.0|
|2010-09-13 09:33:00| AAPL|39.33|39.36| 39.3|39.35| 3300.0|
|2010-09-13 09:34:00| AAPL|39.35|39.39|39.31|39.39| 4222.0|
|2010-09-13 09:34:00| MSFT|39.35|39.39|39.31|39.39| 7222.0|
+-------------------+------+-----+-----+-----+-----+-------+
Here's the solution, which uses the aforementioned window function for rank(). Some transformation is needed, for which you can use pivot() function.
from pyspark.sql.window import Window
import pyspark.sql.functions as f
result = (df
.select(
'DateTime',
'Symbol',
f.rank().over(Window().partitionBy('DateTime').orderBy('Volume')).alias('rank')
)
.groupby('DateTime')
.pivot('Symbol')
.agg(f.first('rank'))
.orderBy('DateTime')
)
By calling result.show() you'll get:
+-------------------+----+----+----+
| DateTime| A|AAPL|MSFT|
+-------------------+----+----+----+
|2010-09-13 09:30:00| 1| 2|null|
|2010-09-13 09:31:00| 2| 1|null|
|2010-09-13 09:32:00| 2| 1|null|
|2010-09-13 09:33:00|null| 1|null|
|2010-09-13 09:34:00| 1| 2| 3|
+-------------------+----+----+----+
Make sure you understand the difference between rank(), dense_rank() and row_number() functions, as they behave differently when they encounter equal numbers in a given window - you can find the explanation here.

Related

How to stack columns/series vertically in pandas

I currently have a dataframe that looks something like this
entry_1 entry_2
2022-01-21 2022-02-01
2022-03-23 NaT
2022-04-13 2022-06-06
however I need to vertically stack my two columns to get something like this
entry
2022-01-21
2022-03-23
2022-04-13
2022-02-01
NaT
2022-06-06
I've tried using df['entry'] = df['entry_1].append(df['entry_2']).reset_index(drop=True) to no success

I recommend that you use
pd.DataFrame(df.values.ravel(), columns=['all_entries'])
This will allow you to return the flattened underlying data as an ndarray. By wrapping that in pd.Dataframe() you will convert it back to a dataframe with the column name "all_entries"
For more information please visit the pandas doc: https://pandas.pydata.org/docs/reference/api/pandas.Series.ravel.html

You can use concat of series and get the result in a dataframe like:
df['entry_1'] = pd.to_datetime(df['entry_1'])
df['entry_2'] = pd.to_datetime(df['entry_2'])
df_result = pd.DataFrame({
'entry':pd.concat([df['entry_1'], df['entry_2']], ignore_index=True)
})
or
entry_cols = ['entry_1', 'entry_2']
df_result = pd.DataFrame({
'entry':pd.concat([df[col] for col in entry_cols], ignore_index=True)
})
print(df_result)
entry
0 2022-02-21
1 2022-02-23
2 2022-04-13
3 2022-02-01
4 NaT
5 2022-06-06

Transforming multiindex df into sorted xlsx in python

long-time reader, first-time poster. I've been doing some time tracking for two projects, grouped the data by project and date using pandas, and would like to fill it into an existing Excel template for a client that is sorted by date (y-axis) and project (x-axis). But I'm stumped. I've been struggling to convert the multi-index dataframe into a sorted xlsx file.
Example data I want to sort
|Date | Project | Hours |
|-----------|---------------------------|---------|
|2022-05-09 |Project 1 | 5.50|
|2022-05-09 |Project 1 | 3.75|
|2022-05-11 |Project 2 | 1.50|
|2022-05-11 |Project 2 | 4.75|
etc.
Desired template
|Date |Project 1|Project 2|
|-----------|---------|---------|
|2022-05-09 | 5.5| 3.75|
|2022-05-11 | 4.75| 1.5|
etc...
So far I've tried a very basic iteration using openpyxl that has inserted the dates, but I can't figure out how to
a) rearrange the data in pandas so I can simply insert it or
b) how to write conditionally in openpyxl for a given date and project
# code grouping dates and projects
df = df.groupby(["Date", "Project"]).sum("Hours")
r = 10 # below the template headers and where I would start inserting time tracked
for date in df.index:
sheet.cell(row=r, column=1).value = date
r+=1
I've trawled StackOverflow for answers but am coming up empty. Thanks for any help you can provide.

I think your data sample is not correct. the 2nd row, instead of 2022-05-09 |Project 1|3.75, it should be 2022-05-09 |Project 2|3.75. The same with 4th row.
As I understand, your data is in long-format and your output is wide-format. In this case, pd.pivot_table can help:
pd.pivot_table(data=df, columns='name', index='year', values='hours').reset_index()

df.pivot_table(index='Date', columns='Project', values='Hours')
Date Project1 Project2
2022-05-09 5.5 3.75
2022-05-11 4.75 1.5

Pandas drop row when parse_dates fails

I came across a problem I thought the smart people at Pandas would've already solved, but I can't seem to find anything, so here I am.
The problem I'm having originates from some bad data, that I expected pandas would be able to filter on reading.
The data looks like this:
Station;Datum;Zeit;Lufttemperatur;Relative Feuchte;Wettersymbol;Windgeschwindigkeit;Windrichtung
9;12.11.2016;08:04;-1.81;86;;;
9;12.11.2016;08:19;-1.66;85.5;;;
9;²;08:34;-1.71;85.6;;;
9;12.11.2016;08:49;-1.91;87.7;;;
9;12.11.2016;09:04;-1.66;86.6;;;
(This is using the ISO-8859-1 character set, it looks different in UTF-8 etc.) I want to read the second column as dates, so naturally, I used
data = pandas.read_csv(file, sep=";", encoding="ISO-8859-1", parse_dates=["Datum"],
date_parser=lambda x: pandas.to_datetime(x, format="%d.%m.%Y"))
which gave
ValueError: time data '²' does not match format '%d.%m.%Y' (match)
Although pandas.read_csv has an input parameter error_bad_lines which looks like it would help my case, it appears all it does is filter out lines that do not have the correct amount of columns. Now I can filter out this particular line in many different ways, and to my knowing all of them require to first load all the data, filter out the rows and then converting the column to datetime objects, but I'd rather do it while reading in the file. It seems to be possible since when I leave out the date_parser, the file gets parsed succesfully and the strange character is just left as it is (although that might give issues when doing datetime instructions later on).
Is there a way for pandas to filter out rows it can't use the date_parser on while reading the file instead of during post-processing?

You want to use the errors parameter in pandas.to_datetime
date_parser=lambda x: pd.to_datetime(x, errors="coerce")
file = "file.csv"
data = pd.read_csv(
file, sep=";", encoding="ISO-8859-1", parse_dates=["Datum"],
date_parser=lambda x: pd.to_datetime(x, errors="coerce")
)
data
Station Datum Zeit Lufttemperatur Relative Feuchte Wettersymbol Windgeschwindigkeit Windrichtung
0 9 2016-12-11 08:04 -1.81 86.0 NaN NaN NaN
1 9 2016-12-11 08:19 -1.66 85.5 NaN NaN NaN
2 9 NaT 08:34 -1.71 85.6 NaN NaN NaN
3 9 2016-12-11 08:49 -1.91 87.7 NaN NaN NaN
4 9 2016-12-11 09:04 -1.66 86.6 NaN NaN NaN

Using monotonically_increasing_id() for assigning row number to pyspark dataframe

I am using monotonically_increasing_id() to assign row number to pyspark dataframe using syntax below:
df1 = df1.withColumn("idx", monotonically_increasing_id())
Now df1 has 26,572,528 records. So I was expecting idx value from 0-26,572,527.
But when I select max(idx), its value is strangely huge: 335,008,054,165.
What's going on with this function?
is it reliable to use this function for merging with another dataset having similar number of records?
I have some 300 dataframes which I want to combine into a single dataframe. So one dataframe contains IDs and others contain different records corresponding to them row-wise

Edit: Full examples of the ways to do this and the risks can be found here
From the documentation
A column that generates monotonically increasing 64-bit integers.
The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the record number within each partition in the lower 33 bits. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.
Thus, it is not like an auto-increment id in RDBs and it is not reliable for merging.
If you need an auto-increment behavior like in RDBs and your data is sortable, then you can use row_number
df.createOrReplaceTempView('df')
spark.sql('select row_number() over (order by "some_column") as num, * from df')
+---+-----------+
|num|some_column|
+---+-----------+
| 1| ....... |
| 2| ....... |
| 3| ..........|
+---+-----------+
If your data is not sortable and you don't mind using rdds to create the indexes and then fall back to dataframes, you can use rdd.zipWithIndex()
An example can be found here
In short:
# since you have a dataframe, use the rdd interface to create indexes with zipWithIndex()
df = df.rdd.zipWithIndex()
# return back to dataframe
df = df.toDF()
df.show()
# your data | indexes
+---------------------+---+
| _1 | _2|
+-----------=---------+---+
|[data col1,data col2]| 0|
|[data col1,data col2]| 1|
|[data col1,data col2]| 2|
+---------------------+---+
You will probably need some more transformations after that to get your dataframe to what you need it to be. Note: not a very performant solution.
Hope this helps. Good luck!
Edit:
Come to think about it, you can combine the monotonically_increasing_id to use the row_number:
# create a monotonically increasing id
df = df.withColumn("idx", monotonically_increasing_id())
# then since the id is increasing but not consecutive, it means you can sort by it, so you can use the `row_number`
df.createOrReplaceTempView('df')
new_df = spark.sql('select row_number() over (order by "idx") as num, * from df')
Not sure about performance though.

using api functions you can do simply as the following
from pyspark.sql.window import Window as W
from pyspark.sql import functions as F
df1 = df1.withColumn("idx", F.monotonically_increasing_id())
windowSpec = W.orderBy("idx")
df1 = df1.withColumn("idx", F.row_number().over(windowSpec)).show()
I hope the answer is helpful

I found the solution by #mkaran useful, But for me there was no ordering column while using window function. I wanted to maintain the order of rows of dataframe as their indexes (what you would see in a pandas dataframe). Hence the solution in edit section came of use. Since it is a good solution (if performance is not a concern), I would like to share it as a separate answer.
df_index = sdf_drop.withColumn("idx", monotonically_increasing_id())
# Create the window specification
w = Window.orderBy("idx")
# Use row number with the window specification
df_index = df_index.withColumn("index", F.row_number().over(w))
# Drop the created increasing data column
df2_index = df_index.drop("idx")
df is your original dataframe and df_index is new dataframe.

Building on #mkaran answer,
df.coalesce(1).withColumn("idx", monotonicallyIncreasingId())
Using .coalesce(1) puts the Dataframe in one partition, and so have monotonically increasing and successive index column. Make sure it's reasonably sized to be in one partition so you avoid potential problems afterwards.
Worth noting that I sorted my Dataframe in ascending order beforehand.
Here's a preview comparison of what it looked like for me, with and without coalesce, where I had a summary Dataframe of 50 rows,
df.coalesce(1).withColumn("No", monotonicallyIncreasingId()).show(60)
startTimes
endTimes
No
2019-11-01 05:39:50
2019-11-01 06:12:50
0
2019-11-01 06:23:10
2019-11-01 06:23:50
1
2019-11-01 06:26:49
2019-11-01 06:46:29
2
2019-11-01 07:00:29
2019-11-01 07:04:09
3
2019-11-01 15:24:29
2019-11-01 16:04:59
4
2019-11-01 16:23:38
2019-11-01 17:27:58
5
2019-11-01 17:32:18
2019-11-01 17:47:58
6
2019-11-01 17:54:18
2019-11-01 18:00:00
7
2019-11-02 04:42:40
2019-11-02 04:49:20
8
2019-11-02 05:11:40
2019-11-02 05:22:00
9
df.withColumn("runNo", monotonically_increasing_id).show(60)
startTimes
endTimes
No
2019-11-01 05:39:50
2019-11-01 06:12:50
0
2019-11-01 06:23:10
2019-11-01 06:23:50
8589934592
2019-11-01 06:26:49
2019-11-01 06:46:29
17179869184
2019-11-01 07:00:29
2019-11-01 07:04:09
25769803776
2019-11-01 15:24:29
2019-11-01 16:04:59
34359738368
2019-11-01 16:23:38
2019-11-01 17:27:58
42949672960
2019-11-01 17:32:18
2019-11-01 17:47:58
51539607552
2019-11-01 17:54:18
2019-11-01 18:00:00
60129542144
2019-11-02 04:42:40
2019-11-02 04:49:20
68719476736
2019-11-02 05:11:40
2019-11-02 05:22:00
77309411328

If you have a large DataFrame and you don't want OOM error problems, I suggest using zipWithIndex():
df1 = df.rdd.zipWithIndex().toDF()
df2 = df1.select(col("_1.*"),col("_2").alias('increasing_id'))
df2.show()
where df is your initial DataFrame.
More solutions are shown by Databricks documentation. Be careful with the row_number() function that moves all the rows in one partition and can cause OutOfMemoryError errors.

To merge dataframes of same size, use zip on rdds
from pyspark.sql.types import StructType
spark = SparkSession.builder().master("local").getOrCreate()
df1 = spark.sparkContext.parallelize([(1, "a"),(2, "b"),(3, "c")]).toDF(["id", "name"])
df2 = spark.sparkContext.parallelize([(7, "x"),(8, "y"),(9, "z")]).toDF(["age", "address"])
schema = StructType(df1.schema.fields + df2.schema.fields)
df1df2 = df1.rdd.zip(df2.rdd).map(lambda x: x[0]+x[1])
spark.createDataFrame(df1df2, schema).show()
But note the following from help of the method,
Assumes that the two RDDs have the same number of partitions and the same
number of elements in each partition (e.g. one was made through
a map on the other).

Python Rolling 30 day period GROUP By with Count Distinct String

I have a dataset of 'user_name','mac','dayte'(day). I would like to GROUP BY ['user_name']. Then for that GROUP BY create WINDOW of rolling 30 days using 'dayte'. In that rolling 30 day period, I would like to count the distinct number of 'mac'. And add that to my dataframe. Sample of the data.
user_name mac dayte
0 001j 7C:D1 2017-09-15
1 0039711 40:33 2017-07-25
2 0459 F0:79 2017-08-01
3 0459 F0:79 2017-08-06
4 0459 F0:79 2017-08-31
5 0459 78:D7 2017-09-08
6 0459 E0:C7 2017-09-16
7 133833 18:5E 2017-07-27
8 133833 F4:0F 2017-07-31
9 133833 A4:E4 2017-08-07
I have tried solving this with a PANDAs dataframe.
df['ct_macs'] = df.groupby(['user_name']).rolling('30d', on='dayte').mac.apply(lambda x:len(x.unique()))
But received the error
Exception: cannot handle a non-unique multi-index!
I tried in PySpark, but received an error as well.
from pyspark.sql import functions as F
#function to calculate number of seconds from number of days
days = lambda i: i * 86400
#convert string timestamp to timestamp type
df= df.withColumn('dayte', df.dayte.cast('timestamp'))
#create window by casting timestamp to long (number of seconds)
w = Window.partitionBy("user_name").orderBy("dayte").rangeBetween(-days(30), 0)
df= df.select("user_name","mac","dayte",F.size(F.denseRank().over(w).alias("ct_mac")))
But received the error
Py4JJavaError: An error occurred while calling o464.select.
: org.apache.spark.sql.AnalysisException: Window function dense_rank does not take a frame specification.;
I also tried
df= df.select("user_name","dayte",F.countDistinct(col("mac")).over(w).alias("ct_mac"))
But it's (count distinct in Window) not supported, apparently, in Spark.
I'm open to a purely SQL approach. In either MySQL or SQL Server, but would prefer Python or Spark.

Pyspark
Window functions are limited in the following ways:
A frame can only be defined by rows and not column values
countDistinct doesn't exist
enumerating functions cannot be used with a frame
Instead you can self join your table.
First let's create the dataframe:
df = sc.parallelize([["001j", "7C:D1", "2017-09-15"], ["0039711", "40:33", "2017-07-25"], ["0459", "F0:79", "2017-08-01"],
["0459", "F0:79", "2017-08-06"], ["0459", "F0:79", "2017-08-31"], ["0459", "78:D7", "2017-09-08"],
["0459", "E0:C7", "2017-09-16"], ["133833", "18:5E", "2017-07-27"], ["133833", "F4:0F", "2017-07-31"],
["133833", "A4:E4", "2017-08-07"]]).toDF(["user_name", "mac", "dayte"])
Now for the join and groupBy:
import pyspark.sql.functions as psf
df.alias("left")\
.join(
df.alias("right"),
(psf.col("left.user_name") == psf.col("right.user_name"))
& (psf.col("right.dayte").between(psf.date_add("left.dayte", -30), psf.col("left.dayte"))),
"leftouter")\
.groupBy(["left." + c for c in df.columns])\
.agg(psf.countDistinct("right.mac").alias("ct_macs"))\
.sort("user_name", "dayte").show()
+---------+-----+----------+-------+
|user_name| mac| dayte|ct_macs|
+---------+-----+----------+-------+
| 001j|7C:D1|2017-09-15| 1|
| 0039711|40:33|2017-07-25| 1|
| 0459|F0:79|2017-08-01| 1|
| 0459|F0:79|2017-08-06| 1|
| 0459|F0:79|2017-08-31| 1|
| 0459|78:D7|2017-09-08| 2|
| 0459|E0:C7|2017-09-16| 3|
| 133833|18:5E|2017-07-27| 1|
| 133833|F4:0F|2017-07-31| 2|
| 133833|A4:E4|2017-08-07| 3|
+---------+-----+----------+-------+
Pandas
This works for python3
import pandas as pd
import numpy as np
df["mac"] = pd.factorize(df["mac"])[0]
df.groupby('user_name').rolling('30D', on="dayte").mac.apply(lambda x: len(np.unique(x)))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Multiple files ranking using pyspark - python

Related

How to stack columns/series vertically in pandas

Transforming multiindex df into sorted xlsx in python

Pandas drop row when parse_dates fails

Using monotonically_increasing_id() for assigning row number to pyspark dataframe

Python Rolling 30 day period GROUP By with Count Distinct String

Categories

Resources