Adding new column in pyspark dataframe - python

I'm trying to add a new record Timezone to my pysaprk dataframe
from timezonefinder import TimezoneFinder
tf = TimezoneFinder()
df = df.withColumn("longitude",col("longitude").cast("float"))
df = df.withColumn("Latitude",col("Latitude").cast("float"))
df = df.withColumn("timezone",tf.timezone_at(lng=col("longitude"), lat=col("Latitude")))
I'm getting below error.
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
Timezonefinder library is used to find timezone by passing geocoordinates.
Latitude, longitude = 20.5061, 50.358
tf.timezone_at(lng=longitude, lat=Latitude)
-- 'Asia/Riyadh'

You need to use a UDF to pass columns to Python functions:
import pyspark.sql.functions as F
#F.udf('string')
def tfUDF(lng, lat):
from timezonefinder import TimezoneFinder
tf = TimezoneFinder()
return tf.timezone_at(lng=lng, lat=lat)
df = df.withColumn("longitude", F.col("longitude").cast("float"))
df = df.withColumn("Latitude", F.col("Latitude").cast("float"))
df = df.withColumn("timezone", tfUDF(F.col("longitude"), F.col("Latitude")))
df.show()
+--------+---------+-----------+
|Latitude|longitude| timezone|
+--------+---------+-----------+
| 20.5061| 50.358|Asia/Riyadh|
+--------+---------+-----------+

Related

How can I sort timestamp from following data dictionary?

Code:
import pandas as pd
from pycoingecko import CoinGeckoAPI
c=CoinGeckoAPI()
bdata=c.get_coin_market_chart_by_id(id='bitcoin',vs_currency='usd',days=30)
data_=pd.DataFrame(bdata)
print(data_)
data=pd.to_datetime(data_[prices],unit='ms')
print(data)
Output:
Requirement:
But I required output in which 4 columns:
Timestamp, Prices, Market_caps, Total_volume
And I want to change the timestamp format into to_datetime
In the above codes, I just sort the bitcoin data from pycoingecko
Example:
You can convert this into a dataframe format like this:
import pandas as pd
from pycoingecko import CoinGeckoAPI
c=CoinGeckoAPI()
bdata=c.get_coin_market_chart_by_id(id='bitcoin',vs_currency='usd',days=30)
prices = pd.DataFrame(bdata['prices'], columns=['TimeStamp', 'Price']).set_index('TimeStamp')
market_caps = pd.DataFrame(bdata['market_caps'], columns=['TimeStamp', 'Market Cap']).set_index('TimeStamp')
total_volumes = pd.DataFrame(bdata['total_volumes'], columns=['TimeStamp', 'Total Volumes']).set_index('TimeStamp')
# combine the separate dataframes
df_market = pd.concat([prices, market_caps, total_volumes], axis=1)
# convert the index to a datetime dtype
df_market.index = pd.to_datetime(df_market.index, unit='ms')
Code adapted from this answer.
You can extract the timestamp column and convert it into date as following with minimum change to your code, you can follow up by merging the new column to your array:
import pandas as pd
from pycoingecko import CoinGeckoAPI
c=CoinGeckoAPI()
bdata=c.get_coin_market_chart_by_id(id='bitcoin',vs_currency='usd',days=30)
data_=pd.DataFrame(bdata)
print(data_)
#data=pd.to_datetime(data_["prices"],unit='ms')
df = pd.DataFrame([pd.Series(x) for x in data_["prices"]])
df.columns = ["timestamp","data"]
df=pd.to_datetime(df["timestamp"],unit='ms')
print(df)

Filling out a dataframe column using parallel processing in Python

Trying to compute a value for each row of dataframe in parallel using the following code, but getting errors either when I pass individual input ranges or the combination:
#!pip install pyblaze
import itertools
import pyblaze
import pyblaze.multiprocessing as xmp
import pandas as pd
inputs = [range(2),range(2),range(3)]
inputs_list = list(itertools.product(*inputs))
Index = pd.MultiIndex.from_tuples(inputs_list,names={"a", "b", "c"})
df = pd.DataFrame(index = Index)
df['Output'] = 0
print(df)
def Addition(A,B,C):
df.loc[A,B,C]['Output']=A+B+C
return df
def parallel(inputs_list):
tokenizer = xmp.Vectorizer(Addition, num_workers=8)
return tokenizer.process(inputs_list)
parallel(inputs_list)

Why can't I search for a row in a pandas df using a date as part of a tuple index?

I am trying to search a pandas df I made which has a tuple as an index. The first part of the tuple is a date and the second part is a forex pair. I've tried a few things but I can't seem to search using a date-formatted string as part of a tuple with .loc or .ix
My df looks like this:
Open Close
(11-01-2018, AEDAUD) 0.3470 0.3448
(11-01-2018, AEDCAD) 0.3415 0.3408
(11-01-2018, AEDCHF) 0.2663 0.2656
(11-01-2018, AEDDKK) 1.6955 1.6838
(11-01-2018, AEDEUR) 0.2277 0.2261
Here is the complete code :
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
forex_11 = pd.read_csv('FOREX_20180111.csv', sep=',', parse_dates=['Date'])
forex_12 = pd.read_csv('FOREX_20180112.csv', sep=',', parse_dates=['Date'])
time_format = '%d-%m-%Y'
forex = forex_11.append(forex_12, ignore_index=False)
forex['Date'] = forex['Date'].dt.strftime(time_format)
GBP = forex[forex['Symbol'] == "GBPUSD"]
forex.index = list(forex[['Date', 'Symbol']].itertuples(index=False, name=None))
forex_open_close = pd.DataFrame(np.array(forex[['Open','Close']]), index=forex.index)
forex_open_close.columns = ['Open', 'Close']
print(forex_open_close.head())
print(forex_open_close.ix[('11-01-2018', 'GBPUSD')])
How do I get the row which has index ('11-01-2018', 'GBPUSD') ?
Can you try putting the tuple in a list using brackets?
Like this:
print(forex_open_close.ix[[('11-01-2018', 'GBPUSD')]])
I would recommend using the Pandas multiIndex. In your case you could do the following:
tuples = list(data[['Date', 'Symbol']].itertuples(index=False, name=None))
data.index = pd.MultiIndex.from_tuples(tuples, names=['Date', 'Symbol'])
# And then to index
data.loc['2018-01-11', 'AEDCAD']

Modify all values of a column PySpark dataframe

I am new to PySpark dataframes and used to work with RDDs before. I have a dataframe like this:
date path
2017-01-01 /A/B/C/D
2017-01-01 /X
2017-01-01 /X/Y
And want to convert to the following:
date path
2017-01-01 /A/B
2017-01-01 /X
2017-01-01 /X/Y
Basically to get rid of everything after the third / including it. So before with RDD I used to have the following:
from urllib import quote_plus
path_levels = df['path'].split('/')
filtered_path_levels = []
for _level in range(min(df_size, 3)):
# Take only the top 2 levels of path
filtered_path_levels.append(quote_plus(path_levels[_level]))
df['path'] = '/'.join(map(str, filtered_path_levels))
Things with pyspark are more complicated I would say. Here is what I have got so far:
path_levels = split(results_df['path'], '/')
filtered_path_levels = []
for _level in range(size(df_size, 3)):
# Take only the top 2 levels of path
filtered_path_levels.append(quote_plus(path_levels[_level]))
df['path'] = '/'.join(map(str, filtered_path_levels))
which is giving me the following error:
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
Any help regrading this would be much appreciated. Let me know if this need more information/explanation.
Use udf:
from pyspark.sql.functions import *
#udf
def quote_string_(path, size):
if path:
return "/".join(quote_plus(x) for x in path.split("/")[:size])
df.withColumn("foo", quote_string_("path", lit(2)))
I resolved my problem using the following code:
from pyspark.sql.functions import split, col, lit, concat
split_col = split(df['path'], '/')
df = df.withColumn('l1_path', split_col.getItem(1))
df = df.withColumn('l2_path', split_col.getItem(2))
df = df.withColumn('path', concat(col('l1_path'), lit('/'), col('l2_path')))
df = df.drop('l1_path', 'l2_path')

How to modify/transform the column of a dataframe?

I have an instance of pyspark.sql.dataframe.DataFrame created using
dataframe = sqlContext.sql("select * from table").
One column is 'arrival_date' and contains a string.
How do I modify this column so as to only take the first 4 characters from it and throw away the rest?
How would I convert the type of this column from string to date?
In graphlab.SFrame, this would be:
dataframe['column_name'] = dataframe['column_name'].apply(lambda x: x[:4] )
and
dataframe['column_name'] = dataframe['column_name'].str_to_datetime()
As stated by Orions, you can't modify a column, but you can override it. Also, you shouldn't need to create an user defined function, as there is a built-in function for extracting substrings:
from pyspark.sql.functions import *
df = df.withColumn("arrival_date", df['arrival_date'].substr(0, 4))
To convert it to date, you can use to_date, as Orions said:
from pyspark.sql.functions import *
df = df.withColumn("arrival_date", to_date(df['arrival_date'].substr(0, 4)))
However, if you need to specify the format, you should use unix_timestamp:
from pyspark.sql.functions import *
format = 'yyMM'
col = unix_timestamp(df['arrival_date'].substr(0, 4), format).cast('timestamp')
df = df.withColumn("arrival_date", col)
All this can be found in the pyspark documentation.
To extract first 4 characters from the arrival_date (StringType) column, create a new_df by using UserDefinedFunction (as you cannot modify the columns: they are immutable):
from pyspark.sql.functions import UserDefinedFunction, to_date
old_df = spark.sql("SELECT * FROM table")
udf = UserDefinedFunction(lambda x: str(x)[:4], StringType())
new_df = old_df.select(*[udf(column).alias('arrival_date') if column == 'arrival_date' else column for column in old_df.columns])
And to covert the arrival_date (StringType) column into DateType column, use the to_date function as show below:
new_df = old_df.select(old_df.other_cols_if_any, to_date(old_df.arrival_date).alias('arrival_date'))
Sources:
https://stackoverflow.com/a/29257220/2873538
https://databricks.com/blog/2015/09/16/apache-spark-1-5-dataframe-api-highlights.html

Categories

Resources