How to replicate python code to R to find duplicates? - python

I'm trying to reproduce this code from python to R:
# Sort by user overall rating first
reviews = reviews.sort_values('review_overall', ascending=False)
# Keep the highest rating from each user and drop the rest
reviews = reviews.drop_duplicates(subset= ['review_profilename','beer_name'], keep='first')
and I've done this piece of code in R:
reviews_df <-df[order(-df$review_overall), ]
library(dplyr)
df_clean <- distinct(reviews_df, review_profilename, beer_name, .keep_all= TRUE)
The problem is that I'm getting with python 1496263 records and with R 1496596 records.
link to dataset: dataset
Can someone help me to see my mistakes?

Without having some data, it's difficult to help, but you might be looking for:
library(tidyverse)
df_clean <- reviews_df %>%
arrange(desc(review_overall)) %>%
distinct(across(c(review_profilename, beer_name)), .keep_all = TRUE)
This code will sort descending by review_overall and look for every profilename + beer name combination and keep the first row (i.e. with highest review overall) for each of these combinations.

Related

What is the best practice to reduce\filter stream data into normal data with the most used characteristics using PySpark?

I am working on streaming web-server records using PySpark in real-time, and I want to reduce\filter the data of a certain period (Let's say 1 week, which is 10M records) into 1M records to reach sampled data that represents normal data with the most used characteristics. I
tried the following strategies in Python:
find the most used username let's say top n like Ali & Eli ----> df['username'].value_counts()
find the most used APIs (api) Ali & Eli accessed individually.
At first we need to filter records belongs to Ali & Eli df_filter_Ali = df[df["username"] == "Ali"] and find the most used APIs (api) by Ali ----> df_filter_Ali['username'].value_counts() let's say \a\s\d\ & \a\b\c\
filter the records of Ali which contains the most accessed APis \a\s\d\ & \a\b\c\
but do them separately, in other words:
df.filter(username=ali).filter(api=/a).sample(0.1).union(df.filter(username=ali).filter(api=/b).sample(0.1)).union(df.filter(username=pejman).filter(api=/a).sample(0.1)).union(df.filter(username=ali).filter(api=/z).sample(0.1))
.union(df.filter(username=pej or ALI).filter(api=/a,/b, /z)
Then we can expect other features belonging to these events contextualized as normal data distribution.
I think the groupby() doesn't give us the right distribution
# Task1: normal data sampling
df = pd.read_csv("df.csv", sep=";")
df1 = []
for first_column in df["username"].value_counts().index[:50]:
second_column_most_values = df.loc[df["username"] == first_column]["normalizedApi"].value_counts().index
for second_column in second_column_most_values[:10]:
sample = df.loc[(df["username"] == first_column) & (df["normalizedApi"] == second_column)].sample(frac=0.1)
df1.append(sample)
df1 = pd.concat(df1)
df2 = []
for first_column in df["username"].value_counts().index[:50]:
second_column_most_values = df.loc[df["username"] == first_column]["normalizedApi"].value_counts().index
user_specific_data = []
for second_column in second_column_most_values[:10]:
sample = df.loc[(df["username"] == first_column) & (df["normalizedApi"] == second_column)]
user_specific_data.append(sample)
df2.append(pd.concat(user_specific_data).sample(frac=0.1))
df2 = pd.concat(df2)
df3 = []
for first_column in df["username"].value_counts().index[:50]:
second_column_most_values = df.loc[df["username"] == first_column]["normalizedApi"].value_counts().index
user_specific_data = []
for second_column in second_column_most_values[:10]:
sample = df.loc[(df["username"] == first_column) & (df["normalizedApi"] == second_column)]
user_specific_data.append(sample)
df3.append(pd.concat(user_specific_data))
df3 = pd.concat(df3)
df3 = df3.sample(frac=0.1)
sampled_napi_df = pd.concat([df1, df2, df3])
sampled_napi_df = sampled_napi_df.drop_duplicates()
sampled_napi_df = sampled_napi_df.reset_index(drop=True)
I checked the post in this regard, but I can't find any interesting way except a few posts: post1 and Filtering streaming data to reduce noise, kalman filter , How correctly reduce stream to another stream which are c++ or Java solutions!
Edit1: I tried to use Scala and pick top 50 username and loop over top 10 APIs they accessed and reduced/sampled and reunion and return back over filtered df:
val users = df.groupBy("username").count.orderBy($"count".desc).select("username").as[String].take(50)
val user_apis = users.map{
user =>
val users_apis = df.filter($"username"===user).groupBy("normalizedApi").count.orderBy($"count".desc).select("normalizedApi").as[String].take(50)
(user, users_apis)
import org.apache.spark.sql.functions.rand
val df_sampled = user_apis.map{
case (user, userApis) =>
userApis.map{
api => df.filter($"username"===user).filter($"normalizedApi"===api).orderBy(rand()).limit(10)
}.reduce(_ union _)
}.reduce(_ union _)
}
I still can't figure it out how can be done efficiently in PySpark? Any help will be appreciate it.
Edit1:
// desired users number 100
val users = df.groupBy("username").count.orderBy($"count".desc).select("username").as[String].take(100)
// desired APIs number selected users they accessed 100
val user_apis = users.map{
user =>
val users_apis = df.filter($"username"===user).groupBy("normalizedApi").count.orderBy($"count".desc).select("normalizedApi").as[String].take(100)
(user, users_apis)
}
import org.apache.spark.sql.functions._
val users_and_apis_of_interest = user_apis.toSeq.toDF("username", "apisOfInters")
val normal_df = df.join(users_and_apis_of_interest, Seq("username"), "inner")
.withColumn("keep", array_contains($"apisOfInters", $"normalizedApi"))
.filter($"keep"=== true)
.distinct
.drop("keep", "apisOfInters")
.sample(true, 0.5)
I think this does what you want in pyspark. I'll confess I didn't run the code but It does give you the spirit of what I think you need to do.
The important thing you want to start doing is avoid 'collect' this is because that requires what you are doing fits in memory in the driver. Also it's a sign you are doing "small data" things instead of using big data tools like 'limit'. Where possible try and use datasets/dataframes to do work as that will give you the most amount of big data power.
I do use window in this and I've provided a link to help explain what it does.
Again this code hasn't been run but I am fairly certain the spirit of my intent is here. If you provide a runnable data set (in the question) I'll test/run/debug.
from pyspark.sql.functions import count, collect_set, row_number, lit, col, explode
from pyspark.sql.window import Window
windowSpec = Window.partitionBy("username").orderBy("count")
top_ten = 10
top_apis = lit(100)
users_and_apis_of_interest = df.groupBy("username")\
.agg(
count('username').alias("count"),
collect_list("normalizedApi").alias("apis")#willl collect all the apis we need for later
).sort( col("count").desc() )\
.limit( top_ten )\
.select( 'username', explode("apis").alias("normalizedApi" ) )\#turn apis we collected back into rows.
.groupBy("username","normalizedApi" )\
.agg( count("normalizedApi" ).alias("count") )\
.select(
"username",
"normalizedApi",
row_number().over(windowSpec).alias("row_number")#create row numbers to be able to select top X apis
).where(col("row_number") > top_apis ) #filter out anything that isn't a top
normal_df = df.join(users_and_apis_of_interest, ["username","normalizedApi"])\
.drop("row_number", "count")\
.distinct()\
.sample(True, 0.5)
normal_df.show(truncate=False)

Can time series analysis forecast the past?

I'm trying to guess the past data by using time series analysis.
Usually, time series analysis forecast the future, but in the opposite direction, can time series forecast(?) the past?
The reason why I do this is, there is missing part in the past data.
I'm trying to write down the code in R or Python.
I tried forecast(arima, h=-92) in R. This didn't work.
This is the code I tried in R.
library('ggfortify')
library('data.table')
library('ggplot2')
library('forecast')
library('tseries')
library('urca')
library('dplyr')
library('TSstudio')
library("xts")
df<- read.csv('https://drive.google.com/file/d/1Dt2ZLOCASYIbvviWQkwwgdo2BdmKfl9H/view?usp=sharing')
colnames(df)<-c("date", "production")
df$date<-as.Date(df$date, format="%Y-%m-%d")
CandyXTS<- xts(df[-1], df[[1]])
CandyTS<- ts(df$production, start=c(1972,1),end=c(2017,8), frequency=12 )
ggAcf(CandyTS)
forecast(CandyTS, h=-92)
It is possible. It is called backcasting. You can find some information in this chapter of Forecasting: Principles and Practice.
Basicly you need to forecast in reverse. I have added an example based on the code in the chapter and your data. Adjust as needed. You create a reverse index and use that to be able to backcast in time. You can use different models than ETS. Same principle
# I downloaded data.
df1 <- readr::read_csv("datasets/candy_production.csv")
colnames(df1) <- c("date", "production")
library(fpp3)
back_cast <- df1 %>%
as_tsibble() %>%
mutate(reverse_time = rev(row_number())) %>%
update_tsibble(index = reverse_time) %>%
model(ets = ETS(production ~ season(period = 12))) %>%
# backcast
forecast(h = 12) %>%
# add dates in reverse order to the forecast with the same name as in original dataset.
mutate(date = df1$date[1] %m-% months(1:12)) %>%
as_fable(index = date, response = "production",
distribution = "production")
back_cast %>%
autoplot(df1) +
labs(title = "Backcast of candy production",
y = "production")

Trouble obtaining counts using multiple datetime columns as conditionals

I am attempting to collect counts of occurrences of an id between two time periods in a dataframe. I have a moderately sized dataframe (about 400 unique ids and just short of 1m rows) containing a time of occurrence and an id for the account which caused the occurrence. I am attempting to get a count of occurrences for multiple time periods (1 hour, 6 hour, 1 day, etc.) prior a specific occurrence and have run into lots of difficulties.
I am using Python 3.7, and for this instance I only have the pandas package loaded. I have tried using for loops and while it likely would have worked (eventually), I am looking for something a bit more efficient time-wise. I have also tried using list comprehension and have run into some errors that I did not anticipate when dealing with datetimes columns. Examples of both are below.
## Sample data
data = {'id':[ 'EAED813857474821E1A61F588FABA345', 'D528C270B80F11E284931A7D66640965', '6F394474B8C511E2A76C1A7D66640965', '7B9C7C02F19711E38C670EDFB82A24A9', '80B409D1EC3D4CC483239D15AAE39F2E', '314EB192F25F11E3B68A0EDFB82A24A9', '68D30EE473FE11E49C060EDFB82A24A9', '156097CF030E4519DBDF84419B855E10', 'EE80E4C0B82B11E28C561A7D66640965', 'CA9F2DF6B82011E28C561A7D66640965', '6F394474B8C511E2A76C1A7D66640965', '314EB192F25F11E3B68A0EDFB82A24A9', 'D528C270B80F11E284931A7D66640965', '3A024345C1E94CED8C7E0DA3A96BBDCA', '314EB192F25F11E3B68A0EDFB82A24A9', '47C18B6B38E540508561A9DD52FD0B79', 'B72F6EA5565B49BBEDE0E66B737A8E6B', '47C18B6B38E540508561A9DD52FD0B79', 'B92CB51EFA2611E2AEEF1A7D66640965', '136EDF0536F644E0ADE6F25BB293DD17', '7B9C7C02F19711E38C670EDFB82A24A9', 'C5FAF9ACB88D4B55AB8196DBFFE5B3C0', '1557D4ECEFA74B40C718A4E5425F3ACB', '68D30EE473FE11E49C060EDFB82A24A9', '68D30EE473FE11E49C060EDFB82A24A9', 'CAF9D8CD627B422DFE1D587D25FC4035', 'C620D865AEE1412E9F3CA64CB86DC484', '47C18B6B38E540508561A9DD52FD0B79', 'CA9F2DF6B82011E28C561A7D66640965', '06E2501CB81811E290EF1A7D66640965', '68EEE17873FE11E4B5B90AFEF9534BE1', '47C18B6B38E540508561A9DD52FD0B79', '1BFE9CB25AD84B64CC2D04EF94237749', '7B20C2BEB82811E28C561A7D66640965', '261692EA8EE447AEF3804836E4404620', '74D7C3901F234993B4788EFA9E6BEE9E', 'CAF9D8CD627B422DFE1D587D25FC4035', '76AAF82EB8C511E2A76C1A7D66640965', '4BD38D6D44084681AFE13C146542A565', 'B8D27E80B82911E28C561A7D66640965' ], 'datetime':[ "24/06/2018 19:56", "24/05/2018 03:45", "12/01/2019 14:36", "18/08/2018 22:42", "19/11/2018 15:43", "08/07/2017 21:32", "15/05/2017 14:00", "25/03/2019 22:12", "27/02/2018 01:59", "26/05/2019 21:50", "11/02/2017 01:33", "19/11/2017 19:17", "04/04/2019 13:46", "08/05/2019 14:12", "11/02/2018 02:00", "07/04/2018 16:15", "29/10/2016 20:17", "17/11/2018 21:58", "12/05/2017 16:39", "28/01/2016 19:00", "24/02/2019 19:55", "13/06/2019 19:24", "30/09/2016 18:02", "14/07/2018 17:59", "06/04/2018 22:19", "25/08/2017 17:51", "07/04/2019 02:24", "26/05/2018 17:41", "27/08/2014 06:45", "15/07/2016 19:30", "30/10/2016 20:08", "15/09/2018 18:45", "29/01/2018 02:13", "10/09/2014 23:10", "11/05/2017 22:00", "31/05/2019 23:58", "19/02/2019 02:34", "02/02/2019 01:02", "27/04/2018 04:00", "29/11/2017 20:35"]}
df = pd.dataframe(data)
df = df.sort_values(['id', 'datetime'], ascending=True)
# for loop attempt
totalAccounts = df['id'].unique()
for account in totalAccounts:
oneHourCount=0
subset = df[df['id'] == account]
for i in range(len(subset)):
onehour = subset['datetime'].iloc[i] - timedelta(hours=1)
for j in range(len(subset)):
if (subset['datetime'].iloc[j] >= onehour) and (subset['datetime'].iloc[j] < sub):
oneHourCount+=1
#list comprehension attempt
df['onehour'] = df['datetime'] - timedelta(hours=1)
for account in totalAccounts:
onehour = sum([1 for x in subset['datetime'] if x >= subset['onehour'] and x < subset['datetime']])
I am getting either 1) incredibly long runtime with the for loop or 2) an ValueError regarding the truth of a series being ambiguous. I know the issue is dealing with the datetimes, and perhaps it is just going to be slow-going, but I want to check here first just to make sure.
So I was able to figure this out using bisection. If you have a similar question please PM me and I'd be more than happy to help.
Solution:
left = bisect_left(keys, subset['start_time'].iloc[i]) ## calculated time
right = bisect_right(keys, subset['datetime'].iloc[i]) ## actual time of occurrence
count=len(subset['datetime'][left:right]

Iterate a piece of code connecting to API using two variables pulled from two lists

I'm trying to run a script (API to google search console) over a table of keywords and dates in order to check if there was improvement in keyword performance (SEO) after the date.
Since i'm really clueless im guessing and trying but Jupiter notebook isn't responding so i can't even tell if im wrong...
This git was made by Josh Carty
the git from which i took this code is:
https://github.com/joshcarty/google-searchconsole
Already pd.read_csv the input table (consist of two columns 'keyword' and 'date'),
made the columns into two separate lists (or maybe it better to use dictionary/other?):
KW_list and
Date_list
I tried:
for i in KW_list and j in Date_list:
for i in KW_list and j in Date_list:
account = searchconsole.authenticate(client_config='client_secrets.json',
credentials='credentials.json')
webproperty = account['https://www.example.com/']
report = webproperty.query.range(j, days=-30).filter('query', i, 'contains').get()
report2 = webproperty.query.range(j, days=30).filter('query', i, 'contains').get()
df = pd.DataFrame(report)
df2 = pd.DataFrame(report2)
df
Expect to see the data frame of all the different keywords (keyowrd1-stat1 , keyword2 - stats2 below, etc. [no overwrite]) at the dates 30 days before the date in the neighbor cell (in the input file)
or at least some respond from J.notebook so i will know what is going on.
Try using the zip function to combine the lists into a list of tuples. This way, the date and the corresponding keyword are combined.
account = searchconsole.authenticate(client_config='client_secrets.json', credentials='credentials.json')
webproperty = account['https://www.example.com/']
df1 = None
df2 = None
first = True
for (keyword, date) in zip(KW_list, Date_list):
report = webproperty.query.range(date, days=-30).filter('query', keyword, 'contains').get()
report2 = webproperty.query.range(date, days=30).filter('query', keyword, 'contains').get()
if first:
df1 = pd.DataFrame(report)
df2 = pd.DataFrame(report2)
first = False
else:
df1 = df1.append(pd.DataFrame(report))
df2 = df2.append(pd.DataFrame(report2))

pandas - drop row with list of values, if contains from list

I have a huge set of data. Something like 100k lines and I am trying to drop a row from a dataframe if the row, which contains a list, contains a value from another dataframe. Here's a small time example.
has = [['#a'], ['#b'], ['#c, #d, #e, #f'], ['#g']]
use = [1,2,3,5]
z = ['#d','#a']
df = pd.DataFrame({'user': use, 'tweet': has})
df2 = pd.DataFrame({'z': z})
tweet user
0 [#a] 1
1 [#b] 2
2 [#c, #d, #e, #f] 3
3 [#g] 5
z
0 #d
1 #a
The desired outcome would be
tweet user
0 [#b] 2
1 [#g] 5
Things i've tried
#this seems to work for dropping #a but not #d
for a in range(df.tweet.size):
for search in df2.z:
if search in df.loc[a].tweet:
df.drop(a)
#this works for my small scale example but throws an error on my big data
df['tweet'] = df.tweet.apply(', '.join)
test = df[~df.tweet.str.contains('|'.join(df2['z'].astype(str)))]
#the error being "unterminated character set at position 1343770"
#i went to check what was on that line and it returned this
basket.iloc[1343770]
user_id 17060480
tweet [#IfTheyWereBlackOrBrownPeople, #WTF]
Name: 4612505, dtype: object
Any help would be greatly appreciated.
is ['#c, #d, #e, #f'] 1 string or a list like this ['#c', '#d', '#e', '#f'] ?
has = [['#a'], ['#b'], ['#c', '#d', '#e', '#f'], ['#g']]
use = [1,2,3,5]
z = ['#d','#a']
df = pd.DataFrame({'user': use, 'tweet': has})
df2 = pd.DataFrame({'z': z})
simple solution would be
screen = set(df2.z.tolist())
to_delete = list() # this will speed things up doing only 1 delete
for id, row in df.iterrows():
if set(row.tweet).intersection(screen):
to_delete.append(id)
df.drop(to_delete, inplace=True)
speed comparaison (for 10 000 rows):
st = time.time()
screen = set(df2.z.tolist())
to_delete = list()
for id, row in df.iterrows():
if set(row.tweet).intersection(screen):
to_delete.append(id)
df.drop(to_delete, inplace=True)
print(time.time()-st)
2.142000198364258
st = time.time()
for a in df.tweet.index:
for search in df2.z:
if search in df.loc[a].tweet:
df.drop(a, inplace=True)
break
print(time.time()-st)
43.99799990653992
For me, your code works if I make several adjustments.
First, you're missing the last line when putting range(df.tweet.size), either increase this or (more robust, if you don't have an increasing index), use df.tweet.index.
Second, you don't apply your dropping, use inplace=True for that.
Third, you have #d in a string, the following is not a list: '#c, #d, #e, #f' and you have to change it to a list so it works.
So if you change that, the following code works fine:
has = [['#a'], ['#b'], ['#c', '#d', '#e', '#f'], ['#g']]
use = [1,2,3,5]
z = ['#d','#a']
df = pd.DataFrame({'user': use, 'tweet': has})
df2 = pd.DataFrame({'z': z})
for a in df.tweet.index:
for search in df2.z:
if search in df.loc[a].tweet:
df.drop(a, inplace=True)
break # so if we already dropped it we no longer look whether we should drop this line
This will provide the desired result. Be aware of this potentially being not optimal due to missing vectorization.
EDIT:
you can achieve the string being a list with the following:
from itertools import chain
df.tweet = df.tweet.apply(lambda l: list(chain(*map(lambda lelem: lelem.split(","), l))))
This applies a function to each line (assuming each line contains a list with one or more elements): Split each element (should be a string) by comma into a new list and "flatten" all the lists in one line (if there are multiple) together.
EDIT2:
Yes, this is not really performant But basically does what was asked. Keep that in mind and after having it working, try to improve your code (less for iterations, do tricks like collecting the indices and then drop all of them).

Categories

Resources