Automating Python Task - python

I would like to automate the below python code to be applied to different dataframes.
df_twitter = pd.read_csv('merged_watsonTwitter.csv')
df_original = pd.read_csv('merged_watsonOriginal.csv')
sample_1_twitter = df_twitter['ID_A'] == "08b56ebc-8eae-41b3-9c86-c79e3be542fd"
sample_1_twitter = df_twitter[sample_1_twitter]
sample_1_original = df_original['ID_B'] == "08b56ebc-8eae-41b3-9c86-c79e3be542fd"
sample_1_original = df_original[sample_1_original]
sample_1_twit_trunc = sample_1_twitter[['raw_score_parent_A','raw_score_child_A']]
sample_1_ori_trunc = sample_1_original[['raw_score_parent_B','raw_score_child_B']]
sample_1_twit_trunc.reset_index(drop=True, inplace=True)
sample_1_ori_trunc.reset_index(drop=True, inplace=True)
sample_1 = pd.concat([sample_1_twit_trunc, sample_1_ori_trunc], axis=1)
sample_1['ID'] = '08b56ebc-8eae-41b3-9c86-c79e3be542fd'
stats.ttest_rel(sample_1['raw_score_child_B'], sample_1['raw_score_child_A'])
For example, the above code that indicates the ID "08b56ebc-8eae-41b3-9c86-c79e3be542fd" is of a specific individual. If I am to calculate the T-test for all the samples I have, then I'll need to keep replacing the different ID's for everyone by copying and pasting it on the code above.
Is there a method to automate this process whereby these sections;
df_twitter['ID_A'] == "08b56ebc-8eae-41b3-9c86-c79e3be542fd"
df_original['ID_B'] == "08b56ebc-8eae-41b3-9c86-c79e3be542fd"
sample_1['ID'] = '08b56ebc-8eae-41b3-9c86-c79e3be542fd'
could accept all the ID's I have and automate the entire process.
At the end, save each result output generated by this function as well:
stats.ttest_rel(sample_1['raw_score_child_B'], sample_1['raw_score_child_A'])

As Klaus mentioned you need a function which takes arguments. You can just try to place your code in a function. You may want to store your ID's in a list of any iterable collection. You can also store the t-test results in a list.
ids = ["08b56ebc-8eae-41b3-9c86-c79e3be542fd","08b56ebc-8eae-41b3-9c86-c79e3be542f4"]
def runTTest (id,df_twitter,df_original):
sample_1_twitter = df_twitter['ID_A'] == id
sample_1_twitter = df_twitter[sample_1_twitter]
sample_1_original = df_original['ID_B'] == id
sample_1_original = df_original[sample_1_original]
sample_1_twit_trunc =
sample_1_twitter[['raw_score_parent_A','raw_score_child_A']]
sample_1_ori_trunc =
sample_1_original[['raw_score_parent_B','raw_score_child_B']]
sample_1_twit_trunc.reset_index(drop=True, inplace=True)
sample_1_ori_trunc.reset_index(drop=True, inplace=True)
sample_1 = pd.concat([sample_1_twit_trunc, sample_1_ori_trunc], axis=1)
sample_1['ID'] = id
return stats.ttest_rel(sample_1['raw_score_child_B'], sample_1['raw_score_child_A'])
t_test_results=[]
for id in ids:
t_test_results.append(runTTest(id,df_twitter ,df_original))

Related

What is the best practice to reduce\filter stream data into normal data with the most used characteristics using PySpark?

I am working on streaming web-server records using PySpark in real-time, and I want to reduce\filter the data of a certain period (Let's say 1 week, which is 10M records) into 1M records to reach sampled data that represents normal data with the most used characteristics. I
tried the following strategies in Python:
find the most used username let's say top n like Ali & Eli ----> df['username'].value_counts()
find the most used APIs (api) Ali & Eli accessed individually.
At first we need to filter records belongs to Ali & Eli df_filter_Ali = df[df["username"] == "Ali"] and find the most used APIs (api) by Ali ----> df_filter_Ali['username'].value_counts() let's say \a\s\d\ & \a\b\c\
filter the records of Ali which contains the most accessed APis \a\s\d\ & \a\b\c\
but do them separately, in other words:
df.filter(username=ali).filter(api=/a).sample(0.1).union(df.filter(username=ali).filter(api=/b).sample(0.1)).union(df.filter(username=pejman).filter(api=/a).sample(0.1)).union(df.filter(username=ali).filter(api=/z).sample(0.1))
.union(df.filter(username=pej or ALI).filter(api=/a,/b, /z)
Then we can expect other features belonging to these events contextualized as normal data distribution.
I think the groupby() doesn't give us the right distribution
# Task1: normal data sampling
df = pd.read_csv("df.csv", sep=";")
df1 = []
for first_column in df["username"].value_counts().index[:50]:
second_column_most_values = df.loc[df["username"] == first_column]["normalizedApi"].value_counts().index
for second_column in second_column_most_values[:10]:
sample = df.loc[(df["username"] == first_column) & (df["normalizedApi"] == second_column)].sample(frac=0.1)
df1.append(sample)
df1 = pd.concat(df1)
df2 = []
for first_column in df["username"].value_counts().index[:50]:
second_column_most_values = df.loc[df["username"] == first_column]["normalizedApi"].value_counts().index
user_specific_data = []
for second_column in second_column_most_values[:10]:
sample = df.loc[(df["username"] == first_column) & (df["normalizedApi"] == second_column)]
user_specific_data.append(sample)
df2.append(pd.concat(user_specific_data).sample(frac=0.1))
df2 = pd.concat(df2)
df3 = []
for first_column in df["username"].value_counts().index[:50]:
second_column_most_values = df.loc[df["username"] == first_column]["normalizedApi"].value_counts().index
user_specific_data = []
for second_column in second_column_most_values[:10]:
sample = df.loc[(df["username"] == first_column) & (df["normalizedApi"] == second_column)]
user_specific_data.append(sample)
df3.append(pd.concat(user_specific_data))
df3 = pd.concat(df3)
df3 = df3.sample(frac=0.1)
sampled_napi_df = pd.concat([df1, df2, df3])
sampled_napi_df = sampled_napi_df.drop_duplicates()
sampled_napi_df = sampled_napi_df.reset_index(drop=True)
I checked the post in this regard, but I can't find any interesting way except a few posts: post1 and Filtering streaming data to reduce noise, kalman filter , How correctly reduce stream to another stream which are c++ or Java solutions!
Edit1: I tried to use Scala and pick top 50 username and loop over top 10 APIs they accessed and reduced/sampled and reunion and return back over filtered df:
val users = df.groupBy("username").count.orderBy($"count".desc).select("username").as[String].take(50)
val user_apis = users.map{
user =>
val users_apis = df.filter($"username"===user).groupBy("normalizedApi").count.orderBy($"count".desc).select("normalizedApi").as[String].take(50)
(user, users_apis)
import org.apache.spark.sql.functions.rand
val df_sampled = user_apis.map{
case (user, userApis) =>
userApis.map{
api => df.filter($"username"===user).filter($"normalizedApi"===api).orderBy(rand()).limit(10)
}.reduce(_ union _)
}.reduce(_ union _)
}
I still can't figure it out how can be done efficiently in PySpark? Any help will be appreciate it.
Edit1:
// desired users number 100
val users = df.groupBy("username").count.orderBy($"count".desc).select("username").as[String].take(100)
// desired APIs number selected users they accessed 100
val user_apis = users.map{
user =>
val users_apis = df.filter($"username"===user).groupBy("normalizedApi").count.orderBy($"count".desc).select("normalizedApi").as[String].take(100)
(user, users_apis)
}
import org.apache.spark.sql.functions._
val users_and_apis_of_interest = user_apis.toSeq.toDF("username", "apisOfInters")
val normal_df = df.join(users_and_apis_of_interest, Seq("username"), "inner")
.withColumn("keep", array_contains($"apisOfInters", $"normalizedApi"))
.filter($"keep"=== true)
.distinct
.drop("keep", "apisOfInters")
.sample(true, 0.5)
I think this does what you want in pyspark. I'll confess I didn't run the code but It does give you the spirit of what I think you need to do.
The important thing you want to start doing is avoid 'collect' this is because that requires what you are doing fits in memory in the driver. Also it's a sign you are doing "small data" things instead of using big data tools like 'limit'. Where possible try and use datasets/dataframes to do work as that will give you the most amount of big data power.
I do use window in this and I've provided a link to help explain what it does.
Again this code hasn't been run but I am fairly certain the spirit of my intent is here. If you provide a runnable data set (in the question) I'll test/run/debug.
from pyspark.sql.functions import count, collect_set, row_number, lit, col, explode
from pyspark.sql.window import Window
windowSpec = Window.partitionBy("username").orderBy("count")
top_ten = 10
top_apis = lit(100)
users_and_apis_of_interest = df.groupBy("username")\
.agg(
count('username').alias("count"),
collect_list("normalizedApi").alias("apis")#willl collect all the apis we need for later
).sort( col("count").desc() )\
.limit( top_ten )\
.select( 'username', explode("apis").alias("normalizedApi" ) )\#turn apis we collected back into rows.
.groupBy("username","normalizedApi" )\
.agg( count("normalizedApi" ).alias("count") )\
.select(
"username",
"normalizedApi",
row_number().over(windowSpec).alias("row_number")#create row numbers to be able to select top X apis
).where(col("row_number") > top_apis ) #filter out anything that isn't a top
normal_df = df.join(users_and_apis_of_interest, ["username","normalizedApi"])\
.drop("row_number", "count")\
.distinct()\
.sample(True, 0.5)
normal_df.show(truncate=False)

Running Functions with Multiple Arguments Concurrently and Aggregating Complex Results

Set Up
This is part two of a question that I posted regarding accessing results from multiple processes.
For part one click Here: Link to Part One
I have a complex set of data that I need to compare to various sets of constraints concurrently, but I'm running into multiple issues. The first issue is getting results out of my multiple processes, and the second issue is making anything beyond an extremely simple function to run concurrently.
Example
I have multiple sets of constraints that I need to compare against some data and I would like to do this concurrently because I have a lot of sets of constrains. In this example I'll just be using two sets of constraints.
Jupyter Notebook
Create Some Sample Constraints & Data
# Create a set of constraints
constraints = pd.DataFrame([['2x2x2', 2,2,2],['5x5x5',5,5,5],['7x7x7',7,7,7]],
columns=['Name','First', 'Second', 'Third'])
constraints.set_index('Name', inplace=True)
# Create a second set of constraints
constraints2 = pd.DataFrame([['4x4x4', 4,4,4],['6x6x6',6,6,6],['7x7x7',7,7,7]],
columns=['Name','First', 'Second', 'Third'])
constraints2.set_index('Name', inplace=True)
# Create some sample data
items = pd.DataFrame([['a', 2,8,2],['b',5,3,5],['c',7,4,7]], columns=['Name','First', 'Second', 'Third'])
items.set_index('Name', inplace=True)
Running Sequentially
If I run this sequentially I can get my desired results but with the data that I am actually dealing with it can take over 12 hours. Here is what it would look like ran sequentially so that you know what my desired result would look like.
# Function
def seq_check_constraint(df_constraints_input, df_items_input):
df_constraints = df_constraints_input.copy()
df_items = df_items_input.copy()
df_items['Product'] = df_items.product(axis=1)
df_constraints['Product'] = df_constraints.product(axis=1)
for constraint in df_constraints.index:
df_items[constraint+'Product'] = df_constraints.loc[constraint,'Product']
for constraint in df_constraints.index:
for item in df_items.index:
col_name = constraint+'_fits'
df_items[col_name] = False
df_items.loc[df_items['Product'] < df_items[constraint+'Product'], col_name] = True
df_res = df_items.iloc[:: ,7:]
return df_res
constraint_sets = [constraints, constraints2, ...]
results = {}
counter = 0
for df in constrain_sets:
res = seq_check_constraint(df, items)
results['constraints'+str(counter)] = res
or uglier:
df_res1 = seq_check_constraint(constraints, items)
df_res2 = seq_check_constraint(constraints2, items)
results = {'constraints0':df_res1, 'constraints1': df_res2}
As a result of running these sequentially I end up with DataFrame's like shown here:
I'd ultimately like to end up with a dictionary or list of the DataFrame's, or be able to append the DataFrame's all together. The order that I get the results doesn't matter to me, I just want to have them all together and need to be able to do further analysis on them.
What I've Tried
So this brings me to my attempts at multiprocessing, From what I understand you can either use Queues or Managers to handle shared data and memory, but I haven't been able to get either to work. I also am struggling to get my function which takes two arguments to execute within the Pool's at all.
Here is my code as it stands right now using the same sample data from above:
Function
def check_constraint(df_constraints_input, df_items_input):
df_constraints = df_constraints_input.copy()
df_items = df_items_input.copy()
df_items['Product'] = df_items.product(axis=1) # Mathematical Product
df_constraints['Product'] = df_constraints.product(axis=1)
for constraint in df_constraints.index:
df_items[constraint+'Product'] = df_constraints.loc[constraint,'Product']
for constraint in df_constraints.index:
for item in df_items.index:
col_name = constraint+'_fits'
df_items[col_name] = False
df_items.loc[df_items['Product'] < df_items[constraint+'Product'], col_name] = True
df_res = df_items.iloc[:: ,7:]
return df_res
Jupyter Notebook
df_manager = mp.Manager()
df_ns = df_manager.Namespace()
df_ns.constraint_sets = [constraints, constraints2]
print('---Starting pool---')
if __name__ == '__main__':
with mp.Pool() as p:
print('--In the pool--')
res = p.map_async(mpf.check_constraint, (df_ns.constraint_sets, itertools.repeat(items)))
print(res.get())
and my current error:
TypeError: check_constraint() missing 1 required positional argument: 'df_items_input'
Easiest way is to create a list of tuples (where one tuple represents one set of arguments to the function) and pass it to starmap.
df_manager = mp.Manager()
df_ns = df_manager.Namespace()
df_ns.constraint_sets = [constraints, constraints2]
print('---Starting pool---')
if __name__ == '__main__':
with mp.Pool() as p:
print('--In the pool--')
check_constraint_args = []
for constraint in constraint_sets:
check_constraint_args.append((constraint, items))
res = p.starmap(mpf.check_constraint, check_constraint_args)
print(res.get())

how to iterate with groupby in pandas

I have a function minmax, that basically iterates over a dataframe of transactions. I want to calculate a set of calculations including the id, so accountstart,accountend are the two fields calculated. The intention is to make this calculations my month and account.
So when I do:
df1 = df.loc[df['accountNo']==10]
minmax(df1) it works.
What I can't do is:
df.groupby('accountNo').apply(minmax)
When I do:
grouped = df.groupby('accountNo')
for i,j in grouped:
print(minmax(j))
It does the computation, print the result, but without print it complains about KeyError: -1 that is itertools. So akward.
How to tackle that in Pandas?
def minmax(x):
dfminmax = {}
accno = set(x['accountNo'])
accno = repr(accno)
kgroup = x.groupby('monthStart')['cumsum'].sum()
maxt = x['startbalance'].max()
kgroup = pd.DataFrame(kgroup)
kgroup['startbalance'] = 0
kgroup['startbalance'][0] = maxt
kgroup['endbalance'] = 0
kgroup['accountNo'] = accno
kgroup['accountNo'] = kgroup['accountNo'].str.strip('{}.0')
kgroup.reset_index(inplace=True)
for idx, row in kgroup.iterrows():
if kgroup.loc[idx,'startbalance']==0:
kgroup.loc[idx,'startbalance']=kgroup.loc[idx-1,'endbalance'],
if kgroup.loc[idx,'endbalance']==0:
kgroup.loc[idx,'endbalance'] =
kgroup.loc[idx,'cumsum']+kgroup.loc[idx,'startbalance']
dfminmax['monthStart'].append(kgroup['monthStart'])
dfminmax['startbalance'].append(kgroup['startbalance'])
dfminmax['endbalance'].append(kgroup['endbalance'])
dfminmax['accountNo'].append(kgroup['accountNo'])
return dfminmax
.apply() takes pandas Series as inputs, not DataFrames. Using .agg, as in df.groupby('accountNo').agg(yourfunction) should yield better results. Be sure to check out the documentation for details on implementation.

Script keep showing "SettingCopyWithWarning'

Hello my problem is that my script keep showing below message
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
downcast=downcast
I Searched the google for a while regarding this, and it seems like my code is somehow
assigning sliced dataframe to new variable, which is problematic.
The problem is ** I can't find where my code get problematic **
I tried copy function, or seperated the nested functions, but it is not working
I attached my code below.
def case_sorting(file_get, col_get, methods_get, operator_get, value_get):
ops = {">": gt, "<": lt}
col_get = str(col_get)
value_get = int(value_get)
if methods_get is "|x|":
new_file = file_get[ops[operator_get](file_get[col_get], value_get)]
else:
new_file = file_get[ops[operator_get](file_get[col_get], np.percentile(file_get[col_get], value_get))]
return new_file
Basically what i was about to do was to make flask api that gets excel file as an input, and returns the csv file with some filtering. So I defined some functions first.
def get_brandlist(df_input, brand_input):
if brand_input == "default":
final_list = (pd.unique(df_input["브랜드"])).tolist()
else:
final_list = brand_input.split("/")
if '브랜드' in final_list:
final_list.remove('브랜드')
final_list = [x for x in final_list if str(x) != 'nan']
return final_list
Then I defined the main function
def select_bestitem(df_data, brand_name, col_name, methods, operator, value):
# // 2-1 // to remove unnecessary rows and columns with na values
df_data = df_data.dropna(axis=0 & 1, how='all')
df_data.fillna(method='pad', inplace=True)
# // 2-2 // iterate over all rows to find which row contains brand value
default_number = 0
for row in df_data.itertuples():
if '브랜드' in row:
df_data.columns = df_data.iloc[default_number, :]
break
else:
default_number = default_number + 1
# // 2-3 // create the list contains all the target brand names
brand_list = get_brandlist(df_input=df_data, brand_input=brand_name)
# // 2-4 // subset the target brand into another dataframe
df_data_refined = df_data[df_data.iloc[:, 1].isin(brand_list)]
# // 2-5 // split the dataframe based on the "brand name", and apply the input condition
df_per_brand = {}
df_per_brand_modified = {}
for brand_each in brand_list:
df_per_brand[brand_each] = df_data_refined[df_data_refined['브랜드'] == brand_each]
file = df_per_brand[brand_each].copy()
df_per_brand_modified[brand_each] = case_sorting(file_get=file, col_get=col_name, methods_get=methods,
operator_get=operator, value_get=value)
# // 2-6 // merge all the remaining dataframe
df_merged = pd.DataFrame()
for brand_each in brand_list:
df_merged = df_merged.append(df_per_brand_modified[brand_each], ignore_index=True)
final_df = df_merged.to_csv(index=False, sep=',', encoding='utf-8')
return final_df
And I am gonna import this function in my app.py later
I am quite new to all the coding, therefore really really sorry if my code is quite hard to understand, but I just really wanted to get rid of this annoying warning message. Thanks for help in advance :)

counting entries yields a wrong dataframe

So I'm trying to automate the process of getting the number of entries a person has by using pandas.
Here's my code:
st = pd.read_csv('list.csv', na_values=['-'])
auto = pd.read_csv('data.csv', na_values=['-'])
comp = st.Component.unique()
eventname = st.EventName.unique()
def get_summary(ID):
for com in comp:
for event in eventname:
arr = []
for ids in ID:
x = len(st.loc[(st.User == str(ids)) & (st.Component == str(com)) & (st.EventName == str(event))])
arr.append(x)
auto.loc[:, event] = pd.Series(arr, index=auto.index)
The output I get looks like this:
I ran some manual loops to see the entries for the first four columns. And I counted them manually too in the csv file. But when I put a print function inside the loop, I can see that it does count the entries correctly, but at some point it gets overwritten with the zero values.
What am I missing/doing wrong here?

Categories

Resources