Function not returning the pyspark DataFrame - python

I have defined a function which returns a dataframe of intersection of all dataframes given as the input. However when I store the output of the function in some variable, it won't get stored in the variable. It is shown as a nonetype object
def intersection(list1, intersection_df,i):
if (i == 1):
intersection_df = list1[0]
print(type(intersection_df))
intersection(list1, intersection_df, i+1)
elif (i>len(list1)):
print(type(intersection_df))
a = spark.createDataFrame(intersection_df.rdd)
a.show()
return a
else:
intersection_df = intersection_df.alias('intersection_df')
tb = list1[i-1]
tb = tb.alias('tb')
intersection_df = intersection_df.join(tb, intersection_df['value'] == tb['value']).where(col('tb.value').isNotNull()).select(['intersection_df.value'])
print(type(intersection_df))
intersection(list1, intersection_df, i+1)
e.g if I give the input as following,
list1 = [1,2,3,4,5,6,7,8,9,10,11,12,13,14]
list2 = [3,4,5,6,7,8,9,10,11,12,13,14,15,16]
list3 = [6,7,8,9,10,11,12,13,4,16,343]
df1 = spark.createDataFrame(list1, StringType())
df2 = spark.createDataFrame(list2, StringType())
df3 = spark.createDataFrame(list3, StringType())
list4 = [df1,df2,df3]
empty_df = []
intersection_df = intersection(list4, empty_df, 1)
I expect the following output to be stored in interesection_df
+-----+
|value|
+-----+
| 7 |
| 11 |
| 8 |
| 6 |
| 9 |
| 10 |
| 4 |
| 12 |
| 13 |
+-----+

I think you got hit by the curse of recursion.
Problem:
You are calling intersection recursively but returning only in one of the if condition. So when it returns your df, it has no where to go (recall: each function call creates a stack).
Solution:
return when you call intersection from your if and else condition. for ex return intersection(list1, intersection_df, i+1) in your if condition.

Related

Python Pandas: Comparison of elements in Dataframe/series

I have a DataFrame in a variable called "myDataFrame" that looks like this:
+---------+-----+-------+-----
| Type | Count | Status |
+---------+-----+-------+-----
| a | 70 | 0 |
| a | 70 | 0 |
| b | 70 | 0 |
| c | 74 | 3 |
| c | 74 | 2 |
| c | 74 | 0 |
+---------+-----+-------+----+
I am using vectorized approach to process the rows in this DataFrame since the amount of rows I have is about 116 million.
So I wrote something like this:
myDataFrame['result'] = processDataFrame(myDataFrame['status'], myDataFrame['Count'])
In my function, I am trying to do this:
def processDataFrame(status, count):
resultsList = list()
if status == 0:
resultsList.append(count + 10000)
else:
resultsList.append(count - 10000)
return resultsList
But I get this for comparison status values:
Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()
What am i missing?
We can do without self-def function
myDataFrame['result'] = np.where(myDataFrame['status']==0,
myDataFrame['Count']+10000,
myDataFrame['Count']-10000)
Update
df.apply(lambda x : processDataFrame(x['Status'],x['Count']),1)
0 [10070]
1 [10070]
2 [10070]
3 [-9926]
4 [-9926]
5 [10074]
dtype: object
I think your function is not really doing the vectorized part.
When it is called, you pass status = myDataFrame['status'], so when it gets to the first if, it checks the condition of myDataFrame['status'] == 0. But myDataFrame['status'] == 0 is a boolean series (of whether each element of the status column equals 0), so it doesn't have a single Truth value (hence the error). Similarly, if the condition could be met, the resultsList would just get the whole "Count" column appended, either all plus 10000 or all minus 10000.
Edit:
I suppose this function uses the built in pandas functions, but applies them in your function:
def processDataFrame(status, count):
status_0 = (status == 0)
output = count.copy() #if you don't want to modify in place
output[status_0] += 10
output[~status_0] -= 10
return output

Pandas DataFrame string replace followed by split and set intersection

I have following pandas DataFrame
data = ['18#38#123#23=>21', '18#38#23#55=>35']
d = pd.DataFrame(data, columns = ['rule'])
and I have list of integers
r = [18, 55]
and I want to filter rules from above DataFrame if all integers in the list r are present in the rule too. I tried the following code and failed
d[d['rule'].str.replace('=>','#').split('#').astype(set).issuperset(set(r))]
How can I achieve the desired filtering with pandas
You were going in right direction, just need to use apply function instead:
d[d['rule'].str.replace('=>','#').str.split('#').apply(lambda x: set(x).issuperset(set(map(str,r))))]
Using str.get_dummies
d.rule.str.replace('=>','#').str.get_dummies(sep='#').loc[:, map(str, r)].all(1)
Outputs
0 False
1 True
dtype: bool
Detail:
get_dummies+loc returns
18 55
0 1 0
1 1 1
My initial instinct would be to use a list comprehension:
df = pd.DataFrame(['18#38#123#23=>21', '188#38#123#23=>21', '#18#38#23#55=>35'], columns = ['rule'])
def wrap(n):
return r'(?<=[^|^\d]){}(?=[^\d])'.format(n)
patterns = [18, 55]
pd.concat([df['rule'].str.contains(wrap(pattern)) for pattern in patterns], axis=1).all(axis=1)
Output:
0 False
1 False
2 True
My approach is similar to #RafaelC's answer, but convert all string into int:
new_df = d.rule.str.replace('=>','#').str.get_dummies(sep='#')
new_df.columns = new_df.columns.astype(int)
has_all = new_df[r].all(1)
# then you can assign new column for initial data frame
d['new_col'] = 10
d.loc[has_all, 'new_col'] = 100
Output:
+-------+-------------------+------------+
| | rule | new_col |
+-------+-------------------+------------+
| 0 | 18#38#123#23=>21 | 10 |
| 1 | 188#38#23#55=>35 | 10 |
| 2 | 18#38#23#55=>35 | 100 |
+-------+-------------------+------------+

What is the fastest way to conditionally change the values of a dataframe in every index and column?

Is there a way to reduce by a constant number each element of a dataframe verifying a condition including their own value without using a loop?
For instance, each cells < 2 sees its value reducing by 1.
Thank you very much.
I like to do this masking.
Here is an inefficient loop using your example
#Example using loop
for val in df['column']:
if(val<2):
val = val - 1
The following code gives the same result, but it will generally be much faster because it does not use a loop.
# Same effect using masks
mask = (df['column'] < 2) #Find entries that are less than 2.
df.loc[mask,'column'] = df.loc[mask,'column'] - 1 #Subtract 1.
I am not sure if this is the fastest, but you can use the .apply function:
import pandas as pd
df = pd.DataFrame(data=np.array([[1,2,3], [2,2,2], [4,4,4]]),
columns=['x', 'y', 'z'])
def conditional_add(x):
if x > 2:
return x + 2
else:
return x
df['x'] = df['x'].apply(conditional_add)
Will add 2 to the final row of column x.
More like (data from Willie)
df-((df<2)*2)
Out[727]:
x y z
0 -1 2 3
1 2 2 2
2 4 4 4
In this case I would use the np.where method from the NumPy library.
The method uses the following logic:
np.where(<condition>, <value if true>, <value if false>)
Example:
# import modules which are needed
import pandas as pd
import numpy as np
# create exmaple dataframe
df = pd.DataFrame({'A':[3,1,5,0.5,2,0.2]})
| A |
|-----|
| 3 |
| 1 |
| 5 |
| 0.5 |
| 2 |
| 0.2 |
# apply the np.where method with conditional statement
df['A'] = np.where(df.A < 2, df.A - 1, df.A)
| A |
|------|
| 3 |
| 0.0 |
| 5 |
| -0.5 |
| 2 |
| -0.8 |`

Add +1 to each item in a comma-separated string in pandas dataframe

I have a pandas dataframe structured as follows:
| ID | Start | Stop |
________________________________________
| 1 | 1,2,3,4 | 5,6,7,7 |
| 2 | 100,101 | 200,201 |
For each row in the dataframe, I'd like to add 1 to each value in the Start column. The dtype for the Start column is 'object'.
Desired output looks like this:
| ID | Start | Stop |
________________________________________
| 1 | 2,3,4,5 | 5,6,7,7 |
| 2 | 101,102 | 200,201 |
I've tried the following (and many versions of the following), but get an error stating ,TypeError: cannot concatenate 'str' and 'int' objects,:
df['test'] = [str(x + 1) for x in df['Start']]
I tried casting the column as an int, but got 'Invalid literal for long() with base 10: '101,102':
df['test'] = [int(x) + 1 for x in df['start'].astype(int)]
I tried converting the field to a list using str.split(), then casting each item as an integer:
Thanks in advance!
df['Start'] is the whole series, so you have to iterate that and then split:
new_series = []
for x in df['Start']:
value_list = []
for y in x.rstrip(',').split(','):
value_list.append(str(int(y) + 1))
new_series.append(','.join(value_list))
df['test'] = new_series
By telling you that you cannot concatenate string and int objects you know that x must be a string. You can solve this by casting x to an int before adding 1 to it. So str(x+1) becomes str(int(x)+1).
df['test'] = [str(int(x) + 1) for x in df['Start']]
df = pd.DataFrame({'Start' : [ [1 , 2, 3 , 4] , [100 , 101] ] , 'Stop' : [ [5 , 6 , 7 ,7] , [200,201] ] })
df.Start = df.Start.apply(lambda x : [y + 1 for y in x ])

How to create a table without using methods or for-loops?

I'm trying to create a 4x3 table without methods or for-loops.
I'd like to use what I learned in class, which is booleans, if-statements, and while-loops.
I want it so that if I input create_table('abcdefghijkl') it would start from the the left top most row and column and go down until the end of the column and then start again at the top of the next column and so on, like displayed below:
| a | e | i |
| b | f | j |
| c | g | k |
| d | h | l |
Below is what I have so far. It's not complete. How do I add to the function so that after 4 rows down, the string should continue to the next column starting from the top?
I'm wracking my brain over this.
All examples I can find online uses for loops and methods to create tables such as these, but I'd like to implement the while loop for this one.
Thanks in advance!
def create_table(table):
t = "" + "|" + ""
i = 0
while i < 12:
t = t + " " + "|" + table[i] + " "
i=i+1
print(t)
return table
Think about it in terms of rows instead of columns. You're writing out a row at a time, not a column at a time, so look at the indices of the individual cells in the original list:
| 0 | 4 | 8 |
| 1 | 5 | 9 |
| 2 | 6 | 10 |
| 3 | 7 | 11 |
Notice each row's cells' indices differ by 4. Find a simple expression for the nth row's cells and the task will become much easier, as you'll essentially be printing out a regular table.
You can translate most for loops to while loops with a simple recipe, so if you figure out how to do it with a for loop, then you are good to go. If you have
for x in s:
{statements}
Make it
i = 0
while i < len(s):
x = s[i]
{statements}
i += 1
It just won't work for some enumerable types that don't support length and indexing, such as generators.
Because you are printing to the terminal, you would want to think about printing each horizontal row, rather than each vertical column. Try something like:
table = 'abcdefghijkl'
i = 0
while i < 4:
print("| {} | {} | {} |".format(table[i], table[i+4], table[i+8]))
i += 1

Categories

Resources