Pyspark and Python - Column is not iterable - python

I am using Python-3 with Azure data bricks.
I have a dataframe. The column 'BodyJson' is a json string that contains one occurrence of 'vmedwifi/' within it. I have added a constant string literal of 'vmedwifi/' as column named 'email_type'.
I want to find the start position of text 'vmedwifi/' with column 'BodyJson' - all columns are within the same dataframe. My code is below.
I get the error 'Column is not iterable' on the second line of code. Any ideas of what I am doing wrong?
# Weak logic to try and identify email addressess
emailDf = inputDf.select('BodyJson').where("BodyJson like('%vmedwifi%#%.%')").withColumn('email_type', lit('vmedwifi'))
b=emailDf.withColumn('BodyJson_Cutdown', substring(emailDf.BodyJson, expr('locate(emailDf.email_type, emailDf.BodyJson)'), 20))
TypeError: Column is not iterable
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<command-536715104422314> in <module>()
12 #emailDf1 = inputDf.select('BodyJson').where("BodyJson like('%#xxx.abc.uk%')")
13
---> 14 b=emailDf.withColumn('BodyJson_Cutdown', substring(emailDf.BodyJson, expr('locate(emailDf.email_type, emailDf.BodyJson)'), 20))
15
16 #inputDf.unpersist()

The issue was with the literial passed to expr.
I decided to tackle this problem a different way which got around this issue.

Related

Trying to change column from object to to date time

I am trying to change a object column to a date time column.
However, every time I run the code below, I get an TypeError: "NATType" object is not callable.
I am assuming that this is due to the blanks in the column but I am not really sure how to resolve it. Removing the rows is not an option here because there are other columns to consider as well.
df['jahreskontakt'] = pd.to_datetime(df['jahreskontakt'], errors='ignore')
Does anybody have any advice? Thanks in advance.
Explanations:
df['jahreskontakt'] #column with yearly contacts by sales team
Values that can be found in the column:
2014-07-01 00:00:00
00:00:00
""
Full error: Tried changing between error = coerce or ignore
TypeError Traceback (most
recent call last)
<ipython-input-122-9d57805d9290> in <module>()
----> 1 df['jahreskontakt'] = pd.to_datetime(df['jahreskontakt'], errors='coerce')
TypeError: 'NaTType' object is not callable

How to iterate over dates in Python/Mysql? 'datetime.date' is not iterable"

Given a Mysql table with columns ("Title", "Author", "Date"). How do you:
Iterate over database to compare a given user provided date input to the database column "Date"
append matching records to lists
without getting the error "TypeError: argument of type 'datetime.date' is not iterable" example code below: Python 3.7
date = request.form.get("date")
list1=[]
list2=[]
list3=[]
results = db.session.query(Books).all()
for i in results:
if date in i.date is True:
list1.append(i.title)
list2.append(i.author)
list3.append(i.date)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-28-c4085a31faa3> in <module>()
5 results = db.session.query(Books).all()
6 for i in results:
----> 7 if date in i.date:
8 list1.append(i.title)
9 list2.append(i.author)
TypeError: argument of type 'datetime.date' is not iterable
Use sqlachemy filter to search. Doing database operations in application code is comparatively poor performance.
results = db.session.query(Books).filter(Books.date==date)

DataFrame.duplicated() error in function recursion TypeError: duplicated() got multiple values for argument 'keep'

I am using Python 3 on Jupyter notebook.
Use case:
Process all the records of an excel file.
Problem:
Duplicate records against the column Login id in excel while the underlaying processing can not process a Data set with duplicate records for a login id. So trying to process the records in batches by filtering and creating sub data sets for the duplicate records using recursive function.
Test data set:
Python code:
# process withdrwal of duplicate enteries with recrursive function
def withdrw_user_balance(withdraw_records):
#create new data frame using the
duplicateRowsDF = withdraw_records.duplicated(['Login'], keep = "first")
duplicateRowsDF.head(10)
#remove duplicates from the original data frame
withdraw_records.drop_duplicates(['Login'], keep='first', inplace=True)
#process withdraw request for the non-duplicated row data frame
# processWithdraw()
#update the status
for row in withdraw_records.itertuples():
withdraw_records.at[row.Index, 'status'] = 1
# write to excel the processed data frame
# writeExcel(withdraw_records)
# clear the object withdraw_records
withdraw_records = None
# check the size of new dataframe, if greater then 0, recruse the withdrw_user_balance() to find more duplicate records in the current
# else return
if duplicateRowsDF.size > 0:
print("recrusion called")
withdrw_user_balance(duplicateRowsDF)
else:
return True;
The next code is to execute the recursive function:
# import the excel file
withdraw_records_excel = pd.read_excel("batch-withdraw-duplicate-login.xlsx")
withdraw_records_excel.tail()
withdraw_records_excel.size
withdrw_user_balance(withdraw_records_excel)
Out put:
recrusion called
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-6-bae5054c7199> in <module>
4
5 withdraw_records_excel.size
----> 6 withdrw_user_balance(withdraw_records_excel)
<ipython-input-5-2ecdc243bfb2> in withdrw_user_balance(withdraw_records)
27 if duplicateRowsDF.size > 0:
28 print("recrusion called")
---> 29 withdrw_user_balance(duplicateRowsDF)
30 else:
31 return True;
<ipython-input-5-2ecdc243bfb2> in withdrw_user_balance(withdraw_records)
4 def withdrw_user_balance(withdraw_records):
5 #create new data frame using the
----> 6 duplicateRowsDF = withdraw_records.duplicated(['Login'], keep = "first")
7 duplicateRowsDF.head(10)
8
TypeError: duplicated() got multiple values for argument 'keep'
I think that the error is because the function parameters are pass by reference in python so the data frame some how keeps the reference of previous function call hence the DataFrame.duplicate() method is throwing error.
To fix that, I have nullify the DataFrame object as withdraw_records = None after processing, but no help.
Note that I am a starter in python so I may have wrong information on the types and object reference.
Thanks for you help.
Finally, I was able to fix the issue in the code.
rather than
#create new data frame using the
duplicateRowsDF = withdraw_records.duplicated(['Login'], keep = "first")
I need to write:
duplicateRowsDF = withdraw_records[withdraw_records.duplicated(['Login'], keep = "first")]
It will return the the sub-DataFrame of duplicate records in each call to the function which can be passed as parameter in the consecutive recursive function call to create batches of unique records.
I have used the pixidust debug tool to debug this program which helped a lot in identifying the issue. Following this link:
Use pixidust debugger with Jupyter notebook

Why is this error occuring when I am using filter in pandas: TypeError: 'int' object is not iterable

When I want to remove some elements which satisfy a particular condition, python is throwing up the following error:
TypeError Traceback (most recent call last)
<ipython-input-25-93addf38c9f9> in <module>()
4
5 df = pd.read_csv('fb441e62df2d58994928907a91895ec62c2c42e6cd075c2700843b89.csv;
----> 6 df = filter(df,~('-02-29' in df['Date']))
7 '''tmax = []; tmin = []
8 for dates in df['Date']:
TypeError: 'int' object is not iterable
The following is the code :
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data/C2A2_data/BinnedCsvs_d400/fb441e62df2d58994928907a91895ec62c2c42e6cd075c2700843b89.csv');
df = filter(df,~('-02-29' in df['Date']))
What wrong could I be doing?
Following is sample data
Sample Data
Use df.filter() (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.filter.html)
Also please attach the csv so we can run it locally.
Another way to do this is to use one of pandas' string methods for Boolean indexing:
df = df[~ df['Date'].str.contains('-02-29')]
You will still have to make sure that all the dates are actually strings first.
Edit:
Seeing the picture of your data, maybe this is what you want (slashes instead of hyphens):
df = df[~ df['Date'].str.contains('/02/29')]

Python: json normalize "String indices must be integers" error

I am getting a type error as "TypeError: string indices must be integers" in the following code.
import pandas as pd
import json
from pandas.io.json import json_normalize
full_json_df = pd.read_json('data/world_bank_projects.json')
json_nor = json_normalize(full_json_df, 'mjtheme_namecode')
json_nor.groupby('name')['code'].count().sort_values(ascending=False).head(10)
Output:
TypeError
Traceback (most recent call last)
<ipython-input-28-9401e8bf5427> in <module>()
1 # Find the top 10 major project themes (using column 'mjtheme_namecode')
2
----> 3 json_nor = json_normalize(full_json_df, 'mjtheme_namecode')
4 #json_nor.groupby('name')['code'].count().sort_values(ascending = False).head(10)
TypeError: string indices must be integers
According to pandas documentation, for data argument of the method json_normalize :
data : dict or list of dicts Unserialized JSON objects
In above, pd.read_json returns dataframe.
So, you can try converting dataframe to dictionary using .to_dict(). There are various options for using to_dict() as well.
May be something like below:
json_normalize(full_json_df.to_dict(), ......)

Categories

Resources