Pyspark Dataframe: Check if values in date columns are valid

Pyspark Dataframe: Check if values in date columns are valid - python

I have a spark DataFrame imported from a CSV file.
After applying some manipulations (mainly deleting columns/rows), I try to save the new DataFrame to Hadoop which brings up an error message:
ValueError: year out of range
I suspect that some columns of type DateType or TimestampType are corrupted. At least in one column I found an entry with a year '207' - this seems to create issues.
**How can I check if the DataFrame adheres to the required time ranges?
I thought about writing a function that takes the DataFrame and gets for each DateType / TimestampType-Column the minimum and the maximum of values, but I cannot get this to work.**
Any ideas?
PS: In my understanding, spark would always check and enforce the schema. Would this not include a check for minimum/maximum values?

For validating the date, regular Expressions can help.
for example: to validate a date with date format MM-dd-yyyy
step1: make a regular expression for your date format. for MM-dd-yyyy it will be ^(0[1-9]|[12][0-9]|3[01])[- \/.](0[1-9]|1[012])[- \/.](19|20)\d\d$
You can use this code for implementation.
This step will help finding invalid dates which wont parse and cause error.
step2: convert the string to date.
the following code can help
import scala.util.{Try, Failure}
import org.apache.spark.sql.functions.udf
object FormatChecker extends java.io.Serializable {
val fmt = org.joda.time.format.DateTimeFormat forPattern "MM-dd-yyyy"
def invalidFormat(s: String) = Try(fmt parseDateTime s) match {
case Failure(_) => true
case _ => false
}
}
val df = sc.parallelize(Seq(
"01-02-2015", "99-03-2010", "---", "2015-01-01", "03-30-2001")
).toDF("date")
invalidFormat = udf((s: String) => FormatChecker.invalidFormat(s))
df.where(invalidFormat($"date")).count()

Related

Cannot match two values in two different csvs

I am parsing through two separate csv files with the goal of finding matching customerID's and dates to manipulate balance.
In my for loop, at some point there should be a match as I intentionally put duplicate ID's and dates in my csv. However, when parsing and attempting to match data, the matches aren't working properly even though the values are the same.
main.py:
transactions = pd.read_csv(INPUT_PATH, delimiter=',')
accounts = pd.DataFrame(
columns=['customerID', 'MM/YYYY', 'minBalance', 'maxBalance', 'endingBalance'])
for index, row in transactions.iterrows():
customer_id = row['customerID']
date = formatter.convert_date(row['date'])
minBalance = 0
maxBalance = 0
endingBalance = 0
dict = {
"customerID": customer_id,
"MM/YYYY": date,
"minBalance": minBalance,
"maxBalance": maxBalance,
"endingBalance": endingBalance
}
print(customer_id in accounts['customerID'] and date in accounts['MM/YYYY'])
# Returns False
if (accounts['customerID'].equals(customer_id)) and (accounts['MM/YYYY'].equals(date)):
# This section never runs
print("hello")
else:
print("world")
accounts.loc[index] = dict
accounts.to_csv(OUTPUT_PATH, index=False)
Transactions CSV:
customerID,date,amount
1,12/21/2022,500
1,12/21/2022,-300
1,12/22/2022,100
1,01/01/2023,250
1,01/01/2022,300
1,01/01/2022,-500
2,12/21/2022,-200
2,12/21/2022,700
2,12/22/2022,200
2,01/01/2023,300
2,01/01/2023,400
2,01/01/2023,-700
Accounts CSV
customerID,MM/YYYY,minBalance,maxBalance,endingBalance
1,12/2022,0,0,0
1,12/2022,0,0,0
1,12/2022,0,0,0
1,01/2023,0,0,0
1,01/2022,0,0,0
1,01/2022,0,0,0
2,12/2022,0,0,0
2,12/2022,0,0,0
2,12/2022,0,0,0
2,01/2023,0,0,0
2,01/2023,0,0,0
2,01/2023,0,0,0
Expected Accounts CSV
customerID,MM/YYYY,minBalance,maxBalance,endingBalance
1,12/2022,0,0,0
1,01/2023,0,0,0
1,01/2022,0,0,0
2,12/2022,0,0,0
2,01/2023,0,0,0

Where does the problem come from
Your Problem comes from the comparison you're doing with pandas Series, to make it simple, when you do :
customer_id in accounts['customerID']
You're checking if customer_id is an index of the Series accounts['customerID'], however, you want to check the value of the Series.
And in your if statement, you're using the pd.Series.equals method. Here is an explanation of what does the method do from the documentation
This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements. NaNs in the same location are considered equal.
So equals is used to compare between DataFrames and Series, which is different from what you're trying to do.
One of many solutions
There are multiple ways to achieve what you're trying to do, the easiest is simply to get the values from the series before doing the comparison :
customer_id in accounts['customerID'].values
Note that accounts['customerID'].values returns a NumPy array of the values of your Series.
So your comparison should be something like this :
print(customer_id in accounts['customerID'].values and date in accounts['MM/YYYY'].values)
And use the same thing in your if statement :
if (customer_id in accounts['customerID'].values and date in accounts['MM/YYYY'].values):
Alternative solutions
You can also use the pandas.Series.isin function that given an element as input return a boolean Series showing whether each element in the Series matches the given input, then you will just need to check if the boolean Series contain one True value.
Documentation of isin : https://pandas.pydata.org/docs/reference/api/pandas.Series.isin.html

It is not clear from the information what does formatter.convert_date function does. but from the example CSVs you added it seems like it should do something like:
def convert_date(mmddyy):
(mm,dd,yy) = mmddyy.split('/')
return mm + '/' + yy
in addition, make sure that data types are also equal
(both date fields are strings and also for customer id)

Delete categorical data from a column and leave the numerical data only ? I want to delete the word "SUBM-" from "SUBM-1245" IN PHYTON

I have a column in phyton which data type is object but I want to change it to integer.
The records on that column show :
SUBM - 4562
SUBM - 4563
and all the information in that column is like that. I want to delete the SUBM - word from the records and apply a similar function like excel "replace with" and I will add 0 to leave that empty with the numerical data only. Can anyone suggest a way to do that ?

If you are working with a column in python, so I assume you are using pandas to parse your table. In this case, you can simply use
df["mycolumn"] = df["mycolumn"].str.replace("SUBM-","")
However, you still have a column of type "object" then. A save way to convert it to numeric is this, where you basically throw away everything that can't be converted to a numeric:
df["mycolumn"] = pd.to_numeric(df["mycolumn"], errors="coerce", downcast="integer")
If you specifically need integer values (float not acceptible for you in case of NaN) you can afterwards fill empty cells with 0 and convert the column to integer:
df["mycolumn"] = df["mycolumn"].fillna(0).map(int) # if you specifically need integers
Alternative is to extract all numeric values using regular expressions. This would automatically return NaN if the expressions do not match (i.e. also when "SUBM-" is not present in your cell)
df["mycolumn"] = df["mycolumn"].str.extract("SUBM-([0-9]*)")

How do I turn string into datetime values and merge

I have a string of yearly data month 1-12, trying to convert it to datetime.month values and then converge it on the main df that already has dt.month values according to some date
usage_12month["MONTH"]= pd.to_datetime(usage_12month["MONTH"])
usage_12month['MONTH'] = usage_12month['MONTH'].dt.month
display(usage_12month)
merge = pd.merge(df,usage_12month, how='left',on='MONTH')
ValueError: Given date string not likely a datetime.
get the error on the 1st line

.dt.month on a datetime returns an int. So I'm assuming you want to convert usage_12month["MONTH"] from a string to an int to be able to merge it with the other df.
There is a simplier way than converting it to a datetime. You could replace the first two lines by usage_12month["MONTH"]= pd.to_numeric(usage_12month["MONTH"]) and it should work.
--
The error you get on the first line is because you don't specify to the to_datetime function how to interpet the string as a datetime (the number in the string could represent a day, an hour...).
To make your way work you have to give a 'format' parameter to the to_datetime function. In your case, your string contains only the month number, so the format string would be '%m' (see https://strftime.org/) : usage_12month["MONTH"]= pd.to_datetime(usage_12month["MONTH"], format = '%m')
When you're supplying the function with a "usual" date fromat like 'yyyy/mm/dd' it guesses how to interpret it, but it is alway better to provide a format to the function.

Manipulate string in python (replace string with part of the string itself)

So I am trying to transform the data I have into the form I can work with. I have this column called "season/ teams" that looks smth like "1989-90 Bos"
I would like to transform it into a string like "1990" in python using pandas dataframe. I read some tutorials about pd.replace() but can't seem to find a use for my scenario. How can I solve this? thanks for the help.
FYI, I have 16k lines of data.
A snapshot of the data I am working with:

To change that field from "1989-90 BOS" to "1990" you could do the following:
df['Yr/Team'] = df['Yr/Team'].str[:2] + df['Yr/Team'].str[5:7]
If the structure of your data will always be the same, this is an easy way to do it.

If the data in your Yr/Team column has a standard format you can extract the values you need based on their position.
import pandas as pd
df = pd.DataFrame({'Yr/Team': ['1990-91 team'], 'data': [1]})
df['year'] = df['Yr/Team'].str[0:2] + df['Yr/Team'].str[5:7]
print(df)
Yr/Team data year
0 1990-91 team 1 1991

You can use pd.Series.str.extract to extract a pattern from a column of string. For example, if you want to extract the first year, second year and team in three different columns, you can use this:
df["year"].str.extract(r"(?P<start_year>\d+)-(?P<end_year>\d+) (?P<team>\w+)")
Note the use of named parameters to automatically name the columns
See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extract.html

Pandas formatting column within DataFrame and adding timedelta Index error

I'm trying to use panda to do some analysis on some messaging data and am running into a few problems try to prep the data. It is coming from a database I don't have control of and therefore I need to do a little pruning and formatting before analyzing it.
Here is where I'm at so far:
#select all the messages in the database. Be careful if you get the whole test data base, may have 5000000 messages.
full_set_data = pd.read_sql("Select * from message",con=engine)
After I make this change to the timestamp, and set it as index, I'm no longer and to call to_csv.
#convert timestamp to a timedelta and set as index
#full_set_data[['timestamp']] = full_set_data[['timestamp']].astype(np.timedelta64)
indexed = full_set_data.set_index('timestamp')
indexed.to_csv('indexed.csv')
#extract the data columns I really care about since there as a bunch I don't need
datacolumns = indexed[['address','subaddress','rx_or_tx', 'wordcount'] + [col for col in indexed.columns if ('DATA' in col)]]
Here I need to format the DATA columns, I get a "SettingWithCopyWarning".
#now need to format the DATA columns to something useful by removing the upper 4 bytes
for col in datacolumns.columns:
if 'DATA' in col:
datacolumns[col] = datacolumns[col].apply(lambda x : int(x,16) & 0x0000ffff)
datacolumns.to_csv('data_col.csv')
#now group the data by "interaction key"
groups = datacolumns.groupby(['address','subaddress','rx_or_tx'])
I need to figure out how to get all the messages from a given group. get_group() requires I know key values ahead of time.
key_group = groups.get_group((1,1,1))
#foreach group in groups:
#do analysis
I have tried everything I could think of to fix the problems I'm running into but I cant seem to get around it. I'm sure it's from me misunderstanding/misusing Pandas as I'm still figuring it out.
I looking to solve these issues:
1) Can't save to csv after I add index of timestamp as timedelta64
2) How do I apply a function to a set of columns to remove SettingWithCopyWarning when reformatting DATA columns.
3) How to grab the rows for each group without having to use get_group() since I don't know the keys ahead of time.
Thanks for any insight and help so I can better understand how to properly use Pandas.

Firstly, you can set the index column(s) and parse dates while querying the DB:
indexed = pd.read_sql_query("Select * from message", engine=engine,
parse_dates='timestamp', index_col='timestamp')
Note I've used pd.read_sql_query here rather than pd.read_sql, which is deprecated, I think.
SettingWithCopy warning is due to the fact that datacolumns is a view of indexed, i.e. a subset of it's rows /columns, not an object in it's own right. Check out this part of the docs: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
One way to get around this is to define
datacolumns = indexed[<cols>].copy()
Another would to do
indexed = indexed[<cols>]
which effectively removes the columns you don't want, if you're happy that you won't need them again. You can then manipulate indexed at your leisure.
As for the groupby, you could introduce a columns of tuples which would be the group keys:
indexed['interaction_key'] = zip(indexed[['address','subaddress','rx_or_tx']]
indexed.groupby('interaction_key').apply(
lambda df: some_function(df.interaction_key, ...)
I'm not sure if it's all exactly what you want but let me know and I can edit.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pyspark Dataframe: Check if values in date columns are valid - python

Related

Cannot match two values in two different csvs

Delete categorical data from a column and leave the numerical data only ? I want to delete the word "SUBM-" from "SUBM-1245" IN PHYTON

How do I turn string into datetime values and merge

Manipulate string in python (replace string with part of the string itself)

Pandas formatting column within DataFrame and adding timedelta Index error

Categories

Resources