I am parsing through two separate csv files with the goal of finding matching customerID's and dates to manipulate balance.
In my for loop, at some point there should be a match as I intentionally put duplicate ID's and dates in my csv. However, when parsing and attempting to match data, the matches aren't working properly even though the values are the same.
main.py:
transactions = pd.read_csv(INPUT_PATH, delimiter=',')
accounts = pd.DataFrame(
columns=['customerID', 'MM/YYYY', 'minBalance', 'maxBalance', 'endingBalance'])
for index, row in transactions.iterrows():
customer_id = row['customerID']
date = formatter.convert_date(row['date'])
minBalance = 0
maxBalance = 0
endingBalance = 0
dict = {
"customerID": customer_id,
"MM/YYYY": date,
"minBalance": minBalance,
"maxBalance": maxBalance,
"endingBalance": endingBalance
}
print(customer_id in accounts['customerID'] and date in accounts['MM/YYYY'])
# Returns False
if (accounts['customerID'].equals(customer_id)) and (accounts['MM/YYYY'].equals(date)):
# This section never runs
print("hello")
else:
print("world")
accounts.loc[index] = dict
accounts.to_csv(OUTPUT_PATH, index=False)
Transactions CSV:
customerID,date,amount
1,12/21/2022,500
1,12/21/2022,-300
1,12/22/2022,100
1,01/01/2023,250
1,01/01/2022,300
1,01/01/2022,-500
2,12/21/2022,-200
2,12/21/2022,700
2,12/22/2022,200
2,01/01/2023,300
2,01/01/2023,400
2,01/01/2023,-700
Accounts CSV
customerID,MM/YYYY,minBalance,maxBalance,endingBalance
1,12/2022,0,0,0
1,12/2022,0,0,0
1,12/2022,0,0,0
1,01/2023,0,0,0
1,01/2022,0,0,0
1,01/2022,0,0,0
2,12/2022,0,0,0
2,12/2022,0,0,0
2,12/2022,0,0,0
2,01/2023,0,0,0
2,01/2023,0,0,0
2,01/2023,0,0,0
Expected Accounts CSV
customerID,MM/YYYY,minBalance,maxBalance,endingBalance
1,12/2022,0,0,0
1,01/2023,0,0,0
1,01/2022,0,0,0
2,12/2022,0,0,0
2,01/2023,0,0,0
Where does the problem come from
Your Problem comes from the comparison you're doing with pandas Series, to make it simple, when you do :
customer_id in accounts['customerID']
You're checking if customer_id is an index of the Series accounts['customerID'], however, you want to check the value of the Series.
And in your if statement, you're using the pd.Series.equals method. Here is an explanation of what does the method do from the documentation
This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements. NaNs in the same location are considered equal.
So equals is used to compare between DataFrames and Series, which is different from what you're trying to do.
One of many solutions
There are multiple ways to achieve what you're trying to do, the easiest is simply to get the values from the series before doing the comparison :
customer_id in accounts['customerID'].values
Note that accounts['customerID'].values returns a NumPy array of the values of your Series.
So your comparison should be something like this :
print(customer_id in accounts['customerID'].values and date in accounts['MM/YYYY'].values)
And use the same thing in your if statement :
if (customer_id in accounts['customerID'].values and date in accounts['MM/YYYY'].values):
Alternative solutions
You can also use the pandas.Series.isin function that given an element as input return a boolean Series showing whether each element in the Series matches the given input, then you will just need to check if the boolean Series contain one True value.
Documentation of isin : https://pandas.pydata.org/docs/reference/api/pandas.Series.isin.html
It is not clear from the information what does formatter.convert_date function does. but from the example CSVs you added it seems like it should do something like:
def convert_date(mmddyy):
(mm,dd,yy) = mmddyy.split('/')
return mm + '/' + yy
in addition, make sure that data types are also equal
(both date fields are strings and also for customer id)
I have a date column in my DataFrame say df_dob and it looks like -
id
DOB
23312
31-12-9999
1482
31-12-9999
807
#VALUE!
2201
06-12-1925
653
01/01/1855
108
01/01/1855
768
1967-02-20
What I want to print is a list of unique years like - `['9999', '1925', '1855', '1967']
basically through this list I just wanted to check whether there is some unwanted year is present or not.
I have tried(pasted my code below) but getting ValueError: time data 01/01/1855 doesn't match format specified and could not resolve it.
df_dob['DOB'] = df_dob['DOB'].replace('01/01/1855 00:00:00', '1855-01-01')
df_dob['DOB'] = pd.to_datetime(df_dob.DOB, format='%Y-%m-%d')
df_dob['DOB'] = df_dob['DOB'].dt.strftime('%Y-%m-%d')
print(np.unique(df_dob['DOB']))
# print(list(df_dob['DOB'].year.unique()))
P.S - when I print df_dob['DOB'], I get values like - 1967-02-20 00:00:00
Can you try this?
df_dob["DOB"] = pd.to_datetime(df_DOB["Date"])
df_dob['YOB'] = df_dob['DOB'].dt.strftime('%Y')
Use pandas' unique for this. And on year only.
So try:
print(df['DOB'].dt.year.unique())
Also, you don't need to stringify your time. Alse, you don't need to replace anything, pandas is smart enough to do it for you. So you overall code becomes:
df_dob['DOB'] = pd.to_datetime(df_dob.DOB) # No need to pass format if there isn't some specific anomoly
print(df['DOB'].dt.year.unique())
Edit:
Another method:
Since you have outofbounds problem,
Another method you can try is not converting them to datetime, but rather find all the four digit numbers in each column using regex.
So,
df['DOB'].str.extract(r'(\d{4})')[0].unique()
[0] because unique() is a function of pd.series not a dataframe. So taking the first series in the dataframe.
The first thing you need to know is if the resulting values (which you said look like 1967-02-20 00:00:00 are datetimes or not. That's as simple as df_dob.info()
If the result says similar to datetime64[ns] for the DOB column, you're good. If not you'll need to cast it as a DateTime. You have a couple of different formats so that might be part of your problem. Also, because there're several ways of doing this and it's a separate question, I'm not addressing it.
We going to leverage the speed of sets, plus a bit of pandas, and then convert that back to a list as you wanted the final version to be.
years = list({i for i in df['date'].dt.year})
And just a side note, you can't use [] instead of list() as you'll end with a list with a single element that's a set.
That's a list as you indicated. If you want it as a column, you won't get unique values
Nitish's answer will also work but give you something like: array([9999, 1925, 1855, 1967])
The dataframe dataset has two columns 'Review' and 'Label' and dtypes of 'Label' is int.
I would like to change the number in the 'Label' column. So I tried to use replace() but it doesn't change well as you can see in the below picture.
A simple and quick solution(besides replace) would be to use a Series.map() method. You could define a dictionary with keys corresponding to the values you want to replace and values set to the new values you wish to have. Then, use an anonymous function(or normal one) to replace your values
d={1:0,2:0,4:1,5:1}
dataset['label']=dataset['label'].map(lambda x: d[x])
This will replace 1 and 2 with 0, and 4 and 5 with 1.
I am not sure what your criteria for "well" is, as the replace method will work for you and essentially achieve the same result(and is more optimized than map for replacement purposes).
What might be causing the issues is that replace has a default arg inplace=False. Thus, your results will not affect each other and you will have to combine them into dataset['label']=dataset['label'].replace([1,2,4,5],[0,0,1,1]) or dataset['label'].replace([1,2,4,5],[0,0,1,1],inplace=True)
I'm building a booking form, and want to allow users to pick a date of booking from available dates in the next 60 days.
I get the next 60 days by:
base = datetime.date.today()
date_list = [base + datetime.timedelta(days=x) for x in range(60)]
Then I subtract already booked dates which are stored in the db:
bookings = list(Booking.objects.all())
primarykeys = []
unav = []
for b in bookings:
primarykeys.append(b.pk)
for p in primarykeys:
unav.append(Booking.objects.get(pk=p).booking_date)
for date in unav:
if date in date_list:
date_list.remove(date)
Then I change the result into tuple for the forms(not sure if this is right?):`
date_list = tuple(date_list)
Then I pass it into the form field as such:
booking_date = forms.ChoiceField(choices=date_list, required=True)
This gives me an error of cannot unpack non-iterable datetime.date object
And now am I stumped...how can I do this? I have a feeling i'm on the complete wrong path.
Thanks in advance
The docs for Django Form fields says the following:
choices
Either an iterable of 2-tuples to use as choices for this field, or a callable that returns such an iterable. This argument accepts the
same formats as the choices argument to a model field. See the model
field reference documentation on choices for more details. If the
argument is a callable, it is evaluated each time the field’s form is
initialized. Defaults to an empty list.
It looks like what you're passing is a tuple in this format:
(date object, date object, ...)
But you need to be passing something like a list of 2-tuples, with the first element of each tuple being the value stored for each choice, and the second element being the value displayed to the user in the form:
[(date_object, date_string), (date_object, date_string), ...)
Change your code to the following and see if that works for you:
base = datetime.date.today()
date_set = set([base + datetime.timedelta(days=x) for x in range(60)])
booking_dates = set(Booking.objects.all().values_list('booking_date', flat=True))
valid_dates = date_set - booking_dates
date_choices = sorted([(valid_date, valid_date.strftime('%Y-%m-%d')) for valid_date in valid_dates],
key=lambda x: x[0])
I've used sets to make it simpler to ensure unique values and subtract the two from each other without multiple for loops. You can use values_list with flat=True to get all the existing booking dates, then create a list of 2-tuples date_choices, with the actual datetime object as the value and display a string representation in whatever format you choose using strftime.
Then the dates are sorted using sorted by date ascending based on the first key, since using sets will mess up the sort order.
Then take a look at this question to see how you can pass these choices into the form from your view, as I don't think it's good to try to dynamically set the choices when defining the Form class itself.
I need to define a function that will perform several operations on a dataframe containing a DatetimeIndex. One of these operations is to slice the dataframe based on a period or date passed as one of the function arguments.
When using loc within a code, the slice objects accept different options. For instance:
df.loc['2004']
to slice all rows with dates in 2004
df.loc['2004-01':'2005-02']
to slice all rows with dates between Jan 2004 and Feb 2005
I would like to be able to use only one argument of the function to construct the slice object that goes inside loc[]. Something like:
df.loc[period]
Where period is the variable passed to the function as one of the arguments, and that can be defined in different formats to be correctly interpreted by the function.
I've tried:
Passing a string variable to loc, for instance with a value constructed as "\'2004\'"+':'+"\'2005\'", but it returns a KeyError "'2002':'2010'".
Converting a string to datetime objects using pd.to_datetime. But this results in "2004" converted to Timestamp('2004-01-01 00:00:00')
I've found this answer and this answer to be similar, but not specific to what I need.
I could use two arguments in the function to solve this (something like start_date, end_date) but was wondering if there is anyway to achieve it with only one.
The slice built-in should work for this:
# equivalent to df.loc['2004':]
period = slice('2004', None)
df.loc[period]
# equivalent to df.loc['2004-01':'2005-02']
period = slice('2004-01', '2005-02')
df.loc[period]