Cannot match two values in two different csvs - python

I am parsing through two separate csv files with the goal of finding matching customerID's and dates to manipulate balance.
In my for loop, at some point there should be a match as I intentionally put duplicate ID's and dates in my csv. However, when parsing and attempting to match data, the matches aren't working properly even though the values are the same.
main.py:
transactions = pd.read_csv(INPUT_PATH, delimiter=',')
accounts = pd.DataFrame(
columns=['customerID', 'MM/YYYY', 'minBalance', 'maxBalance', 'endingBalance'])
for index, row in transactions.iterrows():
customer_id = row['customerID']
date = formatter.convert_date(row['date'])
minBalance = 0
maxBalance = 0
endingBalance = 0
dict = {
"customerID": customer_id,
"MM/YYYY": date,
"minBalance": minBalance,
"maxBalance": maxBalance,
"endingBalance": endingBalance
}
print(customer_id in accounts['customerID'] and date in accounts['MM/YYYY'])
# Returns False
if (accounts['customerID'].equals(customer_id)) and (accounts['MM/YYYY'].equals(date)):
# This section never runs
print("hello")
else:
print("world")
accounts.loc[index] = dict
accounts.to_csv(OUTPUT_PATH, index=False)
Transactions CSV:
customerID,date,amount
1,12/21/2022,500
1,12/21/2022,-300
1,12/22/2022,100
1,01/01/2023,250
1,01/01/2022,300
1,01/01/2022,-500
2,12/21/2022,-200
2,12/21/2022,700
2,12/22/2022,200
2,01/01/2023,300
2,01/01/2023,400
2,01/01/2023,-700
Accounts CSV
customerID,MM/YYYY,minBalance,maxBalance,endingBalance
1,12/2022,0,0,0
1,12/2022,0,0,0
1,12/2022,0,0,0
1,01/2023,0,0,0
1,01/2022,0,0,0
1,01/2022,0,0,0
2,12/2022,0,0,0
2,12/2022,0,0,0
2,12/2022,0,0,0
2,01/2023,0,0,0
2,01/2023,0,0,0
2,01/2023,0,0,0
Expected Accounts CSV
customerID,MM/YYYY,minBalance,maxBalance,endingBalance
1,12/2022,0,0,0
1,01/2023,0,0,0
1,01/2022,0,0,0
2,12/2022,0,0,0
2,01/2023,0,0,0

Where does the problem come from
Your Problem comes from the comparison you're doing with pandas Series, to make it simple, when you do :
customer_id in accounts['customerID']
You're checking if customer_id is an index of the Series accounts['customerID'], however, you want to check the value of the Series.
And in your if statement, you're using the pd.Series.equals method. Here is an explanation of what does the method do from the documentation
This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements. NaNs in the same location are considered equal.
So equals is used to compare between DataFrames and Series, which is different from what you're trying to do.
One of many solutions
There are multiple ways to achieve what you're trying to do, the easiest is simply to get the values from the series before doing the comparison :
customer_id in accounts['customerID'].values
Note that accounts['customerID'].values returns a NumPy array of the values of your Series.
So your comparison should be something like this :
print(customer_id in accounts['customerID'].values and date in accounts['MM/YYYY'].values)
And use the same thing in your if statement :
if (customer_id in accounts['customerID'].values and date in accounts['MM/YYYY'].values):
Alternative solutions
You can also use the pandas.Series.isin function that given an element as input return a boolean Series showing whether each element in the Series matches the given input, then you will just need to check if the boolean Series contain one True value.
Documentation of isin : https://pandas.pydata.org/docs/reference/api/pandas.Series.isin.html

It is not clear from the information what does formatter.convert_date function does. but from the example CSVs you added it seems like it should do something like:
def convert_date(mmddyy):
(mm,dd,yy) = mmddyy.split('/')
return mm + '/' + yy
in addition, make sure that data types are also equal
(both date fields are strings and also for customer id)

Related

Get unique years from a date column in pandas DataFrame

I have a date column in my DataFrame say df_dob and it looks like -
id
DOB
23312
31-12-9999
1482
31-12-9999
807
#VALUE!
2201
06-12-1925
653
01/01/1855
108
01/01/1855
768
1967-02-20
What I want to print is a list of unique years like - `['9999', '1925', '1855', '1967']
basically through this list I just wanted to check whether there is some unwanted year is present or not.
I have tried(pasted my code below) but getting ValueError: time data 01/01/1855 doesn't match format specified and could not resolve it.
df_dob['DOB'] = df_dob['DOB'].replace('01/01/1855 00:00:00', '1855-01-01')
df_dob['DOB'] = pd.to_datetime(df_dob.DOB, format='%Y-%m-%d')
df_dob['DOB'] = df_dob['DOB'].dt.strftime('%Y-%m-%d')
print(np.unique(df_dob['DOB']))
# print(list(df_dob['DOB'].year.unique()))
P.S - when I print df_dob['DOB'], I get values like - 1967-02-20 00:00:00
Can you try this?
df_dob["DOB"] = pd.to_datetime(df_DOB["Date"])
df_dob['YOB'] = df_dob['DOB'].dt.strftime('%Y')
Use pandas' unique for this. And on year only.
So try:
print(df['DOB'].dt.year.unique())
Also, you don't need to stringify your time. Alse, you don't need to replace anything, pandas is smart enough to do it for you. So you overall code becomes:
df_dob['DOB'] = pd.to_datetime(df_dob.DOB) # No need to pass format if there isn't some specific anomoly
print(df['DOB'].dt.year.unique())
Edit:
Another method:
Since you have outofbounds problem,
Another method you can try is not converting them to datetime, but rather find all the four digit numbers in each column using regex.
So,
df['DOB'].str.extract(r'(\d{4})')[0].unique()
[0] because unique() is a function of pd.series not a dataframe. So taking the first series in the dataframe.
The first thing you need to know is if the resulting values (which you said look like 1967-02-20 00:00:00 are datetimes or not. That's as simple as df_dob.info()
If the result says similar to datetime64[ns] for the DOB column, you're good. If not you'll need to cast it as a DateTime. You have a couple of different formats so that might be part of your problem. Also, because there're several ways of doing this and it's a separate question, I'm not addressing it.
We going to leverage the speed of sets, plus a bit of pandas, and then convert that back to a list as you wanted the final version to be.
years = list({i for i in df['date'].dt.year})
And just a side note, you can't use [] instead of list() as you'll end with a list with a single element that's a set.
That's a list as you indicated. If you want it as a column, you won't get unique values
Nitish's answer will also work but give you something like: array([9999, 1925, 1855, 1967])

Get the index of a value passed to map() in pandas

I have a DataFrame that's read in from a csv. The data has various problems. The one i'm concerned about for this post is that some data is not in the column it should be. For example, '900' is in the zipcode column, or 'RQ' is in the langauge column when it should be in the nationality column. In some cases, these "misinsertions" are just anomalies and can be converted to NaN. In other cases they indicate that the values have shifted one column to the right or the left such that the whole row has missinserted data. I want to remove these shifted lines from the DataFrame and try to fix them separately. My proposed solution has been to keep track of the number of bad values in each row as I am cleaning each column. Here is an example with the zipcode column:
def is_zipcode(value: str, regx):
if regx.match(value):
return value
else:
return nan
regx = re.compile("^[0-9]{5}(?:-[0-9]{4})?$")
df['ZIPCODE'] = df['ZIPCODE'].map(lambda x: is_zipcode(x, regx), na_action='ignore')
I'm doing something like this on every column in the dataframe depending on the data in that column, e.g. for the 'Nationality' column i'll look up the values in a json file of nationality codes.
What I haven't been able to achieve is to keep count of the bad values in a row. I tried something like this:
def is_zipcode(value: str, regx):
if regx.match(value):
return 0
else:
return 1
regx = re.compile("^[0-9]{5}(?:-[0-9]{4})?$")
df['badValues'] = df['ZIPCODE'].map(lambda x: is_zipcode(x, regx), na_action='ignore')
df['badValues'] = df['badValues'] + df['Nationalities'].map(is_nationality, na_action='ignore) # where is_nationality() similarly returns 1 if it is a bad value
And this can work to keep track of the bad values. What I'd like to do is somehow combine the process of cleaning the data and getting the bad values. I'd love to do something like this:
def is_zipcode(value: str, regx):
if regx.match(value):
# add 1 to the value of df['badValues'] at the corresponding index
return value
else:
return nan
The problem is that I don't think it's possible to access the index of the value being passed to the map function. I looked at these two questions (one, two) but I didn't see a solution to my issue.
I guess this would do what you want ...
is_zipcode_mask = df['ZIPCODE'].str.match(regex_for_zipcode)
print(len(df[is_zipcode_mask]))

Try and Except - Understanding how it is used in finding data specific to dates

I have been trying to understand a piece of code that includes Try and Except that filters data based on specific/required dates:
required_date = '2021-02-11'
index_for_date = (data_dict['date'] == required_date)
data_filtered_by_date = {}
for key in data_dict.keys():
try:
data_filtered_by_date[key] = np.float_(data_dict[key][index_for_date])
except:
data_filtered_by_date[key] = data_dict[key][index_for_date]
I do not understand why the Try and Except would be used and how the whole code would function. I have researched specifics such as np.float and why we use two crotchets (e.g. [key][index_for_date], why are they together?) next to each other. Hopefully I can get further clarification on this code as I am very new to Python and have done various forms of research in order to find some sort of answer
Let's start with an explanation of your code. data_dict is a dictionary of data columns, one of which is the 'date' column on which you want to filter. index_for_date = (data_dict['date'] == required_date) constructs a Boolean index for the columns to find the specific data (an array which is all false except where it matches the desired date).
You loop over the columns which are in data_dict[key]. Then for each column, you select the ones matching the date using the Boolean index: data_dict[key][index_for_date] that is why you have two sets of square brackets, this first is dict indexing, the second is Boolean array indexing.
Then in the try clause, you try to cast the values to floats using np.float_. If this fails (it throws an exception), you fall back to the 'raw', non casted values.

Pandas formatting column within DataFrame and adding timedelta Index error

I'm trying to use panda to do some analysis on some messaging data and am running into a few problems try to prep the data. It is coming from a database I don't have control of and therefore I need to do a little pruning and formatting before analyzing it.
Here is where I'm at so far:
#select all the messages in the database. Be careful if you get the whole test data base, may have 5000000 messages.
full_set_data = pd.read_sql("Select * from message",con=engine)
After I make this change to the timestamp, and set it as index, I'm no longer and to call to_csv.
#convert timestamp to a timedelta and set as index
#full_set_data[['timestamp']] = full_set_data[['timestamp']].astype(np.timedelta64)
indexed = full_set_data.set_index('timestamp')
indexed.to_csv('indexed.csv')
#extract the data columns I really care about since there as a bunch I don't need
datacolumns = indexed[['address','subaddress','rx_or_tx', 'wordcount'] + [col for col in indexed.columns if ('DATA' in col)]]
Here I need to format the DATA columns, I get a "SettingWithCopyWarning".
#now need to format the DATA columns to something useful by removing the upper 4 bytes
for col in datacolumns.columns:
if 'DATA' in col:
datacolumns[col] = datacolumns[col].apply(lambda x : int(x,16) & 0x0000ffff)
datacolumns.to_csv('data_col.csv')
#now group the data by "interaction key"
groups = datacolumns.groupby(['address','subaddress','rx_or_tx'])
I need to figure out how to get all the messages from a given group. get_group() requires I know key values ahead of time.
key_group = groups.get_group((1,1,1))
#foreach group in groups:
#do analysis
I have tried everything I could think of to fix the problems I'm running into but I cant seem to get around it. I'm sure it's from me misunderstanding/misusing Pandas as I'm still figuring it out.
I looking to solve these issues:
1) Can't save to csv after I add index of timestamp as timedelta64
2) How do I apply a function to a set of columns to remove SettingWithCopyWarning when reformatting DATA columns.
3) How to grab the rows for each group without having to use get_group() since I don't know the keys ahead of time.
Thanks for any insight and help so I can better understand how to properly use Pandas.
Firstly, you can set the index column(s) and parse dates while querying the DB:
indexed = pd.read_sql_query("Select * from message", engine=engine,
parse_dates='timestamp', index_col='timestamp')
Note I've used pd.read_sql_query here rather than pd.read_sql, which is deprecated, I think.
SettingWithCopy warning is due to the fact that datacolumns is a view of indexed, i.e. a subset of it's rows /columns, not an object in it's own right. Check out this part of the docs: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
One way to get around this is to define
datacolumns = indexed[<cols>].copy()
Another would to do
indexed = indexed[<cols>]
which effectively removes the columns you don't want, if you're happy that you won't need them again. You can then manipulate indexed at your leisure.
As for the groupby, you could introduce a columns of tuples which would be the group keys:
indexed['interaction_key'] = zip(indexed[['address','subaddress','rx_or_tx']]
indexed.groupby('interaction_key').apply(
lambda df: some_function(df.interaction_key, ...)
I'm not sure if it's all exactly what you want but let me know and I can edit.

Replacing multiple values within a pandas dataframe cell - python

My problem: I have a pandas dataframe and one column in particular which I need to process contains values separated by (":") and in some cases, some of those values between ":" can be value = value, and can appear at the start/middle/end of the string. The length of the string can differ in each cell as we iterate through the row, for e.g.
clickstream['events']
1:3:5:7=23
23=1:5:1:5:3
9:0:8:6=5:65:3:44:56
1:3:5:4
I have a file which contains the lookup values of these numbers,e.g.
event_no,description,event
1,xxxxxx,login
3,ffffff,logout
5,eeeeee,button_click
7,tttttt,interaction
23,ferfef,click1
output required:
clickstream['events']
login:logout:button_click:interaction=23
click1=1:button_click:login:button_click:logout
Is there a pythonic way of looking up these individual values and replacing with the event column corresponding to the event_no row as shown in the output? I have hundreds of events and trying to work out a smart way of doing this. pd.merge would have done the trick if I had a single value, but I'm struggling to work out how I can work across the values and ignore the "=value" part of the string
Edit for to ignore missing keys in Dict:
import pandas as pd
EventsDict = {1:'1:3:5:7',2:'23:45:1:5:3',39:'0:8:46:65:3:44:56',4:'1:3:5:4'}
clickstream = pd.Series(EventsDict)
#Keep this as a dictionary
EventsLookup = {1:'login',3:'logout',5:'button_click',7:'interaction'}
def EventLookup(x):
list1 = [EventsLookup.get(int(item),'Missing') for item in x.split(':')]
return ":".join(list1)
clickstream.apply(EventLookup)
Since you are using a full DF and not just a series, use:
clickstream['events'].apply(EventLookup)
Output:
1 login:logout:button_click:interaction
2 Missing:Missing:login:button_click:logout
4 login:logout:button_click:Missing
39 Missing:Missing:Missing:Missing:logout:Missing...

Categories

Resources