Replacing multiple values within a pandas dataframe cell - python - python

My problem: I have a pandas dataframe and one column in particular which I need to process contains values separated by (":") and in some cases, some of those values between ":" can be value = value, and can appear at the start/middle/end of the string. The length of the string can differ in each cell as we iterate through the row, for e.g.
clickstream['events']
1:3:5:7=23
23=1:5:1:5:3
9:0:8:6=5:65:3:44:56
1:3:5:4
I have a file which contains the lookup values of these numbers,e.g.
event_no,description,event
1,xxxxxx,login
3,ffffff,logout
5,eeeeee,button_click
7,tttttt,interaction
23,ferfef,click1
output required:
clickstream['events']
login:logout:button_click:interaction=23
click1=1:button_click:login:button_click:logout
Is there a pythonic way of looking up these individual values and replacing with the event column corresponding to the event_no row as shown in the output? I have hundreds of events and trying to work out a smart way of doing this. pd.merge would have done the trick if I had a single value, but I'm struggling to work out how I can work across the values and ignore the "=value" part of the string

Edit for to ignore missing keys in Dict:
import pandas as pd
EventsDict = {1:'1:3:5:7',2:'23:45:1:5:3',39:'0:8:46:65:3:44:56',4:'1:3:5:4'}
clickstream = pd.Series(EventsDict)
#Keep this as a dictionary
EventsLookup = {1:'login',3:'logout',5:'button_click',7:'interaction'}
def EventLookup(x):
list1 = [EventsLookup.get(int(item),'Missing') for item in x.split(':')]
return ":".join(list1)
clickstream.apply(EventLookup)
Since you are using a full DF and not just a series, use:
clickstream['events'].apply(EventLookup)
Output:
1 login:logout:button_click:interaction
2 Missing:Missing:login:button_click:logout
4 login:logout:button_click:Missing
39 Missing:Missing:Missing:Missing:logout:Missing...

Related

Cannot match two values in two different csvs

I am parsing through two separate csv files with the goal of finding matching customerID's and dates to manipulate balance.
In my for loop, at some point there should be a match as I intentionally put duplicate ID's and dates in my csv. However, when parsing and attempting to match data, the matches aren't working properly even though the values are the same.
main.py:
transactions = pd.read_csv(INPUT_PATH, delimiter=',')
accounts = pd.DataFrame(
columns=['customerID', 'MM/YYYY', 'minBalance', 'maxBalance', 'endingBalance'])
for index, row in transactions.iterrows():
customer_id = row['customerID']
date = formatter.convert_date(row['date'])
minBalance = 0
maxBalance = 0
endingBalance = 0
dict = {
"customerID": customer_id,
"MM/YYYY": date,
"minBalance": minBalance,
"maxBalance": maxBalance,
"endingBalance": endingBalance
}
print(customer_id in accounts['customerID'] and date in accounts['MM/YYYY'])
# Returns False
if (accounts['customerID'].equals(customer_id)) and (accounts['MM/YYYY'].equals(date)):
# This section never runs
print("hello")
else:
print("world")
accounts.loc[index] = dict
accounts.to_csv(OUTPUT_PATH, index=False)
Transactions CSV:
customerID,date,amount
1,12/21/2022,500
1,12/21/2022,-300
1,12/22/2022,100
1,01/01/2023,250
1,01/01/2022,300
1,01/01/2022,-500
2,12/21/2022,-200
2,12/21/2022,700
2,12/22/2022,200
2,01/01/2023,300
2,01/01/2023,400
2,01/01/2023,-700
Accounts CSV
customerID,MM/YYYY,minBalance,maxBalance,endingBalance
1,12/2022,0,0,0
1,12/2022,0,0,0
1,12/2022,0,0,0
1,01/2023,0,0,0
1,01/2022,0,0,0
1,01/2022,0,0,0
2,12/2022,0,0,0
2,12/2022,0,0,0
2,12/2022,0,0,0
2,01/2023,0,0,0
2,01/2023,0,0,0
2,01/2023,0,0,0
Expected Accounts CSV
customerID,MM/YYYY,minBalance,maxBalance,endingBalance
1,12/2022,0,0,0
1,01/2023,0,0,0
1,01/2022,0,0,0
2,12/2022,0,0,0
2,01/2023,0,0,0
Where does the problem come from
Your Problem comes from the comparison you're doing with pandas Series, to make it simple, when you do :
customer_id in accounts['customerID']
You're checking if customer_id is an index of the Series accounts['customerID'], however, you want to check the value of the Series.
And in your if statement, you're using the pd.Series.equals method. Here is an explanation of what does the method do from the documentation
This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements. NaNs in the same location are considered equal.
So equals is used to compare between DataFrames and Series, which is different from what you're trying to do.
One of many solutions
There are multiple ways to achieve what you're trying to do, the easiest is simply to get the values from the series before doing the comparison :
customer_id in accounts['customerID'].values
Note that accounts['customerID'].values returns a NumPy array of the values of your Series.
So your comparison should be something like this :
print(customer_id in accounts['customerID'].values and date in accounts['MM/YYYY'].values)
And use the same thing in your if statement :
if (customer_id in accounts['customerID'].values and date in accounts['MM/YYYY'].values):
Alternative solutions
You can also use the pandas.Series.isin function that given an element as input return a boolean Series showing whether each element in the Series matches the given input, then you will just need to check if the boolean Series contain one True value.
Documentation of isin : https://pandas.pydata.org/docs/reference/api/pandas.Series.isin.html
It is not clear from the information what does formatter.convert_date function does. but from the example CSVs you added it seems like it should do something like:
def convert_date(mmddyy):
(mm,dd,yy) = mmddyy.split('/')
return mm + '/' + yy
in addition, make sure that data types are also equal
(both date fields are strings and also for customer id)

Getting NaN Values after Splitting with Boolean Masking

I am trying to split a huge dataframe into smaller dataframes based on values on a specific column.
What I basically did was I created a for loop then assigned each dataframe to a dictionary.
However when I call the items from the dictionary all values are NaN except for the cell_id values that I used for splitting.
Why would this happen?
Also I would appreciate if there are more practical ways to do this.
df_sliced_dict = {}
for cell in ex_df['cell_id'].unique():
df_sliced_dict[cell] = ex_df[ex_df.loc[:, ['cell_id']] == cell]
Replace
df_sliced_dict[cell] = ex_df[ex_df.loc[:, ['cell_id']] == cell]
with
df_sliced_dict[cell] = ex_df[ex_df['cell_id'] == cell]
inside the for-loop and it will work as expected.
The problem is that ex_df.loc[:, ['cell_id']] (or ex_df[['cell_id']]) is a DataFrame, not a Series, and you want a Series to construct your boolean mask.

Pandas series string manipulation using Python - 1st two chars flip and append to end of string

I have a column (series) of values I'm trying to move characters around and I'm going nowhere fast! I found some snippets of code to get me where I am but need a "Closer". I'm working with one column, datatype (STR). Each column strings are a series of numbers. Some are duplicated. These duplicate numbers have a (n-) in front of the number. The (n) number will change based on how many duplicate numbers strings are listed. Some may have two duplicates, some eight duplicates. Doesn't matter, order should stay the same.
I need to go down through each cell or string, pluck the (n-) from the left of string, swap the two characters around, and append it to the end of the string. No number sorting needed. The column is 4-5k lines long and will look like the example given all the way down. No other special characters or letters. Also, the duplicate rows will always be together no matter where in the column.
My problem is the code below actually works and will step through each string, evaluate it for a dash, then process the numbers in the way I need. However, I have not learned how to get the changes back into my dataframe from a python for-loop. I was really hoping that somebody had a niffy lambda fix or a pandas apply function to address the whole column at once. But I haven't found anything that I can tweak to work. I know there is a better way than slowly traversing down through a series and I would like to learn.
Two possible fixes needed:
Is there a way to have the code below replace the old df.string value with the newly created df.string value? If so, please let me know.
I've been trying to read up on df.apply using the split feature so I can address the whole column at once. I understand it's the smarter play. Is there a couple of lines code that would do what I need?
Please let me know what you think. I appreciate the help. Thank you for taking the time.
import re
import pandas as pd
from pandas import DataFrame, Series
import numpy as np
df = pd.read_excel("E:\Book2.xlsx")
df.column1=df.column1.astype(str)
for r in df['column1']: #Finds Column
if bool(re.search('-', r))!=True: #test if string has '-'
continue
else:
a = [] #string holder for '-'
b = [] #string holder for numbers
for c in r:
if c == '-': #if '-' then hold in A
a.append(c)
else:
b.append(c) #if number then hold in B
t = (''.join(b + a)) #puts '-' at the end of string
z = t[1:] + t[:1] #picks up 1st position char and moves to end of string
r = z #assigns new created string to df.column1 value
print(df)
Starting File: Ending File:
column1 column1
41887 41887
1-41845 41845-1
2-41845 41845-2
40905 40905
1-41323 41323-1
2-41323 41323-2
3-41323 41323-3
41778 41778
You can use df.str.replace():
If we recreate your example with a file containing all your values and retain column1 as the column name:
import pandas as pd
df=pd.read_csv('file.txt')
df.columns=['column1']
df['column1']=df['column1'].str.replace('(^\d)-(\d+)',r'\2-\1')
print(df)
This will give the desired output. Replace the old column with new one, and do it all in one (without loops).
#in
41887
1-41845
2-41845
40905
1-41323
2-41323
3-41323
41778
#out
column1
0 41887
1 41845-1
2 41845-2
3 40905
4 41323-1
5 41323-2
6 41323-3
7 41778

Converting Pandas Series to Set splits values in series with commas

I'm new to Pandas. I want to take some strings returned from pandas series (a bunch of values under a column in a csv named 'lots') and put them in a set. To this end I wrote the following:
setbincsv_df = bincsv_df['lots'].apply(set)
print(setbincsv_df )
But the output resulting from that print statement takes a value in that series like "OP" and displays it as 136 {P, O}. Not only does it not split it but it reverses it.
Bottom 5 items returned:
**"132 {I, F}"
"133 {E, F}"
"134 {W, I}"
"135 {V, H}"
"136 {P, O}"**
I'd expect it to return the value as it was in the series "OP". Why is this happening?
If you use apply you are applying the set operation to the string of each row.
For example if you have the word "pull"
print(set("pull"))
{'p','u','l'}
what you probably want is to do set(series):
df = pd.DataFrame({'lots':['ai','cd','ai','drgf']})
print(set(df['lots']) )
that outputs
{'cd', 'ai', 'drgf'}

how to locate row in dataframe without headers

I noticed that when using .loc in pandas dataframe, it not only finds the row of data I am looking for but also includes the header column names of the dataframe I am searching within.
So when I try to append the .loc row of data, it includes the data + column headers - I don't want any column headers!
##1st dataframe
df_futures.head(1)
date max min
19990101 2000 1900
##2nd dataframe
df_cash.head(1)
date$ max$ min$
1999101 50 40
##if date is found in dataframe 2, I will collect the row of data
data_to_track = []
for ii in range(len(df_futures['date'])):
##date I will try to find in df2
date_to_find = df_futures['date'][ii]
##append the row of data to my list
data_to_track.append(df_cash.loc[df_cash['Date$'] == date_to_find])
I want the for loop to return just 19990101 50 40
It currently returns 0 19990101 50 40, date$, max$, min$
I agree with other comments regarding the clarity of the question. However, if what you want to get is just a string that contains a particular row's data, then you could use to_string() method of Pandas.
In your case,
Instead of this:
df_cash.loc[df_cash['Date$'] == date_to_find]
You could get a string that includes only the row data:
df_cash[df_cash['Date$'] == date_to_find].to_string(header=None)
Also notice that I dropped the .loc part, which outputs the same result.
If your dataframe has multiple columns and you dont want them to be joined in a string (may bring data type issues and is potentially problematic if you want to separate them later on), you could use list() method such as:
list(df_cash[df_cash['Date$'] == date_to_find].iloc[0])

Categories

Resources