My dataset looks like this:
Paste_Values AB_IDs AC_IDs AD_IDs
AE-1001-4 AB-1001-0 AC-1001-3 AD-1001-2
AE-1964-7 AB-1964-2 AC-1964-7 AD-1964-1
AE-2211-1 AB-2211-1 AC-2211-3 AD-2211-2
AE-2182-4 AB-2182-6 AC-2182-7 AD-2182-5
I need to compare all values in the Paste_values column with the all other three values in a row.
For Example:
AE-1001-4 is split into two part AE and 1001-4 we need check 1001-4 is present other columns or not
if its not present we need to create new columns put the same AE-1001-4
if 1001-4 match with other columns we need to change it inot 'AE-1001-5' put in the new column
After:
If there is no match I need to to write the value of Paste_values as is in the newly created column named new_paste_value.
If there is a match (same value) in other columns within the same row, then I need to change the last digit of the value from Paste_values column, so that the whole value should not be the same as in any other whole values in the row and that newly generated value should be written in new_paste_value column.
I need to do this with every row in the data frame.
So the result should look like:
Paste_Values AB_IDs AC_IDs AD_IDs new_paste_value
AE-1001-4 AB-1001-0 AC-1001-3 AD-1001-2 AE-1001-4
AE-1964-7 AB-1964-2 AC-1964-7 AD-1964-1 AE-1964-3
AE-2211-1 AB-2211-1 AC-2211-3 AD-2211-2 AE-2211-4
AE-2182-4 AB-2182-6 AC-2182-4 AD-2182-5 AE-2182-1
How can I do it?
Start from defining a function to be applied to each row of your DataFrame:
def fn(row):
rr = row.copy()
v1 = rr.pop('Paste_Values') # First value
if not rr.str.contains(f'{v1[3:]}$').any():
return v1 # No match
v1a = v1[3:-1] # Central part of v1
for ch in '1234567890':
if not rr.str.contains(v1a + ch + '$').any():
return v1[:-1] + ch
return '????' # No candidate found
A bit of explanation:
The row argument is actually a Series, with index values taken from
column names.
So rr.pop('Paste_Values') removes the first value, which is saved in v1
and the rest remains in rr.
Then v1[3:] extracts the "rest" of v1 (without "AE-")
and str.contains checks each element of rr whether it
contains this string at the end position.
With this explanation, the rest of this function should be quite
understandable. If not, execute each individual instruction and
print their results.
And the only thing to do is to apply this function to your DataFrame,
substituting the result to a new column:
df['new_paste_value'] = df.apply(fn, axis=1)
To run a test, I created the following DataFrame:
df = pd.DataFrame(data=[
['AE-1001-4', 'AB-1001-0', 'AC-1001-3', 'AD-1001-2'],
['AE-1964-7', 'AB-1964-2', 'AC-1964-7', 'AD-1964-1'],
['AE-2211-1', 'AB-2211-1', 'AC-2211-3', 'AD-2211-2'],
['AE-2182-4', 'AB-2182-6', 'AC-2182-4', 'AD-2182-5']],
columns=['Paste_Values', 'AB_IDs', 'AC_IDs', 'AD_IDs'])
I received no error on this data. Perform a test on the above data.
Maybe the source of your error is in some other place?
Maybe your DataFrame contains also other (float) columns,
which you didn't include in your question.
If this is the case, run my function on a copy of your DataFrame,
with this "other" columns removed.
Related
I have a DataFrame that's read in from a csv. The data has various problems. The one i'm concerned about for this post is that some data is not in the column it should be. For example, '900' is in the zipcode column, or 'RQ' is in the langauge column when it should be in the nationality column. In some cases, these "misinsertions" are just anomalies and can be converted to NaN. In other cases they indicate that the values have shifted one column to the right or the left such that the whole row has missinserted data. I want to remove these shifted lines from the DataFrame and try to fix them separately. My proposed solution has been to keep track of the number of bad values in each row as I am cleaning each column. Here is an example with the zipcode column:
def is_zipcode(value: str, regx):
if regx.match(value):
return value
else:
return nan
regx = re.compile("^[0-9]{5}(?:-[0-9]{4})?$")
df['ZIPCODE'] = df['ZIPCODE'].map(lambda x: is_zipcode(x, regx), na_action='ignore')
I'm doing something like this on every column in the dataframe depending on the data in that column, e.g. for the 'Nationality' column i'll look up the values in a json file of nationality codes.
What I haven't been able to achieve is to keep count of the bad values in a row. I tried something like this:
def is_zipcode(value: str, regx):
if regx.match(value):
return 0
else:
return 1
regx = re.compile("^[0-9]{5}(?:-[0-9]{4})?$")
df['badValues'] = df['ZIPCODE'].map(lambda x: is_zipcode(x, regx), na_action='ignore')
df['badValues'] = df['badValues'] + df['Nationalities'].map(is_nationality, na_action='ignore) # where is_nationality() similarly returns 1 if it is a bad value
And this can work to keep track of the bad values. What I'd like to do is somehow combine the process of cleaning the data and getting the bad values. I'd love to do something like this:
def is_zipcode(value: str, regx):
if regx.match(value):
# add 1 to the value of df['badValues'] at the corresponding index
return value
else:
return nan
The problem is that I don't think it's possible to access the index of the value being passed to the map function. I looked at these two questions (one, two) but I didn't see a solution to my issue.
I guess this would do what you want ...
is_zipcode_mask = df['ZIPCODE'].str.match(regex_for_zipcode)
print(len(df[is_zipcode_mask]))
I have a dataframe test. My goal is to search in the column t1 for specific strings, and if it matches exactly a specific string, put that string in the next column over called t1_selected. Only thing is, I can't get iterrows() to go over the entire dataframe, and to report results in respective rows.
for index, row in test.iterrows():
if any(['ABCD_T1w_MPR_vNav_passive' in row['t1']]):
#x = ast.literal_eval(row['t1'])
test.loc[i, 't1_selected'] = str(['ABCD_T1w_MPR_vNav_passive'])
I am only trying to get ABCD_T1w_MPR_vNav_passive to be in the 4th row under the t1_selected, while all the other rows will have not found. The first entry in t1_selected is from the last row under t1 which I didn't include in the screenshot because the dataframe has over 200 rows.
I tried to initialize an empty list to append output of
import ast
x = ast.literal_eval(row['t1'])
to see if I can put x in there, but the same issue occurred.
Is there anything I am missing?
for index, row in test.iterrows():
if any(['ABCD_T1w_MPR_vNav_passive' in row['t1']]):
#x = ast.literal_eval(row['t1'])
test.loc[index, 't1_selected'] = str(['ABCD_T1w_MPR_vNav_passive'])
Where index is the row its written to. With i it was not changing
So I need to group a dataframe by its SessionId, and then I need to sort each group with the created time, afterwards i need to retrieve the nth row only of each group.
but i found that after applying lambda it becomes a dataframe instead of a group by object, hence i cannot use the .nth property
grouped = df.groupby(['SessionId'])
sorted = grouped.apply(lambda x: x.sort_values(["Created"], ascending = True))
sorted.nth ---> error
Changing the order in which you are approaching the problem in this case will help. If you first sort and then use groupby, you will get the desired output and you can use the groupby.nth function.
Here is a code snippet to demonstrate the idea:
df = pd.DataFrame({'id':['a','a','a','b','b','b'],
'var1':[3,2,1,8,7,6],
'var2':['g','h','i','j','k','l']})
n = 2 # replace with required row from each group
df.sort_values(['id','var1']).groupby('id').nth(n).reset_index()
Assuming id is your sessionid and var1 is the timestamp, this sorts your dataframe by id and then var1. Then picks up the nth row from each of these sorted groups. The reset_index() is there just to avoid the resulting multi-index.
If you want to get the last n rows of each group, you can use .tail(n) instead of .nth(n).
I have created a small dataset -
n = 2
grouped = df.groupby('SessionId')
pd.concat([grouped.get_group(x).sort_values(by='SortVar').reset_index().loc[[n]] for x in grouped.groups]\
,axis=0)
This will return -
Please note that in python index start from zero, so for n=2, it will give you 3rd row in sorted data
I noticed that when using .loc in pandas dataframe, it not only finds the row of data I am looking for but also includes the header column names of the dataframe I am searching within.
So when I try to append the .loc row of data, it includes the data + column headers - I don't want any column headers!
##1st dataframe
df_futures.head(1)
date max min
19990101 2000 1900
##2nd dataframe
df_cash.head(1)
date$ max$ min$
1999101 50 40
##if date is found in dataframe 2, I will collect the row of data
data_to_track = []
for ii in range(len(df_futures['date'])):
##date I will try to find in df2
date_to_find = df_futures['date'][ii]
##append the row of data to my list
data_to_track.append(df_cash.loc[df_cash['Date$'] == date_to_find])
I want the for loop to return just 19990101 50 40
It currently returns 0 19990101 50 40, date$, max$, min$
I agree with other comments regarding the clarity of the question. However, if what you want to get is just a string that contains a particular row's data, then you could use to_string() method of Pandas.
In your case,
Instead of this:
df_cash.loc[df_cash['Date$'] == date_to_find]
You could get a string that includes only the row data:
df_cash[df_cash['Date$'] == date_to_find].to_string(header=None)
Also notice that I dropped the .loc part, which outputs the same result.
If your dataframe has multiple columns and you dont want them to be joined in a string (may bring data type issues and is potentially problematic if you want to separate them later on), you could use list() method such as:
list(df_cash[df_cash['Date$'] == date_to_find].iloc[0])
My problem: I have a pandas dataframe and one column in particular which I need to process contains values separated by (":") and in some cases, some of those values between ":" can be value = value, and can appear at the start/middle/end of the string. The length of the string can differ in each cell as we iterate through the row, for e.g.
clickstream['events']
1:3:5:7=23
23=1:5:1:5:3
9:0:8:6=5:65:3:44:56
1:3:5:4
I have a file which contains the lookup values of these numbers,e.g.
event_no,description,event
1,xxxxxx,login
3,ffffff,logout
5,eeeeee,button_click
7,tttttt,interaction
23,ferfef,click1
output required:
clickstream['events']
login:logout:button_click:interaction=23
click1=1:button_click:login:button_click:logout
Is there a pythonic way of looking up these individual values and replacing with the event column corresponding to the event_no row as shown in the output? I have hundreds of events and trying to work out a smart way of doing this. pd.merge would have done the trick if I had a single value, but I'm struggling to work out how I can work across the values and ignore the "=value" part of the string
Edit for to ignore missing keys in Dict:
import pandas as pd
EventsDict = {1:'1:3:5:7',2:'23:45:1:5:3',39:'0:8:46:65:3:44:56',4:'1:3:5:4'}
clickstream = pd.Series(EventsDict)
#Keep this as a dictionary
EventsLookup = {1:'login',3:'logout',5:'button_click',7:'interaction'}
def EventLookup(x):
list1 = [EventsLookup.get(int(item),'Missing') for item in x.split(':')]
return ":".join(list1)
clickstream.apply(EventLookup)
Since you are using a full DF and not just a series, use:
clickstream['events'].apply(EventLookup)
Output:
1 login:logout:button_click:interaction
2 Missing:Missing:login:button_click:logout
4 login:logout:button_click:Missing
39 Missing:Missing:Missing:Missing:logout:Missing...