I am trying to add a new column in my existing dataset (pandas dataframe). This new column contains elements that satisfy a specific condition (please see code below). I am iterating over rows, but I am not able to change value of the row based on the condition (rows should have value row_value[0] = var2 or row_value[0] = varB).
for index, row in sample_dataset.iterrows():
row_value = ['Missing']
for var1, var2 in var3:
if row[0].endswith(var1):
row_value[0] = var2
break
for varA,varB in varC:
if row[0].endswith(varA):
row_value[0] = varB
break
Any help will be greatly appreciated. Thanks
Example:
Original Dataset:
Column
hello_world.a
goodmorning_world.b
bye_world.1
...
Lists are:
var1=['1','2','3']
var2=['11','22','33']
var3=list(zip(var1, var2))
similarly for varA, varB, varC:
varA=['a','b','c']
varB=['aa','bb','cc']
varC=list(zip(varA, varB))
I would like to have something like this:
Expected output
Column New_column
hello_world.a aa
goodmorning_world.b bb
bye_world.1 11
...
So, let's go step by step trough your code, first let's define the dataframe
import pandas as pd
# create dataframe with nans in the new column you want to fill
sample_dataset = pd.DataFrame({'Column':['hello_world.a','goodmorning_world.b','bye_world.1']})
# create new column which we will fill later
sample_dataset['New_column'] = pd.Series(index = sample_dataset.index, dtype='object')
Note that it is important to specify the type of the new column because what you want to achieve is the creation of a new column with mixed element types, numbers and strings, and only python 'objects' can hold mixed types.
Let's print it to see how it looks like
print(sample_dataset)
Out:
Column New_column
0 hello_world.a NaN
1 goodmorning_world.b NaN
2 bye_world.1 NaN
Now let's move to the rest of your code:
# the other variables you defined
var1=['1','2','3']
var2=['11','22','33']
var3=list(zip(var1, var2))
varA=['a','b','c']
varB=['aa','bb','cc']
varC=list(zip(varA, varB))
# your code
for index, row in sample_dataset.iterrows():
row_value = ['Missing']
for var1, var2 in var3:
if row[0]. endswith(var1):
row_value[0] = var2
break
for varA,varB in varC:
if row[0].endswith(varA):
row_value[0] = varB
break
Let's check if your code did something to the dataframe
Out:
Column New_column
0 hello_world.a NaN
1 goodmorning_world.b NaN
2 bye_world.1 NaN
Nothing seems to change, but something actually did, which is row_value. If I try to print it after running your code I get:
print(row_value)
Out:
['11']
Again, this is the most striking mistake because it shows that your issue is not just with pandas and dataframes but with programming in general. If you want to modify a variable, you have to access that variable, here the variable you want to change is your dataframe, which is called sample_dataset, but instead of calling it you call row_value in these lines:
row_value[0] = var2
row_value[0] = varB
Which is why at the end of your code row_value is not anymore ['Missing'] but ['11'], so you are changing something, just not your dataframe.
So how to update the values in the new column of the initial dataframe? Here's how you should do it:
# iterating through rows, this is correct
for index, row in sample_dataset.iterrows():
# you don't need to define row_value, but you want to access the value of 'Column' in the current row
value = row['Column']
# you could just do "for var1, var2 in list(zip(var1, var2))" without defining var3, not a mistake but it makes the code more concise
for var1, var2 in var3:
# using .endswith() will raise an error when you try to apply it to numbers, an alternative that works for both numbers and string is to simply access the last element of the array with [-1]
if value[-1] == var1:
# this is how you change an element in a dataframe, i.e. using .at[index, column]
# here we say that we want to change the element in the current index in the column 'New_colum' with var2
sample_dataset.at[index,'New_column'] = var2
break
for varA,varB in varC:
# same story as before
if value[-1] == varA:
sample_dataset.at[index,'New_column'] = varB
break
Let's print again the dataframe to check if this works:
print(sample_dataset)
Out:
Column New_column
0 hello_world.a aa
1 goodmorning_world.b bb
2 bye_world.1 11
So this time we did access the dataframe and change the values of New_column successfully. Go though the code and if you have doubts just comment, I can explain it in more details.
As a final note, if what you want is to just take the last character in the first row and double it in a new column, there are much better ways to do it. For example:
for index, row in sample_dataset.iterrows():
value = row['Column']
sample_dataset.at[index, 'New_column'] = value[-1]*2
Again, if we print it we can see that three lines of code are enough to do the job:
print(sample_dataset)
Out:
Column New_column
0 hello_world.a aa
1 goodmorning_world.b bb
2 bye_world.1 11
In this way, you don't need to define varA,B,C and all the others and you don't need brakes or nested for loops. We can even compress the code to a single line using .apply()
sample_dataset['New_column'] = sample_dataset.apply(lambda x: x['Column'][-1]*2, axis=1)
This will give you again the same results as before, but if you had trouble with your code this might be something you want to leave for the future when you will have a bit more confidence.
Also please note just that the last two methods will create all string elements, so even the last element 11 would be a string and not a float64. This might be something that you want to avoid, in which case you should just use your code, but in general it is not a really good things to mix types in a column though.
Edit
if you want to extract a part of a string respecting a specific rule (in this case, everything after the last period) you need to use regular expressions (or regex). Regex in python are implement in the library re, what you need to do is:
# import library
import re
# define a patter of interest, this specific pattern means 'everything from the end of the string until you find a period'
pattern = r"([^.]*$)"
# now you can extract the final part from each element in your dataframe using re.search
last_part = re.search(pattern, element).groups()[0]
Just to show what it does, let's take a fake value like 'hello_world.com' and apply the regex to it:
print(re.search(pattern, 'hello_world.com').groups()[0])
Out:
com
Now, you want to change in your code value[-1] with re.search so
if value[-1] == var1:
if value[-1] == varB:
Should become
if re.search(pattern, value).groups()[0] == var1:
if re.search(pattern, value).groups()[0] == varB:
Remember to add the import for re and to define the pattern at the beginning of your code.
Related
I have a DataFrame that's read in from a csv. The data has various problems. The one i'm concerned about for this post is that some data is not in the column it should be. For example, '900' is in the zipcode column, or 'RQ' is in the langauge column when it should be in the nationality column. In some cases, these "misinsertions" are just anomalies and can be converted to NaN. In other cases they indicate that the values have shifted one column to the right or the left such that the whole row has missinserted data. I want to remove these shifted lines from the DataFrame and try to fix them separately. My proposed solution has been to keep track of the number of bad values in each row as I am cleaning each column. Here is an example with the zipcode column:
def is_zipcode(value: str, regx):
if regx.match(value):
return value
else:
return nan
regx = re.compile("^[0-9]{5}(?:-[0-9]{4})?$")
df['ZIPCODE'] = df['ZIPCODE'].map(lambda x: is_zipcode(x, regx), na_action='ignore')
I'm doing something like this on every column in the dataframe depending on the data in that column, e.g. for the 'Nationality' column i'll look up the values in a json file of nationality codes.
What I haven't been able to achieve is to keep count of the bad values in a row. I tried something like this:
def is_zipcode(value: str, regx):
if regx.match(value):
return 0
else:
return 1
regx = re.compile("^[0-9]{5}(?:-[0-9]{4})?$")
df['badValues'] = df['ZIPCODE'].map(lambda x: is_zipcode(x, regx), na_action='ignore')
df['badValues'] = df['badValues'] + df['Nationalities'].map(is_nationality, na_action='ignore) # where is_nationality() similarly returns 1 if it is a bad value
And this can work to keep track of the bad values. What I'd like to do is somehow combine the process of cleaning the data and getting the bad values. I'd love to do something like this:
def is_zipcode(value: str, regx):
if regx.match(value):
# add 1 to the value of df['badValues'] at the corresponding index
return value
else:
return nan
The problem is that I don't think it's possible to access the index of the value being passed to the map function. I looked at these two questions (one, two) but I didn't see a solution to my issue.
I guess this would do what you want ...
is_zipcode_mask = df['ZIPCODE'].str.match(regex_for_zipcode)
print(len(df[is_zipcode_mask]))
I have a column (series) of values I'm trying to move characters around and I'm going nowhere fast! I found some snippets of code to get me where I am but need a "Closer". I'm working with one column, datatype (STR). Each column strings are a series of numbers. Some are duplicated. These duplicate numbers have a (n-) in front of the number. The (n) number will change based on how many duplicate numbers strings are listed. Some may have two duplicates, some eight duplicates. Doesn't matter, order should stay the same.
I need to go down through each cell or string, pluck the (n-) from the left of string, swap the two characters around, and append it to the end of the string. No number sorting needed. The column is 4-5k lines long and will look like the example given all the way down. No other special characters or letters. Also, the duplicate rows will always be together no matter where in the column.
My problem is the code below actually works and will step through each string, evaluate it for a dash, then process the numbers in the way I need. However, I have not learned how to get the changes back into my dataframe from a python for-loop. I was really hoping that somebody had a niffy lambda fix or a pandas apply function to address the whole column at once. But I haven't found anything that I can tweak to work. I know there is a better way than slowly traversing down through a series and I would like to learn.
Two possible fixes needed:
Is there a way to have the code below replace the old df.string value with the newly created df.string value? If so, please let me know.
I've been trying to read up on df.apply using the split feature so I can address the whole column at once. I understand it's the smarter play. Is there a couple of lines code that would do what I need?
Please let me know what you think. I appreciate the help. Thank you for taking the time.
import re
import pandas as pd
from pandas import DataFrame, Series
import numpy as np
df = pd.read_excel("E:\Book2.xlsx")
df.column1=df.column1.astype(str)
for r in df['column1']: #Finds Column
if bool(re.search('-', r))!=True: #test if string has '-'
continue
else:
a = [] #string holder for '-'
b = [] #string holder for numbers
for c in r:
if c == '-': #if '-' then hold in A
a.append(c)
else:
b.append(c) #if number then hold in B
t = (''.join(b + a)) #puts '-' at the end of string
z = t[1:] + t[:1] #picks up 1st position char and moves to end of string
r = z #assigns new created string to df.column1 value
print(df)
Starting File: Ending File:
column1 column1
41887 41887
1-41845 41845-1
2-41845 41845-2
40905 40905
1-41323 41323-1
2-41323 41323-2
3-41323 41323-3
41778 41778
You can use df.str.replace():
If we recreate your example with a file containing all your values and retain column1 as the column name:
import pandas as pd
df=pd.read_csv('file.txt')
df.columns=['column1']
df['column1']=df['column1'].str.replace('(^\d)-(\d+)',r'\2-\1')
print(df)
This will give the desired output. Replace the old column with new one, and do it all in one (without loops).
#in
41887
1-41845
2-41845
40905
1-41323
2-41323
3-41323
41778
#out
column1
0 41887
1 41845-1
2 41845-2
3 40905
4 41323-1
5 41323-2
6 41323-3
7 41778
My dataset looks like this:
Paste_Values AB_IDs AC_IDs AD_IDs
AE-1001-4 AB-1001-0 AC-1001-3 AD-1001-2
AE-1964-7 AB-1964-2 AC-1964-7 AD-1964-1
AE-2211-1 AB-2211-1 AC-2211-3 AD-2211-2
AE-2182-4 AB-2182-6 AC-2182-7 AD-2182-5
I need to compare all values in the Paste_values column with the all other three values in a row.
For Example:
AE-1001-4 is split into two part AE and 1001-4 we need check 1001-4 is present other columns or not
if its not present we need to create new columns put the same AE-1001-4
if 1001-4 match with other columns we need to change it inot 'AE-1001-5' put in the new column
After:
If there is no match I need to to write the value of Paste_values as is in the newly created column named new_paste_value.
If there is a match (same value) in other columns within the same row, then I need to change the last digit of the value from Paste_values column, so that the whole value should not be the same as in any other whole values in the row and that newly generated value should be written in new_paste_value column.
I need to do this with every row in the data frame.
So the result should look like:
Paste_Values AB_IDs AC_IDs AD_IDs new_paste_value
AE-1001-4 AB-1001-0 AC-1001-3 AD-1001-2 AE-1001-4
AE-1964-7 AB-1964-2 AC-1964-7 AD-1964-1 AE-1964-3
AE-2211-1 AB-2211-1 AC-2211-3 AD-2211-2 AE-2211-4
AE-2182-4 AB-2182-6 AC-2182-4 AD-2182-5 AE-2182-1
How can I do it?
Start from defining a function to be applied to each row of your DataFrame:
def fn(row):
rr = row.copy()
v1 = rr.pop('Paste_Values') # First value
if not rr.str.contains(f'{v1[3:]}$').any():
return v1 # No match
v1a = v1[3:-1] # Central part of v1
for ch in '1234567890':
if not rr.str.contains(v1a + ch + '$').any():
return v1[:-1] + ch
return '????' # No candidate found
A bit of explanation:
The row argument is actually a Series, with index values taken from
column names.
So rr.pop('Paste_Values') removes the first value, which is saved in v1
and the rest remains in rr.
Then v1[3:] extracts the "rest" of v1 (without "AE-")
and str.contains checks each element of rr whether it
contains this string at the end position.
With this explanation, the rest of this function should be quite
understandable. If not, execute each individual instruction and
print their results.
And the only thing to do is to apply this function to your DataFrame,
substituting the result to a new column:
df['new_paste_value'] = df.apply(fn, axis=1)
To run a test, I created the following DataFrame:
df = pd.DataFrame(data=[
['AE-1001-4', 'AB-1001-0', 'AC-1001-3', 'AD-1001-2'],
['AE-1964-7', 'AB-1964-2', 'AC-1964-7', 'AD-1964-1'],
['AE-2211-1', 'AB-2211-1', 'AC-2211-3', 'AD-2211-2'],
['AE-2182-4', 'AB-2182-6', 'AC-2182-4', 'AD-2182-5']],
columns=['Paste_Values', 'AB_IDs', 'AC_IDs', 'AD_IDs'])
I received no error on this data. Perform a test on the above data.
Maybe the source of your error is in some other place?
Maybe your DataFrame contains also other (float) columns,
which you didn't include in your question.
If this is the case, run my function on a copy of your DataFrame,
with this "other" columns removed.
I am attempting to pull a substring form a column, in the following way:
target_column:
PE123
DD123-HP123
HP123
373627HP23
I would like to pull the first two strings/alphabets of every record, except in cases where there is no alphabet in the first two strings. In this case, pull any alphabet that you find in the rest of the string. So in the case of 373627HP23, it will pull HP.
But the problem is with something like DD123-HP123. My loop is pulling the HP instead of the DD.
for index,row in df.iterrows():
target_value = row['target_column']
predefined_code = [HP]
for code in re.findall("[a-zA-Z]+", target_value):
if (len(code)!=1) and not (code in predefined_code):
possible_code = code
What is wrong with my code here?
What is the best code to write a loop so that in the case of something like DD123-HP123, it will pull the DD and not the HP?
I believe you can use extract for return first matched pattern:
df['new'] = df['target_column'].str.extract("([a-zA-Z]+)")
print (df)
target_column new
0 PE123 PE
1 DD123-HP123 DD
2 HP123 HP
3 373627HP23 HP
I have a dataset that contains columns with strings. One of those columns contain a identifier. Now I want to check if that identifier follows this pattern: e.g. AB12CD, so 2 letters(Capital), 2 numbers followed by 2 letters again.
The data is stored in a pandas data frame. I have:
for i in range(0, len(data.columns)):
if data.columns[i] == 'identifier ':
pattern = re.compile("[A-Z][A-Z][0-9][0-9][A-Z][A-Z]")
if pattern.match(data.ix[i, 0]):
data['identifier Check'] = 'Ok'
else:
data['identifier Check'] = 'identifier Format incorrect'
But this is not working. It puts on every row OK or Corp Key incorrect. Depending on the first row.
Can anybody help me?
Thanks!
Your code doesn't work as expected because data['identifier Check'] = 'Ok' assigns 'Ok' to every row in the identifier Check column.
Your code also scans the DataFrame column-wise instead of row-wise (ie it checks the value in the first row of every column, instead of checking the value in a specific column of every row).
My solution defines a function that returns your required output given a string and a pattern.
This function is called using the apply method that pandas.Series have. In this case it will go over every item in the data['identifier'] column and send it to the check_identifier function. The result of data['identifier'].apply(check_identifier) will be an np.array that will then be assigned to the newly created identifier Check column in the DataFrame.
# abusing the fact that default arguments are evaluated only during function definition
def check_identifier(value, pattern=re.compile("[A-Z][A-Z][0-9][0-9][A-Z][A-Z]")):
return 'OK' if pattern.match(value) else 'identifier Format incorrect'
data['identifier Check'] = data['identifier'].apply(check_identifier)
An example:
def check_identifier(value, pattern=re.compile("[A-Z][A-Z][0-9][0-9][A-Z][A-Z]")):
return 'OK' if pattern.match(value) else 'identifier Format incorrect'
df = pd.DataFrame({'a':['AB12CD', 'AB12Cd']})
print(df)
>> a
0 AB12CD
1 AB12Cd
df['identifier Check'] = df['a'].apply(check_identifier)
print(df)
>> a identifier Check
0 AB12CD OK
1 AB12Cd identifier Format incorrect