looping to pull the first 2 substrings of a column in python - python

I am attempting to pull a substring form a column, in the following way:
target_column:
PE123
DD123-HP123
HP123
373627HP23
I would like to pull the first two strings/alphabets of every record, except in cases where there is no alphabet in the first two strings. In this case, pull any alphabet that you find in the rest of the string. So in the case of 373627HP23, it will pull HP.
But the problem is with something like DD123-HP123. My loop is pulling the HP instead of the DD.
for index,row in df.iterrows():
target_value = row['target_column']
predefined_code = [HP]
for code in re.findall("[a-zA-Z]+", target_value):
if (len(code)!=1) and not (code in predefined_code):
possible_code = code
What is wrong with my code here?
What is the best code to write a loop so that in the case of something like DD123-HP123, it will pull the DD and not the HP?

I believe you can use extract for return first matched pattern:
df['new'] = df['target_column'].str.extract("([a-zA-Z]+)")
print (df)
target_column new
0 PE123 PE
1 DD123-HP123 DD
2 HP123 HP
3 373627HP23 HP

Related

Find which columns contain a certain value for each row in a dataframe

I have a dataframe, df, shown below. Each row is a story and each column is a word that appears in the corpus of stories. A 0 means the word is absent in the story while a 1 means the word is present.
I want to find which words are present in each story (i.e. col val == 1). How can I go about finding this (preferably without for-loops)?
Thanks!
Assuming you are just trying to look at one story, you can filter for the story (let's say story 34972) and transpose the dataframe with:
df_34972 = df[df.index=34972].T
and then you can send the values equal to 1 to a list:
[*df_34972[df_34972['df_34972'] == 1]]
If you are trying to do this for all stories, then you can do this, but it will be a slightly different technique. From the link that SammyWemmy provided, you can melt() the dataframe and filter for 1 values for each story. From there you could .groupby('story_column') which is 'index' (after using reset_index()) in the example below:
df = df.reset_index().melt(id_vars='index')
df = df[df['values'] == 1]
df.groupby('index')['variable'].apply(list)

Pandas series string manipulation using Python - 1st two chars flip and append to end of string

I have a column (series) of values I'm trying to move characters around and I'm going nowhere fast! I found some snippets of code to get me where I am but need a "Closer". I'm working with one column, datatype (STR). Each column strings are a series of numbers. Some are duplicated. These duplicate numbers have a (n-) in front of the number. The (n) number will change based on how many duplicate numbers strings are listed. Some may have two duplicates, some eight duplicates. Doesn't matter, order should stay the same.
I need to go down through each cell or string, pluck the (n-) from the left of string, swap the two characters around, and append it to the end of the string. No number sorting needed. The column is 4-5k lines long and will look like the example given all the way down. No other special characters or letters. Also, the duplicate rows will always be together no matter where in the column.
My problem is the code below actually works and will step through each string, evaluate it for a dash, then process the numbers in the way I need. However, I have not learned how to get the changes back into my dataframe from a python for-loop. I was really hoping that somebody had a niffy lambda fix or a pandas apply function to address the whole column at once. But I haven't found anything that I can tweak to work. I know there is a better way than slowly traversing down through a series and I would like to learn.
Two possible fixes needed:
Is there a way to have the code below replace the old df.string value with the newly created df.string value? If so, please let me know.
I've been trying to read up on df.apply using the split feature so I can address the whole column at once. I understand it's the smarter play. Is there a couple of lines code that would do what I need?
Please let me know what you think. I appreciate the help. Thank you for taking the time.
import re
import pandas as pd
from pandas import DataFrame, Series
import numpy as np
df = pd.read_excel("E:\Book2.xlsx")
df.column1=df.column1.astype(str)
for r in df['column1']: #Finds Column
if bool(re.search('-', r))!=True: #test if string has '-'
continue
else:
a = [] #string holder for '-'
b = [] #string holder for numbers
for c in r:
if c == '-': #if '-' then hold in A
a.append(c)
else:
b.append(c) #if number then hold in B
t = (''.join(b + a)) #puts '-' at the end of string
z = t[1:] + t[:1] #picks up 1st position char and moves to end of string
r = z #assigns new created string to df.column1 value
print(df)
Starting File: Ending File:
column1 column1
41887 41887
1-41845 41845-1
2-41845 41845-2
40905 40905
1-41323 41323-1
2-41323 41323-2
3-41323 41323-3
41778 41778
You can use df.str.replace():
If we recreate your example with a file containing all your values and retain column1 as the column name:
import pandas as pd
df=pd.read_csv('file.txt')
df.columns=['column1']
df['column1']=df['column1'].str.replace('(^\d)-(\d+)',r'\2-\1')
print(df)
This will give the desired output. Replace the old column with new one, and do it all in one (without loops).
#in
41887
1-41845
2-41845
40905
1-41323
2-41323
3-41323
41778
#out
column1
0 41887
1 41845-1
2 41845-2
3 40905
4 41323-1
5 41323-2
6 41323-3
7 41778

New column and iteration over rows

I am trying to add a new column in my existing dataset (pandas dataframe). This new column contains elements that satisfy a specific condition (please see code below). I am iterating over rows, but I am not able to change value of the row based on the condition (rows should have value row_value[0] = var2 or row_value[0] = varB).
for index, row in sample_dataset.iterrows():
row_value = ['Missing']
for var1, var2 in var3:
if row[0].endswith(var1):
row_value[0] = var2
break
for varA,varB in varC:
if row[0].endswith(varA):
row_value[0] = varB
break
Any help will be greatly appreciated. Thanks
Example:
Original Dataset:
Column
hello_world.a
goodmorning_world.b
bye_world.1
...
Lists are:
var1=['1','2','3']
var2=['11','22','33']
var3=list(zip(var1, var2))
similarly for varA, varB, varC:
varA=['a','b','c']
varB=['aa','bb','cc']
varC=list(zip(varA, varB))
I would like to have something like this:
Expected output
Column New_column
hello_world.a aa
goodmorning_world.b bb
bye_world.1 11
...
So, let's go step by step trough your code, first let's define the dataframe
import pandas as pd
# create dataframe with nans in the new column you want to fill
sample_dataset = pd.DataFrame({'Column':['hello_world.a','goodmorning_world.b','bye_world.1']})
# create new column which we will fill later
sample_dataset['New_column'] = pd.Series(index = sample_dataset.index, dtype='object')
Note that it is important to specify the type of the new column because what you want to achieve is the creation of a new column with mixed element types, numbers and strings, and only python 'objects' can hold mixed types.
Let's print it to see how it looks like
print(sample_dataset)
Out:
Column New_column
0 hello_world.a NaN
1 goodmorning_world.b NaN
2 bye_world.1 NaN
Now let's move to the rest of your code:
# the other variables you defined
var1=['1','2','3']
var2=['11','22','33']
var3=list(zip(var1, var2))
varA=['a','b','c']
varB=['aa','bb','cc']
varC=list(zip(varA, varB))
# your code
for index, row in sample_dataset.iterrows():
row_value = ['Missing']
for var1, var2 in var3:
if row[0]. endswith(var1):
row_value[0] = var2
break
for varA,varB in varC:
if row[0].endswith(varA):
row_value[0] = varB
break
Let's check if your code did something to the dataframe
Out:
Column New_column
0 hello_world.a NaN
1 goodmorning_world.b NaN
2 bye_world.1 NaN
Nothing seems to change, but something actually did, which is row_value. If I try to print it after running your code I get:
print(row_value)
Out:
['11']
Again, this is the most striking mistake because it shows that your issue is not just with pandas and dataframes but with programming in general. If you want to modify a variable, you have to access that variable, here the variable you want to change is your dataframe, which is called sample_dataset, but instead of calling it you call row_value in these lines:
row_value[0] = var2
row_value[0] = varB
Which is why at the end of your code row_value is not anymore ['Missing'] but ['11'], so you are changing something, just not your dataframe.
So how to update the values in the new column of the initial dataframe? Here's how you should do it:
# iterating through rows, this is correct
for index, row in sample_dataset.iterrows():
# you don't need to define row_value, but you want to access the value of 'Column' in the current row
value = row['Column']
# you could just do "for var1, var2 in list(zip(var1, var2))" without defining var3, not a mistake but it makes the code more concise
for var1, var2 in var3:
# using .endswith() will raise an error when you try to apply it to numbers, an alternative that works for both numbers and string is to simply access the last element of the array with [-1]
if value[-1] == var1:
# this is how you change an element in a dataframe, i.e. using .at[index, column]
# here we say that we want to change the element in the current index in the column 'New_colum' with var2
sample_dataset.at[index,'New_column'] = var2
break
for varA,varB in varC:
# same story as before
if value[-1] == varA:
sample_dataset.at[index,'New_column'] = varB
break
Let's print again the dataframe to check if this works:
print(sample_dataset)
Out:
Column New_column
0 hello_world.a aa
1 goodmorning_world.b bb
2 bye_world.1 11
So this time we did access the dataframe and change the values of New_column successfully. Go though the code and if you have doubts just comment, I can explain it in more details.
As a final note, if what you want is to just take the last character in the first row and double it in a new column, there are much better ways to do it. For example:
for index, row in sample_dataset.iterrows():
value = row['Column']
sample_dataset.at[index, 'New_column'] = value[-1]*2
Again, if we print it we can see that three lines of code are enough to do the job:
print(sample_dataset)
Out:
Column New_column
0 hello_world.a aa
1 goodmorning_world.b bb
2 bye_world.1 11
In this way, you don't need to define varA,B,C and all the others and you don't need brakes or nested for loops. We can even compress the code to a single line using .apply()
sample_dataset['New_column'] = sample_dataset.apply(lambda x: x['Column'][-1]*2, axis=1)
This will give you again the same results as before, but if you had trouble with your code this might be something you want to leave for the future when you will have a bit more confidence.
Also please note just that the last two methods will create all string elements, so even the last element 11 would be a string and not a float64. This might be something that you want to avoid, in which case you should just use your code, but in general it is not a really good things to mix types in a column though.
Edit
if you want to extract a part of a string respecting a specific rule (in this case, everything after the last period) you need to use regular expressions (or regex). Regex in python are implement in the library re, what you need to do is:
# import library
import re
# define a patter of interest, this specific pattern means 'everything from the end of the string until you find a period'
pattern = r"([^.]*$)"
# now you can extract the final part from each element in your dataframe using re.search
last_part = re.search(pattern, element).groups()[0]
Just to show what it does, let's take a fake value like 'hello_world.com' and apply the regex to it:
print(re.search(pattern, 'hello_world.com').groups()[0])
Out:
com
Now, you want to change in your code value[-1] with re.search so
if value[-1] == var1:
if value[-1] == varB:
Should become
if re.search(pattern, value).groups()[0] == var1:
if re.search(pattern, value).groups()[0] == varB:
Remember to add the import for re and to define the pattern at the beginning of your code.

Match pattern of urls in a pandas column

I am currently working on a drop which contains a big amount of links.
So far I want to filter the links to a list of websites.
So I wrote an array which contains the xxx-value of every website:
www.xxx.de/com/whatever
What I want to do is to check every column entry with the values which are in the array.
list = ['forbes','bloomberg',...]
map = df['URL'].match(list)
df['URL'] = df.apply(map)
Somehow in this manner. I am just not so sure how to work with the link which in the column since I never worked with strings before.
Links are in the following format:
www.forbes.com/.../...
Is there any easy way without using urlparse or similar to do this job?
Thanks a lot for your help!
I believe you need extract for new column:
df = pd.DataFrame({'URL':['www.forbes.com/.../...',
'www.bloomberg.com/something',
'www.webpage.com/something']})
L = ['forbes','bloomberg']
df['new'] = df['URL'].str.extract("(" + "|".join(L) +")", expand=False)
print (df)
URL new
0 www.forbes.com/.../... forbes
1 www.bloomberg.com/something bloomberg
2 www.webpage.com/something NaN
But if want filter rows only use contains:
df = df[df['URL'].str.contains("|".join(L))]
print (df)
URL
0 www.forbes.com/.../...
1 www.bloomberg.com/something

Pandas: query string where column name contains special characters

I am working with a data frame that has a structure something like the following:
In[75]: df.head(2)
Out[75]:
statusdata participant_id association latency response \
0 complete CLIENT-TEST-1476362617727 seeya 715 dislike
1 complete CLIENT-TEST-1476362617727 welome 800 like
stimuli elementdata statusmetadata demo$gender demo$question2 \
0 Sample B semi_imp complete male 23
1 Sample C semi_imp complete female 23
I want to be able to run a query string against the column demo$gender.
I.e,
df.query("demo$gender=='male'")
But this has a problem with the $ sign. If I replace the $ sign with another delimited (like -) then the problem persists. Can I fix up my query string to avoid this problem. I would prefer not to rename the columns as these correspond tightly with other parts of my application.
I really want to stick with a query string as it is supplied by another component of our tech stack and creating a parser would be a heavy lift for what seems like a simple problem.
Thanks in advance.
With the most recent version of pandas, you can esscape a column's name that contains special characters with a backtick (`)
df.query("`demo$gender` == 'male'")
Another possibility is clean the columns names as a previous step in your process, replacing special characters by some other more appropriate.
For instance:
(df
.rename(columns = lambda value: value.replace('$', '_'))
.query("demo_gender == 'male'")
)
For the interested here is a simple proceedure I used to accomplish the task:
# Identify invalid column names
invalid_column_names = [x for x in list(df.columns.values) if not x.isidentifier() ]
# Make replacements in the query and keep track
# NOTE: This method fails if the frame has columns called REPL_0 etc.
replacements = dict()
for cn in invalid_column_names:
r = 'REPL_'+ str(invalid_column_names.index(cn))
query = query.replace(cn, r)
replacements[cn] = r
inv_replacements = {replacements[k] : k for k in replacements.keys()}
df = df.rename(columns=replacements) # Rename the columns
df = df.query(query) # Carry out query
df = df.rename(columns=inv_replacements)
Which amounts to identifying the invalid column names, transforming the query and renaming the columns. Finally we perform the query and then translate the column names back.
Credit to #chrisb for their answer that pointed me in the right direction
The current implementation of query requires the string to be a valid python expression, so column names must be valid python identifiers. Your two options are renaming the column, or using a plain boolean filter, like this:
df[df['demo$gender'] =='male']

Categories

Resources