How can I get the specific strings in column value - python

what I want to do is delete certain parts of a string and take the only near of AcoS and insert it into a new column.
import pandas as pd
data = [{"Campaign" : "Sf l Spy l Branded l ACoS 20 l Manual NX"}]
df = pd.DataFrame(data)
df.insert(1,"targetAcos", 0)
df["targetAcos"] = df["Campaign"].str.replace(r' l ACoS \(.*)\l', r'\1', regex=True)
print(df["targetAcos"])
But I guess I am kinda bad at this, I couldn't make it correctly so I hope you guys can explain how can you do.

I think the Pandas function you want to be using here is str.extract:
df["targetAcos"] = df["Campaign"].str.extract(r'\bl ACoS (\d+) l')
Or perhaps a more generic regex would be:
df["targetAcos"] = df["Campaign"].str.extract(r'\bACoS (\d+)\b')

Related

how to get rid of strings in each list of each row in pandas

Say I have a string column in pandas in which each row is made of a list of strings
Class
Student
One
[Adam, Kanye, Alice Stocks, Joseph Matthew]
Two
[Justin Bieber, Selena Gomez]
I want to get rid of all the names in each class wherever the length of the string is more than 8 characters.
So the resulting table would be:
Class
Student
One
Adam, Kanye
Most of the data would be gone because only Adam and Kanye satisfy the condition of len(StudentName)<8
I tried coming up with a .applyfilter myself, but it seems that the code is running on each character level instead of word, can someone point out where I went wrong?
This is the code:
[[y for y in x if not len(y)>=8] for x in df['Student']]
Check Below code. Seems like you are not defining what you need to split at, hence things are automatically getting split a char level.
import pandas as pd
df = pd.DataFrame({'Class':['One','Two'],'Student':['[Adam, Kanye, Alice Stocks, Joseph Matthew]', '[Justin Bieber, Selena Gomez]'],
})
df['Filtered_Student'] = df['Student'].str.replace("\[|\]",'').str.split(',').apply(lambda x: ','.join([i for i in x if len(i)<8]))
df[df['Filtered_Student'] != '']
Output:
# If they're not actually lists, but strings:
if isinstance(df.Student[0], str):
df.Student = df.Student.str[1:-1].str.split(', ')
# Apply your filtering logic:
df.Student = df.Student.apply(lambda s: [x for x in s if len(x)<8])
Output:
Class Student
0 One [Adam, Kanye]
1 Two []
IIUC, this van be done in a oneliner np.where:
import pandas as pd
import numpy as np
df = pd.DataFrame( {'Class': ['One', 'Two'], 'Student': [['Adam', 'Kanye', 'Alice Stocks', 'Joseph Matthew'], ['Justin Bieber', 'Selena Gomez']]})
df.explode('Student').iloc[np.where(df.explode('Student').Student.str.len() <= 8)].groupby('Class').agg(list).reset_index()
Output:
Class Student
0 One [Adam, Kanye]

Convert Multiple Python Lines to a Concurrent DataFrame and Merge with Source Data

I apologize if this is a rudimentary question. I feel like it should be easy but I cannot figure it out. I have the code that is listed below that essentially looks at two columns in a CSV file and matches up job titles that have a similarity of 0.7. To do this, I use difflib.get_close_matches. However, the output is multiple single lines and whenever I try to convert to a DataFrame, every single line is its own DataFrame and I cannot figure out how to merge/concat them. All code, as well as current and desired outputs are below. Any help would be much appreciated.
Current Code is:
import pandas as pd
import difflib
df = pd.read_csv('name.csv')
aLists = list(df['JTs'])
bLists = list(df['JT'])
n=3
cutoff = 0.7
for aList in aLists:
best = difflib.get_close_matches(aList, bLists, n, cutoff)
print(best)
Current Output is:
['SW Engineer']
['Manu Engineer']
[]
['IT Help']
Desired Output is:
Output
0 SW Engineer
1 Manu Engineer
2 (blank)
3 IT Help
The table I am attempting to do this one is:
Any help would be greatly appreciated!
Here is a simple way to achieve this.I have converted first to a string.Then the first and last brackets are removed from that string and then is appended to a global list.
import pandas as pd
import difflib
import numpy as np
df = pd.read_csv('name.csv')
aLists = list(df['JTs'])
bLists = list(df['JT'])
n = 3
cutoff = 0.7
best = []
for aList in aLists:
temp = difflib.get_close_matches(aList, bLists, n, cutoff)
temp = str(temp)
strippedString = temp.lstrip("[").rstrip("]")
# print(temp)
best.append(strippedString)
print(best)
Output
[
"'SW Engineer'",
"'Manu Engineer'",
'',
"'IT Help'"
]
Here is another better way to achieve this.
You can simply use numpy to concatenate multiple arrays into single one.And then you can convert it to normal array if you want.
import pandas as pd
import difflib
import numpy as np
df = pd.read_csv('name.csv')
aLists = list(df['JTs'])
bLists = list(df['JT'])
n = 3
cutoff = 0.7
best = []
for aList in aLists:
temp = difflib.get_close_matches(aList, bLists, n, cutoff)
best.append(temp)
# print(best)
# Use concatenate() to join two arrays
combinedNumpyArray = np.concatenate(best)
#Converting numpy array to normal array
normalArray = combinedNumpyArray.tolist()
print(normalArray)
Output
['SW Engineer', 'Manu Engineer', 'IT Help']
Thanks
You could use Panda's .apply() to run your function on each entry. This could then either be added as a new column or a new dataframe created.
For example:
import pandas as pd
import difflib
def get_best_match(word):
matches = difflib.get_close_matches(word, JT, n, cutoff)
return matches[0] if matches else None
df = pd.read_csv('name.csv')
JT = df['JT']
n = 3
cutoff = 0.7
df['Output'] = df['JTs'].apply(get_best_match)
Or for a new dataframe:
df_output = pd.DataFrame({'Output' : df['JTs'].apply(get_best_match)})
Giving you:
JTs JT Output
0 Software Engineer Manu Engineer SW Engineer
1 Manufacturing Engineer SW Engineer Manu Engineer
2 Human Resource Manager IT Help None
3 IT Help Desk f IT Help
Or:
Output
0 SW Engineer
1 Manu Engineer
2 None
3 IT Help

How do I remove punctuation from a string in a pandas Series

I am trying to remove punctuation from a pandas Series. My problem is that I am unable to iterate over all the lines in the Series. This is the code that I tried out but it is taking forever to run. Note that my dataset is a bit large, around 112MB(200,000 rows)
import pandas as pd
import string
df = pd.read_csv('let us see.csv')
s = set(string.punctuation)
for st in df.reviewText.str:
for j in s:
if j in st:
df.reviewText = df.reviewText.str.replace(j, '')
df.reviewText = df.reviewText.str.lower()
df['clean_review'] = df.reviewText
print(df.clean_review.tail())
D-E-N's answer is pretty good. I just add another solution of how to improve the performance of your code.
Iteraring over a list version of your series should work faster than your approach.
import pandas as pd
import string
def replace_chars(text, chars):
for c in chars:
text = text.replace(c, '')
return text.lower()
df = pd.read_csv('let us see.csv')
s = set(string.punctuation)
reviewTextList = df.reviewText.astype(str).tolist()
reviewTextList = [replace_chars(x, s) for x in reviewTextList]
df['clean_review'] = reviewTextList
print(df.clean_review.tail())

How to check the pattern of a column in a dataframe

I have a dataframe which has some id's. I want to check the pattern of those column values.
Here is how the column looks like-
id: {ASDH12HK,GHST67KH,AGSH90IL,THKI86LK}
I want to to write a code that can distinguish characters and numerics in the pattern above and display an output like 'SSSS99SS' as the pattern of the column above where 'S' represents a character and '9' represents a numeric.This dataset is a large dataset so I can't predefine the position the characters and numeric will be in.I want the code to calculate the position of the characters and numerics. I am new to python so any leads will be helpful!
You can try something like:
my_string = "ASDH12HK"
def decode_pattern(my_string):
my_string = ''.join(str(9) if s.isdigit() else s for s in my_string)
my_string = ''.join('S' if s.isalpha() else s for s in my_string)
return my_string
decode_pattern(my_string)
Output:
'SSSS99SS'
You can apply this to the column in your dataframe as well as below:
import pandas as pd
df = pd.DataFrame(['ASDH12HK','GHST67KH','AGSH90IL','THKI86LK', 'SOMEPATTERN123'], columns=['id'])
df['pattern'] = df['id'].map(decode_pattern)
df
Output:
id pattern
0 ASDH12HK SSSS99SS
1 GHST67KH SSSS99SS
2 AGSH90IL SSSS99SS
3 THKI86LK SSSS99SS
4 SOMEPATTERN123 SSSSSSSSSSS999
You can use regular experssion:
st = "SSSS99SSSS"
a = re.match("[A-Za-z]{4}[0-9]{2}[A-Za-z]{4}", st)
It will return a match if the string starting with 4 Char followed by 2 numeric and again 4 char
So you can use this in your df to filter the df
You can use the function findall() from the re module:
import re
text = "ASDH12HK,GHST67KH,AGSH90IL,THKI86LK"
result = re.findall("[A-Za-z]{4}[0-9]{2}[A-Za-z]{2}", text)
print(result)

Is there any regular expression in pandas where we can define, first and last characters, and whatever come in between

If i have column names like this in df:
Q24r639606c1: Good Quality Q24r64500c1: Bad Q25r64500c1: Amazing Q24r64500c2: Worst Q24r5200c1: Nice
A A B B
D F C G K
I want to filter columns which start with "Q24 and, has "c1" before colon ":"
I am trying this but here i can pass only one string
Selected_Columns = df.filter(regex = 'Q24r')
Filter using startswith Q24 (^Q24). Then allow for anything (.*) until you locate exactly 'c1:'
import pandas as pd
df = pd.DataFrame(columns=['Q24r639606c1: Good Quality', 'Q24r64500c1: Bad',
'Q25r64500c1: Amazing', 'Q24r64500c2: Worst', 'Q24r5200c1: Nice'])
df.filter(regex='^Q24.*c1:').columns
['Q24r639606c1: Good Quality', 'Q24r64500c1: Bad', 'Q24r5200c1: Nice'],

Categories

Resources