I am trying to replace
'AMAT_0000006951_10Q_20200726_Filing Section: Risk'
with:
'AMAT 10Q Filing Section: Risk'
However, everything up until Filing Section: Risk will be constantly changing, except for positioning. I just want to pull the characters from position 0 to 5 and from 15 through 19.
df['section'] = df['section'].str.replace(
I'd like to manipulate this but not sure how?
Any help is much appreciated!
Given your series as s
s.str.slice(0, 5) + s.str.slice(15, 19) # if substring-ing
s.str.replace(r'\d{5}', '') # for a 5-length digit string
You may need to adjust your numbers to index properly. If that doesn't work, you probably want to use a regular expression to get rid of some length of numbers (as above, with the example of 5).
Or in a single line to produce the final output you have above:
s.str.replace(r'\d{10}_|\d{8}_', '').str.replace('_', ' ')
Though, it might not be wise to replace the underscores. Instead, if they change, explode the data into various columns which can be worked on separately.
If you want to replace a fix length/position of chars, use str.slice_replace to replace
df['section'] = df['section'].str.slice_replace(6, 14, ' ')
Other people would probably use regex to replace pieces in your string. However, I would:
Split the string
append the piece if it isn't a number
Join the remaining data
Like so:
s = 'AMAT_0000006951_10Q_20200726_Filing Section: Risk'
n = []
for i in s.split('_'):
try:
i = int(i)
except ValueError:
n.append(i)
print(' '.join(n))
AMAT 10Q Filing Section: Risk
Edit:
Re-reading your question, if you are just looking to substring:
Grabbing the first 5 characters:
s = 'AMAT_0000006951_10Q_20200726_Filing Section: Risk'
print(s[:4]) # print index 0 to 4 == first 5
print(s[15:19]) # print index 15 to 19
print(s[15:]) # print index 15 to the end.
If you would like to just replace pieces:
print(s.replace('_', ' '))
you could throw this in one line as well:
print((s[:4] + s[15:19] + s[28:]).replace('_', ' '))
'AMAT 10Q Filing Section: Risk'
Related
I have a bunch of strings in a pandas dataframe that contain numbers in them. I could the riun the below code and replace them all
df.feature_col = df.feature_col.str.replace('\d+', ' NUM ')
But what I need to do is replace any 10 digit number with a string like masked_id, any 16 digit numbers with account_number, or any three-digit numbers with yet another string, and so on.
How do I go about doing this?
PS: since my data size is less, a less optimal way is also good enough for me.
Another way is replace with option regex=True with a dictionary. You can also use somewhat more relaxed match
patterns (in order) than Tim's:
# test data
df = pd.DataFrame({'feature_col':['this has 1234567',
'this has 1234',
'this has 123',
'this has none']})
# pattern in decreasing length order
# these of course would replace '12345' with 'ID45' :-)
df['feature_col'] = df.feature_col.replace({'\d{7}': 'ID7',
'\d{4}': 'ID4',
'\d{3}': 'ID3'},
regex=True)
Output:
feature_col
0 this has ID7
1 this has ID4
2 this has ID3
3 this has none
You could do a series of replacements, one for each length of number:
df.feature_col = df.feature_col.str.replace(r'\b\d{3}\b', ' 3mask ')
df.feature_col = df.feature_col.str.replace(r'\b\d{10}\b', masked_id)
df.feature_col = df.feature_col.str.replace(r'\b\d{16}\b', account_number)
I have a dataframe with a list of poorly spelled clothing types. I want them all in the same format , an example is i have "trous" , "trouse" and "trousers", i would like to replace the first 2 with "trousers".
I have tried using string.replace but it seems its getting the first "trous" and changing it to "trousers" as it should and when it gets to "trouse", it works also but when it gets to "trousers" it makes "trousersersers"! i think its taking the strings which contain trous and trouse and trousers and changing them.
Is there a way i can limit the string.replace to just look for exactly "trous".
here's what iv troied so far, as you can see i have a good few changes to make, most of them work ok but its the likes of trousers and t-shirts which have a few similar changes to be made thats causing the upset.
newTypes=[]
for string in types:
underwear = string.replace(('UNDERW'), 'UNDERWEAR').replace('HANKY', 'HANKIES').replace('TIECLI', 'TIECLIPS').replace('FRAGRA', 'FRAGRANCES').replace('ROBE', 'ROBES').replace('CUFFLI', 'CUFFLINKS').replace('WALLET', 'WALLETS').replace('GIFTSE', 'GIFTSETS').replace('SUNGLA', 'SUNGLASSES').replace('SCARVE', 'SCARVES').replace('TROUSE ', 'TROUSERS').replace('SHIRT', 'SHIRTS').replace('CHINO', 'CHINOS').replace('JACKET', 'JACKETS').replace('KNIT', 'KNITWEAR').replace('POLO', 'POLOS').replace('SWEAT', 'SWEATERS').replace('TEES', 'T-SHIRTS').replace('TSHIRT', 'T-SHIRTS').replace('SHORT', 'SHORTS').replace('ZIP', 'ZIP-TOPS').replace('GILET ', 'GILETS').replace('HOODIE', 'HOODIES').replace('HOODZIP', 'HOODIES').replace('JOGGER', 'JOGGERS').replace('JUMP', 'SWEATERS').replace('SWESHI', 'SWEATERS').replace('BLAZE ', 'BLAZERS').replace('BLAZER ', 'BLAZERS').replace('WC', 'WAISTCOATS').replace('TTOP', 'T-SHIRTS').replace('TROUS', 'TROUSERS').replace('COAT', 'COATS').replace('SLIPPE', 'SLIPPERS').replace('TRAINE', 'TRAINERS').replace('DECK', 'SHOES').replace('FLIP', 'SLIDERS').replace('SUIT', 'SUITS').replace('GIFTVO', 'GIFTVOUCHERS')
newTypes.append(underwear)
types = newTypes
Assuming you're okay with not using string.replace(), you can simply do this:
lst = ["trousers", "trous" , "trouse"]
for i in range(len(lst)):
if "trous" in lst[i]:
lst[i] = "trousers"
print(lst)
# Prints ['trousers', 'trousers', 'trousers']
This checks if the shortest substring, trous, is part of the string, and if so converts the entire string to trousers.
Use a dict for string to be replaced:
d={
'trous': 'trouser',
'trouse': 'trouser',
# ...
}
newtypes=[d.get(string,string) for string in types]
d.get(string,string) will return string if string is not in d.
I am looking at the entire transcript of the play, Romeo and Juliet and I want to see how many times'Romeo' and 'Juliet' appear on the same line within the entire play. AKA how many different lines in the play have both words 'Romeo' and 'Juliet' in them?
Note: 'gbdata' is the name of my data aka the entire transcript of the play. For purposes of testing, we might use:
gbdata = '''
Romeo and Juliet # this should count once
Juliet and Romeo, and Romeo, and Juliet # this also should count once
Romeo # this should not count at all
Juliet # this should not count at all
some other string # this should not count at all
'''
The correct answer should be 2, since only the first two lines contain both strings; and more matches within a line don't add to the total count.
This is what I have done so far:
gbdata.count('Romeo' and 'Juliet') # counts 'Juliet's, returning 4
and
gbdata.count('Romeo') + gbdata.count('Juliet') # combines individual counts, returning 8
How can I get the desired output for the above test string, 2?
You can't use str.count() here; it's not built for your purpose, since it doesn't have any concept of "lines". That said, given a string, you can break it down into a list of individual lines by splitting on '\n', the newline character.
A very terse approach might be:
count = sum((1 if ('Romeo' in l and 'Juliet' in l) else 0) for l in gbdata.split('\n'))
Expanding that out into a bunch of separate commands might look like:
count = 0
for line in gbdata.split('\n'):
if 'Romeo' in line and 'Juliet' in line:
count += 1
I'll give a bit of the snippit of code I made. Here it is:
url = urlopen("http://sports.yahoo.com/nhl/scoreboard?d=2013-01-19")
content = url.read()
soup = BeautifulSoup(content)
def yahooscores():
for table in soup.find_all('table', class_='player-title'):
for row in table.find_all('tr'):
date = None
for cell in row.find_all('td', class_='yspsctnhdln'):
for text in cell:
date = cell.text
if date is not None:
print ('%s' % (date) + ", 2013:")
I was trying to go about stripping the words "Scores & Schedules" from the date part of the website, but I could not somehow do it with the .split() and .strip() methods.
So, let me explain what I wish to do, with the above website as an example.
So far, this is what comes out for a date:
Scores & Schedule: Jan 19, 2013:
I just want this:
Jan 19, 2013:
Is there anything in particular I need to know in order to strip those 3 words?
The actual content of cell.text is:
'\nScores & Schedule: Jan 19\n'
... so it makes more sense to get what you need out of that (the last two words) first, and then add ', 2013:' to it, as I think you're trying to do already. A handy feature of split() is that it automatically strips leading and trailing whitespace, so probably the most robust way to get what you want is to change your last line to:
print(' '.join(date.split()[-2:]) + ', 2013:')
This splits date into a list of words with .split(), then uses [-2:] to get the last two words in the list, then joins them back together with a space using ' '.join(...), and finally adds ', 2013:' to the end before printing the result.
As a side note, '%s' % (date) in your original version does absolutely nothing: all you're doing is replacing date with itself. It might be worth familiarising yourself with the documentation on percent-formatting so that you understand why.
Keeping it simple:
>>> s = "Scores & Schedule: Jan 19, 2013:"
>>> s.replace("Scores & Schedule:", "")
' Jan 19, 2013:'
date = "Scores & Schedule: Jan 19, 2013:"
There are many options:
date = date[19:]
date = date.replace("Scores & Schedule: ", "")
date = date.split(":")[1].strip()+":"
to name a few.
Just replace the unwanted part with an empty string.
>>> "Scores & Schedule: Jan 19, 2013:".replace("Scores & Schedule:", "")
' Jan 19, 2013:'
How about:
print(date[20:].strip('\n') + ', 2013')
this is assuming that there will ALWAYS be 'Scores & Schedule: ' in the response.
I have a .csv file which looks like:
['NAME' " 'RA_I1'" " 'DEC_I1'" " 'Mean_I1'" " 'Median_I1'" " 'Mode_I1'" ...]"
where this string carries on for (I think) 95 entries, the entire file is over a thousand rows deep. I want to remove all the characters: [ ' " and just have everything separated by a single white space entry (' ').
So far I've tried:
import pandas as pd
df1 = pd.read_table('slap.txt')
for char in df1:
if char in " '[":
df1.replace(char, '')
print df1
Where I'm just 'testing' the code to see if it will do what I want it to, it's not. I'd like to implement it on the entire file, but I'm not sure how.
I've checked this old post out but not quite getting it to work for my purposes. I've also played with the linked post, the only problem with it seems to be that all the entries are spaced twice rather than just once....
This looks like something you ought to be able to grab with a (not particularly pretty) regular expression in the sep argument of read_csv:
In [11]: pd.read_csv(file_name, sep='\[\'|\'\"\]|[ \'\"]*', header=None)
Out[11]:
0 1 2 3 4 5 6 7
0 NaN NAME RA_I1 DEC_I1 Mean_I1 Median_I1 Mode_I1 NaN
You can play about with the regular expression til it truly fits your needs.
To explain this one:
sep = ('\[\' # each line startswith [' (the | means or)
'|\'\"\]' # endswith '"] (at least the one I had)
'|[ \'\"]+') # this is the actual delimiter, the + means at least one, so it's a string of ", ' and space in any order.
You can see this hack has left a NaN column at either end. The main reason this is pretty awful is because of the inconsistency of your "csv", I would definitely recommend cleaning it up, of course, one way to do that is just to use pandas and then to_csv. If it's generated by someone else... complain (!).
Have you tried:
string.strip(s[, chars])
?
http://docs.python.org/2/library/string.html