I am working on a dataset that looks somewhat like this (using python and pandas):
date text
0 Jul 31 2020 Sentence Numero Uno #cool
1 Jul 31 2020 Second sentence
2 Jul 31 2020 Test sentence 3 #thanks
So I use this bit of code I found online to remove the Hashtags like #cool #thanks as well as make everything lowercase.
for i in range(df.shape[0]) :
df['text'][i] = ' '.join(re.sub("(#[A-Za-z0-9]+)", " ", df['text'][i]).split()).lower()
That works, however I now don't want to delete the hashtags completely but save them in a extra column like this:
date text hashtags
0 Jul 31 2020 sentence numero uno #cool
1 Jul 31 2020 second sentence
2 Jul 31 2020 test sentence 3 #thanks
Can anyone help me with that? How can I do that?
Thanks in advance.
Edit: As some strings contain multiple hashtags it should be stored in the hashtag column as a list.
One possible way to go about this would be the following:
df['hashtag'] = ''
for i in range(len(df)) :
df['hashtag'][i] = ' '.join(re.findall("(#[A-Za-z0-9]+)", df['text'][i]))
df['text'][i] = ' '.join(re.sub("(#[A-Za-z0-9]+)", " ", df['text'][i]).split()).lower()
So, first you create an empty string column called hashtag. Then, in every loop through the rows, you first extract any number of unique hashtags that might exist in the text into the new column. If none exist, you end up with an empty string (you can change that if you like to something else). And then, you replace the hashtag with an empty space, as you were already doing before.
If it happens that in some texts you have more than 1 hashtag, depending on how you want to use the hashtags later, it could be easier to actually store them as a list, instead of " ".join(...). So, if you want to store them as a list, you could replace row 3 with:
df['hashtag'][i] = re.findall("(#[A-Za-z0-9]+)", df['text'][i])
which just returns a list of hashtags.
Use Series.str.findall with Series.str.join:
df['hashtags'] = df['text'].str.lower().str.findall(r"(\#[A-z0-9]+)").str.join(' ')
You can use this string method of pandas:
pattern = r"(\#[A-z0-9]+)"
df['text'].str.extract(pattern, expand=True)
If your string contains multiple matches, you should use str.extractall:
df['text'].str.extractall(pattern)
I added a couple of lines below your code, it should work:
df['hashtags']=''
for i in range(df.shape[0]) :
df['text'][i] = ' '.join(re.sub("(#[A-Za-z0-9]+)", " ", df['text'][i]).split()).lower()
l=df['text'][i].split(0)
s=[k for k in l if k[0]=='#']
if len(s)>=1:
df['hashtags'][i]=' '.join(s)
Use newdf = pd.DataFrame(df.row.str.split('#',1).tolist(),columns = ['text','hashtags']) instead of you for-loop. This will create a new Dataframe. Then you can set df['text']=newdf['text'] and df['hashtags']=newdf['hashtags'].
Related
I am trying to replace
'AMAT_0000006951_10Q_20200726_Filing Section: Risk'
with:
'AMAT 10Q Filing Section: Risk'
However, everything up until Filing Section: Risk will be constantly changing, except for positioning. I just want to pull the characters from position 0 to 5 and from 15 through 19.
df['section'] = df['section'].str.replace(
I'd like to manipulate this but not sure how?
Any help is much appreciated!
Given your series as s
s.str.slice(0, 5) + s.str.slice(15, 19) # if substring-ing
s.str.replace(r'\d{5}', '') # for a 5-length digit string
You may need to adjust your numbers to index properly. If that doesn't work, you probably want to use a regular expression to get rid of some length of numbers (as above, with the example of 5).
Or in a single line to produce the final output you have above:
s.str.replace(r'\d{10}_|\d{8}_', '').str.replace('_', ' ')
Though, it might not be wise to replace the underscores. Instead, if they change, explode the data into various columns which can be worked on separately.
If you want to replace a fix length/position of chars, use str.slice_replace to replace
df['section'] = df['section'].str.slice_replace(6, 14, ' ')
Other people would probably use regex to replace pieces in your string. However, I would:
Split the string
append the piece if it isn't a number
Join the remaining data
Like so:
s = 'AMAT_0000006951_10Q_20200726_Filing Section: Risk'
n = []
for i in s.split('_'):
try:
i = int(i)
except ValueError:
n.append(i)
print(' '.join(n))
AMAT 10Q Filing Section: Risk
Edit:
Re-reading your question, if you are just looking to substring:
Grabbing the first 5 characters:
s = 'AMAT_0000006951_10Q_20200726_Filing Section: Risk'
print(s[:4]) # print index 0 to 4 == first 5
print(s[15:19]) # print index 15 to 19
print(s[15:]) # print index 15 to the end.
If you would like to just replace pieces:
print(s.replace('_', ' '))
you could throw this in one line as well:
print((s[:4] + s[15:19] + s[28:]).replace('_', ' '))
'AMAT 10Q Filing Section: Risk'
I have a dataframe called tweetscrypto and I am trying to remove all the words from the column "text" starting with the character "#" and gather the result in a new column "clean_text". The rest of the words should stay exactly the same:
tweetscrypto['clean_text'] = tweetscrypto['text'].apply(filter(lambda x:x[0]!='#', x.split()))
it does not seem to work. Can somebody help?
Thanks in advance
Please str.replace string starting with #
Sample Data
text
0 News via #livemint: #RBI bars banks from links
1 Newsfeed from #oayments_source: How Africa
2 is that bitcoin? not my thing
tweetscrypto['clean_text']=tweetscrypto['text'].str.replace('(\#\w+.*?)',"")
Still, can capture # without escaping as noted by #baxx
tweetscrypto['clean_text']=tweetscrypto['text'].str.replace('(#\w+.*?)',"")
clean_text
0 News via : bars banks from links
1 Newsfeed from : How Africa
2 is that bitcoin? not my thing
In this case it might be better to define a method rather than using a lambda for mainly readability purposes.
def clean_text(X):
X = X.split()
X_new = [x for x in X if not x.startswith("#")
return ' '.join(X_new)
tweetscrypto['clean_text'] = tweetscrypto['text'].apply(clean_text)
I have a text which looks like an email body as follows.
To: Abc Cohen <abc.cohen#email.com> Cc: <braggis.mathew#nomail.com>,<samanth.castillo#email.com> Hi
Abc, I happened to see your report. I have not seen any abnormalities and thus I don't think we
should proceed to Braggis. I am open to your thought as well. Regards, Abc On Tue 23 Jul 2017 07:22
PM
Tony Stark wrote:
Then I have a list of key words as follows.
no_wds = ["No","don't","Can't","Not"]
yes_wds = ["Proceed","Approve","May go ahead"]
Objective:
I want to first search the text string as given above and if any of the key words as listed above is (or are) present then I want to extract the strings in between those key words. In this case, we have Not and don't keywords matched from no_wds. Also we have Proceed key word matched from yes_wds list. Thus the text I want to be extracted as list as follows
txt = ['seen any abnormalities and thus I don't think we should','think we should']
My approach:
I have tried
re.findall(r'{}(.*){}'.format(re.escape('|'.join(no_wds)),re.escape('|'.join(yes_wds))),text,re.I)
Or
text_f = []
for i in no_wds:
for j in yes_wds:
t = re.findall(r'{}(.*){}'.format(re.escape(i),re.escape(j)),text, re.I)
text_f.append(t)
Didn't get any suitable result. Then I tried str.find() method, there also no success.
I tried to get a clue from here.
Can anybody help in solving this? Any non-regex solution is somewhat I am keen to see, as regex at times are not a good fit. Having said the same, if any one can come up with regex based solution where I can iterate the lists it is welcome.
Loop through the list containing the keys, use the iterator as a splitter (whatever.split(yourIterator)).
EDIT:
I am not doing your homework, but this should get you on your way:
I decided to loop through the splitted at every space list of the message, search for the key words and add the index of the hits into a list, then I used those indexes to slice the message, probably worth trying to slice the message without splitting it, but I am not going to do your homework. And you must find a way to automate the process when there are more indexes, tip: check if the size is even or you are going to have a bad time slicing.
*Note that you should replace the \n characters and find a way to sort the key lists.
message = """To: Abc Cohen <abc.cohen#email.com> Cc: <braggis.mathew#nomail.com>,<samanth.castillo#email.com> Hi
Abc, I happened to see your report. I have not seen any abnormalities and thus I don't think we
should proceed to Braggis. I am open to your thought as well. Regards, Abc On Tue 23 Jul 2017 07:22"""
no_wds = ["No","don't","Can't","Not"]
yes_wds = ["Proceed","Approve","May go ahead"]
splittedMessage = message.split( ' ' )
msg = []
for i in range( 0, len( splittedMessage ) ):
temp = splittedMessage[i]
for j, k in zip( no_wds, yes_wds ):
tempJ = j.lower()
tempK = k.lower()
if( tempJ == temp or tempK == temp ):
msg.append( i )
found = ' '.join( splittedMessage[msg[0]:msg[1]] )
print( found )
I'll give a bit of the snippit of code I made. Here it is:
url = urlopen("http://sports.yahoo.com/nhl/scoreboard?d=2013-01-19")
content = url.read()
soup = BeautifulSoup(content)
def yahooscores():
for table in soup.find_all('table', class_='player-title'):
for row in table.find_all('tr'):
date = None
for cell in row.find_all('td', class_='yspsctnhdln'):
for text in cell:
date = cell.text
if date is not None:
print ('%s' % (date) + ", 2013:")
I was trying to go about stripping the words "Scores & Schedules" from the date part of the website, but I could not somehow do it with the .split() and .strip() methods.
So, let me explain what I wish to do, with the above website as an example.
So far, this is what comes out for a date:
Scores & Schedule: Jan 19, 2013:
I just want this:
Jan 19, 2013:
Is there anything in particular I need to know in order to strip those 3 words?
The actual content of cell.text is:
'\nScores & Schedule: Jan 19\n'
... so it makes more sense to get what you need out of that (the last two words) first, and then add ', 2013:' to it, as I think you're trying to do already. A handy feature of split() is that it automatically strips leading and trailing whitespace, so probably the most robust way to get what you want is to change your last line to:
print(' '.join(date.split()[-2:]) + ', 2013:')
This splits date into a list of words with .split(), then uses [-2:] to get the last two words in the list, then joins them back together with a space using ' '.join(...), and finally adds ', 2013:' to the end before printing the result.
As a side note, '%s' % (date) in your original version does absolutely nothing: all you're doing is replacing date with itself. It might be worth familiarising yourself with the documentation on percent-formatting so that you understand why.
Keeping it simple:
>>> s = "Scores & Schedule: Jan 19, 2013:"
>>> s.replace("Scores & Schedule:", "")
' Jan 19, 2013:'
date = "Scores & Schedule: Jan 19, 2013:"
There are many options:
date = date[19:]
date = date.replace("Scores & Schedule: ", "")
date = date.split(":")[1].strip()+":"
to name a few.
Just replace the unwanted part with an empty string.
>>> "Scores & Schedule: Jan 19, 2013:".replace("Scores & Schedule:", "")
' Jan 19, 2013:'
How about:
print(date[20:].strip('\n') + ', 2013')
this is assuming that there will ALWAYS be 'Scores & Schedule: ' in the response.
In python I need a logic for below scenario I am using split function to this.
I have string which contains input as show below.
"ID674021384 25/01/1986 heloo hi thanks 5 minutes and 25-01-1988."
"ID909900000 25-01-1986 hello 10 minutes."
And output should be as shown below which replace date format to "date" and time format to "time".
"ID674021384 date hello hi thanks time date."
"ID909900000 date hello time."
And also I need a count of date and time for each Id as show below
ID674021384 DATE:2 TIME:1
ID909900000 DATE:1 TIME:1
>>> import re
>>> from collections import defaultdict
>>> lines = ["ID674021384 25/01/1986 heloo hi thanks 5 minutes and 25-01-1988.", "ID909900000 25-01-1986 hello 10 minutes."]
>>> pattern = '(?P<date>\d{1,2}[/-]\d{1,2}[/-]\d{4})|(?P<time>\d+ minutes)'
>>> num_occurences = {line:defaultdict(int) for line in lines}
>>> def repl(matchobj):
num_occurences[matchobj.string][matchobj.lastgroup] += 1
return matchobj.lastgroup
>>> for line in lines:
text_id = line.split(' ')[0]
new_text = re.sub(pattern,repl,line)
print new_text
print '{0} DATE:{1[date]} Time:{1[time]}'.format(text_id, num_occurences[line])
print ''
ID674021384 date heloo hi thanks time and date.
ID674021384 DATE:2 Time:1
ID909900000 date hello time.
ID909900000 DATE:1 Time:1
For parsing similar lines of text, like log files, I often use regular expressions using the re module. Though split() would work well also for separating fields which don't contain spaces and the parts of the date, using regular expressions allows you to also make sure the format matches what you expect, and if need be warn you of a weird looking input line.
Using regular expressions, you could get the individual fields of the date and time and construct date or datetime objects from them (both from the datetime module). Once you have those objects, you can compare them to other similar objects and write new entries, formatting the dates as you like. I would recommend parsing the whole input file (assuming you're reading a file) and writing a whole new output file instead of trying to alter it in place.
As for keeping track of the date and time counts, when your input isn't too large, using a dictionary is normally the easiest way to do it. When you encounter a line with a certain ID, find the entry corresponding to this ID in your dictionary or add a new one to it if not. This entry could itself be a dictionary using dates and times as keys and whose values is the count of each encountered.
I hope this answer will guide you on the way to a solution even though it contains no code.
You could use a couple of regular expressions:
import re
txt = 'ID674021384 25/01/1986 heloo hi thanks 5 minutes and 25-01-1988.'
retime = re.compile('([0-9]+) *minutes')
redate = re.compile('([0-9]+[/-][0-9]+[/-][0-9]{4})')
# find all dates in 'txt'
dates = redate.findall(txt)
print dates
# find all times in 'txt'
times = retime.findall(txt)
print times
# replace dates and times in orignal string:
newtxt = txt
for adate in dates:
newtxt = newtxt.replace(adate, 'date')
for atime in times:
newtxt = newtxt.replace(atime, 'time')
The output looks like this:
Original string:
ID674021384 25/01/1986 heloo hi thanks 5 minutes and 25-01-1988.
Found dates:['25/01/1986', '25-01-1988']
Found times: ['5']
New string:
ID674021384 date heloo hi thanks time minutes and date.
Dates and times found:
ID674021384 DATE:2 TIME:1
Chris