I am working on a dataset that looks somewhat like this (using python and pandas):
date text
0 Jul 31 2020 Sentence Numero Uno #cool
1 Jul 31 2020 Second sentence
2 Jul 31 2020 Test sentence 3 #thanks
So I use this bit of code I found online to remove the Hashtags like #cool #thanks as well as make everything lowercase.
for i in range(df.shape[0]) :
df['text'][i] = ' '.join(re.sub("(#[A-Za-z0-9]+)", " ", df['text'][i]).split()).lower()
That works, however I now don't want to delete the hashtags completely but save them in a extra column like this:
date text hashtags
0 Jul 31 2020 sentence numero uno #cool
1 Jul 31 2020 second sentence
2 Jul 31 2020 test sentence 3 #thanks
Can anyone help me with that? How can I do that?
Thanks in advance.
Edit: As some strings contain multiple hashtags it should be stored in the hashtag column as a list.
One possible way to go about this would be the following:
df['hashtag'] = ''
for i in range(len(df)) :
df['hashtag'][i] = ' '.join(re.findall("(#[A-Za-z0-9]+)", df['text'][i]))
df['text'][i] = ' '.join(re.sub("(#[A-Za-z0-9]+)", " ", df['text'][i]).split()).lower()
So, first you create an empty string column called hashtag. Then, in every loop through the rows, you first extract any number of unique hashtags that might exist in the text into the new column. If none exist, you end up with an empty string (you can change that if you like to something else). And then, you replace the hashtag with an empty space, as you were already doing before.
If it happens that in some texts you have more than 1 hashtag, depending on how you want to use the hashtags later, it could be easier to actually store them as a list, instead of " ".join(...). So, if you want to store them as a list, you could replace row 3 with:
df['hashtag'][i] = re.findall("(#[A-Za-z0-9]+)", df['text'][i])
which just returns a list of hashtags.
Use Series.str.findall with Series.str.join:
df['hashtags'] = df['text'].str.lower().str.findall(r"(\#[A-z0-9]+)").str.join(' ')
You can use this string method of pandas:
pattern = r"(\#[A-z0-9]+)"
df['text'].str.extract(pattern, expand=True)
If your string contains multiple matches, you should use str.extractall:
df['text'].str.extractall(pattern)
I added a couple of lines below your code, it should work:
df['hashtags']=''
for i in range(df.shape[0]) :
df['text'][i] = ' '.join(re.sub("(#[A-Za-z0-9]+)", " ", df['text'][i]).split()).lower()
l=df['text'][i].split(0)
s=[k for k in l if k[0]=='#']
if len(s)>=1:
df['hashtags'][i]=' '.join(s)
Use newdf = pd.DataFrame(df.row.str.split('#',1).tolist(),columns = ['text','hashtags']) instead of you for-loop. This will create a new Dataframe. Then you can set df['text']=newdf['text'] and df['hashtags']=newdf['hashtags'].
So I've been trying to extract the strings that follow the "dot" character in the text file, but only for lines that follow the pattern as below, that is, after the date and time:
09 May 2018 10:37AM • 6PR, Perth (Mornings)
The problem is for each of those lines, the date and time would change so the only common pattern is that there would be AM or PM right before the "dot".
However, if I search for "AM" or "PM" it wouldn't recognize the lines because the "AM" and "PM" are attached to the time.
This is my current code:
for i,s in enumerate(open(file)):
for words in ['PM','AM']:
if re.findall(r'\b' + words + r'\b', s):
source=s.split('•')[0]
Any idea how to get around this problem? Thank you.
I guess your regex is the problem here.
for i, s in enumerate(open(file)):
if re.findall(r'\d{2}[AP]M', s):
source = s.split('•')[0]
# 09 May 2018 10:37AM
If you are trying to extract the datetime try using regex.
Ex:
import re
s = "09 May 2018 10:37AM • 6PR, Perth (Mornings)"
m = re.search("(?P<datetime>\d{2}\s+(January|February|March|April|May|June|July|August|September|October|November|December)\s+\d{4}\s+\d{2}\:\d{2}(AM|PM))", s)
if m:
print m.group("datetime")
Output:
09 May 2018 10:37AM
I've got a string:
SzczęśliwyNumereknadzień06październikato:
and I want to make a space after each word. So the final result should look like this:
Szczęśliwy Numerek na dzień 06 października to:
How can I reach that?
Here is my original string:
Szczęśliwy Numerek na dzień
06 października
to:
Later I removed whitespace, so my string was looking like that:
SzczęśliwyNumereknadzień
06października
to:
And after that, I converted it to one line string, and it's now looking like this:
SzczęśliwyNumereknadzień06październikato:
Try this.
string=""" Szczęśliwy Numerek na dzień
06 października
to:
"""
strings=' '.join(string.split())
I have a string:
string = ""7807161604","Sat Jan 16 00:00:57 +0000 2010","Global focus begins tonight. Pretty interested to hear more about it.","Madison Alabama","al","17428434","81","51","Sun Nov 16 21:46:24 +0000 2008","243"
I only want the text: "Global focus begins tonight. Pretty interested to hear more about it."" which is between the 2nd and 3rd comma/delimiter.
If i use:
i = string.split(',', 2)
s = i[2]
j = s.split(',',-7)
print j[0]
i will get the desired output.
But, if there is an extra comma between the original string as shown below:
string = ""7807161604","Sat Jan 16 00:00:57 +0000 2010","Global focus begins tonight. Pretty interested, to hear more about it.","Madison Alabama","al","17428434","81","51","Sun Nov 16 21:46:24 +0000 2008","243""
Then this approach does not work because the output I require is being split. Can anyone please help and suggest a different approach or advise if I'm going wrong? thanks!
You can use python's built-in csv module to do this.
j = next(csv.reader([string]));
Now j is each item delimited by a , and will include commas if the value is wrapped in ". See j[2].
I have a database full of names like:
John Smith
Scott J. Holmes
Dr. Kaplan
Ray's Dog
Levi's
Adrian O'Brien
Perry Sean Smyre
Carie Burchfield-Thompson
Björn Árnason
There are a few foreign names with accents in them that need to be converted to strings with non-accented characters.
I'd like to convert the full names (after stripping characters like " ' " , "-") to user logins like:
john.smith
scott.j.holmes
dr.kaplan
rays.dog
levis
adrian.obrien
perry.sean.smyre
carie.burchfieldthompson
bjorn.arnason
So far I have:
Fullname.strip() # get rid of leading/trailing white space
Fullname.lower() # make everything lower case
... # after bad chars converted/removed
Fullname.replace(' ', '.') # replace spaces with periods
Take a look at this link [redacted]
Here is the code from the page
def latin1_to_ascii (unicrap):
"""This replaces UNICODE Latin-1 characters with
something equivalent in 7-bit ASCII. All characters in the standard
7-bit ASCII range are preserved. In the 8th bit range all the Latin-1
accented letters are stripped of their accents. Most symbol characters
are converted to something meaningful. Anything not converted is deleted.
"""
xlate = {
0xc0:'A', 0xc1:'A', 0xc2:'A', 0xc3:'A', 0xc4:'A', 0xc5:'A',
0xc6:'Ae', 0xc7:'C',
0xc8:'E', 0xc9:'E', 0xca:'E', 0xcb:'E',
0xcc:'I', 0xcd:'I', 0xce:'I', 0xcf:'I',
0xd0:'Th', 0xd1:'N',
0xd2:'O', 0xd3:'O', 0xd4:'O', 0xd5:'O', 0xd6:'O', 0xd8:'O',
0xd9:'U', 0xda:'U', 0xdb:'U', 0xdc:'U',
0xdd:'Y', 0xde:'th', 0xdf:'ss',
0xe0:'a', 0xe1:'a', 0xe2:'a', 0xe3:'a', 0xe4:'a', 0xe5:'a',
0xe6:'ae', 0xe7:'c',
0xe8:'e', 0xe9:'e', 0xea:'e', 0xeb:'e',
0xec:'i', 0xed:'i', 0xee:'i', 0xef:'i',
0xf0:'th', 0xf1:'n',
0xf2:'o', 0xf3:'o', 0xf4:'o', 0xf5:'o', 0xf6:'o', 0xf8:'o',
0xf9:'u', 0xfa:'u', 0xfb:'u', 0xfc:'u',
0xfd:'y', 0xfe:'th', 0xff:'y',
0xa1:'!', 0xa2:'{cent}', 0xa3:'{pound}', 0xa4:'{currency}',
0xa5:'{yen}', 0xa6:'|', 0xa7:'{section}', 0xa8:'{umlaut}',
0xa9:'{C}', 0xaa:'{^a}', 0xab:'<<', 0xac:'{not}',
0xad:'-', 0xae:'{R}', 0xaf:'_', 0xb0:'{degrees}',
0xb1:'{+/-}', 0xb2:'{^2}', 0xb3:'{^3}', 0xb4:"'",
0xb5:'{micro}', 0xb6:'{paragraph}', 0xb7:'*', 0xb8:'{cedilla}',
0xb9:'{^1}', 0xba:'{^o}', 0xbb:'>>',
0xbc:'{1/4}', 0xbd:'{1/2}', 0xbe:'{3/4}', 0xbf:'?',
0xd7:'*', 0xf7:'/'
}
r = ''
for i in unicrap:
if xlate.has_key(ord(i)):
r += xlate[ord(i)]
elif ord(i) >= 0x80:
pass
else:
r += i
return r
# This gives an example of how to use latin1_to_ascii().
# This creates a string will all the characters in the latin-1 character set
# then it converts the string to plain 7-bit ASCII.
if __name__ == '__main__':
s = unicode('','latin-1')
for c in range(32,256):
if c != 0x7f:
s = s + unicode(chr(c),'latin-1')
print 'INPUT:'
print s.encode('latin-1')
print
print 'OUTPUT:'
print latin1_to_ascii(s)
If you are not afraid to install third-party modules, then have a look at the python port of the Perl module Text::Unidecode (it's also on pypi).
The module does nothing more than use a lookup table to transliterate the characters. I glanced over the code and it looks very simple. So I suppose it's working on pretty much any OS and on any Python version (crossingfingers). It's also easy to bundle with your application.
With this module you don't have to create your lookup table manually ( = reduced risk it being incomplete).
The advantage of this module compared to the unicode normalization technique is this: Unicode normalization does not replace all characters. A good example is a character like "æ". Unicode normalisation will see it as "Letter, lowercase" (Ll). This means using the normalize method will give you neither a replacement character nor a useful hint. Unfortunately, that character is not representable in ASCII. So you'll get errors.
The mentioned module does a better job at this. This will actually replace the "æ" with "ae". Which is actually useful and makes sense.
The most impressive thing I've seen is that it goes much further. It even replaces Japanese Kana characters mostly properly. For example, it replaces "は" with "ha". Wich is perfectly fine. It's not fool-proof though as the current version replaces "ち" with "ti" instead of "chi". So you'll have to handle it with care for the more exotic characters.
Usage of the module is straightforward::
from unidecode import unidecode
var_utf8 = "æは".decode("utf8")
unidecode( var_utf8 ).encode("ascii")
>>> "aeha"
Note that I have nothing to do with this module directly. It just happens that I find it very useful.
Edit: The patch I submitted fixed the bug concerning the Japanese kana. I've only fixed the one's I could spot right away. I may have missed some.
The following function is generic:
import unicodedata
def not_combining(char):
return unicodedata.category(char) != 'Mn'
def strip_accents(text, encoding):
unicode_text= unicodedata.normalize('NFD', text.decode(encoding))
return filter(not_combining, unicode_text).encode(encoding)
# in a cp1252 environment
>>> print strip_accents("déjà", "cp1252")
deja
# in a cp1253 environment
>>> print strip_accents("καλημέρα", "cp1253")
καλημερα
Obviously, you should know the encoding of your strings.
I would do something like this
# coding=utf-8
def alnum_dot(name, replace={}):
import re
for k, v in replace.items():
name = name.replace(k, v)
return re.sub("[^a-z.]", "", name.strip().lower())
print alnum_dot(u"Frédrik Holmström", {
u"ö":"o",
" ":"."
})
Second argument is a dict of the characters you want replaced, all non a-z and . chars that are not replaced will be stripped
The translate method allows you to delete characters. You can use that to delete arbitrary characters.
Fullname.translate(None,"'-\"")
If you want to delete whole classes of characters, you might want to use the re module.
re.sub('[^a-z0-9 ]', '', Fullname.strip().lower(),)