I've got a string:
SzczęśliwyNumereknadzień06październikato:
and I want to make a space after each word. So the final result should look like this:
Szczęśliwy Numerek na dzień 06 października to:
How can I reach that?
Here is my original string:
Szczęśliwy Numerek na dzień
06 października
to:
Later I removed whitespace, so my string was looking like that:
SzczęśliwyNumereknadzień
06października
to:
And after that, I converted it to one line string, and it's now looking like this:
SzczęśliwyNumereknadzień06październikato:
Try this.
string=""" Szczęśliwy Numerek na dzień
06 października
to:
"""
strings=' '.join(string.split())
Related
I am working on a dataset that looks somewhat like this (using python and pandas):
date text
0 Jul 31 2020 Sentence Numero Uno #cool
1 Jul 31 2020 Second sentence
2 Jul 31 2020 Test sentence 3 #thanks
So I use this bit of code I found online to remove the Hashtags like #cool #thanks as well as make everything lowercase.
for i in range(df.shape[0]) :
df['text'][i] = ' '.join(re.sub("(#[A-Za-z0-9]+)", " ", df['text'][i]).split()).lower()
That works, however I now don't want to delete the hashtags completely but save them in a extra column like this:
date text hashtags
0 Jul 31 2020 sentence numero uno #cool
1 Jul 31 2020 second sentence
2 Jul 31 2020 test sentence 3 #thanks
Can anyone help me with that? How can I do that?
Thanks in advance.
Edit: As some strings contain multiple hashtags it should be stored in the hashtag column as a list.
One possible way to go about this would be the following:
df['hashtag'] = ''
for i in range(len(df)) :
df['hashtag'][i] = ' '.join(re.findall("(#[A-Za-z0-9]+)", df['text'][i]))
df['text'][i] = ' '.join(re.sub("(#[A-Za-z0-9]+)", " ", df['text'][i]).split()).lower()
So, first you create an empty string column called hashtag. Then, in every loop through the rows, you first extract any number of unique hashtags that might exist in the text into the new column. If none exist, you end up with an empty string (you can change that if you like to something else). And then, you replace the hashtag with an empty space, as you were already doing before.
If it happens that in some texts you have more than 1 hashtag, depending on how you want to use the hashtags later, it could be easier to actually store them as a list, instead of " ".join(...). So, if you want to store them as a list, you could replace row 3 with:
df['hashtag'][i] = re.findall("(#[A-Za-z0-9]+)", df['text'][i])
which just returns a list of hashtags.
Use Series.str.findall with Series.str.join:
df['hashtags'] = df['text'].str.lower().str.findall(r"(\#[A-z0-9]+)").str.join(' ')
You can use this string method of pandas:
pattern = r"(\#[A-z0-9]+)"
df['text'].str.extract(pattern, expand=True)
If your string contains multiple matches, you should use str.extractall:
df['text'].str.extractall(pattern)
I added a couple of lines below your code, it should work:
df['hashtags']=''
for i in range(df.shape[0]) :
df['text'][i] = ' '.join(re.sub("(#[A-Za-z0-9]+)", " ", df['text'][i]).split()).lower()
l=df['text'][i].split(0)
s=[k for k in l if k[0]=='#']
if len(s)>=1:
df['hashtags'][i]=' '.join(s)
Use newdf = pd.DataFrame(df.row.str.split('#',1).tolist(),columns = ['text','hashtags']) instead of you for-loop. This will create a new Dataframe. Then you can set df['text']=newdf['text'] and df['hashtags']=newdf['hashtags'].
So I've been trying to extract the strings that follow the "dot" character in the text file, but only for lines that follow the pattern as below, that is, after the date and time:
09 May 2018 10:37AM • 6PR, Perth (Mornings)
The problem is for each of those lines, the date and time would change so the only common pattern is that there would be AM or PM right before the "dot".
However, if I search for "AM" or "PM" it wouldn't recognize the lines because the "AM" and "PM" are attached to the time.
This is my current code:
for i,s in enumerate(open(file)):
for words in ['PM','AM']:
if re.findall(r'\b' + words + r'\b', s):
source=s.split('•')[0]
Any idea how to get around this problem? Thank you.
I guess your regex is the problem here.
for i, s in enumerate(open(file)):
if re.findall(r'\d{2}[AP]M', s):
source = s.split('•')[0]
# 09 May 2018 10:37AM
If you are trying to extract the datetime try using regex.
Ex:
import re
s = "09 May 2018 10:37AM • 6PR, Perth (Mornings)"
m = re.search("(?P<datetime>\d{2}\s+(January|February|March|April|May|June|July|August|September|October|November|December)\s+\d{4}\s+\d{2}\:\d{2}(AM|PM))", s)
if m:
print m.group("datetime")
Output:
09 May 2018 10:37AM
I have a String Like
"Originally Posted on 09 May, 2016. By query 3 j...."
how can i extract date using python??
I tried this code:
dStr = "Originally Posted on 09 May, 2016. By query 3 j...."
date_st = re.findall("(\d+\ \w+,)", dStr)
printing date_st, i've got:
['09 May,']
what should i do for year??
You forget to add the year after ',', Only need to add '\d+.' is well.
re.findall("(\d+\ \w+, \d+\.)", dStr)
You will got this:
'09 May, 2016.'
You were almost there. Just add 4 digits for the year after the ,. It is better to use [a-z]+ instead of \w+ to match the month names as \w matches _ and 0-9 (along with alphabets) as well.
re.findall(r'\d+\s[a-z]+,\s\d{4}',s,re.I)
I want to deal with those strings like:
"I will meet you at 1st."
"5th... OK, 5th?"
"today is 2nd\n"
"Aug.3rd"
To replace the "st|nd|rd|th" with other corresponsive string, actually are xml tags, I want to make those "1st, 2nd, 3rd, 4th" into superscript looks:
1<Font Script=”super”>rd</Font>
5<Font Script=”super”>th</Font> ... OK, 5<Font Script=”super”>th</Font>?
Like this
Use re module to identify the date patterns and replace them.
>>> re.sub(r"([0123]?[0-9])(st|th|nd|rd)",r"\1<sup>\2</sup>","Meet you on 5th")
'Meet you on 5<sup>th</sup>'
Regex demo: http://regexr.com/38lao
I have the following string, I collected from twitter streaming api
"""{"created_at":"Mon Mar 11 20:15:36 +0000 2013","id":311208808837951488,"id_str":"311208808837951488","text":"ALIENS ENTRATE E' IMPORTANTE!!! \n\n\n\nMTV's Musical March Madness ritorna il 18 marzo...Siete pronti A http:\/\/t.co\/ABXEfquTJw via #Hopee_dream","source":"\u003ca href=\"http:\/\/twitter.com\/download\/android\" rel=\"nofollow\"\u003eTwitter for Android\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":1025970793,"id_str":"1025970793","name":"Tom's Perfection\u2665","screen_name":"_MyGreenEyes_","location":"","url":null,"description":"Angel,don't you cry,i'll meet you on the other side.\u2661","protected":false,"followers_count":387,"friends_count":520,"listed_count":1,"created_at":"Fri Dec 21 08:39:17 +0000 2012","favourites_count":174,"utc_offset":null,"time_zone":null,"geo_enabled":true,"verified":false,"statuses_count":772,"lang":"it","contributors_enabled":false,"is_translator":false,"profile_background_color":"C0DEED","profile_background_image_url":"http:\/\/a0.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_image_url_https":"https:\/\/si0.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_tile":false,"profile_image_url":"http:\/\/a0.twimg.com\/profile_images\/3363059730\/3d791e51eefa800150cd99917abc1d2c_normal.jpeg","profile_image_url_https":"https:\/\/si0.twimg.com\/profile_images\/3363059730\/3d791e51eefa800150cd99917abc1d2c_normal.jpeg","profile_banner_url":"https:\/\/si0.twimg.com\/profile_banners\/1025970793\/1362500832","profile_link_color":"0084B4","profile_sidebar_border_color":"C0DEED","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"default_profile":true,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweet_count":0,"entities":{"hashtags":[],"urls":[{"url":"http:\/\/t.co\/ABXEfquTJw","expanded_url":"http:\/\/tl.gd\/l9f5j7","display_url":"tl.gd\/l9f5j7","indices":[101,123]}],"user_mentions":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"medium"}"""
I am doing the following operations:
import json
json_string = json_string.strip()
jsn_dict = json.loads(json_string)
print jsn_dict["text"]
gives:
ALIENS ENTRATE E' IMPORTANTE!!!
instead of:
"ALIENS ENTRATE E' IMPORTANTE!!! \n\n\n\nMTV's Musical March Madness ritorna il 18 marzo...Siete pronti A http:\/\/t.co\/ABXEfquTJw via #Hopee_dream"
Looks to me that the newline characters are creating problems in parsing this string to python dictionary.
But then I am doing json_string.strip() operation. I thought it will remove such stuff from my string..
What am I doing wrong?
The str.strip() method only removes whitespace characters at the beginning and the end of a string. Not anywhere in the middle.
To remove all newlines from a string, you could do:
"some\n\n\nstring".replace("\n", "")
or
"some\n\n\nstring".translate(None, "\n")
The first one is probably easier to read and understand.
I ran into a similar problem with text being truncated after the \n symbol.
In my case, the issue was because I had accidentally prepended the # symbol to the key I was using to pull the value out of the dict:
print jsn_dict['#text']