Whatsapp chat log parsing with regex - python

I'm trying to parse a WhatsApp chat log using regex. I have a solution that works for most cases but I'm looking to improve it but don't know how to since I am quite new to regex.
The chat.txt file looks like this:
[06.12.16, 16:46:19] Person One: Wow thats amazing
[06.12.16, 16:47:13] Person Two: Good morning and this goes over multiple
lines as it is a very long message
[06.12.16, 16:47:22] Person Two: ::
While my solution so far would parse most of these messages correctly, however I have a few hundred cases where the message starts with a colon, like the last example above. This leads to an unwanted value of Person Two: : as the sender.
Here is the regex I am working with so far:
pattern = re.compile(r'\[(?P<date>\d{2}\.\d{2}\.\d{2}),\s(?P<time>\d{2}:\d{2}:\d{2})]\s(?P<sender>(?<=\s).*(?::\s*\w+)*(?=:)):\s(?P<message>(?:.+|\n+(?!\[\d{2}\.\d{2}\.\d{2}))+)')
Any advice on how I could go around this bug would be appreciated!

i would pre-process the list to remove the consecutive colons before applying the regex. So for each line e.g
line = [06.12.16, 16:47:22] Person Two: ::
line = line.replace("::","")
which would give :
[06.12.16, 16:47:22] Person Two:
You can then call your regex function on the pre-processed data.

I encountered similar issues when building a tool to analyze WhatsApp chats.
The main issue is that the format of the chat.txt is depending on your system language. In German you will get this 16:47, but in English it might be PM and the month format changes for American users ....
The library I used has the 4 regexs below. So far they covered all occurring cases (Latin Languages).
Filtering general:
const regexParser = /^(?:\u200E|\u200F)*\[?(\d{1,4}[-/.] ?\d{1,4}[-/.] ?\d{1,4})[,.]? \D*?(\d{1,2}[.:]\d{1,2}(?:[.:]\d{1,2})?)(?: ([ap]\.? ?m\.?))?\]?(?: -|:)? (.+?): ([^]*)/i;
Filter System Messages:
const regexParserSystem = /^(?:\u200E|\u200F)*\[?(\d{1,4}[-/.] ?\d{1,4}[-/.] ?\d{1,4})[,.]? \D*?(\d{1,2}[.:]\d{1,2}(?:[.:]\d{1,2})?)(?: ([ap]\.? ?m\.?))?\]?(?: -|:)? ([^]+)/i;
Date:
const regexSplitDate = /[-/.] ?/;
Handle attachments, which are passed in "< >" even when you export the chat without attachments. (e.g. <media ommitted>)
const regexAttachment = /<.+:(.+)>/;`

Related

Text manipulation in Outlook using Python

Sometimes when sending a new event invitation for a certain meeting in Outlook I need to mention all the required people for the meeting in the invitation body, due to company conventions. Many times, the names I already sent the invitation to are the very same people I need to write all over again. I found that if I copy those names from the "To..." field, they are pasted in the format of name <mail>; name <mail>; name <mail>, so I wrote this Python function to turn it into a plain list of names separated by a new line with the mail addresses removed:
def format_invitees(string):
import re; return ''.join(x.strip(' \n')+'\n' for x in re.sub("[<].*?[>]", "", string).replace(' ; ', ';').split(';')).strip('\n')
Now, is there any good way to implement this function into an Outlook Macro, with whether to assign it to a hotkey or add it to the menu on right click? To mention that Python is the only language I know, and I am not allowed to install any external software due to organization orders. Best regards!
I think import re; return re.sub(r" ?<.*?>;? ?","\n",string) is a shorter way of defining the Python function.
But more to your point, follow the instructions at this SO question to enable VBA regex module (given for Word, but applicable in Outlook): How to Use/Enable (RegExp object) Regular Expression using VBA (MACRO) in word
I think the outlook.Recipients property may be useful for getting the names of the people you're needing to list: https://learn.microsoft.com/en-us/office/vba/api/outlook.recipients

python3.6 How do I regex a url from a .txt?

I need to grab a url from a text file.
The URL is stored in a string like so: 'URL=http://example.net'.
Is there anyway I could grab everything after the = char up until the . in '.net'?
Could I use the re module?
text = """A key feature of effective analytics infrastructure in healthcare is a metadata-driven architecture. In this article, three best practice scenarios are discussed: https://www.healthcatalyst.com/clinical-applications-of-machine-learning-in-healthcare Automating ETL processes so data analysts have more time to listen and help end users , https://www.google.com/, https://www.facebook.com/, https://twitter.com
code below catches all urls in text and returns urls in list."""
urls = re.findall('(?:(?:https?|ftp):\/\/)?[\w/\-?=%.]+\.[\w/\-?=%.]+', text)
print(urls)
output:
[
'https://www.healthcatalyst.com/clinical-applications-of-machine-learning-in-healthcare',
'https://www.google.com/',
'https://www.facebook.com/',
'https://twitter.com'
]
i dont have much information but i will try to help with what i got im assuming that URL= is part of the string in that case you can do this
re.findall(r'URL=(.*?).', STRINGNAMEHERE)
Let me go more into detail about (.*?) the dot means Any character (except newline character) the star means zero or more occurences and the ? is hard to explain but heres an example from the docs "Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. ab? will match either ‘a’ or ‘ab’." the brackets place it all into a group. All this togethear basicallly means it will find everything inbettween URL= and .
You don't need RegEx'es (the re module) for such a simple task.
If the string you have is of the form:
'URL=http://example.net'
Then you can solve this using basic Python in numerous ways, one of them being:
file_line = 'URL=http://example.net'
start_position = file_line.find('=') + 1 # this gives you the first position after =
end_position = file_line.find('.')
# this extracts from the start_position up to but not including end_position
url = file_line[start_position:end_position]
Of course that this is just going to extract one URL. Assuming that you're working with a large text, where you'd want to extract all URLs, you'll want to put this logic into a function so that you can reuse it, and build around it (achieve iteration via the while or for loops, and, depending on how you're iterating, keep track of the position of the last extracted URL and so on).
Word of advice
This question has been answered quite a lot on this forum, by very skilled people, in numerous ways, for instance: here, here, here and here, to a level of detail that you'd be amazed. And these are not all, I just picked the first few that popped up in my search results.
Given that (at the time of posting this question) you're a new contributor to this site, my friendly advice would be to invest some effort into finding such answers. It's a crucial skill, that you can't do without in the world of programming.
Remember, that whatever problem it is that you are encountering, there is a very high chance that somebody on this forum had already encountered it, and received an answer, you just need to find it.
Please try this. It worked for me.
import re
s='url=http://example.net'
print(re.findall(r"=(.*)\.",s)[0])

How to extract questions from a word doc with Python using regex

I am using docx library to read files from a word doc, I am trying to extract only the questions using regex search and match. I found infinite ways of doing it but I keep getting a "TypeError".
The data I am trying to extract is this:
Will my financial aid pay for housing?
Off Campus Housing - After financial aid applies toward your tuition and fees, any remaining funds will be sent to you as a refund that will either be directly deposited (which can be set up through your account) or mailed to you as a paper check. You can then use the refund to pay your rent. It is important to note that financial aid may not be available when rent is due, so make sure to have a plan in place to pay your rent. Will my financial aid pay for housing?
"financial" "help" "house"
funds "univ oak"
"money" "chisho"
"pay" "chap"
"grant" "laurel"
What are the requirements to receive a room and grant?
How do I pay for my housing?
How do I pay for housing?
If there's also an easier method of exporting the word doc into a different type of file, that'll be great to know for feedback. Thank you
I am using regex 101, I've tried the following regex expressions to match only the sentences that end in a question mark
".*[?=?]$"
"^(W|w).*[?=?]$"
"^[A-Za-z].*[?=?]$"
import re
import sys
from docx import Document
wordDoc = Document('botDoc.docx')
result = re.search('.*[?=?]$', wordDoc)
print(result)
if result:
print(result.group(0))
for table in wordDoc.tables:
for row in table.rows:
for cell in row.cells:
print("test")
I expect to save the matching patterns into directories so I can export the data to a csv file
Your error:
result = re.search('.*[?=?]$', wordDoc)
I believe that this line is the cause of the problem. search() is expecting a string as a second parameter, but is receiving a Document object.
What you should do is use the findall() function. search() only finds the first match for a pattern; findall() finds all the matches and returns them as a list of strings, with each string representing one match.
Since you are working with docx, you would have to extract the contents of the docx and use them as second parameter of the findall() method. If I remember correctly, this is done by first extracting all the paragraphs, and then extracting the text of the individual paragraphs. Refer to this question.
FYI, the way you would do this for a simple text file is the following:
# Open file
f = open('test.txt', 'r')
# Feed the file text into findall(); it returns a list of all the found strings
strings = re.findall(r'your pattern', f.read())
Your Regex:
Unfortunately, your regex is not quite correct, because although logically it makes sense to match only sentences that end on a ?, one of your matches is place to pay your rent. Will my financial aid pay for housing?, for example. Only the second part of that sentence is an actual question. So discard any lower case letters. Your regex should be something like:
[A-Z].*\?$

Python JSONDecoderError

I am not to sure what I am doing wrong. I am trying to parse the specific contents within JavaScript.
This is the output of "s" (for the code below it):
<script type="text/javascript">window._sharedData = {"activity_counts":{"comment_likes":0,"comments":0,"likes":0,"relationships":0,"usertags":0},"config":{"csrf_token":"OIXAF5a6FwMQJj3vCaUQXCGUGL3sFb0Z","viewer":{"allow_contacts_sync":false,"biography":"Follow for the best social media experience. Est. 2014","external_url":null,"full_name":"Social Media Bliztexnetwork","has_profile_pic":true,"id":"6440587166","profile_pic_url":"https://instagram.fbed1-1.fna.fbcdn.net/vp/dd5d8db8ca1645ac8b69fdaf8886184f/5BB11538/t51.2885-19/s150x150/32947488_229940584435561_2806247690365566976_n.jpg","profile_pic_url_hd":"https://instagram.fbed1-1.fna.fbcdn.net/vp/df4d5098687fe594c5b2d9750804941a/5BEC5FC8/t51.2885-19/s320x320/32947488_229940584435561_2806247690365566976_n.jpg","username":"bliztezxxmedia"}},"supports_es6":false,"country_code":"US","language_code":"en","locale":"en_US","entry_data":{"ProfilePage":[{"logging_page_id":"profilePage_7507466602","show_suggested_profiles":false,"graphql":{"user":{"biography":"What a wonderful day!!!","blocked_by_viewer":false,"country_block":false,"external_url":null,"external_url_linkshimmed":null,"edge_followed_by":{"count":17},"followed_by_viewer":true,"edge_follow":{"count":8},"follows_viewer":false,"full_name":"Verna Manning","has_channel":false,"has_blocked_viewer":false,"highlight_reel_count":0,"has_requested_viewer":false,"id":"7507466602","is_private":true,"is_verified":false,"mutual_followers":{"additional_count":-3,"usernames":[]},"profile_pic_url":"https://instagram.fbed1-1.fna.fbcdn.net/vp/96e65311d0a5e79729411bd582592816/5BCC9C5A/t51.2885-19/s150x150/33143922_237271910362316_6290555001760645120_n.jpg","profile_pic_url_hd":"https://instagram.fbed1-1.fna.fbcdn.net/vp/96e65311d0a5e79729411bd582592816/5BCC9C5A/t51.2885-19/s150x150/33143922_237271910362316_6290555001760645120_n.jpg","requested_by_viewer":false,"username":"vernamanning46464","connected_fb_page":null,"edge_felix_combined_post_uploads":{"count":0,"page_info":{"has_next_page":false,"end_cursor":null},"edges":[]},"edge_felix_combined_draft_uploads":{"count":0,"page_info":{"has_next_page":false,"end_cursor":null},"edges":[]},"edge_felix_video_timeline":{"count":0,"page_info":{"has_next_page":false,"end_cursor":null},"edges":[]},"edge_felix_drafts":{"count":0,"page_info":{"has_next_page":false,"end_cursor":null},"edges":[]},"edge_felix_pending_post_uploads":{"count":0,"page_info":{"has_next_page":false,"end_cursor":null},"edges":[]},"edge_felix_pending_draft_uploads":{"count":0,"page_info":{"has_next_page":false,"end_cursor":null},"edges":[]},"edge_owner_to_timeline_media":{"count":2,"page_info":{"has_next_page":false,"end_cursor":"AQAQt_06KHhticevO8Am12l3GJ1CdrZVdUztIDyZN7oXm_IVmr2Clwi844aWh9oe9TU"},"edges":[{"node":{"__typename":"GraphImage","id":"1810494542282448836","edge_media_to_caption":{"edges":[{"node":{"text":"What a sunny day!"}}]},"shortcode":"BkgKzGch1_EsxkqWK-4ZjG_XoWfrFxgXIOrZqs0","edge_media_to_comment":{"count":24},"comments_disabled":false,"taken_at_timestamp":1530047789,"dimensions":{"height":1080,"width":1080},"display_url":"https://instagram.fbed1-1.fna.fbcdn.net/vp/d82d797684ce57fef7a9fe87c74d2342/5BCE0CF2/t51.2885-15/s1080x1080/e15/fr/35274418_207295373248007_2552664476088270848_n.jpg","edge_liked_by":{"count":0},"edge_media_preview_like":{"count":0},"gating_info":null,"media_preview":"ACoqnuL0Qj5cMT0Gf1rBdi5LHqasXLK8hZBtB7fz/WoMV1xjZGLZHijFSbaNtXYVwhgaZtq/n2H1q9/Zn/TRfyNVAxUYHGetNyfU1LT6DuXnsnTn09f6VX21tERy/wAZCjt3P41BNaqMGLkHj1pRl0luJrqjM20m2rrwGM7TjPtzTobYynGQB3J7VpdWv0I8ihtpNtdGkdvGnlvtb1PqaZstPQfmf8az9ouzL5X3RmLMGPIxgdulSiYDgZAPNVQOB/nvSjmvP9pLa5fKibzlPTmjzlH0qMKB27H+ZqOn7SXcXKiwZFAzTPNHpVVSc1ISaPaz7i5Uf//Z","owner":{"id":"7507466602"},"thumbnail_src":"https://instagram.fbed1-1.fna.fbcdn.net/vp/7cecb59edaba9f9f7565604eac28d8df/5BC63210/t51.2885-15/s640x640/sh0.08/e35/35274418_207295373248007_2552664476088270848_n.jpg","thumbnail_resources":[{"src":"https://instagram.fbed1-1.fna.fbcdn.net/vp/b499ce5fafa113fe57f7325d86628900/5BE96296/t51.2885-15/s150x150/e15/35274418_207295373248007_2552664476088270848_n.jpg","config_width":150,"config_height":150},{"src":"https://instagram.fbed1-1.fna.fbcdn.net/vp/f124ca8254e24569515be5f3f99ff911/5BE9A3A9/t51.2885-15/s240x240/e15/35274418_207295373248007_2552664476088270848_n.jpg","config_width":240,"config_height":240},{"src":"https://instagram.fbed1-1.fna.fbcdn.net/vp/5c82e7c2ae3905863fe25150fca1f5e4/5BCB4ED1/t51.2885-15/s320x320/e15/35274418_207295373248007_2552664476088270848_n.jpg","config_width":320,"config_height":320},{"src":"https://instagram.fbed1-1.fna.fbcdn.net/vp/e76685c6614c444d8ed5f04efc01435a/5BB6D257/t51.2885-15/s480x480/e15/35274418_207295373248007_2552664476088270848_n.jpg","config_width":480,"config_height":480},{"src":"https://instagram.fbed1-1.fna.fbcdn.net/vp/7cecb59edaba9f9f7565604eac28d8df/5BC63210/t51.2885-15/s640x640/sh0.08/e35/35274418_207295373248007_2552664476088270848_n.jpg","config_width":640,"config_height":640}],"is_video":false}},{"node":{"__typename":"GraphImage","id":"1757529388200541080","edge_media_to_caption":{"edges":[{"node":{"text":"What a nice day."}}]},"shortcode":"Bhj_6qyALuYgmy2sPgmUtoBcmcxZWGeyLkM3O00","edge_media_to_comment":{"count":3},"comments_disabled":false,"taken_at_timestamp":1523733851,"dimensions":{"height":1080,"width":1080},"display_url":"https://instagram.fbed1-1.fna.fbcdn.net/vp/16610d58bb6cc90893ffd264f81755c6/5BAD1DBC/t51.2885-15/s1080x1080/e15/fr/30590929_101347367387069_7153309976138612736_n.jpg","edge_liked_by":{"count":1},"edge_media_preview_like":{"count":1},"gating_info":null,"media_preview":"ACoqwgadupijNLn0oAdvNPErDoahopiLkdw6nIPNTfaX9apxDNS7fencLDIAP1qFsZ4qYAx9CDmmMmD1B/GsxkRoqXbnjjP1FN2H2/MVVwHKcVJvqLYf8ml2tSAtbce/vjFIFYeh/Sn0Uhjh7gU8Njjt+dMFLQA4Y9B+HFO+X0/z+VMpaAP/2Q==","owner":{"id":"7507466602"},"thumbnail_src":"https://instagram.fbed1-1.fna.fbcdn.net/vp/a4dfd1c28505301d4c440c95023fbbc7/5BC81D5E/t51.2885-15/s640x640/sh0.08/e35/30590929_101347367387069_7153309976138612736_n.jpg","thumbnail_resources":[{"src":"https://instagram.fbed1-1.fna.fbcdn.net/vp/d69525a20a2e61b2ee8663daf287a8ee/5BB46AD8/t51.2885-15/s150x150/e15/30590929_101347367387069_7153309976138612736_n.jpg","config_width":150,"config_height":150},{"src":"https://instagram.fbed1-1.fna.fbcdn.net/vp/4034b1752e3a4bace405aadd5a35477c/5BCB79E7/t51.2885-15/s240x240/e15/30590929_101347367387069_7153309976138612736_n.jpg","config_width":240,"config_height":240},{"src":"https://instagram.fbed1-1.fna.fbcdn.net/vp/96b684f38cb826f3efd8a7610ed6e9bb/5BEB899F/t51.2885-15/s320x320/e15/30590929_101347367387069_7153309976138612736_n.jpg","config_width":320,"config_height":320},{"src":"https://instagram.fbed1-1.fna.fbcdn.net/vp/51dac35520bc90d7c7253cc331acf561/5BB44919/t51.2885-15/s480x480/e15/30590929_101347367387069_7153309976138612736_n.jpg","config_width":480,"config_height":480},{"src":"https://instagram.fbed1-1.fna.fbcdn.net/vp/a4dfd1c28505301d4c440c95023fbbc7/5BC81D5E/t51.2885-15/s640x640/sh0.08/e35/30590929_101347367387069_7153309976138612736_n.jpg","config_width":640,"config_height":640}],"is_video":false}}]},"edge_saved_media":{"count":0,"page_info":{"has_next_page":false,"end_cursor":null},"edges":[]},"edge_media_collections":{"count":0,"page_info":{"has_next_page":false,"end_cursor":null},"edges":[]}}},"felix_onboarding_video_resources":{"mp4":"/static/videos/felix-onboarding/onboardingVideo.mp4/9d16838ca7f9.mp4","poster":"/static/images/felix-onboarding/onboardingVideoPoster.png/8fdba7cf2120.png"}}]},"gatekeepers":{"ld":true,"rt":true,"sw":true,"vl":true,"seo":true,"seoht":true,"2fac":true,"sf":true,"saa":true,"ai":true},"knobs":{"acct:ntb":0,"cb":0,"captcha":0},"qe":{"dash_for_vod":{"g":"","p":{}},"aysf":{"g":"","p":{}},"bc3l":{"g":"","p":{}},"comment_reporting":{"g":"","p":{}},"direct_conversation_reporting":{"g":"","p":{}},"direct_reporting":{"g":"","p":{}},"reporting":{"g":"","p":{}},"media_reporting":{"g":"","p":{}},"acc_recovery_link":{"g":"","p":{}},"notif":{"g":"","p":{}},"drct_nav":{"g":"","p":{}},"fb_unlink":{"g":"","p":{}},"mobile_stories_doodling":{"g":"","p":{}},"move_comment_input_to_top":{"g":"","p":{}},"mobile_cancel":{"g":"","p":{}},"mobile_search_redesign":{"g":"","p":{}},"show_copy_link":{"g":"control","p":{"show_copy_link_option":"false"}},"mobile_logout":{"g":"","p":{}},"pl_pivot_li":{"g":"control_0423","p":{"show_pivot":"false"}},"pl_pivot_lo":{"g":"","p":{}},"404_as_react":{"g":"","p":{}},"acc_recovery":{"g":"test_with_prefill","p":{"has_prefill":"true"}},"collections":{"g":"","p":{}},"comment_ta":{"g":"","p":{}},"connections":{"g":"control","p":{"has_suggestion_context_in_feed":"false"}},"disc_ppl":{"g":"control_02_27","p":{"has_follow_all_button":"false","has_pagination":"false"}},"embeds":{"g":"","p":{}},"ebdsim_li":{"g":"control_shadow_0322","p":{"is_shadow_enabled":"false","use_new_ui":"true"}},"ebdsim_lo":{"g":"","p":{}},"empty_feed":{"g":"","p":{}},"bundles":{"g":"","p":{}},"exit_story_creation":{"g":"","p":{}},"gdpr_logged_out":{"g":"","p":{}},"appsell":{"g":"","p":{}},"imgopt":{"g":"control","p":{}},"follow_button":{"g":"test","p":{"is_inline":"true"}},"loggedout":{"g":"","p":{}},"loggedout_upsell":{"g":"test_with_new_loggedout_upsell_content_03_15_18","p":{"has_new_loggedout_upsell_content":"true"}},"us_li":{"g":"Test","p":{"show_related_media":"true"}},"msisdn":{"g":"","p":{}},"bg_sync":{"g":"","p":{}},"onetaplogin":{"g":"default_opt_in","p":{"default_value":"true","during_reg":"true","storage_version":"one_tap_storage_version"}},"onetaplogin_userbased":{"g":"","p":{}},"login_poe":{"g":"","p":{}},"prvcy_tggl":{"g":"","p":{}},"private_lo":{"g":"","p":{}},"profile_photo_nux_fbc_v2":{"g":"launch","p":{"prefill_photo":"true","skip_nux":"false"}},"profile_tabs":{"g":"","p":{}},"push_notifications":{"g":"","p":{}},"reg":{"g":"control_01_10","p":{"has_new_landing_appsells":"false","has_new_landing_page":"false"}},"reg_vp":{"g":"","p":{}},"feed_vp":{"g":"launch","p":{"is_hidden":"true"}},"report_haf":{"g":"","p":{}},"report_media":{"g":"","p":{}},"report_profile":{"g":"test","p":{"is_enabled":"true"}},"save":{"g":"test","p":{"is_enabled":"true"}},"sidecar":{"g":"","p":{}},"sidecar_swipe":{"g":"","p":{}},"su_universe":{"g":"test_login_autocomplete","p":{"use_autocomplete_signup":"true"}},"stale":{"g":"","p":{}},"stories_lo":{"g":"test_03_15","p":{"stories_profile":"true"}},"stories":{"g":"","p":{}},"tp_pblshr":{"g":"","p":{}},"video":{"g":"","p":{}},"gdpr_settings":{"g":"","p":{}},"gdpr_blocking_logout":{"g":"","p":{}},"gdpr_eu_tos":{"g":"","p":{}},"gdpr_row_tos":{"g":"test_05_01","p":{"tos_version":"row"}},"fd_gr":{"g":"control","p":{"show_post_back_button":"false"}},"felix":{"g":"test","p":{"is_enabled":"true"}},"felix_clear_fb_cookie":{"g":"control","p":{"is_enabled":"true","blacklist":"fbsr_124024574287414"}},"felix_creation_duration_limits":{"g":"dogfooding","p":{"minimum_length_seconds":"15","maximum_length_seconds":"600"}},"felix_creation_enabled":{"g":"","p":{"is_enabled":"true"}},"felix_creation_fb_crossposting":{"g":"control","p":{"is_enabled":"false"}},"felix_creation_fb_crossposting_v2":{"g":"control","p":{"is_enabled":"true"}},"felix_creation_validation":{"g":"control","p":{"edit_video_controls":"true"}},"felix_creation_video_upload":{"g":"","p":{}},"felix_early_onboarding":{"g":"","p":{}},"pride":{"g":"test","p":{"enabled":"true","hashtag_whitelist":"lgbt,lesbian,gay,bisexual,transgender,trans,queer,lgbtq,girlslikeus,girlswholikegirls,instagay,pride,gaypride,loveislove,pansexual,lovewins,transequalitynow,lesbiansofinstagram,asexual,nonbinary,lgbtpride,lgbta,lgbti,queerfashion,queers,queerpride,queerlife,marriageequality,pride2018,genderqueer,bi,genderfluid,lgbtqqia,comingout,intersex,transman,transwoman,twospirit,transvisibility,queerart,dragqueen,dragking,dragartist,twomoms,twodads,lesbianmoms,gaydads,gendernonconforming"}},"unfollow_confirm":{"g":"","p":{}},"profile_enhance_li":{"g":"control","p":{"has_tagged":"false"}},"profile_enhance_lo":{"g":"control","p":{"has_tagged":"false"}},"create_tag":{"g":"","p":{}}},"hostname":"www.instagram.com","platform":"ios","rhx_gis":"87a25368813608d393baaa28a0d6afb7","nonce":"zsP4NjzdJRIWmer6K5At1A==","zero_data":{},"rollout_hash":"5f72737283f8","bundle_variant":"base","probably_has_app":false,"show_app_install":true};</script>
And this is the code I am trying to execute.
s = str(soup.find_all("script", type="text/javascript")[3])
m = re.search(r"(?<=window._sharedData = )(?P<json>.*)(?=</script>)", s)
if m:
data = json.loads(m.group('json'))
print(data)
for i in data['entry_data']["ProfilePage"]:
for j in i['graphql']['user']['edge_owner_to_timeline_media']['edges']:
print(j['node']["id"])
Upon running this, I am prompted with the following error:
json.decoder.JSONDecodeError: Extra data: line 1 column 12215 (char 12214)
I am completely lost and have no idea where I am going wrong. All help is appreciated and thanks to all of those who contribute in advance!
I think you could update your regex to match the json without the semicolon at the end by adding that to the positive lookahead (?=;</script>):
(?<=window\._sharedData = )(?P<json>.*)(?=;</script>)
Your code might look like this without the [3] in the first line for your given example:
s = str(soup.find_all("script", type="text/javascript"))
m = re.search(r"(?<=window\._sharedData = )(?P<json>.*)(?=;</script>)", s)
if m:
data = json.loads(m.group('json'))
for i in data['entry_data']["ProfilePage"]:
for j in i['graphql']['user']['edge_owner_to_timeline_media']['edges']:
print(j['node']["id"])
There's not quite enough here to debug, what you give for s doesn't include the </script> so the pattern never matches when I run it locally, however when I append it, it seems to work correctly
From the error it is clear that the contents of m.group('json') is not actually a valid JSON string so I suspect you need to work on your regular expression. Try printing out the value of m.group('json') (before attempting to parse it) and feeding that into a a json validator such as https://jsonlint.com/ which will direct you to where the error lies, perhaps that line terminates with a ; that you need to strip out or some other issue

How can I parse email text for components like <salutation><body><signature><reply text> etc?

I'm writing an application that analyzes emails and it would save me a bunch of time if I could use a python library that would parse email text down into named components like <salutation><body><signature><reply text> etc.
For example, the following text "Hi Dave,\nLets meet up this Tuesday\nCheers, Tom\n\nOn Sunday, 15 May 2011 at 5:02 PM, Dave Trindall wrote: Hey Tom,\nHow about we get together ..." would be parsed as
Salutation: "Hi Dave,\n"
Body: "Lets meet up this Tuesday\n"
Signature: "Cheers, Tom\n\n"
Reply Text: "On Sunday, 15 May 2011 at 5:02 PM, Dave Trindal wrote: ..."
I know there's no perfect solution for this kind of problem, but even a library that does good approximation would help. Where can I find one?
https://github.com/Trindaz/EFZP
This provides functionality posed in the original question, plus fair recognition of email zones as they commonly appear in email written by native English speakers from common email clients like Outlook and Gmail.
If you score each line based on the types of words it contains you may get a fairly good indication.
E.G. A line with greeting words near the start is the salutation (also salutations may have phrases that refer to the past tense e.g. it was good to see you last time)
A Body will typically contain words such as "movie, concert" etc. It will also contain verbs (go to, run, walk, etc) and questions marks and offerings (e.g. want to, can we, should we, prefer..).
Check out http://nodebox.net/code/index.php/Linguistics#verb_conjugation
http://ogden.basic-english.org/
http://osteele.com/projects/pywordnet/
the signature will contain closing words.
If you find a datasource that has messages of the structure you want you could do some frequency analysis to see how often each word occurs in each section.
Each word would get a score [salutation score, body score, signature score,..]
e.g. hello could occur 900 times in the salutation, 10 times in the body, and 3 times in the signature.
this means hello would get assigned [900, 10, 3, ..]
cheers might get assigned [10,3,100,..]
now you will have a large list of about 500,000 words.
words that don't have a large range aren't useful.
e.g. catch might have [100,101,80..] = range of 21
(it was good to catch up, wanna go catch a fish, catch you later). catch can occur anywhere.
Now you can reduce the number of words down to about 10,000
now for each line, give the line a score also of the form [salutation score, body score, signature score,..]
this score is calculated by adding the vector scores of each word.
e.g. a sentence "hello cheers for giving me your number" could be:
[900, 10, 3, ..] + [10,3,100,..] + .. + .. + = [900+10+..,10+3+..,3+100,..]
=[1023,900,500,..] say
then because the biggest number is at the start in the salutation score position, this sentence is a salutation.
then if you had to score one of your lines to see what component the line should be in, for each word you would add on its score
Good luck, there is always a trade-off between computation complexity and accuracy. If you can find a good set of words and make a good model to base you calculations it will help.
The first approach that comes to mind (not necessarily the best...) would be to start off by using split. here's a little bit of code and stuff
linearray=emailtext.split('\n')
now you have an array of strings, each one like a paragraph or whatever
so linearray[0] would contain the salutation
deciding where the reply text starts is a little more tricky, i noticed that there is a double newline just before it so maybe do a search for that from the back and hope that the last one indicates the start of the reply text.
Or store some signature words you might expect and search for those from the front, like cheers, regards, and whatever else.
Once you figure out where the signature is the rest is the rest is easy
hope this helped
I built a pretty cheap API for this actually to parse the contact data from signatures of emails and email chains. It's called SigParser. You can see the Swagger docs here for it.
Basically you send it a header 'x-api-key' with a JSON body like so and it parses all the contacts in the reply chain of an email.
{
"subject": "Thanks for meeting...",
"from_address": "bgates#example.com",
"from_name": "Bill Gates",
"htmlbody": "<div>Hi, good seeing you the other day.</div><div>--</div><div>Bill Gates</div><div>Cell 777-444-8888</div>LinkedInTwitter",
"plainbody": "Hi, good seeing you the other day. \r\n--\r\nBill Gates\r\nCell 777-444-8888",
"date": "Mon, 28 May 2018 23:33:40 +0000 (UTC)"
}

Categories

Resources