expected string or buffer when processing text file with python - python

I'm processing a large (120mb) text file from my thunderbird imap directory and attempting to extract to/from info from the headers using mbox and regex. the process runs for a while until I eventually get an exception: "TypeError: expected string or buffer".
The exception references the fifth line of this code:
PAT_EMAIL = re.compile(r"[0-9A-Za-z._-]+\#[0-9A-Za-z._-]+")
temp_list = []
mymbox = mbox("data.txt")
for email in mymbox.values():
from_address = PAT_EMAIL.findall(email["from"])
to_address = PAT_EMAIL.findall(email["to"])
for item in from_address:
temp_list.append(item) #items are added to a temporary list where they are sorted then written to file
I've run the code on other (smaller) files, so I'm guessing the issue is my file. The file appears to be just a bunch of text. Can someone point me in the write direction for debugging this?

There can only be one from address (I think!):
In the following:
from_address = PAT_EMAIL.findall(email["from"])
I have a feeling you're trying to duplicate the work of email.message_from_file and email.utils.parseaddr
from email.utils import parseaddr
>>> s = "Jon Clements <jon#example.com>"
>>> from email.utils import parseaddr
>>> parseaddr(s)
('Jon Clements', 'jon#example.com')
So you can use parseaddr(email['from'])[1] to get the email address and use that.
Similarly, you may wish to look at email.utils.getaddresses to handle to and cc addresses...

Well, I didn't solve the issue but have worked around it for my own purposes. I inserted a try statement so that the iteration just continues past any TypeError. For every thousand email addresses I'm getting about 8 failures, which will suffice. Thanks for your input!
PAT_EMAIL = re.compile(r"[0-9A-Za-z._-]+\#[0-9A-Za-z._-]+")
temp_list = []
mymbox = mbox("data.txt")
for email in mymbox.values():
try:
from_address = PAT_EMAIL.findall(email["from"])
except(TypeError):
print "TypeError!"
try:
to_address = PAT_EMAIL.findall(email["to"])
except(TypeError):
print "TypeError!"
for item in from_address:
temp_list.append(item) #items are added to a temporary list where they are sorted then written to file

Related

python-ldap3 is unable to add user to and existing LDAP group

I am able successfully connect using LDAP3 and retrieve my LDAP group members as below.
from ldap3 import Server, Connection, ALL, SUBTREE
from ldap3.extend.microsoft.addMembersToGroups import ad_add_members_to_groups as addMembersToGroups
>>> conn = Connection(Server('ldaps://ldap.****.com:***', get_info=ALL),check_names=False, auto_bind=False,user="ANT\*****",password="******", authentication="NTLM")
>>>
>>> conn.open()
>>> conn.search('ou=Groups,o=****.com', '(&(cn=MY-LDAP-GROUP))', attributes=['cn', 'objectclass', 'memberuid'])
it returns True and I can see members by printing
conn.entries
>>>
The above line says MY-LDAP-GROUP exists and returns TRUE while searching but throws LDAP group not found when I try to an user to the group as below
>>> addMembersToGroups(conn, ['myuser'], 'MY-LDAP-GROUP')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/****/anaconda3/lib/python3.7/site-packages/ldap3/extend/microsoft/addMembersToGroups.py", line 69, in ad_add_members_to_groups
raise LDAPInvalidDnError(group + ' not found')
ldap3.core.exceptions.LDAPInvalidDnError: MY-LDAP-GROUP not found
>>>
The above line says MY-LDAP-GROUP exists and returns TRUE
Returning True just means that the search succeeded. It doesn't mean that anything was found. Is there anything in conn.entries?
But I suspect your real problem is something different. If this is the source code for ad_add_members_to_groups, then it is expecting the distinguishedName of the group (notice the parameter name group_dn), but you're passing the cn (common name). For example, your code should be something like:
addMembersToGroups(conn, ['myuser'], 'CN=MY-LDAP-GROUP,OU=Groups,DC=example,DC=com')
If you don't know the DN, then ask for the distinguishedName attribute from the search.
A word of warning: that code for ad_add_members_to_groups retrieves all the current members before adding the new member. You might run into performance problems if you're working with groups that have large membership because of that (e.g. if the group has 1000 members, it will load all 1000 before adding anyone). You don't actually need to do that (you can add a new member without looking at the current membership). I think what they're trying to avoid is the error you get when you try to add someone who is already in the group. But I think there are better ways to handle that. It might not matter to you if you're only working with small groups.
After so many trial and errors, I got frustrated and used the older python-ldap library to add existing users. Now my code is a mixture of ldap3 and ldap.
I know this is not what the OP has desired. But this may help someone.
Here the user Dinesh Kumar is already part of a group group1. I am trying to add him
to another group group2 which is successful and does not disturb the existing group
import ldap
import ldap.modlist as modlist
def add_existing_user_to_group(user_name, user_id, group_id):
"""
:return:
"""
# ldap expects a byte string.
converted_user_name = bytes(user_name, 'utf-8')
converted_user_id = bytes(user_id, 'utf-8')
converted_group_id = bytes(group_id, 'utf-8')
# Add all the attributes for the new dn
ldap_attr = {}
ldap_attr['uid'] = converted_user_name
ldap_attr['cn'] = converted_user_name
ldap_attr['uidNumber'] = converted_user_id
ldap_attr['gidNumber'] = converted_group_id
ldap_attr['objectClass'] = [b'top', b'posixAccount', b'inetOrgPerson']
ldap_attr['sn'] = b'Kumar'
ldap_attr['homeDirectory'] = b'/home/users/dkumar'
# Establish connection to server using ldap
conn = ldap.initialize(server_uri, bytes_mode=False)
bind_resp = conn.simple_bind_s("cn=admin,dc=testldap,dc=com", "password")
dn_new = "cn={},cn={},ou=MyOU,dc=testldap,dc=com".format('Dinesh Kumar','group2')
ldif = modlist.addModlist(ldap_attr)
try:
response = conn.add_s(dn_new, ldif)
except ldap.error as e:
response = e
print(" The response is ", response)
conn.unbind()
return response

How do I avoid getting a sporadic KeyError: 'data' when using the Reddit API in python?

I have the following python code that is working ok to use reddit's api and look up the front page of different subreddits and their rising submissions.
from pprint import pprint
import requests
import json
import datetime
import csv
import time
subredditsToScan = ["Arts", "AskReddit", "askscience", "aww", "books", "creepy", "dataisbeautiful", "DIY", "Documentaries", "EarthPorn", "explainlikeimfive", "food", "funny", "gaming", "gifs", "history", "jokes", "LifeProTips", "movies", "music", "pics", "science", "ShowerThoughts", "space", "sports", "tifu", "todayilearned", "videos", "worldnews"]
ofilePosts = open('posts.csv', 'wb')
writerPosts = csv.writer(ofilePosts, delimiter=',')
ofileUrls = open('urls.csv', 'wb')
writerUrls = csv.writer(ofileUrls, delimiter=',')
for subreddit in subredditsToScan:
front = requests.get(r'http://www.reddit.com/r/' + subreddit + '/.json')
rising = requests.get(r'http://www.reddit.com/r/' + subreddit + '/rising/.json')
front.text
rising.text
risingData = rising.json()
frontData = front.json()
print(len(risingData['data']['children']))
print(len(frontData['data']['children']))
for i in range(0, len(risingData['data']['children'])):
author = risingData['data']['children'][i]['data']['author']
score = risingData['data']['children'][i]['data']['score']
subreddit = risingData['data']['children'][i]['data']['subreddit']
gilded = risingData['data']['children'][i]['data']['gilded']
numOfComments = risingData['data']['children'][i]['data']['num_comments']
linkUrl = risingData['data']['children'][i]['data']['permalink']
timeCreated = risingData['data']['children'][i]['data']['created_utc']
writerPosts.writerow([author, score, subreddit, gilded, numOfComments, linkUrl, timeCreated])
writerUrls.writerow([linkUrl])
for j in range(0, len(frontData['data']['children'])):
author = frontData['data']['children'][j]['data']['author'].encode('utf-8').strip()
score = frontData['data']['children'][j]['data']['score']
subreddit = frontData['data']['children'][j]['data']['subreddit'].encode('utf-8').strip()
gilded = frontData['data']['children'][j]['data']['gilded']
numOfComments = frontData['data']['children'][j]['data']['num_comments']
linkUrl = frontData['data']['children'][j]['data']['permalink'].encode('utf-8').strip()
timeCreated = frontData['data']['children'][j]['data']['created_utc']
writerPosts.writerow([author, score, subreddit, gilded, numOfComments, linkUrl, timeCreated])
writerUrls.writerow([linkUrl])
It works well and scrapes the data accurately but it constantly gets interrupted, seemingly randomly, and has a run time crash, saying:
Traceback (most recent call last):
File "dataGather1.py", line 27, in <module>
for i in range(0, len(risingData['data']['children'])):
KeyError: 'data'
I have no idea why this error is occuring on and off and not consistently. I thought maybe I am calling the API too much so it stops me from accessing it so I threw a sleep in my code but that did not help. Any ideas?
When there are no data on the response from the API there are is no key data on the dictionary so you get a keyError on some subreddits. You need to use a try catch
The json you are parsing doesn't contain the 'data' element. Thus you get an error. I think your hunch is correct though. It is probably rate limiting, or that you're asking for hidden/deleted entries.
Reddit is very strict about accessing their API without playing nice. Meaning you should register your app and use a meaningful user-agent to your requets, and you should probably use the python library for this kind of thing: https://praw.readthedocs.io/en/latest/
Without registering it seems to my experience that the direct REST reddit API is even more strict than the 1 request per 2 seconds rule they have (had?).
Python raises a KeyError whenever a dict() object is requested (using the format a = adict[key]) and the key is not in the dictionary.
It seems like when you are getting this error, your data value is empty.
You might just try to get the length of the dictionary before you execute the for loop. If it’s empty, it will just not run. Some interesting error checking here might help.
size = len(risingData)
if size:
for i in range(0,size):
…

Getting KeyError when parsing JSON containing three layers of keys, using Python

I'm building a Python program to parse some calls to a social media API into CSV and I'm running into an issue with a key that has two keys above it in the hierarchy. I get this error when I run the code with PyDev in Eclipse.
Traceback (most recent call last):
line 413, in <module>
main()
line 390, in main
postAgeDemos(monitorID)
line 171, in postAgeDemos
age0To17 = str(i["ageCount"]["sortedAgeCounts"]["ZERO_TO_SEVENTEEN"])
KeyError: 'ZERO_TO_SEVENTEEN'
Here's the section of the code I'm using for it. I have a few other functions built already that work with two layers of keys.
import urllib.request
import json
def postAgeDemos(monitorID):
print("Enter the date you'd like the data to start on")
startDate = input('The date must be in the format YYYY-MM-DD. ')
print("Enter the date you'd like the data to end on")
endDate = input('The date must be in the format YYYY-MM-DD. ')
dates = "&start="+startDate+"&end="+endDate
urlStart = getURL()
authToken = getAuthToken()
endpoint = "/monitor/demographics/age?id=";
urlData = urlStart+endpoint+monitorID+authToken+dates
webURL = urllib.request.urlopen(urlData)
fPath = getFilePath()+"AgeDemographics"+startDate+"&"+endDate+".csv"
print("Connecting...")
if (webURL.getcode() == 200):
print("Connected to "+urlData)
print("This query returns information in a CSV file.")
csvFile = open(fPath, "w+")
csvFile.write("postDate,totalPosts,totalPostsWithIdentifiableAge,0-17,18-24,25-34,35+\n")
data = webURL.read().decode('utf8')
theJSON = json.loads(data)
for i in theJSON["ageCounts"]:
postDate = i["startDate"]
totalDocs = str(i["numberOfDocuments"])
totalAged = str(i["ageCount"]["totalAgeCount"])
age0To17 = str(i["ageCount"]["sortedAgeCounts"]["ZERO_TO_SEVENTEEN"])
age18To24 = str(i["ageCount"]["sortedAgeCounts"]["EIGHTEEN_TO_TWENTYFOUR"])
age25To34 = str(i["ageCount"]["sortedAgeCounts"]["TWENTYFIVE_TO_THIRTYFOUR"])
age35Over = str(i["ageCount"]["sortedAgeCounts"]["THIRTYFIVE_AND_OVER"])
csvFile.write(postDate+","+totalDocs+","+totalAged+","+age0To17+","+age18To24+","+age25To34+","+age35Over+"\n")
print("File printed to "+fPath)
csvFile.close()
else:
print("Server Error, No Data" + str(webURL.getcode()))
Here's a sample of the JSON I'm trying to parse.
{"ageCounts":[{"startDate":"2016-01-01T00:00:00","endDate":"2016-01-02T00:00:00","numberOfDocuments":520813,"ageCount":{"sortedAgeCounts":{"ZERO_TO_SEVENTEEN":3245,"EIGHTEEN_TO_TWENTYFOUR":4289,"TWENTYFIVE_TO_THIRTYFOUR":2318,"THIRTYFIVE_AND_OVER":70249},"totalAgeCount":80101}},{"startDate":"2016-01-02T00:00:00","endDate":"2016-01-03T00:00:00","numberOfDocuments":633709,"ageCount":{"sortedAgeCounts":{"ZERO_TO_SEVENTEEN":3560,"EIGHTEEN_TO_TWENTYFOUR":1702,"TWENTYFIVE_TO_THIRTYFOUR":2786,"THIRTYFIVE_AND_OVER":119657},"totalAgeCount":127705}}],"status":"success"}
Here it is again with line breaks so it's a little easier to read.
{"ageCounts":[{"startDate":"2016-01-01T00:00:00","endDate":"2016-01-02T00:00:00","numberOfDocuments":520813,"ageCount":
{"sortedAgeCounts":{"ZERO_TO_SEVENTEEN":3245,"EIGHTEEN_TO_TWENTYFOUR":4289,"TWENTYFIVE_TO_THIRTYFOUR":2318,"THIRTYFIVE_AND_OVER":70249},"totalAgeCount":80101}},
{"startDate":"2016-01-02T00:00:00","endDate":"2016-01-03T00:00:00","numberOfDocuments":633709,"ageCount":
{"sortedAgeCounts":{"ZERO_TO_SEVENTEEN":3560,"EIGHTEEN_TO_TWENTYFOUR":1702,"TWENTYFIVE_TO_THIRTYFOUR":2786,"THIRTYFIVE_AND_OVER":119657},"totalAgeCount":127705}}],"status":"success"}
I've tried removing the ["sortedAgeCounts"] from in the middle of
age0To17 = str(i["ageCount"]["sortedAgeCounts"]["ZERO_TO_SEVENTEEN"])
but I still get the same error. I've remove the 0-17 section to test the other age ranges and I get the same error for them as well. I tried removing all the underscores from the JSON and then using keys without the underscores.
I've also tried moving the str() to convert to string from the call to where the output is printed but the error persists.
Any ideas? Is this section not actually a JSON key, maybe a problem with the all caps or am I just doing something dumb? Any other code improvements are welcome as well but I'm stuck on this one.
Let me know if you need to see anything else. Thanks in advance for your help.
Edited(This works):
JSON=json.loads(s)
for i in JSON:
print str(JSON[i][0]["ageCount"]["sortedAgeCounts"]["ZERO_TO_SEVENTEEN"])
s is a string which contains the your JSON.

parsing API with Python - how to handle JSON with BOM

I'm using Python 2.7.11 on windows to get JSON data from API (data on trees in Warsaw, Poland, but nevermind that). I want to generate output csv file with all the data provided by the api, for further analysis. I started with a script I used for another project (also discussed here on Stackoverflow and corrected for me by #Martin Taylor).That script didn't work so I tried to modify it using my very basic understanding, googling around and applying pdb debugger. At the moment, the result looks like this:
import pdb
import json
import urllib2
import csv
pdb.set_trace()
url = "https://api.um.warszawa.pl/api/action/datastore_search/?resource_id=ed6217dd-c8d0-4f7b-8bed-3b7eb81a95ba"
myfile = 'C:/dane/drzewa.csv'
csv_myfile = csv.writer(open(myfile, 'wb'))
cols = ['numer_adres', 'stan_zdrowia', 'y_wgs84', 'dzielnica', 'adres', 'lokalizacja', 'wiek_w_dni', 'srednica_k', 'pnie_obwod', 'miasto', 'jednostka', 'x_pl2000', 'wysokosc', 'y_pl2000', 'numer_inw', 'x_wgs84', '_id', 'gatunek_1', 'gatunek', 'data_wyk_pom']
csv_myfile.writerow(cols)
def api_iterate(myfile):
while True:
global url
print url
json_page = urllib2.urlopen(url)
data = json.load(json_page)
json_page.close()
for data_object in data ['result']['records']:
csv_myfile.writerow([data_object[col] for col in cols])
try:
url = data['_links']['next']
except KeyError as e:
break
with open(myfile, 'wb'):
api_iterate(myfile)
I'm a very fresh Python user so I get confused all the time. Now I got to the point when, while reading the objects in json dictionary, I get a Keyerror message associated with the 'x_wgs84' element. I suppose it has something to do with the fact that in the source url this element is preceded by a U+FEFF unicode character. I tried to get around this but I got stuck and would appreciate assistance.
I suspect the code may be corrupt in several other ways - as I mentioned, I'm a very unskilled programmer (yet).
You need to put the key with the unicode character:
To know how to do it, one easy way is to print the keys:
>>> import requests
>>> res = requests.get('https://api.um.warszawa.pl/api/action/datastore_search/?resource_id=ed6217dd-c8d0-4f7b-8bed-3b7eb81a95ba')
>>> data = res.json()
>>> records = data['result']['records']
>>> records[0]
{u'numer_adres': u'', u'stan_zdrowia': u'dobry', u'y_wgs84': u'52.21865', u'y_pl2000': u'5787241.04475524', u'adres': u'ul. ALPEJSKA', u'x_pl2000': u'7511793.96937063', u'lokalizacja': u'Ulica ALPEJSKA', u'wiek_w_dni': u'60', u'miasto': u'Warszawa', u'jednostka': u'Dzielnica Wawer', u'pnie_obwod': u'73', u'wysokosc': u'14', u'data_wyk_pom': u'20130709', u'dzielnica': u'Wawer', u'\ufeffx_wgs84': u'21.172584', u'numer_inw': u'D386200', u'_id': 125435, u'gatunek_1': u'Quercus robur', u'gatunek': u'd\u0105b szypu\u0142kowy', u'srednica_k': u'7'}
>>> records[0].keys()
[u'numer_adres', u'stan_zdrowia', u'y_wgs84', u'y_pl2000', u'adres', u'x_pl2000', u'lokalizacja', u'wiek_w_dni', u'miasto', u'jednostka', u'pnie_obwod', u'wysokosc', u'data_wyk_pom', u'dzielnica', u'\ufeffx_wgs84', u'numer_inw', u'_id', u'gatunek_1', u'gatunek', u'srednica_k']
>>> records[0][u'\ufeffx_wgs84']
u'21.172584'
As you can see, to get your key, you need to write it as '\ufeffx_wgs84' with the unicode character that is causing trouble.
Note: I don't know if you are using python2 or 3, but you might need to put a u before your string declaration in python2 to declare it as unicode string.

Reading saved email file with “.msg” extension, in local disk

How to read email file (saved email to local drive, with “.msg” extension)?
I tried this 2 lines and it doesn't work out.
msg = open('Departure HOUSTON EXPRESS Port NORFOLK.msg', 'r')
print msg.read()
I searched the web for an answer, which gave the below code:
import email
def read_MSG(file):
email_File = open(file)
messagedic = email.Message(email_File)
content_type = messagedic["plain/text"]
FROM = messagedic["From"]
TO = messagedic.getaddr("To")
sujet = messagedic["Subject"]
email_File.close()
return content_type, FROM, TO, sujet
myMSG= read_MSG(r"c:\\myemail.msg")
print myMSG
However it gives an error:
Traceback (most recent call last):
File "C:\Python27\G.py", line 19, in <module>
myMSG= read_MSG(r"c:\\myemail.msg")
File "C:\Python27\G.py", line 10, in read_MSG
messagedic = email.Message(email_File)
TypeError: 'LazyImporter' object is not callable
Some responses on Internet tell it’d better to convert the .msg to .eml before parsing but I am not really sure how.
What would be the best way to read a .msg file?
The code you have now looks to be completely unworkable for what you're trying to accomplish. You need to parse Outlook ".msg" files, which can be done in Python but not using the email module. But if you can use ".eml" files as you mentioned, it will be easier because the email module can read those.
To read .eml files, see email.message_from_file().
In case someone else comes across this like me, almost a decade after the original question:
After trying some different solutions offered here and elsewhere on the internet, I found that the easiest for me was to use extract-msg, which you can install with pip. The readme documentation is limited, but the doc-strings in the actual library is quite comprehensive.
In my case, I needed to read a .msg on disc and specifically save its attachments to disc. Here is some sample code to show how easy this is with extact-msg:
import extract_msg
msg = extract_msg.openMsg('c:/some_folder/some_mail.msg')
sender = msg.sender
subject = msg.subject
body = msg.body
time_received = msg.receivedTime # datetime
attachment_filenames = []
for att in msg.attachments:
att.save(customPath='c:/saved_attachments/')
attachment_filenames.append(att.name)

Categories

Resources