Python 3.6 Mbox to CSV - python

I'm trying to write a script that will convert each email element of an .mbox file into a .csv file. I specifically need the following elements, but if there was a way to "write for each element," that'd be preferred:
To, From, CC'd, BCC'd, Date, Subject, Body
I found a script online that looks to be the start of what I need, and the documentation about the email module, but I can't find any specifics on how to
identify the different attribute options (to, from, cc'd, etc.)
how to write them as unique cell values in a .csv.
Here is sample code I found:
import mailbox
import csv
writer = csv.writer(open("clean_mail_B.csv", "wb"))
for message in mailbox.mbox('Saks.mbox'):
writer.writerow([message['to'], message['from'], message['date']])

To do that you would first need to determine the complete list of possible keys present in all mailbox items. Then you can use that to write the CSV header.
Next you need to get all the key value pairs from each message using .items(). This can then be converted back into a dictionary and written to your CSV file.
The mailbox library unfortunately does not directly expose the message dictionary otherwise it would have been possible to write this directly.
import mailbox
import csv
mbox_file = 'sample.mbox'
with open('clean_mail_B.csv', 'w', newline='', encoding='utf-8') as f_output:
# Create a column for the first 30 message payload sections
fieldnames = {f'Part{part:02}' for part in range(1, 31)}
for message in mailbox.mbox(mbox_file):
fieldnames.update(message.keys())
csv_output = csv.DictWriter(f_output, fieldnames=sorted(fieldnames), restval='', extrasaction='ignore')
csv_output.writeheader()
for message in mailbox.mbox(mbox_file):
items = dict(message.items())
for part, payload in enumerate(message.get_payload(), start=1):
items[f'Part{part:02}'] = payload
csv_output.writerow(items)
A DictWriter is used rather than a standard CSV writer. This would then cope better for when certain message do not contain all possible header values.
The message payload can be in multiple parts, these are added as separate column headers e.g. Part01, Part02. Normally there should be 1 or 2 but your sample mbox contained one with a strange signature containing 25?
If the mbox contains more payload entries for a message (i.e. >30), these are ignored using extrasaction='ignore'. An alternative approach would be to combine all payloads into a single column.

Related

Selecting interface to sniff in Python using a variable

I need to create a sniff in python using the sniff command, to collect packets entering several interfaces.
When I do it specifying the interfaces with their names with the following command:
sniff(iface=["s1-cpu-eth1","s2-cpu-eth1","s3-cpu-eth1","s4-cpu-eth1"], prn=self.recv)
It works, but if I try to use a variable (this is needed because interfaces can change depending on the context and they can be obtained through a for loop populating a variable), such as:
if_to_sniff="\"s1-cpu-eth1\",\"s2-cpu-eth1\",\"s3-cpu-eth1\",\"s4-cpu-eth1\""
sniff(iface=[if_to_sniff], prn=self.recv)
It doesn't work.
I actually tried several ways, but I always get an error saying that the device doesn't exist. How can I do this?
if_to_sniff="\"s1-cpu-eth1\",\"s2-cpu-eth1\",\"s3-cpu-eth1\",\"s4-cpu-eth1\""
This string looks like CSV format? In which case we can use Python's CSV reader to parse it for us:
import csv
if_to_sniff="\"s1-cpu-eth1\",\"s2-cpu-eth1\",\"s3-cpu-eth1\",\"s4-cpu-eth1\""
# csv.reader expects a file or a list of strings (like lines in csv file)
reader = csv.reader([if_to_sniff])
# get the first row from our 'csv'
# interfaces will be the 'columns' of that row
# (i.e. split into a list of sub strings)
interfaces = reader.__next__()
sniff(iface=interfaces, prn=self.recv)

Scraping only select fields from a JSON file

I'm trying to produce only the following JSON data fields, but for some reason it writes the entire page to the .html file? What am I doing wrong? It should only produce the boxes referenced e.g. title, audiosource url, medium sized image, etc?
r = urllib.urlopen('https://thisiscriminal.com/wp-json/criminal/v1/episodes?posts=10000&page=1')
data = json.loads(r.read().decode('utf-8'))
for post in data['posts']:
# data.append([post['title'], post['audioSource'], post['image']['medium'], post['excerpt']['long']])
([post['title'], post['audioSource'], post['image']['medium'], post['excerpt']['long']])
with io.open('criminal-json.html', 'w', encoding='utf-8') as r:
r.write(json.dumps(data, ensure_ascii=False))
You want to differentiate from your input data and your output data. In your for loop, you are referencing the same variable data that you are using to take input in as you are using to output. You want to add the selected data from the input to a list containing the output.
Don't re-use the same variable names. Here is what you want:
import urllib
import json
import io
url = urllib.urlopen('https://thisiscriminal.com/wp-json/criminal/v1/episodes?posts=10000&page=1')
data = json.loads(url.read().decode('utf-8'))
posts = []
for post in data['posts']:
posts.append([post['title'], post['audioSource'], post['image']['medium'], post['excerpt']['long']])
with io.open('criminal-json.html', 'w', encoding='utf-8') as r:
r.write(json.dumps(posts, ensure_ascii=False))
You are loading the whole json in the variable data, and you are dumping it without changing it. That's the reason why this is happening. What you need to do is put whatever you want into a new variable and then dump it.
See the line -
([post['title'], post['audioSource'], post['image']['medium'], post['excerpt']['long']])
it does nothing. So, data remains unchanged. Do what Mark Tolonen suggested and it'll be fine.

Use only part of an attribute when converting XML to CSV

I am new to Python and am attempting to process some XML into CSV files for later diff validation against the output from a database. The code I have below does a good job of taking the 'tct-id' attributes from the XML and outputting them in a nice column under the heading 'DocumentID', as I need for my validation.
However, the outputs from the database just come as numbers, whereas the outputs from this code include the version number of the XML ID; for example
tct-id="D-TW-0010054;3;"
where I need the ;3; removed so I can validate properly.
This is the code I have; is there any way I can go about rewriting this so it will pre-process the XML snippets to remove that - like only take the first 12 characters from each attribute and write those to the CSV, for example?
from lxml import etree
import csv
xml_fname = 'example.xml'
csv_fname = 'output.csv'
fields = ['tct-id']
xml = etree.parse(xml_fname)
with open(xml_fname) as infile, open(csv_fname, 'w', newline='') as outfile:
r = csv.DictReader(infile)
w = csv.DictWriter(outfile, fields, delimiter=';', extrasaction="ignore")
wtr = csv.writer(outfile)
wtr.writerow(["DocumentID"])
for node in xml.xpath("//*[self::StringVariables or self::ElementVariables or self::PubInfo or self::Preface or self::DocHistory or self::Glossary or self::StatusInfo or self::Chapter]"):
atts = node.attrib
atts["elm_name"] = node.tag
w.writerow(node.attrib)
All help is very appreciated.
Assuming you'll only have one ;3; type string to remove from the tct-id, you can use regular expressions
import re
tct_id="D-TW-0010054;3;"
to_rem=re.findall(r'(;.*;)',tct_id)[0]
tct_id=tct_id.replace(to_rem,'')
note i'm using tct_id instead of tct-id as python doesn't usually allow variables to be set like that

Missing double quotes (randomly) when downloading csv file using Oauth2.0 gmail API

I am downloading csv attachments from gmail that are csv reports. I am using Python 3.6.1 and the Oauth 2.0 gmail API.
There is a date column in the csv file and I hard code it's format to '%Y-%m-%d'.
When I download the csv attachment and inspect it as a text file, most times, I get the expected date format as follows (1st 3 columns of 1st 2 lines) -
"date","advertiser","advertiser_id", ...
"2017-05-27","Swiss.com India (UK)","29805", ...
However, on occasion, the quotes from the csv file are missing - I then get it as -
date,advertiser,advertiser_id, ...
27/05/2017,Swiss.com India (UK),29805, ...
In this situation, the date pattern turns out to be '%d/%m/%Y'.
There is no discernible pattern to when a file would be downloaded with the unquoted dates. Most times, if I delete the downloaded file and re-run my script, the quoted attachment is re-downloaded.
Is there a way to setup the attachment download such that the date column is downloaded in the quoted format? Or is there a way to ensure that when I read the csv (using csv.reader) I always get the date column in a certain format?
The specific method I am using to download attachments is given here -
https://developers.google.com/gmail/api/v1/reference/users/messages/attachments/get (Python version). The exact code snippet is -
# Get the body of this part and it's keys.
part_body = part['body']
part_body_keys = part_body.keys()
...
if 'data' in part_body_keys:
a_data = part_body['data']
elif 'attachmentId' in part_body_keys:
att_id = part_body['attachmentId']
att = service.users().messages().attachments().get(
userId=user_id, messageId=message['id'],
id=att_id).execute()
a_data=att['data']
else:
...
# Encode it appropriately and write it to the file.
file_data = base64.urlsafe_b64decode(a_data.encode('UTF-8'))
...
f = open(file_name, 'wb')
f.write(file_data)
f.close()
The code snippet when reading the csv file is -
infile = open(file_name, mode="r", encoding='ascii', errors='ignore')
filereader = csv.reader(infile)
date_fmt = "%Y-%m-%d"
…
for a_row in filereader:
…
try:
rf_datetime = time.strptime(a_row[0], date_fmt)
…
Any pointers would be appreciated! This script has become a key component of my business that automates our reporting process and has visibly reduced effort all around.
Regards
Nitin
It looks like the attached csv files are in a different format themselves (or maybe there is a difference between 'data' and 'attachmentId'?).
To be sure, you could download them manually and check them in a text editor.
As for the quotes: for csv it doesn't make a difference if the fields are quoted or not. Only when fields contain a field separator it needs to be surrounded with quotes. But since you're using a csv reader this shouldn't matter.
As for the dates, it's probably easiest to check the date format once before the reading loop (in the first data row), and set date_fmt (for parsing) accordingly.

Importing JSON in Python and Removing Header

I'm trying to write a simple JSON to CSV converter in Python for Kiva. The JSON file I am working with looks like this:
{"header":{"total":412045,"page":1,"date":"2012-04-11T06:16:43Z","page_size":500},"loans":[{"id":84,"name":"Justine","description":{"languages":["en"], REST OF DATA
The problem is, when I use json.load, I only get the strings "header" and "loans" in data, but not the actual information such as id, name, description, etc. How can I skip over everything until the [? I have a lot of files to process, so I can't manually delete the beginning in each one. My current code is:
import csv
import json
fp = csv.writer(open("test.csv","wb+"))
f = open("loans/1.json")
data = json.load(f)
f.close()
for item in data:
fp.writerow([item["name"]] + [item["posted_date"]] + OTHER STUFF)
Instead of
for item in data:
use
for item in data['loans']:
The header is stored in data['header'] and data itself is a dictionary, so you'll have to key into it in order to access the data.
data is a dictionary, so for item in data iterates the keys.
You probably want for loan in data['loans']:

Categories

Resources