Processed Enron corpus

The final processed Enron corpus consists of 16 Xml files. One file (Corpus.xml)  Corpus.xml.gz

contains the corpus itself, and remaining 15 files contain secondary, derived information. These files can be divided into several groups:

EmailDescriptorList.xml

This file contains descriptors of individual emails saved using <EmailDescriptor> element. Attributes are:

Id – unique identification number of each email in corpus, and records in the file is sorted with respect to this Id.

Deleted – if emails is duplicated, and should be deleted attribute has value “True”, otherwise has value “False”.

Subelements:

<FilePath> - path of file with this email, path is relative to the base folder of the corpus

<Date> - date, when emails was send

<MD5> - MD5 digest taken form \cite{tenborec}

<AuthorList> - list of authors of the email\footnote{If the email is simply forwarded, all forwarding persons are considered as authors}.

<RecipientList> - list of recipient of the email. Both of these lists contain at least one element <Address> with attribute Id equals to Id number of sender’s or recipient’s address.

AddressList.xml

This file is simply list of email addresses which can be found in the corpus. Each address is stored as element <Address> with attributes:

Address – text representation of address,

Id – unique identification number.

PersonList.xml

The file contains list of all employees of Enron, who participate in the communication recorded in the corpus. Each person is stored as element <Person> with several attributes:

Name – name of employee,

EmailAddress – his or her email address,

EmailAddressId – Id number of his or her email address (see file AddressList.xml)

AllCommunicationList.xml

All communication among all emails addresses are stored in this file. Each element <Communication> includes attributes, that describes point-to-point communication:

SenderAddressId – Id of address where emails came from

RecipientAddressId – Id their destination address

NumberOfEmails – number of exchanged emails between these two addresses.

 

InternalCommunicationList.xml

This file contains similar data as file mentioned above. But communication of Enron employees only is stored. Hence there are not Ids of emails addresses, but Ids of persons as attributes:

SenderPersonId – Id of person sending emails,

RecipientPersonId – Id of person recieving emails,

NumberOfEmails – number of exchanged emails between these two persons.

TermGlobalList.xml

This file represents global list of terms, which were found in Enron corpus. The list is alphabetically sorted, and unique Id was assigned to each term. Terms are stored as element <Term> with these attributes:

Token – string representation of the term, i.e. term itself,

TotalFreq – total frequency of the term in whole corpus,

CorrectWord – if the word is correctly spelled “True”, otherwise “False”.

DocumentTermList.xml

For each document (email) with given Id (as attribute) list of all terms included in this document is stored (as subelement <TermList>). The <TermList> element contains elements <Term> with attributes:

Id – Id of particular term in given document,

Freq – frequency of the term in given document.

TermInvertedList.xml

This file is “orthogonal” to file DocumentTermList.xml. For each term with given Id, total frequency TotalFreq a correctness of spelling CorrectWord list of document containing the term is stored. List is stored as element <DocList> with one attribute DocCount indicating number of document in the list. List contains elements <Doc> with attributes:

Id – Id of document, that contains given term,

Freq – frequency of given term in this document.

TermIDFList.xml

For each term IDF is computed according to formula \ref{eq:IDF}. Data are stored as elements <Term> with attributes:

Id – Id of term

IDF – IDF of given term.

DocumentTermWeightList.xml

For each document with given Id there is list of terms <TermList> with offspring elements <Term> with attributes:

Id – Id of term

Weight – weight of the term in given document.

For stems there are same group of files, which contain the same information. They have similar names, string “Term” is replaced with string “Stem”, and name of attributes or elements are also changed from “Term” to “Stem”. The meanings of the data stored in “Stem” files are the same as in “Term” files.