Processed Enron corpus
The final processed Enron corpus consists
of 16 Xml files. One file (Corpus.xml)
Corpus.xml.gz
contains the corpus itself, and
remaining 15 files contain secondary, derived information. These files can be
divided into several groups:
- Structured information about emails, persons, and email addresses:
- Information about communication:
- Information about individual terms in corpus:
- Information about individual stems in corpus:
EmailDescriptorList.xml
This file contains descriptors of
individual emails saved using <EmailDescriptor> element. Attributes are:
Id – unique identification number of each
email in corpus, and records in the file is sorted with respect to this Id.
Deleted – if emails is duplicated, and
should be deleted attribute has value “True”, otherwise has value “False”.
Subelements:
<FilePath> - path of file with this
email, path is relative to the base folder of the corpus
<Date> - date, when emails was send
<MD5> - MD5 digest taken form
\cite{tenborec}
<AuthorList> - list of authors of the
email\footnote{If the email is simply forwarded, all forwarding persons are
considered as authors}.
<RecipientList> - list of recipient
of the email. Both of these lists contain at least one element <Address>
with attribute Id equals to Id number of sender’s or recipient’s address.
AddressList.xml
This file is simply list of email addresses
which can be found in the corpus. Each address is stored as element <Address>
with attributes:
Address – text representation of address,
Id – unique identification number.
PersonList.xml
The file contains list of all employees of
Enron, who participate in the communication recorded in the corpus. Each person
is stored as element <Person> with several attributes:
Name – name of employee,
EmailAddress – his or her email address,
EmailAddressId – Id number of his or her
email address (see file AddressList.xml)
AllCommunicationList.xml
All communication among all emails
addresses are stored in this file. Each element <Communication> includes
attributes, that describes point-to-point communication:
SenderAddressId – Id of address where
emails came from
RecipientAddressId – Id their destination
address
NumberOfEmails – number of exchanged emails
between these two addresses.
InternalCommunicationList.xml
This file contains similar data as file
mentioned above. But communication of Enron employees only is stored. Hence
there are not Ids of emails addresses, but Ids of persons as attributes:
SenderPersonId – Id of person sending
emails,
RecipientPersonId – Id of person recieving
emails,
NumberOfEmails – number of exchanged emails
between these two persons.
TermGlobalList.xml
This file represents global list of terms,
which were found in Enron corpus. The list is alphabetically sorted, and unique
Id was assigned to each term. Terms are stored as element <Term> with
these attributes:
Token – string representation of the term,
i.e. term itself,
TotalFreq – total frequency of the term in
whole corpus,
CorrectWord – if the word is correctly
spelled “True”, otherwise “False”.
DocumentTermList.xml
For each document (email) with given Id (as
attribute) list of all terms included in this document is stored (as subelement
<TermList>). The <TermList> element contains elements <Term>
with attributes:
Id – Id of particular term in given
document,
Freq – frequency of the term in given
document.
TermInvertedList.xml
This file is “orthogonal” to file DocumentTermList.xml.
For each term with given Id, total frequency TotalFreq a correctness of
spelling CorrectWord list of document containing the term is stored. List is
stored as element <DocList> with one attribute DocCount indicating number
of document in the list. List contains elements <Doc> with attributes:
Id – Id of document, that contains given
term,
Freq – frequency of given term in this
document.
TermIDFList.xml
For each term IDF is computed according to
formula \ref{eq:IDF}. Data are stored as elements <Term> with attributes:
Id – Id of term
IDF – IDF of given term.
DocumentTermWeightList.xml
For each document with given Id there is list
of terms <TermList> with offspring elements <Term> with attributes:
Id – Id of term
Weight – weight of the term in given
document.
For stems there are same group of files,
which contain the same information. They have similar names, string “Term” is
replaced with string “Stem”, and name of attributes or elements are also
changed from “Term” to “Stem”. The meanings of the data stored in “Stem” files
are the same as in “Term” files.