README.txt
^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^

version changes:
^^^^^^^^^^^^^^^^
EMUv1.0.17 added explicit options for headers and sentences.  Options [-h yes] and [-s yes] have been added.  Description under input parameters.
EMUv1.0.18 updates:
	1. Handles indel mutations. Codes INS, DEL, and INDEL found in the wtaa column with the nucleotides/amino acids/number being inserted or deleted in the mtaa column
	2. Extracted mutations not converted to three letter amino acid code
	3. Added column <mutation type> which will be either MISSENSE or INDEL depending on the mutation
	4. <type> column has all variant type possibilities depending on the extracted mutation: PROTEIN, DNA, RNA 
	5. Use EMU_seq_filter_v1.2.pl to handle new column in EMU output
	6. Removed ABG output file
EMUv1.0.19 updates:
	1. Fixed error in v18 for two mutation patterns
EMU_seq_filter_v1.2 updates:
	1. Handles <mutation type> column from EMUv1.0.18, EMUv1.0.19
	2. Previous version, would replace EMU's <type> column. The seq_filter type column moved after 


A/
Pipeline of Extaction of MUtation:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1. input of EMU is a text file, each line consist of a tabdelimited triplet of 1) pubmed id, 2) title and 3) the plain text of the abstract.
2. Use EMU on the input file.
3. use the SEQ_filter on the mutations extracted.

detailed description:
^^^^^^^^^^^^^^^^^^^^^
A2 EMU:
^^^^^^^^^^^
EMU needs the following files:

hard coded filenames:
AAconversion.pm    %some perl scripts from Trevor.
HUGOGeneNames.txt  %the list of gene names.
Cell_line_list_short.txt   %the list of cell line names that can be confused with mutations i.e. cell line names that seems to be mutations.

syntax:
perl EMUv1.0.16.pl -f input1 [-s yes] [-h yes]

input parameters:
1. [-f] argument. Input follows option. input1 - the input file with tab-delimted pubmed id, title and abstract in a plain text form.
2. [-s yes] optional argument. With this option, EMU processes the input text by sentences. 
3. [-h yes] optional argument. With this option, the input text file has a header for the columns.  The default is no header.

A3 SEQ_filter:
^^^^^^^^^^^^^^^^^^
the seq_filter parser: 

syntax:
perl EMU_seq_filter.pl <input_file> <output_file>

the input file is the ouput from EMU.
This method needs internet connection. It retrieves data from the NCBI server.


Example: 
^^^^^^^^^
let the PCA_abst_mutation.txt be the input file for EMU that contains the abstracts
perl EMUv1.0.17.pl -f PCA_abst_mutation.txt
perl EMUv1.0.17.pl -f PCA_paper.txt -s yes		//application of EMU on full paper text (instead of just abstract) and runs EMU on sentences
perl EMU_seq_filter.pl EMU_1.17_HUGO_PCA_abst_mutation.txt EMU_1.17_HUGO_PCA_abst_mutation_SF.txt


specification of the input files:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
All files are tab-delimited.

the input file of the EMU has to look like:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pmid	title	abstract
10021378	Alzheimer's disease: clues from flies and worms.	Presenilin mutations give rise to familial Alzheimer's disease and result in elevated production of amyloid beta peptide. Recent evidence that presenilins act in developmental signalling pathways may be the key to understanding how senile plaques, neurofibrillary tangles and apoptosis are all biochemically linked.
.
.
.

the output of the EMU and the input of the fasta check is:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pmid	organism	mut_pat1	pos_patt	wtaa	mtaa	pos	genes	type
15146458	Humans	 g.4870T>C 		T	C	4870	ANP32A;ANP32C;PC	GENOM
.
.
.

the ouput of the seq_filter
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pmid	organism	mut_pat1	pos_patt	wtaa	mtaa	pos	genes	type	fasta_check	gi	gene_name	prot_id
10517877	Humans	 histidine to aspartic acid.	codon 1104	HIS	ASP	1104	ERCC5	PROTEIN	YES	2073	ERCC5	51988900|REV
.
.
.