Codelets for presentation at Sophia-Antipolis 170201
====================================================

relextr-mitie.py  : relation extraction with MITIE
  ./relextr-mitie.py file.txt
relextr-nltk.py   : relation extraction with NLTK
  ./relextr-nltk.py file.txt
relextr-pattern.py: relation extraction with pattern
  ./relextr-pattern.py file.txt

ner-mitie.py: named entity recognition with MITIE
  ./ner-mitie.py file.txt
ner_collection.py: NER on the whole document collection (see jobs_ner.csv)
  ./ner_collection.py
  [produces a cleaned-up NER words list from each of job*.txt and writes
   its output to jobs_ner.csv]

wn_hyponyms.py: finds the WordNet hyponyms graph
  ./wn_hyponyms.py commonword
  [opens a graphical window showing the hyponyms graph of the commonword;
   writes the figure to wn_commonword.png]
wn_path.py: finds WordNet graph paths between two common words and their LCS
  ./wn_path.py commonword1 commonword2

summarize.py: produces a summary of text
  ./summarize.py file.txt

modularity_cluster.py: modularity clustering on weighted distance graph
  given a list of keywords about each document in collection (jobs_ner.csv)
  1. compute similarities using WordNet
  2. cut weak similarities to "no edge"
  3. perform modularity clustering on weighted distance graph
     - use optlang to model and solve the clique-constrained modularity MILP
  4. output clustering on console and jobs_modularity.png figure
  ./modularity_cluster.py [file.csv]
  [If not given, use jobs_ner.csv]

modularity.sh: solve max modularity formulation using AMPL/CPLEX
  ./modularity.sh file.dat
  [uses modularity.mod, modularity.run, jobs.dat]

jobs_adjacency_matrix.py: hard-coded adjacency matrix for jobs [just a test]

keywords.py: produces a list of keywords in a document collection
  ./keywords.py
  [produces a list of keywords per document, whose relevance is computed
   over the whole collection; writes its output to jobs_tfidf.csv]
kmeans.py: k-means on the document collection job*.txt, tfidf dist on keywords

mssc: minimum sum-of-squares clustering (problem induced by k-means alg)
   ./mssc_data.py k  # where k>1 is the number of clusters
   [writes the AMPL .dat file mssc-k_C_feat.dat, where
       k=number of clusters, C=|doc collection|, feat=number of feature comps
    reads the document collection in job*.txt & computes TFIDF matrix]
   ./mssc_randprojdata.py k # where k>1 is the number of clusters
   [writes the AMPL .dat file mssc-rp-k_C_projfeat.dat
       k,C as above, and projfeat=number of JLL projected components]
    reads doc collection job*.txt, computes TFIDF, projects it]
   cat mssc.run | ampl
     [solves convex MINLP reformulation of MSSC,
      uses mssc-k_C_feat.dat or mssc-rp-k_C_projfeat.dat]
   cat mssc-norm1.run | ampl
     [solves convex MINLP reformulation of MSSC,
      uses mssc-k_C_feat.dat or mssc-rp-k_C_projfeat.dat]
      
--------------

jobs*.txt: the data (job announcements downloaded from three different sources, two of which are fragments)
fr/*.txt: some of the job announcement data were in French, these are the originals; those I used have been Google-translated

