You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Amir Salimi 52a70ba946 tiny data to learn 1 year ago
.vscode whatever 1 year ago
char-based-generation tiny data to learn 1 year ago
embedding_based_generation shuffle everything 1 year ago
gh reworking the parsing logic 1 year ago
kenlm @ 684ddc29ca added evaluator 1 year ago
models saved word2vec model 1 year ago
serialized maybe no lstm layers? 1 year ago
.gitignore added tensorboard vis 1 year ago
.gitmodules added evaluator 1 year ago
LICENSE Initial commit 1 year ago
README.md added evaluator 1 year ago
__init__.py added multiprocessing serializer/parser 1 year ago
debug_help.py whatever 1 year ago
evaluator.py embedding debugging 1 year ago
ft_generator.py nni search again 1 year ago
main.py whatever 1 year ago
plot_tsne.py whatever 1 year ago
requirements.txt don't use categorical output for dense output 1 year ago
serializer.py added evaluator 1 year ago
test.py whatever 1 year ago
tsne_ft10.png whatever 1 year ago
vectorizer.py maybe word embedding model choice is incorrect 1 year ago

README.md

An Evaluation of Generative Models for Natural Source Code

University of Alberta, CMPUT 663 Winter 2019 Project

Project proposal can be found here.

Quickstart

virtualenv venv --python=python3
source venv/bin/activate
pip install -r requirements.txt
./main.py

Data

We used Google BigQuery to fetch all Python files from GitHub's publicly available extracts. See gh/README.md.

Field Value
Table Size 43.0 GB
Long Term Storage Size 43.0 GB
Number of Rows 5,941,484
Creation Time Feb 18, 2018, 6:12:17 AM
Last Modified Feb 18, 2018, 6:14:19 AM
Expiration Time Never
Data Location US
Labels None
```
Table Details: contents_py_201802snap

Description
SELECT a.id id, size, content, binary, copies
, sample_repo_name, sample_path
, sample_stars_2016, sample_stars_2017
, sample_stars
FROM (
SELECT id
    , ARRAY_AGG(a.repo_name ORDER BY c.stars DESC LIMIT 1)[OFFSET(0)] sample_repo_name
    , ARRAY_AGG(path ORDER BY c.stars DESC LIMIT 1)[OFFSET(0)] sample_path
    , ARRAY_AGG(IFNULL(b.stars,0) ORDER BY c.stars DESC LIMIT 1)[OFFSET(0)] sample_stars_2016
    , ARRAY_AGG(IFNULL(b2.stars,0) ORDER BY c.stars DESC LIMIT 1)[OFFSET(0)] sample_stars_2017
    , ARRAY_AGG(IFNULL(c.stars,0) ORDER BY c.stars DESC LIMIT 1)[OFFSET(0)] sample_stars
FROM `bigquery-public-data.github_repos.files` a
LEFT JOIN `fh-bigquery.github_extracts.2016_repo_stars` b
ON a.repo_name=b.repo_name
LEFT JOIN `fh-bigquery.github_extracts.2017_repo_stars` b2
ON a.repo_name=b2.repo_name
LEFT JOIN `fh-bigquery.github_extracts.repo_stars` c
ON a.repo_name=c.repo_name
WHERE PATH LIKE '%.py'
GROUP BY 1
) a
JOIN `bigquery-public-data.github_repos.contents` b
ON a.id=b.id
Table Info
Table ID
```

KenLM

# to build ngram model
cd kenlm
mkdir -p build
cd build
cmake .. -DKENLM_MAX_ORDER=10
make -j 4
cd ../..
./main.py --stream-source-code | ./kenlm/build/bin/lmplz -o 10 -S 80G -T /tmp > ./models/py-10gram.arpa
# to install python package
cd kenlm
# change setup.py
# ARGS = ['-O3', '-DNDEBUG', '-DKENLM_MAX_ORDER=6', '-std=c++11']
# ARGS = ['-O3', '-DNDEBUG', '-DKENLM_MAX_ORDER=10', '-std=c++11']
python setup.py install