LePetitPrince

This repository includes the code of "Le Petit Prince" project. (LPP = Le Petit Prince)

Project description

This project is a cross-linguistics study involving NLP and Neurolinguistics experts (NeuroSpin, FAIR, INRIA, CORNELL, ...). It aims at better understanding the cortical bases of language comprehension through computational linguistics.

Data acquisition

The story is segmented into 9 runs:

Chapters 1 to 3 --> run 1
Chapters 4 to 6 --> run 2
Chapters 7 to 9 --> run 3
Chapters 10 to 12 --> run 4
Chapters 13 to 14 --> run 5
Chapters 15 to 19 --> run 6
Chapters 20 to 22 --> run 7
Chapters 23 to 25 --> run 8
Chapters 26 to 27 --> run 9

This study includes:

40 fMRI in English
40 fMRI in French
20 MEG in French each of 90 min long.

These data were acquired by passive listening of the audiobook of "Le Petit Prince", divided in 9 runs of approximately 10 min each.

This huge dataset will be shared through neurovault.

Methodology

In order to do so we followed the following methodology:

For fMRI:

Selection and implementation of different Language Models.

Analysis pipeline:

Generation of raw-features from the text (or audio) of "Le Petit Prince" thanks to the selected models. Concatenation of the raw-feature dataframe with an onset file (the result is called raw-features). Convolution of the newly constructed dataframe with an 'hrf' kernel (the result is called features). Construction of a design-matrix by concatenation of the features associated with the different models of interest (the result is called design-matrix). Ridge (cross validated) regression between our design-matrix and the fMRI data (transformed thanks to Nilearn)(the result is called ridge-indiv).

For MEG:

(Not done yet)

Data architecture

Due to the high amount of data and analysis, this project data-code-derivatives recquire to be organized in a intuitively way. To do so, we first created the script create_architecture.py that will do so automatically (we will see how to execute the script later).

Here you have a glance at the overall architecture:

├── paradigm (experiences information, stimuli)
├── oldstuff (oldscripts, personnal data/code, ...)
├── code (all the code of all the analysis)
│   ├── MEG (code of the MEG analysis pipeline)
│   ├── fMRI (code of the fMRI analysis pipeline)
│   ├── models (code related to models initialization/training/generation)
│   │   ├── english
│   │   │   ├── LSTM (LSTM framework)
│   │   │   ├── RMS (Framework for wave properties analysis)
│   │   │   ├── WORDRATE (Framework for simple linguistics properties analysis)
│   │   │   ├── lstm_wikikristina_embedding-size_200_nhid_300_nlayers_1_dropout_01.py (instantiation of a LSTM model)
│   │   │   ├── lstm_wikikristina_embedding-size_200_nhid_100_nlayers_3_dropout_01.py (instantiation of a LSTM model)
│   │   │   └── ...
│   │   └── french
│   └── utilities (utilities functions: parameters settings, splitter for CV, ...)
├── data (all the raw data acquired from sources)
│   ├── fMRI (fMRI data, 9 runs per subject)
│   │   └── english
│   │       └── sub-057
│   │           └── func
│   ├── wave (wave files data, 9 runs, data for models training)
│   │   ├── english
│   │   └── french
│   └── text (text data, raw text, division in 9 runs, onsets/offsets for each runs, data for models training)
│       ├── english
│       │   ├── lstm_training
│       │   └── onsets-offsets
│       └── french
└── derivatives (results of the code above)
    ├── MEG
    └── fMRI (results from the fMRI pipeline in code/fMRI/)
        ├── design-matrices (concatenation of features associated with different models of interest)
        │   └── english
        ├── features (Raw-features convolved with an 'hrf' kernel)
        │   └── english
        ├── glm-indiv (GLM model fitted on fMRI data with a design-matrix)
        │   └── english
        ├── models (trained models)
        │   └── english
        ├── raw_features (Result of a model generation from the text/wave file of LPP, concatenated with the adequate onsets file)
        │   └── english
        └── ridge-indiv (Ridge model fitted on fMRI data with a design-matrix)
            └── english

To give more insights on the three main parts of the project:

code
- MEG data analysis pipeline
- fMRI data analysis pipeline
  - raw-features.py (generate raw-features and concatenate them with onsets for a given model)
  - features.py (convolve the raw-features with a 'hrf' kernel for a given model)
  - design-matrices.py (concatenate features of different models of interest)
  - glm-indiv.py (GLM model fitted on fMRI data with a design-matrix)
  - ridge-indiv.py (Ridge model fitted on fMRI data with a design-matrix)
  - dodo.py (this files in the python analog of makefile script)
- utilities
  - utils.py (utility functions)
  - settings.py (where we change the parameters)
  - first_level_analysis.py (functions for first level analysis of fMRI)
  - splitter.py (splitter for multi-groups Ridge CV)
- models
  - *XXXX* : framework associated with a kind of model (e.g. LSTM)*
    - model.py (class definition)
    - train.py (train the model)
    - tokenizer.py (adequate tokenizer for the model training and generation)
    - utils.py (utilities functions)
  - *xxxx* : instantiation of a given class model
data
- we have the fMRI data organized following the BIDS standard except for the name of the final file
- the MEG should be added in a few months
- there is the text of LPP, the text divided into 9 runs, the original onset-offsets of LPP and training data for models
- wave files, meaning the content of the audio book with the textgrid files and training data for models
derivatives
- MEG
- fMRI (every script of code/fMRI/ fills a folder of the same name here, the same goes for code/models/)

Executing scripts

If you want to train a given model called model_name.py in a given language and use it in the pipeline, you need to create a module model_name.py in

$LPP/code/models/language/

, and define in it the functions:

load: that returns the trained model
generate: that take as arguments a model, a path to the input run, a language and a textgrid dataframe and generate raw-features

And add at the end of the script:

if __name__=='__main__':
    train(model)

Model training

To train a given model called model_name.py in a given language, just write: ($LPP is the path to the LPP project home directory)

cd $LPP
cd code
python models/language/model_name.py

fMRI pipeline

To start the fMRI pipeline analysis, first: - start by modifying the code/utilities/settings.py file with the parameters, paths, subjects and models that you want to study.

cd $LPP
cd code/fMRI
doit

Normally, the dodo.py will not run a file that has already been run except if it has been modified. If you still want to do so, run:

cd $LPP
cd code/fMRI
doit clean
doit forget

Running doit will first create the adequate architecture and then start the fMRI pipeline.

Analysis

Available analysis so far:

scatter plot comparison of r2 distributions per ROI in the brain for 2 given models

To run such an analysis, you should first fill in the analysis.yaml file with the name of the model you want to study and the name of the study that this scatter plot is suppose to enlighten (e.g. syntax VS semantic). Then run the following command line:

cd $LPP
cd code/fMRI
python analysis.py $LPP/code/analysis.yaml

Name		Name	Last commit message	Last commit date
Latest commit History 592 Commits
MEG		MEG
fMRI		fMRI
models/english		models/english
utilities		utilities
README.md		README.md
__init__.py		__init__.py
analysis.yaml		analysis.yaml
cluster.py		cluster.py
create_data_architecture.py		create_data_architecture.py
create_lstm_model.py		create_lstm_model.py
optimized_cluster_part_1.py		optimized_cluster_part_1.py
optimized_cluster_part_2.py		optimized_cluster_part_2.py
sound.wav		sound.wav
testing_lstm_training.ipynb		testing_lstm_training.ipynb
testing_ridge.ipynb		testing_ridge.ipynb
to_delete.py		to_delete.py

Hororohoruru/LePetitPrince

Folders and files

Latest commit

History

Repository files navigation

LePetitPrince

Project description

Data acquisition

Methodology

Data architecture

Executing scripts

Model training

fMRI pipeline

Analysis

About

Resources

Stars

Watchers

Forks

Languages