Substitution cipher cracker (PoC)

Uses a genetic algorithm with diphone frequency analysis.

A ~1000 letter ciphertext is needed for best results.

Only polish frequency tables included.

Usage

All applications expect and output UTF-8 text.

subst.py <enc|dec> <key file> Encodes/decodes text using key from "key file". Feed text/ciphertext to stdin.
keygen.py Generates a random key and saves it to "random_key.txt"
test.py Generates a radom key, encodes the text fed to stdin with it and tries to crack it. Upon era's completion or receiving SIGINT it compares found keys to the correct one and displays how many letters were cracked correctly.
crack.py Tries to decrypt the ciphertext. Just feed the text to stdin.

With the current (included) polish language stats it was able to crack >20 characters in 1500 iterations when using a 1000+ character text.

PS If you have gnuplot use plot_test_stats.sh to get nice looking graph of the fitness values in the ongoing cracking process. I do not force file buffer flushing so refresh the graph only once in a while.

Algo details

Defnitions:

cTXT - ciphertext
dec(TXT, K) - decrypt TXT with key K
smST - single letter model statistics
dmST - double letter model statistics
sstats(TXT) - count single letter statistics
dstats(TXT) - count double letter statistics
pcorr(ST1, ST2) - pearsons sample correlation between statistics ST1 and ST2

Steps:

Take key X from current population.
dpTXT = dec(cTXT, X)
ssST = sstats(dpTXT), dsST = dstats(dpTXT)
fitness = (pcorr(ssST, smST) + pcorr(dmST, dsST)) * 50

Mutation is just a number of letter swaps within the key. Crossover operator produces a valid keys and is "smart" - uses single letter model stats in the process to speed up convergence.

I decided not to use the dictionary checking. The module is left there for educational purposes only ;). It is also better to start with random keys rather then with the naive key.

Naive key computation:

for i to length(alphabet)
	a = letter which has the maximum occurence in ciphertext
	b = letter which has the maximum occurence in model stats
	add "b -> a" to the key and remove a and b from their stacks

There's probably room for more improvements, refining the parameters of the pool, etc. but I spent enough time on this.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
cipher		cipher
.gitignore		.gitignore
README.md		README.md
crack.py		crack.py
dict.txt		dict.txt
double_stats_pl.txt		double_stats_pl.txt
keygen.py		keygen.py
plot_test_stats.sh		plot_test_stats.sh
random_key.txt		random_key.txt
single_stats_pl.txt		single_stats_pl.txt
subst.py		subst.py
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cipher

cipher

.gitignore

.gitignore

README.md

README.md

crack.py

crack.py

dict.txt

dict.txt

double_stats_pl.txt

double_stats_pl.txt

keygen.py

keygen.py

plot_test_stats.sh

plot_test_stats.sh

random_key.txt

random_key.txt

single_stats_pl.txt

single_stats_pl.txt

subst.py

subst.py

test.py

test.py

Repository files navigation

Substitution cipher cracker (PoC)

Usage

Algo details

About

Releases

Packages

Languages

pinkeen/subst-cracker

Folders and files

Latest commit

History

Repository files navigation

Substitution cipher cracker (PoC)

Usage

Algo details

About

Resources

Stars

Watchers

Forks

Languages