Warning: Creating default object from empty value in /home/patricknevindwyer/digilutionary.com/wp-includes/functions.php on line 292
digilutionary.com
Patrick Dwyer

Frequency :: C++

in News, Comparative Programming, C++ by patrick


Deprecated: preg_replace(): The /e modifier is deprecated, use preg_replace_callback instead in /home/patricknevindwyer/digilutionary.com/wp-includes/functions-formatting.php on line 76

This program is part of the Comparative Programming :: Frequency Analysis set of examples.

At the heart of our C++ example is a Map object from the Standard Template Library. The STL Map is an associative container, pairing a key and value that we can set and retrieve using the [] (square bracket) operators. In the case of frequency counting we’ll create a map between the 26 lowercase letters and a double precision floating point number.

To begin using a map, we include the Map definition (line 4), and declare our map (line 12) as a relationship between a char key and a double value. We need to initialize the frequency count of all of the letters to zero. Using a for loop (line 15) we count from 0 to 25, and accessing our freq_count with square brackets set the character with ASCII value i + 97 to 0.0. The lowercase ASCII letters start at 97, so counting from 97 to 122 we iterate over the character values a to z.

With our frequency counting map in place, we need to open our input file and begin reading characters. To access the contents of a file we need an ifstream, or Input File Stream. The definitions needed to work with files are included in lines 1, 2, and 3. Our input stream file is created on line 20 using the input file name from our command line arguments (argv[1]) and the IO Stream directive ios::in. Our file variable now represents an object through which we can access the contents of our file.

We want to read characters from our file until we reach the End Of File, or eof, using a while loop on line 22. The next step is to read a character and record it in our frequency counter. The double arrow (>>) on line 25 is an operator that reads on character from our input stream file and stores in it the character c. Knowing that lowercase ASCII characters have values between 97 and 122, and uppercase ASCII characters have values between 65 and 90, we can test our character c to see if it is a letter. If our character is a letter we just increment the proper character in our map (lines 29 and 35), increment the total number of characters counted (lines 30 and 36), and we can ignore characters that fall out of the range of letters.

When we reach the end of our file, our while loop will exit, and we can close our file (line 39). All that remains is print out our results. Like our C and Java examples, we need to closely control the format of the data we print out, or we’ll start generating numbers with an excessive precision.

To print our results we use an iterator; a construct that steps through the keys and values of our Map. The iterator im is defined on line 42 as starting with the first element of our map

map<char, double>::iterator im = freq_count.begin()

and continues through all of the values in the map using a for loop, until we reach the end of our map

im != freq_count.end()

Printing out our values involves accessing elements of our map via the iterator

cout << im->first
	
and using control settings to alter the format of our output
	
setiosflags(ios::fixed) << setprecision(2)

The two control flags tell the output (cout) that our numbers are fixed width with 2 decimal places of precision, which will generate the output we desire.

C++
01#include <iostream>
02#include <fstream>
03#include <iomanip>
04#include <map>
05
06using namespace std;
07
08int main(int argc, char* argv[]) {
09
10 char c;
11 float total = 0.0;
12 map<char, double> freq_count;
13
14 // initialize our 26 characters to zero
15 for (int i = 0; i < 26; i++) {
16 freq_count[(char)(i + 97)] = 0.0;
17 }
18
19 // open our file
20 ifstream file(argv[1], ios::in);
21
22 while ( ! file.eof() ) {
23
24 // read in each character from the file
25 file >> c;
26 if ( (c >= 65) && (c <= 90) ) {
27
28 // upper case letter
29 freq_count[c + 32]++;
30 total++;
31 }
32 else if ( (c >= 97) && (c <= 122) ) {
33
34 // lowercase letter
35 freq_count[c]++;
36 total++;
37 }
38 }
39 file.close();
40
41 // calculate the frequency of each character and print the result
42 for (map<char, double>::iterator im = freq_count.begin(); im != freq_count.end(); ++im) {
43
44 cout << im->first << ": " << setiosflags(ios::fixed) << setprecision(2) << (im->second / total * 100.0) << std::endl;
45
46 }
47}
[+-] Toggle Line Numbers

Program Source: freq.cc
Text Source: republic.txt
This text was acquired from Project Gutenberg, and
is distributed as per the license at the beginning of the text.

Compiling the example

From the command line:

g++ freq.cc -o freq_cc

Running the example

From the command line:

./freq_cc republic.txt

Frequency :: C

in News, Comparative Programming, C by patrick


Deprecated: preg_replace(): The /e modifier is deprecated, use preg_replace_callback instead in /home/patricknevindwyer/digilutionary.com/wp-includes/functions-formatting.php on line 76

This program is part of the Comparative Programming :: Frequency Analysis set of examples.

In our C example we don’t have a readily available construct akin to the Hashtable or Map of other languages, so we’ll resort to a simple array of 26 values to count our character frequency. With lines 10 through 13 we declare our array of floating point numbers, initializing each value to 0. In the end our calculation of frequency will be based upon a floating point formula, so using decimal (floating point) numbers instead of integers for counting characters saves us a later conversion.

Reading a file is straightforward; we open a file in read mode (r) on line 16, saving the resulting file handle as a File pointer. All of our file operations will use this file handle, starting with the while statement on line 18 that controls our character reading loop. The feof method will return true if the provided file handle is at the End-Of-File, so our while loop continues so long as the end of file has not been reached (notice the ! before feof, which turns our statement into while not end of file).

To count our character frequency, we’ll read the input file one character at a time with the fgetc method, storing each character in the c variable as it is read.

For our frequency analysis we’re working with ASCII encoded files, so a lowercase letter will fall in the numeric range of 97 to 122 (a to z), and an uppercase letter will be in the range 65 to 90 (A to Z). Two if statements can check for a character in the lower or upper case range. In either case (upper or lower case) we need a simple means of recording the letter we’ve read from the file. In the case of an upper case letter, subtracting 65 from the letter value will place it in our array, while subtracting 97 from a lowercase letter will do the same. Each letter a through z and A through Z can now be counted as a number 0 through 25, which maps exactly to our array indices for the char_count array.

Once the counting is complete we close our input file with fclose (line 36), and can begin our calculations of frequency. To start or calculation we need a for loop to move through each value in our frequency array. Counting from 0 to 25 (line 39) we can use a simple formula (line 40) to discover the character frequency for a specific array index. To translate the integer index (our 0 to 25 value) and calculated frequency to a printable string we’ll take advantage of the formatted printing method (printf line 41), where the format string %c: %.2f translates to “a character, followed by a colon, a space, and a floating point number printed to two decimal places”. The remaining arguments to the printf method are a character (our index value plus 97, which is equivalent to the letters a through z), and our calculated frequency. With that our program is complete, and will print out a frequency chart like the following:

a: 8.00
b: 1.51
c: 2.42
d: 3.90
e: 12.84
f: 2.20
.
.
.
t: 9.42
u: 3.02
v: 0.94
w: 2.28
x: 0.15
y: 2.16
z: 0.04
ANSI C
01#include <stdio.h>
02
03int main(int argc, char *argv[]) {
04
05 int c, i;
06 float total = 0.0;
07 float freq;
08
09 // initialize our 26 counters to zero
10 float char_count[26];
11 for (i = 0; i < 26; i++) {
12 char_count[i] = 0.0;
13 }
14
15 // open our file
16 FILE *f = fopen(argv[1], "r");
17
18 while ( !feof(f) ) {
19
20 // check each character in the file
21 c = fgetc(f);
22 if ( (c >=65) && (c <= 90) ) {
23
24 // upper case letter
25 char_count[c - 65] += 1.0;
26 total += 1.0;
27 }
28 else if ( (c >= 97) && (c <= 122) ) {
29
30 // lower case letter
31 char_count[c - 97] += 1.0;
32 total += 1.0;
33 }
34
35 }
36 fclose(f);
37
38 // calculate the frequency of each character and print the result
39 for (i = 0; i < 26; i++) {
40 freq = char_count[i] / total * 100.0;
41 printf("%c: %.2f\n", (char)(i + 97), freq);
42 }
43}
[+-] Toggle Line Numbers

Program Source: freq.c
Text Source: republic.txt
This text was acquired from Project Gutenberg, and
is distributed as per the license at the beginning of the text.

Compiling the example

From the command line:

gcc freq.c -o freq_c

Running the example

From the command line:

./freq_c republic.txt

Frequency :: Java

in News, Java, Comparative Programming by patrick


Deprecated: preg_replace(): The /e modifier is deprecated, use preg_replace_callback instead in /home/patricknevindwyer/digilutionary.com/wp-includes/functions-formatting.php on line 76

This program is part of the Comparative Programming :: Frequency Analysis set of examples.

Our Java example, while longer than most of the other frequency analysis programs, is fairly straight forward in it’s approach. To keep track of the number of character occurances in our text, we want to create a map of characters to numbers, which we’ll increment every time we see a valid character. In lines 11 through 19 we create our Hashtable and initialize it with the letters ‘a’ through ‘z’, associating with each the value ‘0.0′.

We’ll need to reference the letters ‘a’ through ‘z’ twice in our program; once to create our Hashtable, and again to print the results. For this reason we create a string of letters in alphabetical order on line 12. We can iterate through the string with a simple for loop (lines 17 and 44) to access our character counts.

With our Hashtable ready, we can open and read our file. On line 22 we create a BufferedReader to access the contents of our text file line by line. Our while loop (line 26) extracts each line from the file, placing it’s value in the line variable. This while loop will continue until our input file’s readLine method returns null, which signifies the end of the file.

Once we have a line of text from our file, we need to break it down into characters, the fundamental unit of text for our analysis. To do this we use a for loop, starting on line 29, that counts from 0 up to the length of our line. Using the index of our for loop, we can sequentially extract the characters from our line of text (line 30) using the String class’ substring method.

We don’t want to count any random characters we come across; we’re only interested in letters. Line 33 of our program contains a simple Regular Expression to test our extracted character. The pattern [a-zA-Z] matches any lower or upper case character, but nothing else. If our character matches our regular expression, we want to include it in our frequency analysis, so translate the character to lower case (in case it was an upper case character), and increment our character frequency in the Hashtable (line 37). We increment this value by first retrieving the old value using our current character

freq.get(c)

adding 1

freq.get(c) + 1.0

and then inserting our incremented value back into the Hashtable

freq.put(c, freq.get(c) + 1.0)

The last step in our for loop keeps track of the total number of valid characters so far encountered.

Once we’re done with our file, having read each line and extracted each valid character, we need to calculate our character frequencies and print them out (lines 44 to 48).

Using the string of letters from ‘a’ to ‘z’ that we defined earlier, we can get the key and value for each of the characters in our Hashtable. The key is the character itself, a letter from ‘a’ to ‘z’, while the value is how often the character occured in the text file we analyzed, the number we computed in the course of analyzing the line and characters of the file.

With the key and value in hand, it is a simple matter to determine the relative frequency of our character in the file (line 46). Printing the frequency of our character takes a little bit more work. We’d like the output of our program to look like:

a: 8.00
b: 1.51
c: 2.42
d: 3.90
e: 12.84
f: 2.20
.
.
.
t: 9.42
u: 3.02
v: 0.94
w: 2.28
x: 0.15
y: 2.16
z: 0.04

A first attempt at printing our results could be:

System.out.println(key + ": " + perc);

Which correctly prints a letter, followed by a colon, a space, and our frequency, but our frequency is a significantly long number:

a: 7.9990618682967325
b: 1.5108759990916503
c: 2.4194118807679277
d: 3.9029257051809445
.
.
.

We want to limit our frequency to two decimal places. Thankfully there is a simple method in the String class for printing numbers and strings in a controlled manner. We can pass C-Style string formats to the format method to create a better output string (line 47). The %2.2f instructs the format method that our floating point value should be printed with at most two leading digits, and two trailing digits.

While our frequency analysis is done, it’s important to note that our entire program is wrapped in a try-catch block; a construct of the java language intended for exception processing. When working with any type of Input/Output, be it files, networks, or peripherals, there is the possibility that communication will break, or somehow be disrupted. Many Java classes leave it up to the programmer to handle these situations. In our case opening and reading from a file can cause IOExceptions that we need to account for. In our program we don’t attempt to recover from the error, opting instead to print out a program trace of where the error occured (line 53).

Java
01import java.util.*;
02import java.io.*;
03
04public class freq {
05
06 public static void main(String[] args) {
07
08 try {
09
10 // create our map of lowercase letters to integers
11 Hashtable<String, Double> freq = new Hashtable<String, Double>();
12 String set = "abcdefghijklmnopqrstuvwxyz";
13
14 int count = 0;
15
16 // initialize our character counts to 0
17 for ( int i = 0; i < set.length(); i++) {
18 freq.put(set.substring(i, i+1), 0.0);
19 }
20
21 // open our file
22 BufferedReader in = new BufferedReader(new FileReader(args[0]));
23 String line;
24
25 // read in each line from the file
26 while ( (line = in.readLine()) != null) {
27
28 // extract each character from the line
29 for (int i = 0; i < line.length(); i++) {
30 String c = line.substring(i, i+1);
31
32 // try and match our character to a lower or upper case letter
33 if (c.matches("[a-zA-Z]")) {
34
35 // increment the count of our character
36 c = c.toLowerCase();
37 freq.put(c, freq.get(c) + 1.0);
38 count++;
39 }
40 }
41 }
42
43 // calculate the frequency of each of our characters, printing the result
44 for (int i = 0; i < set.length(); i++) {
45 String key = set.substring(i, i + 1);
46 double perc = freq.get(key) / count * 100.0;
47 System.out.println(key + ": " + String.format("%2.2f", perc));
48 }
49
50 } catch (IOException ioe) {
51
52 // catch any problems we have reading the file from disk
53 ioe.printStackTrace();
54 }
55 }
56}
[+-] Toggle Line Numbers

Program Source: freq.java
Text Source: republic.txt
This text was acquired from Project Gutenberg, and
is distributed as per the license at the beginning of the text.

Compiling the example

From the command line:

javac freq.java

Running the example

java freq republic.txt

Comparative Programming : Frequency

in News, Comparative Programming by patrick


Deprecated: preg_replace(): The /e modifier is deprecated, use preg_replace_callback instead in /home/patricknevindwyer/digilutionary.com/wp-includes/functions-formatting.php on line 76

Every human language has a pattern, a statistically predictable series of words, syllables, and letters. This pattern is a powerful analytic tool that is used to decipher lost languages, crack encrypted messages, build software that can speak, write and reason like a human, and transform one language into another.

The most fundamental of these analytic techniques is the frequency analysis; a comparison of how often each letter in a language occurs in a given text. Almost any sufficiently long text in a language will display a similar character frequency as other texts in the same language.

A common puzzle, the simple substitution cipher, can be quickly attacked and solved using frequency analysis. The character frequency of an encrypted text can be compared to the common frequencies of letters in plain texts, often yielding a clue as to the proper substitution for a selected set of characters.

Our example program this time looks at the most basic of operations in using frequency analysis; obtaining the frequency of each letter in a text. Our programs will be given a text file, which they must read and analyze. The output is each character from ‘a’ to ‘z’, and the frequency of that character in the supplied text.

The example input to a program might be:

[program] republic.txt

Here we’ve asked our program to analysis the character frequency of Plato’s “Republic”. A section of the output from our program might be:

a: 8.00
b: 1.51
c: 2.42
d: 3.90
e: 12.84
f: 2.20
.
.
.
t: 9.42
u: 3.02
v: 0.94
w: 2.28
x: 0.15
y: 2.16
z: 0.04

The characters from ‘a’ to ‘z’ should be printed in order, one per line, followed by a colon, a space, and then the frequency of that letter in the text.

As part of our analysis of the text, we need to make sure that our programs are providing us with an accurate assessment of the character frequency. Normally this kind of testing would require us to provide the program with a text of which we already know the frequency. Luckily for us we’re writing a multiple versions of our program, so we can compare the values generated by our different implementations to verify the results of our work. This kind of testing won’t catch high level errors we might have made in designing the frequency analysis algorithm we use for all of our programs, but it will check our calculations.

An extra program is bundled with the example implementations this week; a small tool to compile, run, and compare the results of all of our implementations. This tool, aptly named test.pl will take care of our testing for us:

perl test.pl -program freq -text republic.txt

This will compile our frequency analysis programs, run each of them against the text of Plato’s “Republic”, and provide us with collated results that look like:

a
        8.00 freq.class
        8.00 freq.pl
        8.00 freq.py
        8.00 freq.rb
        8.00 freq_c
        8.00 freq_cc
b
        1.51 freq.class
        1.51 freq.pl
        1.51 freq.py
        1.51 freq.rb
        1.51 freq_c
        1.51 freq_cc
c
        2.42 freq.class
        2.42 freq.pl
        2.42 freq.py
        2.42 freq.rb
        2.42 freq_c
        2.42 freq_cc
.
.
.

With this output we can compare the frequencies as generated by each program. If any program reports values that differ from the others, we’ll see a message telling us that the Calculated value for [letter] deviates from previous values.

Our examples:

Java
Perl
Python
C
C++
Ruby
OCaml

The entire set of source files, including the analysis text and the testing tool, can be downloaded together as a zip file.

Echo :: OCaml

in News, Comparative Programming, OCaml by patrick


Deprecated: preg_replace(): The /e modifier is deprecated, use preg_replace_callback instead in /home/patricknevindwyer/digilutionary.com/wp-includes/functions-formatting.php on line 76

This program is part of the Comparative Programming :: Echo set of examples.

The OCaml (Object-Caml) echo example is in a functional language, far different in style from our other examples. The semantics and syntax of a functional language can seem strange comparaed to an example in a procedural (or object oriented) language like our other programs. Functional languages work with data and mathematically styled functions. The focus in a functional program is on the results of a function or group of functions, rather than state-controlled flow like in a procedural language.

Our echo program defines four functions to complete our command line echo. Two of the functions are recursive: reverse and print_list. The other two functions, main and trim, are non-recursive. The two recursive functions operate on a list provided as a parameter. In this case the list will be the command line arguments (converted from an OCaml Array to an unmutable list on line 18).

Printing our command line arguments falls to the print_list function. This function takes one parameter, a list (line 9), and has two possible operations. If the list is empty (line 11) then nothing happens, and the function returns. If this list is not empty it is broken into two pieces (line 12) the head of the list (the first item), and the remainder of the list (tail). The first item of the list is printed out with a trailing space, while the remainder of the list is passed to a recursive call to print_list. Each call to the print_list function prints the first element in the list, but the recursive call on line 12 happens before the print statement. This means that our list is effectively printed out in reverse order.

We want our command line arguments printed out in the correct order, so we need a way to reverse our list of arguments before passing them to the print function. Our reverse function takes a list as a parameter, and reverses the elements of the list (line 7).

The third function we defined (line 13) is designed to return a list with the leading element removed. The command line arguments to our OCaml program (line 18) include the name of the program, which we don’t want to echo. The trim function removes this element from our list.

The final function in our program is the main function, in which we convert the Array of command arguments (line 18) into a list, and pass the list to our print_list and reverse functions. The final two lines of our program invoke the main method, and once completed print a new line. Our entire program is just a series of functions all triggered through our main function; no loops, whiles or ifs.

OCaml
01(* File echo.ml *)
02open Printf;;
03
04let rec reverse lst =
05 match lst with
06 [] -> []
07 | head :: tail -> reverse tail @ [head];;
08
09let rec print_list lst =
10 match lst with
11 [] -> []
12 | head :: tail -> printf "%s " head :: print_list tail;;
13let trim lst =
14 match lst with
15 [] -> []
16 | head :: tail -> tail;;
17let main () =
18 let a = Array.to_list(Sys.argv) in
19 print_list( reverse( trim(a) ) );;
20
21main ();;
22print_newline();;
[+-] Toggle Line Numbers

Program Source: echo.ml

Compiling and testing the OCaml example requires installation of an OCaml compiler.

Compiling the example

From the command line:

/usr/local/bin/ocamlc -o echo.ocaml echo.ml

Running the example

From the command line:

./echo.ocaml Hello World