Work Plan

We worked on the project in two phases. In the first phase, we created a blog in which we described our work in different levels along with the problems encountered, prepared the work environment, made our corpus, script writing and analysis of the results.

BLOG

First of all, we made a blog in which our work concerning the project was described step by step and we also did some exercises in order to train ourselves. You can visit our blog (google blogger) by typing پلوریتال.

Work environment

1. URLS : Location our three urls files english.txt, frech.txt and urdu.txt

2. PROGRAMMES : Location of our script (script.fahad1.sh) , the input.txt file containing the paths to the input files

(./URLS) and the output (./TABLEAUX/Tableaux.html)for the programme.

3. PAGES-ASPIREES : Location of our downloaded urls.

4. DUMP-TEXT : Location of the content (extracted pages) saved as text files..

5. CONTEXTES : Folders containing the contexts in which our word appears.

6. INDEX : Folder contaning the dictionnaries of every Dump file.

7. TABLEAUX : Location of our HTML table.

8. FICHIERGLOBAUX : Location of our files containing the concatenation of every dump, contexte and index files.

CORPUS

We made a corpus of about 150 urls which comprised of 50 urls for each language.

In order to gather these urls, we typed the keyword "celebration", "fête" and "جشن".
These words appeared in the majority in press and some forums. This choice enabled us to see on what type of occasions are these words utilised.

Script writing and Results

The first operation that our programme does is the aspiration of the urls by the following command:

wget --no-check-certificate -O ./PAGES-ASPIREES/$j/$i.html $ligne;

At this level, we have three columns, Urls, aspirated pages and retourwget.

In the second step, we type the following command:

lynx -dump -nolist -display_charset="$encodage" ./PAGES-ASPIREES/$j/$i.html > ./DUMP-TEXT/$j/$i-utf8.txt;

In this way, we obtain the content of the aspirated pages in text format.

We then type the command (display_charset)

encodageMeta=$(egrep -io "]+>" ./PAGES-ASPIREES/$j/$i.html | egrep -io "charset *=[^ \>]+" | cut -d= -f 2 | tr -d \" | tr -d \' | tr -d \> | tr -d " " | tr -d \/ | sort -u);

which would enable us to save the pages in the encoding we want.

The "Contextes" allows us to assemble only the lines that we are intersted of in Dumpt utf-8 files. In order to identify these lines, we use the command egrep followed by our motive:

egrep -i "([Cc]elebrat)|([Ff][êe]t)|(جشن)" ./DUMP-TEXT/$j/$i-utf8.txt > ./CONTEXTES/$j/$i-utf8.txt;

The problem that we encountered in this command was that if we seperated our urdu motive with a space, it did not recognise the word . The "contexte" also contains two columns, one for the context files and the other contains HTML context. These were created by the Perl programme "minigrepmultilingue.pl". The result is moved in CONTEXTES with the following commands:

perl ./PROGRAMMES/minigrepmultilingue-v2.2-regexp/minigrepmultilingue.pl "UTF-8" ./DUMP-TEXT/$j/$i-utf8.txt ./PROGRAMMES/minigrepmultilingue-v2.2-regexp/motif-regexp.txt;
mv ./resultat-extraction.html ./CONTEXTES/$j/$i-utf8.html;

The index are the dictionarries of the words contained in the Dump utf-8. We used the following command to generate these dictionnaries in our tables:

egrep -o "\w+" 1.txt | sort | uniq -c | sort -r

You can consult our Table menu for more details about the table output.

Finally, after making the word clouds, we used the software le trameur in which the co-occurence of our word is determined with other words in the context. For this, you can visit the link Trameur in the above menu.

La vie multilingue du mot "fête" sur le web.

Master 1 plurital - 2014/2015

WORK PLAN

BLOG

Work environment

CORPUS

Script writing and Results

Script writing and Results