Computational linguistics blog: 2010

Sunday, November 14, 2010

R: graphs, part2

load data from data frame:

col1 col2 weight
ничего космос 1.0
first two columns consider to be vertex names, third - weight
data<-read.table("/home/bliss/Study/Bauman/Magistr&Disser/Data/dat.txt",header=T,sep="\t")
gr<-graph.data.frame(as.data.frame(data),directed=T)

get vertex id by value: which(V(gr)$name=="космос")
get edge weight using vertex names: E(gr, P=c(which(V(gr)$name=="голубой")-1,which(V(gr)$name=="платок")-1))$weight
Vertex idx starts from 0, not 1!

Tuesday, November 9, 2010

R: working with graphs

use igraph

library(igraph)
create empty graph: graph.empty()
create graph with edges (1,2); (2,3); (5,6): graph(c(1,2,2,3,5,6), directed=TRUE)
get nodes number: vcount(graph)
get edges number: ecount(graph)
add/delete edges: add.edges; delete.edges

Shortest paths: http://igraph.sourceforge.net/doc/R/shortest.paths.html

shortest.paths(graph, v=V(graph), mode = c("all", "out", "in"),
      weights = NULL, algorithm = c("automatic", "unweighted",
                                    "dijkstra", "bellman-ford",
                                    "johnson"))
get.shortest.paths(graph, from, to=V(graph), mode = c("all", "out",
      "in"), weights = NULL)
get.all.shortest.paths(graph, from, to = V(graph), mode = c("all", "out", "in")) 
average.path.length(graph, directed=TRUE, unconnected=TRUE)
path.length.hist (graph, directed = TRUE, verbose = igraph.par("verbose"))

Monday, November 8, 2010

Python Regular Expressions

E.g. we'd like to parse such html-code using regexp:
<tr><td><font color="#bbbbbb">5587 </font></td><td>изумление</td><td>S</td><td>13.98</td><td>20.65</td>
## <td>#N/A</td><td>#N/A</td><td>#N/A</td><td>#N/A</td><td>#N/A</td><td>#N/A</td><td>#N/A</td><td></td></tr>

The code will be:

rows=re.finditer('(\<tr.+?tr\>)',html) ##nejadnyi (v otlichie ot .+ ischet stroki ne maxim dliny)
for row in rows:
cells=re.finditer('(\<td.+?td\>)(\<td.+?td\>)(\<td.+?td\>)(\<td.+?td\>)(\<td.+?td\>)(\<td.+?td\>)(\<td.+?td\>)(\<td.+?td\>)(\<td.+?td\>)(\<td.+?td\>)(\<td.+?td\>)(\<td.+?td\>)',row.groups()[0])

Round brackets '(', ')' means group, you may iterate or name them.
\ - read as it is
.+ - find any string (any symbols), finds string with maximum length and takes a lot of sources
.+? - find any string (any symbols), not maximum length , better one to parse constructions like
<tr>...</tr>..<tr>...</tr>

'(\<tr.+tr\>)', finds ({<tr>...</tr>..<tr>...</tr>}), only one
'(\<tr.+?tr\>)', finds ({<tr>...</tr>},{<tr>...</tr>})

Primitive function to remove html tags:
def remove_tags(html): pattern=re.compile('<.*?>')
result=pattern.sub("",html)
return result

Find string that doesn't contain symbol (e.g. '{'):
re.finditer('({[^}]+})', str)

Wednesday, November 3, 2010

Apache error log

Directory: /var/log/apache2/error.log

Sunday, October 17, 2010

Hebbian theory

"The general idea is an old one, that any two cells or systems of cells that are repeatedly active at the same time will tend to become 'associated', so that activity in one facilitates activity in the other." (Hebb 1949, p. 70)

"When one cell repeatedly assists in firing another, the axon of the first cell develops synaptic knobs (or enlarges them if they already exist) in contact with the soma of the second cell." (Hebb 1949, p. 63)

"If the inputs to a system cause the same pattern of activity to occur repeatedly, the set of active elements constituting that pattern will become increasingly strongly interassociated. That is, each element will tend to turn on every other element and (with negative weights) to turn off the elements that do not form part of the pattern. To put it another way, the pattern as a whole will become 'auto-associated'. We may call a learned (auto-associated) pattern an engram." (Hebb 1949, p. 44)

Hebb, D.O. (1949). The organization of behavior. New York: Wiley

http://en.wikipedia.org/wiki/Hebbian_theory

Thursday, October 7, 2010

R: get euclidean distance

use dist()
Example:
vect<-rbind(data_assoc[term,], cl$centers[i,])
dist[i,]<-dist(vect, method="euclidean")

R:Get index of array element

use which()

get index of min element in arr: which(arr==min(arr))

Sunday, October 3, 2010

R: making plots

data:
threshold;cues;words;pairs
0;6577;23196;102516
1;5437;10496;29998

data<-read.table("../data", colClasses=c("integer","integer","integer","integer"),header=TRUE,sep=";")
max_y<-max(data)
plot_colors=c("red","green","blue")
plot(data$threshold,data$cues,xlab="Threshold for number of reactions per cue",ylab="number of cues,words,pairs",col=plot_colors[1],main="Data quantity reduction",type="l")
lines(data$threshold,data$words,xlab="Threshold for number of reactions per",ylab="number of cues,words,pairs",col=plot_colors[2],main="Data quantity reduction",type="l")
lines(data$threshold,data$pairs,xlab="Threshold for number of reactions per",ylab="number of cues,words,pairs",col=plot_colors[3],main="Data quantity reduction",type="l")

legend("topright", names(data[2:4]), cex=0.8, col=plot_colors, lty=1:3, lwd=2, bty="n")
dev.copy(png,filename="../Picts/threshold_value_dependence.png",height=600, width=800,bg="white")
dev.off()

Tuesday, September 28, 2010

Django: 404 page

settings.py:
set DEBUG=False

create 404.html page in your templates directory.

That's it!

Google docs online viewer

Very good viewer for your pdfs from Google. Just copy&paste the code:
<iframe src="http://docs.google.com/gview?url=http://infolab.stanford.edu/pub/papers/google.pdf&embedded=true" style="width:600px; height:500px;" frameborder="0"></iframe>

replace part written in bold font with your document name

Sunday, September 26, 2010

Postgresql

Create/Alter new user:

su – postgres
psql
alter user postgres with password ‘newpassword’; OR ALTER USER username ENCRYPTED password ‘newpassword’;

Create database:
create database databaseName;
create database databaseName with encoding ‘SQL_ASCII’; OR create database databaseName with encoding ‘utf8′;
grant all privileges on database databaseName to username;

copy files to remote server via ssh

rsync -e ssh -rvzl ttt user@host:/home/ttt

Friday, September 24, 2010

Installation on Ubuntu Lucid

The problem with locales:
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_MESSAGES to default locale: No such file or directory
locale: Cannot set LC_ALL to default locale: No such file or directory
set locale:
/usr/share/locales/install-language-pack

Repositories list: /etc/apt/sources.list

get installed pkgs: dpkg --get-selections
dpkg --get-selections|grep python

gcc installation:
sudo apt-get update
sudo apt-get upgrade
sudo apt-get install build-essential
gcc -v
make -v

gmake:
sudo ln -s /usr/bin/make /usr/bin/gmake

svn:
sudo apt-get install subversion

django:
sudo apt-get install apache2 libapache2-mod-python
svn co http://code.djangoproject.com/svn/django/trunk/ django_src
python -c "from distutils.sysconfig import get_python_lib; print get_python_lib()"
ln -s `pwd`/django_src/django YOUR-DIR/django
sudo cp ~/django_src/django/bin/django-admin.py /usr/local/bin

for posgresql:
apt-get install libreadline6
apt-get install libreadline6-devel
apt-get install libz-dev

psycopg2:
sudo apt-get install python-psycopg2

chardet:
wget http://chardet.feedparser.org/download/python2-chardet-2.0.1.tgz
tar -xzvf python2-chardet-2.0.1.tgz
cd python2-chardet-2.0.1
python setup.py install

matplotlib:
sudo apt-get install python-matplotlib
chmod 777 /dir-with-.mathplotlib or export HOME=any-dir-with-writing-access

PIL:
sudo apt-get install python-imaging

for R:
sudo gedit /etc/apt/sources.list
deb http://rh-mirror.linux.iastate.edu/CRAN/bin/linux/ubuntu lucid/
sudo apt-get update
sudo apt-get install r-base

after that I got problem trying to execute R:
/usr/lib/R/bin/exec/R: symbol lookup error: /usr/local/lib/libreadline.so.6: undefined symbol: PC

the solution:
sudo rm /usr/local/lib/libreadline.*
sudo ldconfig

postgresql:
wget http://wwwmaster.postgresql.org/redir/180/f/source/v9.0.0/postgresql-9.0.0.tar.bz2
bunzip2 postgresql-9.0.0.tar.bz2
tar xzvf postgresql-9.0.0.tar
cd postgresql-9.0.0
./configure

Tuesday, September 21, 2010

MySQL->PostgreSQL

Since I had lots of problems with cyrillic encodings in MySQL, I've decided to move to PostgreSQL.
I've changed my SQL scheme a bit: auto increment PK was converted to serial in PostgreSQL.
Also, I've decided to use Django, so first of all I described my DB model:

from django.db import models

class Cues(models.Model):
    data = models.CharField(max_length=50)

    def __unicode__(self):
            return self.data

class Reacts(models.Model):
    data = models.CharField(max_length=100)

    def __unicode__(self):
            return self.data

class Weights(models.Model):
    cue=models.ForeignKey(Cues)
    react = models.ForeignKey(Reacts)
    weight=models.FloatField()
After that I executed scripts:

python manage.py validate

python manage.py sqlall avs

python manage.py syncdb

and it automatically created schemas:
CREATE TABLE avs_cues
(
id serial NOT NULL,
data character varying(50) NOT NULL,
CONSTRAINT avs_cues_pkey PRIMARY KEY (id)
)

CREATE TABLE avs_reacts
(
id serial NOT NULL,
data character varying(100) NOT NULL,
CONSTRAINT avs_reacts_pkey PRIMARY KEY (id)
)

CREATE TABLE avs_weights
(
id serial NOT NULL,
cue_id integer NOT NULL,
react_id integer NOT NULL,
weight double precision NOT NULL,
CONSTRAINT avs_weights_pkey PRIMARY KEY (id),
CONSTRAINT avs_weights_cue_id_fkey FOREIGN KEY (cue_id)
      REFERENCES avs_cues (id) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION DEFERRABLE INITIALLY DEFERRED,
CONSTRAINT avs_weights_react_id_fkey FOREIGN KEY (react_id)
      REFERENCES avs_reacts (id) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION DEFERRABLE INITIALLY DEFERRED
)
WITH (
OIDS=FALSE
);
ALTER TABLE avs_weights OWNER TO postgres;

-- Index: avs_weights_cue_id

-- DROP INDEX avs_weights_cue_id;

CREATE INDEX avs_weights_cue_id
ON avs_weights
USING btree
(cue_id);

-- Index: avs_weights_react_id

-- DROP INDEX avs_weights_react_id;

CREATE INDEX avs_weights_react_id
ON avs_weights
USING btree
(react_id);
! The problem is that framework doesn't allow to create composite foreign key

Example of query (returns cue and number od reaction for this cue):
select avs_cues.data, count(react_id) from avs_weights,avs_cues where avs_cues.id=avs_weights.cue_id group by avs_cues.data;

Thursday, September 16, 2010

MySQL Database scheme

I've decided to work with utf-8 encoding instead of cp1251.
First, set it:

SET names 'utf8';
DROP TABLE IF EXISTS weights;

DROP TABLE IF EXISTS cues;
DROP TABLE IF EXISTS reacts;

CREATE TABLE cues
       (
         id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
         data VARCHAR(50) collate utf8_general_ci
       )
CREATE TABLE reacts
       (
         id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
         data VARCHAR(100) collate utf8_general_ci
       )
CREATE TABLE weights
       (
         cue_id INT NOT NULL,
         react_id INT NOT NULL,
    weight DOUBLE NOT NULL,
    FOREIGN KEY (cue_id) REFERENCES cues(id),
    FOREIGN KEY (react_id) REFERENCES reacts(id),
    PRIMARY KEY(cue_id, react_id)
       )

Python:
# -*- coding: utf-8 -*-
import MySQLdb

##connection:
conn = MySQLdb.connect (host = "localhost",
                        user = "user1",
                         passwd = "pass1",
                         db = "db_name")
cursor = conn.cursor ()
cursor.execute ("SET names 'utf8'")
....
cursor.execute ("SELECT id FROM cues WHERE data="+"\""+cue+"\"")
    cue_id=cursor.fetchone ()[0]
..
cursor.close()
conn.close()

Tuesday, September 14, 2010

Python: Working with unicode, cp1251 encodings

It's impossible to capitalize or change symbol's case(lowercase or uppercase) if you're using e.g. cp1251 or non-unicode format.

To fix it:
>>> print unicode("ПРИВЕТ",'utf-8').lower()
привет

Monday, September 13, 2010

MySQL + Python

I've decided to work with datatables instead of text files.
Guide for ubuntu (9.10, 10.4):
If you have mysql & python installed, you need:
1. Download MySQLdb files from SourceForge
2. tar -xzvf MySQL-python-1.2.3.tar.gz
3. cd MySQL-python-1.2.3
4. python setup.py build
Here there might be some errors:
File "setup.py", line 5, in
   from setuptools import setup, Extension
ImportError: No module named setuptools
To fix: sudo apt-get install python-setuptools python-dev libmysqlclient15-dev
or EnvironmentError: mysql_config not found
To fix: export PATH=$PATH:/usr/local/mysql/bin
5. Again: python setup.py build
6. sudo python setup.py install
7. Check: python
>>> import MySQLdb

Connect to MySQL using bash:
mysql -h localhost -u root -p

Create database: create database my_db;

Script example:
import MySQLdb

conn = MySQLdb.connect (host = "localhost",
                         user = "root",
                         passwd = "my_ps",
                         db = "my_db")
cursor = conn.cursor ()
cursor.execute ("SELECT VERSION()")
row = cursor.fetchone ()
print "server version:", row[0]
cursor.close ()
conn.close ()
Manual: http://www.kitebird.com/articles/pydbapi.html

Thursday, September 2, 2010

ANN Models for language acquisition

Source: http://www.lsi.upc.edu/~jpoveda/publications/ideal06.pdf
Jordi Poveda, Alfredo Velidodo from Technical University of Catalonia (UPC), Barcelona, Spain
Nativists: Chomsky, Fodor and Pinker
Non-nativists: Mac Whinney, Bates and Snow (among them)

Probabilistic interpretation of neural networks:

1. MacKay, D. J. C.: Bayesian Methods for Back-propagation Networks. In Domany, E., van Hemmen, J. L., and Schulten, K. (eds.), Models of Neural Networks III, Ch. 6 (1994). Springer, New York.

2. Neal, R.: Bayesian Learning for Neural Networks, PhD thesis (1994). University of Toronto, Canada

3. Bishop, C.: Neural Networks for Pattern Recognition (1995). Oxford University
Press, New York

4. Friston, K.: A Theory of Cortical Responses. Philosophical Transactions of the Royal Society, B, 360(2005)

Artificial Neural Network Architectures for Language Acquisition:
1. Kohonen Self-Organizing Maps
Results: Anderson, B.: Kohonen Neural Networks and Language. Brain and Language, (1999) 86-94
2. Elman's SRN for Building Word Meaning Representations
Description: Distributed representations (i.e. as a vector of weights) for the meaning of a word. The word meaning representations are built
from contextual features, by putting the word in relation to its context, as it
occurs in a stream of input sentences. This is indeed what Li and Farkas do in
the WCD (Word Co-ocurrence Detector) subsystem of their DevLex and
SOMBIP models
Results: Elman, J. L.: Finding Structure in Time. Cognitive Science, 14 (1990) 179-211

Models:
1. Rumelhart, D., McClelland, J.: On the Learning of the Past Tenses of English Verbs. Parallel distributed processing: Explorations in the microstructure of cognition (1986)
2. TRACE: McClelland, J. L., Elman, J. L.: Interactive Processes in Speech Perception: The TRACE Model. Parallel distributed processing (1986)
3. SOMBIP: A Model of Bilingual Lexicon Acquisition
Li, P., Farkas, I.: A Self-Organizing Connectionist Model of Bilingual Processing.
Bilingual Sentence Processing (2002)
4. DevLex: A Model of Early Lexical Acquisition
Li, P., Farkas, I., MacWhinney, B.: Early Lexical Development in a Self-Organizing Neural Network. Neural Networks (2004)

Monday, August 30, 2010

Brains, Meaning and Corpus Statistics by Tom M. Mitchell (cmu)

Tom Mitchell's web-page: http://www.cs.cmu.edu/~tom/
Based on “Predicting Human Brain Activity Associated with the Meanings of Nouns,”
Mitchell, Shinkareva, Carlson, Chang, Malave, Mason, & Just, Science, 2008.

1. Can we train on word stimuli, then decode picture stimuli?
YES: We can train classifiers when presenting English words, then decode category of picture stimuli, or Portuguese words.
Therefore, the learned neural activation patterns must capture how the brain represents the meaning of input stimulus.
2. Are representations similar across people?
Can we train classifier on data from a collection of people, then decode stimuli for a new person?
YES: We can train on one group of people, and classify fMRI images of new person.
Therefore, seek a theory of neural representations common to all of us (and of how we vary).
3. Can we discover underlying principles of neural representations?
Idea: Predict neural activity from corpus statistics of
stimulus word [Mitchell et al., Science, 2008] 

http://www.cs.cmu.edu/%7Etom/pubs/fMRI_public_May2009.pdf

Tuesday, April 27, 2010

Semantic memory/network

Basic ideas were presented by Allan M. Collins; M.R. Quillian (1969). "Retrieval time from semantic memory"

Semantic memory is knowledge of facts
Each node has specific meaning
SNs employ the local representation of concepts
A node's activity may spread along links to activate other nodes (spreading activation)
Spreading activation may lose strength as it travels outward from the source
Distance between 2 nodes (concepts) is related to their relatedness

J.Sowa "Semantic networks"

Conceptual graphs.
Conceptual graphs (CGs) are a system of logic based on the existential graphs of Charles Sanders Peirce and the semantic networks of artificial intelligence.

Lendaris, G.G. (1992), "A Neural-Network Approach to Implementing Conceptual Graphs," (invited) Chapter 8 in Conceptual Structures, Nagle, et al, Editors, Ellis Horwood

Monday, April 26, 2010

Similar experiments/projects

1.Semantic memory
Where:University of California, Irvine
Who: Steyvers, M., & Tenenbaum, J.,Griffiths, T
Tags: LSI, LDA, semantic memory, probabilistic model, semantic association
Papers:
...Semantic memory
1.Steyvers, M., & Tenenbaum, J. (2005). The Large Scale Structure of Semantic Networks: Statistical Analyses and a Model of Semantic Growth. Cognitive Science, 29(1), 41-78
2.Steyvers, M. & Griffiths, T. (in press). Probabilistic topic models. In T. Landauer, D McNamara, S. Dennis, and W. Kintsch (eds), Latent Semantic Analysis: A Road to Meaning. Laurence Erlbaum
3.Griffiths, T., & Steyvers, M. (2004). Finding Scientific Topics. Proceedings of the National Academy of Sciences, 101 (suppl. 1), 5228-5235
4.Griffiths, T.L., & Steyvers, M. (2002). Prediction and semantic association. In: Advances in Neural Information Processing Systems, 15
5. Griffiths, T.L., & Steyvers, M. (2002). A probabilistic approach to semantic representation. In: Proceedings of the Twenty-Fourth Annual Conference of Cognitive Science Society. George Mason University, Fairfax, VA
...Memory processes
1.Shiffrin, R.M. & Steyvers, M. (1997). A model for recognition memory: REM: Retrieving Effectively from Memory. Psychonomic Bulletin & Review, 4 (2), 145-166
2. Shiffrin, R. M., & Steyvers, M. (1998). The effectiveness of retrieval from memory. In M. Oaksford & N. Chater (Eds.). Rational models of cognition. (pp. 73-95), Oxford, England: Oxford University Press
3. Steyvers, M. (2000). Modeling semantic and orthographic similarity effects on memory for individual words. Dissertation, Psychology Department, Indiana University. Formatted for 55 pages
4. Wagenmakers, E.J.M., Steyvers, M., Raaijmakers, J.G.W., Shiffrin, R.M., van Rijn, H., & Zeelenberg, R. (2004). A Model for Evidence Accumulation in the Lexical Decision Task. Cognitive Psychology, 48, 332-367
5. Steyvers, M., Wagenmakers, E.J.M., Shiffrin, R.M., Zeelenberg, R., & Raaijmakers, J.G.W. (2001). A Bayesian model for the time-course of lexical processing. In: Proceedings of the Fourth International Conference on Cognitive Modeling. George Mason University, Fairfax, VA

2. Free Association experiment (1973-1998)
Where: University of South Florida, University of Kansas
Who: Nelson, L., McEvoy, K., & Schreiber A.
Tags: free association, data

3. Word associations: Norms for 1,424 Dutch words in a continuous task
Where: University of Leuven, Leuven, Belgium
Who: Simon De Deyne, Gert Storms
Tags: free association, semantic association, experiment
Papers:
1. De Deyne, S. & Storms, G. (2008). Word associations: Norms for 1,424 Dutch words in a continuous task. Behavior Research Methods, 40, 198-205
2. De Deyne, S. & Storms, G. (2008). Word associations: Network and semantic properties. Behavior Research Methods, 40, 213-231
3. Verguts, T., Ameel, E., & Storms, G. (2004). Measures of similarity in models of categorization. Memory & Cognition, 32, 379-389
4. De Deyne, S., Verheyen, S., Ameel, E., Vanpaemel, W., Dry, M., Voorspoels, W., & Storms, G. (2008). Exemplar by feature applicability matrices and other Dutch normative data for semantic concepts. Behavioral Research Methods, 40, 1030-1048

4. Matthiew Zeigenfuse, Michael Lee research @ UCI
Where: University of California, Irvine
Who: Matthiew Zeigenfuse, Michael Lee
Tags: free association, semantic association, stimulus features
Papers:
1. Zeigenfuse, M.D., & Lee, M.D. (in press). Heuristics for choosing features to represent stimuli
2. Zeigenfuse, M.D., & Lee, M.D. (2010). Finding the features that represent stimuli. Acta Psychologica, 133, 283-295
3. Zeigenfuse, M.D., & Lee, M.D. (2008). Finding feature representations of stimuli: Combining feature generation and similarity judgment tasks. In V. Sloutsky, B. Love, & K. McRae (Eds.), Proceedings of the 30th Annual Conference of the Cognitive Science Society, pp. 1825-1830. Austin, TX: Cognitive Science Society

Thursday, February 25, 2010

My project description (ASIS)

ASIS project (in russian) : http://philippovich.ru/Projects/ASIS/index.htm
This free association experiment was oranized by Cherkasova, G. in 1998.
Students of different (social & technical) universities were asked to produce 1 word-reaction on 1 word-cue.
The total number of forms: 5000.
Total number words (stimuli) in each: 100.
Words were generated randomly.

My data consists of 102516 pairs (cue-reaction | frequency).
The total number of cues (different): 6577