Random Similarities
Euskara Magyar

Aim

To estimate the probability of drawing the same or similar word in two randomly selected samples - one from Euskara and one from Magyar.

Method

Faults/Failures

Notes and Observations


Results

Magyar sample 1 Frequency Analysis
             Initial		     Medial		      Final
	-------------------	-------------------	-------------------					
Phoneme	freq	Rel. freq.	freq	Rel. freq.	freq	Rel. freq.
   a	2	0.019801980	25	0.096899225	8	0.079207921
   b	5	0.049504950	3	0.011627907	0	0.000000000
   c	2	0.019801980	0	0.000000000	2	0.019801980
   d	3	0.029702970	6	0.023255814	8	0.079207921
   e	3	0.029702970	31	0.120155039	0	0.000000000
   f	9	0.089108911	0	0.000000000	0	0.000000000
   g	1	0.009900990	12	0.046511628	9	0.089108911
   h	9	0.089108911	5	0.019379845	0	0.000000000
   i	5	0.049504950	15	0.058139535	2	0.019801980
   j	1	0.009900990	2	0.007751938	2	0.019801980
   k	7	0.069306931	6	0.023255814	8	0.079207921
   l	9	0.089108911	18	0.069767442	9	0.089108911
   m	3	0.029702970	5	0.019379845	4	0.039603960
   n	1	0.009900990	8	0.031007752	3	0.029702970
   o	3	0.029702970	23	0.089147287	0	0.000000000
   p	2	0.019801980	4	0.015503876	1	0.009900990
   r	2	0.019801980	17	0.065891473	8	0.079207921
   s	1	0.009900990	3	0.011627907	5	0.049504950
   t	5	0.049504950	6	0.023255814	15	0.148514851
   u	1	0.009900990	5	0.019379845	0	0.000000000
   v	8	0.079207921	6	0.023255814	0	0.000000000
   y	0	0.000000000	0	0.000000000	0	0.000000000
   z	0	0.000000000	3	0.011627907	6	0.059405941
   á	0	0.000000000	12	0.046511628	0	0.000000000
   é	2	0.019801980	12	0.046511628	2	0.019801980
   í	0	0.000000000	8	0.031007752	0	0.000000000
   ó	1	0.009900990	1	0.003875969	2	0.019801980
   ô	0	0.000000000	0	0.000000000	0	0.000000000
   ő	0	0.000000000	0	0.000000000	0	0.000000000
   ö	2	0.019801980	8	0.031007752	0	0.000000000
   ú	0	0.000000000	2	0.007751938	0	0.000000000
   ű	0	0.000000000	0	0.000000000	0	0.000000000
   ü	1	0.009900990	3	0.011627907	1	0.009900990
   gy	2	0.019801980	1	0.003875969	4	0.039603960
   ly	1	0.009900990	0	0.000000000	0	0.000000000
   ny	1	0.009900990	1	0.003875969	1	0.009900990
   ty	0	0.000000000	0	0.000000000	0	0.000000000
   cs	3	0.029702970	3	0.011627907	1	0.009900990
   zs	1	0.009900990	1	0.003875969	0	0.000000000
   cz	0	0.000000000	0	0.000000000	0	0.000000000
   sz	5	0.049504950	3	0.011627907	0	0.000000000

Totals	101			258			101		Grand Total	460
For example, the Relative Frequency for Medial "a" is calculated as follows
	25/258 = 0.096899225
The others are calculated similarly using the respective column totals.

Euskara sample 2 Frequency Analysis
             Initial		     Medial		      Final
	-------------------	-------------------	-------------------					
Phoneme	freq	Rel. freq.	freq	Rel. freq.	freq	Rel. freq.
   a	18	0.178217822	79	0.146840149	27	0.267326733
   b	5	0.049504950	14	0.026022305	0	0.000000000
   d	3	0.029702970	20	0.037174721	0	0.000000000
   e	13	0.128712871	54	0.100371747	12	0.118811881
   f	0	0.000000000	1	0.001858736	0	0.000000000
   g	7	0.069306931	17	0.031598513	0	0.000000000
   h	3	0.029702970	4	0.007434944	0	0.000000000
   i	9	0.089108911	53	0.098513011	17	0.168316832
   j	5	0.049504950	0	0.000000000	0	0.000000000
   k	6	0.059405941	39	0.072490706	1	0.009900990
   l	5	0.049504950	22	0.040892193	0	0.000000000
   m	3	0.029702970	9	0.016728625	0	0.000000000
   n	6	0.059405941	24	0.044609665	9	0.089108911
   o	3	0.029702970	24	0.044609665	9	0.089108911
   p	1	0.009900990	9	0.016728625	0	0.000000000
   r	0	0.000000000	67	0.124535316	2	0.019801980
   s	3	0.029702970	12	0.022304833	0	0.000000000
   t	1	0.009900990	32	0.059479554	0	0.000000000
   u	2	0.019801980	26	0.048327138	18	0.178217822
   x	2	0.019801980	2	0.003717472	0	0.000000000
   z	5	0.049504950	14	0.026022305	4	0.039603960
   dd	0	0.000000000	0	0.000000000	0	0.000000000
   ts	0	0.000000000	2	0.003717472	0	0.000000000
   tx	1	0.009900990	2	0.003717472	0	0.000000000
   tt	0	0.000000000	0	0.000000000	0	0.000000000
   tz	0	0.000000000	11	0.020446097	2	0.019801980
   ll	0	0.000000000	0	0.000000000	0	0.000000000
   ń	0	0.000000000	1	0.001858736	0	0.000000000

Totals	101			538			101		Grand Total	740

		Initial	Probabilities
				Model I				Model II	
	Euskara	Magyar		Est. Prob.	Magyar		Est. Prob.
	a	a,á		0.003529066	a,á		0.003529066
	b	b,v		0.006371924	b,p,v		0.007352220
	d	d		0.000882266	d,t		0.002352711
	e	e,é		0.006371924	e,é		0.006371924
	f	f		0.000000000	f		0.000000000
	g	g		0.000686207	g,k		0.005489658
	h	h		0.002646799	h,g		0.002940888
	i	i,í		0.004411332	i,í		0.004411332
	j	j,h,gy,zs,s	0.006862072	j,h,gy,zs,s	0.006862072
	k	k		0.004117243	k,g		0.004705421
	l	l		0.004411332	l		0.004411332
	m	m		0.000882266	m		0.000882266
	n	n		0.000588178	n		0.000588178
	o	o,ó,ö,ő,ô	0.001764533	o,ó,ö,ő,ô	0.001764533
	p	p		0.000196059	p,b,v		0.001470444
	r	r		0.000000000	r		0.000000000
	s	sz		0.001470444	sz,zs,z		0.001764533
	t	t		0.000490148	t,d		0.000784237
	u	u,ú,ü,ű		0.000392118	u,ú,ü,ű		0.000392118
	x	s		0.000196059	s,zs,z		0.000392118
	z	sz		0.002450740	sz,z,zs,s	0.003431036
	dd	gy		0.000000000	gy		0.000000000
	ts	c		0.000000000	c,z		0.000000000
	tx	cs,cz		0.000294089	cs,cz,zs,ty,z	0.000392118
	tt	ty		0.000000000	cs,cz,ty	0.000000000
	tz	c		0.000000000	c,z		0.000000000
	ll	ly		0.000000000	j,ly		0.000000000
	ń	ny		0.000000000	ny		0.000000000

Prob of an initial match	0.049014802			0.060288207
For example, the probability of an Initial match between Euskara <a> and 
Magyar <a> or <á> is approximately
	= prob. (Initial Euskara <a>) * ( prob.(Magyar <a>) + prob. (Magyar <á>) )
	= 0.178217822 * ( 0.019801980 + 0.000000000 )
	= 0.003529066

Medial Probabilities
				Model I				Model II	
	Euskara	Magyar		Est. Prob.	Magyar		Est. Prob.
	a	a,á		0.021058471	a,á		0.021058471
	b	b,v		0.000907755	b,p,v		0.001311201
	d	d		0.000864528	d,t		0.001729057
	e	e,é		0.016728625	e,é		0.016728625
	f	f		0.000000000	f		0.000000000
	g	g		0.001469698	g,k		0.002204547
	h	h		0.000144088	h,g		0.000489899
	i	i,í		0.008782168	i,í		0.008782168
	j	j,h,gy,zs,s	0.000000000	j,h,gy,zs,s	0.000000000
	k	k		0.001685830	k,g		0.005057491
	l	l		0.002852944	l		0.002852944
	m	m		0.000324198	m		0.000324198
	n	n		0.001383245	n		0.001383245
	o	o,ó,ö,ő,ô	0.005532982	o,ó,ö,ő,ô	0.005532982
	p	p		0.000259359	p,b,v		0.000842915
	r	r		0.008205815	r		0.008205815
	s	sz		0.000259359	sz,zs,z		0.000605170
	t	t		0.001383245	t,d		0.002766491
	u	u,ú,ü,ű		0.001873145	u,ú,ü,ű		0.001873145
	x	s		0.000043226	s,zs,z		0.000100862
	z	sz		0.000302585	sz,z,zs,s	0.001008616
	dd	gy		0.000000000	gy		0.000000000
	ts	c		0.000000000	c,z		0.000043226
	tx	cs,cz		0.000043226	cs,cz,zs,ty,z	0.000100862
	tt	ty		0.000000000	cs,cz,ty	0.000000000
	tz	c		0.000000000	c,z		0.000237745
	ll	ly		0.000000000	j,ly		0.000000000
	ń	ny		0.000007204	ny		0.000007204

Prob of a medial match		0.074111697			0.083246880

Final Probabilities
				Model I				Model II	
	Euskara	Magyar		Est. Prob.	Magyar		Est. Prob.
	a	a,á		0.021174395	a,á		0.021174395
	b	b,v		0.000000000	b,p,v		0.000000000
	d	d		0.000000000	d,t		0.000000000
	e	e,é		0.002352711	e,é		0.002352711
	f	f		0.000000000	f		0.000000000
	g	g		0.000000000	g,k		0.000000000
	h	h		0.000000000	h,g		0.000000000
	i	i,í		0.003333007	i,í		0.003333007
	j	j,h,gy,zs,s	0.000000000	j,h,gy,zs,s	0.000000000
	k	k		0.000784237	k,g		0.001666503
	l	l		0.000000000	l		0.000000000
	m	m		0.000000000	m		0.000000000
	n	n		0.002646799	n		0.002646799
	o	o,ó,ö,ő,ô	0.001764533	o,ó,ö,ő,ô	0.001764533
	p	p		0.000000000	p,b,v		0.000000000
	r	r		0.001568474	r		0.001568474
	s	sz		0.000000000	sz,zs,z		0.000000000
	t	t		0.000000000	t,d		0.000000000
	u	u,ú,ü,ű		0.001764533	u,ú,ü,ű		0.001764533
	x	s		0.000000000	s,zs,z		0.000000000
	z	sz		0.000000000	sz,z,zs,s	0.004313303
	dd	gy		0.000000000	gy		0.000000000
	ts	c		0.000000000	c,z		0.000000000
	tx	cs,cz		0.000000000	cs,cz,zs,ty,z	0.000000000
	tt	ty		0.000000000	cs,cz,ty	0.000000000
	tz	c		0.000392118	c,z		0.001568474
	ll	ly		0.000000000	j,ly		0.000000000
	ń	ny		0.000000000	ny		0.000000000

Prob of a final match		0.035780806			0.042152730

Estimation of Probabilities
Probability of finding a random match on 
a single word with no semantic leeway is :-
Model I (Little Phonetic leeway) is
	= 0.049014802 * 0.074111697 * 0.035780806
	= 0.000129976
	~ 1/7694
Model II (More Phonetic Leeway) is
	= 0.060288207 * 0.083246880 * 0.042152730
	= 0.000211556
	~ 1/4727
If one allows for semantic leeway i.e. more than one meaning is 
allowed in the search for matching words then the probability of 
a "match" is calculated as follows:-
Probability 	= 1 - (1 - p)N
			~ N*p for p small (by the Binomial Theorem)
where p = prob. of a match with no semantic leeway
and   N = number of acceptable meanings allowed for a match.

 

  No Semantic leeway Semantic leeway and the
Probability of a random match
    10 meanings 100 meanings 1000 meanings
Model I 1/7694 1/770 1/77 1/8
Model II 1/4727 1/473 1/47 1/5

Appendix A
Two Random samples of size 101
Random Number	Random Basque		Random Number	Random Magyar				
[pages 1-558]				[pages 299-637]	
26		amaboskarren		564		pislogás
453		osalari			346		divat
544		ziega			405		felé
537		zatigabeki		502		lé
283		hizkabe			495		köpü
128		borrokalari		508		lég
318		itsasbeso		454		irat
14		aipagabetasun		529		tisztít
258		gutxienez		351		egyetem
336		jo			479		gördül
335		jeinutsu		628		világít
404		mendi			546		nyikorog
429		neska			338		csepered
163		eliza			521		hasad
424		naigabe			330		borotva
225		galtze			439		haza
547		zintzurkoi		548		ócska
522		urripen			524		lágy
527		xehatze			328		bocsánat
266		harraska		630		sugár
274		herabe			567		puha
39		ararteko		528		szorít
23		aldikada		478		fogni
334		jauspen			599		tehén
364		labur			619		váltakoz
533		zalekeria		631		viz
369		lapurreta		632		vizsla
458		panpoxtu		451		ihlet
151		duda			513		lök
150		domeka			537		munka
476		sakela			401		olvad
114		bezuza			391		metszés
426		nasai			347		donga
236		gerizatu		483		old
55		artegatzaile		524		lát
320		iturgintza		394		fehér
27		amaldeko		336		csal
36		apalketa		525		öreged
162		ekintza			470		kerek
293		idortu			315		barázda
164		elikatu			633		völgy
46		arkugiltza		489		kisér
27		amaitu			510		lepel
76		azkara			587		szemérem
325		izendun			478		fakad
142		desgiltzapetu		366		végez
358		korain			431		hajlít
515		uhertze			465		karc
499		tiraka			345		dij
200		etorki			369		vádol
390		lurreratu		447		húgy
126		bizkarraulki		485		tart
330		jakingarri		436		has
46		aritz			392		faggat
168		enbeita			434		hang
377		leherdura		590		szimat
427		negarmuxinkatu		616		vág
305		inoren			335		cirok
29		anaide			476		buj
6		afaltoki		299		abból
349		kilimakatu		382		ég
350		kirik			500		lak
88		barreiari		377		erkölcs
354		komeni			568		hág
425		narru			604		tíz
309		iraizte			637		zsír
387		losintxatu		535		mi
28		amiano			411		fivér
363		kuzkur			555		örül
183		erre			573		rombol
506		txakurkeria		490		kolomp
529		zabalkuntza		368		eleven
487		soineko			426		gyilok
228		garbigai		452		illen
257		gutińo			508		ledér
475		saiatu			427		gyök
528		xukadera		335		cím
172		eragozle		551		ordas
314		irtenarazi		341		csomó
211		eztabaidaka		451		igéz
157		egoki			382		éhes
402		meategi			440		hegy
334		jaungoikotu		453		ingó
153		edertu			412		fojt
248		gona			434		hanyag
160		ehogaitz		513		lyuk
312		irin			361		néz
427		negu			416		föld
348		kezkatze		461		jog
177		erdipurdika		488		kilenc
19		alaitu			326		beteg
84		baldindu		364		szegény
442		omen			584		szár
307		ipurdi			615		üzem
34		aobatez			526		rövid
23		aldiz			609		túr
434		odolisuriz		611		undok
185		errenta			418		fül
2		aberekeria		505		locsol
251		gorroto			359		kerít
404		mendeko			305		apaszt
The Random numbers refer to page numbers in the dictionaries.

References


Back | Home

Copyright © 2003

Last updated 31 May 2003