[Topic-models] UTF-8 version of lda-c?

David Mimno mimno at CS.Princeton.EDU
Sun Nov 6 21:02:33 EST 2011


lda-c by itself doesn't deal with any input other than integers and
colons, which are identical in basic Latin1 and UTF-8, so there
shouldn't be any changes necessary. You'll need to be aware of
character encodings when you create the input file. Any modern
programming language should have no problem with unicode. No tricks or
hacks should be necessary.
In Mallet, you can import UTF-8 characters by specifying a regular
expression that defines a token. This one works for most languages
that break on whitespace:
 bin/mallet import-file ... --token-regex '[\p{L}\p{M}]+' ...
-David
On Sat, Nov 5, 2011 at 4:56 AM, Stephan Neuhaus <sten at artdecode.de> wrote:
> Dear list,
>
> i will have to topic-model a number of documents encoded in UTF-8 soon.  Does anyone know of a package that does this, or suggest patches to lda-c?
>
> Thanks,
>
> Stephan
> _______________________________________________
> Topic-models mailing list
> Topic-models at lists.cs.princeton.edu
> https://lists.cs.princeton.edu/mailman/listinfo/topic-models
>


More information about the Topic-models mailing list