When I’ve been reading a very bright book on Mahout, Mahout In Action (which is a great hands-in intro to machine learning, as well), one of the examples has caught my attention.
Authors of the book where using well-known K-means clusterization algorithm for finding similar players on stackoverflow.com, where the criterion of similarity was the set of the authors of questions/answers the users were up-/downvoting.
In a very simple words, K-means algorithm iteratively finds clusters of points/vectors, located close to each other, in a multidimensional space. Being applied to the problem of finding similars players in StackOverflow, we assume that every axis in the multi-dimensional space is a user, where the distance from zero is a sum of points, awarded to the questions/answers given by other players (those dimensions are also often called “features”, where the the distance is a “feature weight”).
Obviously, the same approach can be applied to one of the most sophisticated problems in a massively-multiplayer online poker – collusion detection. We’re making a simple assumption that if two or more players have played too much games with each other (taking into account that any of the players could simply have been an active player, who played a lot of games with anyone), they might be in a collusion.
We break a massive set of players into a small, tight clusters (preferably, with 2-8 players in each), using K-means clustering algorithm. In a basic implementation that we will go through further, every user is represented as a vector, where axises are other players that she has played with (and the weight of the feature is the number of games, played together).
Something quite interesting has happened with Lucene. It started as a library, then its developers began adding new projects based on it. They developed another open source project that would add crawling features (among others features) to Lucene. Nutch is in fact a full featured web serach engine that anyone can use or modify. Inspired in some famous papers from Google about Map Reduce and the Google Filesystem, new features to distribute the index where added to Nutch and, eventually, those features became a project by them own: Hadoop. Since then, many projects have being developed over Hadoop. We are in front of a big explosion of open source code that was ignited by the Lucene spark.
All this projects are in some way related to content processing. For all of you interested in search and information retrieval, now we are going to talk about another project for a domain that’s outside the strict confines of search, but can teach you some interest things about content processing.
Recently I have been reading about this new library, Mahout, that provides all those obscure and mysterious machine learning algorithms, together in a library. A lot of modern web sites are using machine learning techniques. These algorithms are rather old and well known, but were popularized recently by its extensive use in social network sites (facebook knowing better than you who could be your best friend) or Google (reading your mind and guessing what you may have wanted to write in the search box) .
In simple words, we can say that a computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.
For example if you wanted to make a program to recognize captchas you would have:
Role: Sr. Software Engineer
Location: Bellevue, WA
We need only W2 candidates. We can also do H1 transfer
Note: Send me your updated resume at email@example.com
We are seeking engineers excited about building and scaling infrastructure to evolve our analytics platform. As an engineer with our advertising team you will work with technologies to process and organize our data in batch and real-time contexts. The ideal candidate is an autonomous engineer with extensive knowledge of the software development process and an understanding of distributed systems. Familiarity or past experience with Hadoop and related technologies is required.
• Build solutions to measure, forecast, and analyze ad revenue & performance to help product and business teams make informed decisions.
• Empower our data scientists and analysts to apply statistical methodologies and ask complex questions across large data sets.
• Develop, test, deploy, and support applications using Scala, Java, and Ruby.
• Play a key role in influencing and planning the direction of our data infrastructure.
• Create MapReduce jobs to transform and aggregate data in a complex data warehouse environment. • Write custom code and leverage existing tools to collect data from multiple disparate systems.
• Implement processes to ensure data availability, accuracy, and integrity.
• Strong Java development skills.
• Hands-on development mentality with a willingness to troubleshoot and solve complex problems.
• Experience and knowledge of technologies within the Hadoop ecosystem, such as Cascading, Hive, Pig, HBase.
• Experience with AWS a plus.
• Understanding of Linux system internals, administration, and scripting.
• Interest in and willingness to work with machine learning, Mahout, R.
Dinesh Ram Kali.
Human Resource Associate| National Staffing|
Direct: +1 (402) 905 9212
222 South 15th Street, Suite 505N, Omaha, NE 68102