Show Posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.

Messages - rashidacse

Pages: 1 2 [3] 4 5 ... 7

Teaching & Research Forum / Why should I learn Hadoop if I want to get into data mining and big data?

« on: November 28, 2015, 12:19:12 PM »

There was a treasure hunter who was told that there was a treasure trove of diamonds hidden beneath a massive desert of shifting sands. At his disposal was a myriad of tools including satellites, sonars, teams of scientists, and a large sand-moving excavator.

He wondered: Should I learn how to drive the excavator to find treasure and move Big Sand?

Your initial question is actually two questions in one.

You need to find the treasure and move the sand.

The excavator will help both, but it is mostly about moving Big Sand.

1) You don't need Hadoop to get into data mining.
Period. Hadoop is not a treasure hunting / data mining tool. It will not help you see 'sift the sand' in any more than rudimentary ways. It is not meant to do analytics. It is closer to the domain of data warehousing than analytics.

2) Hadoop might help if you want to get into Big Data.
Hadoop is a data storage, processing and management ecosystem. If you want to get into the infrastructure side of data, you could learn Hadoop. But remember that Hadoop is just one brand of excavator. (Albeit an open-source one, at least for now). It is by no means the only brand out there. As the other answer alluded to, you should focus on the principles and thinking in the space, not the tools.

In my travels I have met some excellent data miners, including a 1st place winner of Kaggle, a few analytics department heads from listed companies, and guys quietly working in the background on things that none of us will ever hear about because it is a source of competitive advantage that companies will never publicise. Depending on the industry and applications in question, 'big' data can be just one important piece of puzzle, but the focus certainly isn't on Hadoop per se.

Teaching & Research Forum / How can I learn about data mining with Hadoop?

« on: November 28, 2015, 12:18:14 PM »

Honestly, you shouldn't be looking to learn Data Mining 'with Hadoop', specifically. Data Mining is a broad subject while Hadoop is a software framework which lets you use distributed computing to implement data mining (most commonly) algorithms to mine massive data sets; something one computer or server would fail to do. So assuming that you want to first learn the subject of Data Mining and then go about using Hadoop, here are a few guidelines:

First, you'll need to know the underlying theories and concepts of Data Mining - I would suggest this book - Introduction to Data Mining
It is used by many Universities for their introductory Data Mining courses and is very well structured for beginners to follow, too.
You can supplement this with one of the many open courses online. The following two should be good places to start (I haven't done them myself, but want to)
1) https://www.edx.org/course/calte...

2) https://www.coursera.org/course/...

Second, assuming you've gone through these resources, a very handy place to learn not just about Hadoop but about all the other tools, frameworks and software for a data scientist is this brilliant course by Jeff Leek and co. at John Hopkin's - https://class.coursera.org/datas...

Teaching & Research Forum / How do I learn data mining?

« on: November 28, 2015, 12:13:16 PM »

Learn data mining for free with Harvard's Data Science Course, CS109. This course was developed by Joe Blitzstein and Hanspeter Pfister in Fall of 2013 and will continue in Fall of 2014.

While this is a "data science" course, I still consider this "data mining" because of the valuable practice in extracting and manipulating data, in addition to creating some common data mining algorithms like recommendation engines or sentiment analysis.

Python intro, Numpy intro, Matplotlib intro (Homework 0) (solutions)
Web scraping, data aggregation, plotting, forecasting, model evaluation (Homework 1) (solutions)
Data manipulation, predictions, evaluations (Homework 2) (solutions)
Sentiment analysis, predictive modeling, model calibration (Homework 3) (
ipython.org
solutions)
Recommendation engine, mapreduce (Homework 4) (solutions)
Network analysis and visualization (Homework 5) (solutions)

Teaching & Research Forum / Feature Selection in Machine Learning: What is the statistical method to identif

« on: November 28, 2015, 12:11:50 PM »

If you are using some sort of decision trees, a quick and dirty way is to throw the features into the model. You can identity the importance of features by their distances to the root.

For one-dimensional features:
You can also use area under the receiver operating characteristics (ROC) curve (AUC) to tell the importance of features. In short, the curve plots for different threshold, what is the false positive rate (fpr) vs. true positive rate (tpr). You can imagine, a perfect feature would have tpr = 1.0 and fpr = 0.0. The AUC is 1. A useless feature that has a constant value for all data points will have a ROC that is straight line between (0,0) and (1,1) so the AUC is 0.5. So the higher the AUC value is, the more relevant the feature is. It is sometimes called in the concordance index in Statistics.

Another predictive power measurement is the correlation measures. That is, the correlation between the feature value and the label. If you are already set on a model, then the correlation can be between the latent continuous variable (for example, in a logistic regression model, this is the logistics function applied on the feature value) and the label. This square of this correlation is actually proportional to R Square [1].

Teaching & Research Forum / What are the top 10 problems in machine learning?

« on: November 28, 2015, 12:10:46 PM »

Quoting "Structured Machine Learning: Ten Problems for the Next Ten Years" by Pedro Domingos (Professor at University of Washington)

These seem to be top 10 problems in machine learning research.

Statistical Predicate Invention
Predicate invention in ILP and hidden variable discovery in statistical learning are really two faces of the same problem. Researchers in both communities generally agree that this is a key (if not the key) problem for machine learning. Without predicate invention, learning will always be shallow.
Generalizing Across Domains
Machine learning has traditionally been defined as generalizing across tasks from the same domain, and in the last few decades we’ve learned to do this quite successfully. However, the glaring difference between machine learners and people is that people can generalize across domains with great ease.
Learning Many Levels of Structure
So far, in statistical relational learning (SRL) we have developed algorithms for learning from structured inputs and structured outputs, but not for learning structured internal representations. In both ILP and statistical learning, models typically have only two levels of structure. For example, in support vector machines the two levels are the kernel and the linear combination, and in ILP the two levels are the clauses and their conjunction. While two levels are in principle sufficient to represent any function of interest, they are an extremely inefficient way to represent most functions.
Deep Combination of Learning and Inference
Inference is crucial in structured learning, but research on the two has been largely separate to date. This has led to a paradoxical state of affairs where we spend a lot of data and CPU time learning powerful models, but then we have to do approximate inference over them, losing some (possibly much) of that power. Learners need biases and inference needs to be efficient, so efficient inference should be the bias. We should design our learners from scratch to learn the most powerful models they can, subject to the constraint that inference over them should always be efficient (ideally realtime).
Learning to Map between Representations
Three major problems in this area are entity resolution (matching objects), schema matching (matching predicates) and ontology alignment (matching concepts).
Learning in the Large
Structured learning is most likely to pay off in large domains, because in small ones it is often not to difficult to hand-engineer a “good enough” set of propositional features. So far, for the most part, we have worked on micro-problems (e.g., identifying promoter regions in DNA); our focus should shift increasingly to macro-problems (e.g., modeling the entire metabolic network in a cell).
Structured Prediction with Intractable Inference
Max-margin training of structured models like HMMs and PCFGs has become popular in recent years. One of its attractive features is that, when inference is tractable, learning is also tractable. This contrasts with maximum likelihood and Bayesian methods, which remain intractable. However, most interesting AI problems involve intractable inference. How do we optimize margins when inference is approximate? How does approximate inference interact with the optimizer? Can we adapt current optimization algorithms to make them robust with respect to inference errors, or do we need to develop new ones? We need to answer these questions if max-margin methods are to break out of the narrow range of structures they can currently handle effectively.
Reinforcement Learning with Structured Time
The Markov assumption is good for controlling the complexity of sequential decision problems, but it is also a straitjacket. In the real world systems have memory, some interactions are fast and some are slow, and long uneventful periods alternate with bursts of activity. We need to learn at multiple time scales simultaneously, and with a rich structure of events and durations. This is more complex, but it may also help make reinforcement learning more efficient. At coarse scales, rewards are almost instantaneous, and RL is easy. At finer scales, rewards are distant, but by propagating rewards across scales we may be able to greatly speed up learning.
Expanding SRL to Statistical Relational AI
We should reach out to other subfields of AI, because they have the same problems we do: they have logical and statistical approaches, each solves only a part of the problem, and what is really needed is a combination of the two. We want to apply learning to larger and larger pieces of a complete AI system. For example, natural language processing involves a large number of subtasks (parsing, coreference resolution, word sense disambiguation, semantic role labeling, etc.).
Learning to Debug Programs
One area that seems ripe for progress is automated debugging. Debugging is extremely time-consuming, and was one of the original applications of ILP. However, in the early days there was no data for learning to debug, and learners could not get very far. Today we have the Internet and huge repositories of open-source code. Even better, we can leverage mass collaboration. Every time a programmer fixes a bug, we potentially have a piece of training data. If programmers let us automatically record their edits, debugging traces, compiler messages, etc., and send them to a central repository, we will soon have a large corpus of bugs and bug fixes

Teaching & Research Forum / What are the top 10 data mining or machine learning algorithms?

« on: November 28, 2015, 12:09:18 PM »

One potential answer to this question comes from the Analytics 1305 [2] documentation:

Kernel Density Estimation and Non-parametric Bayes Classifier
K-Means
Kernel Principal Components Analysis
Linear Regression
Neighbors (Nearest, Farthest, Range, k, Classification)
Non-Negative Matrix Factorization
Support Vector Machines
Dimensionality Reduction
Fast Singular Value Decomposition
Decision Tree
Bootstapped SVM

Teaching & Research Forum / How do I learn how to code data mining algorithms on my own?

« on: November 28, 2015, 12:08:13 PM »

Andrew Ng's machine learning class on Coursera involve programming assignments that guide you through implementations of a handful (e.g. linear / logistic regression) in Matlab - https://www.coursera.org/course/ml.

Note: When you're using these algorithms in practice, you'll probably be using library functions. Considering that, I would value a deep understanding of how they work over learning how to implement your own.

Teaching & Research Forum / What are the best sites for coding data mining algorithms in Python?

« on: November 28, 2015, 12:07:34 PM »

What are the best sites for coding data mining algorithms in Python?

Teaching & Research Forum / What are the best websites to learn Python implement for data mining algorithms?

« on: November 28, 2015, 12:06:48 PM »

Try the Coursera courses or Programming Collective Intelligence book

Teaching & Research Forum / What is the best software to implement machine learning/large scale data mining

« on: November 28, 2015, 12:05:10 PM »

I think there is no single best answer to your question. I have expereince with R, Weka and Matlab, their functions on data mining overlap largely. Here is my point of view

Weka: Minimal programming skill required if you use GUI. Also Java is a powerful language for writing your own algorithms. Details (like input, output) can be handled easily. If you only need to do some tweaks or combination on the existing algorithm, and skillful in Java, then Weka is a good option.

Matlab: Easy to learn, very powerful and comprehensive, you can find nearly every high-level function you need and put them together to satisfy your needs. The speed is optimized if you use vectorized & matrix computation. As stated, it's syntax is very easy, but I found it's sometimes too painful to design a large system with its inadequate language features.Matlab is also very expensive.

R: Free and powerful. More powerful than Matlab if you are doing statistical modeling, but inferior in its general toolbox (E.x. I can't find a comprehensive genetic algorithm toolbox in R).

The bad news is, all these software are high-level and if your data is very large, you can't afford to import all the data into memory and do the computation. You must be thoughtful and do some divide-and-conquer. You often need to write some C extension to sophistically handle memory use.

You might also consider Python, there are libraries like sklearn, numpy, pandas, and scipy making python competent to do all kinds of scientific computing with its rich & powerful language features.

Teaching & Research Forum / What is Large Scale Machine Learning?

« on: November 28, 2015, 12:03:41 PM »

The majority of computational problems, including those with explicit
mathematical characterizations such as the NP-complete problems, fall
within the category of those for which the question of how best to
compute them is little understood. These core problems of complexity
theory remain the most fundamental mathematical challenges of computer
science. They are related to foundational questions in quantum physics,
most notably through polynomial time versions of the Turing hypothesis.

In computer systems major conceptual challenges remain in controlling,
load-balancing and programming large parallel systems, whether they be
multicore devices on a chip or highly distributed systems.. how
to design algorithms for multi-core devices that are portable, and run
efficiently on wide classes of architectural designs with widely varying
performance characteristics.

A fundamental question for artificial intelligence is to characterize
the computational building blocks that are necessary for cognition. A
specific challenge is to build on the success of machine learning so as
to cover broader issues in intelligence. This requires, in particular a
reconciliation between two contradictory characteristics--the apparent
logical nature of reasoning and the statistical nature of learning.

In neuroscience a widely open problem remains how the weak model of
computation that cortex apparently offers can support cognitive
computation at all. .. how broad sets of primitive operations can be computed robustly and on a significant scale even on such a weak model.

A challenge in a different direction is understanding how Darwinian
evolution can evolve complex mechanisms in the numbers of generations
and with population sizes that have apparently been available on earth.
Darwin's notions of variation and selection are computational notions
and offer a very direct challenge for quantitative analysis by the
methods of computer science. What is sought is a quantitative
understanding of the possibilities and limitations of the basic
Darwinian process.

Teaching & Research Forum / Which CS areas have the most low-hanging fruit for research?

« on: November 28, 2015, 12:02:10 PM »

1) Distributed Algorithms and Data Structures.

This is a low-hanging fruit since we can get cheap access to clusters of commodity machines these days and we have decades of prior research into parallel and distributed computing at our disposal. On the other hand we are getting more and more data which doesn't fit into memory hierarchy of a single computer anymore, and even if the communication bottlenecks are miraculously resolved we will always be limited by the physical constraints of a single computing node [1].

For starters see the talk by Neil Conway: Cloud Programming: From Doom and Gloom to BOOM and Bloom: http://neilconway.org/talks/boom...

2) Distributed Databases

See What are the best recommended research topics on databases according to edge technologies and recent research trends?

3) Distributed File Systems

See What are the best resources for learning about distributed file systems?

4) Distributed Numerical Analysis

See What are the best resources for distributed numerical analysis/matrix algorithms?

5) Distributed Convex Optimization

See What are some good resources for learning about distributed optimization?

6) Distributed Stream Processing

See What are some good resources for learning about stream mining? Why?

7) Distributed Machine Learning
e.g. ensemble techniques such as Random Forests, or consensus algorithms with applications in Language Learning

also scaling: What are some introductory resources for learning about large scale machine learning? Why?

Distributed inference and structure learning in Graphical models
e.g. Parallel algorithms for Bayesian networks: Page on iastate.edu

9) Distributed Logic and Coordination Specific Languages
e.g Coordination languages by Gelernter and Carriero:
http://dl.acm.org/citation.cfm?i... and http://www.lindaspaces.com/book/

10) Distributed Combinatorial Optimization
e.g. Search, Matching and Network Flow problems on big Graphs
http://en.wikipedia.org/wiki/Mat...,
try online bipartite matching on streaming inputs, it's a fun problem with many applications in bioinformatics (e.g. genome assembly reduces to maximum matching in bipartite graph, see Cufflinks)

11) Distributed artificial intelligence: Swarm learning and evolution, e.g. mimicking Bacterial colonies and Microbial intelligence. Population Genetics (genetic variation, drift, mutation, selection) applied to multiagent systems. Reinforcement learning in a distributed setting.

12) Distributed quantum algorithms, e.g. for Clock Synchronization

13) Network Coding is fun, with some very elegant recent results, such as Fountain code. See: Network Information Theory and Information Theory, Inference and Learning Algorithms (the last chapter on online codes)

14) Decentralized digital currencies (e.g. bitcoin)

15) Compilers for parallel and distributed systems (e.g. Page on cmu.edu, X10, Chapel, C-to-gates (C to HDL) and compilers/runtime for heterogeneous systems (e.g. FPGA + CPU + GPU) ), Programming languages for distributed computing systems

16) Distributed operating systems
Why the data center needs an operating system
Grid computing with Plan 9

17) Distributed robotics and sensor networks

Teaching & Research Forum / Bioinformatics

« on: November 28, 2015, 11:56:54 AM »

Bio-informatics and other uses of CS in biology, biomedical engineering, and medicine, including systems biology (modeling interactions of multiple systems in a living organism, including immune systems and cancer development), computational biophysics (modeling and understanding mechanical, electrical, and molecular-level interactions inside an organism), computational neurobiology (understanding how organisms process incoming information and react to it, control their bodies, store information, and think). There is a very large gap between what is known about brain structure and the functional capabilities of a living brain – closing this gap is one of the grand challenged in modern science and engineering. DNA analysis and genetics have also become computer-based in the last 20 years. Biomedical engineering is another major area of growth, where microprocessor-based systems can monitor vital signs, and even administer life-saving medications without waiting for a doctor. Computer-aided design of prosthetics is very promising.

Teaching & Research Forum / Abundatnt data application

« on: November 28, 2015, 11:56:31 AM »

Abundant-data applications, algorithms, and architectures are a meta-topic that includes research avenues such as data mining (quickly finding relatively simple patterns in massive amounts of loosely structured data, evaluating and labeling data, etc), machine learning (building mathematical models that represent structure and statistical trends in data, with good predictive properties), hardware architectures to process more data than is possible today.

Teaching & Research Forum / Hottest research topic in computer science

« on: November 28, 2015, 11:55:52 AM »

http://www.forbes.com/sites/quora/2015/04/22/13-of-2015s-hottest-topics-in-computer-science-research/