How do I become a data scientist?

Changqi Cai on Quora 127 阅读
Pathan Karimkhan

Being data scientist requires a solid foundation typically in computer science and applications,  modeling, statistics, analytics and math.

What sets the data scientist  apart is strong business acumen, coupled with the ability to communicate  findings to both business and IT leaders in a way that can influence  how an organization approaches a business challenge. Good data  scientists will not just address business problems, they will pick the  right problems that have the most value to the organization.

Also I believe in depth knowledge in Data science, Machine learning and NLP will help to solve ground to top level issues. 4-5 years of development experience can give such acumenship.

Tools and technologies for Bigdata:

Apache spark - Apache Spark is an open-source data analytics cluster computing framework originally developed in the AMPLab at UC Berkeley.[1] Spark fits into the Hadoop open-source community, building on top of the Hadoop Distributed File System (HDFS).[2] However, Spark is not tied to the two-stage MapReduce paradigm, and promises performance up to 100 times faster than Hadoop MapReduce for certain applications.

Database pipelining -
As you will notice it's just not about processing the data, but involves a lot of other components. Collection, storage, exploration, ML and visualization are critical to the proect's success.

SOLR -  Solr to build a highly scalable data analytics engine to enable customers to engage in lightning fast, real-time knowledge discovery.
Solr (pronounced "solar") is an open source enterprise search platform from the Apache Lucene project. Its major features include full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling. Providing distributed search and index replication, Solr is highly scalable.[1] Solr is the most popular enterprise search engine.[2] Solr 4 adds NoSQL features

S3 - Amazon S3 is an online file storage web service offered by Amazon Web Services. Amazon S3 provides storage through web services interfaces. Wikipedia

Hadoop - Apache Hadoop is an open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware . Hadoop is an Apache top-level project being built and used by a global community of contributors and users. It is licensed under the Apache License 2.0. Apache Hadoop

MapReduce : Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks . Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.

Corona :

Corona, a new scheduling framework that separates cluster resource management from job coordination.[1] Corona introduces a cluster managerwhose only purpose is to track the nodes in the cluster and the amount of free resources. A dedicated job tracker is created for each job, and can run either in the same process as the client (for small jobs) or as a separate process in the cluster (for large jobs).

One major difference from our previous Hadoop MapReduce implementation is that Corona uses push-based, rather than pull-based, scheduling. After the cluster manager receives resource requests from the job tracker, it pushes the resource grants back to the job tracker. Also, once the job tracker gets resource grants, it creates tasks and then pushes these tasks to the task trackers for running. There is no periodic heartbeat involved in this scheduling, so the scheduling latency is minimized. Ref : Under the Hood: Scheduling MapReduce jobs more efficiently with Corona

HBase : HBase is an open source , non-relational , distributed database modeled after Google's BigTable and written in Java . It is developed as part of Apache Software Foundation 's Apache Hadoop project and runs on top of HDFS (Hadoop Distributed Filesystem) , providing BigTable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data (small amounts of information caught within a large collection of empty or unimportant data, such as finding the 50 largest items in a group of 2 billion records, or finding the non-zero items representing less than 0.1% of a huge collection).

Zookeeper - Apache ZooKeeper is a software project of the Apache Software Foundation , providing an open source distributed configuration service, synchronization service, and naming registry for large distributed systems . [ clarification needed ] ZooKeeper was a sub project of Hadoop but is now a top-level project in its own right.

Hive - Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. While initially developed by Facebook , Apache Hive is now used and developed by other companies such as Netflix . Amazon maintains a software fork of Apache Hive that is included in Amazon Elastic MapReduce on Amazon Web Services .

Mahout - Apache Mahout is a project of the Apache Software Foundation to produce free implementations of distributed or otherwise scalable machine learning algorithms focused primarily in the areas of collaborative filtering , clustering and classification. Many of the implementations use the Apache Hadoop platform. Mahout also provides Java libraries for common maths operations (focused on linear algebra and statistics) and primitive Java collections. Mahout is a work in progress; the number of implemented algorithms has grown quickly, [3] but various algorithms are still missing.

Lucene is a bunch of search-related and NLP tools but it's core feature is being a search index and retrieval system. It takes data from a store like HBase and indexes it for fast retrieval from a search query. Solr uses Lucene under the hood to provide a convenient REST API for indexing and searching data. ElasticSearch is similar to Solr.

Sqoop is a command-line interface to back SQL data to a distributed warehouse. It's what you might use to snapshot and copy your database tables to a Hive warehouse every night.

Hue is a web-based GUI to a subset of the above tools. Hue aggregates the most common Apache Hadoop components into a single interface and targets the user experience. Its main goal is to have the users "just use" Hadoop without worrying about the underlying complexity or using a command line

Pregel and it's open source twin Giraph is a way to do graph algorithms on billions of nodes and trillions of edges over a cluster of machines. Notably, the MapReduce model is not well suited to graph processing so Hadoop/MapReduce are avoided in this model, but HDFS/GFS is still used as a data store.

NLTK - The Natural Language Toolkit , or more commonly NLTK , is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for the Python programming language . NLTK includes graphical demonstrations and sample data. It is accompanied by a book that explains the underlying concepts behind the language processing tasks supported by the toolkit, plus a cookbook.

NLTK is intended to support research and teaching in NLP or closely related areas, including empirical linguistics , cognitive science , artificial intelligence , information retrieval , and machine learning .

For Python-
Scikit Learn



Freebase - Freebase is a large collaborative knowledge base consisting of metadata composed mainly by its community members. It is an online collection of structured data harvested from many sources, including individual 'wiki' contributions.

DBPedia : DBpedia (from "DB" for " database ") is a project aiming to extract structured content from the information created as part of the Wikipedia project. This structured information is then made available on the World Wide Web . DBpedia allows users to query relationships and properties associated with Wikipedia resources, including links to other related datasets . DBpedia has been described by Tim Berners-Lee as one of the more famous parts of the decentralized Linked Data effort.

Visualization tool
ggplot in R

Mathematics : )

Calculus, Statistic, Probability, linear algebra and coordinate geometry

NER- Named Entity Recognition (NER) labels sequences of words in a text which are the names of things, such as person and company names, or gene and protein names.

Faceted search : Faceted search also called faceted navigation or faceted browsing, is a technique for accessing information organized according to a faceted classification system, allowing users to explore a collection of information by applying multiple filters. A faceted classification system classifies each information element along multiple explicit dimensions, called facets, enabling the classifications to be accessed and ordered in multiple ways rather than in a single, pre-determined, taxonomic order

Source : Wikipedia, the free encyclopedia

There are bunch of course you can work out :

  1. Sentiment analysis for twitter, web articles - Identify over all sentiment for web articles, product review, movie  review, tweets. Lexical based approach or machine learning techniques  can be used
  2. Web article classification/summarization - Use clustering/classification technique to classify the web article, perform semantics analysis to summarize the articles
  3. Recommendations system based on user's social media profiles - Use social media API, collects user interest from facebook, twitter etc implement recommendation system for user interest
  4. Tweet classification and trend detection - Classify the tweets for sports, business, politics, entertainment etc and detect trending tweets in those domain
  5. Movie Review Prediction - Use online movie reviews to predict reviews of new movies.
  6. Summarize Restaurant Reviews - Take a list of reviews about a restaurant, and generate a single English summary for that restaurant.
  7. AutoBot - Build  a system that can have a conversation with you. The user types  messages, and your system replies based on the user's text. Many  approaches here ... you could use a large twitter corpus and do language  similarity
  8. Twitter based news system - Collect tweets for various categories hourly, daily base, identify trending discussion, perform semantic analysis and create kinda news system (Check Frrole product)

Few data sets used for bigdata application you may use :

  1. Home Page for 20 Newsgroups Data Set - The 20 Newsgroups data set is a collection of approximately 20,000  newsgroup documents, partitioned (nearly) evenly across 20 different  newsgroups.
  2. Download Trec (= Text Retrieval Conference) Data Set -  Text datasets used in information retrieval and learning in text domains.
  3. World Factbook Download 2013 - The World Factbook provides information  on the history, people, government, economy, geography, communications,  transportation, military, and transnational issues for 267 world  entities.
  4. DBpedia " Dataset releases -  he DBpedia data set uses a large multi-domain ontology which has been  derived from Wikipedia. The English version of the DBpedia 2014 data set  currently describes 4.58 million “things” with 583 million “facts”. In  addition, we provide localized versions of DBpedia in 125 languages. All  these versions together describe 38.3 million things, out of which 23.8  million overlap (are interlinked) with concepts from the English  DBpedia.
  5. http://konect.uni-koblenz.de/net... - KONECT (the Koblenz Network Collection) is a project to collect large  network datasets of all types in order to perform research in network  science and related fields,
  6. Max-Planck-Institut für Informatik: YAGO - YAGO (Yet Another Great Ontology) is a knowledge base developed at the Max Planck Institute for Computer Science in Saarbrücken . It is automatically extracted from Wikipedia and other sources.
  7. Reuters-21578 Text Categorization Collection Data Set -  Machine learning repository
  8. CSTR Page on ed.ac.uk -  CSTR is concerned with research in all areas of speech technology  including speech recognition, speech synthesis, speech signal  processing, information access, multimodal interfaces and dialogue  systems. We have many collaborations with the wider community of  researchers in speech science, language, cognition and machine learning  for which Edinburgh is renowned.
  9. ConceptNet - ConceptNet is a freely available commonsense knowledgebase and natural-language-processing  toolkit which supports many practical textual-reasoning tasks over  real-world documents right out-of-the-box (without additional  statistical training)

Other well known data sets are : MNIST, CIFAR and ImageNet.

See question on Quora
作者:Changqi Cai on Quora
Recent Activity by Changqi Cai on Quora
原文地址:How do I become a data scientist?, 感谢原作者分享。