Locating Mountains and More with Mahout and Public Weather Dataset

Recently I was playing with Mahout and public weather dataset. In this post I will describe how I used Mahout library and weather statistics to fill missing gaps in weather measurements and how I managed to locate steep mountains in US with a little Machine Learning (n.b. we are looking for people with Machine Learning or Data Mining backgrounds – see our jobs).

The idea was to just play and learn something, so the effort I did and the decisions chosen along with the approaches should not be considered as a research or serious thoughts by any means. In fact, things done during this effort may appear too simple and straightforward to some. Read on if you want to learn about the fun stuff you can do with Mahout!

Tools & Data

The data and tools used during this effort are: Apache Mahout project and public weather statistics dataset. Mahout is a machine learning library which provided a handful of machine learning tools. During this effort I used just small piece of this big pie. The public weather dataset is a collection of daily weather measurements (temperature, wind speed, humidity, pressure, &c.) from 9000+ weather stations around the world.

Artificial Problems

The problems I decided to attack (I made up them by myself, just to set some goals, no serious research was intended) were

  • using Mahout’s recommendation tools to fill the missing data points in weather statistics data
  • using Mahout’s “similarity tools” to locate nature’s relief influencers on the weather conditions, like mountains, based on weather statistics data

First would allow to fill in gaps in measurements on some of the weather stations. In the second problem the system would locate physical nature’s barriers and other influencers on the weather. By comparing the findings with the physical map one could judge e.g. about which mountains can isolate weather spreading to certain areas. Please forgive me if I use the wrong terms, but I really find the outcome quite interesting. Read on!

Using Recommendation Engine to Fill Missing Weather Measurements

The core (simple) idea is to use recommendation engine to fill in missing measurements. I used User-based recommendation specifically by treating station as a user, date as an item and the temperature as a preference. I know, many can say this isn’t a good way to approach the problem, but for my purpose of learning this worked well.

The Data

The dataset contains 18 surface meteorological elements, from which I selected just few items to be used. So, I chose to work only with temperature to simplify things, though I understood that it could not be enough to reach any interesting result. Moreover, one would argue that it makes much more sense to use precipitation to locate such objects as mountains which affect them a lot. I chose a simpler path, though with quite a big risk of getting nothing. In order to have ability to iterate fast I also used just 2 years of data of the stations located in US. I actually tried to use only California’s station, but this didn’t work well (at least in the beginning before I get to tuning the system logic). For this first problem I didn’t use any of the stations location and altitude information to get things more interesting.

To evaluate the system I simply divided the dataset into two pieces and used one of each as training sample and another one as evaluation sample. Please see the code below.

Data Cleaning

I had to clean some the data before I could use it for training the recommender. It seems like there were some stations which has different IDs but has same location in the dataset. I cleaned them to have only one station in the same place to avoid increased weighting of the same station. I also cleaned Alaska’s stations and most of those which are not on the continental area as they are usually alone standing and do not have related weather conditions to others and only bring the noise for the recommender.

Preparing Input Data for Mahout

So, in our interpretation stations are users and days (dates) with temperature are items with user preference, but there’s more to that. It makes sense to try to use connection between dates which is just dropped by this simple interpretation: dates may stand close to each other or be far from each other; by comparing the change of the temperature during some period helps to judge about “weather closeness” of two stations. To make use of that for calculating similarity I also added <date, diff with the N days ago> pairs. E.g. for input:

20060101,56
20060102,61
20060103,62

I got these preferences:

20060101_0,56
20060102_0,61
20060102_1,+5/1
20060103_0,62
20060103_1,+1/1
20060103_2,+6/2

I divided the difference with the further standing “N day ago” by N so that they are weighted less. Otherwise difference between far from each other days going to be bigger than from than that from close days, while it is the close days difference which is more interesting actually. For the same purpose I tested with up to 5 extra pairs (diffs with up to 5 previous days).

Moreover, when it comes to comparing the change it may really not matter by how much temperature was changed, it is enough to know that it was changed at least by some value d (onlyDirectionOfChange=true, changeWeight=d in results below). So, e.g. given value d=2.0, comparing change with previous 2 days (prevDays=2 in the results below) the example data above is going to look like this:

20060101_0,56
20060102_0,61
20060102_1,+2/1
20060103_0,62
20060103_1,0
20060103_2,+2/2

Running Recommender Evaluating Results

Once data is prepared and you know which recommender parameters (including similarities and their parameters and such) training recommender and evaluating the results is very simple. Please find below the code for non-distributed recommendation calculation.

DataModel model = new FileDataModel(statsFile);
RecommenderEvaluator evaluator = new AverageAbsoluteDifferenceRecommenderEvaluator();
RecommenderBuilder builder = new RecommenderBuilder() {
  @Override
  public Recommender buildRecommender(DataModel model) throws TasteException {
    UserSimilarity similarity = new EuclideanDistanceSimilarity(model);
    UserNeighborhood neighborhood = new NearestNUserNeighborhood(neighbors, similarity, model);
    Recommender recommender = new GenericUserBasedRecommender(model, neighborhood, similarity);
    return recommender;
  }
};
return evaluator.evaluate(builder, null, model, 0.95, 1.0);

The code above shows just one of many possible recommender configurations (and it doesn’t show the fact that we use more data for calculating user similarity as explained in the previous section). Since we really care about the numbers we throw away similarities that don’t take into account the preference value, like TanimotoCoefficientSimilarity. During the (limited) tests I ran it appeared that the simplest EuclideanSimilarity worked best.

Result

My (limited) tests showed the following best results with configurations:

SCORE: 2.03936; neighbors: 2, similarity: {onlyDirectionOfChange=true, changeWeight=2.0, prevDays=1}
SCORE: 2.04010; neighbors: 5, similarity: {onlyDirectionOfChange=true, changeWeight=5.0, prevDays=2}
SCORE: 2.04159; neighbors: 2, similarity: {onlyDirectionOfChange=false, prevDays=1}

Where:

  • “score” is the recommendation score, i.e. in our case the average absolute difference in recommended value and actual value
  • “neighbors” is the number of neighbors used (in NearestNUserNeighborhood) to provide recommendation
  • “onlyDirectionOfChange” is whether we use absolute change value when comparing with previous days (value is false) or we comparing with the certain threshold as explained above (value is true)
  • “changeWeight” is the threshold to compare the change with
  • “prevDays” is the number of previous days to compare with

Using Statistics-based Similarity to Locate Weather Influencers

The core (simple) idea is to calculate similarity between stations based on weather statistics and compare it with the physical distance between the stations: if there’s a great difference then assume that there’s something in between such stations that makes the weather noticeably different. Then, the only things we need to define is:

  • how to calculate similarity
  • haw to calculate physical distance
  • what is a *noticeable* difference between weather stats based similarity and physical distance
  • where is this “in between” located

The plan was to use Mahout’s user similarities to calculate the distance between stations based on weather statistics data (taking it as user preferences similar to the first part) and compare it with the physical distance. The evaluation of the results is not very well automated as in previous part, though it could be have I more time for this. To evaluate I just plotted the results and compared this image with the physical map of US.

The Data

The data was used the same as was used in the first part plus physical location of the station, which was used to calculate physical distance between them.

Data Cleaning

Data cleaning had the same logic as for the first part. Though I did more severe cleaning of stations not on the continent and near the shore: we all know that ocean influences the weather a lot, so if I didn’t do that, all shore points would have been considered to be “outliers”. Actually, not only ocean shore-close stations, but most of those near the border of US were removed for the sake of removal algorithm simplicity.

Preparing Input Data for Mahout

The input data were prepared the same way as in first part, the “user preferences” contained <station, temperature> pairs and added <date, diff with the N days ago> pairs. Note, that there are a lot of ways to calculate distance between stations using weather stats, I simply chose the one which would allow me to re-use the same prepared data files from the first part of experiment.

Running the Implementation & Evaluating Results

Let’s have a closer look at how each item from the list in idea overview section was defined.

How to calculate similarity?

As mentioned above I chose simple (similar to the first part) similarity calculation. Simple EuclideanDistanceSimilarity worked well.

How to calculate physical distance?

Physical distance was calculated as Euclidean distance between stations using latitude, longitude and altitude. The longitude was given a much greater weight, because the temperature tends to be affected a lot by longitude coordinate: the further South you go (in Northern Hemisphere where US is) the lower the temperature without any physical relief influencers. Also altitude was given a stronger weight because it has the same strong affect on the temperature.
And, of course to make distance comparable with calculated similarity we need it to be in 0..1 range (with the value close to 1 showing the smallest distance). Hence distance was calculated as 1 / (1 + phys_dist(station1, station2)).

What is a *noticeable* difference between weather stats based similarity and physical distance?

I assumed that there’s an exponential dependency between calculated physical distance of the two stations and similarity calculated from weather stats. I found an average growth rate (of the similarity given the physical distance) using all stations pairs (remember that we have not huge amount of stations, so we can afford that) and used it to detect those pairs that had much greater difference between their physical distance and similarity calculated from weather stats.

Where is this “in between” located?

For those “outlier” pairs detected I put single “physical influencer” point on the map which location is exactly the middle point between those stations (unweighted latitude and longitude were used to calculate it).

Result

Result is better represented and evaluated by comparing the following two images: first created by our system and second being a physical map of US.

red – cleaned “far standing” stations
green – stations which stats were used
blue – found weather “influencers”
circles – some of the cities (Seattle, San-Francisco, Los-Angeles, San-Antonio, Salt Lake City, Miami, New York City, Boston, Chicago)

Compare with:

Note how steep mountains are there in the first image. Of course, there’s noise from lakes, gulfs and other sources. But look how well it drew the relief of California!

What Else?

There are a number of things that can be done to obtain even more interesting results. Just to name a few:
* use not only temperature but other measures, e.g. precipitation as noted in the beginning of the post
* show influencers with different color (heat map) depending on how they affect the weather
* and more

Summary

I touched only a very small piece of the great collection of machine learning tools offered by Mahout and yet managed to get very interesting (and useful?) results. I also used just a small part of the public weather dataset available to everyone and yet it was enough to get meaningful outcome. And there are so many different public datasets available online! There’s so much exciting things one can do with such a powerful tool like Mahout and a pile of data.

@abaranau

Data Engineer Position at Sematext International

If you’ve always wanted to work with Hadoop, HBase, Flume, and friends and build massively scalable, high-throughput distributed systems (like our Search Analytics and SPM), we have a Data Engineer position that is all about that!  If you are interested, please send your resume to jobs@sematext.com.

Responsibilities:

  • Versatile architect and developer – design and build large, high performance,scalable data processing systems using Hadoop, HBase, and other big data technologies
  • DevOps fan –  run and tune large data processing production clusters
  • Tool maker – develop ops and management tools 
  • Open source participant – keep up with development in areas of cloud and distributed computing, NoSQL, Big Data, Analytics, etc.

Pluses:

  • solid Math, Statistics, Machine Learning, or Data Mining is not required but is a big plus
  • experience with Analytics, OLAP, Data Warehouse or related technologies is a big plus
  • ability and desire to expand and lead a data engineering team
  • ability to think both business and engineering
  • ability to build products and services based on observed client needs
  • ability to present in public, at meetups, conferences, etc.
  • ability to contribute to blog.sematext.com
  • active participation in open-source communities
  • desire to share knowledge and teach
  • positive attitude, humor, agility, high integrity, and low ego, attention to detail

Location:

  • New York

We’re small and growing.  Our HQ is in Brooklyn, but our team is spread over 4 continents.  If you follow this blog you know we have deep expertise in search and big data analytics and that our team members are conference speakers, book authors, Apache members, open-source contributors, etc.

Relevant pointers:

Hiring: Data Mining, Analytics, Machine Learning Hackers

If you want to work with search, big data mining, analytics, and machine learning, and you are a positive, proactive, independent creature, please keep reading.We are looking for devops to hack on Sematext’s new products and services, as well as provide services to our growing list of clients.  Working knowledge of Mahout or statistics/machine learning/data mining background would be a major plus.
 

Skills & experience (the more of these you have under your belt the better):

  • Data mining and/or machine learning (Mahout or …)
  • Big data (HBase or Cassandra or Hive or …)
  • Search (Solr or Lucene or Elastic Search or …)

More about an ideal you:

  • You are well organized, disciplined, and efficient
  • You don’t wait to be told what to do and don’t need hand-holding
  • You are reliable, friendly, have a positive attitude, and don’t act like a prima donna
  • You have an eye for detail, don’t like sloppy code, poor spelelling and typous
  • You are able to communicate complex ideas in a clear fashion in English, clean and well designed code, or pretty diagrams

Optional bonus points:

  • You like to write or speak publicly about technologies relevant to what we do
  • You are an open-source software contributor

A few words about us:

We work with search and big data (Lucene, Solr, Nutch, Hadoop, MapReduce, HBase, etc.) on a daily basis and we present at conferences.  Our projects with external clients range from 1 week to several months.  Some clients are small startups, some are large international organizations.  Some are top secret.  New customers knock on our door regularly and this keeps us busy at pretty much all times.  When we are not busy with clients we work on our products.  We run search-lucene.com and search-hadoop.com.  We participate in open-source projects and publish monthly Digest posts that cover Lucene, Solr, Nutch, Mahout, Hadoop, Hive, and HBase.  We don’t write huge spec docs, we work in sprints, we multitask, and try our best to be agile. We send people to conferences, trainings (Hadoop, HBase, Cassandra), and certifications (2 of our team members are Cloudera Certified Hadoop Developers).

We are a small and mostly office-free, highly distributed team spanning 3 continents and 6 countries.  We communicates via email, Skype voice/IM, BaseCamp.  Some of our developers are in Eastern Europe, so we are especially open to new team members being in that area, but we are also interested in good people world-wide, from South America to Far East.

Interested? Please send your resume to jobs @ sematext.com feel free to check out our other positions.

Hiring Search and Data Analytics Engineers

We are growing and looking for smart people to join us either in an “elastic”, on-demand, per-project, or more permanent role:

Lucene/Solr expert who…

  • Has built non-trivial applications with Lucene or Solr or Elastic Search, knows how to tune them, and can design systems for large volume of data and queries
  • Is familiar with (some of the) internals of Lucene or Solr or Elastic Search, at least on the high level (yeah, a bit of an oxymoron)
  • Has a systems/ops bent or knows how to use performance-related UNIX and JVM tools for analyzing disk IO, CPU, GC, etc.

Data Analytics expert who…

  • Has used or built tools to process and analyze large volumes of data
  • Has experience using HDFS and MapReduce, and have ideally also worked with HBase, or Pig, or Hive, or Cassandra, or Voldemort, or Cascading or…
  • Has experience using Mahout or other similar tools
  • Has interest or background in Statistics, or Machine Learning, or Data Mining, or Text Analytics or…
  • Has interest in growing into a Lead role for the Data Analytics team

We like to dream that we can find a person who gets both Search and Data Analytics, and ideally wants or knows how to marry them.

Ideal candidates also have the ability to:

  • Write articles on interesting technical topics (that may or may not relate to Lucene/Solr) on Sematext Blog or elsewhere
  • Create and give technical talks/presentations (at conferences, local user groups, etc.)

Additional personal and professional traits we really like:

  • Proactive and analytical: takes initiative, doesn’t wait to be asked or told what to do and how to do it
  • Self-improving and motivated: acquires new knowledge and skills, reads books, follows relevant projects, keeps up with changes in the industry…
  • Self-managing and organized: knows how to parcel work into digestible tasks, organizes them into Sprints, updates and closes them, keeps team members in the loop…
  • Realistic: good estimator of time and effort (i.e. knows how to multiply by 2)
  • Active in OSS projects: participates in open source community (e.g. mailing list participation, patch contribution…) or at least keeps up with relevant projects via mailing list or some other means
  • Follows good development practices: from code style to code design to architecture
  • Productive, gets stuff done: minimal philosophizing and over-designing

Here are some of the Search things we do (i.e. that you will do if you join us):

  • Work with external clients on their Lucene/Solr projects.  This may involve anything from performance troubleshooting to development of custom components, to designing highly scalable, high performance, fault-tolerant architectures.  See our services page for common requests.
  • Provide Lucene/Solr technical support to our tech support customers
  • Work on search-related products and services

A few words about us:

We work with search and big data (Lucene, Solr, Nutch, Hadoop, MapReduce, HBase, etc.) on a daily basis.  Our projects with external clients range from 1 week to several months.  Some clients are small startups, some are large international organizations.  Some are top secret.  New customers knock on our door regularly and this keeps us busy at pretty much all times.  When we are not busy with clients we work on our products.  We run search-lucene.com and search-hadoop.com.  We participate in open-source projects and publish monthly Digest posts that cover Lucene, Solr, Nutch, Mahout, Hadoop, and HBase.  We don’t write huge spec docs, we work in sprints, we multitask, and try our best to be agile. We send people to conferences, trainings (Hadoop, HBase, Cassandra), and certifications (2 of our team members are Cloudera Certified Hadoop Developers).

We are a small and mostly office-free, highly distributed team that communicates via email, Skype voice/IM, BaseCamp.  Some of our developers are in Eastern Europe, so we are especially open to new team members being in that area, but we are also interested in good people world-wide, from South America to Far East.

Interested? Please send your resume to jobs @ sematext.com.

Follow

Get every new post delivered to your Inbox.

Join 1,695 other followers