Table of Contents
- 1. Research Setting
- 2. Primary Research Question
- 3. Investigation Plan
- 4. Data Access and Visualization Methods
- 5. Preliminary Analysis
- 6. Refinement of Analysis
- 7. Statement of Results
- 8. Discussion of Results
- 9. Statement of Conclusions
- 10. Questions for Further Investigation
1. Research Setting
The research setting for this tutorial is coastal North Carolina, specifically the area known as Onslow Bay. At the Ocean Optics XVIII conference which took place in Montreal, Canada (October 2006) a poster was presented, entitled "Use of Giovanni with Ocean Color Time-Series Project Data for Trend Detection in the Coastal Zone". This poster displayed several examples of chlorophyll time-series generated by Giovanni, which were analyzed using an Excel spreadsheet and supplementary statistical analyses performed by N. Casey. Subsequent to the meeting, the procedure by which the entire analysis could be accomplished with the Excel spreadsheet was determined.
Ocean Optics XVIII Extended Abstract (OO060208)
This tutorial will present in detail how to generate a chlorophyll time-series with Giovanni and how to perform basic statistical analysis of the time-series. The Onslow Bay time-series is quite interesting, and it demonstrates why time-series analysis is quite important for the observation of trends in the coastal zone.
2. Primary Research Question
The primary research question of this tutorial concerns the evaluation of chlorophyll time-series data. The question that will be addressed can be stated as:
What can be learned from an examination of a time-series of chlorophyll concentration generated for a specific coastal area, and how can the time-series be evaluated statistically?
3. Investigation Plan
Our investigation plan has four basic steps:
- Define the research area and the time period to be investigated.
- Generate a time-series of chlorophyll concentration for the region and time-period using Giovanni.
- Import the time-series data into Excel for statistical analysis.
- Generate statistical analysis results in Excel, and examine these results.
4. Data Access and Visualization Methods
The data that will be analyzed are the SeaWiFS Level 3 monthly chlorophyll data available in Giovanni. This data is available at 9 km resolution for the time period September 1997 to present. Giovanni allows designation of the regional boundaries and the time range for the investigation, and provides both a visual plot of the time-series and ASCII (text) output of the data composing the time-series plot.
5. Preliminary Analysis
Onslow Bay is located on the coast of North Carolina. This image from the poster presented at Ocean Optics XVIII (see "Research Setting" for a link to the poster) shows where Onslow Bay is located, and the approximate boundaries of the study area (Region 11). Chlorophyll concentration data for this map were for the period January 1998 to December 2005, which is the time period that will be used for this investigation. Click the image for a larger version TIFF image.
Here is a "close-up" of the Onslow Bay region, generated using the Giovanni area plot function (in the Plot Type menu, this is the first option, actually called "Lat-lon map, Time-averaged").. The latitudeboundaries of the region are 34.5° N and 33.5° N, and the longitude boundaries are 77° W and 76° W. This map is an average of all of the monthly data between Jaunary 1998 and December 2005.
The next step is to generate the monthly time-series for the region. Once the regional boundaries and the time period have been defined for the area map above, it's very simple to go back to the Giovanni interface, select"Time-Series, Area-averaged" in the Plot Type menu, and then generate the time-series plot. Here is the time-series plot:
Examination of this time-series immediately reveals some interesting details. The first detail is the very high chlorophyll concentration in February 1998 (the second point in the time-series). The second detail is a general tendency toward higher values later in the time-series (if that high point in February 1998 is ignored). Are those details important? That's where statistical analysis becomes a vital tool for investigation.
6. Refinement of Analysis
Now that the time-series has been generated in Giovanni, the next step is to put the time-series data into an Excel spreadsheet.
First, and very importantly, the "Data Analysis Took Pak Add-In" must be "added in" to Excel. To do this, click on "Tools", click "Add-Ins", click the "Analysis Tool Pak" box, then click "OK". The spreadsheet will ask if you want to install it. Click "OK". The Data Analysis Tool Pak should install automatically. When this is finished, "Data Analysis" should now appear in the drop-down menu under "Tools". Click it. A window will open that shows the names of many different statistical analysis functions. Slide down to find "Regression". This is the function that will be used when the data in the spreadsheet is ready.
Now follow the step-by-step procedure below. (Step 1 is already completed if the Data Analysis Tool Pak has just been installed.)
--- 1. Open a new Excel spreadsheet.
--- 2. Return to the Giovanni interface, and instead of clicking the "Generate Plot" button to make the time-series plot, click the "ASCII data" button. In the window that showed the time-series plot, the plot will be replaced with the time-series data. The image below shows what the output looks like. It only shows the first 12 months of the data, not all 96 months.
--- 3. Drag-and-click on the data (don't include the header information at the top) all the way to the bottom. Copy-and-paste the data into a text editor (using Notepad on a Windows computer is an easy way to do this). Save the file with an appropriate name (it helps to put your file in a directory location that will be easy to find for the next step.).
--- 4. The next step is to import the data into Excel. Start Excel and open a new spreadsheet. Click "Data", then in the drop-down menu click "Import External Data" followed by a click on "Import Data". Find your file and click on it. This will initiate the "Text Import Wizard", and it should look something like this:
Note that this window indicates the data is "Delimited'. Click "Next". In Step 2, click the check mark in the "Tab" box to remove it, and then click in the "Space" box. This should also automatically put a check in the "Treat consecutive delimiters as one" box, and a line will appear separating the month-and-year column from the data values column. Now click "Next". In Step 3, the first column should be highlighted in black. In the "Column Data Format" box, click the "Date" button, then choose "YMD" (for Year, Month, Day) instead of MDY. The window should look just like this:
Now click "Finish" to import the data. The next window will ask where you want to put the data; for a new spreadsheet, the option should be the first cell in the first column, A1. Click "OK". The data should appear in the first two columns of the spreadsheet, with the month and the year in the first column (like "Jun-99") and the data values in the second column. If that's what it looks like, save the spreadsheet!
--- 5. The next step is a little tedious. Click on the first cell in column B. Then click on "Insert", then "Column", to insert a column between the dates and the data. Now each cell next to the dates must be numbered 1-96 (for this example - the basic procedure is to make a column with consecutive numbers for each month). Check to make sure none were missed; every number next to a December cell should be a multiple of 12. [There may be a quick way to consecutively number cells in a column, if you want to be creative. If you figure out how, email me! My email address is at the bottom of this tutorial.] Save the spreadsheet again.
--- 6. The boring steps are done, and now a statistical analysis can be performed. The data will be treated as if it was plotted in an X-Y plot, with Time (months) as the X-axis (independent variable) and chlorophyll concentration as the Y-axis (dependent variable). Choose "Regression" in the Data Analysis Tool Pak as described above. In the "Regression" window, you will now choose the "Input Y Data" and "Input X Data". Click on the small spreadsheet icon at the right of the input line. Then click-and-drag from the top to the bottom of column C, the chlorophyll concentration data column. At the bottom, hit the "Enter" key. This will designate the data in column C as the dependent variable data. Follow the same procedure for the "Input X data", except now click-and-drag from the top to the bottom of column B. Click the "95% confidence level" box, and the window will look like this:
Now click OK. The results of the statistical analysis should appear on a new spreadsheet page. To see the results better, under "Format", click "Columns", then click "Auto-Fit Selection". This will spread the results out so that they are readable. The results will look like this:
Save the spreadsheet.
--- 7. Now for the purpose of this demonstration, one more thing will be done. Return to the spreadsheet page with the data (Sheet 1) and restart the Regression analysis. The same selection procedure will be performed with columns B and C, except instead of starting at cells C1 and B1, start at cells C7 and B7. Once the data ranges have been selected, perform the regression analysis again. The results should now look like this:
Now we can examine the results of the statistical analysis. But first, save the spreadsheet one more time.
In most cases, regression analysis is used to determine how strongly correlated two variables are. For a time-series analysis of chlorophyll concentration, there's no reason for the concentrations to be strongly correlated with time -- the concentrations vary month to month -- so the regression statistics are not very meaningful here. The two statistical quantities that are useful are the slope of the line and the significance (the significance quantity is actually reported two different places in the results).
The slope of the line is reported here under "Coefficients", in the "X variable 1" line. For the full time-series (the first set of results), the slope of the line is -0.0001. For the truncated time-series that excluded the first six months (the second set of results), the slope of the line is +0.0017. That means that with the very high chlorophyll concentration value from February 1998 included, the slope of the line is essentially flat. But if the first six months of 1998 are excluded, there is a very small positive slope.
But is that slope significant? "Significance" means that the trend is real -- i.e., the data do show an actual increasing trend with time, such that this trend would be expected to continue. The significance number is generated by the "t-test". Thenumber used to evaluate trend significance is called the "P-value". The P-value is reported both in the "Significance F" column and in the "P-value" column in the "X Variable 1" line.
Atrend can be called significantat the 95% confidence level; this means that the actual P-value will be less than 0.05. The full time-series fails the significance test miserably; the P-value is 0.871. But when the first six months are excluded, the P-value is much, much smaller -- a value of 0.0025, considerably less than 0.05. So the chlorophyll concentration trend in Onslow Bay commencing in July 1998 is significant, speaking statistically -- but the small value of the slope of the fitted line indicates that it is just a gentle increasing trend.
The brief discussion above necessarily omits a lot of the underlying mathematics used for the statistical analysis. A couple of links are given in Section 10 for more information.
7. Statement of Results
The results of this investigation of a chlorophyll concentration time-series for the Onslow Bay region, January 1998 to December 2005, indicate that:
- A chlorophyll concentration "anomaly" occurred in February 1998, when the average chlororphyll concentration was much higher than for any other month in the time-series.
- The chlorophyll concentration time-series for the full time period, including the high concentration value in February 1998, does not have a statistically significant trend.
- The chlorophyll concentration time-series for a truncated time period of July 1998 to December 2005, which excludes the high value in February 1998, exhibits a small, statistically significant, positive (increasing) trend.
Now that the time-series for Onslow Bay has been generated and statistically analyzed, possible explanations of both the trend and the anomaly may be formulated. The anomaly occurs in a very suspicious period -- the winter of 1997-1998 in the United States and much of the Northern Hemisphere was affected by the largest El Niño in history. One effect of this El Niño was increased rainfall in regions of the United States. Increased rainfall and runoff into Onslow Bay could have contributed nutrients that would have fostered phytoplankton growth (or, because this is close to shore, just the increased turbidity of the water due to sediments carried by the river can lead to falsely high chlorophyll concentration values). So El Niño and increased rainfall are a possible explanation, but more evidence to support this idea must be found.
The gentle positive trend for the rest of the time-series might also be explained by nutrients carried by rivers. Fertilization of agricultural fields leads to higher concentration of nitrogen and phosphorus (N and P) in the runoff from the fields, which is carried by the rivers to the coastal zone. In the coastal zone, higher concentrations of nutrients enhance phytoplankton growth. This is a problem in the outflow zones of many large rivers, where the higher phytoplankton concentrations cause eutrophication -- decay of dead phytoplankton -- and hypoxia or anoxia, little or no oxygen in the water and near the bottom due to the action of bacteria on the dead phytoplankton. If there are increasing nutrient concentrations in the North Carolina rivers that enter Onslow Bay (and nearby), this might be an explanation for the increasing trend in chlorophyll concentrations. To support this explanation, more data is required.
9. Statement of Conclusions
The results of this tutorial research project indicate that chorophyll concentrations in the Onslow Bay region for the period January 1998 to December 2005 exhibit a small positive trend, if the first six months of that period are excluded from the analysis. This exclusion is due to the observation of a very high chlorophyll concentration in February 1998, which significantly affects the statistical evaluation of the time-series data. A possible explanation of the February 1998 high chlorophyll concentration value is increased rainfall related to the strong El Niño event that winter. A possible explanation of the increasing trend in the July 1998 - December 2005 time-series is increased nutrients in rivers provided by agriculture (and also increased urbanization).
(Note: Excel can also be used to plot time-series data and present the data in figures for reports and publications. The figure below is from the Ocean Optics XVIII poster, and it summarizes the statistical conclusions. See the end of Section 10 for some tips on how to do this.)
10. Questions for Further Investigation
Obviously, this research project provides some major questions that deserve additional investigation:
- What was the chief cause of the anomalously high chlorophyll concentrations in February 1998? (Was it increased rainfall, or something else?)
- Are increased nutrient concentrations in rivers entering the North Carolina coastal zone the likeliest cause of the increasing trend in chlorophyll concentrations for the Onslow Bay region? If not, what other explanations can account for the increasing trend?
- In statistical terms, the high chlorophyll concentration in February 1998 could be called an "outlier". There are statistical tests that can be used to evaluate whether a data point is a true outlier, justifying its exclusion from other data that will be analyzed. How can the February 1998 data point be tested against the rest of the time-series data to determine if it should be excluded?
Links for statistical analysis:
Excel for Business Statistics
Time-Series Plotting Tips for Excel:
The time-series can be plotted with the X-Y scatter plot option. The easiest way to do this is to click the Chart Wizard icon (the histogram), choose "XY (Scatter)", and then selectcolumns B and C (the consecutive numbers for the months, and the time-series data) for the X and Y series to be plotted. After the plot is created, click on a data point to get the "Format Data Series" window. Choose "Add Trendline" and select a linear fit, and the line showing the trend will be plotted. If you click on the line, the "Format Trendline" window appears. Under the "Options" tab, choose "Display equation on chart" and the best-fit equation for the line will be shown.
Unfortunately, the X-axis will show numbers instead of months if this is procedure is followed. If the YMD data (column A) is chosen instead of column B, the months will be displayed on the X-axis, but the equation for the best-fit line won't make any sense. The line, however, will still be a correct fit to the data.
Note that you can change the color of the line and the color and shape of the data points to refine the figure. Excel also allows you to label the axes and the figure to produce a final figure.
One other tip -- to get the figure out of Excel, screen capture software (which captures the image on the screen and allows conversion to other image formats, like JPEG or PNG) works better than trying to extract the figure from the spreadsheet.
Email address for comments: