Time Series Analysis with Apple stock price Forecasting(November 29)

Why don’t we do something more intriguing now that we have a firm understanding of time series? For those who share my curiosity, stocks offer a captivating experience with their charts, graphs, green and red tickers, and numbers. But, let’s review our understanding of stocks before we get started.

  • What is a Stock? A stock represents ownership in a company. When you buy a stock, you become a shareholder and own a portion of that company.
  • Stock Price and Market Capitalization: The stock price is the current market value of one share. Multiply the stock price by the total number of outstanding shares to get the market capitalization, representing the total value of the company.
  • Stock Exchanges: Stocks are bought and sold on stock exchanges, such as the New York Stock Exchange (NYSE) or NASDAQ. Each exchange has its listing requirements and trading hours.
  • Ticker Symbol: Each stock is identified by a unique ticker symbol. For example, Apple’s ticker symbol is AAPL.
  • Types of Stocks:
    • Common Stocks: Represent ownership with voting rights in a company.
    • Preferred Stocks: Carry priority in receiving dividends but usually lack voting rights.
  • Dividends: Some companies pay dividends, which are a portion of their earnings distributed to shareholders.
  • Earnings and Financial Reports: Companies release quarterly and annual financial reports, including earnings. Positive earnings often lead to a rise in stock prices.
  • Market Index: Market indices, like the S&P 500 or Dow Jones, track the performance of a group of stocks. They give an overall sense of market trends.
  • Risk and Volatility: Stocks can be volatile, and prices can fluctuate based on company performance, economic conditions, or global events.
  • Stock Analysis: People use various methods for stock analysis, including fundamental analysis (company financials), technical analysis (historical stock prices and trading volume), and sentiment analysis (public perception).
  • yfinance: yfinance is a Python library that allows you to access financial data, including historical stock prices, from Yahoo Finance.
  • Stock Prediction Models: Machine learning models, time series analysis, and statistical methods are commonly used for stock price prediction. Common models include ARIMA, LSTM, and linear regression.
  • Risks and Caution: Stock trading involves risks. It’s essential to diversify your portfolio, stay informed, and consider seeking advice from financial experts.

When working with stock data using the yfinance library in Python, the dataset typically consists of historical stock prices and related information.

TSFP Further(November 27)

Let’s attempt a thorough analysis of our models today. Residual analysis, as we all know, is a crucial stage in time series modelling to evaluate the goodness of fit and make sure the model assumptions are satisfied. The discrepancies between the values predicted by the model and the observed values are known as residuals.

Here’s how we can perform residual analysis for your AR and MA models:

  1. Compute Residuals:
    • Calculate the residuals by subtracting the predicted values from the actual values.
  2. Plot Residuals:
    • To visually examine the residuals for trends, patterns, or seasonality, plot them over time. The residuals of a well-fitted model should look random and be centred around zero.
  3. Autocorrelation Function (ACF) of Residuals:
    • To see if there is any more autocorrelation, plot the residuals’ ACF. The ACF plot shows significant spikes, which suggest that not all of the temporal dependencies were captured by the model.
  4. Histogram and Q-Q Plot:
    • Examine and compare the residuals histogram with a normal distribution. To evaluate normality, additionally employ a Q-Q plot. Deviations from normalcy could indicate that there is a breach in the model’s presumptions.

If you’re wondering why you should compare the histogram of residuals to a normal distribution or why deviations from normality may indicate that the model assumptions are violated, you’re not alone. Normality is a prerequisite for many statistical inference techniques, such as confidence interval estimation and hypothesis testing, that the residuals, or errors, follow a normal distribution. Biassed estimations and inaccurate conclusions can result from deviations from normality.

The underlying theory of time series models, including ARIMA and SARIMA models, frequently assumes residual normality. If the residuals are not normally distributed, the model may not accurately capture the underlying patterns in the data.

Here’s why deviations from normality might suggest that the model assumptions are violated:

  1. Validity of Confidence Intervals:
    • The normality assumption is critical for constructing valid confidence intervals. The confidence intervals may be unreliable if the residuals are not normally distributed, resulting in incorrect uncertainty assessments..
  2. Outliers and Skewness:
    • Deviations from normality in the histogram could indicate the presence of outliers or residual skewness. It is critical to identify and address these issues in order to improve the model’s performance.

Let’s run a residual analysis on whatever we’ve been doing with “Analyze Boston” data.

  1. Residuals over time: This plot describes the pattern and behaviour of the model residuals, or the discrepancies between the values that the model predicted and the values that were observed, over the course of the prediction period. It is essential to analyse residuals over time in order to evaluate the model’s performance and spot any systematic trends or patterns that the model may have overlooked. There are couple of things to look for:
    • Ideally, residuals should appear random and show no consistent pattern over time. A lack of systematic patterns indicates that the model has captured the underlying structure of the data well.
    • Residuals should be centered around zero. If there is a noticeable drift or consistent deviation from zero, it may suggest that the model has a bias or is missing important information.
    • Heteroscedasticity: Look for consistent variability over time in the residuals. Variations in variability, or heteroscedasticity, may be a sign that the model is not accounting for the inherent variability in the data.
    • Outliers: Look for any extreme values or outliers in the residuals. Outliers may indicate unusual events or data points that were not adequately captured by the model
    • The absence of a systematic pattern suggests that the models are adequately accounting for the variation in the logan_intl_flights data.
    • Residuals being mostly centered around the mean is a good indication. It means that, on average, your models are making accurate predictions. The deviations from the mean are likely due to random noise or unexplained variability.
    • Occasional deviations from the mean are normal and can be attributed to random fluctuations or unobserved factors that are challenging to capture in the model. As long as these deviations are not systematic or consistent, they don’t necessarily indicate a problem.
    • Heteroscedasticity’s absence indicates that the models are consistently managing the variability. If the variability changed over time, it could mean that the models have trouble during particular times.
  2. The ACF (Autocorrelation Function) of residuals: demonstrates the relationship between the residuals at various lags. It assists in determining whether, following the fitting of a time series model, any residual temporal structure or autocorrelation exists. The ACF of residuals can be interpreted as follows:
    • No Significant Spikes: The residuals are probably independent and the model has successfully captured the temporal dependencies in the data if the ACF of the residuals decays rapidly to zero and does not exhibit any significant spikes.
    • Significant Spikes: The presence of significant spikes at specific lags indicates the possibility of residual patterns or autocorrelation. This might point to the need for additional model improvement or the need to take into account different model structures.
    • There are no significant spikes in our ACF, it suggests that the model has successfully removed the temporal dependencies in the data.
  3. Histogram and Q-Q plot:
    • Look at the shape of the histogram. It should resemble a bell curve for normality. A symmetric, bell-shaped histogram suggests that the residuals are approximately normally distributed. Check for outliers or extreme values. If there are significant outliers, it may indicate that the model is not capturing certain patterns in the data. A symmetric distribution has skewness close to zero. Positive skewness indicates a longer right tail, and negative skewness indicates a longer left tail.
    • In a Q-Q plot, if the points closely follow a straight line, it suggests that the residuals are normally distributed. Deviations from the line indicate departures from normality. Look for points that deviate from the straight line. Outliers suggest non-normality or the presence of extreme values. Check whether the tails of the Q-Q plot deviate from the straight line. Fat tails or curvature may indicate non-normality.
    • A histogram will not reveal much to us because of the small number of data points we have. Each bar’s height in a histogram indicates how frequently or how many data points are in a given range (bin).
    • See what I mean?

Another Experiments with TSFP(November 22).

If I were you, I would question why we haven’t discussed AR or MA in isolation before turning to S/ARIMA. This is due to the fact that ARIMA is merely a combination of them, and you can quickly convert an ARIMA to an AR by offsetting the parameter for the MA and vice versa. Before we get started, let’s take a brief look back. We have seen how the ARIMA model operates and how to manually determine the proper method for determining the model parameters through trial and error. I have two questions for you: is there a better way to go about doing this? Secondly, we have seen models like AR, MA, and their combination ARIMA, how do you choose the best model?

Simply put, an MA model uses past forecast errors (residuals) to predict future values, whereas an AR model uses past values of the time series to do so. It assumes that future values are a linear combination of past values, with coefficients representing the weights of each past value. It makes the assumption that the future values are the linear sum of the forecast errors from the past.

Choosing Between AR and MA Models:

Understanding the type of data is necessary in order to select between AR and MA models. An AR model could be appropriate if the data shows distinct trends, but MA models are better at capturing transient fluctuations. Model order selection entails temporal dependency analysis using statistical tools such as the Partial AutoCorrelation Function (PACF) and AutoCorrelation Function (ACF). Exploring both AR and MA models and contrasting their performance using information criteria (AIC, BIC) and diagnostic tests may be part of the iterative process.

A thorough investigation of the features of a given dataset, such as temporal dependencies, trends, and fluctuations, is essential for making the best decision. Furthermore, taking into account ARIMA models—which integrate both AR and MA components—offers flexibility for a variety of time series datasets.

In order to produce the most accurate and pertinent model, the selection process ultimately entails a nuanced understanding of the complexities of the data and an iterative refinement approach.

Let’s get back to our data from “Analyze Boston”. To discern the optimal AutoRegressive (AR) and Moving Average (MA) model orders, Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) plots are employed.

Since the first notable spike in our plots occurs at 1, we tried to model the AR component with an order of 1. And guess what? After a brief period of optimism, the model stalled at zero. The MA model, which had the same order and a flat line at 0, produced results that were comparable. For this reason, in order to identify the optimal pairing of AR and MA orders, a comprehensive search for parameters is necessary.

As demonstrated in my earlier post, these plots can provide us with important insights and aid in the development of a passable model. However, we can never be too certain of anything, and we don’t always have the time to work through the labor-intensive process of experimenting with the parameters. Why not just have our code handle it.

To maximise the model fit, the grid search methodically assessed different orders under the guidance of the Akaike Information Criterion (AIC). As a result of this rigorous analysis, the most accurate AR and MA models were found, and they were all precisely calibrated and predicted. I could thus conclude that (0,0,7) and (8,0,0) were the optimal orders for MA and AR respectively.

Moving Average Model with TSFP and Analyze Boston (November 20th).

The Moving Average Model: or MA(q), is a time series model that predicts the current observation by taking into account the influence of random or “white noise” terms from the past. It is a part of the larger category of models referred to as ARIMA (Autoregressive Integrated Moving Average) models. Let’s examine the specifics:

  • Important Features of MA(q):
    1. Order (q): The moving average model’s order is indicated by the term “q” in MA(q). It represents the quantity of historical white noise terms taken into account by the model. In MA(1), for instance, the most recent white noise term is taken into account.
    2. White Noise: The current observation is the result of a linear combination of the most recent q white noise terms and the current white noise term. White noise is a sequence of independent and identically distributed random variables with a mean of zero and constant variance.
    3. Mathematical Equation: An MA(q) model’s general form is expressed as follows:
      Y t = μ + Et + (θ(1) ​E(t−1) )​+( θ(2) E(t-2) ​)+…+(θq​Et−q)
      Y t is the current observation.
      The time series mean is represented by μ.
      At time t, the white noise term is represented by Et.
      The weights allocated to previous white noise terms are represented by the model’s parameters, which are θ 1​, θ 2​,…, θ q​.
  • Key Concepts and Considerations:
    1. Constant Mean (μ): The moving average model is predicated on the time series having a constant mean (μ).
    2. Stationarity: The time series must be stationary in order for MA(q) to be applied meaningfully. Differencing can be used to stabilise the statistical characteristics of the series in the event that stationarity cannot be attained.
    3. Model Identification: The order q is a crucial aspect of model identification. It is ascertained using techniques such as statistical criteria or autocorrelation function (ACF) plots.
  • Application to Time Series Analysis:
    1. Estimation of Parameters: Using statistical techniques like maximum likelihood estimation, the parameters θ 1, θ 2, …, θ q, are estimated from the data.
    2. Model Validation: Diagnostic checks, such as residual analysis and model comparison metrics, are used to assess the MA(q) model’s performance.
    3. Forecasting: Following validation, future values can be predicted using the model. Based on the observed values and historical white noise terms up to time t−q, the forecast at time t is made.
  • Use Cases:
    1. Capturing Short-Term Dependencies: When recent random shocks have an impact on the current observation, MA(q) models are useful for detecting short-term dependencies in time series data.
    2. Complementing ARIMA Models: To create ARIMA models, which are strong and adaptable in capturing a variety of time series patterns, autoregressive (AR) and differencing components are frequently added to MA(q) models.

Let’s try to fit an MA(1) model to the ‘logan_intl_flights’ time series from Analyze Boston. But before that it’s important to assess whether the ‘logan_intl_flights’ time series is appropriate for this type of model. The ACF and PACF plots illustrate the relationship between the time series and its lag values, which aids in determining the possible order of the moving average component (q).

With “TSFP” and Analyze Boston(November 17).

  • In time series analysis, stationarity is an essential concept. A time series that exhibits constant statistical attributes over a given period of time is referred to as stationary. The modelling process is made simpler by the lack of seasonality or trends. Two varieties of stationarity exist:
    • Strict Stationarity: The entire probability distribution of the data is time-invariant.
    • Weak Stationarity: The mean, variance, and autocorrelation structure remain constant over time.
  • Transformations like differencing or logarithmic transformations are frequently needed to achieve stationarity in order to stabilise statistical properties.
  • Let’s check the stationarity of the ‘logan_intl_flights’ time series. Is the average number of international flights constant over time?
  • A visual inspection of the plot will tell you it’s not. But let’s try performing an ADF test.
  • The Augmented Dickey-Fuller (ADF) test is a prominent solution to this problem. Using a rigorous examination, this statistical tool determines whether a unit root is present, indicating non-stationarity. The null hypothesis of a unit root is rejected if it is less than the traditional 0.05 threshold, confirming stationarity. The integration of domain expertise and statistical rigour in this comprehensive approach improves our comprehension of the dataset’s temporal dynamics.

How Time Series Forecasting works(15th November)?

I feel compelled to share the insightful knowledge bestowed upon me by this book on time series analysis as I delve deeper into its pages.

Defining Time Series: A time series is a sequence of data points arranged chronologically, comprising measurements or observations made at regular and equally spaced intervals. This type of data finds widespread use in various disciplines such as environmental science, biology, finance, and economics. The primary goal when working with time series is to understand the underlying patterns, trends, and behaviors that may exist in the data over time. Time series analysis involves modeling, interpreting, and projecting future values based on past trends.

Time Series Decomposition: Time series decomposition is a technique for breaking down a time series into its fundamental components: trend, seasonality, and noise. These elements enhance our understanding of data patterns.

– Trend: Represents the long-term movement or direction of the data, helping identify if the series is rising, falling, or staying the same over time.
– Seasonality: Identifies recurring, regular patterns in the data that happen at regular intervals, such as seasonal fluctuations in retail sales.
– Noise (or Residuals): Represents sporadic variations or anomalies in the data not related to seasonality or trends, essentially the unexplained portion of the time series.

Decomposing a time series into these components aids in better comprehending the data’s structure, facilitating more accurate forecasting and analysis.

Forecasting Project Lifecycle: The entire process of project lifecycle forecasting involves predicting future trends or outcomes using historical data. The lifecycle typically includes stages such as data collection, exploratory data analysis (EDA), model selection, training the model, validation and testing, deployment, monitoring, and maintenance. This iterative process ensures precise and current forecasts, requiring frequent updates and modifications.

Baseline Models: Baseline models serve as simple benchmarks or reference points for more complex models. They provide a minimal level of prediction, helping evaluate the performance of more sophisticated models.

– Mean or Average Baseline: Projects a time series’ future value using the mean of its historical observations.
– Naive Baseline: Forecasts the future value based on the most recent observation.
– Seasonal Baseline: Forecasts future values for time series with a distinct seasonal pattern using the average historical values of the corresponding season.

Random Walk Model: The random walk model is a straightforward but powerful baseline for time series forecasting, assuming that future variations are entirely random. It serves as a benchmark to assess the performance of more advanced models.

Exploring the ‘Economic Indicators’ dataset from Analyze Boston, we can examine the baseline (mean) for Total International flights at Logan Airport. The baseline model computes the historical average, assuming future values will mirror this average. Visualization of the model’s performance against historical trends helps gauge its effectiveness and identify possible shortcomings for further analysis and improvement in forecasting techniques.

Time series exploration(13th November )

Within the expansive realm of data analysis, a nuanced understanding of the distinctions among various methodologies is imperative. Time series analysis stands out as a specialized expertise tailored for scrutinizing data collected over time. As we juxtapose time series methods against conventional predictive modeling, the distinctive merits and drawbacks of each approach come to light.

Time series analysis, epitomized by models like ARIMA and Prophet, serves as the linchpin for tasks where temporal dependencies shape the narrative. ARIMA employs three sophisticated techniques—moving averages, differencing, and autoregression—to capture elusive trends and seasonality. On the other hand, Prophet, a creation of Facebook, adeptly handles missing data and unexpected events.

In essence, the choice between time series analysis and conventional predictive modeling hinges on the intrinsic characteristics of the available data. When unraveling the intricacies of temporal sequences, time series methods emerge as the superior option. They furnish a tailored approach for discerning and forecasting patterns over time that generic models might overlook. Understanding the strengths and limitations of each data navigation technique aids in selecting the most suitable tool for navigating the data landscape at hand.

Heirarchial Clustering Algortithm applied in dataset(3rd November)

Clusters produced by Hierarchical Clustering closely resemble K-means clusters. Actually, there are situations when the outcome is precisely the same as k-means clustering. However, the entire procedure is little different. Agglomerative and Divisive are the two types. The bottom-up strategy is called aggregative; the opposite is called divisive. Today, my primary focus was on the Agglomerative approach.

 

became familiar with dendrograms, where the horizontal axis represents the data points and the vertical axis indicates the Euclidean distance between two points. Therefore, the clusters are more dissimilar the higher the lines. By examining how many lines the threshold cuts in our dendrogram, we can determine what dissimilarity thresholds to set. The largest clusters below the thresholds are the ones we need. In dendrogram, the threshold is typically located at the greatest vertical distance that you can travel without touching any of the horizontal lines.

Adding to Clustering a method called the Elbow Method(1st November)

Extending Clustering will lead to an approach called the elbow method.

I plotted this graph, known as the WCSS, which shows how close the points in a group are to one another, to determine the number of clusters we need. It assisted me in determining the ideal number of groups at which the addition of another group is insignificant; this is known as the “elbow” point.

WCSS, or within cluster sum of squares, is essentially a gauge of how organised our clusters are. The sum of the squares representing the distances between each point and the cluster’s centre. We get neater clusters with lower WCSS. WCSS typically decreases as the number of clusters increases. The elbow method allows us to determine the ideal number of clusters.

Density based Spatial Clustering(30 October)

Today, learned when and how to use Denisty-Based Spatial Clustering for Applications with Noise, or DBSCAN. Using clustering algorithm, it is possible to locate densely packed groups of data points in space. Unlike with k-means, we don’t really need to specify the number of clusters in advance. It is able to find collections of arbitrary shapes.

 

To investigate the difference in average ages between black and white guys further, I used frequentist statistical methods such as Welch’s t-test in addition to the Bayesian approaches.

This frequentist approach, in addition to the Bayesian studies, consistently and unambiguously shows significant variation in mean ages. The convergence of results from frequentist and Bayesian approaches increases my confidence in the observed difference in average ages between the two groups.

 

Bayesian t-test (25th October)

Because am certain that there is statistically significant difference in average age of about seven, used Bayesian t-test strategy to account for this prior knowledge. However, the results showed an interesting discrepancy: the observed difference was found near the tail of the posterior distribution. This discrepancy and the way it showed how susceptible Bayesian analysis is to earlier specifications did not sit well with me.

Drawing the heat map(23rd October)

The difficulty in visualising large collection of geocoordinated police shooting incidents is effectively transmitting information without overwhelming the viewer. First, attempted to use the Folium library to create individual markers on map to represent each incident. However, as the dataset grew larger, the computational cost of creating markers for each data point increased. As more effective solution to this problem, decided to use HeatMap. The HeatMap can help to show incident concentration more succinctly and the distribution of events on the map more clearly. improved the heatmap’s readability by adjusting its size and intensity with settings such as blur and radius.

Arrange data monthly(16th October)

As a matter of thought during festivals. Police shootings might increase and hence could get something interesting. I collated data monthly and below are the results that I have found.

 

As you can see from the image that there is slight increase in counts during March and August. But however there isn’t much increase.

 

March->741

August->702

Arrange data weekly(13th october)

In order to find something interesting I was curious to collate data weekly and check for some anomalies. However, I couldn’t find any significant cases and hence I have to drop the idea