LISTSERV - BEE-L Archives - COMMUNITY.LSOFT.COM

BEE-L Archives

Informed Discussion of Beekeeping Issues and Bee Biology

BEE-L@COMMUNITY.LSOFT.COM

LISTSERV Archives

BEE-L Home

Subscribe or Unsubscribe

Search Archives

Options:

Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message:

[<< First] [< Prev] [Next >] [Last >>]

Topic:

[<< First] [< Prev] [Next >] [Last >>]

Author:

[<< First] [< Prev] [Next >] [Last >>]

Subject:

Re: Imperfect Models

From:

James Fischer <[log in to unmask]>

Reply To:

Informed Discussion of Beekeeping Issues and Bee Biology <[log in to unmask]>

Date:

Sat, 20 Feb 2021 15:38:31 -0500

Content-Type:

text/plain

Parts/Attachments:

text/plain (127 lines)

> we can't predict the weather with
> any degree of accuracy, as it is
> inherently unpredictable

The National Center for Atmospheric Research (UCAR/NCAR) would beg to differ. Given the use of satellites and ever-smaller "boxes of air" that are assigned values as pressure gradients move they can accurately predict the weather 7 days in advance about 80 percent of the time. The 5-day forecasts are accurate about 90 percent of the time. Yet they get no credit for modeling the boxes with the newest and best supercomputers, and making the boxes ever smaller, and the metrics ever-more precise. They bought higher-priced toys than anyone else in computing worldwide for decades. I assume they still do.

> It means we have to constantly
> remember we are dealing with
> probability.

We probably won't - so many indistinct critiques of "models" have been made here recently, but not a single specific model has been even mentioned yet. These tools have withstood the test of time, and it is only the mis-application of a less-than optional choice of model that should be critiqued, as the established models "just work".

There are only so many practical model "algorithms" out there in the toolbox, and most are old enough to be very well-known. Each has a strength, and each has a cost (in terms of computational effort required). I'll list each, as they are not that mysterious, and I'll see if I can find a bee-related paper I have on my local drives using each technique as a practical example. There is a lack of "modelers" in bee research, just as there is a lack of statisticians, only worse.

Logistic Regression -

This is the "S-shaped line" one sees on a graph that divides "true" from "false" results. The problem with this one is being elegant in one's removal of "noise" and "outliers", so this tool can be misused by charlatans and advocates cosplaying as scientists. This is a non-linear environment, so the math is more complex than a spreadsheet can hack. You need SAS, S, R, SPSS, Stata, etc.

A decent "bee use" of this model is "Using Reports of Bee Mortality in the Field to Calibrate Laboratory-Derived Pesticide Risk Indices" - Environ. Entomol. 37(2): 546Ð554 (2008)
DOI: 10.1603/0046-225X
https://www.researchgate.net/publication/5432604_Using_Reports_of_Bee_Mortality_in_the_Field_to_Calibrate_Laboratory-Derived_Pesticide_Risk_Indices
https://tinyurl.com/4cvo9ey8

They took both oral and contact toxicity worked up an interesting metric - "application rate divided by LD50" (so that a heavily-applied chemical with a lower toxicity would be properly weighted for the higher application rate), and then looked at records of how much of each pesticide was reported as used, and how many bee kills were reported.

Linear Discriminant Analysis -

This is logistic regression for more than two "classes" of output. You get a mean for every class and the variance of the sum total of classes. This works best with "bell curve data", and most things in reality fall on a bell curve. Outliers get tossed up front, so you take the "filet" of nice meaty data from the center of the bell curve, and sacrifice the edges of the curves on the lab's shrine to Carl Friedrich Gauss.

A good example of "bee use" is Fig 5 in "The genetic origin of honey bee colonies used in the COLOSS Genotype-Environment Interactions Experiment: a comparison of methods" (JAR 05/14)
DOI: 10.3896/IBRA.1.53.2.02
https://www.researchgate.net/publication/262692503_The_genetic_origin_of_honey_bee_colonies_used_in_the_COLOSS_Genotype-Environment_Interactions_Experiment_a_comparison_of_methods/download
https://tinyurl.com/29vh53qm

Linear Regression -

This is the "straight-line" regression that one finds "b" coefficient values in "y = mx + b". Both banking and marketing types love these, and this simple model may be part of the reason why they run their businesses into a ditch on a regular basis, and have to be bailed out by taxpayers. It is likely the model used most often and most successfully by the general public, for example it allows one to look at 3 colony weights taken over time, and using the Excel " "Linear Trendline" graphing tool to predict when that hive will run out of stores, assuming conditions continue much the same as they have been for the 3 weightings. (So even with this simplest of models, one can screw up, if conditions change dramatically)

Bayes -

Bayesian models are very powerful in that they make the best match possible given the data presented, so they are good for complex schemes with lots of variables, and some missing variables for some samples. The easiest example is "spam email" detection. The more messages that are flagged as "spam" by users, the better the dataset of common words and phrases that help the system classify new messages as spam or not. The drawback is that without lots of data to "train" the Bayesian filter, it will not be very good. The good news is that it gets smarter and better with age and use.

There are multiple studies suggestion that a "Bayesian Filter Model" is a very good approximation for the thought process bees use to forage, and to make other decisions that result in the emergent behavior that makes them seem "smart" in human terms.
" Bumblebees Learn to Forage like Bayesians"
American Naturalist, Vol. 174, No. 3 (09/2009)
doi: 10.1086/603629
https://drive.google.com/file/d/1N7yme0z1_iMcGL3kQWGdg0bn03QHAbP7/view?usp=sharing
https://tinyurl.com/jbfq6f3p

K-Nearest Neighbors (KNN) -

This model is trained with an existing dataset, which needs to be extensive, and finds fits the new items into their proper place in the dataset by finding the difference between the new item (via Euclidian numbers) in comparison to its "nearest neighbors" to determine the "value" one is predicting/estimating for each.

A "bee application" is "Predicting acute contact toxicity of pesticides in honeybees (Apis mellifera) through a k-nearest neighbor model"
Chemosphere 166 01/2017
DOI: 10.1016/j.chemosphere.2016.09.092
https://drive.google.com/file/d/1P7WAJ5qY8hDgmmDpq6hGB6QEMrqgLyCe/view?usp=sharing
https://tinyurl.com/yt8ujftn

Learning Vector Quantization -

KNN requires massive datasets to be useful. What if you don't have a dataset, but simply want to "train" the system to recognize something that you can discriminate? LVQ is like KNN, except you "train" the system by rejecting the wrong guesses made by the system, to increase accuracy. This is a big deal in "image recognition", but there are few areas in science where we know what we are looking for in advance well enough to use this outside of optical applications.
Bees? Yes, the process can identify and count bees:
"A Study for Identification and Behavioral Tracking of Honeybees in the Observation Hive Using Vector Quantization Method"
Proceedings of Measuring Behavior 2008
Doi: 10.1.1.600.6783
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.600.6783&rep=rep1&type=pdf
https://tinyurl.com/13e3t3d7

Support Vector Machines -

This is taking "Logistic Regression", and expanding the "line" to a "plane", so it adds another dimension to the classification of the data. The vectors from the various input vectors these points to the hyperplane can either "support" (when all the data instances of the same class are on the same side of the hyperplane) or "defy" (when the data point is outside the plane of the rest of its class).
I don’t know of anyone using this technique in any bee-related applications, but one of the more promising ways to employ this "vector" approach is in tandem with a modeling scheme that is based upon the (simplified) foraging behavior of a bee colony. Not joking, "foraging bees" are a model that other fields use to slog through massive datasets - there's "Artificial Bee Colony", "Global Artificial Bee Colony", and likely more variants that I have not yet read about. Here's a decent explanation:

"Classification of Cancer Data Using Support Vector Machines with Features Selection Method Based on Global Artificial Bee Colony"
Proceedings of the 3rd International Symposium on Current Progress in Mathematics and Sciences - 2017
https://doi.org/10.1063/1.5064202

Random Decision Forests -

This does just what it says - with many "decision trees", one has a "forest" so instead of just one "optimal" route, multiple "suboptimal" routes are also found with multiple "forests", which yields better insight into the key factors that led to the data you have. Shows level of influence, and allows one to see what the most significant factors are in one pass. Again, you need a big dataset.
There's likely no bee-related applications here. I've never used this one myself for anything.

Neural Networks and "Deep Neural Networks" -

These are the ones that get all the press these days. There are dozens of platforms, I'll list some of the freeware ones below, as they are fun to play with, but MatLab has a very good tutorial guidebook set that makes it likely the best choice for someone jumping in feet first without having gotten one's ticket punched in Math. The add-on is called "Neural Network toolbox". If you are serious, start with Matlab tutorials. I did.
https://www.mathworks.com/campaigns/offers/deep-learning-with-matlab.html

The freeware ones are:

caffe (From Berkely, so worth a long hard look before picking any other package for anything other than "playing around".)
http://caffe.berkeleyvision.org/

caffe 2
http://caffe2.ai/
bindings to c++ but is optimized for python, see below for the hegemony of python in this area

neuroph
http://neuroph.sourceforge.net/

weka
http://www.cs.waikato.ac.nz/ml/weka/

DL4J
https://deeplearning4j.org/
for folks who use java

FANN
http://leenissen.dk/fann/wp/

Python-based tools (the market seems to have decided that finding/training programmers with Python skills is easiest
Pybrain http://pybrain.org
Pytorch http://pytorch.org (lots of good tutorials and discussion forums for this one)
Theano http://deeplearning.net/software/theano
Keras https://keras.io
Scikit http://scikit-learn.org
Tensorflow https://www.tensorflow.org (Google's tool, so take care, as they have a habit of killing everything we like that does not sell ads)
CNTK https://www.microsoft.com/en-us/cognitive-toolkit/ (Microsoft, so don't expect it to stay "free" forever)

***********************************************
The BEE-L mailing list is powered by L-Soft's renowned
LISTSERV(R) list management software. For more information, go to:
http://www.lsoft.com/LISTSERV-powered.html

ATOM RSS1 RSS2

COMMUNITY.LSOFT.COM