Baseline

Welcome on the BASELINE project wikipage!

Baseline was initiated as a project of the EPIDEMIUM Challenge4Cancer, a 6-month hackathon launched by Roche France and La Paillasse on the 5th of November 2015.

The Baseline mathematical concept. Cancer risks Y as a function of many variables X even though not all variables are available at all geographic zones

Description

Please find here a one-page presentation of the BASELINE project (poster) File:poster_Baseline.pdf

The BASELINE project in a nutshell:

 What? "Cancer Risk = f(a,b,...h). Oh, d? Oh, but z too?" Predict cancer incidence/mortality/survival using risk factors from open data sources (with a worldwide scope and a regional granularity). Why? Cancer is a misunderstood disease. Available health data are underused. Once started for cancer, it can be extended to other threatening illnesses and hopefully one day help keep most important diseases at bay as well as better recover. "longer, healthier lives" When? November 2015 - April 2016 How? The process is Big-Data-oriented: Collect data Extract trends Build a tool Test the product in real-life settings Where? "The scope is worldwide. Everybody is welcome." You just need an internet connection to jump in. Who? Many skills are required. Health professional (General practitioner, public health, oncology, epidemiology) Statisticians (Data architects, econometricians, actuaries) Developers (R/Python, machine learning, data visualization, Web) Communication (Social marketing, designer, newsletter writting)
Please find here a PDF presentation of the BASELINE project.

Contributors

Here is a faction of the contributors:

Name Edouard Debonneuil Augustin Terlinden Peter-Mikhaël Richard Camille Pouchol Julien Pasquier Dahbia Agher Mayumi Iitsuka Joseph Sébastien

Collects Y and Y data

Assembly of collaborative efforts

Project manager

Collects Y and X data

Organizes the data assembly. Prepared a web tool for anyone to add data Collected and assembles data for Austria Extracted Y for England & Wales Chef projet e-santé

Extracted some Y for Germany for Algeria and for the mortality rates for 16 countries. Presented the project several times

Architect

Collects for Japan

Extracted some Y for Germany
Name Claude Touche Delphine Bertram Rudy Mustafa Thibaut Bideault Joseph x Lam Benoit Choffin Frédéric Planchet
Main Achievement Check results "in real life".

Baseline Médicament

Insight into medicines

Baseline Médicament

ISFA student / DSS coordinator

Collected and assembles X for England & Wales

Investigated CART modeling on the database

ISFA student / coordinator

Collected and assembles X for England & Wales

Investigated PLS regressions on the database

ENSAE student

Full analysis for France

ENSAE student

Full analysis for France

ISFA Professor

NGUYEN Thi Nga Stéphane Loisel Xavier Milhaud dongoclinhttc hado0601 tantran1311 adrien sarr
ISFA student

Collects & assembles X for Australia

ISFA Professor

ISFA Professor

ISFA student

Collects & Assembles X for Germany

ISFA student

Collects & assembles X for the USA

ISFA student

Collects & assembles X for the USA

ISFA student

Collects & assembles X for Canada

ISFA student

Collects & assembles X for Canada

Stephane Loisel, Alexis Bienvenue, Xavier Milhaud Djalel Benbouzid Arthur Charpentier Benjamin Schannes
ISFA professors UPMC professor and RAMP organisor Actuarial professor at Université of Rennes Data Scientist from the Institute of Actuaries
Supervised and helped ISFA students on their contributions to the Baseline project Organized RAMPs with Baseline data, invaluable help in preparing the data and starting kits for the RAMPs Proposed the Baseline project as a humanitary positive projects to which data scientists from the Institute of Actuaries could contribute Worked on solutions to make "very effective but black box models" not so black box, in order to both reach good Baseline modelling and to provide public health impacts. Won RAMP1.
TRUONG Tai Anh huongisfa4 ngohang Sylvain Durand Elena Starkov Bruno S Seraya Maouche Frederic Kozlowski
ISFA student

Collects & assembles X for Australia

ISFA student

Collects & assembles X for Japan

ISFA student

Collects & assembles X for Japan

Medical perspective

Baseline Medicine

Medical insight on cancers Collect of cancer mortality rates for Germany and the UK Project leader BD4Cancer

R expert

Strong help in moving data to the sql system

Goals

Big challenge, high stakes!

The Baseline project aims at :

• developing and validating an epidemiological model based on aggregade data;
• discovering new risk factors using this model;
• building a large robust database for the cancer scientific community;
• producing a tool helping a general practitioner (GP) to determine if a random patient entering her/his office is more or less at risk of cancer than an average patient ("good" cancer behaviours for patients);

Methodology

Data collection

Large efforts have been initiated within this project to select variables (X and Y) and collect data in multiple countries.

Variables standardization

To ensure homogeneity in the coding of variables, we first created a data dictionary to centralize all selected variables. This dictionnary is used as a metadata repository and used by project members to structure data by country (or region). Each variable is identified by a unique code, measure, unity and eventually a short description

Data Management

Baseline has collaborated with BD4Cancer project to integrate data collected within Baseline to EpidemiumDB, a large database for cancer Big Data and epidemiology research. Scripts (R, Perl, and Python) are provided by BD4Cancer on Epidemium Github to allow connection and data retrieval from EpidemiumDB.

An SQLite version of the whole database will be available online. Baseline tables have been described on the Wiki of EpidemiumDB. Cancer risk factors, statistics by country, and other data types were provided in EpidemiumDB by BD4Cancer team, as common table to all projects.

Regression & Classification& Classification (Logistic meta-regression)

Time dimension

Fill in the blanks (through Bayesian statistics)

Pool variables together (through multivariate statistics)

Optimize variable picking (through machine learning techniques)

Comparison of other historical data with BASELINE

Collect other data

Create a tool which compares with BASELINE

Validate

Comparison of real-life data with BASELINE

E-cohort questionnaire

Prospective analysis

How to start

First of all, welcome! Here are a few instructions aiming at setting up your smooth landing on the BASELINE planet.

Do you know people a priori interested in the project? If yes, please send them this email. Please do not hesitate in particular to think of universities where you have been, in France or abroad. Right now we are happy to gather various students from ISFA (Lyon), ENSEA and Ecole Centrale Paris. The Epidemium challenge finishes in May, we are now more than 50 persons on the Cancer Baseline project so we have developped the tools to easily work in high numbers -- and if we were a lot more it wouldn't be a problem! (many variables, many countries to investigate) so it is currently still a perfect time to enter.

Enroll

It takes about 15 minutes and then you are all set.

Register to EPIDEMIUM.

Register to BASELINE (top right of the page, click on the small blue "suivre" and the small "rejoindre" that then appears next to it.

Register to WIKI (our public communication).

When you receive a email inviting you to join SLACK (our team communication tool), follow the instruction.

If you are blocked somewhere, email cancerbaseline@googlegroups.com and we will accompany you!

You are in!

Select a country which has not been chosen yet and add your name in the right column on the table.

Create your own wiki page by entering the following URL: http://wiki.epidemium.cc/wiki/Baseline/yourslackidentifier. Use your slack pseudo so we can quickly find you. When initiating the page, click on "create this page". On this wiki page (your page!), you will display the results of your researches. Check the following example to better understand how to proceed.

Note the areas subdivisions recorded for your country.This is related to the granularity of cancer incidence/mortality data we collected already for all the countries.thte the various cancer factors (listed by WHO). Feel free to add risk factors which are not on the list… and let your slackteam know about them.

Then you are all set. Find the value of each listed (and non listed) risk factor in each region of your country(ies) of interest and fill in a table (follow the US example). Examples of google entries: radon concentration in Alabama, age at menarche in Loire-Atlantique, etc.

Help

‘Cause you might need some;-)

Post your question on #general on slack

Contact your slack members individually, e.g. edebonneuil, augustin.

Good luck and good work

Cheers

The BASELINE team

Countries of interest - choose your country!

Please Add your slack name in the column "Who?" of the following tables, at the lines where you will contribute.

The Y are specified by age [0-4, 5 -9, ..., 80-84, 85+], gender, cancer type region, and year. For the X, for the sake of simplicity, only by-region data need to be extracted. More is better but not reqiured.

ISFA students

Country Regions Who?
Japan Aichi Prefecture, Fukui Prefecture, Hiroshima, Miyagi Prefecture, Nagasaki Prefecture, Osaka Prefecture, Yamagata Prefecture huongisfa4, ngohang
Australia New South Wales, Northern Territory, Queensland, South, Tasmania, Victoria, Western, Capital Territory truongtaianh, nguyennga
England and Wales East of England Region, Merseyside and Cheshire, North Western, Northern and Yorkshire, Oxford Region, South and Western Regions, Thames, Trent, West Midlands, Wales rudy, thibaut
US Alabama, Alaska, Arizona, Arkansas, California, Colorado, Connecticut, Delaware, Florida, Georgia, Hawaii, Idaho, Illinois, Indiana, Iowa, Kansas, Kentucky, Louisiana, Maine, Maryland, Massachusetts, Michigan, Minnesota, Mississippi, Missouri, Montana, Nebraska, Nevada, New Hampshire, New Jersey, New Mexico, New York, North Carolina, North Dakota, Ohio, Oklahoma, Oregon, Pennsylvania, Rhode Island, South Carolina, South Dakota, Tennessee, Texas, Utah, Vermont, Virginia, Washington, West Virginia, Wisconsin, Wyoming hado0601, tantran1311
Germany Bavaria, Brandenburg, Berlin, Brandenburg, Mecklenburg, Bremen, Hamburg, Mecklenburg-Western Pomerania, Munich, Rhineland-Palatinate, Saarland, Saxony, Schleswig-Holstein, Thüringen, Northrhine-Westphalia dongoclinhttc

Other

Country Regions who? is in charge
Epidemium's start package WHO in charge of investigating data of risk factors and preventive factors (ex: tobacco)
Albania
Algeria Setif dahbia
Armenia
Austria Tyrol, Vorarlberg Camille Pouchol
Azerbaijan
Bahrain Bahraini people only
Belarus
Belgium Antwerp, Flanders Sarr
Belize
Brazil Brasilia, Cuiaba, Goiania, Sao Paulo rudy
Bulgaria
China Guangzhou City, Hong Kong, Jiashan, Nangang District, Shanghai, Zhongshan nguyennga
Costa Rica
Croatia David Banquet
Cyprus David Banquet
Czech Republic David Banquet
Czechoslovakia
Denmark Julien
Dominican Republic
Egypt Gharbiah Romain
Estonia
Finland Ngohang
France Bas-Rhin, Calvados, Cote d'Or, Doubs, Finistere, Haut-Rhin, Herault, Isere, La Martinique, Loire-Atlantique, Manche, Normandy, Loire, Somme, Tarn, Vendee JosephxLam, bchoffin, or?
French Polynesia
The Gambia
Georgia
Gibraltar
Greece
Guatemala mirlaude
Hungary
Iceland Ngohang
India Barshi
India Barshi, Bhopal, Chennai, Chennai, Karunagappally, Mumbai, Nagpur, New Delhi, Poona, Trivandrum
Ireland Ngohang
Israel Jerusalem District, Northern District, Haifa District, Central District, Tel-Aviv District, Southern District Romain
Italy Biella Province, Brescia Province, Ferrara Province, Florence and Prato, Genoa Province, Macerata Province, Milan, Modena Province, Naples, North East Cancer Surveillance Network, Parma Province, Ragusa Province, Reggio Emilia Province, Romagna Region, Salerno Province, Sassari Province, Sondrio, Syracuse Province, Torino, Umbria Region, Varese Province, Veneto Region Thibaut Bideault
Kazakhstan
Korea (Republic of) Busan, Daegu, Daejeon, Gwangju, Incheon, Korea, Jejudo, Seoul, Ulsan truongtaianh
Kuwait
Kyrgyzstan
Latvia
Lebanon Romain
Lithuania
Luxembourg
Malaysia Penang, Sarawak
Malta tantran1311
Mauritius
Mexico Camille Pouchol
Moldova (Republic of)
New Zealand truongtaianh
Nicaragua
Norway dongoclinhttc
Oman
Ouganda mirlaude
Pakistan South Karachi
Panama
Paraguay
Philippines Manila, Rizal huongisfa4
Poland Cracow, Kielce, Lower Silesia, Rzeszow, Warsaw City
Portugal Azores, Centre, North, South, Porto, South Regional rudy
Romania ECO: Cluj, Timisoara
Russia (Russian Federation) St Petersburg rudy,
San Marino
Senegal Sarr (? données de cancer non trouvées à ce stade)
Serbia centre
Singapore huongisfa4
Slovakia (Slovak Republic)
Slovenia
South Africa mirlaude
Spain Albacete, Asturias, Balears, Basque Country, Canary Islands, Cuenca, Girona, Granada, Murcia, Navarra, La Rioja, Tarragona, Zaragoza Adrien Montagnon
Suriname
Sweden dongoclinhttc
Switzerland Geneva, Graubunden and Glarus, Neuchatel, St Gall-Appenzell, Ticino, Valais, Vaud tantran1311
Tajikistan
Thailand Chiang Mai, Lampang, Songkhla dongoclinhttc
The Netherlands Eindhoven, Maastricht nguyennga
Tunisia Centre
Turkey Antalya, Izmir
Turkmenistan
Northern Ireland (part of UK)
Scotland (part of UK) Thibaut Bideault
Ukraine
Uruguay
Uzbekistan
Venezuela
Vietnam? dongoclinhttc
West Africa mirlaude
Zimbabwe Harare

A possible source of information (open data by country):  http://www.datasciencecentral.com/profiles/blogs/great-github-list-of-public-data-sets?overrideMobileRedirect=1

For each country you should prioritarily look if you find data for known cancer risk factors:

Cancer Risk factors

Identification of cancer risk factors

We identified most known cancer risk factors (n=107) that we have centralized within the database of the projet (BaselineDB). We used multiple sources to collect these data, including the National Insitute of Cancer (NCI) website, and the International Agency for Research on Cancer (IARC)'s Monographs

Validation by an expert.

All risk factors collected within Baseline have been confirmed by an oncologist (Dr Nicolas de Chanaud)

 GROUPS SUB-GROUPS Age Age Arsenic, metals, fibres and dusts Arsenic Asbestos (Chrysotile, amosite, crocidolite, tremolite, actinolite, anthophyllite) Beryllium Cadmium Chromium Crystalline Silica (respirable size) Erionite Hexavalent Chromium Compounds Indoor Emissions from the Household Combustion of Coal Leather dust Nickel Thorium Wood Dust Diet non-Antioxidants Artificial sweeteners (e.g. aspartame) non-Calcium Charred meat non-Cruciferous vegetables non-Garlic Fluoride non-Tea non-Vitamin D Hormones Starting menstruation early Going through menopause late Being older at first pregnancy Never having given birth Immunosuppression Transplants Human Immunodeficiency Virus (HIV) Infectious Agents Human Papillomaviruses (HPVs) Hepatitis C Virus (HCV) Hepatitis B Virus (HBV) Human T-cell Leukemia/Lymphoma Virus Type 1 (HTLV-1) Human Immunodeficiency Virus (HIV) Epstein-Barr Virus (EBV) Human Herpesvirus 8 (HHV8) Merkel Cell Polyomavirus (MCPyV) Helicobacter pylori (H. pylori) Schistosoma hematobium Opisthorchis viverrini Clonorchis Sinensis Radiation Radon Neutron Internalized alpha-particule or beta-particle emitting radionuclides X-Rays and gamma-radiation Solar and ultraviolet (UV) radiation Pharmaceuticals Azathioprine Busulfan Chlorambucil Chlornaphazine Ciclosporin Cyclophosphamide Diethylstilbestrol Etoposide associated with cisplatin and bleomycin Melphalan Methoxsalen plus ultraviolet A radiation Methyl-CCNU MOPP Phenacetin Tamoxifen Thiotepa Treosulfan Estrogen-only menopausal therapy Combined estrogen-progesteron menopausal therapy Combined estrogen-progesteron contraceptives Aristolochic Acids (in plants) Habits Tobacco smoking Secondhand tobacco smoke Smokeless tobacco N'-nitrosonornicotine and 4-(methylnitrosamino)-1-(3-pyridyl)-1)butanone Betel quid and areca nut Alcohol Chinese-style salted fish Obesity BMI weight Height Chemical agents and related occupations 4-aminobiphenyl Benzidine Dyes metabolized to benzidine 4-4'-methylbenebis(2-chlorobenzenamine) 2-naphthylamine Toluidine Auramine Magenta Benzopyrene Coal gasification Coal Tar and Coal-Tar Pitch Coke-Oven Emissions Mineral Oils: Untreated and Mildly Treated Shale oils Soot (Chimney sweeps) Aluminuim production (exposure) Aflatoxins Benzene Ether 1,3-Butadiene Chlorodibenzodioxin, chlorodibenzofuran, chlorobiphenyl Ethylene Oxide Formaldehyde Sulfur mustard Vinyl Chloride Strong Inorganic Acid Mists Containing Sulfuric Acid Mists from strong innorganic acids Iron and steel founder (exposure) Painter (exposure) Rubber-manufacturing industry  (exposure)

New cancer risk factors

Specific drugs might influence the (non) incidence of specific cancers.

[Y|X] matrix on GitHub!

https://github.com/Epidemium/Baseline/tree/master/MATRIX

After three months of intense, distributed data collection, the collected data was assembled to that link. Such was quite a challenge as different variable names and different units had been found for different countries, so we built a dictionnary of variables including units and conversions had to be performed to assemble all. The corresponding has:

• About 250 geographic areas : 98 countries at country level and 5 countries at department/state/region level, leading to approximately.
• About 6000 Y: age-standardized cancer mortality risks are available for those areas at different years from 1989 to 2013.
• Most of the time we also have the risks by age tranches but we have not assembled them yet as we rarely have X factors by age so it would mostly bring additional complexity for the analysis.
• About 230 X: in addtion to the 109 factors searched for, many other factors emerged such as unemployment, fastfood expenses or age of marriage.
• Sparse matrix: a few X factors are available for most Y, a few dozen X are available for a few geographic areas, and other X are for now available only in rare geographic areas. It is part of the data science challenge of the project to analyse such data shapes.

https://github.com/Epidemium/Baseline

We have there put a whole environment around the matrix:

• Data: the input: open data taken online
• Matrix: the data assembled so far into a [Y|X] matrix
• Models: the data analysis performed so far
• Sowhat: corresponding suggestions for public health

Data analysis

Higher prostate cancer in African origins

In parallel with the collection and assembly of data, a singular phenomenon was taken as way to get some feeling about the approach.

1. Intriguing phenomenon. Can our approach get it?
 A known phenomenon.Age-adjusted prostate cancer mortality rate as a function of date. the red line (black population) is clearly above the other ones. source: http://www.cdc.gov/cancer/prostate/ images/2012_prostate_race_death.gif Health topic: In the USA, afroamericans have higher cancer incidence and mortality than other ethnic groups (graph here on the left) Why is that? We have no idea but a quick search on pubmed shows many articles so we could investigate. Action: We can use this as a "level 0" test of aggregate epidemiology? : do we observe a higher prostate cancer mortality in US states that have a greater proportion of afroamericans? Answer= yes: graph on the right. At that point, we thought that it might come from a low integration of black populations in the USA (?) Phenomenon found through our approach In the graph, each point is a state. Similar results are found when done by counties rather than by state. Of course, we should ideally do a multivariate analysis, but we haven't collected data for that yet. source: we built the graph based on CDC-Wonder data

2. Serendipity: when available data guide us to discoveries!

 It happens that in the USA, data on cancer risks is reported for each 'race'. So it was eaasy, we had a look at prostate cancer risks among afroamericans. Surprise: *in every State* the risk of prostate cancer is greater for afroamericans (graph on the left: black points above white points). As if it is something that depends on genes or habbits that are specific to these populations? A few days later, looking for research that might be similar to this baseline project, we found cancer risks per world areas. Surprise: graph on the right: where are the highest prostate cancer risks? In Africa and areas with a strong African origin. This suggest that whatever the country overall counditions, African-origin populations have higher prostate cancer risks. This extra risk is not observed in, say, breast cancer. source: http://globocan.iarc.fr/Pages/fact_sheets_cancer.aspx

Actually, does it mean that the presence of black populations of African origin should be taken into account to judge some empirical level of cancer mortality? Does it mean that these populations are particularly sensible to some risk factors? Those are typically questions that will be addressed by building the baseline. Of course, since we haven't investigated the litterature on the topic yet, the result is here potentially totally known and such an investigation may here not be of high importance. It is simply an appetizer to go furhter in the project.

RAMP - Rapid Analytics and Model Prototyping

40 data scientists competing for the best model for cancer mortality

Such was a one day event on Saturday February 13th at La Paillasse, organized with UPMC: a competition of data scientist. We dedicated a page to that event (http://wiki.epidemium.cc/wiki/Baseline/Ramp) but in brief:

• 40 data scientists analyzed the preliminary data and tried to model "Y=f(X)" where Y is the mortality rate due to some cancer. Indeed, we chose a "grand angle" approach by opposition of the "high prostate cancer mortality risk among african populations in the USA" in order to try something different.
• The spirit and energy were impressive. Some very fine models of cancer mortality rates were obtained. However the best models were "black box models": it is difficult to understand the shape of Y=f(X) that the computer calibrated. This suggests that depending on the needs, the type of data analysis may differ: we should have
• black box models for optimal prediction
• non-black-box models to decipher particular effects
• more manual data analysis as seen with the prostate cancer example above
• The preparation of the RAMP and the conclusions too has led us to think much about what medical questions to answer with the data. Here are the thoughts until we choose to develop one or several of such thoughts forced us a ask a precise question.
• https://github.com/Epidemium/Baseline/tree/master/SOWHAT/medical_guidance provide visions gathered from 6 public health medical doctors in the project. Students have ISFA have since gathered data for 10 of those suggested variables and the new data has been inserted in the matrix.

A non-black-box model: GLM Aggregate

The model

Suppose for the sake of simplicity that we have three X factors, U V and W and that U is available for all countries (eg year) and that V is available for a subset of the countries (eg percent urban population) and than W is available for further less countries.

The "baseline" approach

We could do successive linear regressions (of logit Y in order to produce mortality probabilities between 0 and 1)

• baseline on all lines for U: logit Y = a + b U (equation1)
• baseline on the lines where V is defined: logit Y = a' + b' U + c V (equation 2)
• We could suppose b' as given and equal to b
• ...Though, should b' from equation 2 should be considered as given for equation 1 if V is defined on many lines?
• baseline on the lines where W is defined: logit Y = a' + b'' U + c' V + d W (equation 3)
• With the same type of choice/question for b'' and c'

The word "baseline" is here used because that was the methodology envisionned at the first day of the project and that helped find the word "baseline". We do not mentionhere the shape of the error component but of course a larger population should have more weight, and such equations can in fact typically be logistic regressions.

The "Aggregate GLM" approach

With the above approach, if b=b'=b'' and c=c', the equations above end up simply differing by the constant so we can summarize all in one line:${\displaystyle logit(Y) = (a+a_V 1_V+a_W 1_W) + b U + c (V 1_V) + d (W 1_W)}$Such can actually be performed with a single logistic regression. The final $b$ is chosen somewhere between the $b$ of equations 1 2 and 3 depending on where there is the most weight, and same for $c$: it is an elegant and simple improvement of the baseline approach. That is what we decided to call the "Aggregate GLM" approach and we can of course perform it with more than X factors.

First results

Some codes in R to analyze the matrix can be found in https://github.com/Epidemium/Baseline/tree/master/MODELS/GLM_Aggregate

Let us go directly to the first empirical common results:

robustly* associated with more cancer mortality, at population level fake public suggestions (causality is not proven!)
being a male! and of some Ethnic origins for some cancers ask males to dress, act and sing like women
alcohol (drinking more on average than average populations) make alcohol sellers unemployed
long term unemployment make alcohol sellers employed
higher blood pressure implant a pedometer to everyone and index taxes on daily walk
more blood cholesterol implant a pedometer to everyone and index taxes on daily walk
being in a country where women marry young (we guess that the effect must be quite indirect ;) ask people to relocate

(*) As of February 25th 2016, 60 risk factors have been estimated to be present across sufficiently many geographical areas in the world to run some simple regressions on them. A specific method (GLM Aggregate) was developped to handle the sparsity of the matrix. It was run separately with various stress tests (rotations of variables and of geographical areas, logit regression of Y or linear regression of ln Y, weighting by populations or not) and those were the variables that were always associated with the same sign of impact with respect to cancer; therefore the name "robust". However

• No causality is here established. We added fake public suggestions next to it so that a lurker seeing the table does not take it too seriously
• This is a first attempt of model and as modeling will now take place there might be data errors that will be found or other issues that could impact those results.
• Many more effects of variables might be found as well: this is just a preliminary result.

We felt that publishing those results online were not problematic because they actually would be particularly expected.

More data analysis

The RAMP-style and Aggregate GLM algorithms can be further enhanced, both empirically and theoretically (eg for the later the concept of weighing between b and b' and b'' can be further enhanced with credibility theory; other key enhancement: dimension reduction). Rather than having the team imposing some direction, let us make it totally open:

4 separate groups of analysis at ISFA

At ISFA, 14 students who strongly contributed to the collection of data so far will now split into four groups, each supervised by a professor. Each group will try to advance independently of the other ones (or at least differently!). Each group will present an intermediary report end of March, and four professors will guide them towards an improved analysis for written reports mid May.

The main conclusion as of end of April is that PLS regression is very adated for the shape of Baseline data (https://en.wikipedia.org/wiki/Partial_least_squares_regression, in French https://fr.wikipedia.org/wiki/R%C3%A9gression_des_moindres_carr%C3%A9s_partiels ) . At the time of the closing of the wiki the results aren't finalized but they find similar results as the GLM Aggregate regression here above: this confirms that the data can be used to identify risk factors, so that if we continue collected more data we should get more and more risk factors and better solutions for public health.

RAMP2 - Rapid Analytics and Model Prototyping

40 data scientists competing for the best model for cancer mortality

On Saturday April 30th 2016 at La Paillasse, a second competition of data scientists took place to do "Y=f(X)"

• This time we introduced that the age variable -- as is so correlated with cancer risks -- which made the dataset to investigate considerably bigger. We also introduced the cancer incidence risks rather than model only mortality. And we investigated to what degrees different types of cancers are correlated, in order to show that the dataset can allow to ask various medical questions.
• Precisely, the question asked was to model digestive cancer mortality risks (intestine, colon, rectum and anus, liver, gallbladder) as a function of age and other types of cancer
• There were 40 submissions, a big bravo to the Artix41 team for their "NeuralNetworks" model that not only provided the best result but also the most original result of the day! Their code is at http://www.ramp.studio/2f2b5bb4bd91ccd117abf9123fd503cb67f99ec3/feature_extractor.py and http://www.ramp.studio/2f2b5bb4bd91ccd117abf9123fd503cb67f99ec3/regressor.py

Next

At the end of the challenge, the data analysis don't yet provide great public health suggestions (apart perhaps that

• age is by far the most important variable, suggesting that cancer is clearly associated with aging above all (smoking, sun, alcohol, etc) so biomedical research linking aging and cancer could be likely to bring solutions (eg perhaps thymus atrophy leading to insufficient immune systems that insufficiently tackle cancers, but various things might happen)
• ethnic origin seems to be a medical reality for some cancers so the health interest of such person should be above all and looking for example for common risk factors between Afro-Americans and Carabeans and people in Africa seems important -- something that we hope to achieve with the continuation of this cancer baseline project
• well-known factors are found with models using the already collected data (age, gender, alcohol, blood pressure, cholesterol) so the project most be continued

), the next section aims at continuing the project towards much greater achievements.

Building the future

Some frail roots have been built, now we perform analyses but also to build something robust and for some part (not all) oriented to be meaningful, in order to get the best out of the approach:

An IT system to go to another scale

GitHub

The data was put on GitHub (data folder), as well as the code for the "Aggregate GLM" approach (model folder): https://github.com/Epidemium/Baseline

Indeed, GitHub is a standard to openly share data and programs and it allows anyone to contribute. We will regularly continue to provide there some matrix in the shape of a csv or sql file and to provide some code, however GitHub is insufficient to create a large collaborative work, due to the difficulty to access it.

SQL and EpidemiumDB

In March-April, together with other Epidemium teams we have decided changed systems from the (great) Data Science Studio and Excel tools to SQL and EpidemiumDB, in order to use only open softwares and to create synergies between Epidemium projects and their respective data. The data is added in a MySQL server (the description of tables is here: http://wiki.epidemium.cc/wiki/EpidemiumDB#Cat.C3.A9gorie_2_-_Tables_demand.C3.A9es_par_le_projet_Baseline) and it is regularly copied to an SQLlite database to make data analysis and visualisation.

Numerous obstacles (access to the interface to handle the data, then to the right database, then to the right table, then to the right fields, then to the right field formats, then too time-limited slots to run the desired SQL requests, and extremely slow SQL requests compared to the needs and the size of the data to work with -- taking several days to get the [Y|X] matrix from the stored data) have decayed the most medical-related aspects of the project by one to two months. However, the new system should allow to reach the desired target.

As a results, a future action might be to change systems once more, but we prefer for the moment to stick to that tehcnical approach until the next sections are delivered so that we truly get important results for health worldwide.

A web access for anyone to add data

A tool was created for anyone in the world to add data: number of pens sold in US states, number of medical doctors by inhabitants in Asian countries etc. A first version of the tool works on the computer of one of its two developpers, at the time of writing it doesn't work yet on the Epidemium server for some unknown reasons but we hope to make it work soon. It will be available at http://baseline.epidemium.cc/

It contains an example of dataset to add so that anyone can learn in a few seconds how to add data and can then add its own data. It defines a level of intermediate (skilled) users to look at the proposed new data and to validate it or not for their include it to the big matrix (if not, the one who added data is asked to revise the data and reasons are specified).

Once a first version will be effective, it should be tested by the crowd who contributed to Baseline from close or from far and its use will be spread out via social networks. That way, we hope that the Baseline effort will become global.

Guidance for medical and public health applications

As indicated above, as of early May 2016 we have only reach very mild medical and public health, that however indicate that the project must be continued:

• age is by far the most important variable, suggesting that cancer is clearly associated with aging above all (smoking, sun, alcohol, etc) so biomedical research linking aging and cancer could be likely to bring solutions (eg perhaps thymus atrophy leading to insufficient immune systems that insufficiently tackle cancers, but various things might happen)
• ethnic origin seems to be a medical reality for some cancers so the health interest of such person should be above all and looking for example for common risk factors between Afro-Americans and Carabeans and people in Africa seems important -- something that we hope to achieve with the continuation of this cancer baseline project
• well-known factors are found with models using the already collected data (age, gender, alcohol, blood pressure, cholesterol) so the project most be continued

While the technological developments in the previous section have to be finished to get to more interesting medical and public health suggestions, we like to keep ideal thoughts on what to possibly achieve: a mindstorm of how to best proceed :

• http://wiki.epidemium.cc/wiki/Baseline/Thoughts : thoughts on the matter. We put it on a separate page to keep this main page clear, however we suggest the interested reader to read there and further contribute with his own thinkings.

Last but not least, for anyone to help contribute to the project further, please send an email to the core team: cancerbaseline at googlegroups dot com

Timeline

November 12th 2015: first meeting

A la Paillasse, suite à la présentation données mobiles et santé de Orange

Présents:

• Les astrophysiciens (Iene - ex astrophysicienne, data scientist, text mining (2/3); loic - astrophysicien, datascientist anonyme; vincent - astrophysicen USA; esther - ex astrophy - dev web, peter - astro dev logiciel)
• Stephania - données mobiles & épidémies, de chez Orange
• Les étudiants ingénieurs ENSAE (Peter et Gil)
• Augustin - actuariat accès marché médicaments
• Vincent - bioinformaticien
• Abdel - machine learning
• Pierre-Emmanuel - Economiste stats santé
• Naeme - ingénieur système d'information à l'aphp

Meeting en Français

Contenu:

• Questions sur:
• la pertinence statistique de l'approche (nombre de départements, nombre de variables): c'est une approche très nouvelle et déconcertante pour des épidémiologistes
• comment de se partager la collecte des données
• à la main, avec des hashtags et si possible extraction automatique (un peu d'astrophysique ;)
• possibilité de créer divers projets Epidémium à partir des données assemblées. Ex zones de variables à risque fort ou faible de cancers, modèle généraliste...
• commencer par un pays ou d'abord constituer une grande base de données? : développer les techniques sur une base d'abord petite et à faire grossir parallèlement
• structuration des variables par ontologie (référentiel loinc)
• comment valider les effets? 1) calibrer sur des datasets et valider sur d'autres 2) analyse prospective?

November 19th 2015

Présents: Irene; Loic; Peter; Marika; PE; Augustin; Edouard

Meeting en Français

Contenu:

• preuve de la pertinence statistique d'épidémiologie sur données agrégées. Eurêkos!
• une première mini étude sur le cancer de la prostate faite dans la semaine conforte l'approche sur un cas simple.
• Réconfortant ne ne prouve en rien l'approche en général
• "si effet tomate il y a, la probabilité est nulle que nous puissions le trouver avec ces données agrégées"
• chiche: à tester! ;)
• Il doit bien y avoir de la littérature scientifique sur "épidémiologie sur données agrégées", non?
• à chercher
• "avec des données agrégées, le paradoxe de Simpson démontre que vous ne pourrez pas savoir ce qui est bon au niveau individuel"
• "vous aurez des corrélations entre effets collectifs, ne parlez pas de causalité"
• imaginez le nombre de départements et pays nécessaires
• nous sommes une armée, avec plus de 20 étudiants. Collectons la donnée!
• partage de la collecte de données
• structuration des variables par ontologie
• se fera une fois que nous avons un premier petit pool de jeux de données
• Naeme prépare sur le wiki des instructions Loinc ou Snomed, DelphineB a proposé Medra en ce qui concerne les cancers
• comment valider les effets
• 1) calibrer sur des datasets et valider sur d'autres
• 2) NOUVEAU: validation prospective à l'échelle individuelle: eVeDrug et Talk s'assemblent à ActuRx pour proposer un questionnaire e-cohorte aux médecins pour leur patient, pour valider certains effets qui sortiraient de Cancer Baseline
Presentation at CRI

Lundi 23 matin: présentation du projet au laboratoire Inserm LIMICS à Paris

Mardi 24 soir: présentation du projet à un meet-up Epidemium à La Paillasse + meeting informel pour préparer les présentations:

Mercredi 25 soir: présentation du projet à l'incubateur Boucicaut à Paris (avec la présentation de ActuRx)

Mardi 1er décembre soir: présentation du projet au Centre de Recherche Interdisciplinaires (CRI) à Paris

Meet ups at La Paillasse

December 2nd 2015

Online Kickoff Meeting

à midi

Présents: les élèves de l'ISFA et leurs professeurs, Augustin, Pierre-Michael, Nicolas, Lauridana, Edouard

Bilan:

• jusqu'à la fin 2015, les élèves de l'ISFA vont travailler sur la collecte des données par pays
• les autres personnes du projet feront de même et préparerons la suite
• pour tout nouvel entrant:
• s'inscrire à epidemium.cc et là s'inscrire au projet
• s'inscrire à ce wiki et écrire à cancerbaseline at googlegroups point com
• entrer dans http://cancerbaseline.slack.com/signup
• là indiquer ses compétences IT, médiccales et linguistiques dans #can-you-speak-zweck et les pays en charge dans #data4baseline
• chercher les données et les déposer dans DataScienceStudio
• fin 2015 les analyses commenceront à partir des données collectées

Jeudi 3 décembre soir: meeting à La Paillasse

December 2015: full-time data-collect by ISFA students

Lundi 7 décembre après-midi: séance de travail à Lyon avec les élèves de l'ISFA, Augustin et Edouard

Tuesday December 8th

• presentation in Roche France offices
• working session in Lyon between ISFA students

Thursday December 10th

• presentation by Claude and Edouard to the association of patients Renaloo, Paris
• meeting at La Paillasse and presentation by Augustin

Every Thursday!! January and February 2016

informal discussions

meeting @ La Paillasse

Data collection and assembly in DataScience Studio

Manual extraction from some websites where complex otherwise

Data collection and assembly finalization in Excel for the RAMP

February 13th: RAMP! At La Paillasse

Analysis of the RAMP algorithms

Analysis of the variables collected and the ones that should be additionally collected

March and April 2016: mixture of La Paillasse, email and skype meetings, with contrains of vacations

Presentations to researchers (Salle Petriere, Cauchin, Saint Antoine), and presentations along teaching activities (Institut des Actuaires) and presentations to friend and interested pharmacists here and there.

Exchanges with public health researchers a Thursday evening at La Paillasse that indicated new variables that we added to the matrix

Meetings approximately every week but in complex conditions (few people at La Paillasse, skype, google hangout) about the technologies to develop (Mysql, php, phpmyadmin, EpidemiumDB) and about the RAMP2. Hundreds of hours spent on the new technical environment to assemble the new matrix there and prepare for the RAMP2. Preparation of presentations for Challenge4Cancer. During that time, the project was totally split across people for various activities (web tool, sql data load, sql assembly, presentations for Challenge4Cancer)

RAMP2! Saturday, April 30th.