Mediarchiver

Published on 19/06/2023

Mediarchiver is a tool to archive and manage medias and archiving them.

Currently, the project is in alpha version. You can find two repository of the project:

the node js one is the oldest https://github.com/laBecasse/youtube-dl-archiver
the rust version, which should replace the previous one https://gitlab.com/maxburon/mediarchiver

Présentation

Mediarchiver (a.k.a. Youtube dl Archiver) est un projet, qui visent à répondre aux problèmes suivants:

Censure de YouTube et plus généralement la disparition de contenu, qui nous ait sur le web,
A linstar de peertube, distribuer le contenu des medias sur le web et diminuer le besoin en ressource pour leur partager
Organiser les médias en renseignant un maximum de leurs caractéristiques indépendamment de leur source en utilisant un schéma générique pour les web vidéos.
Créer des listes des groupes, tags qui permettent d'organiser à sa manière, les médias. Ainsi chacun pourrait faire des recommandations listées comme le fait Téléram pour YT.

Si on considère que Youtube est comme la télé, alors MediaArchiver serait l'enregistreur VHS avec l'étagère pour entreposer les copies.

Projets similaires

https://vookmark.co/

Schéma de données

Je pense refondre un peu le schéma chaotique des médias pour adopter dans un premier temps schema.org. Il faut bien séparer les sources (immuables) de l'archive les deux étant des Creative Works. Ils pourraient être relié par les propriétés suivantes, suivant leur types:

image
audio
video

Il faut aussi ajouter deux types pour différencier l'archive des sources.

Un schéma de données relationelles

Les type d'entité rencontrés dans les archives avec leurs propriétés respectives:

média
- id
- identifiants
- title
- creation date
- upload date
- description
- tags
- creator
- sources
- archives
tag
- name
creator
- name
source
- url
- platform
- miniature
archive
- source
- files
- miniature

Il semble assez normal que chaque type d'entité soit stocké dans une table différentes.

Youtube schema

https://github.com/sushma1395/YouTube-Video-Sharing-Database-Design

Recherche par mots clé

Ajouter les résultats depuis youtube

On peut utiliser ce package, qui semble bien efficace pour faire les requêtes: https://www.npmjs.com/package/ytt

Interface

Lecture

Fonctionnalité - un téléchargement -> une archive

À la première lecture d'un média non archivé, c'est l'occasion de l'archiver. Il faudrait que l'archive se télécharge en même temps, qu'il y a la lecture. Le comportement de cette phase devra s'adapter suivant le type de la source:

lien sans limite de validité, (trouver un moyen de lire un fichier en téléchargement sur le serveur, je pense que temps que le header contient "Content-Length", le navigateur ne téléchargera qu'un fichier partiel et la lecture du média s'arrêtera prématurément). Pour ce cas, je pense que l'on peut distribuer le contenu à l'aide du concept de Partial Content (tuto pour l'implémentation sous node js https://www.codeproject.com/Articles/813480/HTTP-Partial-Content-In-Node-js), mais je ne suis pas sûr de son utilité avec le recul. Il y a une implémentation de proxy avec archivage en directe dans ~/dev/buron.coffee/piping/server.js :

// Initialize all required objects.
var http = require("http");
const https = require('https')
var fs = require("fs");
var path = require("path");
var url = require('url');

const currents = {}

// Give the initial folder. Change the location to whatever you want.
var initFolder = __dirname;

// List filename extensions and MIME names we need as a dictionary. 
var mimeNames = {
  '.css': 'text/css',
  '.html': 'text/html',
  '.js': 'application/javascript',
  '.mp3': 'audio/mpeg',
  '.mp4': 'video/mp4',
  '.ogg': 'application/ogg', 
  '.ogv': 'video/ogg', 
  '.oga': 'audio/ogg',
  '.txt': 'text/plain',
  '.wav': 'audio/x-wav',
  '.webm': 'video/webm'
};


http.createServer(httpListener).listen(8000);

function httpListener (request, response) {
  // We will only accept 'GET' method. Otherwise will return 405 'Method Not Allowed'.
  if (request.method != 'GET') { 
    sendResponse(response, 405, {'Allow' : 'GET'}, null);
    return null;
  }
  var filename = path.join(initFolder, path.basename(request.url))

  // // Check if file exists. If not, will return the 404 'Not Found'. 
  // if (!fs.existsSync(filename)) {
  //   sendResponse(response, 404, null, null);
  //   return null;
  // }

  const promise = downloadStarted(request.url) ? Promise.resolve() : downloadURL(request, filename)

  promise.then(proxyRes => {

    if (proxyRes) {
      setCurrent(request.url, proxyRes, filename)
    }
    const current = currents[request.url]
    const stat = fs.statSync(filename);
    var responseHeaders = {};

    var rangeRequest = readRangeHeader(request.headers['range'], current.size);

    // If 'Range' header exists, we will parse it with Regular Expression.
    if (rangeRequest == null) {
      responseHeaders['Content-Type'] = getMimeNameFromExt(path.extname(filename));
      responseHeaders['Content-Length'] = current.size //stat.size;  // File size.
      responseHeaders['Accept-Ranges'] = 'bytes';

      //  If not, will return file directly.
      sendResponse(response, 200, responseHeaders, fs.createReadStream(filename));
      return null;
    }

    var start = rangeRequest.Start;
    var end = rangeRequest.End;

    // If the range can't be fulfilled by the complete file.
    if (start >= current.size || end >= stat.size) {
      // Indicate the acceptable range.
      responseHeaders['Content-Range'] = 'bytes */' + current.size; // File size.

      // Return the 416 'Requested Range Not Satisfiable'.
      sendResponse(response, 416, responseHeaders, null);
      return null;
    }

    if (end >= stat.size) {

    }


    // Indicate the current range.
    responseHeaders['Content-Range'] = 'bytes ' + start + '-' + end + '/' + current.size;
    responseHeaders['Content-Length'] = start == end ? 0 : (end - start + 1);
    responseHeaders['Content-Type'] = getMimeNameFromExt(path.extname(filename));
    responseHeaders['Accept-Ranges'] = 'bytes';
    responseHeaders['Cache-Control'] = 'no-cache';

    // Return the 206 'Partial Content'.
    sendResponse(response, 206,
                 responseHeaders, fs.createReadStream(filename, { start: start, end: end }));
  })
}

function sendResponse(response, responseStatus, responseHeaders, readable) {
  response.writeHead(responseStatus, responseHeaders);

  if (readable == null)
    response.end();
  else
    readable.on('open', function () {
      readable.pipe(response);
    });

  return null;
}

function getMimeNameFromExt(ext) {
  var result = mimeNames[ext.toLowerCase()];

  // It's better to give a default value.
  if (result == null)
    result = 'application/octet-stream';

  return result;
}

const CHUNK_LENGTH = 102400

function readRangeHeader(range, totalLength) {
  /*
   * Example of the method 'split' with regular expression.
   * 
   * Input: bytes=100-200
   * Output: [null, 100, 200, null]
   * 
   * Input: bytes=-200
   * Output: [null, null, 200, null]
   */

  if (range == null || range.length == 0)
    return null;

  var array = range.split(/bytes=([0-9]*)-([0-9]*)/);
  var start = parseInt(array[1]);
  var end = parseInt(array[2]);
  var result = {
    Start: isNaN(start) ? 0 : start,
    End: isNaN(end) ? Math.min(totalLength - 1, start + CHUNK_LENGTH) : end
  };
  //  console.log()
  if (!isNaN(start) && isNaN(end)) {
    result.Start = start;
    result.End = Math.min(totalLength - 1, start + CHUNK_LENGTH);
  }

  if (isNaN(start) && !isNaN(end)) {
    result.Start = totalLength - end;
    result.End = totalLength - 1;
  }

  //  console.log(result)
  return result;
}


function downloadURL(req, filename) {

  const options = {
    host: 'medias-dl.buron.coffee',
    port: 443,
    path: req.url,
    method: 'GET',
    headers: { host: 'medias-dl.buron.coffee',
               'user-agent':
               'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0',
               accept:
               'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
               'accept-language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
               'accept-encoding': 'gzip, deflate',
               connection: 'keep-alive',
               'upgrade-insecure-requests': '1',
               pragma: 'no-cache',
               'cache-control': 'no-cache' }
  }
  console.log(filename)
  const file = fs.createWriteStream(filename);
  return new Promise((resolve, reject) => {
    const request = https.request(options, r => {
      r.pipe(file)
      resolve(r)
      // r.pipe(res, {end: true})
    })

    request.on('error', reject)

    req.pipe(request, {end: true})
  })
}

function setCurrent(url, res, filename) {
  currents[url] = {
    size: parseInt(res.headers['content-length']),
    url: url,
    filename: filename
  }

  return currents[url]
}

function downloadStarted (url) {
  return currents[url]
}

Il suffit de lancer le serveur et de faire une requête avec le même chemin qu'un archive existante:

node server.js
# e.g. open http://localhost:8000/archives/arte.tv%3A%2B7/073938-000-A/L'homme%20a%20mang%C3%A9%20la%20Terre.mp4

lien avec limite de valité (mettre à jour le lien si besoin puis appliquer le point précédent)
webtorrent (1. lancer le téléchargement par torrent sur le serveur 2. envoyer le lien magnet au client avec le tracker du serveur - si existant - 3. lancer le téléchargement côté client et la lecture)

HLS and m3u8 files

idea from the support #+BEGIN_src html <video width="352" height="198" id="hls-example" controls> <source src="https://api.vice.com/v1/transcoder/manifests/video/480/5eeb434d01c79b0089c8ed75_b5db2178-0c6e-4107-8ab2-2831889b8d94/d.m3u8" type="application/x-mpegURL"> </video>

Taille des vidéos

Format et taille des fichiers

Fait à partir de vidéos de YT

format	taille (1min)	débit (Ko/s)
1080p	7Mo-20Mo	70-340
720p(image + audio)	1.5Mo	17
720p	5.5-10Mo	65-170
480p	3.25Mo
360p (image + audio)	1.5Mo	17
360p	2Mo
240p	1Mo-2.5Mo

WebTorrent

Seed à l'initialisation

On peut seeder certaines vidéos au démarrage du serveur, donc depuis le serveur. Après un test, je remarque que:

seeder des vidéos en utilisant Webtorrent (hybrid) avec un seul client (ce qui semble mieux), utilise beaucoup de mémoire dès que l'on souhaite seeder plus d'un dizaine de média
il semble que seeder des vidéos côté serveur ne fonctionne pas toujours. Dans mon cas, j'arrive pas à établir de connection (wire) depuis le serveur (medias-dl.buron.coffee), alors que ça marche bien sur un serveur de dev (même avec un client depuis une autre machine). Est-ce un problème de firewall ?
il y a eu des bugs dans les versions précédentes qui faisait que les médias étaient téléchargés via ytdl bien qu'il était disponibles par webtorrent :/ Ainsi, il y a pour le moment des fichiers avec torrents dont les hashs ne sont pas en accord.

NodeInfo

https://nodeinfo.diaspora.software/ C'est un standard pour renseigner des informations d'un serveur au sein de la fédiverse. Il peut être utilisé pour :

découvrir les noeuds peertube et adapter le téléchargement
définir un nodeinfo pour les instance MediaArchiver pour qu'elles se reconnaissent entre elles.

Méthodes de récupération des méta données

Extraction time code from description

We can see that there are many medias with time codes in their descriptions. This can be used as metadata. Check the following query:

db.medias.count({$and: [{ description: {$regex: /[0-9]:[0-9][0-9]/g}}]}, {title: 1})

Feed RSS

L'objectif est de pouvoir créer des médias à partir d'un flux RSS, parce qu'il contient des informations précieuse comme:

titre du média
site du créateur
nom du créateur
lien du média
les dates

Voici des exemples intéressants:

NPM library : https://www.npmjs.com/package/rss-parser

HTML extraction

HTML to JSON mappings

The point is to build a tool to scrape HTML pages with XPath queries and convert the results into JSON objects, such as we can translate from the xml document:

<music>
  <artists>
    <artist>
      <name>Tally Hall</name>
      <yearFormed>2002</yearFormed>
    </artist>
    <artist>
      <name>Ben Folds Five</name>
      <yearFormed>1993</yearFormed>
    </artist>
  </artists>
</music>

into the following json document:

{
  "artists": [
    {
      "name": "Tally Hall",
      "year": "2002"
    },
    {
      "name": "Ben Folds Five",
      "year": "1993"
    }
  ]
}

using the following kind of query:

{
  "artists":
    {
      "@root": "//music/artists/artist",
      "name": "./name/text()",
      "year": "./yearFormed/text()"
    }
}

We can extend each mapping with site configurations as done by Five Filters with site patterns. At the end, we choose have a mapping for a set of URLs represented by a URL pattern. These mappings should be used to fill information concerning a media, or related entities. For example, one can extract medias from a page of https://www.franceinter.fr using the following mapping:

{
  title: "//head/title/text()",
  rss: "//a[@class='podcast-button rss']/@href",
  medias:
  {
    "@root": "//figure[@class='media-visual']//button[substring(@class, string-length(@class) -string-length('playable') +1) = 'playable']",
    title: "./@data-diffusion-title",
    path: "./@data-diffusion-path",
    url: "./@data-url"
  }
}

Entity explosion

C'est une extension pour navigateur web : https://github.com/99of9/Entity-Explosion

Je ne sais pas comment elle fonctionne dans le détail, mais elle doit être basé sur des requêtes sparql sur la base de wikidata pour récupérer les identifiants correspondants aux pages visitées. Il est possible de faire ça en utilisant les propriétés "Wikidata property for identifier", qui se retrouve généralement en bas de page. À l'heure où j'écris ceci, il y a 6427.

SELECT distinct ?item ?itemLabel 
WHERE 
{
  ?item wdt:P31 ?class.
  ?class wdt:P279* wd:Q19847637.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
} limit 10

Extracting wikidata ID and platform from the URL

We can use wikidata which contains a lot of "url formatter" describing URL pattern parsing the identifier of mainstream platforms.

SELECT ?item ?itemLabel ?value
WHERE 
{
  ?item wdt:P1630 ?value.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
} LIMIT 10

We can find below the properties for an identifier related to media service. It is quite general, so we can find property to identify different type of media related things like: channel, emission, album, track, podcast, etc.

SELECT distinct ?item ?itemLabel ?formatter
WHERE 
{
  ?item wdt:P31 ?mediaServiceRelatedPropertyClass.
  ?mediaServiceRelatedPropertyClass wdt:P279* wd:Q63873112.
  ?item wdt:P31 ?idPropertyClass.
  ?idPropertyClass wdt:P279* wd:Q19847637.
  ?item wdt:P1630 ?formatter
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
} limit 100

Below the properties for an identifier whose type constraint is one of podcast episode, episode, video recording, music video. It is more selective.

SELECT distinct ?item ?itemLabel ?formatter
WHERE 
{
  ?item <http://www.wikidata.org/prop/P2302> ?propConstraint.
  ?propConstraint <http://www.wikidata.org/prop/qualifier/P2308> ?typeClass.
  ?item wdt:P31 ?idPropertyClass.
  ?idPropertyClass wdt:P279* wd:Q19847637.
  ?item wdt:P1630 ?formatter.
  filter(?typeClass in (wd:Q61855877, wd:Q1983062, wd:Q34508, wd:Q193977))
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
} limit 100

Even more specific, the following

SELECT distinct ?item ?itemLabel ?formatter
WHERE 
{
  ?item <http://www.wikidata.org/prop/P2302> ?propConstraint.
  ?propConstraint <http://www.wikidata.org/prop/qualifier/P2308> ?typeClass.
  ?item wdt:P31 ?idPropertyClass.
  ?idPropertyClass wdt:P279* wd:Q19847637.
  ?item wdt:P1630 ?formatter.
  filter(?typeClass in (wd:Q61855877, wd:Q1983062, wd:Q34508, wd:Q193977))
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
} limit 100

SELECT distinct ?item ?itemLabel
WHERE 
{
  ?item wdt:P31 ?mediaServiceRelatedPropertyClass.
  ?mediaServiceRelatedPropertyClass wdt:P279* wd:Q63873112.
  ?item wdt:P31 ?idPropertyClass.
  ?idPropertyClass wdt:P279* wd:Q19847637.
  #?item wdt:P1630 ?formatter.
  ?item wdt:P8966 ?regexp.
  # ?item wdt:P1793 ?format.
  # BIND(REPLACE(?formatter, "\\$1", ?format) AS ?f) .
  BIND("https://www.youtube.com/@veritasium" AS ?url).
  FILTER(REGEX(?url , ?regexp)).
  #?entity ?item ?url.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
} limit 10

Linked Data Extraction

Examples of urls where linked data scraping should be used:

https://www.brut.media/fr/entertainment/anxiete-generalisee-pour-theo-grosjean-les-fetes-de-fin-d-annee-sont-une-epreuve-54ba0782-bae9-421a-a2e3-bd06cebe43ff
https://www.bbc.co.uk/programmes/m000cq72
https://www.lci.fr/politique/greve-contre-la-reforme-des-retraites-elisabeth-borne-assure-qu-il-n-y-aura-pas-de-penurie-de-carburant-2141741.html, we can extract (i) the video title (ii) video description (iii) thumbnail URL (iv) the embed URL (v) the publisher organization IRI (vi) some keywords
https://www.franceculture.fr/emissions/les-cours-du-college-de-france/du-gouvernement-par-les-lois-la-gouvernance-par-les-0, we can extract (i) the radio episode, (ii) its title, (iii) description, (iv) published date, (v) modification date, (vi) its image, (vii) publisher IRI (here, france culture IRI in wikidata for example), (ix) the radio series "Les cours du collègue de France" (xi) the locate language "fr_FR"
https://peertube.parleur.net/videos/watch/02f441a3-3968-4c05-8a93-c921efdbbac2, we can extract (i) the video title (ii) video description (iii) thumbnail URL (iv) the video duration (v) the embed URL (vi) the video plaform of the url: peertube

WikiData

In wikidata, I found the following interesting classes of video, you can find more using this tools:

vlog

SELECT ?vlog ?vlogLabel 
WHERE 
{
  ?vlog wdt:P31 wd:Q674926.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
} LIMIT 50

online video

SELECT ?item ?itemLabel 
WHERE 
{
  ?item wdt:P31 wd:Q23058567.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
} LIMIT 5

music video

SELECT (COUNT(?item) AS ?count)
WHERE 
{
  ?item wdt:P31 wd:Q193977.
}

YouTube video ID property (~23K entities using it)

SELECT ?item ?itemLabel ?o
WHERE 
{
  ?item wdt:P1651 ?o.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
} LIMIT 2

TED talk ID property (2077 entities using it)

SELECT ?item ?itemLabel ?o
WHERE 
{
  ?item wdt:P2613 ?o.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
} LIMIT 2

The following for channels, it is important to differentiate channel, through which the media is distributed from the organization publishing the media.

YouTube channel ID is the property defining the YT channel url of an entity,

SELECT (COUNT(?item) AS ?count)
WHERE 
{
  ?item wdt:P2397 ?o.
}

YouTube channel is the class for YouTube channels (239 instances),

SELECT ?item ?itemLabel 
WHERE 
{
  ?item wdt:P31 wd:Q17558136.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
} LIMIT 5

Internet Television is a subclass of channel

SELECT ?item ?itemLabel 
WHERE 
{
  ?item wdt:P31 wd:Q841645.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
} LIMIT 5

radio channel is also a subclass of channel

SELECT ?item ?itemLabel 
WHERE 
{
  ?item wdt:P31 wd:Q28114677
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
} LIMIT 5

The organizations can have many forms and many relations with the medias, for example:

France Culture is a radio station creating and publishing on at least two channels: YT and a radio channel
ThinkerView is a think tank creating and publishing on at least two channels: YT and their website
Un odieux connard is a human
Some person publishes or redistributes some media and are not the author of it.

The medias can also be grouped by playlist, series

SELECT distinct ?item ?itemLabel
WHERE 
{
  ?item wdt:P2378 wd:Q23054661.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
} LIMIT 20

SELECT distinct ?concept ?conceptLabel
WHERE 
{
  ?item wdt:P31 ?concept.
  ?item wdt:P23 ?o.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
} LIMIT 20

# subclasses of channel
SELECT ?subclass ?subclassLabel 
WHERE 
{
?subclass wdt:P279*  wd:Q733553 .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
} LIMIT 2

Schema.org and web embedded graphs

France Culture:

{
    "@context": "http://schema.org",
        "@type": "RadioEpisode",
    "headline": "Du gouvernement des hommes : de l&#039;imaginaire horloger à l&#039;ordinateur",
    "image": {
            "@type": "ImageObject",
            "url": "https://cdn.radiofrance.fr/s3/cruiser-production/2017/01/d59d6d55-ea54-4b0b-9df1-73c9a2d2192a/838_fullsizerender_18.jpg",
            "height": 489,
            "width": 870
        },    "datePublished": "2017-01-09",
    "dateModified": "2017-01-09",
    "url": "https://www.franceculture.fr/emissions/les-cours-du-college-de-france/du-gouvernement-par-les-lois-la-gouvernance-par-les-0",
    "publisher": {
        "@type": "Organization",
        "name": "France Culture",
        "logo": {
            "@type": "ImageObject",
            "url": "https://www.franceculture.fr/img/france-culture-amp.png",
            "height": 40,
            "width": 203
        }
    },
            "partOfSeries": {
        "@type": "RadioSeries",
        "name": "Les Cours du Collège de France",
        "productionCompany": "France Culture"
    },
        "description": "Comment l’analyse juridique peut-elle contribuer, demande Alain Supiot, à éclairer les transformations de nos sociétés, travaillées par la globalisation, la révolution numérique et le passage, selon sa formule du &quot;gouvernement par les lois à la gouvernance par les nombres&quot; ?"
}

France inter:

{
   "@context":"https://schema.org",
   "@type":"RadioEpisode",
   "inLanguage":"fr",
   "name":"Agnès Buzyn : &#039;Un retrait complet de la réforme, c’est au-delà de ce que peut entendre un gouvernement&#039;",
   "description":"Agnès Buzyn, ministre des Solidarités et de la Santé, est l&#039;invitée du Grand Entretien de Nicolas Demorand et Léa Salamé à 8h20.",
   "audio":{
      "@type":"AudioObject",
      "contentUrl":"https://media.radiofrance-podcast.net/podcast09/10239-17.12.2019-ITEMA_22232006-0.mp3",
      "name": "L&#039;invité de 8h20 : le grand entretien",
      "duration": "1598",
      "encodingFormat": "mp3",
      "potentialAction":"ListenAction"
   },
   "director":{
      "@type":"Person",
      "@id":"https://www.franceinter.fr/personnes/lea-salame",
      "name":"Léa Salamé, Nicolas Demorand"
   },
   "partOfSeries": {
      "@type":"RadioSeries",
      "name":"L'invité de 8h20 : le grand entretien",
      "url":"emissions/l-invite"
    },
   "productionCompany": {
      "@type":"Organization",
      "name":"France Inter"
   },
   "hasPart":{
      "@type": "NewsArticle",
      "mainEntityOfPage":{
         "@type": "WebPage",
         "@id": "https://www.franceinter.fr/emissions/l-invite-de-8h20-le-grand-entretien/l-invite-de-8h20-le-grand-entretien-17-decembre-2019"
      },
      "headline":"Agnès Buzyn : &#039;Un retrait complet de la réforme, c’est au-delà de ce que peut entendre un gouvernement&#039;",
      "description":"Agnès Buzyn, ministre des Solidarités et de la Santé, est l&#039;invitée du Grand Entretien de Nicolas Demorand et Léa Salamé à 8h20.",
      "articleSection":"",
      "datePublished":"2019-12-17T08:21:20+01:00",
      "dateModified":"2019-12-17T16:02:46+01:00",
      "author":"France Inter",
      "keywords":"Info,retraites,réforme des retraites,hôpitaux,mouvement social,grèves,santé,manifestations,",
      "image": {
         "@type":"ImageObject",
         "url":"https://cdn.radiofrance.fr/s3/cruiser-production/2017/09/6d968f1c-6e2b-4790-be36-bb24afd7e3ce/1200x680_ab.jpg",
         "width":"525",
         "height":"298",
         "copyrightHolder":"Radio France"
      },
      "publisher":{
         "@type": "Organization",
         "name": "France Inter",
         "logo": {
            "@type": "ImageObject",
            "url": "https://www.franceinter.fr/img/logo-fi-AMP.png",
            "width":"525",
            "height":"298"
         }
      },
      "speakable":{
         "@type":"SpeakableSpecification",
         "xpath":[
            "/html/head/title",
            "/html/head/meta[@name='description']/@content"
           ]}
   }
}

Musique

Normalisation du volume des clips vidéos

https://video.stackexchange.com/questions/19527/how-to-modify-audio-level-of-video-according-to-replaygain-info

Music and Video Clip

L'objectif est:

identifier les médias, qui sont des musiques ou des clips musicaux
glaner des méta-données sur les musiques comme des ids musicbrainz, l'artiste, l'album, l'année de sortie.

Scrape the title and the artist

For now (<19/01/2021>), there are 732 medias tagged as music, some music media are still not tagged and medias tagged by music are not real music medias.

For the music video from youtube containing "Auto-generated by Youtube" in their description (precise shape), we can find the title and the artist separated by a middle point in the description also. Here a query to select these videos:

db.medias.find({description: {$regex: /Auto-generated by YouTube/g}}, {title: 1})

We can also use the title shape to split it into title and artist using separator characters like '-' (548 medias):

> db.medias.find({$and: [{tags: 'music'},{ title: {$regex: /([-•]|By)/g}}]}, {title: 1, creator: 1})
{ "_id" : ObjectId("5ed1712fde57bdf07ceb5e2e"), "title" : "Kill The Noise - BLVCK MVGIC [official video]", "creator" : null }
{ "_id" : ObjectId("5ed19fa3de57bdf07ceb5e2f"), "title" : "Takeo Ischi - New Bibi Hendl (Chicken Yodeling) 2011", "creator" : "Takeo Ischi" }
{ "_id" : ObjectId("5eca6e42690f92f4e5f1d05f"), "title" : "WHAT THE CUT #3 - CHIEN, CANARDS ET TAPIS VOLANT", "creator" : "Lea Salonga, Brad Kane" }
{ "_id" : ObjectId("5ea984fab1d47b532f7cb44f"), "title" : "Major Lazer & DJ Snake - Lean On [traduit en français par Peddy]", "creator" : null }
{ "_id" : ObjectId("5eae933ab1d47b532f7cb520"), "title" : "Colleen - The Golden Morning Breaks [Full Album]", "creator" : "Colleen" }
{ "_id" : ObjectId("5eae9f8cb1d47b532f7cb524"), "title" : "Colleen - A flame my love, a frequency [Full Album]", "creator" : "Colleen" }
{ "_id" : ObjectId("5eaed201b1d47b532f7cb525"), "title" : "OVERWERK - Virtue (Official Video)", "creator" : "OVERWERK" }
{ "_id" : ObjectId("5eb5adc1fb1f39cecdbe4334"), "title" : "S+C+A+R+R - The Rest Of My Days (Official Music Video)", "creator" : "S+C+A+R+R" }
{ "_id" : ObjectId("5eb5ba83fb1f39cecdbe4336"), "title" : "Fall Out Boy - This Ain't A Scene, It's An Arms Race (Official Music Video)", "creator" : "Fall Out Boy" }
{ "_id" : ObjectId("5eb5ba92fb1f39cecdbe4337"), "title" : "Fall Out Boy - I Don't Care (Official Music Video)", "creator" : "Fall Out Boy" }
{ "_id" : ObjectId("5eb5d4d4fb1f39cecdbe4339"), "title" : "Oxygen in Moscow (Full Video) - Jean Michel Jarre", "creator" : "Jean-Michel Jarre" }
{ "_id" : ObjectId("5eba5472fb1f39cecdbe433f"), "title" : "Perlee - Charlie’s Song", "creator" : "Perlee" }
{ "_id" : ObjectId("5ebe6824fb1f39cecdbe4342"), "title" : "Le Joueur Franais - Fraianà feat. Barbatuques (Baianá)", "creator" : null }

We can use the creator value, which often is the artist, for the rest of the media (62 medias):

> db.medias.find({$and: [{tags: 'music'},{ title: {$not:{$regex: /([-•]|By)/g}}}, {creator: {$ne: null}}]}, {title: 1, creator: 1})
{ "_id" : ObjectId("5d59a4e12f758bd207a969d0"), "title" : "A Cool CAT in Town [Tape Five ft Brenda Boykin]", "creator" : "Tape Five" }
{ "_id" : ObjectId("5db89e25672c18d1de5164aa"), "title" : "W.A Mozart :The Magic Flute with English subtitle (complete)", "creator" : "René Pape" }
{ "_id" : ObjectId("5dc9b3f25f8291fed7be22b1"), "title" : "Ghostbusters (metal cover by Leo Moracchioli)", "creator" : "Leo" }
{ "_id" : ObjectId("5dc9b44e5f8291fed7be22b2"), "title" : "Feel Good Inc. (metal cover by Leo Moracchioli)", "creator" : "Leo" }
{ "_id" : ObjectId("5dd025633102d1cd0b2796f4"), "title" : "60 Minutes of Funk (peut servir de Blind Test)", "creator" : "Joe Sample" }
{ "_id" : ObjectId("5e89c31665e5ec7310f1ec82"), "title" : "TWO DOOR CINEMA CLUB | UNDERCOVER MARTYN", "creator" : "Two Door Cinema Club" }
{ "_id" : ObjectId("5e8a1ba865e5ec7310f1ec88"), "title" : "Jean Ferrat   La montagne", "creator" : "Jean Ferrat" }
{ "_id" : ObjectId("5e93370159d3345a458fa5d9"), "title" : "Animal Collective's \"Banshee Beat\" (studio)", "creator" : "Animal Collective" }
{ "_id" : ObjectId("5e93374859d3345a458fa5db"), "title" : "Lights Out", "creator" : "Postcards" }
{ "_id" : ObjectId("5e93377959d3345a458fa5dd"), "title" : "The Greatest Fear", "creator" : "Safar" }

Query MusicBrainz

It seems that musicbrainz do not contain the link to individual music or video clip (like Youtube etc).

We can also use the node client : https://github.com/maxkueng/node-musicbrainz

Maybe we can use title instead…

curl -s "https://musicbrainz.org/ws/2/recording/?query=Too%20Many%20Zooz%20-%20Warriors&fmt=json&limit=1"|jq

We can also query for URL in the release relations (found in the Grease Monkey script for Bandcamp import in MB):

curl -s "https://musicbrainz.org/ws/2/url?resource=https%3A%2F%2Fodezenne.bandcamp.com%2Falbum%2Fdolziger-str-2&inc=release-rels&fmt=json"|jq .

Find artists related to an url:

curl -s "https://musicbrainz.org/ws/2/url?resource=https%3A%2F%2Fwww.youtube.com%2Fuser%2Falo2zen&inc=artist-rels&fmt=json"|jq .

Find artist details with their related urls:

curl -s "https://musicbrainz.org/ws/2/artist/e61590ce-9528-4844-9a9e-b0f40bbe71c4?inc=url-rels&fmt=json"|jq .

Wikidata

From the video clip of Thriller hosted in YouTube at: https://www.youtube.com/watch?v=sOnqjkJTMaA, you can query the concepts having whose YouTube video Id is property have this value, using the following SPARQL query:

SELECT distinct ?item ?itemLabel ?conceptLabel
WHERE 
{
  ?item wdt:P31 ?concept.
  ?item wdt:P1651 "sOnqjkJTMaA".
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

But this methods returns empty result on the following links:

The same can be done for YT channel:

SELECT distinct ?item ?itemLabel ?conceptLabel
WHERE 
{
  ?item wdt:P31 ?concept.
  ?item wdt:P2397 "UCIVKqDHYh8eidalfpfip2Wg".
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

And for soundcloud channel:

SELECT distinct ?item ?itemLabel ?id ?conceptLabel
WHERE 
{
  ?item wdt:P31 ?concept.
  ?item wdt:P3040 "katetempest".
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

Related projects

mps-youtube

It is a terminal based YT player and downloader https://github.com/mps-youtube/mps-youtube In particular, the album matching capability.

spotify-dl

It is a terminal based downloader for spotify links https://github.com/SwapnilSoni1999/spotify-dl

MPD

Websockify

installation

cd /tmp
git clone https://aur.archlinux.org/websockify.git
cd websockify
makepkg -i

But it seems to not work properly and I think it comes from the fact that websockify do not support string message anymore and mpd is sending string message. (https://github.com/novnc/websockify/issues/365#issuecomment-432270758)

I am getting this kind of error each time, I tried to send a message to mpd websocket bridge.

const socket = new WebSocket('ws://localhost:8800');
socket.binaryType = "arraybuffer";
var bytearray = new Uint8Array( '' );
socket.send(bytearray.buffer);

└─> websockify 8800 192.168.1.71:6600 -v
/usr/lib/python3.9/site-packages/websockify/websocket.py:30: UserWarning: no 'numpy' module, HyBi protocol will be slower
  warnings.warn("no 'numpy' module, HyBi protocol will be slower")
WebSocket server settings:
  - Listen on :8800
  - No SSL/TLS support (no cert file)
  - proxying from :8800 to 192.168.1.71:6600
127.0.0.1: new handler Process
127.0.0.1 - - [19/Jan/2021 16:51:05] "GET / HTTP/1.1" 101 -
127.0.0.1 - - [19/Jan/2021 16:51:05] 127.0.0.1: Plain non-SSL (ws://) WebSocket connection
127.0.0.1 - - [19/Jan/2021 16:51:05] connecting to: 192.168.1.71:6600
127.0.0.1 - - [19/Jan/2021 16:52:01] 192.168.1.71:6600: Closed target
handler exception: 'array.array' object has no attribute 'fromstring'
exception
Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/websockify/websockifyserver.py", line 691, in top_new_client
    client = self.do_handshake(startsock, address)
  File "/usr/lib/python3.9/site-packages/websockify/websockifyserver.py", line 619, in do_handshake
    self.RequestHandlerClass(retsock, address, self)
  File "/usr/lib/python3.9/site-packages/websockify/websockifyserver.py", line 99, in __init__
    SimpleHTTPRequestHandler.__init__(self, req, addr, server)
  File "/usr/lib/python3.9/http/server.py", line 653, in __init__
    super().__init__(*args, **kwargs)
  File "/usr/lib/python3.9/socketserver.py", line 720, in __init__
    self.handle()
  File "/usr/lib/python3.9/site-packages/websockify/websockifyserver.py", line 315, in handle
    SimpleHTTPRequestHandler.handle(self)
  File "/usr/lib/python3.9/http/server.py", line 427, in handle
    self.handle_one_request()
  File "/usr/lib/python3.9/site-packages/websockify/websocketserver.py", line 47, in handle_one_request
    super(WebSocketRequestHandlerMixIn, self).handle_one_request()
  File "/usr/lib/python3.9/http/server.py", line 415, in handle_one_request
    method()
  File "/usr/lib/python3.9/site-packages/websockify/websocketserver.py", line 60, in _websocket_do_GET
    self.handle_upgrade()
  File "/usr/lib/python3.9/site-packages/websockify/websockifyserver.py", line 221, in handle_upgrade
    WebSocketRequestHandlerMixIn.handle_upgrade(self)
  File "/usr/lib/python3.9/site-packages/websockify/websocketserver.py", line 87, in handle_upgrade
    self.handle_websocket()
  File "/usr/lib/python3.9/site-packages/websockify/websockifyserver.py", line 259, in handle_websocket
    self.new_websocket_client()
  File "/usr/lib/python3.9/site-packages/websockify/websocketproxy.py", line 134, in new_websocket_client
    self.do_proxy(tsock)
  File "/usr/lib/python3.9/site-packages/websockify/websocketproxy.py", line 232, in do_proxy
    bufs, closed = self.recv_frames()
  File "/usr/lib/python3.9/site-packages/websockify/websockifyserver.py", line 180, in recv_frames
    buf = self.request.recvmsg()
  File "/usr/lib/python3.9/site-packages/websockify/websocket.py", line 393, in recvmsg
    if not self._recv_frames():
  File "/usr/lib/python3.9/site-packages/websockify/websocket.py", line 543, in _recv_frames
    frame = self._decode_hybi(self._recv_buffer)
  File "/usr/lib/python3.9/site-packages/websockify/websocket.py", line 823, in _decode_hybi
    f['payload'] = self._unmask(buf[hlen:(hlen+length)], mask_key)
  File "/usr/lib/python3.9/site-packages/websockify/websocket.py", line 732, in _unmask
    data.fromstring(buf)
AttributeError: 'array.array' object has no attribute 'fromstring'

MPD.js and Bragi-MPD

MPD.js and Bragi-MPD are respectively a js mpd client and a web interface based on MPD.js. They both assume that websockify is working properly with mpd. But I have some trouble to make it works and it seems that I am not the only one: https://github.com/bobboau/Bragi-MPD/pull/13. I even have this error using the branch mentioned in the PR.

others

https://github.com/hbenl/mpc-js-web

Other Web-based API for MPD

YMPD

YMPD utilise une API basé sur les websockets aussi pour contrôler MPD.

Mipod

Mipod utilise une API REST et basée sur les websockets. J'ai testé ça marche :) https://github.com/jotak/mipod

Error with large mp4 file over http and https links

MPD fails to read the entire file around 5/10min with the error:

Jan 15 12:29 : ffmpeg/aac: Input buffer exhausted before END element found
Jan 15 12:29 : ffmpeg: avcodec_send_packet() failed: Invalid data found when processing input
Jan 15 12:29 : ffmpeg/mov,mp4,m4a,3gp,3g2,mj2: stream 0, offset 0xaf5a5a: partial file
Jan 15 12:29 : exception: CURL failed: transfer closed with 70102496 bytes remaining to read
Jan 15 12:29 : player: played "http://medias-dl.buron.coffee/archives/youtube/lTRiuFIWV54/1%20A.M%20Study%20Session%20%F0%9F%93%9A%20-%20%5Blofi%20hip%20hop_chill%20beats%5D.mp4"

No error appears when the file is read from the file system, so the problem comes from the http connection and no from the file itself.

I have also an error when I try to download the file using curl with a slow connection limit-rate:

curl http://medias-dl.buron.coffee/archives/youtube/lTRiuFIWV54/1%20A.M%20Study%20Session%20%F0%9F%93%9A%20-%20%5Blofi%20hip%20hop_chill%20beats%5D.mp4 --limit-rate 40K -v --output /tmp/lofi.mp4

...
{ [5 bytes data]
  6 78.2M    6 5455k    0     0  20479      0  1:06:44  0:04:32  1:02:12     0* transfer closed with 75381124 bytes remaining to read
} [5 bytes data]
 * OpenSSL SSL_write: Relais brisé (pipe), errno 32
 * Failed sending HTTP2 data
  8 78.2M    8 6479k    0     0  24322      0  0:56:12  0:04:32  0:51:40  277k
 * Connection #0 to host medias-dl.buron.coffee left intact
curl: (18) transfer closed with 75381124 bytes remaining to read

It seems here that the error comes from the OpenSSL, but the problem on MPD arises over HTTP and HTTPS connections … Maybe curl and MPD error are unrelated :/

Curl v7.74 doesn't have any error using HTTP.

Using mpd v0.22.3 seems to solve the problem.

Todo for v1

ajouter des liens vers les flux RSS

améliorer l'upload

trouver une solution au timeout lors de l'upload

reprendre l'amélioration proposée sur le state sur la branche play while downloading
ajouter le state pour les erreurs
introduire une route pour interroger le state
améliorer le mediaDB client pour qu'il supporte l'upload avec suivit de download state

corriger que l'upload depuis la recherche fonctionne bien

ajouter le bouton download et radio depuis la page recherche
ajouter à la radio un media non archivée, l'archive et l'ajoute à la radio
cliquer sur le lien n'archive plus automatiquement le media, uniquement les meta données
la lecture d'un media non archivé, lance l'archive puis la lecture

Vérifier que youtube-dl fonctionne sur les playlists

à tester avec : https://www.youtube.com/playlist?list=PLC7CA51D9DBAEC012

vérifier que peertube choisi bien le meilleur format vidéo 360p

supporter offset et limit avec ytt

Le cas Monsieur3D

web archive of the channel http://web.archive.org/web/20160113004332/https://www.youtube.com/user/Monsieur3D/
link of videos archive: https://drive.google.com/drive/folders/1Hjj4dFDNJfm2YS9xkO1a40A1_Gw_Egry
forum où le lien a été diffusé: https://www.jeuxvideo.com/forums/42-51-54796461-1-0-1-0-monsieur-3d-a-supprime-toutes-ses-videos.htm

Ideas

Utiliser jq à la place de MongoDB

L'idée est de supporter à la fois mongoDB et jq. Avec jq, toutes les données sont dans le dossier d'archive. Il faut que la base de données ne supporte que l'ajout de document en temps que mise à jour.

La structure de l'archive pourrait être un simple fichier medias.json contenant tous les medias.

Remarques sur jq

Il y a pas d'optimisation à utiliser limit

Requêtes avec jq:

recherche de mot-clé (ici "Mozinor"):

cat medias.json |jq -s -c '[.[] | select(contains({description: "Mozinor"}) or contains({title: "Mozinor"}) or contains({tags: ["Mozinor"]}) or contains({uploader: "Mozinor"}))| {title: .title, upload_date: .upload_date}]' |jq 'sort_by(.upload_date)'

findByUrl

cat medias.json |jq -s -c '.[] |select(.media_url == "https://www.youtube.com/watch?v=T7gAXMlehsE")| {title: .title, upload_date: .upload_date}'

findByTag

cat medias.json |jq -s -c '.[] | select(contains({tags: ["Mozinor"]}))| {title: .title, upload_date: .upload_date}'

findAll

cat medias.json |jq -s -c 'sort_by(.upload_date)|reverse |limit(10;.[]) | {title: .title, upload_date: .upload_date}'

removeById

Use a brutalism design

inspiring design of Havard film archive

Todo on the current version

TODO solve the problem with torrent

https://medias-dl.buron.coffee/medias/5d5eaaf1713d4ddf79cc53fc

Todos on RUST

TODO ajouter les propriétés suivant aux medias

platform from extractor. Il y en a trop pour les énumérer : https://github.com/blackjack4494/yt-dlc/blob/master/docs/supportedsites.md
duration
contentSize

Structure basique de projet

une collection de medias avec des méta données explorables et éditables
des archives des médias en question avec des sources d'archives
des outils d'extraction de méta données à partir de sources externes
des outils de téléchargement pour créer des archives

Ressources

https://preservetube.com/