MythTV Guide

Project page for MythTV listing provider

Mission Statement

With the departure of Zap2It listings, there are no providers which can accumulate and distribute the listing for TV channels. This project aims to create a method to fill that void and to help replace the MythTV's dependence on 3rd party providers for getting the television listing data.

Documentation

Network Architecture Ideas

First Proposal

The system should be based on a mesh network, completely decentralized as necessary. However, to jump start the network, certain centralized servers will be used.

All of these things can be written in interpreted languages like Perl or PHP, and run over standard HTTP and with standard Apache installations. I see no reason to either include or reimpliment the BitTorrent protocol, since we'll be transferring single small files and the network organization amounts to no more than lists of servers and hashes of data. The clients could be written in C for easy integration into the existing MythTV structure, but that's by no means necessary. It could just as easily be a Python app running as a cron job, or as a daemon, and populating a directory with XML files for XMLTV to use.

I see the following three types of computers in the network:

Server

Analagous to a tracker in BitTorrent. The Server's job is to cache a list of all the other Super Nodes in the network and also maintain a full cache of data and hashes for distribution. Each client connects to a Server first (based on the cached list of Servers it has available) to ask for data. The Server then returns an updated list of Super Nodes and Servers to the client, and either returns the data or redirects the Client to a Super Node to get the data. All new data/hash sets must come from a Server first (no caching upstream from a Super Node).

Super Node

Analagous to a seed in BitTorrent. The Super Node maintains a cache of all data and hashes, and a list of all known Servers on the network. It periodically checks with a Server on its list to update its caches and its Server list. When requested, it will serve data or lists to Clients. A Super Node could be a Client itself, as well. Requests from Clients for data must be redirected from a Server; Clients should not connect to Super Nodes directly to maintain the integrity of the caches (since Servers always have the most up-to-date version of the caches, and will know which Super Nodes also have the most up-to-date version).

Client

Analagous to a peer in BitTorrent. The Client is running as part of a MythTV installation (or other DVR application) and downloads data from either Servers or Super Nodes. It will ask a Server periodically for an update to its data, and will retrieve that data either from the Server or from a Super Node the server redirects the client to.

Scraper

The Scraper is an app that runs continuously, downloading the raw HTML from the source site, parsing it into XML for distribution, creating gzipped versions of the data, and generating hashes of the XML and gzipped data. The Scraper then contacts a Server and uploads the new data. Scrapers could be running on Servers and/or Super Nodes, or as standalone machines.

Component Descriptions

Client

unsure, so including multiple scenrios. edit and delete as necessary

1) Runs at a MythTV installation site, handles querying server(s) for current listings. MythTV setup asks client for current listing data.

2) Runs at a server site and handles distributing XMLTV data between other server nodes. (Cover in AeroIllini's comments below and probably not a real necessary issue, except that it might keep channel listing sites from getting upset about large amounts of scraper traffic from non-revenue generating clients.

Server

Handles distributing XMLTV formatted listings.

Scraper

Handles scheduling execution of scraper scripts. Could be on user configurable schedule, on demand, or a combination of both.

Scraper Script

When called, code will visit a URI, retrieve data, parse data to XMLTV format, and then persist that data somehow for distribution by servers

General Discussion

I think it might be helpful if we pinned down exactly what this project is going to look like before we start deciding on implementation and architecture details. Are we building a fully mesh-network-based bittorrent system with scraping happening at “node” servers? Are we building a central server with load balancing capabilities? Or are we just building a better screen scraper with periodic updates and leaving everything local? Please leave thoughts below.

FYI, I can usually be found in #periapsis on freenode, if you want to talk about this in realtime. I'm typically active during the day Pacific time.

- AeroIllini


Agreed on that AeroIllini. Here are some points that I think will alter the way this project gets constructed. I am thinking that this project may be able to serve this data to other clients, but MythTV would appear to be the intended target for this work, so some of my thoughts center around its needs.

  • A bittorrent system mitigates bandwidth issues, but may not allow for a reasonable download times
    • (I guess this depends on how MythTV gets it's channel listings (as a background process or on-demand)

Here's what I want to research

  • I'm assuming we should use Unicode from the start (to allow for multiple languages/character sets)
  • How many times per day is channel data retrieved (with different usage scenarios)
  • Is data text only or text + binary (pictures?)
  • What are typical sizes of a particular listing update
    • Is it compressible?
    • Does MythTV have a format it already expects (I would assume so)
    • Should we reuse the existing format MythTV expects (again, I would assume so)

As far as implementation, I like the bittorrent idea, but wonder if that makes the system not useful unless you have lots of nodes. I haven't implemented a bittorrent system yet, so I have questions about avoiding cache poisoning and keeping slow and unresponsive nodes from dragging down parts of the network. In my mind, bittorrent is great for file delivery when the request treated as asynchronous. This app (when serving a MythTV client) should be quick and transparent. That makes me lean towards the community of mirrored servers that distribute the data.

- pfarrell


Great points, pfarrell. Here are some of my responses:

  • Unicode: definitely yes. Let's make this as international-friendly as possible (including language localization where appropriate).
  • How many times a day is info retrieved? – Well, how many times a day is the screen we're scraping updated? That would determine our update interval. I think we should leave this as a configurable variable, and give the clients the ability to pull an update interval from the servers, in the case that we change our scraping source and the interval changes.
  • I like the idea of pictures (icons, essentially), but I'm not sure where we would get them. Depends on our scraping source.
  • I have not researched size of the XML download, but I imagine that a simple gzip version would be useful, and greatly reduce bandwidth requirements. The size would also vary greatly based on how much information we provide. Additionally, the hash comparison idea I mentioned would cut down on size as well (i.e., when clients update, they first compare a hash with one stored on a server to see if the update is necessary–we could post hashes for both gzipped and plaintext versions of each listing snapshot). I also like the idea of stratifying the listings–people who want just show names and times have one feed they subscribe to, while people who want all the descriptions, icons, credits, etc. have another feed. Icons should also have a hash system so they are only downloaded once–e.g., when a user grabs a snapshot containing icons, the “icon” will be a url+hash; the client compares the hash and only grabs the image from the URL if it's different. That way people aren't downloading the same 250KB image for Heroes every time they refresh their listings.
  • Formats: MythTV already supports XMLTV, so I think that's the format we should stick to. It also includes provisions for icons, as a URL (not as base64 data). Details are available here.

I like the idea of a BitTorrent system. If we do decide to go that route, we would need to include libbittorrent (or similar) in our client, so that tranfer is transparent and does not require additional BitTorrent clients or action from the user. This would be possible, but might constrain us to a C++ development environment, at least for the client. We would also need a way to keep people seeding to prevent network slowdowns. However, if the filesizes are small enough (and I guess that they are) I don't think we would see any appreciable gains using a bittorrent system over a network of “supernodes”, i.e., clients who are also serving data and listed on the central server as download points. Basically, it would be a bittorrent tracker system without the overhead of nonstandard ports and multiple file parts. This is all easily done over http.

- AeroIllini


I had wondered about the seeding a well. I'll be straight up with you, if we go with C++, I'm going to have to play more of a junior role as far as coding goes (but there's always so much more to a project anyway ;). Most of my experience lies in interpreted languages with only minor dabblings in C. It sounds like you're leaning towards a bittorrent network distributing XMLTV. My initial feelings were more of a mirrored server environment, but I don't have any problem exploring the bittorrent route.

I'm created new sections above that we should modify until we get it fleshed out.


I haven't chimed in until now.. but just wanted to let you all know that I have no programming skills, so will be of no help there. But I should have a computer that I can run torrent seeder or mirror on if thats what we need. I have a fiber optic connection through Surewest, 20Mbps up and down. I think it would be interesting to see how much momentum this project gains if other Myth users start utilizing it, so let me know if there's anything I can do to help.

- Bobby

This is the best way to proceed. Scripts could run from different systems and update a central database. This database would be what all the MythTV users download (RSS? P2P?); specifying what portions they wanted.

The scripts should scan multiple guide sites, removing dependence on any single one, plus suppport manual guide info entry from the community to fill in any gaps and further distance the efforts from perceived legal issues. The best setup would prioritize what sources of information are more reliable than others, so that weaker data would get overwritten by the more accurate information.

The main concerns about all this was the dependency on a single server for the MythTV users (which COULD be eliminated by using P2P networks to grab files based on date ranges).

I REALLY believe this is the way to go. A community site where every channel in the world, as well as internet channels and pod casts, are all in one site database for browsing and fetching. You don't need to “zip code” smart it – user can just select what channels they wanted and locally map them to channel numbers on their systems.

The legal issues as I see them are not copyright (although that will be claimed). You cannot copyright a listing of shows and description texts. You could copyright the displayed format, but that is not being duplicated here. The area it more falls into is potentional theft of service if the information is charged for, violation of terms on site access, or DMCA if more than simple extraction was required (ex: decrypting). In truth, 5000 worldwide people spending 20 mintes a week entering 3 weeks worth of a single channel information by hand would circumvent all of this entirely and serve everyone without any legal issues. From an automation perspective, robots and spiders are a fact of web life.

– Keith


Does the announcement of this project http://schedulesdirect.org/ negate the need for this project. I am still interested in working on this, but have had some personal stuff going on that has taken up most of my free time (new baby on the way next month :). Is anyone still interested in working on this? I don't think $15 to help kick off their project and a goal of $20 a year sounds exorbitant, but our p2p ideas were pretty cool (and would be harder to shut down).

– Pat F.

 
mythtv_guide.txt · Last modified: 2009/01/04 09:43 (external edit)
 
Recent changes RSS feed Creative Commons License Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki