qddate – quick and dirty python parser dates what could be found during HTML scraping

travis build status pypi version Documentation Status Code Coverage Join the chat at https://gitter.im/ivbeg/qddate

qddate is a Python 3 lib that helps to parse any date strings from html pages extremely fast. This lib was created during long term news aggregation efforts and analyzing in wild HTML pages with dates. It’s not intended to have beautiful code, support for so much languages as possible and so on. It should help to process millons of strings to identify and parse dates. qddata was part of proprietary technology of “news reconstruction”. It’s used to automatically create RSS feeds from sites without it.

If you are looking for more advanced (and slower) date parsing try dateparser and dateutil.

Documentation

Documentation is built automatically and can be found on Read the Docs.

Features

  • More than 348 date patterns supported (by the end 2017)
  • Generic parsing of dates in English, Russian, Spanish, Portugenese and other languages
  • Supports strings with with left aligned dates and supplimental words. Example: “12.03.1999 some text here”
  • Extremely fast, uses pyparsing, hard-coded constants and dirty speed optimizations tricks

Limitations

  • Not all languages supported, more languages will be added by request and example
  • Not so easy to add new language based date patterns as it’s in dateparser for example.
  • Could miss some rarely used date formats
  • Doesn’t support relative dates
  • Doesn’t support calendars

Speed optimization

  • All constants are hard encoded, no external settings
  • Uses only datetime and pyparsing as external libraries. No more dependencies, all reused code incorporated into the lib code
  • No regular expressions, instead pre-generated pyparsing patterns
  • Intensive pattern filtering using min/max text length filters and common text patterns
  • No one settings/data file loaded from disk

Usage

The easiest way is to use the qddate.DateParser class, and it’s parse function.

class qddate.DateParser(generate=True)[source]

Class to use pyparsing-based patterns to parse dates

match(text, noprefix=False)[source]

Matches date/datetime string against date patterns and returns pattern and parsed date if matched. It’s not indeded for common usage, since if successful it returns date as array of numbers and pattern that matched this date

Parameters:
  • text – Any human readable string
  • noprefix (bool) – If set True than doesn’t use prefix based date patterns filtering settings
Returns:

Returns dicts with values as array of representing parsed date and ‘pattern’ with info about matched pattern if successful, else returns None

Return type:

dict.

parse(text, noprefix=False)[source]

Parse date and time from given date string.

Parameters:
  • text – Any human readable string
  • noprefix (bool) – If set True than doesn’t use prefix based date patterns filtering settings
Returns:

Returns datetime representing parsed date if successful, else returns None

Return type:

datetime.

Dependencies

qddate relies on following libraries in some ways:

  • pyparsing is a module for advanced text processing.

Supported languages

  • Bulgarian
  • Czech
  • English
  • French
  • German
  • Portuguese
  • Russian
  • Spanish

Thanks

I wrote this date parsing code at 2008 year and later only updated it several times, migrating from regular expressions to pyparsing. Looking at dateparser <https://github.com/scrapinghub/dateparser> clean code and documentation motivated me to return to this code and to clean it up and to share it publicly. I’ve used same documentation and code style approach and reused build scripts and documentation generation style from dateutil. Many thanks to ScrapingHub team!

Join the chat at https://gitter.im/qddate/Lobby

Using DateParser.match

DateParser is the only way to implement fast dates parsing.

The instance of DateParser uses basic date patterns from qddate.consts and generates extended list of patterns. It helps to reduce number of comparisons of strings significantly. Right now no language selection implemented but it doesn’t slow down date parsing.

This class wraps around the core qddate functionality.

class qddate.DateParser(generate=True)[source]

Class to use pyparsing-based patterns to parse dates

match(text, noprefix=False)[source]

Matches date/datetime string against date patterns and returns pattern and parsed date if matched. It’s not indeded for common usage, since if successful it returns date as array of numbers and pattern that matched this date

Parameters:
  • text – Any human readable string
  • noprefix (bool) – If set True than doesn’t use prefix based date patterns filtering settings
Returns:

Returns dicts with values as array of representing parsed date and ‘pattern’ with info about matched pattern if successful, else returns None

Return type:

dict.

Warning

It returns raw matched date and raw pattern:

>>> dp.match('11 August 2017')
{'values': (['11', 8, '2017'], {'day': ['11'], 'month': [8], 'year': ['2017']}), 'pattern': {'key': 'dt:date:date_eng1', 'name': 'Date with english month', 'pattern': {W:(0123...) Suppress:(["."]) January | February | March | April | May | June | July | August | September | October | November | December Suppress:(["."]) W:(0123...)}, 'length': {'min': 10, 'max': 20}, 'format': '%d.%b.%Y', 'right': True, 'basekey': 'dt:date:date_eng1'}}

Popular Formats

class qddate.DateParser(generate=True)[source]

Class to use pyparsing-based patterns to parse dates

parse(text, noprefix=False)[source]

Parse date and time from given date string.

Parameters:
  • text – Any human readable string
  • noprefix (bool) – If set True than doesn’t use prefix based date patterns filtering settings
Returns:

Returns datetime representing parsed date if successful, else returns None

Return type:

datetime.

Function ‘parse’ mimics default behavior of dateparser ‘parse’ function. Except that it is part of DateParser class, not standalone function.

>>> import qddate
>>> parser = qddate.DateParser()
>>> parser.parse('2012-12-15')
datetime.datetime(2012, 12, 12, 0, 0)
>>> parser.parse(u'Fri, 12 Dec 2014 10:55:50')
datetime.datetime(2014, 12, 12, 10, 55, 50)
>>> parser.parse(u'пятница, июля 17, 2015')  # Russian (17 July 2015)
datetime.datetime(2015, 1, 13, 13, 34)
>>> dp.parse(u'Le 8 juillet 2015')
datetime.datetime(2015, 7, 8, 0, 0)

This will try to parse a date from the given string, attempting to detect the language each time automatically.

Indices and tables