Sharingan is a tool built on Python 3.6 using OpenCV 3.2 to extract news content as text from newspaper’s photo and perform news context extraction.
Note: This is a fun project I started out of curiosity and is still under development. It is still not mature enough to produce very accurate results.
The working can be divided into two tasks:
Image processing and text recognition
Context extraction
Image processing and text extraction
Our ROI is the text content of the page and therefore, some image processing is required to highlight and extract the text content from the image. Also, the appropriate text content highlighted requires some more processing and cleaning so that there is no noise and false positives while OCR is performed.
Edge detection technique is used to find boundaries of objects in an image by analyzing varying brightness in the image. Here, it is being used for segmenting image. More precisely, I’ve used Canny Edge Detection technique.
3. Dilation
Detecting contours for text at this point will lead to hundreds of nonsensical contours. To achieve a confident boundary detection I’ve used dilation here which is a process of dilating. It increases the white region in the image or size of foreground object. In informal terms, it leaks the white pixels to its neighborhood so that it transforms the text area as more solid looking
4. Finding Contours and Contour Approximation
Finding contours around the white pixels Contour approximation: It approximates a contour shape to another shape with less number of vertices depending upon the precision we specify. After performing contour approximation I got this
What I got ?
By employing techniques mentioned above, I ended up with these:
Inference
It’s evident that our logic was able to crop out the text content from the page but it also acquired few false positives which can be filtered out in this case with small tweaking. Also, our logic couldn’t isolate the image content (TODO: fix this).
Clean images for text extraction
I implemented adaptive binary thresholding to clean and highlight the text area
Text Extraction
I’ve used tesseract to extract the text from the segmented images.
Tesseract was originally developed at Hewlett-Packard Laboratories Bristol and at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998.
In 2005 Tesseract was open sourced by HP. Since 2006 it is developed by Google.
Manual Mode
Sharingan provides both automatic and manual segmentation mode. Below is the demo of manual segmentation.
Drag and drop the area to crop
Threshold
Context extraction
The phrase structure of a sentence in English is of the form:
The above rule means that a sentence (S) consists of a Noun Phrase (NP) and a Verb Phrase(VP). We can further define grammar for a Noun Phrase but let’s not get into that :)
A Verb Phrase defines the action performed on or by the object whereas a Noun Phrase function as verb subject or object in a sentence. Therefore, NP can be used to extract the important topics from the sentences.
I’ve used Brown Corpus in Natural Language Toolkit (NLTK) for Part Of Speech (POS) tagging of the sentences and defined custom Context Free Grammar (CFG) for extracting NP.
“The Brown Corpus was the first million-word electronic corpus of English, created in 1961 at Brown University. This corpus contains text from 500 sources, and the sources have been categorized by genre, such as news, editorial, and so on.”
A part-of-speech tagger, or POS-tagger, processes a sequence of words, and attaches a part of speech tag to each word.
>>> text = word_tokenize("And now for something completely different")
>>> nltk.pos_tag(text)
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'),
('completely', 'RB'), ('different', 'JJ')]
In my context extractor script, I’ve used unigram as well as bigram POS tagging. A unigram tagger is based on a simple statistical algorithm: For every token/word assign a tag that is more likely for that token/word which is decided as per the lookup match found in the trained data. The drawback of unigram tagging is, we can just tag a token with a “most likely” tag in isolation with the larger context of the text.
Therefore, for better results we use an n-gram tagger, whose context is current token along with the POS tags of preceding n-1 tokens. The problem with n-gram taggers is sparse data problem which is quite immanent in NLP.
“As n gets larger, the specificity of the contexts increases, as does the chance that the data we wish to tag contains contexts that were not present in the training data.”
I’ve also defined a custom CFG to extract Noun Phrases from the POS tagged list of tokens.
Applying all this logic to get the keypoints from the text content extracted above gives:
Introduction Understanding decorators in Python is one of the most trickiest thing and creating one is one of the most craftiest th...
Saturday, March 26, 2016
Introduction
Understanding decorators in Python is one of the most trickiest thing and creating one is one of the most craftiest thing in Python. It requires understanding of few functional programming concepts, how functions work, namespace/scope/lifetime of data items and most importantly closure.
What is a decorator ?
A decorator is a design pattern which allows us to modify the functionality of a method or class without changing the implementation of the existing function to be decorated.
For beginners, hold on questions like:
How is it different from calling a separately created function containing added functionality?
What's so cool about decorators?
Is Bruce Wayne the Batman?
I hope these questions will get answered by the end of this post.
Know your functions better
Functions are like any other variables in python i.e. we can pass functions as arguments and can return functions from a function as a return value. Why is it so ? Because functions are also objects in Python like everything else.
Consider this:
Now we know that functions are also objects.
Namespace, scope and lifetime
A namespace, in simple terms, is the collection of names which we define in our code which are essentially the collection of objects (named). Therefore, there can be a number of namespaces existing independently. This independence is in terms of their scope and lifetime.
Functions create their own namespace and it is accessible directly only in the function definition. This is the scope of the namespace in a function. Similar is the case with any other code segment.
Variables in their local namespace in a function are destroyed when the function ends. This is the lifetime of the variables.
Rule: While accessing variables, Python looks for the variables in the local scope first and then in the enclosing scope.