rough draft #1"

2 years ago · 29fce34a18
parent fa1b27113f
commit 29fce34a18
1 changed files with 281 additions and 0 deletions
--- a/_posts/2021-12-11-adventures-of-writing-a-screenreader-in-rust.md
+++ b/_posts/2021-12-11-adventures-of-writing-a-screenreader-in-rust.md
@ -0,0 +1,281 @@
+---
+title: "From Software Noob To Linux Accessibility Master"
+layout: post
+tags: "atspi, dbus, dbus a11y, accessibility, a11y, linux, linux a11y"
+---
+
+Here are some interesting problems I have faced when working with DBus, AT-SPI (Accessible Technology--Serial Protocol Interface) and the Rust programming language.
+I realize these are fairly unique constraints, and this information is likely only relevant for a select few, but I thought the experience might be worthwhile to write down:
+for my own sanity when I inevitably experience these same issues later and for others who may want to contribute to our new screen reader project [Odilia](https://yggdrasil-sr.github.io).
+
+## DBus
+
+[DBus](https://www.freedesktop.org/wiki/Software/dbus/) is a cool API!
+Well it's not an API, but rather a mechanism to share messages across processes in Linux;
+this is generally called IPC or Inter-Process Communication.
+DBus can be used to [send and receive desktop notifications](https://specifications.freedesktop.org/notification-spec/latest/ar01s09.html),
+[shutdown your computer](https://www.freedesktop.org/wiki/Software/systemd/dbus/)
+and, for my purposes [get accessibility events](https://www.freedesktop.org/wiki/Accessibility/Walkthrough/).
+
+### Inner Workings
+
+DBus is an object-oriented approach to IPC.
+It is split up into 4 main components that work together:
+
+1. Objects
+2. Interfaces
+3. Methods
+4. Properties
+5. Buses
+
+#### Objects, Methods & Properties
+
+Objects are just like objects you learned in your CS classes;
+it is a structure which contains attributes, and methods which can be called on the object.
+DBus' objects are very similar, except that attributes are called properties.
+
+Most DBus libraries provide a way for you to use "native objects" (i.e., a Python object, a C++ object, a Rust structure + implementation, etc.); this allows access to DBus methods using the language features available to you.
+So for example, in Python you might write:
+
+```python
+obj = get_a_dbus_object()
+print(obj.get_text()) # using a method
+print(obj.locale) # using a property
+```
+
+This would print out whatever may be returned from the object's GetText method and what is found in the locale property.
+Notice that DBus methods are always Pascal case (i.e., capitalized at each starting letter of a word).
+
+#### Interfaces
+
+A DBus interface (not to be confused with a Java interface, or a Rust trait) is a definition of a collection of methods.
+For example, the "Text" interface may have an attribute like "Length" or a method like "GetText".
+So the interface "Text" is just a list of methods and attributes all wrapped up together.
+That's it! That simple!
+
+This will come in handy later when we need to check if an object implements a method;
+this way we can check for an entire interface of methods and properties instead of checking for each individually.
+
+#### Busses
+
+A bus' closest equivalent in standard computer science terms would be an IP address.
+A bus address looks like ":1.39"; think of this like a raw IP address.
+Some addresses have names associated with them like "org.a11y.Bus"; think of this like a DNS A record pointing at an IP (bus) address.
+So a bus is just a place to send IPC requests, just like you'd send HTTP requests to a web server at a specific IP/port combination.
+
+### Accessibility Events &amp; Information
+
+Let's assume for a moment that you cannot see anything. You are blind.
+If you try to read an article you obviously cannot see what is on your screen, so you need something to read it to you.
+This technology that reads your screen to you is, uncreatively called a screenreader, sometimes abbreviated "SR".
+Well how does a screen reader know what is on the screen? How does it know what a button is? And a link?
+How does it know if content has changed or if an alert has been sent?
+
+The former describes accessibility information (i.e., this button contains a certain string of text);
+the latter describes an accessibility event (an `aria-live` region has been updated, or an alert box has been displayed).
+
+DBus can send these events and information to your process, if you ask for it.
+This is what you want if I'm to create anything like a screenreader.
+
+#### Accessibility Events in Rust
+
+So why "Object:StateChanged\0"? Where does this come from?
+
+The specification that is used to send this information to our DBus connection is called AT-SPI: Accessible Technology--Serial Protocol Interface.
+To clarify: DBus is the general IPC mechanism for processes in Linux;
+AT-SPI is a standard for how to send accessibility information/events over the DBus protocol.
+
+## AT-SPI
+
+AT-SPI are a set of XML files that specify *how* to send data across DBus for accessibility events.
+I'm going to be honest: at first this system *seems* very convoluted and unnecessarily complex.
+Over time though, this system has grown on me as I start to see its "complexities" as a sort of after-affect of the core principle of *simplicity* used within DBus and the specifications which use it.
+
+I have explained previously that DBus has objects and methods just like a native object in Python, C++ or Javascript.
+
+So let's say we want to implement the most basic thing a screenreader can do: read text.
+Let's suppose we already have an item we want to get the text of.
+Now to get the text of it, we call a method on the interface and pass the path.
+This is abstracted away for us, generally speaking, when using any kind of language-specific DBus binding, but it's better to be explicit in this case.
+
+No problem! We call `item.get_text()` and that's it, right?
+No.
+This is where, again, this "complexity" comes in.
+Again, it starts out this way, but it will grow on anyone who enjoys the idea of the UNIX principles with time and understanding.
+
+So what happens if we do `obj.get_text()`?
+Let's try it on the first list item on my website's [homepage](/):
+
+Here is the excerpt as it is written on the day of writing this article:
+
+> I have three goals in my software development career:
+> 1. Strong adherence to the <a href="https://?">UNIX principles</a> of software design.
+> 2. Security, privacy and anonymity of the internet.
+> 3. Accessibility of technology to the visually impaired.
+
+What would you expect to receive if you ran `get_text()` on the first list item there?
+If you, like me, were a little brainlette, you probably guessed "1. Strong adherence to the UNIX Principles of software design."
+Let's find out if this is correct:
+
+```rust
+let text = acc.get_text();
+println!("TEXT: \"{}\"", text);
+
+$ cargo run
+TEXT: "1. Strong aherance to the  of software design."
+```
+
+If you read that carefully, you'll see there are what look like three spaces where the UNIX principles link should go.
+This is *extremely* deceptive for two reasons:
+
+1. One of those is *NOT* a space. It's an [Object Replacement Character](https://www.fileformat.info/info/unicode/char/fffc/index.htm) aka Unicode Point U+FFFC.
+2. It looks like it has just dropped a piece of text without telling us! And without a way to get it back! *Gasp!* Oh the horror!
+
+This is what I thought too.
+But allow me to defend this for a minute.
+
+What if you had something complex like a table, a block quote, an image or even something like a [MathML equation](example) inside the block of text (in our case, inside a list item, but this applies to any piece of text inside another)?
+If you had a table, would you want to read it out?
+MathML, you might want to say everything upfront, but MathML would need some amount of processing before it be readable as text.
+And even with a link, there is a reason for this.
+
+If you can see perfectly find and browse the web like anyone else, with your eyes, you can see what is a visited and unvisited link based on the color of the link. A darker color generally indicated a visited link,
+whereas a lighter color generally indicates an unvisited link.
+When a screenreader gets info about a piece of text, it would need to include that information to its user like "UNIX princples...link" or "UNIX principles...visited link".
+So if I get the text of some item which contains another, should it include all sub items? What about just links? Should it tell you if the link is visited or not?
+
+All these questions above would introduce additional complexity to answer if being done within a single query.
+This has given me pause in my youthful "the system is broken" angst that generally plagues my thinking;
+instead I see this is a very sober-minded and UNIX-y design principle that I think makes much more sense than the alternative.
+Here are some major advantages of this method:
+
+1. It allows *optional* processing of sub-elements; maybe you don't care what is underneath the element: this saves processing power and complexity.
+2. It allows *custom* processing of sub-elements; you do not have to rely on AT-SPI to tell you what information you want. Perhaps you only need the role of the sub element, not the entire text of it: again, this saves CPU cycles and code complexity.
+3. Allows arbitrary data to be inside any other structural element. This is optimal for HTML, which is built to have more or less arbitrary nesting of elements.
+
+In reality: this is actually genius design!
+My next question is: "If it uses the object replacement character so it can replace the children, then what happens if the object replacement character is actually in the text?"
+Well, with some processing you can actually find out where each child goes, or if the object replacement character is actually written in the text itself.
+How so?
+
+First off, let's get a list of children.
+We can do this with `obj.get_children()`.
+
+```rust
+# rust way of awaiting and not caring about an error case is: .await.unwrap()
+println!("CHILDREN: {:?}", obj.get_children().await.unwrap());
+
+$ cargo run
+CHILDREN: [(":1.7", Path("/org/a11y/atspi/accessible/193\u{0}")), (":1.7", Path("/org/a11y/atspi/accessible/194\u{0}"))]
+```
+
+You'll notice that the children are merely a list of tuples;
+each tuple only contains, at its core, two strings:
+
+* Sender: A string describing which application has sent the information.
+* Path: A string describing which element is being sent.
+
+The sender, you will notice, looks suspiciously like a bus address.
+This is actually what it is. Each process has a bus address, and it is letting you know where it's coming from.
+The path is a (TODO) path to a new object for which we can receive information about through DBus if we want more information.
+
+// TODO
+Remember earlier when we used a sender, path and connection to connect to the accessibility bus?
+And later when we created a Proxy to an object that was sent over as part of an event?
+Well, this is the same idea!
+We clone the `Arc<SyncConnection>` connection to use the same connection to talk to dbus, and we use the sender and path to create a proxy we can then use the same methods on as the parent!
+
+Pretty cool stuff!
+
+Okay, now back to what I was saying about being able to grab information about children to find out if we need to replace the object replacement characters or not.
+(TODO)
+Here's what we do: there is an [interface](???), we talked about this earlier, called [Hyperlink](???xml);
+the Hyperlink interface can actually tell us what cursor position inside the parent the child occupies.
+Some objects we get over DBus will not support this, but the vast majority will.
+I dislike the fact it is called hyperlink, even though I can see that this is the primary use case,
+I think it's reasonable to say that `StartIndex` and `EndIndex` are not exactly unique to hyperlinks (`<a>` tags).
+Minor criticism aside, there is an opportunity here to match with the parent and find out if and where the child belongs to be placed.
+Here's how:
+
+If we get the position of every occurrence of the object replacement character from the parent,
+and check each child to see if its `StartIndex` matches the position of the object replacement character,
+then anytime it matches, that is where the child belongs.
+
+There is another use for this that I would like to point out.
+I think this is a reasonable case for seeing it pulled into its own interface, or joining accessible.
+That is: structural navigation.
+
+## Structural Navigation
+
+People who use screenreaders have some special abilities I actually wish browsers implemented by default:
+the ability to jump through the document by specific tags and attributes.
+It's not sophisticated;
+depth first search forward or backward looking for the closest heading, link, button, table, etc.
+This is so ingrained in screebreader users that when a page finishes loading,
+it is customary for the screenreader to announce (speak out lout to the user) the number of tables, headings, and visited and unvisited links that are on the page.
+
+If I want to look for the next heading in an HTML document, however,
+I can not start by just checking all children, because it is fairly common to have various tags embedded in your current tag.
+I need to know, which children are after and which are before my caret.
+
+## The Carrot 🥕
+
+The caret is the same as your cursor in an input box.
+Type right here and watch as your cursor (aka caret) moves with your typing: <input type="text" placeholder="type here">
+
+The caret, or cursor, is something that most people are only used to seeing in the context of *editable* text,
+but screenreader users enable a special mode in their browser (usually activated with F7) called "caret browsing".
+Caret browsing allows you to navigate through a webpage using a cursor
+even when the text is not editable.
+This is *awesome*!
+I can not understate how useful this is to me, just for simple keyboard-driven simplicity's sake and trying to eliminate the mouse as much as possible.
+
+Try it now! You can always turn it off with F7, just the same as enabling it.
+
+This caret can be moved around just like in any run-of-the-mill WYSIWYG (What You See Is What You Get) editors like Word or Libreoffice Writer.
+This is how a screenreader user navigates the web:
+with a cursor.
+They use it to read one character at a time (with left and right arrow), a word at a time (Ctrl+left or right arrow) or entire lines of text (using up and down arrow).
+This becomes, in essence, the active focus of the user: it is always on the cursor (a.k.a. caret).
+
+## Keyboard Input
+
+Keyboard input with accessible applications follows a very complex path, which can be a serious buzzkill for attempting high-performance screenreaders.
+Let me show you what the issues are; the accessible technology (screenreader, in this case) will be written as "AT" in this diagram:
+
+```
+Wayland: Kernel -> libinput -> DE/WM -> accessible application -> AT
+X11: Kernel -> Xorg -> DE/WM -> accessible application -> AT
+```
+
+What happens in the case of an inaccessible application?
+It doesn't work, at all. A key press which is sent to an inaccessible application will *not* be sent to an AT application (i.e., a screenreader).
+This is a serious problem, that I don't think *should* exist at all.
+Perhaps there is some mechanism I am missing as to how to interrupt these keys before they pass all the way to an application and then just hope the GUI is accessible;
+supposing that this is not the case, we need a system to interrupt the keys before they are sent all the way down the stack, then sent to the screenreader.
+This is needed for two reasons: 1) performance; it doesn't make sense to send keys that far down the stack, just to hope the application implements accessibility correctly; we should be able to interrupt key presses *before* it gets to the application 2) control; it is best to be able to control things regardless of if an application is running or not. Under a system where an application must be accessible to send us keystrokes, a non-responsive application will not send us keystrokes either.
+To have full control and maximum performance, we need to interrupt the keys at their source.
+
+### rdev
+
+`rdev` is a Rust crate which can (with the "unstable_grab" feature enabled) grab keys from the Linux kernel before they are passed any further down the stack.
+It allows us to consume events if we do not want to also do the default action; for example, in "Browse Mode" a screenreader user will use the letter h to jump between headings within a page;
+normally this would type the letter h, so to stop this from happening we can consume (or "eat") the event so that it isn't sent any further at all.
+
+INSERT CONNECTION PARAGRAPH
+
+## Pulling It All Together
+
+Now that we have the basics of DBus, AT-SPI, caret browsing and structural navigation, let's put it all together in a final program which can actually accomplish something:
+
+```rust
+// use DBus to get the bus address of the accessibility (a11y) bus.
+// connect to the accessibility bus and ask to receive focus change events
+  // speak the text of the current element, chopping off by line breaks, and including link information
+
+// use odilia-input to get keystrokes
+```
+
+ALMOST DONE
+
+ADD CONCLUSION