rough draft #1"

master
Tait Hoyem 2 years ago
parent fa1b27113f
commit 29fce34a18

@ -0,0 +1,281 @@
---
title: "From Software Noob To Linux Accessibility Master"
layout: post
tags: "atspi, dbus, dbus a11y, accessibility, a11y, linux, linux a11y"
---
Here are some interesting problems I have faced when working with DBus, AT-SPI (Accessible Technology--Serial Protocol Interface) and the Rust programming language.
I realize these are fairly unique constraints, and this information is likely only relevant for a select few, but I thought the experience might be worthwhile to write down:
for my own sanity when I inevitably experience these same issues later and for others who may want to contribute to our new screen reader project [Odilia](https://yggdrasil-sr.github.io).
## DBus
[DBus](https://www.freedesktop.org/wiki/Software/dbus/) is a cool API!
Well it's not an API, but rather a mechanism to share messages across processes in Linux;
this is generally called IPC or Inter-Process Communication.
DBus can be used to [send and receive desktop notifications](https://specifications.freedesktop.org/notification-spec/latest/ar01s09.html),
[shutdown your computer](https://www.freedesktop.org/wiki/Software/systemd/dbus/)
and, for my purposes [get accessibility events](https://www.freedesktop.org/wiki/Accessibility/Walkthrough/).
### Inner Workings
DBus is an object-oriented approach to IPC.
It is split up into 4 main components that work together:
1. Objects
2. Interfaces
3. Methods
4. Properties
5. Buses
#### Objects, Methods & Properties
Objects are just like objects you learned in your CS classes;
it is a structure which contains attributes, and methods which can be called on the object.
DBus' objects are very similar, except that attributes are called properties.
Most DBus libraries provide a way for you to use "native objects" (i.e., a Python object, a C++ object, a Rust structure + implementation, etc.); this allows access to DBus methods using the language features available to you.
So for example, in Python you might write:
```python
obj = get_a_dbus_object()
print(obj.get_text()) # using a method
print(obj.locale) # using a property
```
This would print out whatever may be returned from the object's GetText method and what is found in the locale property.
Notice that DBus methods are always Pascal case (i.e., capitalized at each starting letter of a word).
#### Interfaces
A DBus interface (not to be confused with a Java interface, or a Rust trait) is a definition of a collection of methods.
For example, the "Text" interface may have an attribute like "Length" or a method like "GetText".
So the interface "Text" is just a list of methods and attributes all wrapped up together.
That's it! That simple!
This will come in handy later when we need to check if an object implements a method;
this way we can check for an entire interface of methods and properties instead of checking for each individually.
#### Busses
A bus' closest equivalent in standard computer science terms would be an IP address.
A bus address looks like ":1.39"; think of this like a raw IP address.
Some addresses have names associated with them like "org.a11y.Bus"; think of this like a DNS A record pointing at an IP (bus) address.
So a bus is just a place to send IPC requests, just like you'd send HTTP requests to a web server at a specific IP/port combination.
### Accessibility Events & Information
Let's assume for a moment that you cannot see anything. You are blind.
If you try to read an article you obviously cannot see what is on your screen, so you need something to read it to you.
This technology that reads your screen to you is, uncreatively called a screenreader, sometimes abbreviated "SR".
Well how does a screen reader know what is on the screen? How does it know what a button is? And a link?
How does it know if content has changed or if an alert has been sent?
The former describes accessibility information (i.e., this button contains a certain string of text);
the latter describes an accessibility event (an `aria-live` region has been updated, or an alert box has been displayed).
DBus can send these events and information to your process, if you ask for it.
This is what you want if I'm to create anything like a screenreader.
#### Accessibility Events in Rust
So why "Object:StateChanged\0"? Where does this come from?
The specification that is used to send this information to our DBus connection is called AT-SPI: Accessible Technology--Serial Protocol Interface.
To clarify: DBus is the general IPC mechanism for processes in Linux;
AT-SPI is a standard for how to send accessibility information/events over the DBus protocol.
## AT-SPI
AT-SPI are a set of XML files that specify *how* to send data across DBus for accessibility events.
I'm going to be honest: at first this system *seems* very convoluted and unnecessarily complex.
Over time though, this system has grown on me as I start to see its "complexities" as a sort of after-affect of the core principle of *simplicity* used within DBus and the specifications which use it.
I have explained previously that DBus has objects and methods just like a native object in Python, C++ or Javascript.
So let's say we want to implement the most basic thing a screenreader can do: read text.
Let's suppose we already have an item we want to get the text of.
Now to get the text of it, we call a method on the interface and pass the path.
This is abstracted away for us, generally speaking, when using any kind of language-specific DBus binding, but it's better to be explicit in this case.
No problem! We call `item.get_text()` and that's it, right?
No.
This is where, again, this "complexity" comes in.
Again, it starts out this way, but it will grow on anyone who enjoys the idea of the UNIX principles with time and understanding.
So what happens if we do `obj.get_text()`?
Let's try it on the first list item on my website's [homepage](/):
Here is the excerpt as it is written on the day of writing this article:
> I have three goals in my software development career:
> 1. Strong adherence to the <a href="https://?">UNIX principles</a> of software design.
> 2. Security, privacy and anonymity of the internet.
> 3. Accessibility of technology to the visually impaired.
What would you expect to receive if you ran `get_text()` on the first list item there?
If you, like me, were a little brainlette, you probably guessed "1. Strong adherence to the UNIX Principles of software design."
Let's find out if this is correct:
```rust
let text = acc.get_text();
println!("TEXT: \"{}\"", text);
$ cargo run
TEXT: "1. Strong aherance to the of software design."
```
If you read that carefully, you'll see there are what look like three spaces where the UNIX principles link should go.
This is *extremely* deceptive for two reasons:
1. One of those is *NOT* a space. It's an [Object Replacement Character](https://www.fileformat.info/info/unicode/char/fffc/index.htm) aka Unicode Point U+FFFC.
2. It looks like it has just dropped a piece of text without telling us! And without a way to get it back! *Gasp!* Oh the horror!
This is what I thought too.
But allow me to defend this for a minute.
What if you had something complex like a table, a block quote, an image or even something like a [MathML equation](example) inside the block of text (in our case, inside a list item, but this applies to any piece of text inside another)?
If you had a table, would you want to read it out?
MathML, you might want to say everything upfront, but MathML would need some amount of processing before it be readable as text.
And even with a link, there is a reason for this.
If you can see perfectly find and browse the web like anyone else, with your eyes, you can see what is a visited and unvisited link based on the color of the link. A darker color generally indicated a visited link,
whereas a lighter color generally indicates an unvisited link.
When a screenreader gets info about a piece of text, it would need to include that information to its user like "UNIX princples...link" or "UNIX principles...visited link".
So if I get the text of some item which contains another, should it include all sub items? What about just links? Should it tell you if the link is visited or not?
All these questions above would introduce additional complexity to answer if being done within a single query.
This has given me pause in my youthful "the system is broken" angst that generally plagues my thinking;
instead I see this is a very sober-minded and UNIX-y design principle that I think makes much more sense than the alternative.
Here are some major advantages of this method:
1. It allows *optional* processing of sub-elements; maybe you don't care what is underneath the element: this saves processing power and complexity.
2. It allows *custom* processing of sub-elements; you do not have to rely on AT-SPI to tell you what information you want. Perhaps you only need the role of the sub element, not the entire text of it: again, this saves CPU cycles and code complexity.
3. Allows arbitrary data to be inside any other structural element. This is optimal for HTML, which is built to have more or less arbitrary nesting of elements.
In reality: this is actually genius design!
My next question is: "If it uses the object replacement character so it can replace the children, then what happens if the object replacement character is actually in the text?"
Well, with some processing you can actually find out where each child goes, or if the object replacement character is actually written in the text itself.
How so?
First off, let's get a list of children.
We can do this with `obj.get_children()`.
```rust
# rust way of awaiting and not caring about an error case is: .await.unwrap()
println!("CHILDREN: {:?}", obj.get_children().await.unwrap());
$ cargo run
CHILDREN: [(":1.7", Path("/org/a11y/atspi/accessible/193\u{0}")), (":1.7", Path("/org/a11y/atspi/accessible/194\u{0}"))]
```
You'll notice that the children are merely a list of tuples;
each tuple only contains, at its core, two strings:
* Sender: A string describing which application has sent the information.
* Path: A string describing which element is being sent.
The sender, you will notice, looks suspiciously like a bus address.
This is actually what it is. Each process has a bus address, and it is letting you know where it's coming from.
The path is a (TODO) path to a new object for which we can receive information about through DBus if we want more information.
// TODO
Remember earlier when we used a sender, path and connection to connect to the accessibility bus?
And later when we created a Proxy to an object that was sent over as part of an event?
Well, this is the same idea!
We clone the `Arc<SyncConnection>` connection to use the same connection to talk to dbus, and we use the sender and path to create a proxy we can then use the same methods on as the parent!
Pretty cool stuff!
Okay, now back to what I was saying about being able to grab information about children to find out if we need to replace the object replacement characters or not.
(TODO)
Here's what we do: there is an [interface](???), we talked about this earlier, called [Hyperlink](???xml);
the Hyperlink interface can actually tell us what cursor position inside the parent the child occupies.
Some objects we get over DBus will not support this, but the vast majority will.
I dislike the fact it is called hyperlink, even though I can see that this is the primary use case,
I think it's reasonable to say that `StartIndex` and `EndIndex` are not exactly unique to hyperlinks (`<a>` tags).
Minor criticism aside, there is an opportunity here to match with the parent and find out if and where the child belongs to be placed.
Here's how:
If we get the position of every occurrence of the object replacement character from the parent,
and check each child to see if its `StartIndex` matches the position of the object replacement character,
then anytime it matches, that is where the child belongs.
There is another use for this that I would like to point out.
I think this is a reasonable case for seeing it pulled into its own interface, or joining accessible.
That is: structural navigation.
## Structural Navigation
People who use screenreaders have some special abilities I actually wish browsers implemented by default:
the ability to jump through the document by specific tags and attributes.
It's not sophisticated;
depth first search forward or backward looking for the closest heading, link, button, table, etc.
This is so ingrained in screebreader users that when a page finishes loading,
it is customary for the screenreader to announce (speak out lout to the user) the number of tables, headings, and visited and unvisited links that are on the page.
If I want to look for the next heading in an HTML document, however,
I can not start by just checking all children, because it is fairly common to have various tags embedded in your current tag.
I need to know, which children are after and which are before my caret.
## The Carrot 🥕
The caret is the same as your cursor in an input box.
Type right here and watch as your cursor (aka caret) moves with your typing: <input type="text" placeholder="type here">
The caret, or cursor, is something that most people are only used to seeing in the context of *editable* text,
but screenreader users enable a special mode in their browser (usually activated with F7) called "caret browsing".
Caret browsing allows you to navigate through a webpage using a cursor
even when the text is not editable.
This is *awesome*!
I can not understate how useful this is to me, just for simple keyboard-driven simplicity's sake and trying to eliminate the mouse as much as possible.
Try it now! You can always turn it off with F7, just the same as enabling it.
This caret can be moved around just like in any run-of-the-mill WYSIWYG (What You See Is What You Get) editors like Word or Libreoffice Writer.
This is how a screenreader user navigates the web:
with a cursor.
They use it to read one character at a time (with left and right arrow), a word at a time (Ctrl+left or right arrow) or entire lines of text (using up and down arrow).
This becomes, in essence, the active focus of the user: it is always on the cursor (a.k.a. caret).
## Keyboard Input
Keyboard input with accessible applications follows a very complex path, which can be a serious buzzkill for attempting high-performance screenreaders.
Let me show you what the issues are; the accessible technology (screenreader, in this case) will be written as "AT" in this diagram:
```
Wayland: Kernel -> libinput -> DE/WM -> accessible application -> AT
X11: Kernel -> Xorg -> DE/WM -> accessible application -> AT
```
What happens in the case of an inaccessible application?
It doesn't work, at all. A key press which is sent to an inaccessible application will *not* be sent to an AT application (i.e., a screenreader).
This is a serious problem, that I don't think *should* exist at all.
Perhaps there is some mechanism I am missing as to how to interrupt these keys before they pass all the way to an application and then just hope the GUI is accessible;
supposing that this is not the case, we need a system to interrupt the keys before they are sent all the way down the stack, then sent to the screenreader.
This is needed for two reasons: 1) performance; it doesn't make sense to send keys that far down the stack, just to hope the application implements accessibility correctly; we should be able to interrupt key presses *before* it gets to the application 2) control; it is best to be able to control things regardless of if an application is running or not. Under a system where an application must be accessible to send us keystrokes, a non-responsive application will not send us keystrokes either.
To have full control and maximum performance, we need to interrupt the keys at their source.
### rdev
`rdev` is a Rust crate which can (with the "unstable_grab" feature enabled) grab keys from the Linux kernel before they are passed any further down the stack.
It allows us to consume events if we do not want to also do the default action; for example, in "Browse Mode" a screenreader user will use the letter h to jump between headings within a page;
normally this would type the letter h, so to stop this from happening we can consume (or "eat") the event so that it isn't sent any further at all.
INSERT CONNECTION PARAGRAPH
## Pulling It All Together
Now that we have the basics of DBus, AT-SPI, caret browsing and structural navigation, let's put it all together in a final program which can actually accomplish something:
```rust
// use DBus to get the bus address of the accessibility (a11y) bus.
// connect to the accessibility bus and ask to receive focus change events
// speak the text of the current element, chopping off by line breaks, and including link information
// use odilia-input to get keystrokes
```
ALMOST DONE
ADD CONCLUSION
Loading…
Cancel
Save