Help with ARM Cortex-A53?

Ceegen · Apr 1, 2020

I'm having a hard time finding info on the ARM Cortex-A53, the processor in the Raspberry Pi 3 B+, the 1.4GHz 64-bit quad-core Broadcom type. Would anyone have any idea on how to explain the way the load/store architecture works? Not that I want it explained in detail, but just a few info bits. I was reading the official ARM docs on the instruction set, but they don't seem to have a diagram of the bit patterns and what they would correspond to in terms of representing an instruction? Or in what sequence, how do the cores/threads cycle? What is the starting address of memory that is first read into the Pi from the micro sd card? (test/rhetorical questions). Or am I just way off?

As a reference I am currently reading:
https://www.sciencedirect.com/topics/computer-science/load-store-architecture
and
https://www.sciencedirect.com/topics/computer-science/arithmetic-instruction

Are these good sources of information? Seems ARM is pretty tight lipped about details. Someone asked about "programming in hex", someone flamed them for it, the thread got locked. He was talking about something as in this video:

View: https://www.youtube.com/watch?v=oO8_2JJV0B4

(5m 14s for reference, but this guy's channel is a goldmine for anyone wanting to learn nitty gritty, and this video is amazing).

Thank you for your time and consideration. I now have FreeBSD 12.1 working for my Raspberry Pi 3B+, and it is awesome. Thank very much all who contribute, hope to one day give back.

ralphbsz · Apr 1, 2020

Do a google search for "ARM instruction set". The first few hits (not on Arm's own website, strangely) are long PDF files that have the complete ARMv7 instruction set. Here's the one I clicked on. I just started reading it, and it gets really boring after 20 pages, but it seems to be complete and well documented. It would be definitely possible to program in assembly or "in hex" (we used to call it in binary) from that document.

Ceegen · Apr 1, 2020

32 bit version will work in it. The 3B+ supports Armv8-A (64) and I wanted to focus on that. Do I read it right that you have to "slide" or shift the register as if unfolding the 32+32 into another register? Immediate value + r1 (or whatever)? Is that all that needs to be done to activate 64bit mode? I also read somewhere that the first x amount of kbs read from the memory card go into the video core memory, to the neon processors (as the built-in startup routine), and spits it back out into the L1 for the processors to start on? I think that is how this seems to be forming, as I read more.

https://developer.arm.com/ip-products/processors/cortex-a/cortex-a53
http://infocenter.arm.com/help/topic/com.arm.doc.ddi0500d/DDI0500D_cortex_a53_r0p2_trm.pdf
*page 7 of 635*
Neck deep, no turning back now.

ralphbsz · Apr 1, 2020

Now you are way deeper than my knowledge.

Ceegen · Apr 1, 2020

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0500j/BABDCHHI.html

The following confidential books are only available to licensees:
• Arm® Cortex®-A53 MPCore Processor Cryptography Extension Technical Reference Manual (DDI 0501).
• Arm® Cortex®-A53 MPCore Processor Configuration and Sign-off Guide (DII 0281).
• Arm® Cortex®-A53 MPCore Processor Integration Manual (DIT 0036).
• Arm® AMBA® 5 CHI Protocol Specification (IHI 0050).
• Armv8 AArch32 UNPREDICTABLE behaviors.

Hit the paywall, don't exactly need the cryptography, but (DII 0281), (DIT 0036) and (IHI 0050) seem necessary to progress to the level of manipulating the SoC. Being super skeptical of all things provided by the manufacturer, it is necessary in the tradition of free speech (that BSD was based on) that at some point small potatoes are thrown a bone every now and again after the tech and everything becomes obsolete? By the time I read all this and finally put something together, quantum computers will have been replacing the ones we have now. Technology is exciting and frustrating all at the same time. I have gray hairs now. <3

Ceegen · Apr 1, 2020

Could anyone explain how the physical memory (ram) interfaces with the L2 cache? Where does the other hardware insert their bits into ram at [from physical address] to [end physical address]? Anyone have a chart of that?
For reference, I was reading:
https://wiki.osdev.org/PS2_Keyboard
https://wiki.osdev.org/Printing_to_Screen
It seems the difference between this and what we have in the "bios" of the SoC is just a memory representation of the keyboard/mouse interface in memory, and that the video works very differently now. The NEON video core seems to be able to do two things: Return a value, and, print to screen. So you can, in theory, use one (I think there are two NEON cores in a Raspberry Pi 3B+) of the video core for displaying text while the other does math things for you (floating point stuff mostly). Or both can be dedicated to number crunching and you can go 100% headless (no video at all).

Building a charset involves, in this case, a way needed of translating keystrokes read from memory into signals sent to the NEON coprocessor to display it as text on the screen. How that translation process works as far as a memory interface seems like the key to delving deeper? Quite the project. (Previous discussion)

msplsh · Apr 1, 2020

One person gets the paywalled documents and then implements stuff based on it. Often somebody pays them to do it, too. That's how it works.
What's the point of knowing this information, to you?

Ceegen · Apr 1, 2020

msplsh said:
What's the point of knowing this information, to you?

I'm nearly 40 years old and haven't done anything with my life. In my off time, I study things. Would rather educate myself than entertain myself. Just being a user doesn't cater to my interests in life. If Linus Torvalds can do it, why can't I? Learning Python and installing a module doesn't really interest me.

IMO the only thing holding me back right now is a bit of math knowledge I have yet to really master, and that is trigonometry and calculus, which is necessary for graphics and the NEON cores. Everything else I have enough familiarity with to be able to add functionality to a system by transcribing functions into processor commands, in theory as binary instructions as being first read into memory from specific locations on disk, and be able to manipulate a differently planned charset. This seems to be the most important thing to be able to manipulate (the charset) based on what type of processor architecture you are using. Plan on using 5 bits to represent 0-9 instead of 4 bits in a single byte. I base this theory of binary representation based on how the load/store architecture works, as in LD-DO-ST from processor to RAM as instructed in the L1 cache. How they first read those instructions and in what order then translates into all the things.

I just want to have fun, and this is my definition of fun...

msplsh · Apr 1, 2020

Ok, you mention Linus and not being a just "user" which are classes of people that program things, so...

What's the point of knowing this information? What are you going to program?

Ceegen · Apr 1, 2020

Sidenote, the RasPi was donated to me from a friend who was using it as a ham repeater, and he didn't need it anymore. Pretty good stuff, learned a lot learning how to turn the damn thing on in the first place. Going to hook a switch up to it for on/off/reset signals. Want to get another few of the 3B+ models for testing. Have plans on hooking a spinning rust HD to it, it is currently wired directly from a modified usb mini cable to plug it into the RasPi from 5+/G rails.

I turn the whole rig on with a piece of copper wire that bridges the green into a ground, that simulates the "on" signal for power to be sent to the rest of the wires. Disassembled a single mobo connector instead of liberating all the wires from the ATX harness. Yellow circle in the pic is next project: Switch for on/off/reset, custom formed plastic case from injection molding for fancy reasons. Reduction in wire space, I can more easily reduce noise across GPIO with shorter printed connections to bridge two of the same exact model RasPi and put them together, with a heat barrier (RF shielding substance added) between the two boards and extra cooling methods.

Ceegen · Apr 1, 2020

msplsh said:
What are you going to program?

Stuff.
Sorry, but, to elaborate: Perhaps a FreeBSD version that I modify, for instance. Instead of reinventing the wheel, just audit and adapt? Piecing puzzle pieces together of a bigger overall structure, based on movement of information rather than just the manipulation of it. LD/ST on an Arm v8A means you can load information reading right to left, or, left to right. That small difference can probably net huge gains if I'm right, just want to test this theory out.

msplsh · Apr 1, 2020

I don't mean to rain on your parade, but "stuff" just kind of indicates this quest isn't really going anywhere.

You aren't going to be able to modify FreeBSD at the assembler level like this, because FreeBSD is compiled. If you wanted to make this kind of optimization, you'd do it at the compiler level. For all you know, Clang might already take advantage of this optimization. Or it doesn't use it because it's too slow. Who knows? Only the compiler people!

Everything is built on layers. You're talking about "transcribing functions into processor commands" but that's what a compiler does. Nobody hand tunes assembler anymore. You could, but silicon is not a fine crafted wood that lasts forever. You already realize this with your quantum computers comment.

Everybody builds on the shoulders of giants. Pick a layer and work within it. Trying to reinvent the whole stack is too complicated for one person and will not end well.

Ceegen · Apr 1, 2020

msplsh said:
Only the compiler people!

I don't think like this.

msplsh · Apr 1, 2020

Don't think like what? Only the compiler people who work on Clang know how Clang works?

ralphbsz · Apr 2, 2020

Ceegen said:
Could anyone explain how the physical memory (ram) interfaces with the L2 cache?

Where does the other hardware insert their bits into ram at [from physical address] to [end physical address]? Anyone have a chart of that?

Now you are outside the realm of the Arm ISA (instruction set), you are into physical implementation. For that you need to talk to the people who make the chip itself, which if I remember right is Broadcom. And Broadcom is, if I remember right, pretty tight-lipped about details. For really good reason: If they were to publish those details openly, their chip would be cloned so fast, they would never be able to sell them, meaning they would make no money, meaning they would't have built the chip in the first place.

I understand that not being able to get documentation is frustrating. But it is the way of the world.

Building a charset involves, in this case, a way needed of translating keystrokes read from memory into signals sent to the NEON coprocessor to display it as text on the screen. How that translation process works as far as a memory interface seems like the key to delving deeper?

Your statement her is very hard to understand. You want to understand all the layers between you hitting a key on the keyboard connected to a Raspberry Pi and the pixels on the screen lighting up, and you want to do all that understanding at the detail level of thinking through all the memory reads and writes?

If that's what you want, that's insane. It is WAY too complicated to address that that level of detail. Just as an example: The keyboard itself contains a microprocessor. Which then interfaces to USB (keyboards are connected via USB), which itself is a protocol complex enough to fill a few hundred pages of documentation. Then on the RPi, there is a USB host built into the Broadcom chip. That thing is driven by a software driver, which is probably multiple layers (generic USB in the middle, hardware-specific driver like XHCI below, keyboard driver above), all a few thousand lines of code. Then your keystroke goes through roughly a half dozen layers in the kernel before it ends up in userspace, for example in a shell or editor. The book that describes the kernel (quite abstractly, without details) is by Kirk McKusick and others (I've been reading a few dozen pages every evening), and that alone is a few hundred pages long. Now you are in userspace, some application (shell?) turns the character around and writes it to a device driver interface (probably a pseudo-terminal), from where it goes into the kernel. Many thousands lines of code later, you end up in the framebuffer video driver, where a font is looked up, and a pixel it toggled.

Just finding all the lines of code in FreeBSD that are involved in that one character would be a whole-semester project for a CS undergraduate class, for a team of 4 students. Actually, I should suggest this to some friends who teach CS in college: find all the lines of code just in the OS that make a character go from keyboard to screen. But in that code, you are ignoring the firmware in the USB interface and the graphics (for which you don't have the source code).

Now go to the complexity of translating all that code into instructions (see instruction set above), and from instructions to register and memory loads and stores, and you are probably into a multi-year project. If it weren't so excruciatingly boring, this would make a great PhD thesis.

I think you need to learn about layering. There are only way to understand computers at the microscopic level (of individual memory transactions) is for a very small part of the system. Such as the bootloader, or the video driver that goes to the framebuffer.

Ceegen · Apr 6, 2020

ralphbsz said:
I understand that not being able to get documentation is frustrating. But it is the way of the world.

Indeed.

Your statement her is very hard to understand. You want to understand all the layers between you hitting a key on the keyboard connected to a Raspberry Pi and the pixels on the screen lighting up, and you want to do all that understanding at the detail level of thinking through all the memory reads and writes?

Yes.

If that's what you want, that's insane.

Never claimed to be sane lol.

Just finding all the lines of code in FreeBSD that are involved in that one character would be a whole-semester project for a CS undergraduate class, for a team of 4 students.

I am pretty sure I indicated that I'm fine with this being a multi-year project. Don't care if it takes me 5 years or 20 years to do it, I've got time. Going to take me a while to read through all the material posted here, will check back in from time to time. <3

msplsh · Apr 6, 2020

Ceegen said:
By the time I read all this and finally put something together, quantum computers will have been replacing the ones we have now.

This is where you're going to end up.

Start with understanding how the compilers work for ARM.

Also

Ceegen said:
What is the starting address of memory that is first read into the Pi from the micro sd card

This is going to be a RPi "BIOS" question.

GitHub - raspberrypi/firmware: This repository contains pre-compiled binaries of the current Raspberry Pi kernel and modules, userspace libraries, and bootloader/GPU firmware.

This repository contains pre-compiled binaries of the current Raspberry Pi kernel and modules, userspace libraries, and bootloader/GPU firmware. - raspberrypi/firmware

github.com

Hope you like decompiling.

Ceegen · May 31, 2020

"Start with understanding how the compilers work for ARM."

Much reading lol. About the ISA, which bits do what mostly. There is more material than I thought, just didn't know where to look. Mentioned previously about rotating the bits, think the mnemonic is ROR (rotate right) according to specs, and set those bits to indicate from where in the 32 or 64 bits I send it to. Theoretically changing how the processor reads by restricting arithmetic operations to ring 1, while restricting rotate/shift operations to ring 2, read/write to ring 0? Not necessarily a speed increase, but an advantage in some "secure" state by default? And so long as the address of some known reference point (in memory) never changes (reading from RAM and comparing it with secured/reserved memory in L1 as authentication), it could be ensured that security checks be done at each cycle of read and write for inter-processor control. So learning to address addresses in L1 vs L2 vs RAM is important, too, another method of being able to control the timing of these operations in how the read/write speed effects the decision in making one more calculation or writing on next cycle.

This theoretical processor instruction set alteration could make addresses be fed into the L2 cache [at address location] to be read on next cycle by the processing element in that core so shift operations are only done in registers zero to three (or whatever), with set memory boundaries in L2 that overlap for these communications, as defined by what the L1 dictates to the processor. Learning how to point the processing elements at which place in memory at what times is yet another mystery. I suspect dedicating a portion of processor power to write to RAM (or VRAM since this is an SoC? Correct?) for some print function as an interaction between keyboard and screen (and not the chair for once) as a fundamental part of keyboard or storage-memory (the sd card) to RAM // Videocore VI // CPU or whatever. Most of that is already set up in some hard-coded process that uses the GPU to boot the CPU, which reads from x-location from storage in y-format? Seems fundamental in fiddling with things. Then there is this whole thing about setting the state of the processor to thumb 2 vs 32 vs 64?

Gibs transistors plox.

Help with ARM Cortex-A53?

Attachments