- Sabrina Ramonov š
- Posts
- AI-Powered Infinite Test Prep (Part 2 - SAT Math)
AI-Powered Infinite Test Prep (Part 2 - SAT Math)
Gen AI Eats SAT Math Practice Tests
This is Part 2 of the series, AI-Powered Infinite Test Prep.
In Part 1, I shared my mom-in-lawās struggles with the DMV test in English.
This inspired me to build PassMyTests.com which generates infinite test practice questions while translating explanations into her native language.
I contemplated other standardized exams, similar in multiple-choice format, but haunting many millions more people:
The dreaded SAT!
In 2023 alone, 1.9 million students took the SAT.
I'm embarrassed to reveal how many hours Iāve spent studying for it.
Iād shove big SAT prep books into my backpack, heading off to Barnes & Noble.

I was one of those kids!!
SAT Math Test
I decided to tackle the SAT Math test first.
My MVP goal:
Generate infinite conceptually plausible and realistic SAT Math questions, while translating the answer explanations for non-native English speakers.
First, I repurposed my DMV prompt:
Your task is to act as a terminal for written test taking the SAT Math Test.
Produce a random 4-choice question.
The question should pulled be from a random set of 500 questions.
The question should be in English.
Finally, return the output in the valid JSON of the following format:
{
"body": "Body of the question",
"choices": ["Choice A", "Choice B", "Choice C", "Choice D"],
"answer": <index of the current choice e.g. A is 0, B is 1, etc.>,
"explanation": "Short explanation of the answer",
"translation": "${translation}"
}
Next, I made a new route in my MVP:
Hereās a sample question and explanation translated to Spanish:

Source: PassMyTests.com - SAT Math
This MVP didnāt take long, but now the hard part begins.
Next, I share several challenges I encountered.
Challenge 1: Missing Correct Answer
Sometimes, ChatGPT fails to include the correct answer as an option:

Sabrina Ramonov @ sabrina.dev
In the above image, the correct answer is 600 square meters.
But this was not an option.
To fix this, I added a verification step to my prompt:
Before returning the output, solve the question.
Check that the correct answer is present in the list of 4 choices.
Much better.
Sample questions generated by ChatGPT-4:

Sabrina Ramonov @ sabrina.dev

Sabrina Ramonov @ sabrina.dev
However, even if the correct answer is in the answer list, sometimes the "answerā index in the output does not match the correct answer index.
Recall my prompt:
"answer":
<index of the current choice e.g. A is 0, B is 1, etc.>,
I havenāt improved this.
Iāll try assigning a random number between 0 and 3 to the right answer, then inserting it into the JSON output.
Challenge 2: In-Context Learning
Nonetheless, initial results seem promising.
Itās been years since I studied for a test, but the questions looked plausible.
I was curious:
Would in-context learning improve it?
In-context learning is an effective prompt engineering technique where you include examples directly in your prompt, enabling ChatGPT to tackle new tasks without finetuning.
I added 4 real SAT Math questions to my prompt:
Your task is to act as a terminal for written test taking the SAT Math Test.
Produce a random 4-choice question.
The question should pulled be from a random set of 500 questions.
The question should be in English.
Before returning the output, solve the question.
Check that the correct answer is present in the list of 4 choices.
Finally, return the output in the valid JSON of the following format:
{
"body": "Body of the question",
"choices": ["Choice A", "Choice B", "Choice C", "Choice D"],
"answer": <'correct_index'>,
"explanation": "Answer explanation, show your work step by step",
"translation": "${translation}"
}
Below are the examples of the type of questions you should generate. Each example is delimited by triple quotes. Use these example to create various questions.
"""
Maria is staying at a hotel that charges $99.95 per night plus tax for a room. A tax of 8% is applied to the room rate, and an additional onetime untaxed fee of $5.00 is charged by the hotel. Which of the following represents Mariaās total charge, in dollars, for staying x nights?
A. (99.95 + 0.08x) + 5
B. 1.08(99.95x) + 5
C. 1.08(99.95x + 5)
D. 1.08(99.95 + 5)x
"""
"""
x^2 + y^2 = 153
y = ā4x
If (x, y) is a solution to the system of equations above, what is the value of x^2?
A. -51
B. 3
C. 9
D. 144
"""
"""
The function f is defined by f(x) = 2xĀ³ + 3xĀ² + cx + 8, where c is a constant. In the xy-plane, the graph of f intersects the x-axis at the three points (ā4, 0), (1/2, 0), and (p, 0). What is the value of c?
A. -18
B. -2
C. 2
D.10
"""
"""
A researcher wanted to know if there is an association between exercise and sleep for the population of 16-year-olds in the United States. She obtained survey responses from a random sample of 2000 United States 16-year-olds and found convincing evidence of a positive association between exercise and sleep. Which of the following conclusions is well supported by the data?
A. There is a positive association between exercise and sleep for 16-year-olds in the United States.
B. There is a positive association between exercise and sleep for 16-year-olds in the world.
C. Using exercise and sleep as defined by the study, an increase in sleep is caused by an increase of exercise for 16-year-olds in the United States.
D. Using exercise and sleep as defined by the study, an increase in sleep is caused by an increase of exercise for 16-year-olds in the world.
"""
Here are questions generated with in-context learning:

Sabrina Ramonov @ sabrina.dev

Sabrina Ramonov @ sabrina.dev

Sabrina Ramonov @ sabrina.dev

Sabrina Ramonov @ sabrina.dev

Sabrina Ramonov @ sabrina.dev
Unfortunately, the questions are repetitive and substantially less random after applying in-context learning.
Even ChatGPT agrees!
In the last image, ChatGPT admits:
āIt seems the same question was generated again.ā
For now, I removed in-context learning from my prompt.
Challenge 3: Probabilistic Outputs
As I discussed in a previous post on LLM Ops, LLMs are probabilistic.
There is a non-zero probability that occasionally it returns an invalid JSON, despite my explicit request for a valid JSON output.
I encountered the first JSON failure after only n=20 generations:
{
"body": "In a geometry class, the measures of the three angles of a triangle are \(3x\), \(4x\), and \(5x\) degrees. Which of the following could be the measure of the smallest angle?",
"choices": ["\(20^\circ\)", "\(30^\circ\)", "\(40^\circ\)", "\(50^\circ\)"],
"answer": 0,
"explanation": "In any triangle, the sum of the three angles is always \(180^\circ\). Therefore, \(3x + 4x + 5x = 180\). Solving for x gives \(12x = 180\), so \(x = 15\). The smallest angle would then be \(3x = 3(15) = 45^\circ\), which is closest to \(40^\circ\).",
"translation": "${translation}"
}
ChatGPT mixed LaTeX and JSON without escaping!
In a real LLM-powered app, if I did not have robust error handling in place, this could crash the app.
An imperfect bandaid, I added another verification step:
Confirm a valid JSON output before finalizing your answer.
But thereās still no guarantee because LLM outputs are probabilistic.
A faster and cheaper approach, avoiding another LLM call, would be to assume the JSON output could be malformed and fix it with a custom parser.
Only if the parser fails, then retry with ChatGPT.
Just as I was congratulating myself for this brilliant 80/20 solution, I ran into another issue caused by probabilistic outputs!
On generation N=1000, I got this perplexing and very wrong JSON output:
JSON just repeats itself
translation field is the fully translated JSON
answer index is 2 (x = 6) but the JSON output is 1 (x = 5)
{
"body": "What is the solution to the equation 4x - 7 = 17?",
"choices": ["A. x = 10", "B. x = 5", "C. x = 6", "D. x = 4"],
"answer": 1,
"explanation": "Add 7 to both sides of the equation to isolate 4x. This gives 4x = 24. Then, divide by 4 on both sides to solve for x, which results in x = 6.",
"translation": "{
"body": "ĀæCuĆ”l es la soluciĆ³n de la ecuaciĆ³n 4x - 7 = 17?",
"choices": ["A. x = 10", "B. x = 5", "C. x = 6", "D. x = 4"],
"answer": 1,
"explanation": "Suma 7 a ambos lados de la ecuaciĆ³n para aislar 4x. Esto da como resultado 4x = 24. Luego, divide entre 4 en ambos lados para resolver x, lo que da como resultado x = 6.",
"translation": ""
}
}
Perhaps ChatGPT is deliberately trying to entertain meā¦
Ugh, and I thought I had solved Challenge #1.
Definitely havenāt solved Challenge #3.
Traditional software engineering deals with structured data.
LLMs deal with unstructured natural language and probabilistic outputs.
Iām still grappling with this new reality.
Conclusion
Altogether, I enjoyed building this MVP and see many ways to improve it.
I believe Gen AI will massively transform traditional test prep and education more broadly, increasing accessibility especially across multiple languages.
Hereās a slice of my āI Should Really Do Thatā list:
Evaluation criteria and methodology ā given 4 different prompts and different models, generate N=100,1000,10000 questions, and evaluate each question based on a set of criteria
Restrict question types -- e.g. LINEAR_EQUATIONS, QUADRATIC_EQUATIONS, GEOMETRY_TRIANGLE, LINEAR_ALGEBRA
Multimodality ā generate both an image and the question
Create brand new questions based on concepts ā given a concept along with axioms and theorems, generate a question that utilizes it
If you have feedback or requests for PassMyTests.com
DM me anytime @Sabrina_Ramonov or Linkedin.