Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: Excel to Python Compiler (pyoneer.ai)
79 points by narush on May 23, 2024 | hide | past | favorite | 36 comments
We (me and @aarondia) built a tool to help you turn psuedo-software Excel files into real-software Python. Ideally, Pyoneer helps you automate your manual Excel processes. You can try it today here: https://pyoneer.ai.

How it works:

1. You upload an Excel file

2. We statically parse the Excel file and build a dependency graph of all the cells, tables, formulas, and pivots.

3. We do a graph traversal, and translate nodes as we hit them. We use OpenAI APIs to translate formulas. There’s a bunch of extra work here — because even with the best prompt engineering a fella like me can do, OpenAI sucks at translating formulas (primarily because it doesn’t know what datatypes its dealing with). We augment this translation with a mapping from ranges to variable names and types, which in our experience can improve the percentage of correctly translatable formulas by about 5x.

4. We generate test cases for our translations as well, to make sure the Python process matches your Excel process.

5. We give you back a Jupyter notebook that contains the code we generated.

If there are pieces of the Excel we can’t translate successfully (complex formulas, or pivot tables currently), then we leave them as a TODO in the code. This makes it easy for you to hop in and continue finishing the script.

Who is this for:

Developers who know Python, primarily! Pyoneer might be useful if:

1. You’ve got an Excel file you’re looking to move to Python (usually for speed, size, or maintenance reasons).

2. There’s enough logic contained in the notebook that it’s going to be a hassle for you to just rewrite it from scratch.

3. Or you don’t know the logic that is in the Excel workbook well since you didn’t write it in the first place :)

Post translation, even if Pyoneer doesn't nail it perfectly or translate all the formulas, you'll be able to pop into the notebook and continue cleaning up the TODOs / finish writing the formulas.

What the Alpha launch supports:

Launched early! Currently we’re focused on supporting:

1. Any number of sheets, with any reference structure between them.

2. Cells that translate as variables directly. We’ll translate the formulas to Python code that has the same result, or else we’ll generate a TODO letting you know we failed translating this cell.

3. Tables that translate as Pandas dataframes. We support at most one table per sheet, at the tables must be contigious. If the formulas in a column are consistent, then we will try and translate this as a single pandas statement.

We do not support: pivot tables or complex formulas. When we fail to translate these, we generate TODO statements. We also don’t support graphs or macros - and you won’t see these reflected in the output at all currently.

Why we built this:

We did YCS20 and built an open source tool called Mito(https://trymito.io). It’s been a good journey since then - we’ve scaled revenue and to over 2k Github stars (https://github.com/mito-ds/mito). But fundamentally, Mito is a tool that’s useful for Excel users who wanted to start writing Python code more effectively.

We wanted to take another stab at the Excel -> Python pain point that was more developer focused - that helped developers that have to translate Excel files into Python do this much more quickly. Hence, Pyoneer!

I’ll be in the comments today if you’ve got feedback, criticism, questions, or comments.



Why would you get an LLM to transpile the formula for you? There are already libraries that attempt to do this. Wouldn't your time be better served in expanding these libraries to overcome deficiencies? I don't get why you would throw away all the work done in this space and just offload it to an LLM. It's a technical problem. Imagine if ditched all transpilers and gave it to an LLM and said "good luck!"

https://formulas.readthedocs.io/en/stable/


Copying and pasting from a comment below -- we considered this heavily during our MVP, but in practice "moving an Excel process to Python" does not just mean executing an Excel file through Python. This pretty much doesn't replace your Excel dependence at all.

Consider the following (pretty easy) translation of a simple table:

    # This is not a useful or usable artifact; you're still trapped in Excel
    # but things are just worse now, since it's not on a 2d grid
    A1 = "Prices"
    A2 = 1
    A3 = 2
    B1 = "With Tax"
    B2 = SUM(A1, 10) * 1.3
    B3 = SUM(A2, 10) * 1.3
But this is ultimately is an awful solution for the user: 1. It’s 100% impossible to read. A large Excel file often has 100k+ formulas (many of them with shared structures). This is 100k+ lines of code... 2. It’s impossible to maintain. Yeah, since it lacks all semantic structure, there’s no f* way you’re going to test it or modify it.

To make it really concrete: you can't just transpile a SUMIF or VLOOKUP to a Python implementation of SUMIF or VLOOKUP, and you absolutely can't do this on a cell-by-cell basis.

Rather: we're trying to generate a Python script that appears to be written by an expert developer. To do this, you have to be willing to ditch the Excel formulas / execution engine, do more abstract reasoning over the file (like identifying tables / consistent formulas in columns and translating them as pandas dataframes), and translate them without just relying on matching the Excel exactly.

You want something closer to this:

    df = pd.DataFrame({'Prices': [1, 2]})
    df['With Tax'] = (df['Prices'] + 10) * 1.3
We want parity of outputs, not parity in how we get there!


> We want parity of outputs, not parity in how we get there!

To be clear, though, you don't necessarily have parity of outputs.

It seems strange to me to sacrifice correctness for readable output. I would prefer a deterministic strategy that is always correct and sometimes readable. You could do that by generating an intermediate structure A1=... A2=..., then applying heuristics to say "hey, this enormous column of VLOOKUPs is actually a join", and so on. Maybe LLMs could advise on that, but I'm not sure how you'd check their work...

... Anyway you're the person "in the arena", having actually created something, so well done!


> To be clear, though, you don't necessarily have parity of outputs.

The cool thing is that the Excel file is both the programatic specification of the process as well as the actual output data you want as well. We can check parity of outputs by comparing the data we create with Python to the data in Excel - in practice, Pyoneer generates test cases for tables that do exactly this, even when we can't translate every formula correctly!

> applying heuristics to say "hey, this enormous column of VLOOKUPs is actually a join", and so on.

We do this deterministically currently. The only non-deterministic aspect is formula translation - where we defer to some LLM. Structurally, everything is deterministic though - and here we really do aim for readability (there's a lot more to do here though).


> we're trying to generate a Python script that appears to be written by an expert developer.

But you're not. You're just deferring to an LLM. Try to create some abstractions that can work. For instance there are libraries that unuglify code. Better to start from correct python and build from that than doing the whole shebang with an LLM black box

In your case of 100k formulas verifying the output is correct would be a monumental task especially if it's not deterministic and you can't have library tests. And the human readable format is much harder to deduce about and validate. And getting an answer for one a single input output pair is prone to over fitting esp if you do it recursively to have it fit errors

No offense, but I can't imagine anyone using this for anything serious.


Yes op should use formula ast and excel model via formulas/excel/plot to translate syntactically to make sure its correctness, justifying the effort.


This jumped out for me too. Here seems like a huge wasted opportunity.


Check my comment to OP on this thread - hopefully this is answers why we don't do this!

I will say: we are considering some more approaches from traditional compilation / transpilation. I think these are very compatible!


> We use OpenAI APIs to translate formulas.

It's very uncool that you don't make this very clear on the landing page.


Thanks for the feedback. I've updated the landing page to prominently display this language - see the how it works section.


Why is it “very uncool”?


https://pypi.org/project/formulas/ might help with this without needing the OpenAI part


Thanks for the link! We looked into existing libraries for Excel formula execution - this library as an awesome example!

We considered using this, but to copy from our MVP spec:

The easiest thing to do is to replicate Excel’s execution engine — you can see someone who has done this [here](https://pypi.org/project/xlcalculator/). But this is just evaluating Excel formulas is not what users want when transitioning a process to Excel - they want to be able to ditch Excel (mostly).

The next easiest thing would be to transpile Excel formulas to the following format:

    # This is not a useful or usable artifact; you're still trapped in Excel
    # but things are just worse now, since it's not on a 2d grid
    A1 = "Prices"
    A2 = 1
    A3 = 2
    B1 = "With Tax"
    B2 = SUM(A1, 10) * 1.3
    B3 = SUM(A2, 10) * 1.3
But this is ultimately is an awful solution for the user:

1. It’s 100% impossible to read. A large Excel file often has 100k+ formulas (many of them with shared structures). This is 100k+ lines of code...

2. It’s impossible to maintain. Yeah, since it lacks all semantic structure, there’s no f** way you’re going to test it or modify it.

In general: *we're trying to generate a Python script that appears to be written by an expert developer. To do this, you have to be willing to ditch the Excel formulas / execution engine.


This is really awesome.

> In addition, Content found on or through this Service are the property of Saga Inc. or used with permission. You may not distribute, modify, transmit, reuse, download, repost, copy, or use said Content, whether in whole or in part, for commercial purposes or for personal gain, without express advance written permission from us.

Is the "Content" the results of what is uploaded?


You 100% own all Python code that you download from Pyoneer. Sorry for the lack of clarity here!

You're welcome to: 1. Edit it to fix up places where it can't translate fully. 2. Send it to your colleagues and tell them you wrote it (although... for your own personal morals... maybe don't :) ) 3. Upload it to your companies Github 4. Whatever the hell else you want

To be very clear: this is your code!

The code we generate currently has no dependencies on anything other than pandas, numpy, the Python core library, and the Excel file you uploaded in the first place. This might change as we try and support more Excel features (and so need to do some pivot table mocking), but we're not aiming to lock folks in!

P.S. If you do anything particularly interesting with the code you generate... tell me about it. nate @ sagacollab.com. I'd love to hear P.P.S. I'll update the language on this ASAP.


So is this a one-way only process (I'd assume it is)? This assumes that the once the python engineers who are "tasked" to productionize it, no more changes are allowed to the sheets or do you allow incremental updates?


Yep - this is a one-way process! You can think of it like an eject from Excel, in the best case.

The devs we've worked with so far have the goal of replacing the Excel process - inheriting it from the team that runs it manually, and automating it fully in Python. From them on, changes to the process would run through a more traditional software-development lifecycle, as you would be editing code.

For these devs - this is a feature not a bug! In Excel, version control, testing, and review is pretty much non-existent...

Cool username btw...


Ah thank you. One idea (btw love what you are doing). Have you considered "defering" the python generation process and so there is an intermediate (possibly in-mem) layer that gives you crud access to the underlying DB (ahem excel). Then you could target this to any lang/runtime/backend with performance tradeoffs etc? Bit like a language server?


I stumbled into a career in programming due to originally learning excel in an unrelated field. This project is such a great idea imo and seems like it would be a game changer for anyone that is regularly working between excel and python files for data.

Also my take on the transpilation vs LLM is that you chose the right path. The point of transforming an excel spreadsheet to python is to have better organization and ability to reason about the operations being performed.


Hey Jeff, thanks for the kind words. I'd love to learn more about your experience and transition through that pain point -- shoot me an email at nate @ sagacollab . com if you want to chat.


Suggesting an extreme test case for you:

https://github.com/amzn/computer-vision-basics-in-microsoft-...

(I'm the primary author of the above.)


Sorry "morgtage.xls"? This makes me feel a bit pessimistic about the accuracy of the formulas.


Woof. Always the words you stare at the most that are wrong... will update that demo video when I get the chance, but might be a bit :)

Good thing Pyoneer generates test cases for the formulas it generates! No need to trust my spelling abilities -- your Excel file is the ultimate source of truth.


Very cool, what are your plans for working with external data sources via power query?


Totally on the roadmap, but not sure when yet!

The problem is data gets into these mega-excels through all sorts of funky routes... and I really do mean funky :)

1. PowerQuery: this is defined statically in the notebook so is detectable by Pyoneer. But I don't know a ton about the integration in Python here. I imagine this is doable.

2. Manual data entry: Pyoneer can't detect this from a static Excel file, really - what's the difference between the static Excel sheet and data updates ever time? Oftentimes, users with a lot of manual data entry to "automate this in Python" by turning an Excel file into like a form. Generating a proper web app out of the Excel file would be pretty sweet!

3. Database output copied - aka, copy in a table. This one is sometimes pretty crazy - I've seen Excel workbooks that have SQL queries just copied and pasted into a random cell in the notebook, so you can copy that and run it on some archaic SQL server. And then copy the output back in...

4. Macros: runs an API call, or an SQL query, or pulls (and then formats) data from another Excel sheet. Then put it in the right place. This then requires translating Macros - which are a whole programming language of their own. This is actually pretty high-priority for us right now, based on early feedback from developers who are in the thick of it with big Excel files.

6. Custom plugins. Big finance shops build/buy plugins that pull in data all the time! We haven't really started investigating how to handle these.

5. Other workbooks: at large banks, there's an additional dependency graph of workbooks that rely on eachother across the org. It's epic. There's a single workbook that defines all market holidays, that's used for all excel files that do performance reporting. And then these performance reports feed into other Excel's (by way of direct references, but also by way of copy and pasting, but also by way of uploading/downloading through a database). Support multiple Excel files at once is something we'll have to tackle eventually!

So... there's a lot to do here. We're really early - so we're focused on two primary things right now:

1. Solving the most pressing pain points first. Hence the early launch so we can talk to more folks and prioritize better. I've got a reasonable idea since I've done so much of this work myself, but every finance shop does things different...

2. Leaving good TODOs when we can't translate something. Currently, we can't translate pivot tables or complex formulas -- but we generate TODOs for these so you can go back and fill them in with the Python skills you have (and maybe the help of ChatGPT).

We're aiming to just give you a Python script. So if we don't translate the data pull how you want... you can just edit the notebook :)


What customer discovery have you done so far?

> turn psuedo-software Excel files into real-software Python

I'm curious how many people actually have this problem. It strikes me that the corporate environments that are building big hairy excel files probably have locked down IT where running python is a difficult thing to do.

In the environments where you can get some kind of python running, it's probably likely that developers are looped in as part of an efficiency project, and doing some kind of voice of the customer / requirements capture and starting to build from scratch more or less.

There's a vanishing window for stuff like this, if you're a Microsoft shop like 99% of the corporate world I think you are turning those excel files into power apps and powerBI dashboards, before you are hiring python devs.


> What customer discovery have you done so far?

Me and my two cofounders spend the past 4 years working on Mito (https://trymito.io) -- where our customers are primarily large finance shops (including some bulge bracket banks you've heard of) that has a really concrete goal of getting users out of Excel and into Python. It's not every finance shop, but a quite a few are trying to make this transition. This usually means: multi-day Python trainings, a Python support team, a few developers who semi-full-time job is helping transition existing Excel processes to Python.

We built Mito to be a tool for the Excel-first users - we tried to make it easier for them to use their existing spreadsheet skills to write Python. But in working with the developers that support these new Python users, it became clear to us that there's a big pain point around:

1. I'm a dev who was given a big, old Excel file

2. It has a lot of business logic in it, understood by the person who made it, but not by me - who is tasked with turning it into real software

3. I have to spend 100s of hours: trying to understand the file, faithfully replicating the logic, and testing for consistency - to convert this to an Excel process.

I personally have been this developer in quite a few cases - just as support for Mito and helping these Excel users trying to transition to Python. Some Excel files literally take 300+ hours to "rebuild from scratch" in Python. It's often very engaging work, but brutally slow - so we're trying to automate as much as we can with Pyoneer.

> There's a vanishing window for stuff like this, if you're a Microsoft shop like 99% of the corporate world I think you are turning those excel files into power apps and powerBI dashboards, before you are hiring python devs.

I think this is a really fair point! We're not sure exactly what a reasonable business model really looks like, long-term. Right now, we're really focused on finding the developers for-which this is a big pain point, and seeing what we need to prioritize to make their lives better. I'm one of those developers...


Hey, thanks for the reply!

My next question would be how did Mito go?

What MMR did you get to?

How strong was the PMF?, 4 years seems like a long time to test this product, and I'm not sure if the overall market for this is too small. Is it the trap of dogfooding (building the thing you wish you had) without sizing the market?

Mito site says: Trusted by dozens of fortune 500 companies, how penetrated is that really? Is it one dev in each company on the free tier or entire departments/teams using this on the $150/month/user plan?

I built something similar (excel / python space ) but it was really just one feature as part of a larger platform, not something I would build a company around.


Sure thing - thanks for the good thoughts!

We're still working on Mito - it's not a retired product by any means. Pyoneer is just another stab at the same problem for a different user group.

MMR-wise, we scaled to profitability. PMF-wise, we have not reached this. We have large customers who make up the bulk of our revenue who love the product, and use it quite effectively as the basis for their entire Python program, but, transparently, scaling is hard!

> Is it the trap of dogfooding (building the thing you wish you had) without sizing the market?

Very possibly. But I think this is a much bigger pain point at large orgs with legacy processes than you realize, though. Every large bank has an entire development teams that are tasked with transitioning legacy processes out of spreadsheets. We're aiming to improve the efficiency of these developers dramatically.

For some of the spreadsheets I've personally automated, I think this would take 300 hours of work and make it like 5...


From their website it sounds like they know there is a market for this because they’ve been doing it manually for a while:

“Our founding team spent the past decade transitioning Excel files to Python - from startups to insurance companies to bulge bracket banks.”


Yeah, we built this because we wished we had it! I've spent literally thousands of hours reimplementing Excel workbooks in Python as support for our previous shot at this problem - which was a spreadsheet that generates Python code as you edit it.


> if you're a Microsoft shop like 99% of the corporate world I think you are turning those excel files into power apps and powerBI dashboards, before you are hiring python devs.

This is a good incremental decision, but once everyone's done it, perhaps an edge can be found by doing it properly. As much as IT folk love administrating Microsoft products, the products are terrible for the users.


'I'm curious how many people actually have this problem. It strikes me that the corporate environments that are building big hairy excel files probably have locked down IT where running python is a difficult thing to do.'

FWIW I have worked at 3 financial companies (including my current job) where Excel is used heavily and where "desk quants" are allowed to use Python.


Additional thoughts:

How are you modeling the input data? Are you using dataclasses and type annotations? something like https://docs.pydantic.dev/latest/ ?

How/Where is the data stored? Not inside the notebook, right?


Currently:

1. Data remains stored in the excel file. The generated script pulls the raw data directly from the notebook - but it's a single read_xlsx call. So if you want to switch it out for an API call, db read, whatever - it's easy to do so.

2. We model data as primitive Python data types, or, if it's a table, as a pandas dataframe.

Currently, we detect at most one table per sheet, and it's gotta be contiguous. These are pretty huge limitations we'll be relaxing soon -- but we wanted to get something out as soon as it would have been useful to one person -- and in it's current state, this would have helped me with some of my larger Excel automation projects :)


I'll give it a try! Thanks!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: