• 0 Posts
  • 22 Comments
Joined 2 years ago
cake
Cake day: June 24th, 2023

help-circle


  • That’s when you get into more of the nuance with tokenization. It’s not a simple lookup table, and the AI does not have access to the original definitions of the tokens. Also, tokens do not map 1:1 onto words, and a word might be broken into several tokens. For example “There’s” might be broken into “There” + “'s”, and “strawberry” might be broken into “straw” + “berry”.

    The reason we often simplify it as token = words is that it is the case for most of the common words.



  • I think it does make sense, it’s a “did this loop exit naturally? If so, do x”. This makes a lot of sense if you, for example, have a loop that checks a condition and breaks if that condition is met, e.g. finding the next item in a list. This allows for the else statement to set some default value to indicate that no match was found.

    Imo, the feature can be very useful under certain circumstances, but the syntax is very confusing, and thus it’s almost never a good idea to actually use it in code, since it decreases readability a lot for people not intimately familiar with the language.

    Edit: Now, this is just guessing, but what I assume happens under the hood is that the else statement is executed when the StopIteration exception is recieved, which happens when next() is called on an exhausted iterator (either empty or fully consumed)




  • Wow, this is great! Works perfectly if you only care about the order of the files. However, if you wanted e.g. the 238th file or know which index file 99993 is, that’s a bit more of a headache.

    You’ll also run into filename length limits quite quickly, since the number of files scales linearly with the number of characters in the filename, compared to exponentially with the 01 method.





    1. I imagine that the company would have the burden of proof that any of these criteria are fulfilled.

    2. Third-party rights most likely refers to the use of third-party libraries, where the source code for those isn’t open source, and therefore can’t be disclosed, since they aren’t part of the government contract. Security concerns are probably things along the line of “Making this code open source would disclose classified information about our military capabilities” and such.

    Switzerland are very good bureaucracy and I trust that they know how to make policies that actually stick.



  • Comment should describe “why?”, not “how?”, or “what?”, and only when the “why?” is not intuitive.

    The problem with comments arise when you update the code but not the comments. This leads to incorrect comments, which might do more harm than no comments at all.

    E.g. Good comment: “This workaround is due to a bug in xyz”

    Bad comment: “Set variable x to value y”

    Note: this only concerns code comments, docstrings are still a good idea, as long as they are maintained





  • Being able to handle it, and being able to handle it efficiently enough are two very distinct things. The hash method might be able to handle long strings, but it might take several seconds/minutes to process them, slowing down the application significantly. Imagine a malicious user being able to set a password with millions (or billions!) of characters.

    Therefore, restricting it to a small, but still sufficiently big, number of characters might help prevent DoS-attacks without any notable reduction in security for regular users.