How does the tech behind archive.today work in detail? Is there any information out there that goes beyond the Google AI search reply or this HN thread [2]?
If they're under an organised defamation campaign, they're not helping themselves by DDoSing someone else's blog and editing archived pages.
behringer 4 days ago [-]
Is that, itself, true or disinformation?
ndiddy 4 days ago [-]
They did edit archived pages. They temporarily did a find/replace on their archive to replace "Nora Puchreiner" (an alias the site operator uses) with "Jani Patokallio" (the name of the blogger who wrote about archive.today's owner). https://megalodon.jp/2026-0219-1634-10/https://archive.ph:44...
I think Wikipedia made the right decision, you can't trust an archival service for citations if every time the sysop gets in a row they tamper with their database.
This is so ‘early internet beef’ quaint. What next? Are they going to G-line each other?
behringer 3 days ago [-]
It it utterly stupid when you consider that the host needed to replace their username with something to conceal their user accounts.
stuffoverflow 4 days ago [-]
I've not seen any evidence of them editing archived pages BUT the DDOSing of gyrovague.com is true and still actively taking place. The author of that blog is Finnish leading archive.today to ban all Finnish IPs by giving them endless captcha loops. After solving the first captcha, the page reloads and a javascript snippet appears in the source that attempts to spam gyrovague.com with repeated fetches.
verteu 2 days ago [-]
> I've not seen any evidence of them editing archived pages
There is evidence of this in the article you're commenting on.
mmooss 4 days ago [-]
How do you know that? Did you see it (do you have a Finnish IP?)?
stuffoverflow 4 days ago [-]
Yes I have Finnish IP and just before I wrote that post I tested it to make sure it was still happening.
I assume it must be a blanket ban on Finnish IPs as there has been comments about it on Reddit and none of my friends can get it to work either. 5 different ISPs were tried. So at the very least it seems to affect majority of Finnish residential connections.
mmooss 4 days ago [-]
> just before I wrote that post I tested it to make sure it was still happening
That's awesome. I wish everyone made sure of their facts. Thanks.
delusional 4 days ago [-]
This is quite an interesting question. For a single datapoint, I happen to have access to a VPN that's supposedly in Finland, and connecting through that didn't make any captcha loop appear on archive.today. The page worked fine.
Now it's obviously possible that my VPN was whitelisted somehow, or that the GeoIP of it is lying. This is just a singular datapoint.
fear-anger-hate 4 days ago [-]
As another datapoint with Finnish IP from Mullvad VPN: CAPTCHA loop and indeed after solving first CAPTCHA this can be found in page source:
It’s also pretty common for VPNs to have exit nodes physically located in different counties to where they report those IPs (to GeoIP databases) as having originated from.
BoredPositron 4 days ago [-]
VPNs usually don't tell you much about residential experiences.
drum55 4 days ago [-]
It was true and visible when reported, yeah.
4 days ago [-]
daymanstep 4 days ago [-]
I've also noticed archive.today injecting suspicious looking ads into archived pages that originally did not have ads.
And that voice is practically shouting, "I AM UNTRUSTWORTHY".
ouhamouch 4 days ago [-]
that is not the worst scream (especially after FBI and Russian trail). better to shout anything than to die in silence
eddythompson80 4 days ago [-]
What kinda logic is that? If you don't want to die in silence, then shout something sensical. But if you're gonna shout garbage, just die in silence.
tolerance 4 days ago [-]
People say they want the old weird web back. Well there’s this.
ouhamouch 4 days ago [-]
The property of the medium: no one would repost or discuss "something sensical".
tolerance 4 days ago [-]
Or some shrewd sort of tactician.
8cvor6j844qw_d6 4 days ago [-]
archive.today works surprisingly well for me, often succeeding where archive.org fails.
archive.org also complies with takedown requests, so it's worth asking: could the organised campaign against archive.today have something to do with it preserving content that someone wants removed?
wolvoleo 4 days ago [-]
They preserve a lot of paywalled content so yeah I'm sure there's enough financial incentives to bother them :(
4 days ago [-]
sieabahlpark 4 days ago [-]
[dead]
iamnothere 4 days ago [-]
There was also the recent news about sites beginning to block the Internet Archive. Feels like we are gearing up for the next phase of the information war.
idiotsecant 4 days ago [-]
[flagged]
pyuser583 4 days ago [-]
Was that written by AI? It sounds like AI, spends lots of time summarizing other posts, and has no listed author. My AI alarm is going off.
KennyBlanken 4 days ago [-]
Ars was caught recently using AI to write articles when the AI hallucinated about a blogger getting harassed by someone using AI agents. The article quoted his blog and all the quotes were nonsense.
mrweasel 4 days ago [-]
Even if something is AI generated the author, and the editor, should at least attempt to read back the article. English isn't my native language, so that obviously plays in, but very frequently I find that articles I struggle to read are AI generated, they certainly have that AI feel.
It would be interesting to run the numbers, but I get the feeling that AI generated articles may have a higher LIX number. Authors are then less inclined to "fix" the text, because longer word makes them seem smarter.
moron4hire 4 days ago [-]
"Should" and "will" are completely different things. My kids "should" brush their teeth every night without me having to tell them. But they won't.
mrweasel 4 days ago [-]
Sounds like you're suggesting an RFC for journalists and editors :-)
lambda 4 days ago [-]
Yeah, wow. Definitely setting off my AI summary alarm.
girvo 4 days ago [-]
Yeah nearly certainly.
5 days ago [-]
5 days ago [-]
robotnikman 4 days ago [-]
A big fear of mine is something happening to archive.is
There is so much is archived there, to lose it all would be a tragedy.
ouhamouch 4 days ago [-]
There are number of blog posts like
owner-archive-today . blogspot . com
2 years old, like J.P's first post on AT
bdhcuidbebe 4 days ago [-]
They are able to scrape paywalled sites at random, so im guessing a residential botnet is used.
khannn 3 days ago [-]
It's funny that residential VPN botnets aren't uncommon now. "Free VPN" if you allow your computer/phone to be an exit point.
pingou 4 days ago [-]
But how do they bypass the paywall? They can't just pretend to be Google by changing the user-agent, this wouldn't work all the time, as some websites also check IPs, and others don't even show the full content to Google.
They also cannot hijack data with a residential botnet or buy subscriptions themselves. Otherwise, the saved page would contain information about the logged-in user. It would be hard to remove this information, as the code changes all the time, and it would be easy for the website owner to add an invisible element that identifies the user. I suppose they could have different subscriptions and remove everything that isn't identical between the two, but that wouldn't be foolproof.
wbmva 4 days ago [-]
On the network layer, I don't know. But on the WWW layer, archive.today operates accounts that are used to log into websites when they are snapshotted. IIRC, the archive.today manipulates the snapshots to hide the fact that someone is logged in, but sometimes fails miserably:
This particular addon is blocked on most western git servers, but can still be installed from Russian git servers. It includes custom paywall-bypassing code for pretty much every news websites you could reasonably imagine, or at least those sites that use conditional paywalls (paywalls for humans, no paywalls for big search engines). It won't work on sites like Substack that use proper authenticated content pages, but these sorts of pages don't get picked up by archive.today either.
My guess would be that archive.today loads such an addon with its headless browser and thus bypasses paywalls that way. Even if publishers find a way to detect headless browsers, crawlers can also be written to operate with traditional web browsers where lots of anti-paywall addons can be installed.
wuschel 4 days ago [-]
Wow, did not know about the regional blocking of git servers! Makes me wonder what else is kept from the western audience, and for what reason this blocking is happening.
Thanks for sketching out their approach and for the URI.
pingou 4 days ago [-]
But don't news websites check for ip addresses to make sure they really are from Google bots?
seanhly 4 days ago [-]
Most of them don’t check the IP, it would seem. Google acquires new IPs all the time, plus there are a lot of other search systems that news publishers don’t want to accidentally miss out on. It’s mostly just client side JS hiding the content after a time delay or other techniques like that. I think the proportion of the population using these addons is so low, it would cost more in lost SEO for news publishers to restrict crawling to a subset of IPs.
4 days ago [-]
expedition32 4 days ago [-]
I use this add on. It does get blocked sometimes but they update the rules every couple of weeks.
rkagerer 4 days ago [-]
I thought saved pages sometimes do contain users' IP's?
The way I (loosely) understand it, when you archive a page they send your IP in the X-Forwarded-For header. Some paywall operators render that into the page content served up, which then causes it to be visible to anyone who clicks your archived link and Views Source.
bdhcuidbebe 4 days ago [-]
> But how do they bypass the paywall?
I’m guessing by using a residential botnet and using existing credentials by unknowingly ”victims” by automating their browsers.
> Otherwise, the saved page would contain information about the logged-in user.
If you read this article, theres plenty of evidence they are manipulating the scraped data.
But I’m just speculating here…
pingou 4 days ago [-]
But in the article they talk about manipulating users devices to do a DDOS, not scrape websites. And the user going to the archive website is probably not gonna have a subscription, and anyway I'm not sure that simply visiting archive.today will make it able to exfiltrate much information from any other third party website since cookies will not be shared.
I guess if they can control a residential botnet more extensively they would be able to do that, but it would still be very difficult to remove login information from the page, the fact that they manipulated the scraped data for totally unrelated reasons a few times proves nothing in my opinion.
notpushkin 4 days ago [-]
They do remove the login information for their own accoubts (e.g. the one they use for LinkedIn sign-up wall). Their implementation is not perfect, though, which is how the aliases were leaked in the first place.
4 days ago [-]
celsoazevedo 5 days ago [-]
I don't see the point in doxing anyone, especially those providing a useful service for the average internet user. Just because you can put some info together, it doesn't mean you should.
With this said, I also disagree with turning everyone that uses archive[.]today into a botnet that DDoS sites. Changing the content of archived pages also raises questions about the authenticity of what we're reading.
The site behaves as if it was infected by some malware and the archived pages can't be trusted. I can see why Wikipedia made this decision.
fluoridation 4 days ago [-]
For a very brief time, "doxing" (that is, dropping dox, that is, dropping docs, or documents) used to mean something useful. You gathered information that was not out in public, for example by talking to people or by stealing it, and put it out in the open.
It's very silly to talk about doxing when all someone has done is gather information anyone else can equally easily obtain, just given enough patience and time, especially when it's information the person in question put out there themselves. If it doesn't take any special skills or connections to obtain the information, but only the inclination to actually perform the research on publicly available data, I don't see what has been done that is unethical.
bawolff 4 days ago [-]
Call it stalking or harrasment if you prefer. Regardless its rude (sometimes illegal) behaviour.
That's no justification for using visitors to your site to do a DDOS.
In the slang of reddit: ESH
fluoridation 4 days ago [-]
It's neither of those. Stalking refers to persistent, unwanted, one-sided interactions with a person such as following, surveilling, calling, or sending messages or gifts. Investigating a person's past or identity doesn't involve any interaction with the physical person. Harassment is persistent attempts to interact with someone after having been asked to stop. Again, an investigation doesn't require any form of interaction.
JoshTriplett 4 days ago [-]
> Harassment is persistent attempts to interact with someone
No, harassment also includes persistent attempts to cause someone grief, whether or not they involve direct interactions with that person.
From Wikipedia:
> Harassment covers a wide range of behaviors of an offensive nature. It is commonly understood as behavior that demeans, humiliates, and intimidates a person.
fluoridation 4 days ago [-]
Doxing in the loose sense could be harassment in certain circumstances, such as if you broadcast a person's home address to an audience with the intent to cause that audience to use that address, even if the address was already out there. In that case, the problem is not the release of information, but the intent you're communicating with the release. It would be the same if you told that audience "you know guys? It's not very difficult to find jdoe's home address if you google his name. I'm not saying anything, I'm just saying." Merely de-pseudonymizing a screen name may or may not be harassment. Divulging that jdoe's real name is John Doe would not have the same implications as if his name was, say, Keanu Reeves.
Because the two are distinct, one can't simply replace "doxing" with "harassment".
JoshTriplett 4 days ago [-]
Generally speaking, every case I've seen of people using the term "doxing" tends to be for the case that specifically is harassment; it has the connotation of using the information, precisely because if you aren't intending to use it there's no good reason for you to have it.
fluoridation 4 days ago [-]
That's just another way the term is used incorrectly.
JoshTriplett 4 days ago [-]
Language evolves. Connotation tends to become definition. Not always the only definition, but connotation becomes the "especially" or the "definition 2", and can become the primary definition over time.
fluoridation 4 days ago [-]
That's not what I mean. If we agree that harassment is wrong and that doxing is not harassment (because not all doxing is harassment), then it's incorrect to say that doxing is wrong. For example, the article from the blog, even if we agree that it is doxing, isn't harassment. The person being discussed is presented in a positive light:
>I for one will be buying Denis/Masha/whoever a well deserved cup of coffee.
Using one term when what is meant is actually the other serves nothing but to sow confusion.
bawolff 4 days ago [-]
You can harass someone while discussing them in a positive light.
And i don't just mean under colloquial definition, i mean under the legal definition of harrasment. In fact its fairly common for unwanted "positive" attention to be harrasment - e.g. unwanted sexual advances mostly fit that description.
fluoridation 4 days ago [-]
You are generalizing an irrelevant point. What I was getting at is that unlike the usual usage of doxing, it was not a call to go bother that person. I didn't think I needed to make that point this explicitly within the context of this subthread.
bawolff 4 days ago [-]
Which is irrelavent as that is not a requirement for it to be harrasment.
I get that a call to action is a common feature of doxing and it wasn't present here, but its not a particularly common feature of harrasment outside of the context of doxing and nothing in the definition of harrasment requires it.
grimgrin 4 days ago [-]
update the etymology then on wikipedia with your reference
that current etymology is what we’re all talking about obv
allarm 3 days ago [-]
> Language evolves
That's just another way of saying "words don't have meanings". Yes, it evolves, but to preserve the original meanings, that evolution should be slowed down as much as possible to avoid “black is white” effects.
wolvoleo 4 days ago [-]
In this case archive.today has a lot of influence over the information we take in because of the rise in paywalls. They have the potential of modifying the news we absorb at scale.
In that context I don't think the question ("actually, who is providing all this information to me and what interests drive them") is one that's misplaced. Maybe we shouldn't look into a gift horse's mouth but don't forget this could be a Trojan horse as well.
The article brought to light some ties to Russia but probably not ties to its government and its troll farms. Rather an independent and pretty rebellious citizen. That's good to hear. And that's valuable information. I trust the site more after reading the article, not less.
The article could have redacted the names they found but they were found with public sources and these sources validate the encountered information (otherwise the results could have been dismissed)
noobermin 4 days ago [-]
Did you read the article? They dug deep, they didn't just do a google search and leave it at that. They drew links between deleted posts and defunct accounts, they compared profile pictures of anonymous profiles.
I'm not defending the archive.today webmaster but it's unfortunately understandable they are angry. Saying what the blogger did was merely point out public information is a gross oversimplification.
fluoridation 4 days ago [-]
Did you read the comment you're replying to? They didn't use any information not publicly available.
noobermin 3 days ago [-]
That is NOT the line for doxxing at all, I don't know why you hang your argument on that aspect. Even institutions that care about secrecy like governments state that documents that aggregate ostensibly public information can raise the classification level of a document above being non-classified. The reasons for this are obvious, essentially aggregated information can lead one to draw conclusions that otherwise are not obvious. That is akin to what the original article by Gyrovague does.
fluoridation 3 days ago [-]
>That is NOT the line for doxxing
Again, did you read my comment? I know what it means now. My point is about highlighting the change in meaning, not about obstinately denying what the word means.
>Even institutions that care about secrecy like governments state [...]
A given organization can have whatever policy it wants with regards to which documents it wants to allow to be made public. It could make all documents printed on non-yellow paper classified. That has nothing to do with the ethics of doxing.
>The reasons for this are obvious, essentially aggregated information can lead one to draw conclusions that otherwise are not obvious.
A secret is not something that's not obvious, it's a datum that's strictly controlled by the people who know it. If I can find some information about your real identity just by searching for it online then it's not a secret; you don't control that piece of information. You've given up that control by divulging the information in a public space where information often remains indefinitely.
lelandbatey 4 days ago [-]
Eh, you can find in public data things like "what is someone's address" based only on their name by looking up public records of mortgage records. That however is quite bad form, and if you did do that, I think it would be pretty unethical.
jsheard 5 days ago [-]
It's also kind of ironic that a site whose whole premise is to preserve pages forever, whether the people involved like it or not, is seeking to take down another site because they are involved and don't like it. Live by the sword, etc.
palmotea 4 days ago [-]
> It's also kind of ironic that a site whose whole premise is to preserve pages forever, whether the people involved like it or not
Oddly, I think archive.today has explicitly said that's not what they're there for, and the people shouldn't rely on their links as a long-term archive.
eviks 4 days ago [-]
Where have they said it?
> Archive.today is a time capsule for web pages!
> It takes a 'snapshot' of a webpage that will always be online even if the original page disappears.
Bypassing paywalls? It actually seems like they've got accounts at many paywalled sites. Shorter term archiving?
Given the unclear ownership situation, it makes sense not to rely on them for anything long term. They could disappear tomorrow.
jMyles 5 days ago [-]
> Changing the content of archived pages also raises questions about the authenticity of what we're reading.
This is absolutely the buried lede of this whole saga, and needs to be the focus of conversation in the coming age.
Sophira 4 days ago [-]
Sites that exist to archive other websites will almost always need to dynamically change the content of the HTML that they're serving in some way or another. (For example, a link that points to the root of the website may need changed in order to point to the right location.)
So it doesn't necessarily raise questions about whether the content has been changed or not. The difference is in whether that change is there to make the archive usable - and of course, for archive.today, that's not the case.
ddtaylor 5 days ago [-]
Did they actually run the DDoS via a script or was this a case of inserting a link and many users clicked it? They are substantially different IMO
dunder_cat 5 days ago [-]
https://news.ycombinator.com/item?id=46624740 has the earliest writeup that I know of. It was running it via a script and intentionally using cache busting techniques to try to increase load on the hosted wordpress infrastructure.
jsheard 5 days ago [-]
> It was running
It still is, uBlocks default lists are killing the script now but if it's allowed to load then it still tries to hammer the other blog.
dunder_cat 5 days ago [-]
Ah good to know. My pi-hole actually was blocking the blog itself since the ublock site list made its way into one of the blocklists I use. But I've been just avoiding links as much as possible because I didn't want to contribute.
RobotToaster 5 days ago [-]
Given the site is hosted on wordpress.com, who don't charge for bandwidth, it seems to have been completely ineffective.
Hamuko 5 days ago [-]
The speculation that I saw was that they'd try to get Wordpress.com to boot him off for being a burden on the overall infrastructure.
This is an impressively unhinged take. I still have no idea what the person is trying to achieve. And I'm sad we're likely going to lose that resource in the future.
noobermin 4 days ago [-]
I understand being mad but no, unfortunately, despite me knowing humans are human and they get angry at times, this response does still leave a bitter taste in the mouth and many people will perceive it that way. Changing the content of the archived pages is the worst thing they've done honestly. The "3 Hz DDoS" is funny perhaps but then if it's so harmless, then why even bother? But regardless, tampering with the archives, that is, tainting the content that people appreciate you for won't sit well with people.
I don't know, I feel like everyone loses here.
4 days ago [-]
walletdrainer 4 days ago [-]
People are now also talking about the weirdo trying to dox him instead of just the operator of the website, doesn’t seem like an unreasonable goal.
viraptor 4 days ago [-]
We're taking about both now, at least one a week it seems. Without the DDoS, we'd mostly forget about the blog. I didn't even know about the blog until the DDoS started.
ddtaylor 4 days ago [-]
Seems like they just Streisand Effect themselves and amplify the message of the "attacker"
ouhamouch 4 days ago [-]
[dead]
chrisjj 5 days ago [-]
As if Wordpress.com was that dumb...
RobotToaster 4 days ago [-]
Mullenweg is dumb, but he seems like the kind of dumb that would try to launch his own attack on archive.today rather than remove the site.
(For those who don't know, he's currently trying to destroy one of the largest WP hosting providers with a bunch of lawsuits)
daedrdev 4 days ago [-]
Are you kidding, it's wordpress
ddtaylor 5 days ago [-]
Thank you this is exactly the information I was looking for.
"You found the smoking gun!"
hexagonwin 5 days ago [-]
they silently ran the DDoS script on their captcha page (which is frequently shown to visitors, even when simply viewing and not archiving a new page)
cardanome 4 days ago [-]
As far as I understand the person behind archive.today might face jail time if they are found out. You shouldn't be surprised that people lash out when you threaten their life.
I don't think the DDOSing is a very good method for fighting back but I can't blame anyone for trying to survive. They are definitely the victim here.
If that blog really doxxed them out of idle curiosity they are an absolute piece of shit. Though I think this is more of a targeted campaign.
pibaker 4 days ago [-]
One thing they always teach you in Crime University is "don't break two laws at the same time." If you have contrabands in your car, don't speed or run red lights, because it brings attention and attentions means jail.
In this case, I didn't know that the archive.today people were doxxed until they started the ddos campaign and caught attention. I doubt anyone in this thread knew or cared about the blogger until he was attacked. And now this entire thing is a matter of permanent record on Wikipedia and in the news. archive.today's attempt at silencing the blogger is only bringing them more trouble, not less.
Barbara_Streisand_Mansion.jpg
stuffoverflow 4 days ago [-]
The weird thing is that there was nothing new in that blog post. And on top of that it couldn't conclusively say who the owner of archive.today is, so no one still knows.
ouhamouch 4 days ago [-]
We do not know what was important in that doxx.
Probably nothing and the DDoS hype was intentional to distract attention and highlight J.P.'s doxx among the other, making them insignificant.
J.P. might be the only one of the doxxers who could promote their doxx in media, and this made his doxx special, not the content?
Anyway, it made the haystack bigger keeping needle the same.
protimewaster 4 days ago [-]
> As far as I understand the person behind archive.today might face jail time if they are found out. You shouldn't be surprised that people lash out when you threaten their life.
One of the really strange things about all of this is that there is a public forum post in which a guy claims to be the site owner. So this whole debacle is this weird mix of people who are angry and saying "clearly the owner doesn't want to be associated with the site" on the one hand, but then on the other hand there's literally a guy who says he's the one that owns the site, so it doesn't seem like that guy is very worried about being associated with it?
It also seems weird to me that it's viewed as inappropriate to report on the results of Googling the guy who said he owns the site, but maybe I'm just out of touch on that topic.
ouhamouch 4 days ago [-]
There are even YouTube videos (of GamerGate-time, thus before AI era) with a guy claiming to be the site owner. A bit difficult to OSINT :)
arboles 4 days ago [-]
> is that there is a public forum post in which a guy claims to be the site owner.
Which forum post? The post mentioned by the blogger, the post on an F-Secure forum (a company with cybersecurity products) was a request for support by the owner of archive.today regarding a block of their site. It's arguably not intended as a public statement by the owner of the archive, and they were simply careless with their username.
RobotToaster 4 days ago [-]
I don't see how that contradicts anything? He's almost certainly using a nomme de guerre.
4 days ago [-]
luckylion 4 days ago [-]
Somebody who a) directs DDOS attacks and b) abuses random visitors' browser for those DDOS attacks is never the victim.
You don't know their motives for running their site, but you do get a clear message about their character by observing their actions, and you'd do well to listen to that message.
cardanome 4 days ago [-]
The character is completely irrelevant to whether they are a victim of doxxing.
They might be the worst person ever but that doesn't matter. People can be good and bad, sometimes the victim sometimes the perpetrator.
Is it morally wrong to doxx someone and cause them to go to jail because they are running an archive website? Yes. It is. It doesn't matter who the person is. It does not matter what their motivations are.
AgentME 4 days ago [-]
There are plenty of cases where the operator of archive.today refused to take down archives of pages with people's identifying information, so it's a huge double standard for them to insist on others to not look into their identity using public information.
darkwater 4 days ago [-]
So, we are back at eye for eye and tooth for tooth?
cardanome 4 days ago [-]
No. I literally said
> I don't think the DDOSing is a very good method for fighting back
I am really shocked by the conditional empathy people here are showing. The doxxing isn't less bad just because the reaction to it is bad.
Its like justifying bullying because the person "deserves" it.
darkwater 3 days ago [-]
If the reaction is disproportional (like in this case), you stop being right even if you were initially right.
cardanome 2 days ago [-]
That some messed up morality. If you are right you are right.
Now what you do in reaction might be legally and morally wrong and maybe you need to be punished for that. But that doesn't negate the injustice you suffered. Two wrongs make... two wrongs. One does not negate the other.
fc417fc802 4 days ago [-]
Irrelevant to a determination of fact, yes. But very relevant to the question of whether or not I care about any of this. Bad thing happened to bad person, lots of drama ensued, come rubberneck the various internet slapfights, details at 11. In other news, water is wet.
luxuryballs 4 days ago [-]
this seems like type of thing that should be on blockchain and decentralized nodes validate authenticity, it could support revisions but not lose originals
nosamu 4 days ago [-]
Has anyone else noticed that some of Archive.today's X/Twitter captures [1] are logged in with an account called "advancedhosters" [2], which is associated with a web hosting company apparently located in Cyprus? The latest post [3] from the account links to a blog post [4] including private communications between the webmaster of Archive.today (using their previously-known "Volth" alias) and a site owner requesting a takedown. Also note that the previous post [5] from the "advancedhosters" account was a link to a pro-Russia, anti-Ukraine article, archived via Archive.today of course. Seems like an interesting lead to untangle.
It could be a donated account. I've noticed archive.whatever also bypasses some paywalls by using legitimate account logins but I doubt there's one person going around subscribing to every news outlet that gets any coverage.
If archive.whatever wasn't so useful to the general public, it'd be hard to distinguish from a criminal operation given the way it operates, unlike say the Internet Archive who goes through all of the proper legal paperwork to be a real nonprofit.
snigsnog 4 days ago [-]
Lead to what?
Kiboneu 4 days ago [-]
That’s what OP wants to find out.
snigsnog 4 days ago [-]
No, what information is he hoping to find? Does he also want to doxx the website owner?
ChocMontePy 5 days ago [-]
I noticed last year that some archived pages are getting altered.
Every Reddit archived page used to have a Reddit username in the top right, but then it disappeared. "Fair enough," I thought. "They want to hide their Reddit username now."
The problem is, they did it retroactively too, removing the username from past captures.
You can see on old Reddit captures where the normal archived page has no username, but when you switch the tab to the Screenshot of the archive it is still there. The screenshot is the original capture and the username has now been removed for the normal webpage version.
When I noticed it, it seemed like such a minor change, but with these latest revelations, it doesn't seem so minor anymore.
palmotea 4 days ago [-]
> When I noticed it, it seemed like such a minor change, but with these latest revelations, it doesn't seem so minor anymore.
That doesn't seem nefarious, though. It makes sense they wouldn't want to reveal whatever accounts they use to bypass blocks, and the logged-in account isn't really meaningful content to an archive consumer.
Now, if they were changing the content of a reddit post or comment, that would be an entirely different matter.
TehCorwiz 4 days ago [-]
If it's not nefarious why isn't it documented as part of their policies? They're not tracking those changes and making clear it was anonymization, why not? If they're not tracking and publishing changes to the documents what's to say they haven't edited other things? The short answer is that without another archived copy we just don't know and that's what's making people uncomfortable. They also injected malicious JS into the site. What's to stop them from doing that again? Trust and transparency are the name of the game with libraries. I could care less about the who they are, but their actions as steward of a collection for posterity fail to encourage my trust.
zymhan 4 days ago [-]
Editing what is billed as an archive defeats the purpose of an "archive".
palmotea 4 days ago [-]
> Editing what is billed as an archive defeats the purpose of an "archive".
No, certain edits are understandable and required. Even the archive.org edits its pages (e.g. sticks banners on them and does a bunch of stuff to make them work like you'd expect).
Even paper archives edit documents (e.g. writing sequence numbers on them, so the ordering doesn't get lost).
Disclosing exactly what account was used to download a particular page is arguably irrelevant information, and may even compromise the work of archiving pages (e.g. if it just opens the account to getting blocked).
ajam1507 4 days ago [-]
The relevant part of the page to archive is the content of the page, not the user account that visited the page. Most sane people would consider two archives of the same page with different user accounts at the top, the same page.
maxloh 4 days ago [-]
Don't be surprised by this, there are a lot more edits than you think. For example, CSS is always inlined so that pages could render the same as it was archived.
raincole 4 days ago [-]
CSS inlining happens during the process of archiving, no?
The issue here is to edit archived pages retrospectively.
4 days ago [-]
basch 5 days ago [-]
It seems a lot of people havent heard of it, but I think its worth plugging https://perma.cc/ which is really the appropriate tool for something like Wikipedia to be using to archive pages.
It costs money beyond 10 links, which means either a paid subscription or institutional affiliation. This is problematic for an encyclopedia anyone can edit, like Wikipedia.
extraduder_ire 4 days ago [-]
This is assuming they can't work out something with wikipedia to offer it for free (via a wikiforge tool, or bot) in exchange for the exposure of being the most common archive provider/putting a "used by Wikimedia" logo on their website.
The major reason archive.today was being used is that it also bypassed paywalls, and I don't think perma.cc does that normally.
toomuchtodo 5 days ago [-]
Wikimedia could pay, they have an endowment of ~$144M [1] (as of June 30, 2024). Perma.cc has Archive.org and Cloudflare as supporting partners, and their mission is aligned with Wikimedia [2]. It is a natural complementary fit in the preservation ecosystem. You have to pay for DOIs too, for comparison [3] (starting at $275/year and $1/identifier [4] [5]).
With all of this context shared, the Internet Archive is likely meeting this need without issue, to the best of my knowledge.
[2] https://perma.cc/about ("Perma.cc was built by Harvard’s Library Innovation Lab and is backed by the power of libraries. We’re both in the forever business: libraries already look after physical and digital materials — now we can do the same for links.")
(no affiliation with any entity in scope for this thread)
bawolff 4 days ago [-]
> Organizations that do not qualify for free usage can contact our team to learn about creating a subscription for providing Perma.cc to their users. Pricing is based on the number of users in an organization and the expected volume of link creation.
If pricing is so much that you have to have a call with the marketing team to get a quote, i think it would be a poor use of WMF funds.
Especially because volume of links and number of users that wikimedia would entail is probably double their entire existing userbase at least.
Ultimately we are mostly talking about a largely static web host. With legal issues being perhaps the biggest concern. It would probably make more sense for WMF to create their own than to become a perma.cc subscriber.
However for the most part, partnering with archive.org seems to be going well and already has some software integration with wikipedia.
RupertSalt 5 days ago [-]
If the WMF had a dollar for every proposal to spend Endowment-derived funds, their Endowment would double and they could hire one additional grant-writer
Dylan16807 4 days ago [-]
Do you have experience with this? I'd like to hear more, really. I think this is the first time I've seen a suggestion for something new they can spend money on. I usually just see talk about where to spend less.
nine_k 5 days ago [-]
If the endowment is invested so that it brings very conservative 3% a year, it means that it brings $4.32M a year. By doubling that, rather many grant writers could be hired.
erk__ 4 days ago [-]
Well the last annual report I could find actually says that they got a return of 17.65% so 3% would be pretty bad
I switched to Perma.cc earlier this week and have had a mixed experience to say the least. I think image heavy pages just error out completely, while still charging me such as:
and reddit blocks their agent seemingly. It is open source though.
jsheard 5 days ago [-]
Does Wikipedia really need to outsource this? They already do basically everything else in-house, even running their own CDN on bare metal, I'm sure they could spin up an archiver which could be implicitly trusted. Bypassing paywalls would be playing with fire though.
toomuchtodo 5 days ago [-]
Archive.org is the archiver, rotted links are replaced by Archive.org links with a bot.
Yeah for historical links it makes sense to fall back on IAs existing archives, but going forward Wikipedia could take their own snapshots of cited pages and substitute them in if/when the original rots. It would be more reliable than hoping IA grabbed it.
Shortcut is to consume the Wikimedia changelog firehose and make these http requests yourself, performing a CDX lookup request to see if a recent snapshot was already taken before issuing a capture request (to be polite to the capture worker queue).
Gander5739 5 days ago [-]
This already happens. Every link added to Wikipedia is automatically archived on the wayback machine.
5 days ago [-]
RupertSalt 5 days ago [-]
[citation needed]
Gander5739 5 days ago [-]
Ironic, I know. I couldn't find where I originally heard this years ago, but the InternetArchiveBot page linked above says "InternetArchiveBot monitors every Wikimedia wiki for new outgoing links" which is probably referring to what I said.
jsheard 5 days ago [-]
I didn't know you can just ask IA to grab a page before their crawler gets to it. In that case yeah it would make sense for Wikipedia to ping them automatically.
extraduder_ire 4 days ago [-]
There's a /save/<url> endpoint that archives the page you point it at.
You can see a text box for it on the right, if you go on the waybackmachine's homepage. I used it yesterday.
ferngodfather 5 days ago [-]
Why wouldn't Wikipedia just capture and host this themselves? Surely it makes more sense to DIY than to rely on a third party.
huslage 5 days ago [-]
Why would they need to own the archive at all? The archive.org infrastructure is built to do this work already. It's outside of WMF's remit to internally archive all of the data it has links to.
5 days ago [-]
RupertSalt 5 days ago [-]
Spammers and pirates just got super excited at that plan!
toomuchtodo 5 days ago [-]
There are various systems in place to defend against them, I recommend against this, poor form against a public good is not welcome.
snigsnog 4 days ago [-]
Archive.org are left wing activists that will agree to censor anything other left wing activists or large companies don't want online.
AlexeyBelov 3 days ago [-]
And you're another disruptive "N days old" account. Troll somewhere else.
Anyone can request anything be removed and they may honor the request: https://help.archive.org/help/how-do-i-request-to-remove-som... they say nothing about only removing things illegal in the US or anything like that, meaning they can and will remove things based on personal judgements about whether it should be archived.
Maken 2 days ago [-]
Unlisting (not even removing) doxing information about living people is being a left wing activist? Is there where the Oberton window lies now?
raincole 4 days ago [-]
> Does Wikipedia really need to outsource this?
I hope so. Archiving is a legal landmine.
IshKebab 4 days ago [-]
Of course they do. If Wikipedia did it themselves they'd immediately get DMCA'd and sued into oblivion.
> Bypassing paywalls would be playing with fire though.
That's the only reason archive.today was used. For non-paywalled stuff you can use the wayback machine.
5 days ago [-]
5 days ago [-]
ouhamouch 5 days ago [-]
[dead]
culi 5 days ago [-]
The 3 listed alternatives there seem to have nothing to do with digital archiving. Here's a better alternative to g2 that doesn't login-wall you:
The word salad with ukraine, arms trade, nazis, hunter biden, leave no doubt the operator is from Russia.
karel-3d 4 days ago [-]
He says elsewhere he comes from right wing activism. He could be some hard right type. But he says elsewhere he is outside of US jurisdiction. And the fact that he reacts so violently means that the original blogpost is somehow right. So probably Russia
dmix 4 days ago [-]
He’s probably being purposefully vague which makes for difficult reading.
frenchtoast8 4 days ago [-]
A bit off topic, but are there any self hosted open source archiving servers people are using for personal usage?
I think ArchiveBox[1] is the most popular. I will give it a shot, but it's a shame they don't support URL rewriting[2], which would be annoying for me. I read a lot of blog and news articles that are split across multiple pages, and it would be nice if that article's "next page" link was a link to the next archived page instead of the original URL.
> Change the original source to something that doesn't need an archive (e.g., a source that was printed on paper), or for which a link to an archive is only a matter of convenience.
They're basically recommending changing verifiable references that can easily be cross-checked and verified, to "printed on paper" sources that could likely never be verified by any other Wikipedian, and can easily be used to provide a falsification and bias that could go unnoticed for extended periods of time.
Honestly, that's all you need to know about Wikipedia.
The "altered" allegation is also disingenuous. The reason archive.org never works, is precisely because it doesn't alter the pages enough. There's no evidence that archive.today has altered any actual main content they've archived; altering the hidden fields, usernames and paywalls, as well as random presentation elements to make the page look properly, doesn't really count as "altered" in my book, yet that's precisely what the allegation amounts to.
Jordan-117 4 days ago [-]
The accusation is not that they alter pages at all -- they obviously need to in order to make some pages readable/functional, bypass paywalls, or hide account names used to do so. The Wayback Machine does something similar with YouTube to make old videos playable.
The allegation here is that they altered page content not just to remove their own alias, but to insert the name of the blogger they were targeting. That moves it from a defensible technical change for accessibility to being part of their bizarre revenge campaign against someone who crossed them.
tonymet 4 days ago [-]
You should add this context to the talk page. You can do it anonymously without login. I wasn’t aware of either side of this allegation, and it’s helpful to understand this context.
tonymet 4 days ago [-]
Are there people who just downvote every comment? How is this a bad suggestion? If people want change on WP, they should contribute to the discussion there.
archive.today is very popular on HN; the opaque, shortened URLs are promoted on HN every day
I can't use archive.today. I tried but gave up. Too many hassles. I might be in the minority but I know I'm not the only one. As it happens. I have not found any site that I cannot access without it
The most important issue with archive.today though is the person running it, their past and present behaviour. It speaks for itself
Whomever it is, they have lot of info about HN users' reading habits given that archive.today URLs are so heavily promoted by HN submitters, commenters and moderators
1vuio0pswjnm7 4 days ago [-]
Archive.today wants/needs EDNS subnet
"Geolocation" as a justication is ambiguous
Why a need for geolocation
Geolocation can be used for multiple purposes
"DNS performance" is only one purpose
Other purposes might offer the user no benefit, and might even be undesirable for users
As a result, some users don't send EDNS subnet. It's always been optional to send it
Even public resolvers, third party DNS services, like Cloudflare, recognise the tradeoffs for users and allow users to avoid sending it. Popular DNS software makes compiling support for EDNS subnet optional
Archive.today wants/needs EDNS subnet so bad it tries to gather it using a tracking pixel or it tries to block users who dont send it, e.g., Cloudflare users
Thus, before one even considers all the other behaviour of this website operator, some of which is mentioned in this thread, there is a huge red flag for anyone who pays attention to EDNS subnet
As with almost all websites repeated DNS lookups are not an absolute requirement for successful HTTP requests
There are some IP addresses for archive.{today,is,md,ph,li,...} that have continued to work for years
belviewreview 4 days ago [-]
I use archive.today all the time. How do you access pages, like for instance on the economist, without it?
201984 4 days ago [-]
With the paywall blocker so good it got banned! You can also get it on Android.
A Russian domain git website hosting just a readme.md and a copy of the MIT license but no source code? Just the extension files?
moho 4 days ago [-]
The author got banned from github and gitlab after DMCA takedowns. The code used to be available in those, but I guess he got tired of starting over?
Anyway, extensions are just signed zip files. You can extract them and view the source. BPC sources are not compressed or obfuscated. The extension is evaluated and signed by Mozilla (otherwise it wouldn't install in release-channel Firefox), if you put any stock in that.
ranger_danger 4 days ago [-]
For me, all archive.* links just present an endless captcha loop. I am not using CF DNS or any proxy/VPN, but even if I do try those things, it still doesn't work.
1vuio0pswjnm7 4 days ago [-]
http-request set-header user-agent "Mozilla/5.0 (Linux; Android 14) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.6533.103 Mobile Safari/537.36 Lamarr" if { hdr(host) -m end economist.com }
Years ago I used some other workaround that no longer works, maybe something like amp.economist.com. AMP with text-only browser was a useful workaround for many sites
Workarounds usually don't last forever. Websites change from time to time. This one will stop working at some point
There are some people who for various reasons cannot use archive.today
gpvos 4 days ago [-]
Which utility, extension, tool or language is that?
1vuio0pswjnm7 4 days ago [-]
It's from an haproxy configuration file
This unfamiliarity is why I try to use programs that more HN readers are familiar with, like curl or wget, in HN examples. But I find those programs awkward to use. The examples may contain mistakes. I don't use those programs in real life
For making HTTP requests I use own HTTP generators, TCP clients, and local forward proxies
Given the options (a) run a graphical web browser and enable Javascript to solve an archive.today CAPTCHA that contains some fetch() to DDoS a blogger or (b) add a single line to a configuration file and use whatever client I want, no Javascript required, I choose (b)
If dang and tomhow enforce a policy against paywalled content would garner less interest in accessing those pages via third parties. Most news gets reported by multiple outlets in general, so the same discussions would still surface.
wolvoleo 4 days ago [-]
> Whomever it is, they have lot of info about HN users' reading habits given that archive.today URLs are so heavily promoted by HN submitters, commenters and moderators
Anyone interested in the reading habits of HN users can just take a look at news.ycombinator.com ;)
diath 4 days ago [-]
> Whomever it is, they have lot of info about HN users' reading habits given that archive.today URLs are so heavily promoted by HN submitters, commenters and moderators
It's not promoted, it's just used as a paywall bypass so everyone can read the linked article.
fouc 4 days ago [-]
you can change the tld of any archive.today link if .today doesn't work. for example archive.ph, archive.is, archive.md, etc
qingcharles 4 days ago [-]
There's a DNS issue between Archive Today and some ISPs which causes their domains not to resolve properly, which is why some people have a lot of trouble using it.
justincormack 4 days ago [-]
Its not "a DNS issue" they are banned in many countries and there are ongoing court cases, so various enforcement mechanisms are used.
The fact is i cant have a discussion about a paywalled article without reading it. Archive.today is popular as a paywall bypass because nobody wants HN to devolve into debate based on a headline where nobody has rtfa.
1vuio0pswjnm7 4 days ago [-]
"archive.today" as used here means the collection of archive.tld domains, where .tld could be ".is", ".md", ".ph", etc.
"promoted" as used here means placing an archive.tld URL at the top of an HN thread so that many HN readers will
follow it, or placing these URLs elsewhere in threads
nobody9999 3 days ago [-]
>I can't use archive.today. I tried but gave up. Too many hassles.
What hassles have you experienced?
I use the Archive Page[0] extension which is really easy to use.
The only thing that annoys me about it is the repeated requests (starting about eight or nine months ago) to complete CAPTCHAs.
"The only thing that annoys me about it is the repeated requests (starting about eight or nine months ago) to complete CAPTCHAs"
What does this annoy you
nobody9999 2 days ago [-]
>What does this annoy you
Prior to that I was rarely prompted with a CAPTCHA. Now it's every. single. time. I archive something or open an AT link.
Why doesn't that annoy you?
1vuio0pswjnm7 1 days ago [-]
"Why doesn't that annoy you?"
I don't use archive.today. Why would it annooy me
rawling 4 days ago [-]
Is it not possible to create a non-repudiable archive of what a website served, when, entirely locally i.e. not relying on some third party site who might disappear or turn out to be unreliable?
Could you not in theory record the whole TLS transaction? Can it not be replayed later and re-verified?
Up until an old certificate leaks or is broken and you can fake anything "from back when it was valid", I guess.
arboles 4 days ago [-]
I don't know, but archive sites could at least publish hashes of the content at archive time. This could be used to prove an archive wasn't tampered with later. I'm pretty underwhelmed by the Wayback Machine (archive.org), it's no better technically than archive.today.
armchairhacker 3 days ago [-]
How do you ensure the tampered content isn’t re-hashed? Usually if you’re saving the hash in advance, you can save the whole archived page. Otherwise, you can use a regular archive service then hash the archived page yourself.
The only way I know to ensure an archive isn’t tampered is to re-archive it. If you sent a site to archive.today, archive.org, megalodon.jp, and ghostarchive.org, it’s unlikely that all will be tampered in the same way.
arboles 3 days ago [-]
A list of hashes (tuple of [hashed url+date metadata, hashed content]) takes much less disk space than the archive contents themselves. Archive websites could publish the list for all their content so it can be compared against in the future. People would save copies of the list. If you didn't store the list yourself ahead of time, and don't trust a third-party to be "the source of truth", the archive could've uploaded the hashes to the blockchain at archive time:
Unfortunately you can't usefully replay TLS and be able to validate it, so no that does not work. Best strategy would probably be a public transparency log, but websites are pretty variable and dynamic so this would be unlikely to work for many.
octoberfranklin 4 days ago [-]
Actually you can! After all, TLS lacks the deniability features of more advanced cryptosystems (like OTR or Signal).
The technology for doing this is called a Zero Knowledge Proof TLS Oracle:
The 10k-foot view is that you pick the random numbers involved in the TLS handshake in a deterministic way, much like how zk proofs use the Fiat-Shamir transform. In other words, instead of using true randomness, you use some hash of the transcript of the handshake so far (sort of). Since TLS doesn't do client authentication the DH exchange involves randomness from the client.
For all the blockchain haters out there: cryptocurrency is the reason this technology exists. Be thankful.
krick 4 days ago [-]
I believe there are multiple options with different degree of "half-baked"-ness, but can anyone name the best self-hosted version of this service?
Ultimately, what we all use it for is pretty straight-forward, and it seems like by now we should've arrived at having approximately one best implementation, which could be used both for personal archiving and for iternet-facing instances (perhaps even distributed). But I don't know if we have.
robotnikman 4 days ago [-]
I'm wondering the same thing, would be great to have something similar for personal use
seanhly 4 days ago [-]
Curiously, this isn't the first time archive.today was implicated in a DDoS. A HN post from three years back shows some pasted snippets of similar XmlHttpRequest code running on archive.ph (an archive.today synonym site). Post link: https://news.ycombinator.com/item?id=38233062
On that occasion, the target of the attack was a site named northcountrygazette.org, whose owner seems to have never become aware of the attack. The HN commenter noted when they went to the site manually it was incredibly slow, which would suggest the DDoS attempt was effective.
I tried to see if there was anything North Country Gazette had published that the webmaster of archive.today might have taken issue with, and I couldn't find anything in particular. However, the "Gazette" had previously threatened readers with IP logging to prosecute paywall bypassers (https://news.slashdot.org/story/10/10/27/2134236/pay-or-else...), and also blocks archivers in its robots.txt file, indicating it is hostile towards archiving in general.
I can no longer access North Country Gazette, so perhaps it has since gone out of business. I found a few archived posts from its dead website complaining of high server fees. Like the target of this most recent DDoS, June Maxam, the lady behind North Country Gazette, also appears/appeared to be a sleuth.
ouhamouch 4 days ago [-]
[dead]
andai 4 days ago [-]
Sounds like there's a gap in the market for a "commons" archive... maybe powered by something p2p like BitTorrent protocol?
This would have sounded Very Normal in the 2000s... I wonder if we can go back :)
bawolff 4 days ago [-]
P2p is generally bad for this usecase. P2P generally only works for keeping popular content around (content gets dropped when the last peer that cares disconnects). If the content was popular it wouldnt need to be archived in the first place.
andai 4 days ago [-]
I think if you take this idea far enough you end up reinventing taxes from first principles.
quotemstr 4 days ago [-]
Imagine a proof-of-space cryptocurrency that encouraged archiving long-tail data.
4 days ago [-]
pwdisswordfishy 3 days ago [-]
/dev/random as a free money printer? Sign me up.
PhilipRoman 4 days ago [-]
IMO there is actually a very low hanging fruit here, even without P2P or DHTs we could have an URI scheme that consists of a domain and document hash. It is then up to the user to add alternate mirrors for domains. Aside from privacy, it doesn't really matter who answers these requests since the documents are self-signing.
xurukefi 5 days ago [-]
Kinda off-topic, but has anyone figured out how archive.today manages to bypass paywalls so reliably? I've seen people claiming that they have a bunch of paid accounts that they use to fetch the pages, which is, of course, ridiculous. I figured that they have found an (automated) way to imitate Googlebot really well.
jsheard 5 days ago [-]
> I figured that they have found an (automated) way to imitate Googlebot really well.
If a site (or the WAF in front of it) knows what it's doing then you'll never be able to pass as Googlebot, period, because the canonical verification method is a DNS lookup dance which can only succeed if the request came from one of Googlebots dedicated IP addresses. Bingbot is the same.
xurukefi 5 days ago [-]
There are ways to work around this. I've just tested this: I've used the URL inspection tool of Google Search Console to fetch a URL from my website, which I've configured to redirect to a paywalled news article. Turns out the crawler follows that redirect and gives me the full source code of the redirected web site, without any paywall.
That's maybe a bit insane to automate at the scale of archive.today, but I figure they do something along the lines of this. It's a perfect imitation of Googlebot because it is literally Googlebot.
jsheard 5 days ago [-]
I'd file that under "doesn't know what they're doing" because the search console uses a totally different user-agent (Google-InspectionTool) and the site is blindly treating it the same as Googlebot :P
Presumably they are just matching on *Google* and calling it a day.
xurukefi 5 days ago [-]
Sure, but maybe there are other ways to control Googlebot in a similar fashion. Maybe even with a pristine looking User-Agent header.
Aurornis 5 days ago [-]
> which I've configured to redirect to a paywalled news article.
Which specific site with a paywall?
Aurornis 5 days ago [-]
> I've seen people claiming that they have a bunch of paid accounts that they use to fetch the pages, which is, of course, ridiculous.
The curious part is that they allow web scraping arbitrary pages on demand. So if a publisher could put in a lot of arbitrary requests to archive their own pages and see them all coming from a single account or small subset of accounts.
I hope they haven't been stealing cookies from actual users through a botnet or something.
xurukefi 5 days ago [-]
Exactly. If I was an admin of a popular news website I would try to archive some articles and look at the access logs in the backend. This cannot be too hard to figure out.
coppsilgold 4 days ago [-]
You don't even need active measures. If a publisher is serious about tracing traitors there are algorithms for that (which are used by streamers to trace pirates). It's called "Traitor Tracing" in the literature. The idea is to embed watermarks following a specific pattern that would point to a traitor or even a coalition of traitors acting in concert.
It would be challenging to do with text, but is certainly doable with images - and articles contain those.
bawolff 4 days ago [-]
You need that sort of thing (i.e. watermarking) when people are intentionally trying to hide who did it.
In the archive.today case, it looks pretty automated. Surely just adding an html comment would be sufficient.
fc417fc802 4 days ago [-]
If they use paid accounts I would expect them to strip info automatically. An "obvious" way to do that is to diff the output from two separate accounts on separate hardware connecting from separate regions. Streaming services commonly employ per-session randomized stenographic watermarks to thwart such tactics. Thus we should expect major publishers to do so as well.
At which point we still lack a satisfactory answer to the question. Just how is archive.today reliably bypassing paywalls on short notice? If it's via paid accounts you would expect they would burn accounts at an unsustainable rate.
I’m an outsider with experience building crawlers. You can get pretty far with residential proxies and browser fingerprint optimization. Most of the b-tier publishers use RBC and heuristics that can be “worked around” with moderate effort.
quietsegfault 5 days ago [-]
.. but what about subscription only, paywalled sources?
tonymet 5 days ago [-]
many publisher's offer "first one's free".
For those that don't , I would guess archive.today is using malware to piggyback off of subscriptions.
elzbardico 5 days ago [-]
> which is, of course, ridiculous.
Why? in the world of web scrapping this is pretty common.
xurukefi 5 days ago [-]
Because it works too reliably. Imagine what that would entail. Managing thousands of accounts. You would need to ensure to strip the account details form archived peages perfectly. Every time the website changes its code even slightly you are at risk of losing one of your accounts. It would constantly break and would be an absolute nightmare to maintain. I've personally never encountered such a failure on a paywalled news article. archive.today managed to give me a non-paywalled clean version every single time.
Maybe they use accounts for some special sites. But there is definetly some automated generic magic happening that manages to bypass paywalls of news outlets. Probably something Googlebot related, because those websites usually give Google their news pages without a paywall, probably for SEO reasons.
mikkupikku 5 days ago [-]
Using two or more accounts could help you automatically strip account details.
xurukefi 5 days ago [-]
That's actually a really neat idea.
wbmva 4 days ago [-]
Do you know where the doxxed info ultimately originates from? It turns out that the archives leaked account names. Try Googling what happened to volth on Github.
permo-w 4 days ago [-]
I could be wrong, but I think I've seen it fail on more obscure sites. But yeah it seems unlikely they're maintaining so many premium accounts. On the other hand they could simply be state-backed. Let's say there are 1000 likely paywalled sites, 20 accounts for each = 20k accounts, $10/month => $200k/month = $2.4m a year. If I were an intelligence agency I'd happily drop that plus costs to own half the archived content on the internet.
Surely it wouldn't be too hard to test. Just set up an unlisted dummy paywall site, archive it a few times and see what the requests looks like.
Jordan-117 4 days ago [-]
Interesting theory. It would also be a good way to subtly undermine the viability of news outlets, not to mention the insidious potential of altering snapshots at will. OTOH, I'd expect a state-sponsored effort to be more professional in terms of not threatening and smearing some blogger who questioned them.
permo-w 3 days ago [-]
If I were an intelligence agency wanting to throw people off my scent, maybe I'd set up or pay off a blogger to track down my site's "owner" and then do some immature shit in response to absolutely confirm forever that the blogger was right.
Not saying this is true, just saying it could be
behringer 4 days ago [-]
Replace any identifiers like usernames and emails with another string automatically.
cnst 4 days ago [-]
It's because it's actively maintained, and bypassing the paywalls is its whole selling point, thus, they do have to be good at it.
They bypass the rendering issues by "altering" the webpages. It's not uncommon to archive a page, and see nothing because of the paywalls; but then later on, the same page is silently fixed. They have a Tumblr where you can ask them questions; at one point, it's been quite common for everyone to ask them to fix random specific pages, which they did promptly.
Honestly, you cannot archive a modern page, unless you alter it. Yet they're now being attacked under the pretence of "altering" webpages, but that's never been a secret, and it's technologically impossible to archive without altering.
Jordan-117 4 days ago [-]
There's a pretty massive difference between altering a snapshot to make it archivable/readable and doing it to smear and defame a blogger who wrote about you.
Cider9986 4 days ago [-]
I imagine accounts are the only way that archive.today works on sites like 404media.co that seem to have server sided paywalls. Similarly, twitter has a completely server sided paywall.
layer8 5 days ago [-]
It’s not reliable, in the sense that there are many paywalled sites that it’s unable to archive.
xurukefi 5 days ago [-]
But it is reliable in the sense that if it works for a site, then it usually never fails.
tonymet 4 days ago [-]
no tool is 100% effective. Archive.today is the best one we've seen
comeonbro 4 days ago [-]
There is an enormous amount of stuff that is only on archive.today, including stuff that is otherwise gone forever. A mix of stuff that somebody only ever did archive.today on and not archive.org, and stuff that could only be archived on archive.today because archive.org fails on it.
Anything on twitter post-login-wall for one. A million only-semi-paywalled news articles for others. But mainly an unfathomably long tail.
It was extremely distressing when the admin started(?) behaving badly for this reason. That others are starting to react this way to it is understandable. What a stupid tragedy.
_el1s7 4 days ago [-]
Just went into a rabbit hole looking into this, wow, can't tell if this is just another drama on the weird wide web or something else.
croes 5 days ago [-]
> “I’m glad the Wikipedia community has come to a clear consensus, and I hope this inspires the Wikimedia Foundation to look into creating its own archival service,” he told us.
Hardly possible for Wikimedia to provide a service like archive.today given the legal trouble of the latter.
Strangely naive.
anilakar 5 days ago [-]
> If you want to pretend this never happened – delete your old article and post the new one you have promised. And I will not write “an OSINT investigation” on your Nazi grandfather
From hero to a Kremlin troll in five seconds.
alfiedotwtf 4 days ago [-]
It would be nice if there was a non-dynamic snapshot archive as well as the page itself. That way, if the loaded JavaScript stops causes it to stop rendering, at least there’ll be a static fallback
nubinetwork 5 days ago [-]
I noticed I've started being redirected to a blank nginx server for archive.is... but only the .is domain, .ph and .today work just fine. I wonder if they ended up on an adblocker or two.
stephen_g 4 days ago [-]
There was some beef the site owner had with Cloudflare where if your were using Cloudflare DNS it wouldn’t serve anything to you? Is that still happening?
Not sure why it would only be on archive.is and not the others but ‘is’ loads for me.
nubinetwork 4 days ago [-]
Oh maybe... I don't use cloudflare DNS, but maybe one of my rpz zones does something weird...
jl6 4 days ago [-]
Am I reading this right… they tampered with an archived page and then changed it back? How do we know? Is there another archive site that has before and after proof?
They've changed usernames they use to post under. That's the only "altered" allegation they've been accused of.
BTW, they also alter paywalls and other elements, because otherwise, many websites won't show the main content these days.
It kind of seems like "altered" is the new "hacker" today?
Jordan-117 4 days ago [-]
Specifically, they changed a "commenting as: [their alias]" UI element to "commenting as: [name of the blogger they were fighting with]".
Compare (the changed element is near the very bottom of the page; replace the "[dot]" since these URLs seem to trigger spam filters for some commenters):
>In emails sent to Patokallio after the DDoS began, “Nora” from Archive.today threatened to create a public association between Patokallio’s name and AI porn and to create a gay dating app with Patokallio’s name.
Oh good. That's definitely a reasonable thing to do or think.
The raw sociopathy of some people. Getting doxxed isn't good, but this response is unhinged.
jMyles 5 days ago [-]
It's a reminder how fragile and tenuous are the connections between our browser/client outlays, our societal perceptions of online norms, and our laws.
We live at a moment where it's trivially easy to frame possession of an unsavory (or even illegal) number on another person's storage media, without that person even realizing (and possibly, with some WebRTC craftiness and social engineering, even get them to pass on the taboo payload to others).
oytis 5 days ago [-]
I mean, the admin of archive.today might face jail time if deanonymised, kind of understandable he's nervous. Meanwhile for Patokallio it's just curiosity and clicks
ouhamouch 5 days ago [-]
That was private negotiations, btw, not public statements.
In response to J.P's blog already framed AT as project grown from a carding forum + pushed his speculations onto ArsTechnica, whose parent company just destroyed 12ft and is on to a new victim. The story is full of untold conflicts of interests covered with soap opera around DDoS.
MBCook 5 days ago [-]
Why does it matter it was a private communications?
It’s still a threat isn’t it?
5 days ago [-]
Yossarrian22 5 days ago [-]
Can you elaborate on your point?
ouhamouch 5 days ago [-]
The fight is not about where it is shown and not about what, not about "links in Wikipedia", but about whether News Inc will be able to kill AT, as they did with 12FT.
Yossarrian22 5 days ago [-]
What is News Inc? Are they a funder of Wikipedia(I think Wikipedia didn’t have a parent company so they’re not owners)?
ouhamouch 5 days ago [-]
They are owner of ArsTechnica which wrote 3rd (or 4th?) article on AT in a row painting it in certain colors.
The article about FBI subpoena that pulled J.P's speculations out of the closet was also in ArsTechnica and by the same author, and that same article explicitly mentioned how they are happy with 12ft down
Yossarrian22 5 days ago [-]
… Ars is owned by Conde Nast?
5 days ago [-]
ouhamouch 5 days ago [-]
from the Ars article:
---
US publishers have been fighting web services designed to bypass paywalls. In July, the News/Media Alliance said it secured the takedown of paywall-bypass website 12ft.io. “Following the News/Media Alliance’s efforts, the webhost promptly locked 12ft.io on Monday, July 14th,” the group said. (Ars Technica owner Condé Nast is a member of the alliance.)
---
tetris11 4 days ago [-]
Archive.today's domain registrar is Tucows for anyone wondering
ValentineC 3 days ago [-]
Just curious: is this of any significance?
bjourne 5 days ago [-]
FYI, archive.today is NOT the Internet Archive/Wayback Machine.
super256 4 days ago [-]
I prefer archive.today because the Internet Archive’s Wayback Machine allows retrospective removals of archived pages. If a URL has already been crawled and archived, the site owner can later add that URL to robots.txt and request a re-crawl. Once the crawler detects the updated robots.txt, previously stored snapshots of that page can become inaccessible, even if they were captured before the rule was added.
Unfortunately this happens more often than one would expect.
I found this out when I preserved my very first homepage I made as a child on a free hosting service. I archived it on archive.org, and thought it would stay there forever. Then, in 2017 the free host changed the robots.txt, closed all services, and my treasured memory was forever gone from the internet. ;(
pgalvin 4 days ago [-]
This information is now many years out of date - they no longer have this policy.
extraduder_ire 4 days ago [-]
Any idea when that changed? I've been unable to access historical sites in the past because someone parked the domain and had a very restrictive robots.txt on it.
So toward the end of last year, the FBI was after archive.today, presumably either for keeping track of things the current administration doesn't want tracked, or maybe for the paywall thing (on behalf of rich donors/IP owners). https://gizmodo.com/the-fbi-is-trying-to-unmask-the-registra...
That effort appears to have gone nowhere, so now suddenly archive.today commits reputational suicide? I don't suppose someone could look deeper into this please?
> Regarding the FBI’s request, my understanding is that they were seeking some form of offline action from us — anything from a witness statement (“Yes, this page was saved at such-and-such a time, and no one has accessed or modified it since”) to operational work involving a specific group of users. These users are not necessarily associates of Epstein; among our users who are particularly wary of the FBI, there are also less frequently mentioned groups, such as environmental activists or right-to-repair advocates.
> Since no one was physically present in the United States at that time, however, the matter did not progress further.
> You already know who turned this request into a full-blown panic about “the FBI accusing the archive and preparing to confiscate everything.”
> an analysis of existing links has shown that most of its uses can be replaced.
Oh? Do tell!
nobody9999 5 days ago [-]
>> an analysis of existing links has shown that most of its uses can be replaced.
>Oh? Do tell!
They do. In the very next paragraph in fact:
The guidance says editors can remove Archive.today links when the original
source is still online and has identical content; replace the archive link so
it points to a different archive site, like the Internet Archive,
Ghostarchive, or Megalodon; or “change the original source to something that
doesn’t need an archive (e.g., a source that was printed on paper)
chrisjj 5 days ago [-]
[flagged]
Kim_Bruning 5 days ago [-]
> archive.today
Hopeless. Caught tampering the archive.
The whole situation is not great.
rockskon 2 days ago [-]
Did they? I thought the claim was code was added unrelated to the contents of the archived pages that effectuated a DDOS on someone's blog.
snigsnog 4 days ago [-]
I'd rather deal with this weird feud than not have access to any archived content that people want censored. Defeats the entire purpose of an archive
5 days ago [-]
nobody9999 5 days ago [-]
I just quoted the very next paragraph after the sentence you quoted and asked for clarification.
I did so. You're welcome.
As for the rest, take it up with Jimmy Wiles, not me.
mikehotel 5 days ago [-]
aka Jimbo Wales
nobody9999 4 days ago [-]
Thanks for the correction. I can't type the letter 'a'[0].
I would be suprised if archive.today had something that was not in the wayback machine
chrisjj 5 days ago [-]
Archive.today has just about everything the archived site doesn't want archived. Archive.org doesn't, because it lets sites delete archives.
layman51 4 days ago [-]
I know that sometimes the behavior of each archiver service is a bit different. For example, it's possible that both Archive.today and the Internet Archive say they have a copy of a page, but then when you open up the IA version, you might see that it renders completely differently or not at all. It might be caused because the webpage has like two scrollbars, or maybe there's a redirect that happens when a link to the page is loaded. I notice this seems to happen on documentation pages that are hosted by Salesforce. It can be a bit of a pain if you want to save to save a backup copy online of a release note or something like that for everyone to easily reference in the future.
chrisjj 4 days ago [-]
> it's possible that both Archive.today and the Internet Archive say they have a copy of a page, but then when you open up the IA version, you might see that it renders completely differently or not at all
AT archives the page as seen, even including a screenshot.
IA archives the page as loaded, then when you view hamfistedly injects its header bar and executes the source JS. As you'd expect the result is often wrecked - or tampered.
bombcar 5 days ago [-]
Wayback machine removes archives upon request, so there’s definitely stuff they don’t make publicly available (they may still have it).
super256 4 days ago [-]
You don't even need to do requests if you are the owner of the URL. Robot.txt changes are applied in retrospect, which means you can disallow crawls to /abc, request a re-crawl, and all snapshots from the past which match this new rule will be removed.
zahlman 5 days ago [-]
Trying to search the Wayback machine almost always gives me their made-up 498 error, and when I do get a result the interface for scrolling through dates is janky at best.
ribosometronome 5 days ago [-]
Accounts to bypass paywalls? The audacity to do it?
that_lurker 5 days ago [-]
Oh yeah those where a thing. As a public organization they can't really do that.
I personally just don't use websites that paywall important information.
eviks 4 days ago [-]
> the community should figure out how to efficiently remove links to archive.today
You're part of the community! Prove him right!
chrisjj 4 days ago [-]
:)
But seriously, removal is simple but replacement is not.
dakolli 4 days ago [-]
The FBI called out archive.today a couple months ago, there's clearly a campaign against them by the USA (4th Reich), which stands principally against any information repository they don't control or have influence over (its Russian owned). This is simply donors of the Trump regime who own media companies requesting this because its the primary way around paywalls for most people who know about it.
5 days ago [-]
realaaa 3 days ago [-]
wow! but this felt like end of the story - here is LLM summary of timeline - sharing as is
---------
Here’s the chronology that the HN thread id=47092006 is about, based on the linked Ars Technica article and related sources.
---
## 1. What “started the argument”?
The core dispute starts from a 2023 blog post by engineer Jani Patokallio on his site Gyrovague, investigating who is behind archive.today. That post, plus later FBI interest, led to:
1. A *GDPR/takedown campaign* against the blog post.
2. An *apparent DDoS* launched from archive.today’s CAPTCHA page against his blog.
3. *Threats* from the archive.today operator (“Nora”) to associate Patokallio’s name with AI porn and other harassment.
4. *Discovery that archive.today had altered archived pages* to insert Patokallio’s name.
5. A *Wikipedia RfC* and decision to deprecate and blacklist archive.today links.
The Hacker News thread you referenced is about the final step: Wikipedia’s decision to remove ~695,000 archive.today links.
---
## 2. Timeline of the situation
```mermaid
timeline
title archive.today – Wikipedia controversy chronology
2012-2015 : Site founded as archive.is; later branded archive.today
2023-08-05 : Patokallio publishes investigation into archive.today’s ownership
2025-10-30 : FBI subpoena to archive.today’s registrar (Tucows)
2025-11-05 : Heise reports FBI subpoena, links to Patokallio’s 2023 post
2026-01-08 : GDPR complaint from “Nora” to Automattic re Patokallio’s post
2026-01-10 : archive.today webmaster emails Patokallio asking for temporary takedown
2026-01-11 : DDoS from archive.today CAPTCHA page against Gyrovague begins
2026-01-14 : First public HN report about weird/DDoS behavior from archive.today
2026-01-21 : gyrovague.com added to DNS blocklists used by ad blockers
2026-01-25 : Email exchange escalates; “Nora” threatens AI porn, “gay dating app”, “Nazi grandfather”
2026-02-01 : Patokallio publishes detailed timeline and DDoS disclosure
2026-02-07 : Wikipedia RfC opens on archive.today links
2026-02-10 : Ars Technica reports on DDoS and Wikipedia considering blacklist
2026-02-19 : DDoS code still present in archive.today CAPTCHA page (per Wikipedia guidance)
2026-02-20 : RfC closed; consensus to deprecate/blacklist archive.today
2026-02-20–21 : Major outlets report Wikipedia’s blacklist; guidance page created
```
So, in the terms of your question:
- *What started the argument* was Patokallio’s 2023 investigation into archive.today’s ownership, which later coverage of the FBI subpoena amplified.
- The *direct trigger for Wikipedia’s action* was the combination of:
- The *DDoS* launched from archive.today against his blog.
- The *threats* (AI porn, harassment) against him.
- Evidence that the *archive’s content had been tampered with*, violating Wikipedia’s trust in it as a citation source.【turn4fetch0】【turn9find1】
ValveFan6969 4 days ago [-]
[dead]
ValveFan6969 5 days ago [-]
[dead]
Keekgette 4 days ago [-]
[flagged]
attila-lendvai 5 days ago [-]
[flagged]
Permit 4 days ago [-]
> i don't know anything specific about the site or any conflicts involved, yet this smells like a negative PR campaign to me...
What possible value could a comment from someone who has no knowledge of the site or conflict add to this discussion?
attila-lendvai 3 days ago [-]
the sniff test, or gut reaction...
sure, it's worth only as much as of a total stranger, but still.
ChrisArchitect 5 days ago [-]
[flagged]
input_sh 5 days ago [-]
I know I'm arguing with a bot that nobody monitors, but it's already in the fucking post.
ChrisArchitect 4 days ago [-]
You're late. Things have changed.
casey2 5 days ago [-]
Anecdotally I generally see archive.is/archive.today links floating around "stochastic terrorist" sites and other hate cults.
oytis 4 days ago [-]
I see them everywhere where paywalled content is referenced
snigsnog 4 days ago [-]
Shows that it's a great archival service if the most censored people are able to use it without their archives being censored.
TZubiri 5 days ago [-]
They seem totally unrelated to the Internet Archive. They probably only ever got on Wikipedia by leeching of the IA brand and confusing enough people to use them
Onavo 5 days ago [-]
Wayback machine won't bypass paywall nor pirate content, not to mention they are under US jurisdiction. You can't have your cake and eat it.
krick 4 days ago [-]
Honestly, IMHO archive.today is just so much nicer to use in every aspect than IA, that unless they outright start to distribute malware (I mean, like, via the page itself — otherwise it's pretty much irrelevant), I don't think I'll stop using it.
tl2do 5 days ago [-]
Why not show both? Wikipedia could display archive links alongside original sources, clearly labeled so readers know which is which. This preserves access when originals disappear while keeping the primary source as the main reference.
bawolff 5 days ago [-]
The objection is to this specific archieve service not archiving in general.
AgentME 4 days ago [-]
Wikipedia shouldn't allow links to sites which intentionally falsify archived pages and use their visitors to perform DDOS attacks.
ranger207 5 days ago [-]
They generally do. Random example, citation 349 on the page of George Washington: ""A Brief History of GW"[link]. GW Libraries. Archived[link] from the original on September 14, 2019. Retrieved August 19, 2019."
Gander5739 5 days ago [-]
This will always be done unless the original url is marked as dead or similar.
shevy-java 5 days ago [-]
Anyone has a short summary as to who and why Archive.today acted via DDos? Isn't that something done by malicious actors? Or did others misuse Archive.today?
zeroonetwothree 5 days ago [-]
If you read the linked article it is discussed
alsetmusic 5 days ago [-]
I will no longer donate to Wikipedia as long as this is policy.
jraph 5 days ago [-]
Why? The decision seems reasonable at first sight.
chrisjj 5 days ago [-]
Second sight is advisable in such cases. Fact is, archives are essential to WP integrity and there's no credible alternative to this one.
I see WP is not proposing to run its own.
mook 5 days ago [-]
Wouldn't it be precisely because archives are important that using something known to modify the contents would be avoided?
esseph 5 days ago [-]
> something known to modify the contents would be avoided?
Like Wikipedia?
beej71 4 days ago [-]
No, not like that. There's a difference between a site that:
1) provides a snapshot of another site for archival purposes.
2) provides original content.
You're arguing that since encyclopedias change their content, the Library of Congress should be allowed to change the content of the materials in its stacks.
By modifying its archives, archive.today just flushed its credibility as an archival site. So what is it now?
esseph 4 days ago [-]
> You're arguing that since encyclopedias change their content, the Library of Congress should be allowed to change the content of the materials in its stacks.
As an end user of Wikipedia there are occasions where content has been scrubbed and/or edits hidden. Admins can see some of those, but end users cannot (with various justifications, some excellent/reasonable and some.. nebulous). That's all I'm saying, nothing about Congress or such other nonsense. It seems like an occasion of the pot calling the kettle names from this side of the fence.
beej71 4 days ago [-]
But Wikipedia promises you that it will modify its content. They're transparent about that promise.
An archival site (by default definition) promises you that it will not modify its content. And when it does, it's no longer an archival site.
Wikipedia has never been an archival site and it never will be. archive.today was an archival site, but now it never will be again.
ouhamouch 4 days ago [-]
This is your imaginary archive from the world of pink ponies.
Meanwhile their IMA on Reddit: no promises, no commitment. Just like Microsoft EULA :)
What I don't see on that page is where they explicitly don't promise to not modify anything in the archive.
chrisjj 4 days ago [-]
> What I don't see on that page is where they explicitly don't promise to not modify anything in the archive.
I'm quoting all of that because is lacks an explicit promise of non-modification /i
Meanwhile seriously, if you were disappointed not to see e.g. "We explicitly don't promise not to modify", then perhaps you should consider why, regardless, this site was trusted enough to get a gazillion links in Wikipedia... and HN.
beej71 4 days ago [-]
> I'm quoting all of that because is lacks an explicit promise of non-modification.
And I'm quoting all of that because it lacks an explicit (or implicit) promise of modification. :)
It was (emphasis on past-tense) so-trusted because it advertises itself as an archival site. (The linked disclaimer is all about it not being a "long-term" archival site. It says it archives pages for latecomers. There is an implication here that it archives them accurately. What use is a site for latecomers if they change the content to be something else?) If they'd said or indicated they would be changing the content to no longer reflect the original site, Wikipedia would not have linked to them because they wouldn't be a credible source.
In any case, now I can't use them to share or use links since we can no longer trust those archives to be untampered. When I share a link to nyt content on archive.today or copy and paste content into email, I'm putting my name on that declaring "nyt printed this". If that's not true, it's my reputation.
Just like it was archive.today's.
esseph 4 days ago [-]
> When I share a link to nyt content on archive.today or copy and paste content into email, I'm putting my name on that declaring "nyt printed this". If that's not true, it's my reputation.
What if the nyt article itself is the problem? How does that square?
ouhamouch 4 days ago [-]
[dead]
chrisjj 5 days ago [-]
Obviously not, since archive.org is encouraged.
huslage 5 days ago [-]
What exactly is credible about archive.today if they are willing to change the archive to meet some desire of the leadership? That's not credible in the least.
chrisjj 5 days ago [-]
A lot more credible than archive.org that lets archives be changed and deleted by the archive targets.
What's your better idea?
josephcsible 4 days ago [-]
Does archive.org really let its archives be changed? That's very different than letting them be deleted from a credibility perspective.
ouhamouch 4 days ago [-]
Yes.
Archive.org snapshots may load javascript from external sites, where the original page had loaded them. That script can change anything on the page. Most often, the domain is expired and hijacked by a parking company, so it just replaces the whole page with ads.
The page "got changed" every second. It is easy to make an archived page which would show different content depending on current time or whether you have Mac or Windows, or your locale, or browser fingerpring, or been tailored for you personally
josephcsible 4 days ago [-]
I don't think it's fair to equate running JS that can change the rendered output with the archive server actually changing the HTML it sends back.
ouhamouch 4 days ago [-]
I agree, JS is much worse. Because anyone could create an "untrustworthy" page on archive.org, no hack or admin assistance is required.
chrisjj 4 days ago [-]
Much worse indeed. This's why one should be deeply sceptical of the handful of WP users seeking to replace archive.today by archive.org. AT allows tampering by the archive operator; IA allows tampering by half the planet... including WP editors who'd love that replacement.
RupertSalt 5 days ago [-]
> the archive targets
Isn't there a substantial overlap with the copyright holders?
chrisjj 4 days ago [-]
Overlap?
that_lurker 5 days ago [-]
The operators() of archive.today (and the other domains) are doing shadey things and the links are not working so why keep the site around as for example Internet archives waybackmachine works as alternative to it.
chrisjj 5 days ago [-]
What archive.today links are not working?
> Internet archives wayback machine works as alternative to it.
It is appalling insecure. It lets archives be altered by page JS and deleted by the page domain owner.
Nonstarter for anything that you actually want to be preserved, especially anything controversial.
chrisjj 4 days ago [-]
No request is needed. Just robots.txt to deliver a bulk removal.
throw0101a 5 days ago [-]
> Fact is, archives are essential to WP integrity and there's no credible alternative to this one.
Yes, they are essentional, and that was the main reason for not blacklisting Archive.today. But Archive.today has shown they do not actually provide such a service:
> “If this is true it essentially forces our hand, archive.today would have to go,” another editor replied. “The argument for allowing it has been verifiability, but that of course rests upon the fact the archives are accurate, and the counter to people saying the website cannot be trusted for that has been that there is no record of archived websites themselves being tampered with. If that is no longer the case then the stated reason for the website being reliable for accurate snapshots of sources would no longer be valid.”
How can you trust that the page that Archive.today serves you is an actual archive at this point?
chrisjj 5 days ago [-]
> If ... If ...
Oh dear.
> How can you trust that the page that Archive.today serves you is an actual archive at this point?
Because no-one shown evidence that it isn't.
rufo 5 days ago [-]
The quote uses ifs because it was written before this was verified, but the Wikipedia thread in question has links to evidence of tampering occurring.
It lands "503 Service Unavailable
No server is available to handle this request."
Gander5739 4 days ago [-]
Apologies, then. The Wayback link works just fine for me, no errors.
prmoustache 4 days ago [-]
> there's no credible alternative to this one.
But this one is not credible either so...
Jordan-117 5 days ago [-]
Did you not read the article? They not only directed a DDOS against a blogger who crossed them, but altered their own archived snapshots to amplify a smear against them. That completely destroys their trustworthiness and credibility as a source of truth.
chrisjj 5 days ago [-]
Sure I read it. But I don't believe everything I read on the internet.
creatonez 4 days ago [-]
The proof is right there for you to see. Denying it is rather wacky.
ouhamouch 5 days ago [-]
Altered snapshots = hide Nora name?
ArsTechica just did the same - removed Nora from older articles. How can you trust ArsTechica after that?
Jordan-117 4 days ago [-]
They didn't just remove her name, but replaced it with the target's name.
I don't know what you're talking about re: Ars removing her name from old articles.
Jordan-117 4 days ago [-]
Follow-up: maybe you're confusing Ars Technica with Wikipedia, whose admins did redact Nora's last name from discussions? If so, that's a weird equivalence to draw, since the change was disclosed and done to protect personal information, not attack someone else in the process. (Also, "Nora [redacted]" itself seems to be a name lifted from an unrelated person who had merely contacted Archive.today with a takedown request.)
Smartchat 4 days ago [-]
1. I can't post links (I've already tried), my comments with links are getting shadowbanned. Check out Jon Brodkin's article on Ars about AT, not today's, but the previous one, 6 days ago. Nora's name was there, but now it's silently gone.
2. We learned about Nora's involvement from Patokallio. We learned about Nora's non-involvement... also from Patokallio. They could have reached a settlement with AT that includes hiding Nora's name.
3. Regardless of who Nora is, it is interesting to see the extent of this censorship: so far only gyrovague.com and arstechnica.com, but not tomshardware.com and not tech.yahoo.com. This shows which sites are working closely with the AT defamation campaign, and which are simply copywriting the news feed.
Jordan-117 4 days ago [-]
Silently? It tells you right there in the article: "Nora [last name redacted]". Maybe they could add a more fulsome explanation in an editor's note but it seems pretty obvious in context.
If AT is appropriating some random person's name as an alias, it seems helpful to report on that publicly in order to expose the practice and help clear up the misinformation.
Smartchat 4 days ago [-]
Silently. Last article. Not today's.
One with title 'Archive.today CAPTCHA page executes DDoS; Wikipedia considers banning site'
Even if they did, so what? There's nothing wrong with a news article removing personal information as a precaution. It's light-years away from altering the content of an archival snapshot in order to target someone else.
Smartchat 4 days ago [-]
Well, that's the only name they removed, even though it didn't stand out among the other names in the investigation. Secondly, it's ironic to do so in an article tagged "Streisand Effect" so perhaps we're witnessing part of the performance. And thirdly, it's strange to blame AT for removing... the same name, and not blame Ars. Immediately accusing... AT of double standards and hypocrisy.
I am lost here. It is definitively an organized defamation campaign.
“You are guilty simply because I am hungry”
Jordan-117 4 days ago [-]
Seems more like Ars trying to avoid piling more attention on the name of a person that isn't actually involved.
And again, the accusation against Archive.today isn't just that they removed their "Nora" alias from a snapshot, but that they replaced it with the name of the blogger they were quarreling with. There's no defensible reason to do that outside of petty revenge (which tracks with the emails and public statements from the Archive.today maintainer).
Smartchat 4 days ago [-]
> Ars trying to avoid piling more attention on the name of a person that isn't actually involved.
Oh, yes, by removing the name in the context of "Streisand Effect".
> petty revenge
How does it "revenge"? Was it a porn page? Or something bad?
It is likely to be just a funny placeholder name of the same length to come in mind.
--
We could find good and bad motives for both AT and Ars.
The bias against AT was here apriori. Paywall-story for CondeNast, russophobia for the rest.
Jordan-117 4 days ago [-]
They apparently did a find + replace across their database to change the Nora alias to the blogger's name. So any archives of content referencing her would instead point to him, muddying the waters and blaming him for anything she was accused of. Like I said, petty.
The porn smear threats came later, via email.
ouhamouch 4 days ago [-]
[dead]
4 days ago [-]
Larrikin 5 days ago [-]
About how much had you previously donated over the years?
5 days ago [-]
selridge 5 days ago [-]
[flagged]
kmeisthax 5 days ago [-]
[flagged]
paganel 5 days ago [-]
At this point Archive.today provides a better service (all things considered) compared to Wikipedia, at least when it comes to current affairs.
How does the tech behind archive.today work in detail? Is there any information out there that goes beyond the Google AI search reply or this HN thread [2]?
[1] https://algustionesa.com/the-takedown-campaign-against-archi... [2] https://news.ycombinator.com/item?id=42816427
They also tampered with their archive for a few of the social media sites (Twitter, Instagram, Blogger) by changing the name of the signed in account to Jani Patokallio. https://megalodon.jp/2026-0220-0320-05/https://archive.is:44...
I think Wikipedia made the right decision, you can't trust an archival service for citations if every time the sysop gets in a row they tamper with their database.
There is evidence of this in the article you're commenting on.
I assume it must be a blanket ban on Finnish IPs as there has been comments about it on Reddit and none of my friends can get it to work either. 5 different ISPs were tried. So at the very least it seems to affect majority of Finnish residential connections.
That's awesome. I wish everyone made sure of their facts. Thanks.
Now it's obviously possible that my VPN was whitelisted somehow, or that the GeoIP of it is lying. This is just a singular datapoint.
setInterval(function(){fetch("https://gyrovague.com/tag/"+Math.random().toString(36).subst...",{ referrerPolicy:"no-referrer",mode:"no-cors" });},1400);
https://archive-is.tumblr.com/post/808911640210866176/people...
archive.org also complies with takedown requests, so it's worth asking: could the organised campaign against archive.today have something to do with it preserving content that someone wants removed?
It would be interesting to run the numbers, but I get the feeling that AI generated articles may have a higher LIX number. Authors are then less inclined to "fix" the text, because longer word makes them seem smarter.
There is so much is archived there, to lose it all would be a tragedy.
owner-archive-today . blogspot . com
2 years old, like J.P's first post on AT
They also cannot hijack data with a residential botnet or buy subscriptions themselves. Otherwise, the saved page would contain information about the logged-in user. It would be hard to remove this information, as the code changes all the time, and it would be easy for the website owner to add an invisible element that identifies the user. I suppose they could have different subscriptions and remove everything that isn't identical between the two, but that wouldn't be foolproof.
https://megalodon.jp/2026-0221-0304-51/https://d914s229qk4kj...
https://archive.is/Y7z4E
The second shows volth's Github notifications. Volth was a major nix-pkgs contributor, but his Github account disappeared.
https://github.com/orgs/community/discussions/58164
This particular addon is blocked on most western git servers, but can still be installed from Russian git servers. It includes custom paywall-bypassing code for pretty much every news websites you could reasonably imagine, or at least those sites that use conditional paywalls (paywalls for humans, no paywalls for big search engines). It won't work on sites like Substack that use proper authenticated content pages, but these sorts of pages don't get picked up by archive.today either.
My guess would be that archive.today loads such an addon with its headless browser and thus bypasses paywalls that way. Even if publishers find a way to detect headless browsers, crawlers can also be written to operate with traditional web browsers where lots of anti-paywall addons can be installed.
Thanks for sketching out their approach and for the URI.
https://www.reddit.com/r/Advice/comments/5rbla4/comment/dd5x...
The way I (loosely) understand it, when you archive a page they send your IP in the X-Forwarded-For header. Some paywall operators render that into the page content served up, which then causes it to be visible to anyone who clicks your archived link and Views Source.
I’m guessing by using a residential botnet and using existing credentials by unknowingly ”victims” by automating their browsers.
> Otherwise, the saved page would contain information about the logged-in user.
If you read this article, theres plenty of evidence they are manipulating the scraped data.
But I’m just speculating here…
I guess if they can control a residential botnet more extensively they would be able to do that, but it would still be very difficult to remove login information from the page, the fact that they manipulated the scraped data for totally unrelated reasons a few times proves nothing in my opinion.
With this said, I also disagree with turning everyone that uses archive[.]today into a botnet that DDoS sites. Changing the content of archived pages also raises questions about the authenticity of what we're reading.
The site behaves as if it was infected by some malware and the archived pages can't be trusted. I can see why Wikipedia made this decision.
It's very silly to talk about doxing when all someone has done is gather information anyone else can equally easily obtain, just given enough patience and time, especially when it's information the person in question put out there themselves. If it doesn't take any special skills or connections to obtain the information, but only the inclination to actually perform the research on publicly available data, I don't see what has been done that is unethical.
That's no justification for using visitors to your site to do a DDOS.
In the slang of reddit: ESH
No, harassment also includes persistent attempts to cause someone grief, whether or not they involve direct interactions with that person.
From Wikipedia:
> Harassment covers a wide range of behaviors of an offensive nature. It is commonly understood as behavior that demeans, humiliates, and intimidates a person.
Because the two are distinct, one can't simply replace "doxing" with "harassment".
>I for one will be buying Denis/Masha/whoever a well deserved cup of coffee.
Using one term when what is meant is actually the other serves nothing but to sow confusion.
And i don't just mean under colloquial definition, i mean under the legal definition of harrasment. In fact its fairly common for unwanted "positive" attention to be harrasment - e.g. unwanted sexual advances mostly fit that description.
I get that a call to action is a common feature of doxing and it wasn't present here, but its not a particularly common feature of harrasment outside of the context of doxing and nothing in the definition of harrasment requires it.
that current etymology is what we’re all talking about obv
That's just another way of saying "words don't have meanings". Yes, it evolves, but to preserve the original meanings, that evolution should be slowed down as much as possible to avoid “black is white” effects.
In that context I don't think the question ("actually, who is providing all this information to me and what interests drive them") is one that's misplaced. Maybe we shouldn't look into a gift horse's mouth but don't forget this could be a Trojan horse as well.
The article brought to light some ties to Russia but probably not ties to its government and its troll farms. Rather an independent and pretty rebellious citizen. That's good to hear. And that's valuable information. I trust the site more after reading the article, not less.
The article could have redacted the names they found but they were found with public sources and these sources validate the encountered information (otherwise the results could have been dismissed)
I'm not defending the archive.today webmaster but it's unfortunately understandable they are angry. Saying what the blogger did was merely point out public information is a gross oversimplification.
Again, did you read my comment? I know what it means now. My point is about highlighting the change in meaning, not about obstinately denying what the word means.
>Even institutions that care about secrecy like governments state [...]
A given organization can have whatever policy it wants with regards to which documents it wants to allow to be made public. It could make all documents printed on non-yellow paper classified. That has nothing to do with the ethics of doxing.
>The reasons for this are obvious, essentially aggregated information can lead one to draw conclusions that otherwise are not obvious.
A secret is not something that's not obvious, it's a datum that's strictly controlled by the people who know it. If I can find some information about your real identity just by searching for it online then it's not a secret; you don't control that piece of information. You've given up that control by divulging the information in a public space where information often remains indefinitely.
Oddly, I think archive.today has explicitly said that's not what they're there for, and the people shouldn't rely on their links as a long-term archive.
> Archive.today is a time capsule for web pages! > It takes a 'snapshot' of a webpage that will always be online even if the original page disappears.
Given the unclear ownership situation, it makes sense not to rely on them for anything long term. They could disappear tomorrow.
This is absolutely the buried lede of this whole saga, and needs to be the focus of conversation in the coming age.
So it doesn't necessarily raise questions about whether the content has been changed or not. The difference is in whether that change is there to make the archive usable - and of course, for archive.today, that's not the case.
It still is, uBlocks default lists are killing the script now but if it's allowed to load then it still tries to hammer the other blog.
I don't know, I feel like everyone loses here.
(For those who don't know, he's currently trying to destroy one of the largest WP hosting providers with a bunch of lawsuits)
"You found the smoking gun!"
I don't think the DDOSing is a very good method for fighting back but I can't blame anyone for trying to survive. They are definitely the victim here.
If that blog really doxxed them out of idle curiosity they are an absolute piece of shit. Though I think this is more of a targeted campaign.
In this case, I didn't know that the archive.today people were doxxed until they started the ddos campaign and caught attention. I doubt anyone in this thread knew or cared about the blogger until he was attacked. And now this entire thing is a matter of permanent record on Wikipedia and in the news. archive.today's attempt at silencing the blogger is only bringing them more trouble, not less.
Barbara_Streisand_Mansion.jpg
Probably nothing and the DDoS hype was intentional to distract attention and highlight J.P.'s doxx among the other, making them insignificant.
J.P. might be the only one of the doxxers who could promote their doxx in media, and this made his doxx special, not the content?
Anyway, it made the haystack bigger keeping needle the same.
One of the really strange things about all of this is that there is a public forum post in which a guy claims to be the site owner. So this whole debacle is this weird mix of people who are angry and saying "clearly the owner doesn't want to be associated with the site" on the one hand, but then on the other hand there's literally a guy who says he's the one that owns the site, so it doesn't seem like that guy is very worried about being associated with it?
It also seems weird to me that it's viewed as inappropriate to report on the results of Googling the guy who said he owns the site, but maybe I'm just out of touch on that topic.
Which forum post? The post mentioned by the blogger, the post on an F-Secure forum (a company with cybersecurity products) was a request for support by the owner of archive.today regarding a block of their site. It's arguably not intended as a public statement by the owner of the archive, and they were simply careless with their username.
You don't know their motives for running their site, but you do get a clear message about their character by observing their actions, and you'd do well to listen to that message.
They might be the worst person ever but that doesn't matter. People can be good and bad, sometimes the victim sometimes the perpetrator.
Is it morally wrong to doxx someone and cause them to go to jail because they are running an archive website? Yes. It is. It doesn't matter who the person is. It does not matter what their motivations are.
> I don't think the DDOSing is a very good method for fighting back
I am really shocked by the conditional empathy people here are showing. The doxxing isn't less bad just because the reaction to it is bad.
Its like justifying bullying because the person "deserves" it.
Now what you do in reaction might be legally and morally wrong and maybe you need to be punished for that. But that doesn't negate the injustice you suffered. Two wrongs make... two wrongs. One does not negate the other.
[1] https://archive.today/20240714173022/https://x.com/archiveis...
[2] https://x.com/advancedhosters
[3] https://x.com/advancedhosters/status/1731129170091004412
[4] https://lj.rossia.org/users/mopaiv/257.html
[5] https://x.com/advancedhosters/status/1501971277099286539
If archive.whatever wasn't so useful to the general public, it'd be hard to distinguish from a criminal operation given the way it operates, unlike say the Internet Archive who goes through all of the proper legal paperwork to be a real nonprofit.
Every Reddit archived page used to have a Reddit username in the top right, but then it disappeared. "Fair enough," I thought. "They want to hide their Reddit username now."
The problem is, they did it retroactively too, removing the username from past captures.
You can see on old Reddit captures where the normal archived page has no username, but when you switch the tab to the Screenshot of the archive it is still there. The screenshot is the original capture and the username has now been removed for the normal webpage version.
When I noticed it, it seemed like such a minor change, but with these latest revelations, it doesn't seem so minor anymore.
That doesn't seem nefarious, though. It makes sense they wouldn't want to reveal whatever accounts they use to bypass blocks, and the logged-in account isn't really meaningful content to an archive consumer.
Now, if they were changing the content of a reddit post or comment, that would be an entirely different matter.
No, certain edits are understandable and required. Even the archive.org edits its pages (e.g. sticks banners on them and does a bunch of stuff to make them work like you'd expect).
Even paper archives edit documents (e.g. writing sequence numbers on them, so the ordering doesn't get lost).
Disclosing exactly what account was used to download a particular page is arguably irrelevant information, and may even compromise the work of archiving pages (e.g. if it just opens the account to getting blocked).
The issue here is to edit archived pages retrospectively.
mroe https://en.wikipedia.org/wiki/Perma.cc
The major reason archive.today was being used is that it also bypassed paywalls, and I don't think perma.cc does that normally.
With all of this context shared, the Internet Archive is likely meeting this need without issue, to the best of my knowledge.
[1] https://meta.wikimedia.org/wiki/Wikimedia_Endowment
[2] https://perma.cc/about ("Perma.cc was built by Harvard’s Library Innovation Lab and is backed by the power of libraries. We’re both in the forever business: libraries already look after physical and digital materials — now we can do the same for links.")
[3] https://community.crossref.org/t/how-to-get-doi-for-our-jour...
[4] https://www.crossref.org/fees/#annual-membership-fees
[5] https://www.crossref.org/fees/#content-registration-fees
(no affiliation with any entity in scope for this thread)
If pricing is so much that you have to have a call with the marketing team to get a quote, i think it would be a poor use of WMF funds.
Especially because volume of links and number of users that wikimedia would entail is probably double their entire existing userbase at least.
Ultimately we are mostly talking about a largely static web host. With legal issues being perhaps the biggest concern. It would probably make more sense for WMF to create their own than to become a perma.cc subscriber.
However for the most part, partnering with archive.org seems to be going well and already has some software integration with wikipedia.
https://wikimediaendowment.org/annualreports/2023-2024-annua...
https://www.in.gov/nircc/planning/highway/traffic-data/inter...
and reddit blocks their agent seemingly. It is open source though.
https://meta.wikimedia.org/wiki/InternetArchiveBot
https://github.com/internetarchive/internetarchivebot
Shortcut is to consume the Wikimedia changelog firehose and make these http requests yourself, performing a CDX lookup request to see if a recent snapshot was already taken before issuing a capture request (to be polite to the capture worker queue).
You can see a text box for it on the right, if you go on the waybackmachine's homepage. I used it yesterday.
Anyone can request anything be removed and they may honor the request: https://help.archive.org/help/how-do-i-request-to-remove-som... they say nothing about only removing things illegal in the US or anything like that, meaning they can and will remove things based on personal judgements about whether it should be archived.
I hope so. Archiving is a legal landmine.
> Bypassing paywalls would be playing with fire though.
That's the only reason archive.today was used. For non-paywalled stuff you can use the wayback machine.
https://alternativeto.net/software/freezepage/
https://archive-is.tumblr.com/post/806832066465497088/ladies...
https://archive-is.tumblr.com/post/807584470961111040/it-see...
I think ArchiveBox[1] is the most popular. I will give it a shot, but it's a shame they don't support URL rewriting[2], which would be annoying for me. I read a lot of blog and news articles that are split across multiple pages, and it would be nice if that article's "next page" link was a link to the next archived page instead of the original URL.
1: https://archivebox.io/
2: https://github.com/ArchiveBox/ArchiveBox/discussions/1395
Open source. Self hosted or managed. Native iOS and Android apps.
Its Content Scripts feature allows custom JS scripts that transform saved content, which could be used to do URL rewriting.
* https://omnom.zone/
* https://github.com/asciimoo/omnom
https://en.wikipedia.org/wiki/Wikipedia:Archive.today_guidan...
They're basically recommending changing verifiable references that can easily be cross-checked and verified, to "printed on paper" sources that could likely never be verified by any other Wikipedian, and can easily be used to provide a falsification and bias that could go unnoticed for extended periods of time.
Honestly, that's all you need to know about Wikipedia.
The "altered" allegation is also disingenuous. The reason archive.org never works, is precisely because it doesn't alter the pages enough. There's no evidence that archive.today has altered any actual main content they've archived; altering the hidden fields, usernames and paywalls, as well as random presentation elements to make the page look properly, doesn't really count as "altered" in my book, yet that's precisely what the allegation amounts to.
The allegation here is that they altered page content not just to remove their own alias, but to insert the name of the blogger they were targeting. That moves it from a defensible technical change for accessibility to being part of their bizarre revenge campaign against someone who crossed them.
archive.today is very popular on HN; the opaque, shortened URLs are promoted on HN every day
I can't use archive.today. I tried but gave up. Too many hassles. I might be in the minority but I know I'm not the only one. As it happens. I have not found any site that I cannot access without it
The most important issue with archive.today though is the person running it, their past and present behaviour. It speaks for itself
Whomever it is, they have lot of info about HN users' reading habits given that archive.today URLs are so heavily promoted by HN submitters, commenters and moderators
"Geolocation" as a justication is ambiguous
Why a need for geolocation
Geolocation can be used for multiple purposes
"DNS performance" is only one purpose
Other purposes might offer the user no benefit, and might even be undesirable for users
As a result, some users don't send EDNS subnet. It's always been optional to send it
Even public resolvers, third party DNS services, like Cloudflare, recognise the tradeoffs for users and allow users to avoid sending it. Popular DNS software makes compiling support for EDNS subnet optional
Archive.today wants/needs EDNS subnet so bad it tries to gather it using a tracking pixel or it tries to block users who dont send it, e.g., Cloudflare users
Thus, before one even considers all the other behaviour of this website operator, some of which is mentioned in this thread, there is a huge red flag for anyone who pays attention to EDNS subnet
As with almost all websites repeated DNS lookups are not an absolute requirement for successful HTTP requests
There are some IP addresses for archive.{today,is,md,ph,li,...} that have continued to work for years
https://gitflic.ru/project/magnolia1234/bypass-paywalls-fire...
Anyway, extensions are just signed zip files. You can extract them and view the source. BPC sources are not compressed or obfuscated. The extension is evaluated and signed by Mozilla (otherwise it wouldn't install in release-channel Firefox), if you put any stock in that.
Workarounds usually don't last forever. Websites change from time to time. This one will stop working at some point
There are some people who for various reasons cannot use archive.today
This unfamiliarity is why I try to use programs that more HN readers are familiar with, like curl or wget, in HN examples. But I find those programs awkward to use. The examples may contain mistakes. I don't use those programs in real life
For making HTTP requests I use own HTTP generators, TCP clients, and local forward proxies
Given the options (a) run a graphical web browser and enable Javascript to solve an archive.today CAPTCHA that contains some fetch() to DDoS a blogger or (b) add a single line to a configuration file and use whatever client I want, no Javascript required, I choose (b)
Anyone interested in the reading habits of HN users can just take a look at news.ycombinator.com ;)
It's not promoted, it's just used as a paywall bypass so everyone can read the linked article.
"promoted" as used here means placing an archive.tld URL at the top of an HN thread so that many HN readers will follow it, or placing these URLs elsewhere in threads
What hassles have you experienced?
I use the Archive Page[0] extension which is really easy to use.
The only thing that annoys me about it is the repeated requests (starting about eight or nine months ago) to complete CAPTCHAs.
[0] https://addons.mozilla.org/en-US/firefox/addon/archive-page/
What does this annoy you
Prior to that I was rarely prompted with a CAPTCHA. Now it's every. single. time. I archive something or open an AT link.
Why doesn't that annoy you?
I don't use archive.today. Why would it annooy me
Could you not in theory record the whole TLS transaction? Can it not be replayed later and re-verified?
Up until an old certificate leaks or is broken and you can fake anything "from back when it was valid", I guess.
The only way I know to ensure an archive isn’t tampered is to re-archive it. If you sent a site to archive.today, archive.org, megalodon.jp, and ghostarchive.org, it’s unlikely that all will be tampered in the same way.
https://gwern.net/timestamping
The technology for doing this is called a Zero Knowledge Proof TLS Oracle:
https://eprint.iacr.org/2024/447.pdf
https://tlsnotary.org
The 10k-foot view is that you pick the random numbers involved in the TLS handshake in a deterministic way, much like how zk proofs use the Fiat-Shamir transform. In other words, instead of using true randomness, you use some hash of the transcript of the handshake so far (sort of). Since TLS doesn't do client authentication the DH exchange involves randomness from the client.
For all the blockchain haters out there: cryptocurrency is the reason this technology exists. Be thankful.
Ultimately, what we all use it for is pretty straight-forward, and it seems like by now we should've arrived at having approximately one best implementation, which could be used both for personal archiving and for iternet-facing instances (perhaps even distributed). But I don't know if we have.
On that occasion, the target of the attack was a site named northcountrygazette.org, whose owner seems to have never become aware of the attack. The HN commenter noted when they went to the site manually it was incredibly slow, which would suggest the DDoS attempt was effective.
I tried to see if there was anything North Country Gazette had published that the webmaster of archive.today might have taken issue with, and I couldn't find anything in particular. However, the "Gazette" had previously threatened readers with IP logging to prosecute paywall bypassers (https://news.slashdot.org/story/10/10/27/2134236/pay-or-else...), and also blocks archivers in its robots.txt file, indicating it is hostile towards archiving in general.
I can no longer access North Country Gazette, so perhaps it has since gone out of business. I found a few archived posts from its dead website complaining of high server fees. Like the target of this most recent DDoS, June Maxam, the lady behind North Country Gazette, also appears/appeared to be a sleuth.
This would have sounded Very Normal in the 2000s... I wonder if we can go back :)
If a site (or the WAF in front of it) knows what it's doing then you'll never be able to pass as Googlebot, period, because the canonical verification method is a DNS lookup dance which can only succeed if the request came from one of Googlebots dedicated IP addresses. Bingbot is the same.
That's maybe a bit insane to automate at the scale of archive.today, but I figure they do something along the lines of this. It's a perfect imitation of Googlebot because it is literally Googlebot.
Presumably they are just matching on *Google* and calling it a day.
Which specific site with a paywall?
The curious part is that they allow web scraping arbitrary pages on demand. So if a publisher could put in a lot of arbitrary requests to archive their own pages and see them all coming from a single account or small subset of accounts.
I hope they haven't been stealing cookies from actual users through a botnet or something.
It would be challenging to do with text, but is certainly doable with images - and articles contain those.
In the archive.today case, it looks pretty automated. Surely just adding an html comment would be sufficient.
At which point we still lack a satisfactory answer to the question. Just how is archive.today reliably bypassing paywalls on short notice? If it's via paid accounts you would expect they would burn accounts at an unsustainable rate.
For those that don't , I would guess archive.today is using malware to piggyback off of subscriptions.
Why? in the world of web scrapping this is pretty common.
Maybe they use accounts for some special sites. But there is definetly some automated generic magic happening that manages to bypass paywalls of news outlets. Probably something Googlebot related, because those websites usually give Google their news pages without a paywall, probably for SEO reasons.
Surely it wouldn't be too hard to test. Just set up an unlisted dummy paywall site, archive it a few times and see what the requests looks like.
Not saying this is true, just saying it could be
They bypass the rendering issues by "altering" the webpages. It's not uncommon to archive a page, and see nothing because of the paywalls; but then later on, the same page is silently fixed. They have a Tumblr where you can ask them questions; at one point, it's been quite common for everyone to ask them to fix random specific pages, which they did promptly.
Honestly, you cannot archive a modern page, unless you alter it. Yet they're now being attacked under the pretence of "altering" webpages, but that's never been a secret, and it's technologically impossible to archive without altering.
Anything on twitter post-login-wall for one. A million only-semi-paywalled news articles for others. But mainly an unfathomably long tail.
It was extremely distressing when the admin started(?) behaving badly for this reason. That others are starting to react this way to it is understandable. What a stupid tragedy.
Hardly possible for Wikimedia to provide a service like archive.today given the legal trouble of the latter.
Strangely naive.
From hero to a Kremlin troll in five seconds.
Not sure why it would only be on archive.is and not the others but ‘is’ loads for me.
BTW, they also alter paywalls and other elements, because otherwise, many websites won't show the main content these days.
It kind of seems like "altered" is the new "hacker" today?
Compare (the changed element is near the very bottom of the page; replace the "[dot]" since these URLs seem to trigger spam filters for some commenters):
archive [dot] is/gFD6Z
megalodon [dot] jp/2026-0219-1628-23/https://archive.is:443/gFD6Z
Oh good. That's definitely a reasonable thing to do or think.
The raw sociopathy of some people. Getting doxxed isn't good, but this response is unhinged.
We live at a moment where it's trivially easy to frame possession of an unsavory (or even illegal) number on another person's storage media, without that person even realizing (and possibly, with some WebRTC craftiness and social engineering, even get them to pass on the taboo payload to others).
In response to J.P's blog already framed AT as project grown from a carding forum + pushed his speculations onto ArsTechnica, whose parent company just destroyed 12ft and is on to a new victim. The story is full of untold conflicts of interests covered with soap opera around DDoS.
It’s still a threat isn’t it?
The article about FBI subpoena that pulled J.P's speculations out of the closet was also in ArsTechnica and by the same author, and that same article explicitly mentioned how they are happy with 12ft down
--- US publishers have been fighting web services designed to bypass paywalls. In July, the News/Media Alliance said it secured the takedown of paywall-bypass website 12ft.io. “Following the News/Media Alliance’s efforts, the webhost promptly locked 12ft.io on Monday, July 14th,” the group said. (Ars Technica owner Condé Nast is a member of the alliance.) ---
Unfortunately this happens more often than one would expect.
I found this out when I preserved my very first homepage I made as a child on a free hosting service. I archived it on archive.org, and thought it would stay there forever. Then, in 2017 the free host changed the robots.txt, closed all services, and my treasured memory was forever gone from the internet. ;(
That effort appears to have gone nowhere, so now suddenly archive.today commits reputational suicide? I don't suppose someone could look deeper into this please?
> Regarding the FBI’s request, my understanding is that they were seeking some form of offline action from us — anything from a witness statement (“Yes, this page was saved at such-and-such a time, and no one has accessed or modified it since”) to operational work involving a specific group of users. These users are not necessarily associates of Epstein; among our users who are particularly wary of the FBI, there are also less frequently mentioned groups, such as environmental activists or right-to-repair advocates.
> Since no one was physically present in the United States at that time, however, the matter did not progress further.
> You already know who turned this request into a full-blown panic about “the FBI accusing the archive and preparing to confiscate everything.”
Not sure who he's talking about there.
Oh? Do tell!
>Oh? Do tell!
They do. In the very next paragraph in fact:
Hopeless. Caught tampering the archive.
The whole situation is not great.
I did so. You're welcome.
As for the rest, take it up with Jimmy Wiles, not me.
[0] https://www.youtube.com/watch?v=2ewY8CnFae0&t=56s
AT archives the page as seen, even including a screenshot.
IA archives the page as loaded, then when you view hamfistedly injects its header bar and executes the source JS. As you'd expect the result is often wrecked - or tampered.
I personally just don't use websites that paywall important information.
You're part of the community! Prove him right!
But seriously, removal is simple but replacement is not.
---------
Here’s the chronology that the HN thread id=47092006 is about, based on the linked Ars Technica article and related sources.
---
## 1. What “started the argument”?
The core dispute starts from a 2023 blog post by engineer Jani Patokallio on his site Gyrovague, investigating who is behind archive.today. That post, plus later FBI interest, led to:
1. A *GDPR/takedown campaign* against the blog post. 2. An *apparent DDoS* launched from archive.today’s CAPTCHA page against his blog. 3. *Threats* from the archive.today operator (“Nora”) to associate Patokallio’s name with AI porn and other harassment. 4. *Discovery that archive.today had altered archived pages* to insert Patokallio’s name. 5. A *Wikipedia RfC* and decision to deprecate and blacklist archive.today links.
The Hacker News thread you referenced is about the final step: Wikipedia’s decision to remove ~695,000 archive.today links.
---
## 2. Timeline of the situation
```mermaid timeline title archive.today – Wikipedia controversy chronology
```So, in the terms of your question:
- *What started the argument* was Patokallio’s 2023 investigation into archive.today’s ownership, which later coverage of the FBI subpoena amplified. - The *direct trigger for Wikipedia’s action* was the combination of: - The *DDoS* launched from archive.today against his blog. - The *threats* (AI porn, harassment) against him. - Evidence that the *archive’s content had been tampered with*, violating Wikipedia’s trust in it as a citation source.【turn4fetch0】【turn9find1】
What possible value could a comment from someone who has no knowledge of the site or conflict add to this discussion?
sure, it's worth only as much as of a total stranger, but still.
I see WP is not proposing to run its own.
Like Wikipedia?
1) provides a snapshot of another site for archival purposes. 2) provides original content.
You're arguing that since encyclopedias change their content, the Library of Congress should be allowed to change the content of the materials in its stacks.
By modifying its archives, archive.today just flushed its credibility as an archival site. So what is it now?
As an end user of Wikipedia there are occasions where content has been scrubbed and/or edits hidden. Admins can see some of those, but end users cannot (with various justifications, some excellent/reasonable and some.. nebulous). That's all I'm saying, nothing about Congress or such other nonsense. It seems like an occasion of the pot calling the kettle names from this side of the fence.
An archival site (by default definition) promises you that it will not modify its content. And when it does, it's no longer an archival site.
Wikipedia has never been an archival site and it never will be. archive.today was an archival site, but now it never will be again.
Meanwhile their IMA on Reddit: no promises, no commitment. Just like Microsoft EULA :)
https://old.reddit.com/r/DataHoarder/comments/1i277vt/psa_ar...
I'm quoting all of that because is lacks an explicit promise of non-modification /i
Meanwhile seriously, if you were disappointed not to see e.g. "We explicitly don't promise not to modify", then perhaps you should consider why, regardless, this site was trusted enough to get a gazillion links in Wikipedia... and HN.
And I'm quoting all of that because it lacks an explicit (or implicit) promise of modification. :)
It was (emphasis on past-tense) so-trusted because it advertises itself as an archival site. (The linked disclaimer is all about it not being a "long-term" archival site. It says it archives pages for latecomers. There is an implication here that it archives them accurately. What use is a site for latecomers if they change the content to be something else?) If they'd said or indicated they would be changing the content to no longer reflect the original site, Wikipedia would not have linked to them because they wouldn't be a credible source.
In any case, now I can't use them to share or use links since we can no longer trust those archives to be untampered. When I share a link to nyt content on archive.today or copy and paste content into email, I'm putting my name on that declaring "nyt printed this". If that's not true, it's my reputation.
Just like it was archive.today's.
What if the nyt article itself is the problem? How does that square?
What's your better idea?
Archive.org snapshots may load javascript from external sites, where the original page had loaded them. That script can change anything on the page. Most often, the domain is expired and hijacked by a parking company, so it just replaces the whole page with ads.
Example: https://web.archive.org/web/20140701040026/http://echo.msk.r...
----
And another example: https://web.archive.org/web/20260219005158/https://time.is/
The page "got changed" every second. It is easy to make an archived page which would show different content depending on current time or whether you have Mac or Windows, or your locale, or browser fingerpring, or been tailored for you personally
Isn't there a substantial overlap with the copyright holders?
> Internet archives wayback machine works as alternative to it.
It is appalling insecure. It lets archives be altered by page JS and deleted by the page domain owner.
Nonstarter for anything that you actually want to be preserved, especially anything controversial.
Yes, they are essentional, and that was the main reason for not blacklisting Archive.today. But Archive.today has shown they do not actually provide such a service:
> “If this is true it essentially forces our hand, archive.today would have to go,” another editor replied. “The argument for allowing it has been verifiability, but that of course rests upon the fact the archives are accurate, and the counter to people saying the website cannot be trusted for that has been that there is no record of archived websites themselves being tampered with. If that is no longer the case then the stated reason for the website being reliable for accurate snapshots of sources would no longer be valid.”
How can you trust that the page that Archive.today serves you is an actual archive at this point?
Oh dear.
> How can you trust that the page that Archive.today serves you is an actual archive at this point?
Because no-one shown evidence that it isn't.
Wikipedia does not have a project page with this exact name.
I assume that is weasel words for 404 Not Found.
To https://en.wikipedia.org/wiki/Wikipedia:Requests_for_comment...
I read that up to the first "proof", https://web.archive.org/web/20260218135501/https://www.googl...
It lands "503 Service Unavailable No server is available to handle this request."
But this one is not credible either so...
ArsTechica just did the same - removed Nora from older articles. How can you trust ArsTechica after that?
I don't know what you're talking about re: Ars removing her name from old articles.
2. We learned about Nora's involvement from Patokallio. We learned about Nora's non-involvement... also from Patokallio. They could have reached a settlement with AT that includes hiding Nora's name.
3. Regardless of who Nora is, it is interesting to see the extent of this censorship: so far only gyrovague.com and arstechnica.com, but not tomshardware.com and not tech.yahoo.com. This shows which sites are working closely with the AT defamation campaign, and which are simply copywriting the news feed.
If AT is appropriating some random person's name as an alias, it seems helpful to report on that publicly in order to expose the practice and help clear up the misinformation.
One with title 'Archive.today CAPTCHA page executes DDoS; Wikipedia considers banning site'
I'll try to add the link with comment edit:
This has Nora's name https://web.archive.org/web/20260210195502/https://arstechni...
The current version has not
I am lost here. It is definitively an organized defamation campaign.
“You are guilty simply because I am hungry”
And again, the accusation against Archive.today isn't just that they removed their "Nora" alias from a snapshot, but that they replaced it with the name of the blogger they were quarreling with. There's no defensible reason to do that outside of petty revenge (which tracks with the emails and public statements from the Archive.today maintainer).
Oh, yes, by removing the name in the context of "Streisand Effect".
> petty revenge
How does it "revenge"? Was it a porn page? Or something bad?
It is likely to be just a funny placeholder name of the same length to come in mind.
--
We could find good and bad motives for both AT and Ars.
The bias against AT was here apriori. Paywall-story for CondeNast, russophobia for the rest.
The porn smear threats came later, via email.