Illegal Web Scraping: Makes Democratized Data Even More Crucial
It is amazing how things progress when one is following a story.
On a number of occasions we discussed the concept of democratized data. In fact, this is what I view to be Hive's number 1 use case at the moment.
People often say find a need and fill it.
The reality is that, in this era of AI and training large models, data is crucial. Naturally, people are quickly realizing the value of said data on their servers. Those who operate on the Internet are confronted with a situation where the ability to control it is taking on added meaning.
With so much money on the line, these entities are taking strides to lock things down.
While technological steps are one thing, it is something else when the law gets involved.
Image generated by Ideogram
Illegal Web Scraping Screams For Data Democratization
Here is a short video that discusses a case that went in front of a United States court that dealt with data scraping. The accused was found to be guilty. This was a corporate case but does bring up some interesting questions.
https://inleo.io/threads/view/taskmaster4450le/re-leothreads-2mo8o3c7p
The one element here is that we are looking at data extraction tied to fraud. Of course, in many parts of the world, there are laws against fraud, regardless of how the information used was acquired.
Leaving that aspect of this aside, what about the act of designing an automated agent to pull data off websites. What happens if they becomes illegal?
Certainly, this is something that will have to filter through the courts and every country will be different. However, we saw a move where developers are being held responsible for what their software does.
Most will remember the case of the Tornado Cash developer who got 64 months for money laundering. Basically he designed a privacy application that allowed for the swapping of cryptocurrency.
Thus, we cannot call it unreasonable to think that some governments will take such action. If that is the case, could developers be held responsible?
Democratized Data
The democratization of data solves this problem.
What this means is generating data that is placed in public databases, such as the Hive blockchain, where anyone is free to utilize it. Since nobody owns it, start ups can garner the data to train their models.
This is not the case with entities such as Reddit and X which are locking down their sites. The ability to scrape the Internet is diminishing.
We also have to factor in lawsuits.
OpenAi has been sued by a number of entities for training their models on data claimed to be under copyright laws. This is going to have to make it through the court system before we know where the rulings stand. Nevertheless, this company faces the potential in billions in verdicts.
It is obvious start ups cannot withstand this.
So what are they to do?
Actually, a better question is what are we going to do? Do we want a future where Big Tech is the only one with access to data? Is the idea of a handful of mega-corporations being the developers of these models appealing to people?
The answer to this question should dictate future behavior.
If one has no problem with this future, then feeding the massive beasts are no problem. Google, Amazon, X, and Meta will see their database grow on a daily basis, allowing them to feed increasing compute they acquire.
On the other hand, if one stands for decentralization and distribution, then these centralized entities are even less appealing.
Web 3.0 = Decentralization
It is no secret that a core tenet of Web 3.0 is the idea of decentralization.
Actually, we are looking at a technology that was brought about with the idea of democratized data from the start. The breakthrough of Bitcoin came from the ability to arrive at consensus without a centralized third party. This means that the ledger, i.e. database, was not under the control of a single entity.
Bitcoin's data, for the most part, is limited to financial transactions. Over the years, other databases are showing up that expanded upon this concept. Hive is an example of a permissionless text database.
We are now seeing this growing in imporance. Some like to cite how "data is the new oil". If that is the case, who is getting more oil is the question?
Is humanity well served by creating another cartel like we see with the physical commodity, only this time in the digital world?
Our success with cartels seems rather clear.
The foundation of the Internet is the database. Everything we do is tied to it. Without databases, we would have nothing on our screen. This applies whether we are discussing Web 2.0 or Web 3.0.
AI training is taking this to another level. We see the value grow, meaning these lead this large entities has keeps growing.
Permissionless databases hold the key to combating this. Even if the law starts to swing in the direction of holding developers responsible, democratized data makes it a meaningless point.
Posted Using InLeo Alpha
Congratulations @taskmaster4450! You have completed the following achievement on the Hive blockchain And have been rewarded with New badge(s)
Your next target is to reach 2250000 upvotes.
You can view your badges on your board and compare yourself to others in the Ranking
If you no longer want to receive notifications, reply to this comment with the word
STOP
Check out our last posts:
👏 Keep Up the good work on Hive ♦️ 👏
❤️ @mysteriousroad suggested sagarkothari88 to upvote your post ❤️
🙏 Don't forget to Support Back 🙏
It is taking some time for the law to catch up with technology. Copywrited content/information is subject to be exploited by artifical intelligence as it scrapes the web. It is a lot to keep up with.
That is true.
The MO of governments, knowing they cant stop the masses, is to make an example of people.
When one is put on trail, and they feed the media machine, it is a scare tactic that can control those who are thinking about doing that.
Of course, with tech, we are dealing with something global so those in North Korea really do not care what the EU or US says.
Do we want a future where Big Tech is the only one with access to data? trust me, you wouldn't love witnessing this outcome. It has been a recent discussion I made with a friend on how data may become too expensive for the masses to afford should centralized hands again prevail.
Just look at what Google paid Reddit.
That sums it up.
The concerns which you raise are of greater or lesser importance depending on the motivations of the developers. If the motivation is more political/ideological/non-profit driven like open source then it is a hard one to legislate against. Bitcoin itself deflected many attacks due to the fact there was no one person or corporation in charge that could be fined or prosecuted; just an elusive Satoshi Nakamoto.
Remember the encryption schemes that the US government outlawed downloading outside of the USA in the 90s? That kinda fizzled didn't it. If my server is using a VPN and the use of the data collected is not centralized nor profit driven then ideally we are likely to see more fizzling in my opinion.
The downloading issue, much of that tied to copyright, did fizzle for a couple reasons.
To start, the sheer magnitude of that activity was overwhelming. This is not going to be the case with the developers, not the same numbers.
A second issue is the fact that we were dealing with "crimes" that were mostly civil, i.e people being sued or facing fines. When something is tied to a jail sentence, if can be.
They could easily tie this to espionage or something like that.
As for the open nature and decentralized, I agree with you 1000%. That is why we have to get as much data on permissionless networks as possible.
I guess it depends on who you ask. For those who use Meta, Google [Android], and Twitter, they pretty much know and accept that their data is being used by these companies and they are ok with it. They believe their data is relatively safe in these big companies. Barring a data hack/leak, the worse that can happen is their email being sold to advertisers. But trusting and using unknown applications is much more scarier for them.
There is a big difference between building a scraping agent and a hack.