print(soup.title)# <title>Misol Sahifa</title>print(soup.h1.text)# Sarlavha
Class orqali qidirish:
pythonCopy codeparagraf = soup.find("p", class_="paragraf")print(paragraf.text)# Bu oddiy paragraf.
Havolalarni olish:
link = soup.find("a")print(link['href'])# https://example.com
1.4. Bir nechta elementlarni olish
from bs4 import BeautifulSouphtml ="""<ul> <li>Element 1</li> <li>Element 2</li> <li>Element 3</li></ul>"""soup =BeautifulSoup(html, "lxml")elements = soup.find_all("li")for element in elements:print(element.text)
2. Scrapy bilan ishlash
2.1. Scrapy o'rnatish
pipinstallscrapy
2.2. Yangi loyiha yaratish
scrapystartprojectmyproject
2.3. Scrapy'ning asosiy tuzilmasi
spiders: Skriptingni amalga oshiradigan fayllar joylashgan papka.
items.py: Topilgan ma'lumotlarni saqlash uchun strukturani belgilash.
pipelines.py: Ma'lumotlarni saqlash yoki tozalash uchun ishlatiladi.
2.4. Oddiy spider yaratish
Spider faylini yaratish:
cdmyprojectscrapygenspiderexampleexample.com
Spiderni yozish:
import scrapyclassExampleSpider(scrapy.Spider): name ='example' start_urls = ['https://example.com']defparse(self,response):# Sarlavhani chiqarish title = response.xpath("//title/text()").get()print(f"Sarlavha: {title}")
2.5. XPath va CSS Selector bilan qidirish
XPath orqali:
response.xpath("//h1/text()").get()# H1 tagini olish
CSS Selector orqali:
response.css("h1::text").get()# H1 tagini olish
2.6. Ma'lumotlarni yig'ish va saqlash
JSON yoki CSV faylga yozish:
scrapycrawlexample-oresults.json
Pipeline orqali ma'lumotni qayta ishlash: pipelines.py faylini tahrirlash:
classMyProjectPipeline:defprocess_item(self,item,spider):# Ma'lumotni tozalash item['title']= item['title'].strip()return item
Ba'zan ma'lumotlar AJAX orqali yuklanadi. Scrapy bilan so'rov yuborish orqali bu ma'lumotlarni olish mumkin:
import scrapyclassAjaxSpider(scrapy.Spider): name ="ajax_example" start_urls = ["https://example.com/ajax"]defparse(self,response):# JSON ma'lumotlarni olish data = response.json()for item in data['results']:yield{'name': item['name'],'price': item['price']}
2.2. Dinamik ma'lumotlarni olish (Selenium bilan birga ishlash)
Scrapy dinamik yuklangan sahifalarni olish uchun Selenium bilan birga ishlatilishi mumkin:
from scrapy import Spiderfrom selenium import webdriverfrom selenium.webdriver.common.by import ByclassSeleniumSpider(Spider): name ="selenium_spider"def__init__(self): self.driver = webdriver.Chrome()defstart_requests(self): urls = ["https://example.com"]for url in urls:yield scrapy.Request(url=url, callback=self.parse)defparse(self,response): self.driver.get(response.url) elements = self.driver.find_elements(By.TAG_NAME, "h1")for element in elements:print(element.text)
2.3. Ma'lumotlarni qayta ishlash va filtrlash
Scrapy'da ma'lumotlarni filtrlash uchun ItemLoader dan foydalaning: