Python Forum
Scrape for html based on url string and output into csv - Printable Version

+- Python Forum (
+-- Forum: Python Coding (
+--- Forum: Web Scraping & Web Development (
+--- Thread: Scrape for html based on url string and output into csv (/thread-31938.html)

Pages: 1 2

Scrape for html based on url string and output into csv - dana - Jan-10-2021

Crawl an email from specified website.

I have list of a specific company registration codes in csv format which are updated weekly basis.

I want to crawl all email address from source website which have those specific corresponding company email addresses and put the email address to new csv file.

Source addresses where the email what needs to be crawled looks like this: / "q" value equals variable (comapny registration code) with each different page where the email is).

Each address string which needed to crawl is located in csv file (starting from second column with header "regcode")
(source table structure: compname | regcode | othercol1 | othercol2) (columns are separated by semicolon ;)

The email what need to be crawled is located between the html tags of each page:
<table class="table-info"> <tr>..</tr> <tr>..</tr> <tr>..</tr> <tr>..</tr> <tr>..</tr> <tr>..</tr> <tr>..</tr> <tr>..</tr> <tr>..</tr> <tr>..</tr> <tr> <td class="col-1"><div class="col-1-text">E-mail:</div></td> <td class="col-2"><div class="col-2-text"><a href=""></a></div></td> </tr> </table>
The crawled email should be put into new csv file, called extracted.csv.

The extracted.csv table structure should be as following:
regcode | email

Explanation: the same company registration code which is used as crawl string, should be put into the new csv file belongside the crawled email address.

This process should be triggered every week and automation should look out for new entires only which are updated in the csv file.

RE: Scrape for html based on url string and output into csv - snippsat - Jan-10-2021

So have you tried something?
The task is manageable with basic skill in Python and looked a little at tool needed like Requests,Bs4,lxml,csv(module).
Look at Web-Scraping part-1
Quick hint:
from bs4 import BeautifulSoup

html = '''\
<table class="table-info">
    <td class="col-1"><div class="col-1-text">E-mail:</div></td>
    <td class="col-2"><div class="col-2-text"><a href=""></a></div></td>

soup = BeautifulSoup(html, 'lxml')
>>> mail = soup.select_one('.col-2')
>>> mail
<td class="col-2"><div class="col-2-text"><a href=""></a></div></td>
>>> mail.select_one('a').get('href')

RE: Scrape for html based on url string and output into csv - dana - Jan-11-2021


Thanks for the quick hint :)

I think I need to use Scrapy, because the csv file contains over 100K rows of data / companies and that means over 100K web requests.

I am very new to this, so any help is highly appreciated!

Thanks Smile

RE: Scrape for html based on url string and output into csv - snippsat - Jan-11-2021

(Jan-11-2021, 12:19 AM)dana Wrote: I think I need to use Scrapy, because the csv file contains over 100K rows of data / companies and that means over 100K web requests.
Scrapy could possible be used for this.
I would start with a smaller test file and just use basic tool like shown eg BS with lxml(very fast parser C speed).
Then see how long time it take on sample file.
Can also look post there you see i use concurrent.futures to speed it up.

Look at this Post for spilt csv with Pandas and use in then use in Scrapy.
The chuck csv from Pandas can also be used in method that i have talked about.

RE: Scrape for html based on url string and output into csv - dana - Jan-11-2021

So, i started to read the csv file to get the data like so:

import csv

with open('data.csv', encoding='utf8') as csv_file:
    csv_reader = csv.DictReader(csv_file, delimiter=';')
    count = 0

    for row in csv_reader:
Now, I am clueless how to loop the csv row as request url parameter q.


(Jan-11-2021, 12:06 PM)snippsat Wrote:
(Jan-11-2021, 12:19 AM)dana Wrote: I think I need to use Scrapy, because the csv file contains over 100K rows of data / companies and that means over 100K web requests.
Scrapy could possible be used for this.
I would start with a smaller test file and just use basic tool like shown eg BS with lxml(very fast parser C speed).
Then see how long time it take on sample file.
Can also look post there you see i use concurrent.futures to speed it up.

Look at this Post for spilt csv with Pandas and use in then use in Scrapy.
The chuck csv from Pandas can also be used in method that i have talked about.

RE: Scrape for html based on url string and output into csv - snippsat - Jan-12-2021

(Jan-11-2021, 11:49 PM)dana Wrote: Now, I am clueless how to loop the csv row as request url parameter q.
Could post a sample of the .csv file.
See if this helps.
import csv

with open('data.csv') as csvfile:
    reader = csv.reader(csvfile, delimiter=';')
    for row in reader:
        url = f'{row[1]}'

RE: Scrape for html based on url string and output into csv - dana - Jan-12-2021

So far, I understand now. Thanks.

How can I get the scraped email without the "mailto:"

The e-mail address is located where I referenced on the first post.

<table class="table-info">
    <td class="col-1"><div class="col-1-text">E-mail:</div></td>
    <td class="col-2"><div class="col-2-text"><a href=""></a></div></td>
So, it should look inside path:
1. where is table with class "table-info"
2. where is div with class "col-2-text"
3. where is hyperlink with mailto:
4. extract the clean email only without "mailto:" or the text inside a tags

Any ideas?

RE: Scrape for html based on url string and output into csv - snippsat - Jan-12-2021

When get string back(get('href')) there is no parser can do,
then use normal Python string methods or regex.
A simple split(':') is all that's needed.
>>> table_info = soup.select_one('.table-info')
>>> mail = table_info.select_one('.col-2 a')
>>> mail = mail.get('href')
>>> mail
>>> mail_clean = mail.split(':')[1]
>>> mail_clean

RE: Scrape for html based on url string and output into csv - dana - Jan-12-2021

I put it to the test on live url, but i am missing something, it will show error:

from bs4 import BeautifulSoup
from requests import get
page = ""
content = get(page).content
soup = BeautifulSoup(content, "lxml")

table_info = soup.select_one('.table-info')
mail = table_info.select_one('.col-2 a')
mail = mail.get('href')
mail_clean = mail.split(':')[1]
File "C:\Users\pc\Desktop\python\", line 9, in <module> mail = table_info.select_one('.col-2 a') AttributeError: 'NoneType' object has no attribute 'select_one'

RE: Scrape for html based on url string and output into csv - snippsat - Jan-12-2021

Look at content you get back,eg print(soup).
<noscript>This '
 'site requires Javascript to work,.....
So i don't know if just test this on a server(that make this more difficult) that may not be needed for this this task.
Usually when a site use a lot of Javascript can use Selenium

As this is just a simple test of a server that not may be needed for this task,can bypass it be passing in the cookie.
from bs4 import BeautifulSoup
from requests import get

page = ""
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
cookies = {'__test': '1bb6e881021f013463740eeb74840b18'}
content = get(page, headers=headers,  cookies=cookies).content
soup = BeautifulSoup(content, "lxml")

table_info = soup.select_one('.table-info')
mail = table_info.select_one('.col-2 a')
mail = mail.get('href')
mail_clean = mail.split(':')[1]