December 14, 2022 Computer Tips

Python script to filter the arXiv and get an email daily

Here we share a simple Python script to read daily the arXiv and get an email filtering the abstracts according to some keywords

A researcher’s every day task usually involves checking the for scientific novelties in the literature. For physicists the arXiv is the most common preprint server where most scientific papers are uploaded. However, more and more scientific papers are uploaded every day, and filtering the papers that are interesting becomes a time-consuming task.

Here we provide a simple Python script that can read the arXiv and email to you the filtered abstracts according to some keywords that you determine.

 
#!/usr/local/bin/python

# Import stuff
import requests
from bs4 import BeautifulSoup
import smtplib
from email.mime.text import MIMEText

# Define keywords. You can add as many as you want
keywords = ['keyword1', 'keyword2', 'keyword3']

# Define website to be read and read it correctly
link = "https://arxiv.org/list/cond-mat/new"
page = requests.get(link)
soup = BeautifulSoup( page.content, 'html.parser')

# Extract the information needed and put it in format to be checked
titles = soup.find_all('div', {'class' : 'list-title mathjax'})
abstracts = soup.find_all('p', {'class' : 'mathjax'})
authors = soup.find_all('div', {'class' : 'list-authors'})
refs = soup.find_all('a', {'title' : 'Abstract'})
lines_titles = [title.get_text() for title in titles]
lines_abstracts = [abstract.get_text() for abstract in abstracts]
lines_authors = [author.get_text() for author in authors]
lines_refs = [ref.get_text() for ref in refs]

# Write filtered papers on a file in your local computer.
# In this case /PathToWhateverYouWant/arxiv_summary.txt
filetosend = open('/PathToWhateverYouWant/arxiv_summary.txt','w')

for i in range(len(lines_abstracts)):
    if any(word in lines_abstracts[i] for word in keywords):
        filetosend.write(lines_titles[i].encode('ascii', 'ignore').decode('ascii'))
        filetosend.write(lines_authors[i].encode('ascii', 'ignore').decode('ascii'))
        filetosend.write('Abstract: ')
        filetosend.write(lines_abstracts[i].encode('ascii', 'ignore').decode('ascii'))
        filetosend.write('https://arxiv.org/')
        filetosend.write(lines_refs[i].encode('ascii', 'ignore'). \ 
                         decode('ascii').replace('arXiv:', 'abs/'))
        filetosend.write('\n**********\n')

filetosend.close()

# Prepare to send email. Read file and generate plain text.
fp = open(filetosend, 'rb')
msg = MIMEText(fp.read())
fp.close()

# Define Subject, From, To and send message
# Use the same email address to send and receive
msg['Subject'] = 'arXiv'
msg['From'] = 'email@youruni'
msg['To'] = 'email@youruni'

# Include your SMTP server address
s = smtplib.SMTP('smtpserver')
s.sendmail('email@youruni', 'email@youruni', msg.as_string())
s.quit()

This script can be scheduled to be executed automatically using crontab. For instance, if you want to get your arXiv email every weekday at 12:00 use the command

 
crontab -e

and include this line in the file that will appear

 
0 12 * * 1-5 /PathToTheAboveScript.py