correct-apricot•3y ago
Twitter scraping by both keyword and profile
It is too computationally intense/slow for me to make the api call for one of the filters and do post processing with the second filter. I am wondering if you can make an api call to scrape filtering by both keyword and profile. Is this possible or can I only do one or the other? Thanks!
11 Replies
correct-apricotOP•3y ago
I see this question is similar to the Facebook scraper post, is it the same case that you are unable to filter both simultaneously in one api call?
Hello @Deleted User the twitter has advanced search possibilities by itself . May you fill the form for advanced search ( https://twitter.com/search-advanced?lang=en ) and then copy paste it to the Actor's input? If it would not help, what combination of keywords and profiles, are you trying to scrape?
correct-apricotOP•3y ago
For some reason when I advanced search by both user and keyword on apify, it only searches the keyword. Is that supposed to happen?
@Deleted User which specific actor do you use? I just tried Twitter Scraper and 90% of the results are from the user I set on Input with the right keywords.
correct-apricotOP•3y ago
I use the same, I’m asking if it’s possible to set keyword and user and have results return the union of both
Can you give us more specific examples and step by step approach what are you trying to achieve.
correct-apricotOP•3y ago
Sure, so say I want to scrape all tweets by https://twitter.com/JoeBiden containing the word "president", I am current using this body of code
actorinput = {
"addTweetViewCount": true,
"addUserInfo": false,
"browserFallback": false,
"debugLog": false,
"extendOutputFunction": "async ({ data, item, page, request, customData, Apify }) => {\n return item;\n}",
"extendScraperFunction": "async ({ page, request, addSearch, addProfile, , addThread, addEvent, customData, Apify, signal, label }) => {\n \n}",
"fromDate": "2021-11-02",
"handle": [
"https://twitter.com/JoeBiden"
],
"handlePageTimeoutSecs": 5000,
"maxIdleTimeoutSecs": 60,
"maxRequestRetries": 6,
"mode": "own",
"profilesDesired": 10,
"proxyConfig": {
"useApifyProxy": true
},
"searchTerms": [
"president"
],
"tweetsDesired": 10000,
"useAdvancedSearch": true,
"useCheerio": true
}
headers = {
'Content-Type': 'application/json; charset=utf-8',
'Authorization': f'Bearer {api_token}'
}
data = json.dumps(actor_input)
response = requests.post(api_endpoint, headers=headers, data=data)
@Deleted User just advanced to level 1! Thanks for your contributions! 🎉
correct-apricotOP•3y ago
however it looks like the actor is retrieving tweets from any user containing the search term 'president'. I am only interested in tweets from "https://twitter.com/JoeBiden" containing the term 'president'. Thanks!
@Deleted User yes for this general input I am also receiving a lot unrelevant results.
That's why I suggested you to generate expression from advanced search form (on the twitter website) and use it for the
searchTerms
attribute. The input then looks like this:
Now all the results belongs to the specified twitter account.correct-apricotOP•3y ago
ahh okay, i was wrongly under the impression that the api would have done this for me, thank you so much!