Get the members from a Twitter list with python - python

I am trying to create a Data frame with some data from the European Parliament members. However I am struggling with the data received when using the tweepy package.
api = tweepy.API(auth)
# Iterate through all members of the owner's list
member in tweepy.Cursor(api.list_members, 'Europarl_EN', 'all-meps-on-twitter').items():
m = member
print(member)
The problem is I do not how to get a readable table after this. Also I tried this just in order to get the names:
lel = api.list_members('Europarl_EN', 'all-meps-on-twitter', -10)
for i in lel:
print(i.name)
And the output is:
Jaromír Kohlíček
István Ujhelyi
Deli Andor
Maria Grapini
Winkler Gyula
LefterisChristoforou
Mircea Diaconu
Maria Heubuch
Daniel Buda
Marijana Petir
Maite Pagazaurtundúa
Janice Atkinson
Andrew Lewer
Martina Michels
Joachim Starbatty
Peter Jahr
Emil Radev
József Nagy
Quisthoudt-Rowohl
Dominique Bilde
All in, my intention is to transform lel into a dataframe or in the worst scenario to get the usernames.

Related

How can I use the Google Cloud Natural Language Processing API on a big query table or any other topic modelling resource?

As mentioned in the title, I have a bigquery table with 18 million rows, nearly half of them are useless and I am supposed to assign a topic/niche to each row based on an important column (that has detail about a product a website), I have tested NLP API on a sample data with size of 10,000 and it did wonders but my standard approach where I am iterating over the newarr (which is the important details column I am obtaining through querying my bigquery table), here I am sending only one cell at a time, awaiting response from the api and appending it to the results array.
Ideally I want to do this operation on 18 Million rows in the minimum time, my per minute quota is increased to 3000 api requests so thats the max I can make, But I cant figure out how can i send a batch of 3000 rows one after another each minute.
for x in newarr:
i += 1
results.append(sample_classify_text(x))
Sample Classify text is a function straight from Documentation
#this function will return category for the text
from google.cloud import language_v1
def sample_classify_text(text_content):
"""
Classifying Content in a String
Args:
text_content The text content to analyze. Must include at least 20 words.
"""
client = language_v1.LanguageServiceClient()
# text_content = 'That actor on TV makes movies in Hollywood and also stars in a variety of popular new TV shows.'
# Available types: PLAIN_TEXT, HTML
type_ = language_v1.Document.Type.PLAIN_TEXT
# Optional. If not specified, the language is automatically detected.
# For list of supported languages:
# https://cloud.google.com/natural-language/docs/languages
language = "en"
document = {"content": text_content, "type_": type_, "language": language}
response = client.classify_text(request = {'document': document})
#return response.categories
# Loop through classified categories returned from the API
for category in response.categories:
# Get the name of the category representing the document.
# See the predefined taxonomy of categories:
# https://cloud.google.com/natural-language/docs/categories
x = format(category.name)
return x
# Get the confidence. Number representing how certain the classifier
# is that this category represents the provided text.

How to preserve entry order when converting pdf to text in python?

I am trying to read the text from a pdf file. This file is part of a generated report. I am amble to read the text in the file but it comes out very garbled. What I want is to get each line in the pdf file as an item in a list, eventually, but you can see that the field names and entries get all mixed up. An example of the pdf I am trying to important can be found here, and below is the code that I am trying to use to get the lines.
import PyPDF2
try:
from StringIO import StringIO
except ImportError:
from io import StringIO
filename = 'U:/PLAN/BCUBRICH/Python/Network Plan/Page 1 from AMP380_1741500.pdf'
def getPDFContent(filename):
content = ""
p = open(filename, "rb")
pdf = PyPDF2.PdfFileReader(p)
pdf.
num_pages = pdf.getNumPages()
for i in range(0, num_pages):
content += pdf.getPage(i).extractText()+'\n'
# content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content
content=getPDFContent(filename)
Here is the output I get:
Out:'''UNITED STATES ENVIRONMENTAL PROTECTION AGENCYAIR QUALITY SYSTEMSITE DESCRIPTION REPORTApr. 25, 2019Site ID: 49-003-0003
Site Name: Brigham City
Local ID: BR
140 W.FISHBURN DRIVE, BRIGHAM CITY, UTStreet Address: City: Brigham City
Utah Zip Code: 84302
State: Box ElderCounty: Monitoring PointLocation Description: SuburbanLocation Setting: Interpolation-MapColl. Method:ResidentialLand Use: 20000819Date Established: Date Terminated: 20190130Last Updated: HQ Eval. Date:Regional Eval. Date: UtahAQCR : Ogden-Clearfield, UTCBSA: Salt Lake City-Provo-Orem, UTCSA: Met. Site ID:Direct Met Site: On-Site Met EquipType Met Site: Dist to Met. Site(m): Local Region: Urban Area: Not in an urban area
EPA Region: Denver
17411City Population: Dir. to CBD: Dist. to City(km): 3000Census Block: 3Block Group: 960701Census Tract: 1Congressional District: Class 1 Area: +41.492707Site Latitude: -112.018863Site Longitude: MountainTime Zone: UTM Zone: UTM Northing: UTM Easting: Accuracy: 60.73
Datum: WGS84
Scale: 24000
Point/Line/Area: Point 1,334.0Vertical Measure(m): 0Vert Accuracy: UnknownVert Datum : Vert Method: Unknown
Owning Agency: 1113 Utah Department Of Environmental Quality SITE COMMENTS SITE FOR OZONE, PM2.5, AND MET ACTIVE MONITOR TYPES Primary Monitor Periods # of Parameter Code Poc Begin Date End Date Monitor Type Monitors 42602 1 20180126 OTHER 2 44201 1 20010501 SLAMS 16 88101 1 20000819 20141231 88101 1 20160101 20161231 88101 1 20180101 88101 3 20170101 20171231 88101 4 20150101 20151231 TANGENT ROADS Road Traffic Traffic Compass Number Road Name Count Year Traffic Volume Source Road Type Sector 1 FISHBURN DRIVE 450 2000 LOCAL ST OR HY S Page 1 of 77
'''
For Example, I would like the eighth item in the list to be
State: Utah Zip Code: 84302 County: Box Elder
but what I get is
Utah Zip Code: 84302 State: Box ElderCounty:
These kind of mix ups happen throughout the document.
This is merely an explanation why that happens, not a solution. But it is too long for a comment, so it got an answer...
The reason for this weird order is that the text chunks in the document drawn in that order.
If you dig into the PDF and look at the content stream, you find this segment responsible for the example line you picked:
/TD <</MCID 12 >>BDC
-47.25 -1.685 Td
(Utah )Tj
28.125 0 Td
[(Zip Code: )-190(84302)]TJ
-32.06 -0 Td
(State: )Tj
EMC
/TD <</MCID 13 >>BDC
56.81 0 Td
(Box Elder)Tj
-5.625 0 Td
(County: )Tj
EMC
You probably don't understand the instructions but can see that the strings (in round brackets (...)) come exactly in the order you observe in the output
Utah Zip Code: 84302 State: Box ElderCounty:
instead of the desired
State: Utah Zip Code: 84302 County: Box Elder
The Td instructions in-between make the text insertion point jump back and forth to achieve the different appearance in a viewer.
Apparently your text extraction method merely retrieves the strings from the content stream in the order they are drawn and ignores the actual locations at which they are drawn. For a proper text extraction, therefore, you have to change the method you use. As I don't really know PyPDF2 myself, I cannot say whether this library offers different text extraction methods to turn to or whether you have to resort to a different library.

UNWIND feature in py2neo not reading json fully

graph = Graph()
query2 = """
WITH {m} AS document
UNWIND document.lists AS s
UNWIND s.imageurl AS img
UNWIND s.youtubevideourl AS vid
RETURN s
"""
print (graph.cypher.execute(query2,m = m))
I am trying to use UNWIND to read through the full json file but I am only
getting through the first part and so I am unable to plot a graph of full json.
It was working fine earlier but now I have added youtube video links, title of
same page, weblinkurl, webtitle I have started facing the same problem.
Here is an example of JSON file I compiled with different links as I am able to read first part only but I want to read full JSON.
This has only 2 parts of JSON and I want to read full and make nodes
Please if anyone could tell how to do using UNWIND or else.
[{'Topic': 'Virat_Kohli',
'imagetitle': 'Virat_Kohli_June_2016_(cropped).jpg?width=300',
'imageurl': 'http://commons.wikimedia.org/wiki/Special:FilePath/Virat_Kohli_June_2016_(cropped).jpg?width=300',
'webtitle': 'Virat Kohli Official Website',
'weburl': 'http://www.viratkohli.club/',
'youtubevideotitle': 'Virat Kohli Finally Accepts Love For GIRLFRIEND Anushka Sharma On Aamir Khan's Secret Superstar Show - YouTube',
'youtubevideourl': 'https://www.youtube.com/watch?v=zmPh2OQzZqc'},
{'Topic': 'Virat_Kohli',
'webtitle': 'Virat Kohli profile 2017, News and images only on official website of RCB',
'weburl': 'https://www.royalchallengers.com/virat-kohli',
'youtubevideotitle': 'Virat Kohli after losing ICC champions trophy Final - India vs Pakistian - Press Conference 2017 - YouTube',
'youtubevideourl': 'https://www.youtube.com/watch?v=Yf38l1Kx2-I'},
what i am trying is to do is this
graph = Graph()
query2 = """
WITH {j} AS document
UNWIND document.lists AS s
UNWIND s.Topic AS top
UNWIND s.weburl AS url
UNWIND s.imageurl AS img
UNWIND s.youtubevideourl as y
MERGE (c:topicnames {name:s.Topic})
MERGE (sc:images{img:img, type : s.imagetitle})
MERGE (v:weblink{url:url, type : s.webtitle})
MERGE (g:videos{vid:y, type : s.youtubevideotitle})
MERGE (c)-[:IMAGE_LINKS]->(sc)
MERGE (c)-[:WEB_LINKS]->(v)
MERGE (c)-[:VIDEO_LINKS]->(g)
RETURN (c)
"""
print (graph.cypher.execute(query2,j = j))
So I must have a single node of topic and 5 video link nodes, 5 weblink nodes and 1 imagelink node in neo4j but its only drawing nodes for 1 part of json
so UNWIND is not reading or converting other values having same key as Topic, weburl,youtubevideourl and that is why I want to know why its not working and how to fix it.
The JSON file itself is the list of documents so you don't need to specifically pass a list. And you don't need to use UNWIND that many times. Try using below program(And make sure all the variables are present when parsing):
graph = Graph()
query2 = """
UNWIND {j} AS s
MERGE (c:topicnames {name:s.Topic})
MERGE (sc:images{img:s.imageurl, type : s.imagetitle})
MERGE (v:weblink{url:s.weburl, type : s.webtitle})
MERGE (g:videos{vid:s.youtubevideourl, type : s.youtubevideotitle})
MERGE (c)-[:IMAGE_LINKS]->(sc)
MERGE (c)-[:WEB_LINKS]->(v)
MERGE (c)-[:VIDEO_LINKS]->(g)
RETURN (c)
"""
print (graph.cypher.execute(query2,j = j))
Hope this helps!

How can you parse a document stored in the MARC21 format with Python

Yesterday harvard released open access to all its library metadata (some 12 million records)
I was looking to parse the data and play with it as the goal of the release was to "support innovation"
Download the 12GB tarball, unpacked it to find 13 .mrc files about 800MB each
MARC21 format
When I looked at the head and tail of the first few files, it looks to be very unstructured, even after reading a bit about MARC21.
Here's what the first 4k of the first file look like:
$ head -c 4000 ab.bib.00.20120331.full.mrc
00857nam a2200253 a 4500001001200000005001700012008004100029010001700070020001200087035001600099040001800115043001200133050002500145245011100170260004900281300002100330504004100351610006400392650005300456650003500509700003800544988001300582906000800595002000001-420020606093309.7880822s1985 unr b 000 0 ruso a 86231326 c0.45rub0 aocm18463285 aDLCcDLCdHLS ae-ur-un0 aJN6639.A8bK665 198500aInformat︠s︡ii︠a︡ v rabote partiĭnykh komitetov /c[sostavitelʹ Stepan Ivanovich I︠A︡lovega]. aKiev :bIzd-vo polit. lit-ry Ukrainy,c1985. a206 p. ;c20 cm. aIncludes bibliographical references.20aKomunistychna partii︠a︡ UkraïnyxInformation services. 0aParty committeeszUkrainexInformation services. 0aInformation serviceszUkraine.1 aI︠A︡lovega, Stepan Ivanovich. a20020608 0DLC00418nam a22001335u 4500001001200000005001700012008004100029110003000070245004600100260006000146500005800206988001300264906000700277002000002-220020606093309.7900925|1944 mx |||||||spa|d1 aCampeche (Mexico : State)10aLey del notariado del estado de Campeche.0 a[Campeche]bDepartamento de prensa y publicidad,c1944. aAt head of title: Gobierno constitucional del estado. a20020608 0MH00647nam a2200229M 4500001001200000005001700012008004100029010001700070035001600087040001300103041001100116050003600127100004200163245004100205246005600246260001600302300001900318500001500337650004400352988001300396906000800409002000003-020051201172535.0890331s1902 xx d 000 0 ota a 73960310 0 aocm23499219 aDLCcEYM0 aotaara0 aPJ6636.T8bU5 1973 (Orien Arab)1 aUnsī, Muḥammad ʻAlī ibn Ḥasan.10aQāmūs al-lughah al-ʻUthmānīyah.3 aDarārī al-lāmiʻāt fī muntakhabāt al-lughāt. c[1902 1973] a564 p.c22 cm. aRomanized. 0aTurkish languagevDictionariesxArabic. a20020608 0DLC00878nam a2200253 a 4500001001200000005001700012008004100029010001700070035001600087040001800103043001200121050002300133245012800156246004600284260006300330300004800393500003300441610003200474650005000506700002400556710002300580988001300603906000800616002000004-920020606093309.7880404s1980 yu fa 000 0 scco a 82167322 0 aocm17880048 aDLCcDLCdHLS ae-yu---0 aL53.P783bT75 198000aTrideset pet godina Prosvetnog pregleda, 1945-1980 /c[glavni i odgovorni urednik i urednik publikacije Ružica Petrović].3 a35 godina Prosvetnog pregleda, 1945-1980. aBeograd :bNovinska organizacija Prosvetni pregled,c1980. a146 p., [21] p. of plates :bill. ;c29 cm. aIn Serbo-Croatian (Cyrillic)20aProsvetni pregledxHistory. 0aEducationzYugoslaviaxHistoryy20th century.1 aPetrović, Ružica.2 aProsvetni pregled. a20020608 0DLC00449nam a22001455u 4500001001200000005001700012008004100029245008200070260002800152300001100180440006600191700002600257988001300283906000700296002000005-720020606093309.7900925|1981 pl |||||||pol|d10aZ zagadnień dialektyki i świadomości społecznej /cpod red. K. Ślęczka.0 aKatowice :bUŚ,c1981. a135 p. 0aPrace naukowe Uniwersytetu Śląskiego w Katowicach ;vnr 4621 aŚlęczka, Kazimierz. a20020608 0MH00331nam a22001455u 4500001001200000005001700012008004100029100002200070245002200092250001200114260002800126300001100154988001300165906000700178002000006-520020606093309.7900925|1980 pl |||||||pol|d1 aMencwel, Andrzej.10aWidziane z dołu. aWyd. 1.0 aWarszawa :bPIW,c1980. a166 p. a20020608 0MH00746cam a2200241 a 4500001001200000005001700012008004100029010001700070020001500087035001600102040001800118050002400136082001600160100001600176245008000192260007100272300002500343504004100368650003400409650004000443988001300483906000800496002000007-300000000000000.0900123s1990 enk b 001 0 eng a 90031350 a03910368230 aocm21081069 aDLCcDLCdHBS00aHF5439.8b.O35 199
Has anyone ever had to work with a MARC21 before? Does it typically look like this or do I need to parse it differently.
pymarc is the best option to parse MARC21 records using Python (full disclosure: I'm one of its maintainers). If you're unfamiliar with working with MARC21, it's worth reading through some of the specification you linked to on the Library of Congress website. I'd also read through the Working with MARC page on the Code4lib wiki.
You may want to check this out - pymarc
Disclaimer: I'm the author of marcx.
pymarc is a great library. For a few operations, which I missed in pymarc I implemented as a thin layer atop of it: marcx.
marcx.FatRecord is a small extension to pymarc.Record, that adds a few shortcuts. The gist are the twins add and remove, a (subfield) value generator itervalues and a generic test function.
The main benefit it an easier way to iterate over field (or subfield) values. Example:
>>> from marcx import FatRecord; from urllib import urlopen
>>> record = FatRecord(data=urlopen("http://goo.gl/lfJnw9").read())
>>> for val in record.itervalues('100.a', '700.a'):
... print(val)
Hunt, Andrew,
Thomas, David,

Good ways to sort a queryset? - Django

what I'm trying to do is this:
get the 30 Authors with highest score ( Author.objects.order_by('-score')[:30] )
order the authors by last_name
Any suggestions?
What about
import operator
auths = Author.objects.order_by('-score')[:30]
ordered = sorted(auths, key=operator.attrgetter('last_name'))
In Django 1.4 and newer you can order by providing multiple fields.
Reference: https://docs.djangoproject.com/en/dev/ref/models/querysets/#order-by
order_by(*fields)
By default, results returned by a QuerySet are ordered by the ordering tuple given by the ordering option in the model’s Meta. You can override this on a per-QuerySet basis by using the order_by method.
Example:
ordered_authors = Author.objects.order_by('-score', 'last_name')[:30]
The result above will be ordered by score descending, then by last_name ascending. The negative sign in front of "-score" indicates descending order. Ascending order is implied.
I just wanted to illustrate that the built-in solutions (SQL-only) are not always the best ones. At first I thought that because Django's QuerySet.objects.order_by method accepts multiple arguments, you could easily chain them:
ordered_authors = Author.objects.order_by('-score', 'last_name')[:30]
But, it does not work as you would expect. Case in point, first is a list of presidents sorted by score (selecting top 5 for easier reading):
>>> auths = Author.objects.order_by('-score')[:5]
>>> for x in auths: print x
...
James Monroe (487)
Ulysses Simpson (474)
Harry Truman (471)
Benjamin Harrison (467)
Gerald Rudolph (464)
Using Alex Martelli's solution which accurately provides the top 5 people sorted by last_name:
>>> for x in sorted(auths, key=operator.attrgetter('last_name')): print x
...
Benjamin Harrison (467)
James Monroe (487)
Gerald Rudolph (464)
Ulysses Simpson (474)
Harry Truman (471)
And now the combined order_by call:
>>> myauths = Author.objects.order_by('-score', 'last_name')[:5]
>>> for x in myauths: print x
...
James Monroe (487)
Ulysses Simpson (474)
Harry Truman (471)
Benjamin Harrison (467)
Gerald Rudolph (464)
As you can see it is the same result as the first one, meaning it doesn't work as you would expect.
Here's a way that allows for ties for the cut-off score.
author_count = Author.objects.count()
cut_off_score = Author.objects.order_by('-score').values_list('score')[min(30, author_count)]
top_authors = Author.objects.filter(score__gte=cut_off_score).order_by('last_name')
You may get more than 30 authors in top_authors this way and the min(30,author_count) is there incase you have fewer than 30 authors.

Categories

Resources