Python: Scraping ID's from JSON - python

This question is a bit of an ask, but it's been giving me a headache all day (as I am fairly new to programming).
Basically I have huge list of ID's (named pk's) and I need to get all of them as they are surrounded by other text.
How would I go about retrieving all of the ID's? By the way each ID looks like this:
"pk":12345678
"pk":123456789
The ID is either a 8 or 9 digit number.
Thanks a lot guys, any help would be appreciated!
Editor's note: Asker did post his full json data in a comment to this answer.

ids = [var["pk"]]
where var is the variable of your JSON
If you clarify your JSON a little more I might be able to make this more precise.

I'd just use JSONPath. A simple, but extremely general way to extract all the ids would be this:
>>> from jsonpath import jsonpath
>>> from json import loads
>>> instagram_pop = open("instagram_popular_list.json"), "r").read()
>>> instagram_data = loads(instagram_pop)
>>> jsonpath(instagram_data, '$..id')[:3]
[u'234148392791340801_11305924', u'234098919041318605_2364270', u'234153616185741448_1907035']
Of course, since your data is flat, you can get away with a direct loop, such as:
[item['id'] for item in instagram_data['items']]
but I have a feeling you have more struct parsing to do, so I think jsonpath is a more flexible answer.

Related

How can I split integers from string line?

How can I split confirmed value, death value and recovered value. I want to add them to different lists. I tried to isdigit method to find value in line. Also I tried split('":'). I thought I can define value after '":'. But these are not working.
https://api.covid19api.com/total/dayone/country/us
I added all line to textlist from this page.
I just edited question for other users. My problem solved thank you.
The list actually contains a string. You need to parse it and then iterate over it to access the required values from it.
import json
main_list = ['.....']
data_points = json.parse(main_list[0])
confirmed = []
for single_data_point in data_points:
confirmed.append(single_data_point.Confirmed)
print(confirmed)
A similar approach can be taken for any other values needed.
Edit:
On a better look at your source, it looks like the initial data is not in the right JSON format to begin with. Some issues I noticed:
Each object which has a Country value does not have its closing }. This is a bigger issue and needs to be resolved first.
The country object starting from the 2nd object has a ' before the object starting. This should not be the case as well.
I suggest you to look at how you are initially parsing/creating the list.
Since you gave the valid source of your data it becomes pretty simple:
import urllib.request
import json
data = json.load(urllib.request.urlopen("https://api.covid19api.com/total/dayone/country/turkey"))
confirmed=[]
deaths=[]
recovered=[]
for dataline in data:
confirmed.append(dataline["Confirmed"])
deaths.append(dataline["Deaths"])
recovered.append(dataline["Recovered"])
print ("Confirmed:",confirmed)
print ("Deaths:", deaths)
print ("Recovered:",recovered)

Read data from CSV and write data to CSV - String to integer

I have a CSV file with 100,000 rows.
Each row in column A is a sentence comprised of both chars and integers.
I want column B to contain only integers.
I want the new columns to be in the same CSV file.
How can I accomplish this?
If I'm understanding your question correctly, I would use .isdigit() to parse the data in column A. I'm frankly not sure what the format of column A is, so I don't know exactly what you would do with this (if you gave more information I could give a more specific answer). Your solution will likely come in a similar form to this:
def find(lines):
B = []
for line in lines:
numbers = [c for c in line if c.isdigit()]
current = int(''.join(numbers))
# current is the concatenation of all
# integers found in column A from left to right
B.append(current)
return B
Let me know if this makes sense or is even in the right track for your solution. Once again, without knowing what you're trying to do, and what A looks like, I'm not sure what your actual goals are.
EDIT
I'm not going to explain the csv stuff for you, mainly because there is a fantastic resource and library for it included in python here. If you have specific questions related to writing csv, definitely post them.
It sounds like you essentially want to pull int values out of column A then add them to a new column B. There are definitely many ways to solve this, but the general form of the problem is for each row you'll filter out the int, then you'll add the filtered int into the new column. I'll list a couple:
Regex: You could use a pattern such as [0-9]+ to pull the string out of A, then use int(whatever that output is) to cast to int, then store those values in B. I'm a sucker for a good regular expression and this one is fairly straight forward. Regexr is a great resource to learn about this and test your pattern.
Use an algorithm similar to above: The above algorithm worked before, but I've updated it slightly. Now that it's been updated it'll return an array of numbers correspondent to numbers in A from left to right. This is relatively sound, but it doesn't necessarily guarantee you have the right integer, given that if the title has an int in it, it'll mess some things up. It is likely one of the more clear ways of doing this, though.

Python replace "" to \" in text

I want to manipulate my text in python. I will use this text to embed as JavaScript data. I need the text in my text file to display exactly as follows. It should have the format I mention below, not only when it prints.
I have text:
""text""
and I want:
\"text\"
with open('phase2.2.1.csv', 'w', newline='') as csvFile:
writer = csv.writer(csvFile)
for b in batches:
writer.writerow([b.replace('\n', '').replace('""', '\\"')])
Unfortunately, the above yields
\""text\""
Any help will be much appreciated.
I would suggest:
.replace('""', '\\"')
And it really works, see:
In [8]: x = '""text""'
In [9]: print(x.replace('""', '\\"'))
\"text\"
If what you're trying to generate is JSON-encoded strings, the right way to do that is to use the json module:
text = json.dumps(text)
If you're trying to generate actual JavaScript source code, that's still almost the right answer. JSON is very close to being a subset of JavaScript—a lot closer than a quick&dirty fix for one error you happen to have noticed so far is going to be.
If you actually want to generate correct JS code for any possible string, you have to deal with the corner cases where JSON is not quite a subset of JS. But nobody ever does (it took years before anyone even noticed the difference in the specs).

Extract elements from a particular list on python

Here is the block to analyse:
('images\\Principales\\Screenshot_1.png', '{"categories":[{"name":"abstract_","score":0.00390625},{"name":"outdoor_","score":0.01171875},{"name":"outdoor_road","score":0.41796875}],"description":{"tags":["road","building","outdoor","scene","street","city","sitting","empty","light","view","driving","red","sign","intersection","green","large","riding","traffic","white","tall","blue","fire"],"captions":[{"text":"a view of a city street","confidence":0.83864323826716347}]},"requestId":"73fc14d5-653f-4a0a-a45a-e7a425580361","metadata":{"width":150,"height":153,"format":"Png"},"color":{"dominantColorForeground":"Grey","dominantColorBackground":"Grey","dominantColors":["Grey"],"accentColor":"274A68","isBWImg":false}}')
I need to extract all elements after "description", but i don't know how to do that... (in fact, i need this elements:
"road", "building","outdoor","scene","street","city","sitting","empty","light","view","driving","red","sign","intersection","green","large","riding","traffic","white","tall","blue","fire"
I've been looking for several minutes already, but I do not understand how to do it! I'm a little beginner in learning "lists" element, and I still have a hard time understanding.
The "For" loop returns only 'images\\Principales\\Screenshot_1.png', then the big blocks left ...
Did you have a solution?
Thanks in advence!
EDIT:
Indeed, it is actually JSON! Thanks to the people who helped me :)
To extract the desired elements contained in the second block, I simply proceeded thus:
import json
ElementSeparate= '{"categories":[{"name":"abstract_","score":0.00390625},{"name":"outdoor_","score":0.01171875},{"name":"outdoor_road","score":0.41796875}],"description":{"tags":["road","building","outdoor","scene","street","city","sitting","empty","light","view","driving","red","sign","intersection","green","large","riding","traffic","white","tall","blue","fire"],"captions":[{"text":"a view of a city street","confidence":0.83864323826716347}]},"requestId":"73fc14d5-653f-4a0a-a45a-e7a425580361","metadata":{"width":150,"height":153,"format":"Png"},"color":{"dominantColorForeground":"Grey","dominantColorBackground":"Grey","dominantColors":["Grey"],"accentColor":"274A68","isBWImg":false}'
ElementSeparate = json.loads(ElementSeparate)
for a in ElementSeparate['description']['tags']:
print a
To me it looks like you're trying to parse JSON. You should use the JSON parser for the second element of the array. You'll get back either list or dictionary. Then you'll be able to extract data from "description" key has.
https://docs.python.org/3/library/json.html

How does unicodecsv.DictReader represent a csv file

I'm currently going through the Udacity course on data analysis in python, and we've been using the unicodecsv library.
More specifically we've written the following code which reads a csv file and converts it into a list. Here is the code:
def read_csv(filename):
with open(filename,'rb')as f:
reader = unicodecsv.DictReader(f)
return list(reader)
In order to get my head around this, I'm trying to figure out how the data is represented in the dictionary and the list, and I'm very confused. Can someone please explain it to me.
For example, one thing I don't understand is why the following throws an error
enrollment['cancel_date']
While the following works fine:
for enrollment in enrollments:
enrollments['cancel_date'] = parse_date(enrollment['cancel_date'])
Hopefully this question makes sense. I'm just having trouble visualizing how all of this is represented.
Any help would be appreciated.
Thanks.
I too landed up here for some troubles related to the course and found this unanswered. However I think you already managed it. Anyway answering here so that someone else might find this helpful.
Like we all know, dictionaries can be accessed like
dictionary_name['key']
and likewise
enrollments['cancel_date'] should also work.
But if you do something like
print enrollments
you will see the structure
[{u'status': u'canceled', u'is_udacity': u'True', ...}, {}, ... {}]
If you notice the brackets, it's like a list of dictionaries. You may argue it is a list of list. Try it.
print enrollments[0][0]
You'll get an error! KeyError.
So, it's like a collection of dictionaries. How to access them? Zoom down to any dictionary (rather rows of the csv) by enrollments[n].
Now you have a dictionary. You can now use freely the key.
print enrollments[0]['cancel_date']
Now coming to your loop,
for enrollment in enrollments:
enrollment['cancel_date'] = parse_date(enrollment['cancel_date'])
What this is doing is the enrollment is the dummy variable capturing each of the iterable element enrollments like enrollments[1], enrollments[2] ... enrollments[n].
So every-time enrollment is having a dictionary from enrollments and so enrollment['cancel_date'] works over enrollments['cancel_date'].
Lastly I want to add a little more thing which is why I came to the thread.
What is the meaning of "u" in u'..' ? Ex: u'cancel_date' = u'11-02-19'.
The answer is this means the string is encoded as an Unicode. It is not part of the string, it is python notation. Unicode is a library that contains the characters and symbol for all of the world's languages.
This mainly happens because the unicodecsv package does not take the headache of tracking and converting each item in the csv file. It reads them as Unicode to preserve all characters. Now that's why Caroline and you defined and used parse_date() and other functions to convert the Unicode strings to the desired datatype. This is all a part of the Data Wrangling process.

Categories

Resources