I have extracted the following Twitter data using Tweepy. However, I am not able to fetch data from the included data object. I am specifically trying to fetch the URL and description data. I can see from the json_response that both data on URL and description are present.
My data has the following structure:
{
"data": [
{
"attachments": {
"media_keys": [
"3_1376989039262195713"
]
},
"author_id": "964661980551266304",
"created_at": "2021-03-30T20:05:45.000Z",
"id": "1376989044039544836",
"text": "#RichardGrenell I also want to speak out against this FB group who blocked me (after asking me to invite all my friends) for making the point that this recall not be made a MAGA one. \n\nI didn\u2019t stump on the ground for Trump, I did it for my children."
},
{
"attachments": {
"media_keys": [
"3_1376986160963145736",
"3_1376986160988368898",
"3_1376986160963198980",
"3_1376986160954757129"
]
},
"author_id": "1000347213145563136",
"created_at": "2021-03-30T19:54:20.000Z",
"id": "1376986169704071171",
"text": "#Bobbrock8013 #irishson19161 #RandPaul It's ok to question the election of Trump, but if you question Biden's win you are a \"domestic terrorist.\" Does the Biden Admin welcome a discussion of opposing views on policies regarding lockdowns, masks and vaccines? Why is Big Tech censoring conservatives? Fascists censor."
},
{
"attachments": {
"media_keys": [
"3_1376961169450221571"
]
},
"author_id": "328673472",
"created_at": "2021-03-30T18:15:00.000Z",
"id": "1376961171841036291",
"text": "#ByronYork Newsworthy, but Democrats via their minions will likely censor Trump's statement from Twitter, Facebook, CNN, MSNBC, Washington Post, NY Times etc You know our free speech rules now are based on the Democrats' version of what they will ALLOW us Deplorables to say let alone think."
},
{
"author_id": "18774517",
"created_at": "2021-03-30T10:31:58.000Z",
"id": "1376844643837566986",
"text": "RT #BrexitBuster: #EditingMike #LauraHa15799415 I\u2019m old enough to remember when Piers Morgan was Donald J Trump\u2019s number one fanboy. Are yo\u2026"
},
{
"author_id": "52405628",
"created_at": "2021-03-30T10:30:33.000Z",
"id": "1376844286646480899",
"text": "RT #BrexitBuster: #EditingMike #LauraHa15799415 I\u2019m old enough to remember when Piers Morgan was Donald J Trump\u2019s number one fanboy. Are yo\u2026"
},
{
"author_id": "848911132496723969",
"created_at": "2021-03-30T10:30:11.000Z",
"id": "1376844194921250818",
"text": "RT #BrexitBuster: #EditingMike #LauraHa15799415 I\u2019m old enough to remember when Piers Morgan was Donald J Trump\u2019s number one fanboy. Are yo\u2026"
},
{
"attachments": {
"media_keys": [
"3_1376836461601898499",
"3_1376836461614542853"
]
},
"author_id": "848911132496723969",
"created_at": "2021-03-30T09:59:37.000Z",
"id": "1376836504308305921",
"text": "#EditingMike #LauraHa15799415 I\u2019m old enough to remember when Piers Morgan was Donald J Trump\u2019s number one fanboy. Are you?\n\nThen he praised Joe Biden\u2019s speech... until he was offered the chance to pen a vicious hatchet piece for the Daily Mail! Pointing this out earned me a block.\n#shapeshiftingcreep"
},
{
"attachments": {
"media_keys": [
"3_1376821889004363777"
]
},
"author_id": "31308988",
"created_at": "2021-03-30T09:01:34.000Z",
"id": "1376821895811715073",
"text": "A lady sent this to my messenger right before she blocked me because she was mad I typed the names of Trump's sex assault victims"
},
{
"attachments": {
"media_keys": [
"3_1376704749379145731"
]
},
"author_id": "198202008",
"created_at": "2021-03-30T01:16:05.000Z",
"id": "1376704753145643014",
"text": "#moondancer34 #MrCrispyMAGA #lonelymilkshake #EFMoriarty #CBSNews Who is this person who blocked me? A MAGA lover? Guess that\u2019s why. But how ironic that he\u2019s a Trump supporter yet a WA fan when Woody is about as liberal as they get. In fact he donated to Hillary\u2019s campaign so she\u2019d win against Trump. Whatever! \ud83d\ude02\ud83e\udd37\u200d\u2640\ufe0f"
}
],
"includes": {
"media": [
{
"media_key": "3_1376989039262195713",
"type": "photo",
"url": "https://pbs.twimg.com/media/ExwMPFDUYAEHKn0.jpg"
},
{
"media_key": "3_1376986160963145736",
"type": "photo",
"url": "https://pbs.twimg.com/media/ExwJnijWUAgfPlb.jpg"
},
{
"media_key": "3_1376986160988368898",
"type": "photo",
"url": "https://pbs.twimg.com/media/ExwJnipXMAIHmJp.jpg"
},
{
"media_key": "3_1376986160963198980",
"type": "photo",
"url": "https://pbs.twimg.com/media/ExwJnijXIAQ4F_x.jpg"
},
{
"media_key": "3_1376986160954757129",
"type": "photo",
"url": "https://pbs.twimg.com/media/ExwJnihWUAkr8bi.jpg"
},
{
"media_key": "3_1376961169450221571",
"type": "photo",
"url": "https://pbs.twimg.com/media/Exvy416WQAMRlO0.jpg"
},
{
"media_key": "3_1376836461601898499",
"type": "photo",
"url": "https://pbs.twimg.com/media/ExuBd4-WQAMgTTR.jpg"
},
{
"media_key": "3_1376836461614542853",
"type": "photo",
"url": "https://pbs.twimg.com/media/ExuBd5BXMAU2-p_.jpg"
},
{
"media_key": "3_1376821889004363777",
"type": "photo",
"url": "https://pbs.twimg.com/media/Ext0Np0WYAEUBXy.jpg"
},
{
"media_key": "3_1376704749379145731",
"type": "photo",
"url": "https://pbs.twimg.com/media/ExsJrOtWUAMgVxk.jpg"
}
],
"users": [
{
"created_at": "2018-02-17T00:45:13.000Z",
"description": "Congressional Candidate for CA-28 Proud Angeleno/Catholic/Californio by marriage Localist\u2022Centrist\u2022Pragmatist\u2022Realist",
"id": "964661980551266304",
"name": "Beatrice Cardenas",
"username": "RealBetyCardens"
},
{
"created_at": "2018-05-26T12:05:35.000Z",
"description": "Following President Trump .... KAG 2020 \ud83c\uddfa\ud83c\uddf8",
"id": "1000347213145563136",
"name": "Joseph Fong",
"username": "JosephEugeneFo1"
},
{
"created_at": "2011-07-03T20:29:43.000Z",
"description": "Husband, Dad, Granddad, Christian,Army MP Sgt vet, I.U. grad, former banker & retired City Finance Director, Reagan guy. Cancer survivor. \u271d\ufe0f\ud83c\uddfa\ud83c\uddf8",
"id": "328673472",
"name": "Steve B",
"username": "Stevebfrs"
},
{
"created_at": "2009-01-08T19:06:29.000Z",
"description": "a younger Victor Meldrew but interesting - I hope - nice sometimes !",
"id": "18774517",
"name": "NORBET",
"username": "NORBET"
},
{
"created_at": "2009-06-30T14:17:41.000Z",
"description": "Tanglewood and Gretsch",
"id": "52405628",
"name": "FSociety Tom \ud83c\uddea\ud83c\uddfa #FBPE ANTIFA #RESIST #FBPPR #BLM",
"username": "thebdaman"
},
{
"created_at": "2017-04-03T14:52:40.000Z",
"description": "We are the Remain Resistance... popping Brexit bubbles one at a time. Mostly sarcasm, occasionally deadly serious. Love the UK & the EU. Detest racism & Nazis.",
"id": "848911132496723969",
"name": "Brexit Buster",
"username": "BrexitBuster"
},
{
"created_at": "2009-04-15T02:18:58.000Z",
"description": "No DMs !!! \ud83c\udf0a \ud83c\udf0a\nBLM ,Trans lives matter, LGBT \ud83c\udf08\nAlly of all marginalized",
"id": "31308988",
"name": "Stephy Pachuco (Her, She) \ud83c\udf0a\ud83c\udf0a",
"username": "Stephaniespc"
},
{
"created_at": "2010-10-03T16:56:45.000Z",
"description": "How'd you know I was looking at you if you weren't looking at me? \ud83d\udde3Mike Patton \u2615\ufe0fCoffee \ud83d\ude0eWeekends \ud83c\udf0aPolitics \ud83d\ude0dNYC \ud83e\udd96Museum Employee",
"id": "198202008",
"name": "Patti\ud83d\uddfd",
"username": "PattiFromNYC"
}
]
},
"meta": {
"newest_id": "1376989044039544836",
"next_token": "b26v89c19zqg8o3fosqtjm19orv2gber5hh7b0fu7uem5",
"oldest_id": "1376704753145643014",
"result_count": 9
}
}
I can successfully fetch the data from the data object which is 'id', 'text', 'created_at', and 'author_id' using the following code. However, the code does not retrieve the 'URL' and 'description' data from the included object which leaves me with two empty columns.
# Create file
csvFile = open("data.csv", "a", newline="", encoding='utf-8')
csvWriter = csv.writer(csvFile)
# Create headers for the data
csvWriter.writerow(
['author id', 'created_at', 'id', 'tweet', 'bio', 'image_url'])
csvFile.close()
def append_to_csv(json_response, fileName):
# A counter variable
counter = 0
# Open OR create the target CSV file
csvFile = open(fileName, "a", newline="", encoding='utf-8')
csvWriter = csv.writer(csvFile)
# Loop through each tweet
for tweet in json_response['data']:
# We will create a variable for each since some of the keys might not exist for some tweets
# So we will account for that
# 1. Author ID
author_id = tweet['author_id']
# 2. Time created
created_at = dateutil.parser.parse(tweet['created_at'])
# 3. Tweet ID
tweet_id = tweet['id']
# 4. Tweet text
text = tweet['text']
# 5. description
if('description' in tweet):
bio = tweet['users']['description']
else:
bio = " "
# 6. image url
if ('url' in tweet):
image_url = tweet['media']['url']
else:
image_url = " "
# Assemble all data in a list
res = [author_id, created_at, tweet_id, text, bio, image_url]
# Append the result to the CSV file
csvWriter.writerow(res)
counter += 1
# When done, close the CSV file
csvFile.close()
# Print the number of tweets for this iteration
print("# of Tweets added from this response: ", counter)
Problem
I have a large JSON file (~700.000 lines, 1.2GB filesize) containing twitter data that I need to preprocess for data and network analysis.
During the data collection an error happend: Instead of using " as a seperator ' was used. As this does not conform with the JSON standard, the file can not be processed by R or Python.
Information about the dataset:
Every about 500 lines start with meta info + meta information for the users, etc. then there are the tweets in json (order of fields not stable) starting with a space, one tweet per line.
This is what I tried so far:
A simple data.replace('\'', '\"') is not possible, as the "text" fields contain tweets which may contain ' or " themselves.
Using regex, I was able to catch some of the instances, but it does not catch everything:
re.compile(r'"[^"]*"(*SKIP)(*FAIL)|\'')
Using literal.eval(data) from the ast package also throws an error.
As the order of the fields and the legth for each field is not stable I am stuck on how to reformat that file in order to conform to JSON.
Normal sample line of the data (for this options one and two would work, but note that the tweets are also in non-english languages, which use " or ' in their tweets):
{'author_id': '1236888827605725186', 'entities': {'mentions': [{'start': 108, 'end': 124, 'username': 'realDonaldTrump'}], 'hashtags': [{'start': 49, 'end': 55, 'tag': 'QAnon'}, {'start': 56, 'end': 66, 'tag': 'ProudBoys'}]}, 'context_annotations': [{'domain': {'id': '10', 'name': 'Person', 'description': 'Named people in the world like Nelson Mandela'}, 'entity': {'id': '799022225751871488', 'name': 'Donald Trump', 'description': 'US President Donald Trump'}}, {'domain': {'id': '35', 'name': 'Politician', 'description': 'Politicians in the world, like Joe Biden'}, 'entity': {'id': '799022225751871488', 'name': 'Donald Trump', 'description': 'US President Donald Trump'}}], 'text': 'RT #NinjaHodon: Here’s an example of the average #QAnon #ProudBoys crackass trash that’s going to vote for #realDonaldTrump. \n\n https://t.…', 'referenced_tweets': [{'type': 'retweeted', 'id': '1315363137240010753'}], 'conversation_id': '1315441338427506689', 'id': '1315441338427506689', 'lang': 'en', 'public_metrics': {'retweet_count': 20, 'reply_count': 0, 'like_count': 0, 'quote_count': 0}, 'created_at': '20201011T23:57:09.000Z', 'source': 'Twitter for Android', 'possibly_sensitive': False}
Reformatted sample line which causes an issue:
{"users": [{"id": "437781219", "username": "HakesJon", "location": `"Wisconsin", "description": "#IndieFictionWriter. Husband. Father. Bearded.\n#BlackLivesMatter #DemilitarizeThePolice #DismantlePolicing", "name": "Jon Hakes", "created_at": "20111215T20:42:41.000Z"}, {"id": "1171947445841997824", "username": "FactNc", "location": "Under Carolina blue skies ", "description": "Defender of truth, justice and the American way. "I never give them hell. I just tell the truth and they think it\'s hell." Harry S. Truman", "name": "NCFactFinder", "created_at": "20190912T00:44:21.000Z"}, {"id": "315041625", "username": "o0rimbuk0o", "description": "Your desire to put pronouns here is not my issue. Get help.\n\n#resist #notmypresident\n#FBiden", "name": "Sick of it", "created_at": "20110611T06:16:11.000Z"}, {"id": "3141427487", "username": "theGeekSheek", "description": "I don't believe in your God. Don't tell me he hates me.", "name": "Chic Geek", "created_at": "20150406T18:34:45.000Z"}, {"id": "1084112678", "username": "KarinBorjeesson", "description": "Love to help people & animals in need. Love music. Fucking hate racists. #Anon #OpExposeCPS #BLM #FreePalestine #Yemen #OpSerenaShim #Animalrights #NoDAPL", "name": "AnonyMISSKarin", "created_at": "20130112T20:57:28.000Z"}, {"id": "1003712866011308033", "username": "persian_pesar", "description": "\u200f\u200f\u200f\u200f\u200f\u200f\u200f\u200f\u200f\u200f\u200f\u200f\u200f\u200f\u200fبه ستواری و سختی رشک پولاد/\nبه راه عشق سرها داده بر باد/\nقرین بیستون هم\u200cسنگ فرهاد/\nز کرمانشاهیان یاد اینچنین باد\n\u200e#Civil_Environment_Engineer", "name": "persianpesar🏳\u200d🌈", "created_at": "20180604T18:59:30.000Z"}, {"id": "814795859644809217", "username": "Aazadist", "description": "\u200f\u200e#Equality🌐\n\u200e#Humanity🌐\nخواهی نشوی همرنگ ، رسوای جماعت شو", "name": "Aazad 🏳️\u200d🌈 آزاد", "created_at": "20161230T11:30:45.000Z"}, {"id": "790375699638915072", "username": "Isaihstewart", "location": "Los Angeles, CA", "description": "Part time assistant manager at “Sheets and Things”", "name": "Dey got the henessey 🗣", "created_at": "20161024T02:13:46.000Z"}, {"id": "4846243708", "username": "williamvercetti", "location": "Virginia Beach, VA", "description": "vma. art. modelo papi. tpain to the dms.", "name": "William Vercetti", "created_at": "20160125T17:21:50.000Z"}, {"id": "1160723882", "username": "k_cawsey", "location": "Halifax, Nova Scotia", "description": "Chaucer, Malory, Arthur Tolkien. #Dal_English", "name": "Dr. Kathy Cawsey", "created_at": "20130208T17:15:30.000Z"}, {"id": "3789298943", "username": "solomonesther17", "location": "Lagos, Nigeria", "description": "FairBib Legal Practitioners", "name": "Esther Solomon", "created_at": "20150927T04:52:29.000Z"}, {"id": "14860380", "username": "Dejify", "location": "San Francisco", "description": "The Nigerian State is a festering boil that the world can't afford to ignore. Because, when it pops, its rancid ooze won't be pleasant nor easy to contain.", "name": "Buhari: Uber Ment (Dèjì Akọ́mọláfẹ́)", "created_at": "20080521T18:57:27.000Z"}, {"id": "1120883223070773248", "username": "Donna780780", "description": "", "name": "Donna Swidley", "created_at": "20190424T02:52:40.000Z"}, {"id": "1253742908487929858", "username": "Neros_sis", "location": "Florida", "description": "", "name": "#Nero's Fiddle GOP has a terrorism problem", "created_at": "20200424T17:50:00.000Z"}, {"id": "585090491", "username": "vickierae562", "location": "The LBC", "description": "That’s Right, I’m a Lefty 🤣 and I don’t feed trolls! #resist #DumpTrump #DitchMitch #LooseLindsey", "name": "Vickie Rae", "created_at": "20120519T21:00:28.000Z"}, {"id": "1262122532607574022", "username": "EmilySi49944255", "description": "", "name": "Skylar Aubrey", "created_at": "20200517T20:47:34.000Z"}, {"id": "1401663176", "username": "mdeHummelchen", "location": "Tief im Westen", "description": "Pflegewissenschaftlerin,Pflegeberaterin,Dozentin,Lächeln und winken...Pro Pflegekammer", "name": "Madame Hummelchen 💙", "created_at": "20130504T07:44:32.000Z"}, {"id": "2381808114", "username": "mommy97giraffe", "location": "Antifa HQs/Mom Division Office", "description": "Follower of Jesus, Mennonite mom&wife, lover of books, world, peo, poetry&art. 6 autoimmunes&fibro🥄ie Proud Mama Bear of 1gayD & 1pan&autistic son, in 20s🌈💖", "name": "Mennonite Mom(she/her)", "created_at": "20140310T08:51:02.000Z"}, {"id": "2362182011", "username": "rd2glry", "location": "Washington, DC", "description": "", "name": "ateachr", "created_at": "20140224T04:07:21.000Z"}, {"id": "974917494870700032", "username": "GiraffeOld", "location": "Arizona, USA", "description": "", "name": "old man giraffe", "created_at": "20180317T07:56:58.000Z"}, {"id": "830939480", "username": "redz041", "description": "", "name": "Jan Mouzone", "created_at": "20120918T12:18:36.000Z"}, {"id": "3346032292", "username": "kumccaig44", "description": "", "name": "Katrine McCaig", "created_at": "20150625T21:25:21.000Z"}, {"id": "80630279", "username": "LuluTheCalm", "location": "Green Grass & Puddles, Canada", "description": "Mischief in My Eyes & Adventure in My Soul. \nLet's Have a Laugh &, you know, Make the World a Better Place.😎 \nAus/Brit/Cdn🇨🇦", "name": "Lulu 🇨🇦#BeKindBeCalmBeSafe💞 😷 🎏", "created_at": "20091007T17:26:56.000Z"}, {"id": "3252437864", "username": "engelhardterin", "location": "Houston, TX || Lubbock, TX", "description": "24 || Texas Tech || ♀️ || she/her", "name": "Erin Engelhardt", "created_at": "20150622T07:26:28.000Z"}, {"id": "93797267", "username": "mcbeaz", "location": "he/him", "description": "black lives matter.", "name": "mike", "created_at": "20091201T05:28:58.000Z"}, {"id": "2585773107", "username": "michiganington", "location": "Washington, D.C. ", "description": "", "name": "Allyoop", "created_at": "20140606T02:12:33.000Z"}, {"id": "27857135", "username": "JackRayher", "location": "Northport, NY", "description": "Senior Marketing Executive\nLifelong Democrat\n#BidenHarris", "name": "Jack Rayher", "created_at": "20090331T12:12:03.000Z"}, {"id": "1078457644736827392", "username": "RobertCooper58", "description": "Bilingual community advocate. Father of five wonderful kids. Lifelong progressive and proud member of #TheDemCoalition. Early supporter of President #JoeBiden.", "name": "Robert Cooper 🌊", "created_at": "20181228T01:08:34.000Z"}, {"id": "206860139", "username": "MariaArtze", "location": "Münster, Deutschland", "description": "Nas trincheiras da ESO\nEmigrante a medio retornar. Womansplainer.\n(Sie vostede)\n\nTrans rights are human rights.", "name": "A Malvada Profe mediovacinada", "created_at": "20101023T22:27:26.000Z"}, {"id": "2903906123", "username": "lm1067", "location": "London, England", "description": "B A FINE ARTIST GRADUATED", "name": "Luis Pais", "created_at": "20141203T15:53:10.000Z"}, {"id": "64119853", "username": "IAM_SHAKESPEARE", "location": "Tweeting from the Grave", "description": "This bot has tweeted the complete works of Shakespeare (in order) 5 times over the last 12years. On hiatus for a bit. Created by #strebel", "name": "Willy Shakes", "created_at": "20090809T05:41:08.000Z"}, {"id": "3176623941", "username": "acastellich", "location": "Chicago, Il.", "description": "Abogado,Restaurantero,Immigrant , UVM. AD1 IPADE MBA. Restaurant Hospitality Industry, Chicago IL.", "name": "Alejandro Castelli", "created_at": "20150417T13:23:17.000Z"}, {"id": "782765390925533185", "username": "Diane_L_Espo", "location": "Florida, USA", "description": "", "name": "DianeEspo 🇺🇲🗽", "created_at": "20161003T02:13:07.000Z"}, {"id": "67471020", "username": "thedcma", "location": "Fort Lauderdale, FL", "description": "🖤💎 Style is the only substance I abuse.💎🖤 I’m just a 🌈 Gay 🐔Hillbilly 🔮Warlock 🛵 Riding a 👨🏻\u200d🎤Vaporwave Fever Dream #blacklivesmatter", "name": "Grace Kelly on Steiroids", "created_at": "20090821T00:32:37.000Z"}, {"id": "78797635", "username": "graciosodiablo", "description": "Too much of a good thing can be bad. So too little of a bad thing must be good. 160 characters or less of me should be perfect.", "name": "gracioso diabloint", "created_at": "20091001T03:59:16.000Z"}, {"id": "268314713", "username": "philppedurand", "location": "Auxerre", "description": "Je suis une personne gentille je milite pour la PMA. je suis militant communiste je suis aussi à l’association des Rosoirs je suis conseillé quartier", "name": "Philippe durand", "created_at": "20110318T14:37:36.000Z"}, {"id": "37996028", "username": "nicrawhide", "location": "Pinconning Michigan ", "description": "Just your average small town gay with big town sensibility!!", "name": "Nicholas Bean", "created_at": "20090505T19:20:37.000Z"}, {"id": "1236656342674407427", "username": "LadyJayPersists", "location": "Valhalla", "description": "USN Veteran | Shieldmaiden | Mom | Not here for a man, I have one | PTSD Warrior | My mind is a beautiful servant to a dangerous master", "name": "Jax", "created_at": "20200308T14:13:48.000Z"}, {"id": "171183306", "username": "dawndawnB", "location": "United States", "description": "Mrs. B, mother of 2 amazing kids, Substance Abuse Counselor, Volunteer, Music Lover. Born in DC but a VA Lo❤️er!", "name": "nwad", "created_at": "20100726T19:21:24.000Z"}, {"id": "817247846751555587", "username": "me2020_2021", "location": "Brisbane, Queensland", "description": "Proud Aussie, living a wonderful life with my wife, Australian Cricket 🏏👏,😷 🍺🥃 \U0001f9ae Alex", "name": "👀🏳️\u200d🌈 "A girl has no Name”', 'created_at': '20170106T05:54:05.000Z'}, {'id': '879459933988585472', 'username': 'Davecl3069', 'location': 'San Francisco Bay Area', 'description': 'proud of my views, life long learner,& hopefully, that guy!\n#LowerTheFlagForCovidVictims #VoteBlue #BLM #SupportThePlayers #LGBTQ #WeNeedToDoBetter #ResistStill', 'name': 'David', 'created_at': '20170626T22:02:42.000Z'}`
Code used
Regex:
def convert_to_json(file):
with open(file, "r", encoding="utf-8") as f:
x = f.read()
x = x.replace("-", "")
rx = re.compile(r'"[^"]*"(*SKIP)(*FAIL)|\'')
decoded = rx.sub('"', x)
literal_eval:
def open_json():
with open("data.json", "r", encoding="utf-8") as f:
f.read()
data = literal_eval(f)
data = json.loads(str(data))
What I would like to achieve
Reformat the data to conform to JSON (this question) in order to be able to
Build a dataframe with the relevant tweettext, user information and metadata (secondary goal) to be used in further analyses.
Thanks in advance for any suggestions! :)
if the ' that are causing the problem are only in the tweets and desciption
you could try that
pre_tweet ="'text': '"
post_tweet = "', 'referenced_tweets':"
with open(file, encoding="utf-8") as f:
data=f.readlines()
output = []
errors = []
for line in data:
if pre_tweet in line and post_tweet in line :
first_part,rest = line.split(pre_tweet)
tweet,last_part = rest.split(post_tweet)
pre_tweet = first_part.replace('\'', '\"') + pre_tweet.replace('\'', '\"')
post_tweet = post_tweet.replace('\'', '\"') + last_part.replace('\'', '\"')
output.append(pre_tweet + tweet + post_tweet)
else :
errors.append(line)
and if errors is not empty, either it's because there are no tweets in the line (you can change the code a little bit to add it to your output),
or what's after the tweet is not 'referenced_tweets'. In the second case, you may try to figure what could the changes be and modify the above code to add multiple post_tweet
then you may do the same with the description by changing pre and post tweet by what's usually before and after the description
The numbers of possible keys after the tweets/description must be finite, so it may take some time to figure out all the possibilities but in the end you should succeed
So I figured out a way to process the corrupt data.
The solution can be found here.
Using ast.literal_eval(input_string) lets me read in the corrupt json lines as a dictionary. Only thing is to make sure that no leading or trailing whitespace, commata etc. are included in the input string.
Example code for reading in data with ast.literal_eval():
from ast import literal_eval
with open("inputdata.json", "r", encoding="utf-8") as f:
dictlist = []
for line in f:
x: str = f.readline()
x = x.lstrip()
data = literal_eval(x)
dictlist.append(data)
I'm new to python programming.
I have tried a lot to avoid these nested for loops, but no success.
My data input like:
[
{
"province_id": "1",
"name": "HCM",
"districts": [
{
"district_id": "1",
"name": "Thu Duc",
"wards": [
{
"ward_id": "1",
"name": "Linh Trung"
},
{
"ward_id": "2",
"name": "Linh Chieu"
}
]
},
{
"district_id": "2",
"name": "Quan 9",
"wards": [
{
"ward_id": "3",
"name": "Hiep Phu"
},
{
"ward_id": "4",
"name": "Long Binh"
}
]
}
]
},
{
"province_id": "2",
"name": "Binh Duong",
"districts": [
{
"district_id": "3",
"name": "Di An",
"wards": [
{
"ward_id": "5",
"name": "Dong Hoa"
},
{
"ward_id": "6",
"name": "Di An"
}
]
},
{
"district_id": "4",
"name": "Thu Dau Mot",
"wards": [
{
"ward_id": "7",
"name": "Hiep Thanh"
},
{
"ward_id": "8",
"name": "Hiep An"
}
]
}
]
}
]
And my code is:
for province in data:
for district in province['districts']:
for ward in district['wards']:
# Excute my function
print('{}, {}, {}'.format(ward['name'], district['name'], province['name']))
Output
Linh Trung, Thu Duc, HCM
Linh Chieu, Thu Duc, HCM
Hiep Phu, Quan 9, HCM
...
Even though my code is working it looks pretty ugly.
How can I avoid these nested for loops?
Your data structure is naturally nested, but one option you have for neatening your code is to write a generator function for iterating over it:
def all_wards(data):
for province in data:
for district in province['districts']:
for ward in district['wards']:
yield province, district, ward
This function has the same triply-nested loop in it as you currently have, but everywhere else in your code, you can now iterate over the data structure with a single non-nested loop:
for province, district, ward in all_wards(data):
print('{}, {}, {}'.format(ward['name'], district['name'], province['name']))
If you prefer to avoid having too much indentation, here's an equivalent way to write the function, similar to #adarian's answer but without creating a temporary list:
def all_wards(data):
return (
province, district, ward
for province in data
for district in province['districts']
for ward in district['wards']
)
Here is a one-liner version
[
print("{}, {}, {}".format(ward["name"], district["name"], province["name"]))
for province in data
for district in province["districts"]
for ward in district["wards"]
]
You could do something like this:
def print_district(district, province):
for ward in district['wards']:
print('{}, {}, {}'.format(ward['name'], district['name'], province['name']))
def print_province(province):
for district in province['districts']:
print_district(district, province)
for province in data:
print_province(province)