Python ValueError: Length of values does not match length of index

Python ValueError: Length of values does not match length of index - python

I am carrying out a API search on scaleserp and for each search I do i want to put the output in the column in my dataframe.
#Matches the GET request
api_result = requests.get('https://api.scaleserp.com/search', params, verify=False)
print(type(api_result))
#stores the result in JSON
result = api_result.json()
print(type(result))
#Creates a new DataFrame with 'Organic_Results' From the JSON output.
Results_df = (result['organic_results'])
#FOR loop to look at each result and select which output from the JSON is wanted.
for res in Results_df:
StartingDataFrame['JSONDump'] = res
api_result is a requests.models.Response
result is a dict.
RES is a dict.
I want the RES to be put into the column Dump. is this possible?
Updated Code
#Matches the GET request
api_result = requests.get('https://api.scaleserp.com/search', params, verify=False)
#stores the result in JSON
result = api_result.json()
#Creates a new DataFrame with 'Organic_Results' From the JSON output.
Results_df = (result['organic_results'])
#FOR loop to look at each result and select which output from the JSON is wanted.
for res in Results_df:
Extracted_data = {key: res[key] for key in res.keys()
& {'title', 'link', 'snippet_matched', 'date', 'snippet'}}
Extracted_data is a dict and contains the info i need.
{'title': '25 Jun 1914 - Advertising - Trove', 'link': 'https://trove.nla.gov.au/newspaper/article/7280119', 'snippet_matched': ['()', 'charge', 'Dan Whit'], 'snippet': 'I Iron roof, riltibcd II (),. Line 0.139.5. wai at r ar ... Propertb-« entired free of charge. Line 2.130.0 ... AT Dan Whit",\'»\', 6il 02 sturt »L, Prlnce\'»~Brti\'»e,. Line 3.12.0.'}
{'snippet': "Mary Bardwell is in charge of ... I() •. Al'companit'd by: Choppf'd Chitkf'n Li\\f>r Palt·. 1h!iiSC'o Gret'n Salad g iii ... of the overtime as Dan Whit-.", 'title': 'October 16,1980 - Bethlehem Public Library',
'link': 'http://www.bethlehempubliclibrary.org/webapps/spotlight/years/1980/1980-10-16.pdf', 'snippet_matched': ['charge', '()', 'Dan Whit'], 'date': '16 Oct 1980'}
{'snippet': 'CONGRATULATIONS TO DAN WHIT-. TLE ON THE ... jailed and beaten dozens of times. In one of ... ern p()rts ceased. The MIF is not only\xa0...', 'title': 'extensions of remarks - US Government Publishing Office', 'link': 'https://www.gpo.gov/fdsys/pkg/GPO-CRECB-1996-pt5/pdf/GPO-CRECB-1996-pt5-7-3.pdf', 'snippet_matched': ['DAN WHIT', 'jailed', '()'], 'date': '26 Apr 1986'}
{'snippet': 'ILLUSTRATION BY DAN WHIT! By Matt Manning ... ()n the one hand, there are doctors on both ... self-serving will go to jail at the beginning of\xa0...', 'title': 'The BG News May 23, 2007 - ScholarWorks#BGSU - Bowling ...', 'link': 'https://scholarworks.bgsu.edu/cgi/viewcontent.cgi?article=8766&context=bg-news', 'snippet_matched': ['DAN WHIT', '()', 'jail'], 'date': '23 May 2007'}
{'snippet': '$19.95 Charge card number SERVICE HOURS: ... Explorer Advisor Dan Whit- ... lhrr %(OnrwflC or ()utuflrueonlinelfmarketing (arnpaigfl%? 0I - .',
'title': '<%BANNER%> TABLE OF CONTENTS HIDE Section A: Main ...', 'link': 'https://ufdc.ufl.edu/UF00028295/00194', 'snippet_matched': ['Charge', 'Dan Whit', '()'], 'date': 'Listings 1 - 800'}
{'title': 'Lledo Promotional,Bull Nose Morris,Dandy,Desperate Dan ...', 'link': 'https://www.ebay.co.uk/itm/Lledo-Promotional-Bull-Nose-Morris-Dandy-Desperate-Dan-White-Van-/233817683840', 'snippet_matched': ['charges'], 'snippet': 'No additional import charges on delivery. This item will be sent through the Global Shipping Programme and includes international tracking. Learn more- opens\xa0...'}

The problem looks like the length of your organic_results is not the same as the length of your dataframe.
StartingDataFrame['Dump'] = (result['organic_results'])
Here your setting the whole column dump to equal the organic_results which is smaller or larger than your already defined dataframe. I'm not sure what your dataframe already has but you could do this by iterating through the rows and stashing them in that row if you have values you want them to add up with like this:
StartingDataFrame['Dump'] = []*len(StartingDataFrame)
for i,row in StartingDataFrame.iterrows():
StartingDataFrame.at[i,'Dump'] = result['organic_results']
Depending on what your data looks like you could maybe just append it to the dataframe
StartingDataFrame = StartingDataFrame.append(result['organic_results'],ignore_index=True)
Could you show us a sample of what both data source looks like?

Related

Twitter api not returning 'created_at' info for tweets

When running the following code using the Python Tweepy library:
query = '$MSFT'
start_time = '2019-01-01T00:00:00Z'
end_time = '2019-02-01T00:00:00Z'
max_results = 10
results = client.search_all_tweets(query=query, max_results=max_results, start_time=start_time, end_time=end_time)
I receive a bunch of tweets. However, these tweets only contain the following data:
tweet.edit_history_tweet_ids
tweet.id
tweet.text
Information like 'created_at' or 'geo' are missing. Anyone have any idea what is going wrong here?

Adding tweet_fields parameter with specific field can see created_at but geo usually missed in tweet due to not added by tweeter.
It can't see most case even if passing geo
I can't test it due to no paid account but I can test recent 7 days result.
as attached code and results.
The search_recent_tweets() and search_all_tweets() almost same parameters.
Here is documentation.
It is matched V2 low level API
in here
GET /2/tweets/search/all
import tweepy
bearer_token ="***** your bearer_token *****"
query = '\$MSFT'
start_time = '2019-01-01T00:00:00Z'
end_time = '2019-02-01T00:00:00Z'
max_results = 10
results = client.search_all_tweets(query=query,
tweet_fields=['created_at','geo'],
max_results=max_results,
start_time=start_time,
end_time=end_time)
for result in results.data:
print(result.data)
import tweepy
bearer_token ="***** your bearer_token *****"
client = tweepy.Client(bearer_token=bearer_token)
query="\$MSFT"
results = client.search_recent_tweets(query=query,
tweet_fields=['created_at','geo'],
start_time='2023-01-12T00:03:00Z',
max_results=10
)
for result in results.data:
print(result.data)
Result
>python search.py
{'created_at': '2023-01-16T01:27:49.000Z', 'text': '#excel_ranger $MSFT', 'edit_history_tweet_ids': ['1614796530795298817'], 'id': '1614796530795298817'}
{'created_at': '2023-01-16T01:27:26.000Z', 'text': 'Live day trading, detailed analysis on stocks with entry, exit, stop losses, price targets,mentoring and live news\n\nhttps://xxx/LzwYLosqbb\n\n$BIDU $SAVE $WORK $T $AAPL $C $MSFT $SPY $FB $CHWY $PTON $DIS $F $ADBE $CSCO $MGM $XOM $AMD $ZI $Z $BA $DOW $NET $PROP $OSTK $QCOM $ADM https://xxx/n3T6xDyNNj', 'edit_history_tweet_ids': ['1614796435139993607'], 'id': '1614796435139993607'}
{'created_at': '2023-01-16T01:27:10.000Z', 'text': '$SPY $QQQ $DIS $TSLA $SHOP\n$AMD $AAPL $SQ $AMZN \n$EA $SEDG $MA $V $KO $PYPL $RCL $GOOG $NKLA $DKNG $HD $ROKU $NFLX $FB $GILD $VXX $MSFT\n\nThanks to the discord group for the traders 🙏https://xxx/LzwYLosqbb https://xxx/V3pz7tESOk', 'edit_history_tweet_ids': ['1614796367511035904'], 'id': '1614796367511035904'}
{'created_at': '2023-01-16T01:26:56.000Z', 'text': 'RT #DarthDividend23: Great post by #thedividendclub on IG! \n\n$WMT $APD $JNJ $PG $WBA $GILD $O $MRK $UPS $AAPL $COST $SBUX $TGT $LOW $MSFT $…', 'edit_history_tweet_ids': ['1614796311076769794'], 'id': '1614796311076769794'}
{'created_at': '2023-01-16T01:26:52.000Z', 'text': 'RT #TrendSpider: Do the Markets need a Breather? 🧐\n\nPrepare for the week ahead 📝\n\nTickers Covered:\n$SPY 00:00\n$QQQ 01:58\n$IWM 04:57\n$BTC 07…', 'edit_history_tweet_ids': ['1614796291812327428'], 'id': '1614796291812327428'}
{'created_at': '2023-01-16T01:26:46.000Z', 'text': '🚀Alerted an entry for $METX win of 121%. \nhttps://xxx/LzwYLosqbb\n✅Check us out\n\n🔥\n\n$NAKD $TSLA $AAPL $NIO $SPY $INTC $GE $SNDL $NXTD $KGC $ZOM\n$FCEL\n$FEYE \n$ZM\n$AAL \n$PLTR \n$CCL \n$MSFT \n$PFE\n$PLUG \n$WFC \n$AMD \n$VXX \n$CNSP $QQQ https://xxx/OZFbKDpaHp', 'edit_history_tweet_ids': ['1614796270048010243'], 'id': '1614796270048010243'}
{'created_at': '2023-01-16T01:25:50.000Z', 'text': "STOCK, OPTIONS updates, alerts Free chatroom\nDon't forget to take a trial! \nhttps://xxx/LzwYLosqbb\n\n$SPY $BABA $DVAX $ACB $OSTK $TRIL $LK $CODX $SAVE $GSX $INO $KSS $PENN $NVAX $NIO $AAL $NKLA $MSFT $AAPL $AMZN $TSLA $CCL $BILI $CVNA $DAL $TWTR https://xxx/cT6Lc9Yhpv", 'edit_history_tweet_ids': ['1614796034487603200'], 'id': '1614796034487603200'}
{'created_at': '2023-01-16T01:25:36.000Z', 'text': '$MSFT at $220 feels like stealing candy from a baby.', 'edit_history_tweet_ids': ['1614795972751785985'], 'id': '1614795972751785985'}
{'created_at': '2023-01-16T01:25:16.000Z', 'text': '$DXY Index looks to test 101.60 as odds of less-hawkish Fed bets soar https://xxx/9PYMyxzvpt $CRM $DXY $NIO $BABA $BTC.X $ETH.X $BLK $COIN $BNB.X $AAPL $TSLA $MULN $CEI $SPY $DJIA $QQQ $WMT $MSFT $PFE $MRNA $AZN $ABNB $AMD $BNTX $BA $COP $PDD $COST $GM $META $AMZN $NFLX', 'edit_history_tweet_ids': ['1614795889251422208'], 'id': '1614795889251422208'}
{'created_at': '2023-01-16T01:24:52.000Z', 'text': '$AAPL Trade idea💡\nhttps://xxx/LzwYLosqbb\n\n$AMC $SPY $GME $QQQ $MU $MSFT $AMD $PTON $AMZN $CRM $XLF $XLE $TSLA $AAL https://xxx/rDouPd3hNj', 'edit_history_tweet_ids': ['1614795790412648450'], 'id': '1614795790412648450'}

How can I only parse the first HTML block from multiple blocks, if they all contain the same class-name?

I need to parse info from a site, on this site, there are 2 blocks, "Today" and "Yesterday", and they have the same class name of standard-box standard-list.
How can I only parse the first block (under "Today") in a row, without extracting the inform from "Yesterday", if they both contain the same class-name?
Here is my code:
import requests
url_news = "https://www.123.org/"
response = requests.get(url_news)
soup = BeautifulSoup(response.content, "html.parser")
items = soup.findAll("div", class_="standard-box standard-list")
news_info = []
for item in items:
news_info.append({
"title": item.find("div", class_="newstext",).text,
"link": item.find("a", class_="newsline article").get("href")
})

When running your provided code, I don't get an output for items. However, you said that you do, so:
If you only want to get the data under "Today", you can use .find() instead of .find_all(), since .find() will only return the first found tag -- which is "Today" and not the other tags.
So, instead of:
items = soup.findAll("div", class_="standard-box standard-list")
Use:
items = soup.find("div", class_="standard-box standard-list")
Additionally, to find the link, I needed to access the attribute using tag-name[attribute]. Here is working code:
news_info = []
items = soup.find("div", class_="standard-box standard-list")
for item in items:
news_info.append(
{"title": item.find("div", class_="newstext").text, "link": item["href"]}
)
print(news_info)
Output:
[{'title': 'NIP crack top 3 ranking for the first time in 5 years', 'link': 'https://www.hltv.org/news/32545/nip-crack-top-3-ranking-for-the-first-time-in-5-years'}, {'title': 'Fessor joins Astralis Talent', 'link': 'https://www.hltv.org/news/32544/fessor-joins-astralis-talent'}, {'title': 'Grashog joins AGO', 'link': 'https://www.hltv.org/news/32542/grashog-joins-ago'}, {'title': 'ISSAA parts ways with Eternal Fire', 'link': 'https://www.hltv.org/news/32543/issaa-parts-ways-with-eternal-fire'}, {'title': 'BLAST Premier Fall Showdown Fantasy live', 'link': 'https://www.hltv.org/news/32541/blast-premier-fall-showdown-fantasy-live'}, {'title': 'FURIA win IEM Fall NA, EG claim final Major Legends spot', 'link': 'https://www.hltv.org/news/32540/furia-win-iem-fall-na-eg-claim-final-major-legends-spot'}]

Converting deeply nested JSON response from an API call to pandas dataframe

I am currently having trouble parsing a deeply nested JSON response from a HTTP API call.
My JSON Response is like
{'took': 476,
'_revision': 'r08badf3',
'response': {'accounts': {'hits': [{'name': '4002238760',
'display_name': 'Googleglass-4002238760',
'selected_fields': ['Googleglass',
'DDMonkey',
'Papu New Guinea',
'Jonathan Vardharajan',
'4002238760',
'DDMadarchod-INSTE',
None,
'Googleglass',
'0001012556',
'CC',
'Setu Non Standard',
'40022387',
320142,
4651321321333,
1324650651651]},
{'name': '4003893720',
'display_name': 'Swift-4003893720',
'selected_fields': ['Swift',
'DDMonkey',
'Papu New Guinea',
'Jonathan Vardharajan',
'4003893720',
'DDMadarchod-UPTM-RemotexNBD',
None,
'S.W.I.F.T. SCRL',
'0001000110',
'SE',
'Setu Non Standard',
'40038937',
189508,
1464739200000,
1559260800000]},
After I receive the response I am storing it in data object using json normalize
data = response.json()
data = data['response']['accounts']['hits']
data = json_normalize(data)
However after I normalize my dataframe looks like this
My Curl Statement looks like this
curl --data 'query= {"terms":[{"type":"string_attribute","attribute":"Account Type","query_term_id":"account_type","in_list":["Contract"]},{"type":"string","term":"status_group","in_list":["paying"]},{"type":"string_attribute","attribute":"Region","in_list":["DDEU"]},{"type":"string_attribute","attribute":"Country","in_list":["Belgium"]},{"type":"string_attribute","attribute":"CSM Tag","in_list":["EU CSM"]},{"type":"date_attribute","attribute":"Contract Renewal Date","gte":1554057000000,"lte":1561833000000}],"count":1000,"offset":0,"fields":[{"type":"string_attribute","attribute":"DomainName","field_display_name":"Client Name"},{"type":"string_attribute","attribute":"Region","field_display_name":"Region"},{"type":"string_attribute","attribute":"Country","field_display_name":"Country"},{"type":"string_attribute","attribute":"Success Manager","field_display_name":"Client Success Manager"},{"type":"string","term":"identifier","field_display_name":"Account id"},{"type":"string_attribute","attribute":"DeviceSLA","field_display_name":"[FIN] Material Part Number"},{"type":"string_attribute","attribute":"SFDCAccountId","field_display_name":"SFDCAccountId"},{"type":"string_attribute","attribute":"Client","field_display_name":"[FIN] Client Sold-To Name"},{"type":"string_attribute","attribute":"Sold To Code","field_display_name":"[FIN] Client Sold To Code"},{"type":"string_attribute","attribute":"BU","field_display_name":"[FIN] Active BUs"},{"type":"string_attribute","attribute":"Service Type","field_display_name":"[FIN] Service Type"},{"type":"string_attribute","attribute":"Contract Header ID","field_display_name":"[FIN] SAP Contract Header ID"},{"type":"number_attribute","attribute":"Contract Value","field_display_name":"[FIN] ACV - Annual Contract Value","desc":true},{"type":"date_attribute","attribute":"Contract Start Date","field_display_name":"[FIN] Contract Start Date"},{"type":"date_attribute","attribute":"Contract Renewal Date","field_display_name":"[FIN] Contract Renewal Date"}],"scope":"all"}' --header 'app-token:YOUR-TOKEN-HERE' 'https://app.totango.com/api/v1/search/accounts'
So ultimately I want to store the Response in a dataframe along with the field names.

I've had to do this sort of thing a few times in the past (flatten out a nested json) I'll explain my process, and you can see if it works, or at least can then work the code a bit to fit your needs.
1) Took the data response, and completely flattened it out using a function. This blog was very helpful when I first had to do this.
2) Then it iterates through the flat dictionary created to find where each rows and columns are needed to be created by the numbering of the new key names within the nested parts. There are also keys that are unique/distinct, so they don't have a number to identify as a "new" row, so I account for those in what I called special_cols.
3) As it iterates through those, pulls the specified row number (embedded in those flat keys), and then constructs the dataframe in that way.
It sounds complicated, but if you debug and run line by line, you could see how it works. None-the-less, I believe it should get you what you need.
data = {'took': 476,
'_revision': 'r08badf3',
'response': {'accounts': {'hits': [{'name': '4002238760',
'display_name': 'Googleglass-4002238760',
'selected_fields': ['Googleglass',
'DDMonkey',
'Papu New Guinea',
'Jonathan Vardharajan',
'4002238760',
'DDMadarchod-INSTE',
None,
'Googleglass',
'0001012556',
'CC',
'Setu Non Standard',
'40022387',
320142,
4651321321333,
1324650651651]},
{'name': '4003893720',
'display_name': 'Swift-4003893720',
'selected_fields': ['Swift',
'DDMonkey',
'Papu New Guinea',
'Jonathan Vardharajan',
'4003893720',
'DDMadarchod-UPTM-RemotexNBD',
None,
'S.W.I.F.T. SCRL',
'0001000110',
'SE',
'Setu Non Standard',
'40038937',
189508,
1464739200000,
1559260800000]}]}}}
import pandas as pd
import re
def flatten_json(y):
out = {}
def flatten(x, name=''):
if type(x) is dict:
for a in x:
flatten(x[a], name + a + '_')
elif type(x) is list:
i = 0
for a in x:
flatten(a, name + str(i) + '_')
i += 1
else:
out[name[:-1]] = x
flatten(y)
return out
flat = flatten_json(data)
results = pd.DataFrame()
special_cols = []
columns_list = list(flat.keys())
for item in columns_list:
try:
row_idx = re.findall(r'\_(\d+)\_', item )[0]
except:
special_cols.append(item)
continue
column = re.findall(r'\_\d+\_(.*)', item )[0]
column = column.replace('_', '')
row_idx = int(row_idx)
value = flat[item]
results.loc[row_idx, column] = value
for item in special_cols:
results[item] = flat[item]
Output:
print (results.to_string())
name displayname selectedfields0 selectedfields1 selectedfields2 selectedfields3 selectedfields4 selectedfields5 selectedfields6 selectedfields7 selectedfields8 selectedfields9 selectedfields10 selectedfields11 selectedfields12 selectedfields13 selectedfields14 took _revision
0 4002238760 Googleglass-4002238760 Googleglass DDMonkey Papu New Guinea Jonathan Vardharajan 4002238760 DDMadarchod-INSTE NaN Googleglass 0001012556 CC Setu Non Standard 40022387 320142.0 4.651321e+12 1.324651e+12 476 r08badf3
1 4003893720 Swift-4003893720 Swift DDMonkey Papu New Guinea Jonathan Vardharajan 4003893720 DDMadarchod-UPTM-RemotexNBD NaN S.W.I.F.T. SCRL 0001000110 SE Setu Non Standard 40038937 189508.0 1.464739e+12 1.559261e+12 476 r08badf3

Filtering/accessing date in Bio Entrez pubmed pulls with python

I have a list of criteria (names and date ranges of when papers were published) to obtain a list of published papers. I'm using Biopython's Bio Entrez to obtain papers from Entrez.
I can query and get results by author name but I'm not figuring out how to deal with the data to get the dates in there. This is what I've done:
handle = Entrez.esearch(db="pubmed", term = "" )
result = Entrez.read(handle)
handle.close()
ids = result['IdList']
print ids
#for each ids go through it and pull the summary
for uid in ids:
handle2 = Entrez.esummary(db="pubmed", id=uid, retmode= "xml")
result2 = Entrez.read(handle2)
handle2.close()
Now the output looks like this
[{'DOI': '10.1016/j.jmoldx.2013.10.002', 'Title': 'Validation of a next-generation sequencing assay for clinical molecular oncology.', 'Source': 'J Mol Diagn', 'PmcRefCount': 7, 'Issue': '1', 'SO': '2014 Jan;16(1):89-105', 'ISSN': '1525-1578', 'Volume': '16', 'FullJournalName': 'The Journal of molecular diagnostics : JMD', 'RecordStatus': 'PubMed - indexed for MEDLINE', 'ESSN': '1943-7811', 'ELocationID': 'doi: 10.1016/j.jmoldx.2013.10.002', 'Pages': '89-105', 'PubStatus': 'ppublish+epublish', 'AuthorList': ['Cottrell CE', 'Al-Kateb H', 'Bredemeyer AJ', 'Duncavage EJ', 'Spencer DH', 'Abel HJ', 'Lockwood CM', 'Hagemann IS', "O'Guin SM", 'Burcea LC', 'Sawyer CS', 'Oschwald DM', 'Stratman JL', 'Sher DA', 'Johnson MR', 'Brown JT', 'Cliften PF', 'George B', 'McIntosh LD', 'Shrivastava S', 'Nguyen TT', 'Payton JE', 'Watson MA', 'Crosby SD', 'Head RD', 'Mitra RD', 'Nagarajan R', 'Kulkarni S', 'Seibert K', 'Virgin HW 4th', 'Milbrandt J', 'Pfeifer JD'], 'EPubDate': '2013 Nov 6', 'PubDate': '2014 Jan', 'NlmUniqueID': '100893612', 'LastAuthor': 'Pfeifer JD', 'ArticleIds': {'pii': 'S1525-1578(13)00219-5', 'medline': [], 'pubmed': ['24211365'], 'eid': '24211365', 'rid': '24211365', 'doi': '10.1016/j.jmoldx.2013.10.002'}, u'Item': [], 'History': {'received': '2013/02/04 00:00', 'medline': ['2014/08/30 06:00'], 'revised': '2013/08/23 00:00', 'pubmed': ['2013/11/12 06:00'], 'aheadofprint': '2013/11/06 00:00', 'accepted': '2013/10/01 00:00', 'entrez': '2013/11/12 06:00'}, 'LangList': ['English'], 'HasAbstract': 1, 'References': ['J Mol Diagn. 2014 Jan;16(1):7-10. PMID: 24269227'], 'PubTypeList': ['Journal Article'], u'Id': '24211365'}]
I tried looking at using Efetch which doesn't always have xml output from what I understand. I thought I could filter for dates by parsing through xml as so
proj_start = '2009 Jan 01'
proj_start = time.strptime(proj_start, '%Y %b %d')
for paper in results2:
handle = open(paper)
record = Entrez.read(handle)
pub_dat=time.strptime(record["EPubDate"], '%Y %b %d')
I get the error:
Traceback (most recent call last):
File "<ipython-input-39-13bcded12392>", line 2, in <module>
handle = open(paper)
TypeError: coercing to Unicode: need string or buffer, ListElement found
I feel like I'm missing something and I should be able to feed this directly into the query. I also don't understand why this method doesn't work even though it seems a harder way to do this. Is there a better way to do this? I tried to do this using xml.etree but I also got a similar like error.

You don't need to open(paper) : paper is a already Python dict (basically JSON). If you want the accepted date you can access it like this:
paper['History']['accepted']
'2013/10/01 00:00'

Parsing Text Structured with Indents in Python

I am getting stuck trying to figure out an efficient way to parse some plaintext that is structured with indents (from a word doc). Example (note: indentation below not rendering on mobile version of SO):
Attendance records 8 F 1921-2010 Box 2
1921-1927, 1932-1944
1937-1939,1948-1966,
1971-1979, 1989-1994, 2010
Number of meetings attended each year 1 F 1991-1994 Box 2
Papers re: Safaris 10 F 1951-2011 Box 2
Incomplete; Includes correspondence
about beginning “Safaris” may also
include announcements, invitations,
reports, attendance, and charges; some
photographs.
See also: Correspondence and Minutes
So the unindented text is the parent record data and each set of indented text below each parent data line are some notes for that data (which are also split into multiple lines themselves).
So far I have a crude script to parse out the unindented parent lines so that I get a list of dictionary items:
import re
f = open('example_text.txt', 'r')
lines = f.readlines()
records = []
for line in lines:
if line[0].isalpha():
processed = re.split('\s{2,}', line)
for i in processed:
title = processed[0]
rec_id = processed[1]
years = processed[2]
location = processed[3]
records.append({
"title": title,
"id": rec_id,
"years": years,
"location": location
})
elif not line[0].isalpha():
print "These are the notes, but attaching them to the above records is not clear"
print records`
and this produces:
[{'id': '8 F',
'location': 'Box 2',
'title': 'Attendance records',
'years': '1921-2010'},
{'id': '1 F',
'location': 'Box 2',
'title': 'Number of meetings attended each year',
'years': '1991-1994'},
{'id': '10 F',
'location': 'Box 2',
'title': 'Papers re: Safaris',
'years': '1951-2011'}]
But now I want to add to each record the notes to the effect of:
[{'id': '8 F',
'location': 'Box 2',
'title': 'Attendance records',
'years': '1921-2010',
'notes': '1921-1927, 1932-1944 1937-1939,1948-1966, 1971-1979, 1989-1994, 2010'
},
...]
What's confusing me is that I am assuming this procedural approach, line by line, and I'm not sure if there is a more Pythonic way to do this. I am more used to working with scraping webpages and with those at least you have selectors, here it's hard to double back going one by one down the line and I was hoping someone might be able to shake my thinking loose and provide a fresh view on a better way to attack this.
Update
Just adding the condition suggested by answer below over the indented lines worked fine:
import re
import repr as _repr
from pprint import pprint
f = open('example_text.txt', 'r')
lines = f.readlines()
records = []
for line in lines:
if line[0].isalpha():
processed = re.split('\s{2,}', line)
#print processed
for i in processed:
title = processed[0]
rec_id = processed[1]
years = processed[2]
location = processed[3]
if not line[0].isalpha():
record['notes'].append(line)
continue
record = { "title": title,
"id": rec_id,
"years": years,
"location": location,
"notes": []}
records.append(record)
pprint(records)

As you have already solved the parsing of the records, I will only focus on how to read the notes of each one:
records = []
with open('data.txt', 'r') as lines:
for line in lines:
if line.startswith ('\t'):
record ['notes'].append (line [1:])
continue
record = {'title': line, 'notes': [] }
records.append (record)
for record in records:
print ('Record is', record ['title'] )
print ('Notes are', record ['notes'] )
print ()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python ValueError: Length of values does not match length of index - python

Related

Twitter api not returning 'created_at' info for tweets

How can I only parse the first HTML block from multiple blocks, if they all contain the same class-name?

Converting deeply nested JSON response from an API call to pandas dataframe

Filtering/accessing date in Bio Entrez pubmed pulls with python

Parsing Text Structured with Indents in Python

Categories

Resources