Why is the index position outputting data from other index's?

Why is the index position outputting data from other index's? - python

I'm trying to create a list of values from two categories in a dataset called style 1 and style 2. However, when I input the code it creates a list that combines data from style 1 and style 2. The weird part is that it's grabbing these catchphrases from somewhere I'm not aware of? This is the dataset that I'm using and I cannot find the catchphrases within it: https://www.kaggle.com/datasets/jessicali9530/animal-crossing-new-horizons-nookplaza-dataset?select=villagers.csv
Here's my code:
style = []
for i in lines[1:]:
vals = i.strip().split(',')
style.append(vals[13])
style.append(vals[14])
print(style)
Here's a small sample of what is being printed:
[' let it go. Then chase it down. What were you thinking?"', 'Active', 'Cool', 'Cool', 'Active', 'Simple', 'Simple', 'Elegant', 'Active', 'Active', 'Simple', 'Simple', 'Cute', 'Cute', 'Gorgeous', 'Elegant', ' water', ' and shelter!"',
As you can see there are these random catchphrases mixed within the style 1 and style 2 values. Not sure why or where it's coming from.

Some lines must contain commas inside a csv field. Usually in csv processing this is handled by putting double quotes around each field, like this:
Hero,Catchphrase,Sidekick
"Lone Ranger","Hi Ho, Silver!","Tonto"
"Superman","Up, Up, and Away!","Lois Lane"
You see that there are commas inside the catchphrase column. Since they are inside the quotes, standard csv processing will handle it correctly.
However, it looks like you are not using standard csv processing. You're just treating each csv data line as a standard string. If you were to do the same with this sample data, the first line will split into four fields, and the second line into five.
Here's a short example of the right way to do it:
import csv
with open("myfile.csv") as f:
reader = csv.reader(f)
for row in reader:
# row[13] and row[14] are handled correctly

Related

Making acronym for every airport in Python list?

How do I create a new column, and write acronyms for each respective airport record using Python on a csv file?
I have a csv file of airports and I want the names of the airports to be in acronym form so that I can display them on a map more compactly, with the airport symbol showing what it is.
An example would be this sample list:
['Bradley Sky Ranch', 'Fire Island Airport', 'Palmer Municipal Airport']
into this: ['B.S.R', 'F.I.A.', 'P.M.A.']
Next, how would you put the '.' period punctuation between each acronym letter?
I think it would be + "." + or something with ".".join?
Lastly, a benefit would be if there is a way to get rid of the word 'Airport' so that every acronym doesn't end with 'A'?
For example, something like .strip 'Airport'... but it's not the main goal.
The numbered list below shows examples of code I have, but I have no coherent solution. So please take only what makes sense, and if it doesn't I would like to learn more effective syntax!
[The original airport data is from the ESRI Living Atlas.] I have a new field/column called 'NameAbbrev' which I want to write the acronyms into, but I did this in ArcPro which contains essentially a black-box interface for calculating new fields.
Sidenote: Why am I posting to SO and not GeoNet if this is map related? Please note that my goal is to use python and am not asking about ArcPy. I think the underlying principle is python-based for operating on a csv file (whereas ArcPy would be operating on a featureclass and you would have to use ESRI-designated functions). And SO reaches a wider audience of python experts.
1) So far, I have come across how to turn a string into an acronym, which works great on a single string, not a list:
Creating acronyms in Python
acronym = "".join(word[0] for word in test.upper().split())
2) and attempted to split the items in a list, or how to do readlines on a csv file based on an example (not mine): Attribute Error: 'list' object has no attribute 'split'
def getQuakeData():
filename = input("Please enter the quake file: ")
# Use with to make sure the file gets closed
with open(filename, "r") as readfile:
# no need for readlines; the file is already an iterable of lines
# also, using generator expressions means no extra copies
types = (line.split(",") for line in readfile)
# iterate tuples, instead of two separate iterables, so no need for zip
xys = ((type[1], type[2]) for type in types)
for x, y in xys:
print(x,y)
getQuakeData()
3) Also, I have been able to use pandas to print out just the column of airport names into a list:
import pandas
colnames = ['OBJECTID', 'POLYGON_ID', 'POLYGON_NM', 'NM_LANGCD', 'FEAT_TYPE', 'DETAIL_CTY', 'FEAT_COD', 'NAME_FIX', 'ORIG_FID', 'NameAbbrev']
data = pandas.read_csv(r'C:\Users\...\AZ_Airports_table.csv', names=colnames)
names = data.NAME_FIX.tolist()
print(names)
#Here is a sample of the list of airport names/print result.
#If you want a sample to demo guidance you could use these names:
#['NAME_FIX', 'Bradley Sky Ranch', 'Fire Island Airport', 'Palmer Municipal Airport', 'Kodiak Airport', 'Nome Airport', 'Kenai Municipal Airport', 'Iliamna Airport', 'Sitka Airport', 'Wrangell Airport', 'Sand Point Airport', 'Unalaska Airport', 'Adak Airport', 'Homer Airport', 'Cold Bay Airport']
4) I've also been able to use search cursor and writerow in the past, but I don't know how exactly to apply these methods. (unrelated example):
with open(outCsv, 'wb') as ouputCsv:
writer = csv.writer(outputCsv)
writer.writerow(fields) # writes header containing list of fields
rows = arcpy.da.SearchCursor(fc, field_names=fields)
for row in rows:
writer.writerow(row) # writes fc contents to output csv
del rows
5) So, I have pieces, but I don't know how to put them all together or if they even fit together. This is my Frankenstein monster of a solution, but it is wrong because it is trying to look at each column!
def getAcronym():
filename = r'C:\Users\...\AZ_Airports_table.csv'
# Use with to make sure the file gets closed
with open(filename, "r") as readfile:
# no need for readlines; the file is already an iterable of lines
# also, using generator expressions means no extra copies
airport = (line.split(",") for line in readfile)
# iterate tuples, instead of two separate iterables, so no need for zip
abbreviation = "".join(word[0] for word in airport.upper().split())
# could also try filter(str.isupper, line)
print(abbreviation)
getAcronym()
Is there a simpler way to combine these ideas and get the acronym column I want? Or is there an alternative way?

This can be done quite simply using a list comprehension, str.join, and filter:
>>> data = ['Bradley Sky Ranch', 'Fire Island Airport', 'Palmer Municipal Airport']
>>> ['.'.join(filter(str.isupper, name)) for name in data]
['B.S.R', 'F.I.A', 'P.M.A']

Shortest answer
You can iterate over each string in the list, by using a for loop, then you can add each result to a new list. It could be turned into a function if you desire.
airports = ['Bradley Sky Ranch', 'Fire Island Airport', 'Palmer Municipal Airport']
air_acronyms = []
for airport in airports:
words = airport.split()
letters = [word[0] for word in words]
air_acronyms.append(".".join(letters))
print(air_acronyms)
output
['B.S.R', 'F.I.A', 'P.M.A']

I don't know and actually not properly understood what you want but as far as I understand you want to generate the acronym of your list of strings with the first character of every word. So what about my below solution with couple of loops? You can use list comprehension or filter or other cool functions of python to achieve what you want further. Let me know if I miss anything.
input = ['Bradley Sky Ranch', 'Fire Island Airport', 'Palmer Municipal Airport']
output = []
for i in input:
j = i.split(' ')
res = ''
for k in j:
res+= k[0] + '.'
output.append(res)
print(output)
Output:
['B.S.R.', 'F.I.A.', 'P.M.A.']

Python 2.7 Can't Write out file from DictReader with DictWriter after re.findall with Regex

I've tried many approaches based on great stack overflow ideas per:
How to write header row with csv.DictWriter?
Writing a Python list of lists to a csv file
csv.DictWriter -- TypeError: __init__() takes at least 3 arguments (4 given)
Python: tuple indices must be integers, not str when selecting from mysql table
https://docs.python.org/2/library/csv.html
python csv write only certain fieldnames, not all
Python 2.6 Text Processing and
Why is DictWriter not Writing all rows in my Dictreader instance?
I tried mapping reader and writer fieldnames and special header parameters.
I built a second layer test from some great multi-column SO articles:
code follows
import csv
import re
t = re.compile('<\*(.*?)\*>')
headers = ['a', 'b', 'd', 'g']
with open('in2.csv', 'rb') as csvfile:
with open('out2.csv', 'wb') as output_file:
reader = csv.DictReader(csvfile)
writer = csv.DictWriter(output_file, headers, extrasaction='ignore')
writer.writeheader()
print(headers)
for row in reader:
row['d'] = re.findall(t, row['d'])
print(row['a'], row['b'], row['d'], row['g'])
writer.writerow(row)
input data is:
a, b, c, d, e, f, g, h
<* number 1 *>, <* number 2 *>, <* number 3 *>, <* number 4 *>, ...<* number 8 *>
<* number 2 *>, <* number 3 *>, <* number 4 *>, ...<* number 8 *>, <* number 9 *>
output data is:
['a', 'b', 'd', 'g' ]
('<* number 1 *>', '<* number 2 *>', ' number 4 ', <* number 7 *>)
('<* number 2 *>', '<* number 3 *>', ' number 5 ', <* number 8 *>)
exactly as desired.
But when I use a rougher data set that has words with blanks, double quotes, and mixes of upper and lower case letters, the printing works at the row level, but the writing does not work entirely.
By entirely, I have been able (I know I'm in epic fail mode here) to actually write one row of the challenging data, but not in that instance, a header and multiple rows. Pretty lame that I can't overcome this hurdle with all the talented articles that I've read.
All four columns fail with either a key error or with a "TypeError: tuple indices must be integers, not str"
I'm obviously not understanding how to grasp what Python needs to make this happen.
The high level is: read in text files with seven observations / columns. Use only four columns to write out; perform the regex on one column. Make sure to write out each newly formed row, not the original row.
I may need a more friendly type of global temp table to read the row into, update the row, then write the row out to a file.
Maybe I'm asking too much of Python architecture to coordinate a DictReader and a DictWriter to read in data, filter to four columns, update the fourth column with a regex, then write out the file with the updated four tuples.
At this juncture, I don't have the time to investigate a parser. I would like to eventually in more detail, since per release of Python (2.7 now, 3.x later) parsers seem handy.
Again, apologize for the complexity of the approach and my lack of understanding of the underpinnings of Python. In R language, the parallel of my shortcomings would be understanding coding at the S4 level, not just the S3 level.
Here is data that is closer to what fails, sorry--I needed to show how the headers are set up, how the file rows coming in are formatted with individual double quotes with quotes around the entire row and how the date is formatted, but not quoted:
stuff_type|stuff_date|stuff_text
""cool stuff"|01-25-2015|""the text stuff <*to test*> to find a way to extract all text that is <*included in special tags*> less than star and greater than star"""
""cool stuff"|05-13-2014|""the text stuff <*to test a second*> to find a way to extract all text that is <*included in extra special tags*> less than star and greater than star"""
""great big stuff"|12-7-2014|"the text stuff <*to test a third*> to find a way to extract all text that is <*included in very special tags*> less than star and greater than star"""
""nice stuff"|2-22-2013|""the text stuff <*to test a fourth ,*> to find a way to extract all text that is <*included in doubly special tags*> less than star and greater than star"""
stuff_type,stuff_date,stuff_text
cool stuff,1/25/2015,the text stuff <*to test*> to find a way to extract all text that is <*included in special tags*> less than star and greater than star
cool stuff,5/13/2014,the text stuff <*to test a second*> to find a way to extract all text that is <*included in extra special tags*> less than star and greater than star
great big stuff,12/7/2014,the text stuff <*to test a third*> to find a way to extract all text that is <*included in very special tags*> less than star and greater than star
nice stuff,2/22/2013,the text stuff <*to test a fourth *> to find a way to extract all text that is <*included in really special tags*> less or greater than star
I plan to retest this, but a Spyder update made my Python console crash this morning. Ugghh. With vanilla Python, the test data above fails with the following code... no need to do the write step...can't even print here... may need the QUOTES.NONE in the dialect.
import csv
import re
t = re.compile('<\*(.*?)\*>')
headers = ['stuff_type', 'stuff_date', 'stuff_text']
with open('C:/Temp/in3.csv', 'rb') as csvfile:
with open('C:/Temp/out3.csv', 'wb') as output_file:
reader = csv.DictReader(csvfile)
writer = csv.DictWriter(output_file, headers, extrasaction='ignore')
writer.writeheader()
print(headers)
for row in reader:
row['stuff_text'] = re.findall(t, row['stuff_text'])
print(row['stuff_type'], row['stuff_date'], row['stuff_text'])
writer.writerow(row)
Error:
can't past the snipping tool image in here ....sorry
KeyError: 'stuff_text'
OK: it might be in the quoting and separation of columns: the data above without quotes printed without a KeyError and now writes to the file correctly: I may have to clean up the file from quote characters before I pull out text with the regex. Any thoughts would be appreciated.
Good question # Andrea Corbellini
The code above generates the following output if I've manually removed the quotes:
stuff_type,stuff_date,stuff_text
cool stuff,1/25/2015,"['to test', 'included in special tags']"
cool stuff,5/13/2014,"['to test a second', 'included in extra special tags']"
great big stuff,12/7/2014,"['to test a third', 'included in very special tags']"
nice stuff,2/22/2013,"['to test a fourth ', 'included in really special tags']"
which is what I want in regards to output. So, thanks for your "lazy" question---I'm the lazy one that should have put this second output as a follow on.
Again, without removing multiple sets of quotation marks, I have KeyError: 'stuff_type'. I apologize that I have attempted to insert the image from a screen capture of the Python with the error, but have not figured out yet how to do that in SO. I used the Images section above, but that seems to point to a file that maybe is uploaded to SO? not inserted?
With #monkut's excellent input below on using ".".join things or literally stuff is getting better.
{['stuff_type', 'stuff_date', 'stuff_text']
('cool stuff', '1/25/2015', 'to test:included in special tags')
('cool stuff', '5/13/2014', 'to test a second:included in extra special tags')
('great big stuff', '12/7/2014', 'to test a third:included in very special tags')
('nice stuff', '2/22/2013', 'to test a fourth :included in really special tags')}
import csv
import re
t = re.compile('<\*(.*?)\*>')
headers = ['stuff_type', 'stuff_date', 'stuff_text']
csv.register_dialect('piper', delimiter='|', quoting=csv.QUOTE_NONE)
with open('C:/Python/in3.txt', 'rb') as csvfile:
with open('C:/Python/out5.csv', 'wb') as output_file:
reader = csv.DictReader(csvfile, dialect='piper')
writer = csv.DictWriter(output_file, headers, extrasaction='ignore')
writer.writeheader()
print(headers)
for row in reader:
row['stuff_text'] = ":".join(re.findall(t, row['stuff_text']))
print(row['stuff_type'], row['stuff_date'], row['stuff_text'])
writer.writerow(row)
Error path follows:
runfile('C:/Python/test quotes with dialect quotes none or quotes filter and special characters with findall regex.py', wdir='C:/Python')
['stuff_type', 'stuff_date', 'stuff_text']
('""cool stuff"', '01-25-2015', 'to test')
Traceback (most recent call last):
File "<ipython-input-3-832ce30e0de3>", line 1, in <module>
runfile('C:/Python/test quotes with dialect quotes none or quotes filter and special characters with findall regex.py', wdir='C:/Python')
File "C:\Users\Methody\Anaconda\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 699, in runfile
execfile(filename, namespace)
File "C:\Users\Methody\Anaconda\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 74, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "C:/Python/test quotes with dialect quotes none or quotes filter and special characters with findall regex.py", line 20, in <module>
row['stuff_text'] = ":".join(re.findall(t, row['stuff_text']))
File "C:\Users\Methody\Anaconda\lib\re.py", line 177, in findall
return _compile(pattern, flags).findall(string)
TypeError: expected string or buffer
I'll have a find a stronger way to clean up and remove the quotes before processing the regex findall. Probably something row = string.remove(quotes with blanks).

I think findall returns a list, which may be screwing things up, since dictwriter wants a single string value.
row['d'] = re.findall(t, row['d'])
You can use .join to turn the results to a single string value:
row['d'] = ":".join(re.findall(t, row['d']))
Where, here values are joined with, ":". As you mention, though, you may need to clean the values a bit more...
You mentioned there was a problem with using the compiled regex object.
Here's an example of how the compiled regex object is used:
import re
t = re.compile('<\*(.*?)\*>')
text= ('''cool stuff,1/25/2015,the text stuff <*to test*> to find a way to extract all text that'''
''' is <*included in special tags*> less than star and greater than star''')
result = t.findall(text)
This should return the following into result:
['to test', 'included in special tags']

Trying to write a list of dictionaries to csv in Python, running into encoding issues

So I am running into an encoding problem stemming from writing dictionaries to csv in Python.
Here is an example code:
import csv
some_list = ['jalape\xc3\xb1o']
with open('test_encode_output.csv', 'wb') as csvfile:
output_file = csv.writer(csvfile)
for item in some_list:
output_file.writerow([item])
This works perfectly fine and gives me a csv file with "jalapeño" written in it.
However, when I create a list of dictionaries with values that contain such UTF-8 characters...
import csv
some_list = [{'main': ['4 dried ancho chile peppers, stems, veins
and seeds removed']}, {'main': ['2 jalape\xc3\xb1o
peppers, seeded and chopped', '1 dash salt']}]
with open('test_encode_output.csv', 'wb') as csvfile:
output_file = csv.writer(csvfile)
for item in some_list:
output_file.writerow([item])
I just get a csv file with 2 rows with the following entries:
{'main': ['4 dried ancho chile peppers, stems, veins and seeds removed']}
{'main': ['2 jalape\xc3\xb1o peppers, seeded and chopped', '1 dash salt']}
I know I have my stuff written in the right encoding, but because they aren't strings, when they are written out by csv.writer, they are written as-is. This is frustrating. I searched for some similar questions on here and people have mentioned using csv.DictWriter but that wouldn't really work well for me because my list of dictionaries aren't all just with 1 key 'main'. Some have other keys like 'toppings', 'crust', etc. Not just that, I'm still doing more work on them where the eventual output is to have the ingredients formatted in amount, unit, ingredient, so I will end up with a list of dictionaries like
[{'main': {'amount': ['4'], 'unit': [''],
'ingredient': ['dried ancho chile peppers']}},
{'topping': {'amount': ['1'], 'unit': ['pump'],
'ingredient': ['cool whip']}, 'filling':
{'amount': ['2'], 'unit': ['cups'],
'ingredient': ['strawberry jam']}}]
Seriously, any help would be greatly appreciated, else I'd have to use a find and replace in LibreOffice to fix all those \x** UTF-8 encodings.
Thank you!

You are writing dictionaries to the CSV file, while .writerow() expects lists with singular values that are turned into strings on writing.
Don't write dictionaries, these are turned into string representations, as you've discovered.
You need to determine how the keys and / or values of each dictionary are to be turned into columns, where each column is a single primitive value.
If, for example, you only want to write the main key (if present) then do so:
with open('test_encode_output.csv', 'wb') as csvfile:
output_file = csv.writer(csvfile)
for item in some_list:
if 'main' in item:
output_file.writerow(item['main'])
where it is assumed that the value associated with the 'main' key is always a list of values.
If you wanted to persist dictionaries with Unicode values, then you are using the wrong tool. CSV is a flat data format, just rows and primitive columns. Use a tool that can preserve the right amount of information instead.
For dictionaries with string keys, lists, numbers and unicode text, you can use JSON, or you can use pickle if more complex and custom data types are involved. When using JSON, you do want to either decode from byte strings to Python Unicode values, or always use UTF-8-encoded byte strings, or state how the json library should handle string encoding for you with the encoding keyword:
import json
with open('data.json', 'w') as jsonfile:
json.dump(some_list, jsonfile, encoding='utf8')
because JSON strings are always unicode values. The default for encoding is utf8 but I added it here for clarity.
Loading the data again:
with open('data.json', 'r') as jsonfile:
some_list = json.load(jsonfile)
Note that this will return unicode strings, not strings encoded to UTF8.
The pickle module works much the same way, but the data format is not human-readable:
import pickle
# store
with open('data.pickle', 'wb') as pfile:
pickle.dump(some_list, pfile)
# load
with open('data.pickle', 'rb') as pfile:
some_list = pickle.load(pfile)
pickle will return your data exactly as you stored it. Byte strings remain byte strings, unicode values would be restored as unicode.

As you see in your output, you've used a dictionary so if you want that string to be processed you have to write this:
import csv
some_list = [{'main': ['4 dried ancho chile peppers, stems, veins', '\xc2\xa0\xc2\xa0\xc2\xa0 and seeds removed']}, {'main': ['2 jalape\xc3\xb1o peppers, seeded and chopped', '1 dash salt']}]
with open('test_encode_output.csv', 'wb') as csvfile:
output_file = csv.writer(csvfile)
for item in some_list:
output_file.writerow(item['main']) #so instead of [item], we use item['main']
I understand that this is possibly not the code you want as it limits you to call every key main but at least it gets processed now.
You might want to formulate what you want to do a bit better as now it is not really clear (at least to me). For example do you want a csv file that gives you main in the first cell and then 4 dried ...

Get Attribute From Text File in Python

So I'm making a Yu-Gi-Oh database program. I have all the information stored in a large text file. Each monster is chategorized in the following way:
|Name|NUM 1|DESC 1|TYPE|LOCATION|STARS|ATK|DEF|DESCRIPTION
Here's an actual example:
|A Feather of the Phoenix|37;29;18|FET;YSDS;CP03|Spell Card}Spell||||Discard 1 card. Select from your Graveyard and return it to the top of your Deck.|
So I made a program that searches this large text file by name and it returns the information from the text file without the '|'. Here it is:
with open('TEXT.txt') as fd:
input=[x.strip('|').split('|') for x in fd.readlines()]
to_search={x[0]:x for x in input}
print('\n'.join(to_search[name]))
Now I'm trying to edit my program so I can search for the name of the monster and choose which attribute I want to display. So it'd appear like
A Feather of the Phoenix
Description:
Discard 1 card. Select from your Graveyard and return it to the top of your Deck.
Any clues as to how I can do this?

First, this is a variant dialect of CSV, and can be parsed with the csv module instead of trying to do it manually. For example:
with open('TEXT.txt') as fd:
rows = csv.reader(fd, delimiter='|')
to_search = {row[1]:row for row in rows}
print('\n'.join(to_search[name]))
You might also prefer to use DictReader, so each row is a dict (keyed off the names in the header row, or manually-specified column names if you don't have one):
with open('TEXT.txt') as fd:
rows = csv.DictReader(fd, delimiter='|')
to_search = {row['Name']:row for row in rows}
print('\n'.join(to_search[name]))
Then, to select a specific attribute:
with open('TEXT.txt') as fd:
rows = csv.DictReader(fd, delimiter='|')
to_search = {row['Name']:row for row in rows}
print(to_search[name][attribute])
However… I'm not sure this is a good design in the first place. Do you really want to re-read the entire file for each lookup? I think it makes more sense to read it into memory once, into a general-purpose structure that you can use repeatedly. And in fact, you've almost got such a structure:
with open('TEXT.txt') as fd:
monsters = list(csv.DictReader(fd, delimiter='|'))
monsters_by_name = {monster['Name']: monster for monster in monsters}
Then you can build additional indexes, like a multi-map of monsters by location, etc., if you need them.
All this being said, your original code can almost handle what you want already. to_search[name] is a list. If you just build a map from attribute names to indices, you can do this:
attributes = ['Name', 'NUM 1', 'DESC 1', 'TYPE', 'LOCATION', 'STARS', 'ATK', 'DEF', 'DESCRIPTION']
attributes_by_name = {value: idx for idx, value in enumerate(attributes)}
# ...
with open('TEXT.txt') as fd:
input=[x.strip('|').split('|') for x in fd.readlines()]
to_search={x[0]:x for x in input}
attribute_index = attributes_by_name[attributes]
print(to_search[name][attribute_index])

You could look at the namedtuple class in collections. You will want to make each entry a namedtuple with your fields as attributes. The namedtuple might look like:
Card = namedtuple('Card', 'name, number, description, whatever_else')
As shown in the collections documentation, namedtuple and csv work well together:
import csv
for card in map(Card._make, csv.reader(open("cards", "rb"))):
print card.name, card.description # format however you want here
The mechanics around search can be very complicated. For example, if you want a really fast search built around an exact match, you could build a dictionary for each attribute you're interested in:
name_map = {card.name: card for card in all_cards}
search_result = name_map[name_you_searched_for]
You could also do a startswith search:
possibles = [card for card in all_cards if card.name.startswith(search_string)]
# here you need to decide what to do with these possibles, in this example, I'm just snagging the first one, and I'm not handling the possibility that you don't find one, you should.
search_result = possibles[0]
I recommend against trying to search the file itself. This is an extremely complex kind of search to do and is typically left up to database systems to implement this kind of functionality. If you need to do this, consider switching the application to sqlite or another lightweight database.

Transposing datasets with multiple rows in a file into a single row per dataset

Could I please have some pointers to websites where I can read and get the skills to write python code to do the following?
So far I can only find python code that reads structured data into lists and dictionaries. I need to see an example with line processing to merge multiple rows of data to a single row.
Problem
I have datasets in a file, each dataset is enclosed in {}, with one item per row.
I need to transpose all the items of a data set to a single row ie transpose to tabular> Below is an example
Input file:
details_book1{
title,txt, book_book1
author,txt,author_book1
price,txt, price_book1 }
details_book2
{
title,txt, book_book2
author,txt,author_book2
price,txt, price_book2
}
Output Required:
details_book1,book_book1,author_book1,price_book1
details_book2,book_book2,author_book2,price_book2
...
details_bookn,book_bookn,author_bookn,price_bookn

I'm sorry I don't know of particular references, other than just learning about string and list manipulations, for which the python docs aren't too bad, but it could perhaps be as simple as something like this:
lines = [line for line in a.split('\n') if line]
books = []
book = ''
for line in lines:
if '}' in line:
book += ',' + line
book = book.replace('{', ' ').replace('}', ' ')
books.append([x.strip() for x in book.split(',') if x.strip()])
book = ''
else:
book += line + ','
This would create a list of lists of the entitites, and you could loop through the list, pulling out all the elements into variables:
for book, title, a, bookbook, author, b, authorbook, price, c, pricebook in books:
print '%s,%s,%s,%s' % (book, bookbook, authorbook, pricebook)
# result
details_book1,book_book1,author_book1,price_book1
details_book2,book_book2,author_book2,price_book2
This can fail in a few ways, though, and requires that your data match what you've shown so far. In particular, if you have commas in any of the the text, where I split the book variable around commas inside the second list comprehension will split into too many fields, and the unpacking later in the for loop (last example code snippet) will fail.
Also, if a block starts on the same line as the previous block's }, it will fail to cut up the data correctly. There are ways around this, but I wanted to keep things very simple.
Maybe this can help as a starting point.
I suppose you could do this as well:
import re
for book in re.findall('\w.*?{.*?}', a, flags=re.M|re.S):
book = book.replace('\n',',').replace('{',',').replace('}',',')
book = [x.strip() for x in book.split(',') if x.strip()]
print book
This uses a regular expression via the re.findall to find all words followed by any amount of whitespace, and anything at all (non-greedy) between curly braces. This results in a bit of a mess of newlines and missing commas, so then I replace newlines and braces with commas, then use a list comprehension to split around commas, strip whitespace around each split element, and leave out any empty strings that result.
This results in these lists each time in book:
['details_book1', 'title', 'txt', 'book_book1', 'author', 'txt', 'author_book1', 'price', 'txt', 'price_book1']
['details_book2', 'title', 'txt', 'book_book2', 'author', 'txt', 'author_book2', 'price', 'txt', 'price_book2']
Again, splitting around commas is a problem if anything like book titles or txt blurbs have commas in them (but if they do, I don't know how you're able to tell those blurbs apart from the comma-separated bits on each line).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why is the index position outputting data from other index's? - python

Related

Making acronym for every airport in Python list?

Python 2.7 Can't Write out file from DictReader with DictWriter after re.findall with Regex

Trying to write a list of dictionaries to csv in Python, running into encoding issues

Get Attribute From Text File in Python

Transposing datasets with multiple rows in a file into a single row per dataset

Categories

Resources