Python convert list into split lists - python

so I have been given the task of using an api to pull student records and learnerID's to put into an in house application. The json formatting is dreadful and the only successful way I managed to split students individually is by the last value.
Now I am at the next stumbling block, I need to split these student lists into smaller sections so I implement a for loop as so:
student = request.text.split('"SENMajorNeedsDetails"')
for students in student:
r = str(student).split(',')
print (student[0], student[1])
print (r[0], r[1])
This works perfectly except this puts it all into a single list again and each student record isn't a set length (some have more values/fields than others).
so what I am looking to do is have a list for each student split on the comma, so student1 would equal [learnerID,personID,name,etc...]
this way when I want to reference the learnerID I can call learner1[0]
It is also very possible that I am going about this the wrong way and I should be doing some other form of list comprehension
my step by step process that I am aiming towards is:
pull data from system - DONE
split data into individual students - DONE
take learnerID,name,group of each student and add database entry
I have split step 3 into two stages where one involves my issue above and the second is the create database records
Below is a shortended example of the list item student[0], followed by student[1] if more is needed then say
:null},{"LearnerId":XXXXXX,"PersonId":XXXXXX,"LearnerCode":"XXXX-XXXXXX","UPN":"XXXXXXXXXXX","ULN":"XXXXXXXXXX","Surname":"XXXXX","Forename":"XXXXX","LegalSurname":"XXXXX","LegalForename":"XXXXXX","DateOfBirth":"XX/XX/XXXX 00:00:00","Year":"XX","Course":"KS5","DateOfEntry":"XX/XX/XXXX 00:00:00","Gender":"X","RegGroup":"1XX",],
:null},{"LearnerId":YYYYYYY,"PersonId":YYYYYYYY,"LearnerCode":"XXXX-YYYYYYYY","UPN":"YYYYYYYYYY","ULN":"YYYYYYYYYY","Surname":"YYYYYYYY","Forename":"YYYYYY","LegalSurname":"YYYYYY","LegalForename":"YYYYYYY","DateOfBirth":"XX/XX/XXXX 00:00:00","Year":"XX","Course":"KS5","DateOfEntry":"XX/XX/XXXX 00:00:00","Gender":"X","RegGroup":"1YY",],
Sorry doesn't like putting it on seperate lines
EDIT* changed wording at the end and added a redacted student record

Just to clarify the resolution to my issue was to learn how to parse JSON propperly, this was pointed out by #Patrick Haugh and all credit should go to him for pointing me in the right direction. Second most helpful person was #ArndtJonasson
The problem was that I was manually trying to do the job of the JSON library and I am no where near that level of competency yet. As stated originally it was totally likely that I was going about it in completely the wrong way.

Related

Can I amend one data sheet to match another data frame's ID that are almost similar?

I have multiple data frames to compare. My problem is the product IDs. one is set up like:
000-000-000-000
Vs
000-000-000
(gross)
I have looked on here, reddit, YouTube, and even went deep down the rabbit hole trying .join, .append, some other method I've never seen before, or even understand yet. Is there a way(or even better some documentation I can read on to learn this) to pull the Product ID from the Main excel sheet, compare it to the one(s) that should match. Then i will more than like make the in place ID across all sheets. That way I can use those IDs as the index and do a side by side compare of the ID to row data? Each ID has about 113 values to compare. That's 113 columns, but for each row if that make sense
Example: (colorful columns is main sheet that the non colored column will be compared to)
additional notes:
The highlighted yellow IDs are "unique", and I wont be changing those but instead write them to a list or something and use an if statement to ignore them when found.
Edit:
so I wrote this code which is almost perfect what I need to do with this.
It takes out the "-" which I apply to all my IDs. Just need to make a list of ID that are unique to skip over on taking away the zeros
dfSS["Product ID"] = dfSS["Product ID"].str.replace("-", "")
Then this will only list the digits up to 9 digits, except the unique IDs
dfSS["Product ID"] = dfSS["Product ID"]str[:9]
Will add the full code below here once i get it to work 100%
I am now trying to figure out how to say somethin like
lst =[1,2,3,4,5]
if dfSS["Product ID"] not in lst:
dfSS["Product ID"] = dfSS["Product ID"].str.replace("-", "").str[:9]
This code does not work but everyday I get closer and closer to being able to compare these similar yet different data frames. the lst is just an example of a 000-000-000 Product IDs in a list that I do not want to filter at all. but keep in the data frame
If the ID transformation is predictable, then one option is to use regex for homogenizing IDs. For example if the situation is just removing the first three digits, then something like the following can be used:
df['short_id'] = df['long_id'].str.extract(r'\d\d\d-([\d-]*)')
If the ID transformation is not so predictable (e.g. due to transcription errors or some other noise in the data) then the best option is to first disambiguate the ID transformation using something like recordlinkage, see the example here.
Ok solved this for every Product ID with or without dashes, #, ltters, etc..
(\d\d\d-)?[_#\d-]?[a-zA-Z]?
(\d\d\d-)? -This is for the first & second three integer sets, w/ zero or more matches and a dashes (non-greedy)
[_#\d-]? - This is for any special chars and additional numbers (non-greedy)
[a-zA-Z]? - This, not sure why, but I had to separate from the last part due to it wouldn't pick up every letter. (non-greedy)
With the above I solved everything I needed for RE.
Where I learned how to improve my RE skills:
RE Documentation
Automate the Boring Stuff- Ch 7
You can test you RE's here
Additional way to show this. Put this here to show there is no one way of doing it. RE is super awesome:
(\d{3}-)?[_#\d{3}-]?[a-zA-Z]?

Display text from a key/value pair in a nested dictionary

I'm a novice.
I am trying to print the elements from the Periodic Table to the screen arranged like the table itself. I'm using (' - ') to separate the symbols that I haven't written in the dictionary yet. I'm only using a nested dictionary with two entries total to minimize confusion.
Training Source last exercise.
I asked this question elsewhere and someone (correctly) suggested using str.join(list) but It wasn't part of the tutorial.
I'm trying to teach myself and I want to understand. No schooling, no work, no instructor.
The hints at the bottom of the linked tutorial says:
1."Use a for loop to loop through each element. Pick out the elements' row numbers and column numbers."
2."Use two nested for loops to print either an element's symbol or a series of spaces, depending on how full that row is."
I'd like to solve it this way. Thanks in advance.
Note* No, pre-intermediate, intermediate or advanced code please, the tutorial has only covered code related to variables, strings, numbers, lists, tuples, functions(beginners),if statements, while loops, basic terminal apps and dictionaries.
Lastly I'd like to have the table itself printed with the shape of the real Periodic Table. If you could throw in a bit of that code for a novice it'd really help thanks.
My attempt(wrong):
ptable = {'mercury':{'symbol':'hg','atomic number': '80','row': '6','column': '12','weight':'200.59',}, 'tungsten':{'symbol':'w','atomic number':'74','row':'6','column':'6','weight':'183.84'},}
for line in range(1,7):
for key in ptable:
row = int(ptable[key]['row'])
column = int(ptable[key]['column'])
if line != row:
print('-'*18)
else:
space = 18 - column
print('-'*(column-1)+ptable[key]['symbol']+'-'*space)
outputs:
------------------
------------------
------------------
------------------
------------------
------------------
------------------
------------------
------------------
------------------
-----------hg------
-----w------------
The output should have 7 lines as in the Periodic table. It is supposed to display the symbols of each element in the correct place as in the Periodic Table. Since I only have two elements in the library it should show Hg and W in their correct places
The experienced programmers' solution:
for line in range(1, 8): # line will count from 1 to 7
# display represents the line of elements we will print later
# hyphens show that the space is empty
# so we fill the list with hyphens at first
display = ['-'] * 18
for key in ptable:
if int(ptable[key]['row']) == line:
# if our element is on this line
# add that element to our list
display[int(ptable[key]['column']) - 1] = ptable[key]['symbol']
# str.join(list) will "join" the elements of your list into a new string
# the string you call it on will go in between all of your elements
print(''.join(display))
I honestly think this code isn't that hard to understand and I think trying to change it would only make it more complicated. I'm going to put you some links at the end for you to check it out and understand the ''join() method and the range() function which you seem not to understand either. You said you wanted to learn Python by yourself and that's a great thing! (I'm doing it too) But that doesn't mean you have to stick to a tutorial ;). You can go beyond it and also skip the parts you don't care about and come back later when you need them. If you need explanations about methods (like ''.join) or any other thing let me know. Sorry if that doesn't help you ;(.
Links:
The .join() method
The range() function

Python Lists/ Tuples in terms of sorting

I am struggling to overcome a problem/task in Python and I'm really stuck for ideas. I need to read two lines from a file, sort one of these lines (from multiple files which are determined from user inputted data) but return both pieces of data in the context of a running club. The users average miles per hour will be calculated over a few weeks and stored in a .txt file alongside a user id stored at the beginning of the program, the final section of the program will need to read these files (the user id and the average miles per hour) and sort the average miles per hour while keeping the user id (returns both together allowing for a summary). I then need to state the top few runners. Any help would be much appreciated and I have not used SQL etc, just line-by-line, standard Python. My code is un-optimized but I'm at 'the home straight' with it now. Also, my friend suggested for me to use tuples but I don't know where to start in all honesty. Please excuse any 'elementary mistakes'. I have also played with tuples but never properly integrated them as I don't know where to begin. Also finding problems with the saving of variables as the clash with the operators which means I cannot globalize them without defining each and everyone.
def retrieve():
global datasave,y ###don't know where to go from this as it does not work
y=1
if y>3: #just practiced with 2 'users'
y=str(y)
print("All members entered and saved, a comparision of average miles per hour will be intiated")
file=open(y+".txt",'r') #saves files in which they occur for easy 'read'
datasave=file.readline(5)
datasave,y=datasave #gave me a 'cannot assign to operator' error
y=int(y)
y=y+1
else:
avmphlist=[datasave1,datasave2,datasave3,datasave4,datasave5,datasave6,datasave7,datasave8,datasave9,datasave10]
sorted(avmphlist)
print(avmphlist)
print("Unfortunately ",avmphlist[9]," and ",avmphlist[10],"have not made the team, thank you for your participation")
print("Congratulations to ",avmphlist[1],", ",avmphlist[2],", ",avmphlist[3],", ",avmphlist[4],", ",avmphlist[5],", ",avmphlist[6],", ",avmphlist[7]," and ",avmphlist[8],)
start with defining a list of tuples for you data
runnerData = [("TestName1", 70),("TestName2", 50), ("TestName3", 60)]
now use the inbuild sort method:
sortedRunnerData = sorted(runnerData, key=lambda data: data[1])
Now you have a sorted list of tuples of your data (ascending). If you need it in descending order just reverse the list:
sortedRunnerData.reverse()
Now the list sortedRunnerData list contains the data in descending order.

How to 'flatten' lines from text file if they meet certain criteria using Python?

To start I am a complete new comer to Python and programming anything other than web languages.
So, I have developed a script using Python as an interface between a piece of Software called Spendmap and an online app called Freeagent. This script works perfectly. It imports and parses the text file and pushes it through the API to the web app.
What I am struggling with is Spendmap exports multiple lines per order where as Freeagent wants One line per order. So I need to add the cost values from any orders spread across multiple lines and then 'flatten' the lines into One so it can be sent through the API. The 'key' field is the 'PO' field. So if the script sees any matching PO numbers, I want it to flatten them as per above.
This is a 'dummy' example of the text file produced by Spendmap:
5090071648,2013-06-05,2013-09-05,P000001,1133997,223.010,20,2013-09-10,104,xxxxxx,AP
COMMENT,002091
301067,2013-09-06,2013-09-11,P000002,1133919,42.000,20,2013-10-31,103,xxxxxx,AP
COMMENT,002143
301067,2013-09-06,2013-09-11,P000002,1133919,359.400,20,2013-10-31,103,xxxxxx,AP
COMMENT,002143
301067,2013-09-06,2013-09-11,P000003,1133910,23.690,20,2013-10-31,103,xxxxxx,AP
COMMENT,002143
The above has been formatted for easier reading and normally is just one line after the next with no text formatting.
The 'key' or PO field is the first bold item and the second bold/italic item is the cost to be totalled. So if this example was to be passed through the script id expect the first row to be left alone, the Second and Third row costs to be added as they're both from the same PO number and the Fourth line to left alone.
Expected result:
5090071648,2013-06-05,2013-09-05,P000001,1133997,223.010,20,2013-09-10,104,xxxxxx,AP
COMMENT,002091
301067,2013-09-06,2013-09-11,P000002,1133919,401.400,20,2013-10-31,103,xxxxxx,AP
COMMENT,002143
301067,2013-09-06,2013-09-11,P000003,1133910,23.690,20,2013-10-31,103,xxxxxx,AP
COMMENT,002143
Any help with this would be greatly appreciated and if you need any further details just say.
Thanks in advance for looking!
I won't give you the solution. But you should:
Write and test a regular expression that breaks the line down into its parts, or use the CSV library.
Parse the numbers out so they're decimal numbers rather than strings
Collect the lines up by ID. Perhaps you could use a dict that maps IDs to lists of orders?
When all the input is finished, iterate over that dict and add up all orders stored in that list.
Make a string format function that outputs the line in the expected format.
Maybe feed the output back into the input to test that you get the same result. Second time round there should be no changes, if I understood the problem.
Good luck!
I would use a dictionary to compile the lines, using get(key,0.0) to sum values if they exist already, or start with zero if not:
InputData = """5090071648,2013-06-05,2013-09-05,P000001,1133997,223.010,20,2013-09-10,104,xxxxxx,AP COMMENT,002091
301067,2013-09-06,2013-09-11,P000002,1133919,42.000,20,2013-10-31,103,xxxxxx,AP COMMENT,002143
301067,2013-09-06,2013-09-11,P000002,1133919,359.400,20,2013-10-31,103,xxxxxx,AP COMMENT,002143
301067,2013-09-06,2013-09-11,P000003,1133910,23.690,20,2013-10-31,103,xxxxxx,AP COMMENT,002143"""
OutD = {}
ValueD = {}
for Line in InputData.split('\n'):
# commas in comments won't matter because we are joining after anyway
Fields = Line.split(',')
PO = Fields[3]
Value = float(Fields[5])
# set up the output string with a placeholder for .format()
OutD[PO] = ",".join(Fields[:5] + ["{0:.3f}"] + Fields[6:])
# add the value to the old value or to zero if it is not found
ValueD[PO] = ValueD.get(PO,0.0) + Value
# the output is unsorted by default, but you could sort or preserve original order
for POKey in ValueD:
print OutD[POKey].format(ValueD[POKey])
P.S. Yes, I know Capitals are for Classes, but this makes it easier to tell what variables I have defined...

Reading text and assigning a class to data in Python

I've been searching around, and had no luck finding anything answering my question.
Essentially I have a file with the following data:
Title - 19
Artist - Adele
Year released - 2008
1 - Daydreamer, 3:41, 1
2 - Best for Last, 4:19, 5
3 - Chasing Pavements, 3:31, 7
4 - Cold Shoulder, 3:12, 3
Title - El Camino
Artist - The Black Keys
Year released - 2011
1 - Lonely Boy, 3:13, 1
2 - Run Right Back, 3:17, 10
EOF
I know how to create classes, and how to assign an object to a class and values to that object, but I am just about ready to tear my hair out on how it is I'm supposed to process the text. From text, I need to create a title for the album, and assign the album's information to it. There's more else besides that needs to be done, and there are more lines to be read, and I just don't know where to start on this. I've found two "album.py" files via google, and I've been unable to make heads or tails of how to apply the solution to my case.
And yes, this is for a school assignment. I've done some digging around and found some things relevant, but I'm just not understanding it. I'm new to programming in general, and I've made progress but this seems too far over my head.
I know I could reduce this to lists using split (\n\n) and operating on a series of progressively smaller lists, but I am trying to avoid this method at all costs.
EDIT:
For the time being, it's best to assume I know nothing. Though, to answer below question: I can open the file and read it. If its a consistent CSV formatted file, I can write code to process the enclosed data, and create a class structure that uses that data. Right now I'm just having trouble with the first three lines, and the digits immediately below.
APRIL 4 2012:
Okay, I have some code, I've left the comments with respect to it underneath.
def getInput():
global albums
raw = open("album.txt","r")
infile = raw
raw.close
text=""
line = infile.readline()
while (line != "EOF\n" ):
text += line
line=infile.readline()
text=text.rstrip("\n\n")
albums=[str(n) for n in text.split("\n\n")]
return albums
class Album():
def __init__(self, title, artist, date):
self.title=title
self.artist=artist
self.date=date
self.track={}
def addSong(self, TrackID, title, time, ranking):
self.track+={self}
def getAlbumLength(self):
asdf=0
def getRanking(self):
asdf=0
def labels(x): #establishes labels per item to be used for Album Classifier
title=""
artist=""
date=""
for i in range(0,len(albums),1):
sublist=[str(n) for n in albums[i].split("\n")]
RANDUMB=len(albums[i])
title=sublist[0]
artist=sublist[1]
date=sublist[2]
for j in range(0,len(sublist),1):
song_info = [str(k) for k in sublist[3:].split("," and " - ")]
TrackID=song_info[0]
title=song_info[1]
time=song_info[2]
ranking=song_info[3]
getInput()
labels(albums)
Personal comments on code:
I was trying to avoid getting it into lists because I anticipated this problem. As the functions are concerned, I have to use every single bloody one, because it's in the assignment requirements... I am displeased because I could probably get around using them. The code is working sufficiently enough, except for the last part of it where I am trying to take the song information. I want to split the song information into lists, which are nested into the album information list. Something like:
[Album title, Artist, Date released,[01,Song,3:44,2],[02,Song,0:01,9]....]
The current code gives me index out of range error as of right now... I am using python3.
TLDR: The substance of my problem has thus changed from one of trying to solve how to go about starting the solution to how to take items in a list and convert them into nested lists.
If you end up editing your question to contain some more specific examples of what is giving you trouble, I will edit this answer. But to address your general question, there are some steps involved to achieving your goal.
Like you said, you need to write a class that reflects the structure you intend to have from this data.
You will need to parse this file, probably line by line. So you have to determine if this file format is consistant. If it is, then you need to determine:
What is the delimiter between each set of data, which will be conformed into a class instance?
What is the delimiter between each field of each line?
When you are looping over each line, you will know that you need to start a new album object whenever you encounter a blank line.
When you know you are starting a new album, you can assume that the first line will be a title, the second an artist, the third, the year, etc.
For each of these lines you will also have to have rules of how to split each one into the data you want. At a basic level it can be a simple set of splits. At a more advanced level you might define regular expressions for each type of line.

Categories

Resources