Create variable for each unique value over two lists - python

Apologies in advance for the lengthy post. I am nominally familiar with Python, but think it might be able to easily accomplish the task. Some background:
I have survey data where respondents were asked to select the two schools they’re considering applying to out of a list of 1500 or so. The data are stored as two variables (one per institution selected – vname “Institution_1”, “Institution_2”) where each value uniquely identifies a particular institution.
Later on respondent rate the institutions they selected on a 1 to 6 scale on a series of attributes. Each of these ratings is stored as a separate scale variable in the data, and I have two of them – corresponding to what position the institution was selected in. If, for example, Adelphi University is “Institution_1” then the ratings on “Core academics” is stored in variable “Q.32_combined_1”; if Adelphi University is “Institution_2” then the ratings on “Core academics” is stored in variable “Q.36_combined_1”.
I want to combine the ratings for each institution and here’s the SPSS syntax for doing so for this one institution (Adelphi is uniquely identified with a meaningful value of 188429):
DO IF (Institution_1 = 188429).
COMPUTE Adelphi_CoreAcad=Q.32_combined_1.
ELSE IF (Institution_2 = 188429).
COMPUTE Adelphi_CoreAcad =Q.36_combined_1.
END IF.
EXECUTE.
But we have 1,000+ institutions in our data. How can we create a variable for each unique value over these two lists (Institution_1 and Institution_2).
Is there a way to use Python to create these variables and/or build the SPSS syntax that would work?
Thanks!

Try this. It's rough, since I don't have SPSS, but I think it's what you're asking for. (Note: I'm not sure that what you're asking for is the right thing, but see if it works, and maybe we'll go from there.)
This creates a set of variables named U188429_CoreAcad, etc. Where the U is just a leading prefix ("U" for "Unit ID"), 188429 is the unit id, and "CoreAcad" is a made up string you can change.
I used categories 'CoreAcad', 'PrettyCoeds', 'FootballTeam' and 'Drinking', because if I had it all to do over again, that's how I would have rated schools. (Except for 'CoreAcad,' which was your thing.)
I assumed that your categories were 32-35 for institution 1, and 36-39 for institution 2. You can change those below as well.
I assumed that you can spss.Submit a bunch of lines together. If not, split the string up and submit the lines one at a time.
I commented out "BEGIN PROGRAM", "import spss", "END PROGRAM" because I'm just feeding stuff into a command-line python2.7. Uncomment those for your use.
#BEGIN PROGRAM.
#import spss, spssaux
# According to the internet, unitids are sparse values.
Unit_ids = [
188429, # Adelphi
188430, # Random #s
171204,
100001,
]
Categories = {
'CoreAcad' : ('Q.32_combined_1', 'Q.36_combined_1'),
'PrettyCoeds' : ('Q.33_combined_1', 'Q.37_combined_1'),
'FootballTeam' : ('Q.34_combined_1', 'Q.38_combined_1'),
'Drinking' : ('Q.35_combined_1', 'Q.39_combined_1'),
}
code = """
DO IF (Institution_1 = %(unitid)d).
COMPUTE U%(unitid)d_%(category)s = %(answer1)s.
ELSE IF (Institution_2 = %(unitid)d).
COMPUTE U%(unitid)d_%(category)s = %(answer2)s.
END IF.
EXECUTE.
"""
for unitid in Unit_ids:
for category, answers in Categories.iteritems():
answer1,answer2 = answers
print(code%(locals()))
#spss.Submit(code%(locals()))
#END PROGRAM.

I suggest a different restructure solution:
First, you separate the two institutions into two lines, each with it's corresponding ratings:
varstocases /make institution from Institution_1 Institution_2
/make CoreAcad from Q.32_combined_1 Q.36_combined_1
/make otherRting from inst1var inst2var.
You can add another make subcommand for each additional rating that corresponds to each of the two institutions.
At this point your data has one line per single institution and it's ratings.
You can now analyze them, eg:
means CoreAcad otherRting by institution.
Or you can aggregate by institution to analyze their ratings. For example:
DATASET DECLARE AggByInst.
AGGREGATE /OUTFILE='AggByInst' /BREAK=institution
/MCoreAcad MotherRting =MEAN(CoreAcad otherRting).

Related

Can I use ml/nlp to determine pattern in the type of usernames are generated?

I have a dataset which has first and last names along with their respective email ids. Some of the email ids follow a certain pattern such as:
Fn1 = John , Ln1 = Jacobs, eid1= jj#xyz.com
Fn2 = Emily , Ln2 = Queens, eid2= eq#pqr.com
Fn3 = Harry , Ln3 = Smith, eid3= hsm#abc.com
The content after # has no importance for finding the pattern. I want to find out how many people follow a certain pattern and what is that pattern. Is it possible to do so using nlp and python?
EXTRA: To know what kind of pattern is for a some number of people we can store examples of that pattern along with its count in an excel sheet?!
You certainly could - e.g., you could try to learn a relationship between your input and output data as
(Fn, Ln) --> eid
and further disect this relationship into patterns.
However before hitting the problem with complex tools (especially if new to ml/nlp), I'd do further analysis of the data first.
For example, I'd first be curious to see what portion of your data displays the clear patterns you've shown in the examples - using first character(s) from the individual's first/last name to build the corresponding eid (which could be determined easily programatically).
Setting aside that portion of the data that satisfies this clear pattern - what does the remainder look like?
Is there are another clear, but different pattern to some of this data?
If there is - I'd then perform the same exercise again - construct a proper filter to collect and set aside data satisfying that pattern - and examine the remainder.
Doing this analysis might help determine at least a partial answer to your inquiry rather quickly
To know what kind of pattern is for a some number of people we can store examples of that pattern along with its count in an excel sheet?!
Moreover it will help determine
a) whether you need to even use more complex tooling (if enough patterns can be easily seived out this way - is it worth the investment to go heavier?) or
b) if not, which portion of the data to target with heavier tools (the remainder of this process - those not containing simple patterns).

Plotting OpenStreetMap relations does not generate continous lines

All,
I have been working on an index of all MTB trails worldwide. I'm a Python person so for all steps involved I try to use Python modules.
I was able to grab relations from the OSM overpass API like this:
from OSMPythonTools.overpass import Overpass
overpass = Overpass()
def fetch_relation_coords(relation):
rel = overpass.query('rel(%s); (._;>;); out;' % relation)
return rel
rel = fetch_relation_coords("6750628")
I'm choosing this particular relation (6750628) because it is one of several that is resulting in discontinuous (or otherwise erroneous) plots.
I process the "rel" object to get a pandas.DataFrame like this:
elements = pd.DataFrame(rel.toJSON()['elements'])
"elements" looks like this:
The Elements pandas.DataFrame contains rows of the types "relation" (1 in this case), several of the type "way" and many of the type "node". It was my understanding that I would use the "relation" row, "members" column to extract the order of the ways (which point to the nodes), and use that order to make a list of the latitudes and longitudes of the nodes (for later use in leaflet), in the correct order, that is, the order that leads to continuous path on a map.
However, that is not the case. For this particular relation, I end up with the following plot:
If we compare that with the way the relation is displayed on openstreetmap.org itself, we see that it goes wrong (focus on the middle, eastern part of the trail). I have many examples of this happening, although there are also a lot of relations that do display correctly.
So I was wondering, what am I missing? Are there nodes with tags that need to be ignored? I already tried several things, including leaving out nodes with any tags, this does not help. Somewhere my processing is wrong but I don't understand where.
You need to sort the ways inside the relation yourself. Only a few relation types require sorted members, for example some route relations such as route=bus and route=tram. Others may have sorted members, such as route=hiking, route=bicycle etc., but they don't require them. Various other relations, such as boundary relations (type=boundary), usually don't have sorted members.
I'm pretty sure there are already various tools for sorting relation members, obviously this includes the openstreetmap.org website where this relation is shown correctly. Unfortunately I'm not able to point you to these tools but I guess a little bit research will reveal others.
If I opt to just plot the different way on top of each other, I indeed get a continuous plot (index contains the indexes for all nodes per way):
In the Database I would have preferred to have the nodes sorted anyway because I could use them to make a GPX file on the fly. But I guess I did answer my own question with this approach, thank you #scai for tipping me into this direction.
You could have a look at shapely.ops.linemerge, which seems to be smart enough to chain multiple linestrings even if the directions are inconsistent. For example (adapted from here):
from shapely import geometry, ops
line_a = geometry.LineString([[0,0], [1,1]])
line_b = geometry.LineString([[1,0], [2,5], [1,1]]) # <- switch direction
line_c = geometry.LineString([[1,0], [2,0]])
multi_line = geometry.MultiLineString([line_a, line_b, line_c])
merged_line = ops.linemerge(multi_line)
print(merged_line)
# output:
LINESTRING (0 0, 1 1, 2 5, 1 0, 2 0)
Then you just need to make sure that the endpoints match exactly.

I want to append some text into new list until list meets specific strings

I'm pre-processing trump-hillary debate script text to create 3 lists which will including each 3 person's saying.
Entire script is 1046 lists
some of text are as following
for i in range(len(loaded_txt)):
print("load_text[i]",load_text[i])
loaded_txt[i] TRUMP: No, it's going to totally help you. And one thing we have to do: Repeal and replace the disaster known as Obamacare. It's destroying our country. It's destroying our businesses, our small business and our big businesses. We have to repeal and replace Obamacare.
loaded_txt[i]
loaded_txt[i] You take a look at the kind of numbers that that will cost us in the year '17, it is a disaster. If we don't repeal and replace -- now, it's probably going to die of its own weight. But Obamacare has to go. It's -- the premiums are going up 60 percent, 70 percent, 80 percent. Next year they're going to go up over 100 percent.
loaded_txt[i]
loaded_txt[i] And I'm really glad that the premiums have started -- at least the people see what's happening, because she wants to keep Obamacare and she wants to make it even worse, and it can't get any worse. Bad health care at the most expensive price. We have to repeal and replace Obamacare.
loaded_txt[i]
loaded_txt[i] WALLACE: And, Secretary Clinton, same question, because at this point, Social Security and Medicare are going to run out, the trust funds are going to run out of money. Will you as president entertain -- will you consider a grand bargain, a deal that includes both tax increases and benefit cuts to try to save both programs?
I tried to append list into TRUMP_script_list=[], if it has "TRUMP:" in list like this
TRUMP_script_list=[]
for i in range(len(loaded_txt)):
if "TRUMP:" in loaded_txt[i]:
TRUMP_script_list.append(loaded_txt[i])
But the problem is list without name.
But text without name should be trump's saying if it is under text with name of trump, UNTIL list meets texts including names not trump(wallace or clinton)
I tried "while" loop which will be terminated if list would contain other names(wallace, clinton). But failed to implement.
How can I implement this algorithm or any other good idea?
define function to get title:
def get_title(text, titles, previous_title):
for title in titles:
if title in text:
return title
return previous_title
define reference dictionary:
name_script_list = {'TRUMP:':TRUMP_script_list, 'HILLARY:':HILLARY_script_list, 'WALLACE:':WALLACE_script_list}
titles = set(name_script_list.keys())
title = ''
iterate through list in for loop:
for text in loaded_txt:
title = get_title(text, titles, title)
name_script_list[title].append(text)
basically the idea is that get_title() gets a series of titles to try, as well as what the last title was. If any of the titles appear, it returns that one. Otherwise, it returns the prior title
I initialized the initial title as ''. This will work, so long as there is a title in the first line of text. If there isn't, it will throw an error. The fix for this is up to how you want it implemented. Do you just not want to consider such a case (indicates error in loaded_txt, or list of possible titles)? Do you want to set a specific person's name as the default initial title? Do you want to skip lines until the initial title? There are a number of approaches, and I'm not sure which you would prefer

Defining/algorithm, python

Recently I started to learn python.
There was a need for the program.
Please point the track :
The program takes two values from a file: the enterprise and the number of people in them.
Next i need to make the following calculating : finding company with the fewest people - > define it as one part of all - > count ratio is less than all other companies - > count of the number of rations for some time, uniformly.
First, I don't know, what i need to do first. Should i define all data like this "Company = numbers", or do a dict?
I don't ask to solve the problem - I ask to teach.
B.R.
for your example if you want to keep it as easy as possible put all your data in dicts you don't need to create database or file to store the data, for each entity you have to create its variable as dict and make the relationship between them with id like this : we assume that we have client and orders entities:
client = {
1:{"name":"name_1","email":"name_1#gmail.com"},
2:{"name":"name_2","email":"name_2#gmail.com"}
}
orders={1:{"date":"2016-02-01","client_id":1}...}
and so on.

Ensure change of a string doesn't go unnoticed in other spots of a large program

ALL_POSSIBLE_DOG_TOYS is a large list containing the names of every single toy a dog owner can buy for his dog.
ALL_POSSIBLE_DOG_TOYS = ['large_chew_bone', 'treat_ball', .... ]
If the owner has bought a specific toy, it can affect the dog's happiness in various ways depending on some factors (e.g. dog breed, age etc).
Therefore, in some places of my code I need to check whether a specific toy is purchased by the owner, that is, a toy is part of his self.toys_purchased list.
e.g.
# Check if a specific toy is bought.
if 'treat_ball' in self.toys_purchased:
# Do stuff with treat balls.
# ...
I need to perform the above checks in various locations of a 15k LOC program.
I also want to ensure that if I ever change the name of a toy in ALL_POSSIBLE_DOG_TOYS, I will not forget to also change its name in my if checks.
One way to achieve that is by doing the following:
DOG_TOYS_AS_DICT = {
'large_chew_bone': 'large_chew_bone',
'treat_ball': 'treat_ball',
...
...
}
That is, create a dict with each key being the same as its value.
Then use DOG_TOYS_AS_DICT values instead of using directly the actual toy name:
# Check if a specific toy is bought.
if DOG_TOYS_AS_DICT['treat_ball'] in self.toys_purchased:
# Do stuff with treat balls.
# ...
This way if I ever change 'treat_ball' in ALL_POSSIBLE_DOG_TOYS to 'ball_with_treats' I would get a (desired) KeyError in locations where I # Check if a specific toy is bought, and so that I can change 'treat_ball' to the new string.
Question:
Is there a clearer way to ensure that changes to any of the toy names doesn't go unnoticed in the rest of the program?
One of the ways - edit your code in IDE, such like PyCharm, so when you need to rename - use refactoring menu, which will change name everywhere in your program.
Why don't you create an external database, containing the names of those toys, categories, nicknames, article numbers, ...? There seems to be a DB_API library that you can use for accessing MySQL, Oracle, ... databases. Or am I missing something in your question?

Categories

Resources