I have hundereds of dataframe, let say the name is df1,..., df250, I need to build list by a column of those dataframe. Usually I did manually, but today data is to much, and to prone to mistakes
Here's what I did
list1 = df1['customer_id'].tolist()
list2 = df2['customer_id'].tolist()
..
list250 = df250['customer_id'].tolist()
This is so manual, can we make this in easier way?
The easier way is to take a step back and make sure you put your dataframes in a collection such as list or dict. You can then perform operations easily in a scalable way.
For example:
dfs = {1: df1, 2: df2, 3: df3, ... , 250: df250}
lists = {k: v['customer_id'].tolist() for k, v in dfs.items()}
You can then access the results as lists[1], lists[2], etc.
There are other benefits. For example, you are no longer polluting the namespace, you save the effort of explicitly defining variable names, you can easily store and transport related collections of objects.
Using exec function enables you to execute python code stored in a string:
for i in range(1,251):
s = "list"+str(i)+" = df"+str(i)+"['customer_id'].tolist()"
exec(s)
I'd use next code. In this case there's no need to manually create list of DataFrames.
cust_lists = {'list{}'.format(i): globals()['df{}'.format(i)]['customer_id'].tolist()
for i in range(1, 251)}
Now you can access you lists from cust_lists dict by the name, like this:
`cust_lists['list1']`
or
`list1`
Related
I am experimenting using lambda functions to create a list that contains only the values of a particular key to a list.
I have the following:
names = None
names = list(map(lambda restaurant: dict(name=restaurant['name']
).values(), yelp_restaurants))
names
# This is what I want for the list:
# ['Fork & Fig',
# 'Salt And Board',
# 'Frontier Restaurant',
# 'Nexus Brewery',
# "Devon's Pop Smoke",
# 'Cocina Azul',
# 'Philly Steaks',
# 'Stripes Biscuit']
What I get however, is the following:
[dict_values(['Fork & Fig']),
dict_values(['Salt And Board']),
dict_values(['Frontier Restaurant']),
dict_values(['Nexus Brewery']),
dict_values(["Devon's Pop Smoke"]),
dict_values(['Cocina Azul']),
dict_values(['Philly Steaks']),
dict_values(['Stripes Biscuit'])]
Is there a way to only pass the values, an eliminate the redundant 'dict_values' prefix?
The function you are using to create names is a bit redundant:
names = list(map(lambda restaurant: dict(name=restaurant['name']
).values(), yelp_restaurants))
The control flow that you have outlined is "from a list of dict entries called yelp_restaurants, I want to create a dict of each name and grab the values from each dict and put that in a list."
Why? Don't get hung up on lambda functions yet. Start simple, like a for loop:
names = []
for restaurant in yelp_restaurants:
names.append(restaurant['name'])
Look at how much simpler that is. It does exactly what you want, just get the name and stuff it into a list. You can put that in a list comprehension:
names = [restaurant['name'] for restaurant in yelp_restaurants]
Or, if you really need to use lambda, now it's much easier to see what you actually want to accomplish
names = list(map(lambda x: x['name'], yelp_restaurants))
Remember, the x in lambda x: is each member of the iterable yelp_restaurants, so x is a dict. With that in mind, you are using direct access on name to extract what you want.
I have code that generates a list of 28 dictionaries. It cycles thru 28 files and links data points from each file in the appropriate dictionary. In order to make my code more flexible I wanted to use:
tegDics = [dict() for x in range(len(files))]
But when I run the code the first 27 dictionaries are blank and only the last, tegDics[27], has data. Below is the code including the clumsy, yet functional, code I'm having to use that generates the dictionaries:
x=0
import os
files=os.listdir("DirPath")
os.chdir("DirPath")
tegDics = [{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{}] # THIS WORKS!!!
#tegDics = [dict() for x in range(len(files))] - THIS WON'T WORK!!!
allRads=[]
while x<len(tegDics): # now builds dictionaries
for line in open(files[x]):
z=line.split('\t')
allRads.append(z[2])
tegDics[x][z[2]]=z[4] # pairs catNo with locNo
x+=1
Does anybody know why the more elegant code doesn't work.
Since you're using x within the list comprehension, it will no longer be zero by the time you reach the while loop - it will be len(files)-1 instead. I suggest changing the variable you use to something else. It's traditional to use a single underscore for a value you don't care about.
tegDics = [dict() for _ in range(len(files))]
It could be useful to eliminate your use of x entirely. It's customary in python to iterate directly over the objects in a sequence, rather than using a counter variable. You might do something like:
for tegDic in tegDics:
#do stuff with tegDic here
Although it's slightly trickier in your case, since you want to simultaneously iterate through tegDics and files at the same time. You can use zip to do that.
import os
files=os.listdir("DirPath")
os.chdir("DirPath")
tegDics = [dict() for _ in range(len(files))]
allRads=[]
for file, tegDic in zip(files,tegDics):
for line in open(file):
z=line.split('\t')
allRads.append(z[2])
tegDic[z[2]]=z[4] # pairs catNo with locNo
Anyway there is a simplest way imho:
taegDics = [{}]*len(files)
The list comprehension is a great structure for generalising working with lists in such a way that the creation of lists can be managed elegantly. Is there a similar tool for managing Dictionaries in Python?
I have the following functions:
# takes in 3 lists of lists and a column specification by which to group
def custom_groupby(atts, zmat, zmat2, col):
result = dict()
for i in range(0, len(atts)):
val = atts[i][col]
row = (atts[i], zmat[i], zmat2[i])
try:
result[val].append(row)
except KeyError:
result[val] = list()
result[val].append(row)
return result
# organises samples into dictionaries using the groupby
def organise_samples(attributes, z_matrix, original_z_matrix):
strucdict = custom_groupby(attributes, z_matrix, original_z_matrix, 'SecStruc')
strucfrontdict = dict()
for k, v in strucdict.iteritems():
strucfrontdict[k] = custom_groupby([x[0] for x in strucdict[k]],
[x[1] for x in strucdict[k]], [x[2] for x in strucdict[k]], 'Front')
samples = dict()
for k in strucfrontdict:
samples[k] = dict()
for k2 in strucfrontdict[k]:
samples[k][k2] = dict()
samples[k][k2] = custom_groupby([x[0] for x in strucfrontdict[k][k2]],
[x[1] for x in strucfrontdict[k][k2]], [x[2] for x in strucfrontdict[k][k2]], 'Back')
return samples
It seems like this is unwieldy. There being elegant ways to do almost everything in Python, I'm inclined to think I'm using Python wrongly.
More importantly, I'd like to be able to generalise this function better so that I can specify how many "layers" should be in the dictionary (without using several lambdas and approaching the problem in a Lisp style). I would like a function:
# organises samples into a dictionary by specified columns
# number of layers could also be assumed by number of criterion
def organise_samples(number_layers, list_of_strings_for_column_ids)
Is this possible to do in Python?
Thank you! Even if there isn't a way to do it elegantly in Python, any suggestions towards making the above code more elegant would be really appreciated.
::EDIT::
For context, the attributes object, z_matrix, and original_zmatrix are all lists of Numpy arrays.
Attributes might look like this:
Type,Num,Phi,Psi,SecStruc,Front,Back
11,181,-123.815,65.4652,2,3,19
11,203,148.581,-89.9584,1,4,1
11,181,-123.815,65.4652,2,3,19
11,203,148.581,-89.9584,1,4,1
11,137,-20.2349,-129.396,2,0,1
11,163,-34.75,-59.1221,0,1,9
The Z-matrices might both look like this:
CA-1, CA-2, CA-CB-1, CA-CB-2, N-CA-CB-SG-1, N-CA-CB-SG-2
-16.801, 28.993, -1.189, -0.515, 118.093, 74.4629
-24.918, 27.398, -0.706, 0.989, 112.854, -175.458
-1.01, 37.855, 0.462, 1.442, 108.323, -72.2786
61.369, 113.576, 0.355, -1.127, 111.217, -69.8672
Samples is a dict{num => dict {num => dict {num => tuple(attributes, z_matrix)}}}, having one row of the z-matrix.
The list comprehension is a great structure for generalising working with lists in such a way that the creation of lists can be managed elegantly. Is there a similar tool for managing Dictionaries in Python?
Have you tries using dictionary comprehensions?
see this great question about dictionary comperhansions
I need a data structure of this form:
test_id1: resultValue, timeChecked
test_id2: resultValue, timeChecked
test_id3: resultValue, timeChecked
test_id4: resultValue, timeChecked
...
up until now I have dealt with it by just using a dictionary with key value for the id and the result. But I would like to add the time I checked.
What would be the best way to do this? Can I make the value a tuple? Or is a new class better suited in this case?
What would my class need to look like to accommodate the above?
One lightweight alternative to making a class that both 1) retains the simplicity of a tuple and 2) allows named (as opposed to positional) field access is namedtuple.
from collections import namedtuple
Record = namedtuple("Record", ["resultValue", "timeChecked"])
s = {"test_id1": Record("res1", "time1"), "test_id2": Record("res2", "time2")}
You can now use the values in this dict as if you had a class with the resultValue and timeChecked fields defined...
>>> s["test_id1"].resultValue
'res1'
...or as simple tuples:
>>> a, b = s["test_id1"]
>>> print a, b
res1 time1
Can I make the value a tuple?
Yes, and I'm not really sure why you don't want to. If it's to make look-ups easier to read, you can even make the value another dictionary.
{"test_id1" : (resultValue, timeChecked)}
or
{"test_id1" : {"resultValue": resultValue, "timeChecked" : timeChecked }}
I think a whole new class is overkill for this situation.
You could also use a pandas DataFrame for your whole structure. It's probably waaaay overkill unless you're planning to do some serious data processing, but it's a nice thing to know about and directly matches your problem.
I would like to build up a list using a for loop and am trying to use a slice notation. My desired output would be a list with the structure:
known_result[i] = (record.query_id, (align.title, align.title,align.title....))
However I am having trouble getting the slice operator to work:
knowns = "output.xml"
i=0
for record in NCBIXML.parse(open(knowns)):
known_results[i] = record.query_id
known_results[i][1] = (align.title for align in record.alignment)
i+=1
which results in:
list assignment index out of range.
I am iterating through a series of sequences using BioPython's NCBIXML module but the problem is adding to the list. Does anyone have an idea on how to build up the desired list either by changing the use of the slice or through another method?
thanks zach cp
(crossposted at [Biostar])1
You cannot assign a value to a list at an index that doesn't exist. The way to add an element (at the end of the list, which is the common use case) is to use the .append method of the list.
In your case, the lines
known_results[i] = record.query_id
known_results[i][1] = (align.title for align in record.alignment)
Should probably be changed to
element=(record.query_id, tuple(align.title for align in record.alignment))
known_results.append(element)
Warning: The code above is untested, so might contain bugs. But the idea behind it should work.
Use:
for record in NCBIXML.parse(open(knowns)):
known_results[i] = (record.query_id, None)
known_results[i][1] = (align.title for align in record.alignment)
i+=1
If i get you right you want to assign every record.query_id one or more matching align.title. So i guess your query_ids are unique and those unique ids are related to some titles. If so, i would suggest a dictionary instead of a list.
A dictionary consists of a key (e.g. record.quer_id) and value(s) (e.g. a list of align.title)
catalog = {}
for record in NCBIXML.parse(open(knowns)):
catalog[record.query_id] = [align.title for align in record.alignment]
To access this catalog you could either iterate through:
for query_id in catalog:
print catalog[query_id] # returns the title-list for the actual key
or you could access them directly if you know what your looking for.
query_id = XYZ_Whatever
print catalog[query_id]