I have a dictionary (in python), where the keys are animal names, and the values are sets that contain gene names. Not all the animals have all the genes.
There are about 108 genes (of which I have a list) and 15 species. There are 28 genes common to all animals.
I would like to plot the presence of a gene in an animal for every animal and gene.
For example:
d = {'dog': {'tnfa', 'tlr1'}, 'cat': {'myd88', 'tnfa', 'map2k2'}}
The plot I'd like would look something like this:
dog cat
tnfa x x
myd88 x
tlr1 x
map2k2 x
It would be nice if I could group the animals with the most number of genes together too. But that's optional.
Do you have any suggestions for an approach I can make?
Let's try this:
d = {'dog': {'tnfa', 'tlr1'}, 'cat': {'myd88', 'tnfa'}}
df = pd.DataFrame.from_dict(d, orient='index')
df.stack().reset_index()\
.drop('level_1',axis=1).assign(Value='x')\
.set_index([0,'level_0'])['Value']\
.unstack().rename_axis('gene')\
.rename_axis('animal', 1)
Output:
animal cat dog
gene
myd88 x None
tlr1 None x
tnfa x x
Using pandas crosstab will get you the matrix you are looking for
d = {'dog': ['tnfa', 'tlr1'], 'cat': ['myd88', 'tnfa']}
#data munging
df = pd.DataFrame(d).stack()
df.index = df.index.droplevel(0)
#create and format crosstab
ct = pd.crosstab(df.index, df.values)
ct.index.name = "animal"
ct.columns.name= "gene"
ct = ct.replace([0, 1], ["" , "x"])
ct = ct.T
print(ct)
Results in
animal cat dog
gene
myd88 x
tlr1 x
tnfa x x
Not really sure about the grouping - do you mean by number of genes or common genes? Probably need some more examples as well for that one.
A pure python solution:
Instead of using pandas, my solution just uses some simple for-loops and the .ljust method to print a neat table.
I am not too used to working with dictionaries in python, but using .keys() seemed the way to go. The code loops through each animal and gets that animal's genes. Then for each row so far in the table, if the first value of that row is in the genes, then just add an 'x' to the end of that row to mark that this animal has that gene, also remove that gene so it doesn't create its own row at the end. Otherwise, if the first element of that row was not one of the animal's genes, then just append an empty string to fill that cell of the table.
Finally, for all the remaining genes, if they have not been removed from being already in the table, create a new row in the table with cells of: that gene, the number of animals already seen before (['']*index) and then finally an 'x' to show that the current animal does have that gene.
Finally, the last step is to inset a row at the beginning to simply have the animal names from the dict.
Here's the code:
d = {'dog': {'tnfa', 'tlr1'}, 'cat': {'myd88', 'tnfa', 'map2k2'}}
table = []
cellWidth = 0
for index, animal in enumerate(d.keys()):
cellWidth = max(cellWidth, len(animal))
genes = d[animal]
for row in table:
if row[0] in genes:
row.append('x')
genes.remove(row[0])
else:
row.append('')
for gene in genes:
cellWidth = max(cellWidth, len(gene))
table.append([gene] + ['']*index + ['x'])
table.insert(0, [''] + list(d.keys()))
[print(''.join([c.ljust(cellWidth + 1) for c in r])) for r in table]
and the result is what is wanted:
cat dog
map2k2 x
tnfa x x
myd88 x
tlr1 x
Update:
I have added a variable : cellWidth which will store the greatest length animal or gene. To do this, the max() function is utilized to minimize code length. In the final print, the cells are printed with one extra space than the max so there is some room.
Related
I've stumbled upon intricate data and I want to present it totally differently.
Currently, my dataframe has a default index (numerated) and 3 labels: sequence (that stores sentences), labels (which is a list that contains 20 different strings) and scores which is again a list (length of 20) that corresponds to the labels list and the ith element in the scores list is the score of the ith element in the labels list.
The labels list is sorted via the scores list; if label j has the highest score in row i, then j would show up first in the labels list; but if another label has the highest score, it would show up first instead.. so essentially it's sorted by the scores list.
I want to paint a different picture: use the labels list as my new columns and as value, use the corresponding values via the scores list.
For example, if this is is how my current dataframe looks like:
d = {'sentence': ['Hello, my name is...', 'I enjoy reading books'], 'labels': [['Happy', 'Sad'],['Sad', 'Happy']],'score': [['0.9','0.1'],['0.8','0.2']]}
df = pd.DataFrame(data=d)
df
I want to keep the first column which is the sentence, but then use the labels like the rest of the columns and fill it with the value of the corresponding scores.
An example output would be then:
new_format_d = {'sentence': ['Hello, my name is...', 'I enjoy reading books'], 'Happy': ['0.9', '0.2'],'Sad': ['0.1','0.2']}
new_format_df = pd.DataFrame(data=new_format_df )
new_format_df
Is there an "easy" way to execute that?
I was finally able to solve it using a NumPy array hack:
First you convert the lists to np arrays:
df['labels'] = df['labels'].map(lambda x: np.array(x))
df['scores'] = df['scores'].map(lambda x: np.array(x))
Then, you loop over the labels and add each label, one at a time, and its corresponding scores using the boolean condition described below:
for label in df['labels'][0]:
df[label] = df_text_20[['labels','scores']].apply(lambda x: x[1][x[0]==label][0], axis=1)
My suggestion is to change your dictionary if you can. First find the indices of the Happy and Sad from labels:
happy_index = [internal_list.index('Happy') for internal_list in d['labels']]
sad_index = [internal_list.index('Sad') for internal_list in d['labels']]
Then add new keys name Happy and Sad to your dictionary:
d['Happy'] = [d['score'][cnt][index] for cnt, index in enumerate(happy_index)]
d['Sad'] = [d['score'][cnt][index] for cnt, index in enumerate(sad_index)]
Finally, delete your redundant keys and convert it to dataframe:
del d['labels']
del d['score']
df = pd.DataFrame(d)
sentence Happy Sad
0 Hello, my name is... 0.9 0.1
1 I enjoy reading books 0.2 0.8
I'm new to Pandas, and I'm having a horrible time figuring out datasets.
I have a csv file I've read in using pandas.read_csv, dogData, that looks as follows:
The column names are dog breeds, the first line [0] refers to the size of the dogs, and beyond that there's a bunch of numerical values. The very first column has string description that I need to keep, but isn't relevant to the question. The last column for each size category contains separate "Average" values. (Note that it changed the "Average" columns to "Average.1", "Average.2" and so on, to take care of them not being unique)
Basically, I want to "group" by the first row - so all "small" dog values will be averaged except the "small" average column, and so on. The result would look like something like this:
The existing "Average" columns should not be included in the new average being calculated. The existing "Average" columns for each size don't need to be altered at all. All "small" breed values should be averaged, all "medium" breed values should be averaged, and so on (actual file is much larger then the sample I showed here).
There's no guarantee the breeds won't be altered, and no guarantee the "sizes" will remain the same / always be included ("Small" could be left out, for example).
EDIT:: After Joe Ferndz's comment, I've updated my code and have something slightly closer to working, but the actual adding-the-columns is giving me trouble still.
dogData = pd.read_csv("dogdata.csv", header=[0,1])
dogData.columns = dogData.columns.map("_".join)
totalVal = ""
count = 0
for col in dogData:
if "Unnamed" in col:
continue # to skip starting columns
if "Average" not in col:
totalVal += dogData[col]
count += 1
else:
# this is where I'd calculate average, then reset count and totalVal
# right now, because the addition isn't working, I'm haven't figured that out
break
print(totalVal)
Now, this code is getting the correct values technically... but it won't let me numerically add them (hence why totalVal is a string right now). It gives me a string of concatenated numbers, the correct concatenated numbers, but won't let me convert them to floats to actually add.
I've tried doing float(dogData[col]) for the totalVal addition line - it gives me a TypeError: cannot convert the series to <class float>
I've tried keeping it as a string, putting in "," between the numbers, then doing totalVal.split(",") to separate them, then convert and add... but obviously that doesn't work either, because AttributeError: 'Series' has no attribute 'split'
These errors make sense to me and I understand why it's happening, but I don't know what the correct method for doing this is. dogData[col] gives me all the values for every row at once, which is what I want, but I don't know how to then store that and add it in the next iteration of the loop.
Here's a copy/pastable sample of data:
,Corgi,Yorkie,Pug,Average,Average,Dalmation,German Shepherd,Average,Great Dane,Average
,Small,Small,Small,Small,Medium,Large,Large,Large,Very Large,Very Large
Words,1,3,3,3,2.4,3,5,7,7,7
Words1,2,2,4,4,2.2,4,4,6,8,8
Words2,2,1,5,3,2.5,5,3,8,9,6
Words3,1,4,4,2,2.7,6,6,5,6,9
You have to do a few tricks to get this to work.
Step 1: You need to read the csv file and use first two rows as header. It will create a MultiIndex column list.
Step 2: You need to join them together with say an _.
Step 3: Then rename the specific columns as per your requirement like S-Average, M-Average, ....
Step 4: find out how many columns have dog name + small
Step 5: Compute value for Small. Per your req, sum (columns with Small) / count (columns with Small)
Step 6,7: do same for Large
Step 8,9: do same for Very Large
This will give you the final list. If you want the columns to be in specific order, then you can change the order.
Step 10: Change the order for the dataframe
import pandas as pd
df = pd.read_csv('abc.txt',header=[0,1], index_col=0)
df.columns = df.columns.map('_'.join)
df.rename(columns={'Average_Small': 'S-Average',
'Average_Medium': 'M-Average',
'Average_Large': 'L-Average',
'Average_Very Large': 'Very L-Average'}, inplace = True)
idx = [i for i,x in enumerate(df.columns) if x.endswith('_Small')]
if idx:
df['Small']= ((df.iloc[:, idx].sum(axis=1))/len(idx)).round(2)
df.drop(df.columns[idx], axis = 1, inplace = True)
idx = [i for i,x in enumerate(df.columns) if x.endswith('_Large')]
if idx:
df['Large']= ((df.iloc[:, idx].sum(axis=1))/len(idx)).round(2)
df.drop(df.columns[idx], axis = 1, inplace = True)
idx = [i for i,x in enumerate(df.columns) if x.endswith('_Very Large')]
if idx:
df['Very_Large']= ((df.iloc[:, idx].sum(axis=1))/len(idx)).round(2)
df.drop(df.columns[idx], axis = 1, inplace = True)
df = df[['Small', 'S-Average', 'M-Average', 'L-Average', 'Very L-Average', 'Large', 'Very_Large', ]]
print (df)
The output of this will be:
Small S-Average M-Average ... Very L-Average Large Very_Large
Words 2.33 3 2.4 ... 7 4.0 7.0
Words1 2.67 4 2.2 ... 8 4.0 8.0
Words2 2.67 3 2.5 ... 6 4.0 9.0
Words3 3.00 2 2.7 ... 9 6.0 6.0
I currently have a pandas DataFrame df:
paper reference
2171686 p84 r51
3816503 p41 r95
4994553 p112 r3
2948201 p112 r61
2957375 p32 r41
2938471 p65 r41
...
Here, each row of df shows the relationship of citation between paper and reference (where paper cites reference).
I need the following numbers for my analysis:
Frequency of elements of paper in df
When two elements from paper are randomly selected, the number of reference they cite in common
For number 1, I performed the following:
df_count = df.groupby(['paper'])['paper'].count()
For number 2, I performed the operation that returns pairs of elements in paper that cite the same element in reference:
from collections import defaultdict
pair = []
d = defaultdict(list)
for idx, row in df.iterrows():
d[row['paper']].append(row['paper'])
for ref, lst in d.items():
for i in range(len(lst)):
for j in range(i+1, len(lst)):
pair.append([lst[i], lst[j], ref])
pair is a list that consists of three elements: first two elements are the pair of paper, and the third element is from reference that both paper elements cite. Below is what pair looks like:
[['p88','p7','r11'],
['p94','p33','r11'],
['p75','p33','r43'],
['p5','p12','r79'],
...]
I would like to retrieve a DataFrame in the following format:
paper1 freq1 paper2 freq2 common
p17 4 p45 3 2
p5 2 p8 5 2
...
where paper1 and paper2 represent the first two elements of each list of pair, freq1 and freq2 represent the frequency count of each paper done by df_count, and common is a number of reference both paper1 and paper2 cite in common.
How can I retrieve my desired dataset (in the desired format) from df, df_count, and pair?
I think this can be solved only using pandas.DataFrame.merge. I am not sure whether this is the most efficient way, though.
First, generate common reference counts:
# Merge the dataframe with itself to generate pairs
# Note that we merge only on reference, i.e. we generate each and every pair
df_pairs = df.merge(df, on=["reference"])
# Dataframe contains duplicate pairs of form (p1, p2) and (p2, p1), remove duplicates
df_pairs = df_pairs[df_pairs["paper_x"] < df_pairs["paper_y"]]
# Now group by pairs, and count the rows
# This will give you the number of common references per each paper pair
# reset_index is necessary to get each row separately
df_pairs = df_pairs.groupby(["paper_x", "paper_y"]).count().reset_index()
df_pairs.columns = ["paper1", "paper2", "common"]
Second, generate number of references per paper (you already got this):
df_refs = df.groupby(["paper"]).count().reset_index()
df_refs.columns = ["paper", "freq"]
Third, merge the two DataFrames:
# Note that we merge twice to get the count for both papers in each pair
df_all = df_pairs.merge(df_refs, how="left", left_on="paper1", right_on="paper")
df_all = df_all.merge(df_refs, how="left", left_on="paper2", right_on="paper")
# Get necessary columns and rename them
df_all = df_all[["paper1", "freq_x", "paper2", "freq_y", "common"]]
df_all.columns = ["paper1", "freq1", "paper2", "freq2", "common"]
I have column in a Pandas dataframe that I want to use to lookup a value of cost in a lookup dictionary.
The idea is that I will update an existing column if the item is there and if not the column will be left blank.
All the methods and solutions I have seen so far seem to create a new column, such as apply and assign methods, but it is important that I preserve the existing data.
Here is my code:
lookupDict = {'Apple': 1, 'Orange': 2,'Kiwi': 3,'Lemon': 8}
df1 = pd.DataFrame({'Fruits':['Apple','Banana','Kiwi','Cheese'],
'Pieces':[6, 3, 5, 7],
'Cost':[88, 55, 65, 55]},)
What I want to achieve is lookup the items in the fruit column and if the item is there I want to update the cost column with the dictionary value multiplied by the number of pieces.
For example for Apple the cost is 1 from the lookup dictionary, and in the dataframe the number of pieces is 6, therefore the cost column will be updated from 88 to (6*1) = 6. The next item is banana which is not in the lookup dictionary, therefore the cost in the original dataframe will be left unchanged. The same logic will be applied to the rest of the items.
The only way I can think of achieving this is to separate the lists from the dataframe, iterate through them and then add them back into the dataframe when I'm finished. I am wondering if it would be possible to act on the values in the dataframe without using separate lists??
From other responses I image I have to use the loc indicators such as the following: (But this is not working and I don't want to create a new column)
df1.loc[df1.Fruits in lookupDict,'Cost'] = lookupDict[df1.Fruits] * lookupD[df1.Pieces]
I have also tried to map but it overwrites all the content of the existing column:
df1['Cost'] = df1['Fruits'].map(lookupDict)*df1['Pieces']
EDIT*******
I have been able to achieve it with the following using iteration, however I am still curious if there is a cleaner way to achieve this:
#Iteration method
for i,x in zip(df1['Fruits'],xrange(len(df1.index))):
fruit = (df1.loc[x,'Fruits'])
if fruit in lookupDict:
newCost = lookupDict[fruit] * df1.loc[x,'Pieces']
print(newCost)
df1.loc[x,'Cost'] = newCost
If I understood correctly:
mask = df1['Fruits'].isin(lookupDict.keys())
df1.loc[mask, 'Cost'] = df1.loc[mask, 'Fruits'].map(lookupDict) * df1.loc[mask, 'Pieces']
Result:
In [29]: df1
Out[29]:
Cost Fruits Pieces
0 6 Apple 6
1 55 Banana 3
2 15 Kiwi 5
3 55 Cheese 7
Apologies for the messy title: Problem as follows:
I have some data frame of the form:
df1 =
Entries
0 "A Level"
1 "GCSE"
2 "BSC"
I also have a data frame of the form:
df2 =
Secondary Undergrad
0 "A Level" "BSC"
1 "GCSE" "BA"
2 "AS Level" "MSc"
I have a function which searches each entry in df1, looking for the words in each column of df2. The words that match, are saved (Words_Present):
def word_search(df,group,words):
Yes, No = 0,0
Words_Present = []
for i in words:
match_object = re.search(i,df)
if match_object:
Words_Present.append(i)
Yes = 1
else:
No = 0
if Yes == 1:
Attribute = 1
return Attribute
I apply this function over all entries in df1, and all columns in df2, using the following iteration:
for i in df2:
terms = df2[i].values.tolist()
df1[i] = df1['course'][0:1].apply(lambda x: word_search(x,i,terms))
This yields an output df which looks something like:
df1 =
Entries Secondary undergrad
0 "A Level" 1 0
1 "GCSE" 1 0
2 "AS Level" 1 0
I want to amend the Word_Search function to output a the Words_Present list as well as the Attribute, and input these into a new column, so that my eventual df1 array looks like:
Desired dataframe:
Entries Secondary Words Found undergrad Words Found
0 "A Level" 1 "A Level" 0
1 "GCSE" 1 "GCSE" 0
2 "AS Level" 1 "AS Level" 0
If I do:
def word_search(df,group,words):
Yes, No = 0,0
Words_Present = []
for i in words:
match_object = re.search(i,df)
if match_object:
Words_Present.append(i)
Yes = 1
else:
No = 0
if Yes == 1:
Attribute = 1
if Yes == 0:
Attribute = 0
return Attribute,Words_Present
My function therefore now has multiple outputs. So applying the following:
for i in df2:
terms = df2[i].values.tolist()
df1[i] = df1['course'][0:1].apply(lambda x: word_search(x,i,terms))
My Output Looks like this:
Entries Secondary undergrad
0 "A Level" [1,"A Level"] 0
1 "GCSE" [1, "GCSE"] 0
2 "AS Level" [1, "AS Level"] 0
The output of pd.apply() is always a pandas series, so it just shoves everything into the single cell of df[i] where i = secondary.
Is it possible to split the output of .apply into two separate columns, as shown in the desired dataframe?
I have consulted many questions, but none seem to deal directly with yielding multiple columns when the function contained within the apply statement has multiple outputs:
Applying function with multiple arguments to create a new pandas column
Create multiple columns in Pandas Dataframe from one function
Apply pandas function to column to create multiple new columns?
For example, I have also tried:
for i in df2:
terms = df2[i].values.tolist()
[df1[i],df1[i]+"Present"] = pd.concat([df1['course'][0:1].apply(lambda x: word_search(x,i,terms))])
but this simply yields errors such as:
raise ValueError('Length of values does not match length of ' 'index')
Is there a way to use apply, but still extract the extra information directly into multiple columns?
Many thanks, apologies for the length.
The direct answer to your question is yes: use the apply method of the DataFrame object, so you'd be doing df1.apply().
However, for this problem, and anything in pandas in general, try to vectorise rather than iterate through rows -- it's faster and cleaner.
It looks like you are trying to classify Entries into Secondary or Undergrad, and saving the keyword used to make the match. If you assume that each element of Entries has no more than one keyword match (i.e. you won't run into 'GCSE A Level'), you can do the following:
df = df1.copy()
df['secondary_words_found'] = df.Entries.str.extract('(A Level|GCSE|AS Level)')
df['undergrad_words_found'] = df.Entries.str.extract('(BSC|BA|MSc)')
df['secondary'] = df.secondary_words_found.notnull() * 1
df['undergrad'] = df.undergrad_words_found.notnull() * 1
EDIT:
In response to your issue with having many more categories and keywords, you can continue in the spirit of this solution by using an appropriate for loop and doing '(' + '|'.join(df2['Undergrad'].values) + ')' inside the extract method.
However, if you have exact matches, you can do everything by a combination of pivots and joins:
keywords = df2.stack().to_frame('Entries').reset_index().drop('level_0', axis = 1).rename(columns={'level_1':'category'})
df = df1.merge(keywords, how = 'left')
for colname in df.category:
df[colname] = (df.Entries == colname) * 1 # Your indicator variable
df.loc[df.category == colname, colname + '_words_found'] = df.loc[df.category == colname, 'Entries']
The first line 'pivots' your table of keywords into a 2-column dataframe of keywords and categories. Your keyword column must be the same as the column in df1; in SQL, this would be called the foreign key that you are going to join these tables on.
Also, you generally want to avoid having duplicate indexes or columns, which in your case, was Words Found in the desired dataframe!
For the sake of completeness, if you insisted on using the apply method, you would iterate over each row of the DataFrame; your code would look something like this:
secondary_words = df2.Secondary.values
undergrad_words = df2.Undergrad.values
def(s):
if s.Entries.isin(secondary_words):
return pd.Series({'Entries':s.Entries, 'Secondary':1, 'secondary_words_found':s.Entries, 'Undergrad':0, 'undergrad_words_found':''})
elif s.Entries.isin(undergrad_words ):
return pd.Series({'Entries':s.Entries, 'Secondary':0, 'secondary_words_found':'', 'Undergrad':1, 'undergrad_words_found':s.Entries})
else:
return pd.Series({'Entries':s.Entries, 'Secondary':0, 'secondary_words_found':'', 'Undergrad':0, 'undergrad_words_found':''})
This second version will only work in the cases you want it to if the element in Entries is exactly the same as its corresponding element in df2. You certainly don't want to do this, as it's messier, and will be noticeably slower if you have a lot of data to work with.