dictionary values to excel columns - python

I would like to convert a dictionary of key-value pairs to an excel file with column names that match the values to the corresponding columns.
For example :
I have an excel file with column names as:
a,b,c,d,e,f,g and h.
I have a dictionary like:
{1:['c','d'],2:['a','h'],3:['a','b','b','f']}.
I need the output to be:
a b c d e f g h
1 1 1
2 1 1
3 1 2 1
the 1,2,3 are the keys from the dictionary.
The rest of the columns could be either 0 or null.
I have tried splitting the dictionary and am getting
1 = ['c','d']
2 = ['a','h']
3 = ['a','b','b','f']
but, I don't know how to pass this to match with the excel file.

Your problem can be solved with pandas and collections (there may exist a more efficient solution):
import pandas as pd
from collections import Counter
d = {...} # Your dictionary
series = pd.Series(d) # Convert the dict into a Series
counts = series.apply(Counter) # Count items row-wise
counts = counts.apply(pd.Series) # Convert the counters to Series
table = counts.fillna(0).astype(int) # Fill the gaps and make the counts integer
print(table)
# a b c d f h
1 0 0 1 1 0 0
2 1 0 0 0 0 1
3 1 2 0 0 1 0
It is not clear what type of output you expect, so I leave it to you to convert the DataFrame to the output of your choice.

A simple solution only based on standard lists and dictionaries. It generates a 2D list, which is then easy to convert into a CSV file than can be loaded by Excel.
d = {1:['c','d'],2:['a','h'],3:['a','b','b','f']}
cols = dict((c,n) for n,c in enumerate('abcdefgh'))
rows = dict((k,n) for n,k in enumerate('123'))
table = [[0 for col in cols] for row in rows]
for row, values in d.items():
for col in values:
table[rows[row]][cols[col]] += 1
print(table)
# output:
# [[0,0,1,1,0,0,0,0], [1,0,0,0,0,0,0,1], [1,2,0,0,0,1,0,0]]

Related

Python: merge columns with same name, keeping minimum value

I have a big matrix, like this:
df:
A A A B B ... (column names)
A 2 4 5 9 2
A 6 8 7 6 4
A 5 2 6 4 5
B 3 4 1 3 4
B 4 5 3 1 4
.
.
(row names)
I would like to merge the columns with same name, and findig the minimum value. At the end I would like to have a matrix like this:
df_min:
A B ... (column names)
A 2 2
A 6 4
A 2 4
B 1 3
B 3 1
.
.
(row names)
My intentions, afterwards (outside of the question), is to merge the rows as well. Desired outcome:
df_min:
A B ... (column names)
A 2 2
B 1 1
.
.
(row names)
I tried this:
df_min= df.groupby('df.columns, axis=1').agg(np.min)
But it didn't work, it removed some rows (for example, removing entirely row A)... EDIT: Apparently, it worked fine but I had two columns with different names but whitespace at the end of the name. These methods reorder the columns, which confused me.
A snipped of the dataframe:
Simply groupby on the level=0 for each axis:
df.groupby(level=0, axis=1).min()
output:
A B
A 2 2
A 6 4
A 2 4
B 1 3
B 3 1
both axes:
df.groupby(level=0, axis=1).min().groupby(level=0).min()
output:
A B
A 2 2
B 1 1
Alternatively, use a single groupby trough a stack/unstack:
df.stack().groupby(level=[0,1]).min().unstack()
output:
A B
A 2 2
B 1 1
EDIT
numpy only based solution
I'm assuming that you have a list associating names to column indices, e.g. for the first code sample you provided something like
column_names = ['A', 'A', 'A', 'B', 'B']
and that your data type is single-precision floating point. In this scenario, you can do something like the following:
unique_column_names = list(dict.fromkeys(column_names)) # get unique column names preserving original order
df_min = np.empty((df.shape[0], len(unique_column_names), dtype=np.float32) # allocate output array
for i, column_name in enumerate(unique_column_names): # iterate over unique column names
column_indices = [id for id in range(df.shape[1]) if column_names[id] == column_name] # extract all column indices having the same name
tmp = df[:, column_indices] # extract columns named as column_name
df_min[:, i] = np.amin(tmp, axis=1)] # take min by row and save result
Then, if you want to repeat the process by row, assuming you have another list associating row indices and names named row_names
unique_row_names = list(dict.fromkeys(row_names)) # get unique row names preserving order
df_final = np.empty((len(unique_row_names), len(unique_column_names), dtype=np.float32) # allocate final output
for j, row_name in enumerate(unique_row_names): # iterate over unique row names
row_indices = [id for id in range(df.shape[0]) if row_names[id] == row_name] # extract rows having row_name
tmp = df_min[row_indices, :] # extract rows named as row_name from the column-reduced matrix
df_final[j, :] = np.amin(tmp, axis=0) # take min by column and save result
The column-name and row-name association list for the final output are unique_column_names and unique_row_names

Populate a dataframe column and rows using keys and values from a dictionary, row by row

Context: I'm trying to insert the keys from dictionary as columns and values as rows using from_dict method, but it doesn't seem to be working. I have a bit more context as comments in the code below:
for line in range(1,20):
print("New line")
substitutions = "C241T,C3037T,C14408T,A23403G,C27046T,G28881A,G28882A,G28883C" #for simplicity purpose, we'll use this input for substitutions variable
#substitutions = final_df.iloc[line,2]
snv = [] #Empty list that will reset for every line
for content in substitutions.split(","):
reference = content[0] #get the 1st character. Example, in "C241T" it retrieves "C"
substitution = content[-1] #get the last character. Example, in "C241T" it retrieves "T"
output = "{0}>{1}".format(reference,substitution) #put in desired column output, example "C>T"
snv.append(output) #append the desired column to the list
dictionary = dict() #create a dictionary to get the counts for each output in the snv list
for key in snv:
dictionary[key] = dictionary.get(key,0) + 1
for key in dictionary.keys():
if key not in pca_df.columns[:]: #if the key is not in the column of the dataframe, then add it
pca_df.from_dict(dictionary) #Make the keys from the dictionary the columns and place the counts on the respective line
This would be the desired output:
EDIT: the pca_df has this format and I'd like to populate it with the desired output:
seqName clade
0 Wuhan/Hu-1/2019 19A
1 sample_1 20B
2 sample_2 20A
...
If substitutions is this "C241T,C3037T,C14408T,A23403G,C27046T,G28881A,G28882A,G28883C" (sample_1) then the output on the dataframe should be :
seqName clade C>T A>G G>A G>C T>C C>A G>T A>T
0 Wuhan/Hu-1/2019 19A 0 0 0 0 0 0 0 0
1 sample_1 20B 4 1 2 1 0 0 0 0
Then iterate to the next line (sample_2 with substitutions as "C241T,C3037T,C14408T,A23403G,C29144T") and do the same:
seqName clade C>T A>G G>A G>C T>C C>A G>T A>T
0 Wuhan/Hu-1/2019 19A 0 0 0 0 0 0 0 0
1 sample_1 20B 4 1 2 1 0 0 0 0
2 sample_2 20B 4 1 0 0 0 0 0 0
etc.
Any help is very welcome! I'm fairly new to python so the code might not be the best.
Adding this piece of code fixed the problem:
for key,value in dictionary.items():
pca_df.loc[line, key] = value
I'd still like to see other (quicker/better) solutions if anyone is interested. :) Doing this for 20k lines took 57 seconds and I might need to do this to millions of lines, so this needs to be optimized for certain.
A general pointer:
You're iterating through the same data multiple times:
String->List
List->Dict
Dict->DataFrame
Either make your changes directly to the dataframe, or make a dictionary in one pass and then use pandas to convert it straight to a dataframe.
Pseudocode:
data_dict = ()
For seq in dataset:
# unclear what each row looks like, but if [SeqName, Clade, "Substitutions"]
row_dict = ()
for item in seq[2].split(","):
#no need to create separate list. Could make this one line:
value = item[0]+item[-1]
# create or increment dictionary entry
row_dict(value) = row_dict.get(value, 0) + 1
# Now add each row_dict to data_dict
data_dict(seq[0]) = row_dict
# Now build dataframe. It will fill missing values with NaN
Data_frame = pandas.from_dict(data_dict)
You could look at defaultdict or Counters (specialized dictionaries that make counting like you're doing easier).

Pandas: select column with most unique values

I have a pandas DataFrame and want to find select the column with the most unique values.
I already filtered the unique values with nunique(). How can I now choose the column with the highest nunique()?
This is my code so far:
numeric_columns = df.select_dtypes(include = (int or float))
unique = []
for column in numeric_columns:
unique.append(numeric_columns[column].nunique())
I later need to filter all the columns of my dataframe depending on this column(most uniques)
Use DataFrame.select_dtypes with np.number, then get DataFrame.nunique with column by maximal value by Series.idxmax:
df = pd.DataFrame({'a':[1,2,3,4],'b':[1,2,2,2], 'c':list('abcd')})
print (df)
a b c
0 1 1 a
1 2 2 b
2 3 2 c
3 4 2 d
numeric = df.select_dtypes(include = np.number)
nu = numeric.nunique().idxmax()
print (nu)
a

Mapping Values of key in Dictionary present in dataframe columns

I have a dataframe A with column 'col_1' and values of column is A and B and and I am trying to map the values of A and B present in Dictionary
DataFrame A:
enter image description here
and have dictionary
enter image description here
and I want the output like this
Dataframe :
col_1 Values
A 1
A 2
A 3
B 1
B 2
Any help will be highly appreciated
thanks
I tried to frame your problem properly:
df = pd.DataFrame({"col_1":["A","A","A","B","B"]})
Printing df gives us your dataframe shown in the image above:
print(df)
col_1
0 A
1 A
2 A
3 B
4 B
Here is your dictionary:
dict1 = {"A":[1,2,3], "B":[1,2]}
I created an empty list to hold the elements then stack up the list with your request, and finally created a new column called values and write the list into the column
values1 = []
for key,value_list in dict1.items():
for item in value_list:
value_item = key+" "+ str(item)
values1.append(value_item)
df["values"] = values1
printing df results into:
df
col_1 values
0 A A 1
1 A A 2
2 A A 3
3 B B 1
4 B B 2

Find index value of a dataframe by comparing with another series

I am having problem while extracting index value from a data frame by comparing a dataframe column values with another list.
list=[a,b,c,d]
data frame
by comparing list with column X
X Y Z
0 a r t
1 e t y
2 c f h
3 d r t
4 b g q
this should return the index values like
X
0 a
4 b
2 c
3 d
I tried this method
z=dataframe.loc[(dataframe['X'] == list)]
You should use isin as you are comparing to a list of elements:
dataframe = pd.DataFrame(columns = ['X','Y','Z'])
dataframe['X'] = ['a','e','c','d','b']
dataframe['Y'] = ['r','t','f','r','g']
dataframe['Z'] = ['t','y','h','y','k']
mylist = ['a','b','c','d']
(always post a way to create your dataframe in your question, it will be faster to answer)
dataframe[dataframe['X'].isin(mylist)].X
0 a
2 c
3 d
4 b
Name: X, dtype: object
You need to use isin:
Make sure your list is a list of strings, then use dropna to get rid of unwanted rows and columns.
list = ['a','b','c','d']
df[df.isin(list)].dropna(how='all').dropna(axis=1)
Or if you only wanted to compare with column X.
df.X[df.X.isin(list)]
Output:
X
0 a
2 c
3 d
4 b

Categories

Resources