I've got excel/pandas dataframe/file looking like this:
+------+--------+
| ID | 2nd ID |
+------+--------+
| ID_1 | R_1 |
| ID_1 | R_2 |
| ID_2 | R_3 |
| ID_3 | |
| ID_4 | R_4 |
| ID_5 | |
+------+--------+
How can I transform it to python dictionary? I want my result to be like:
{'ID_1':['R_1','R_2'],'ID_2':['R_3'],'ID_3':[],'ID_4':['R_4'],'ID_5':[]}
What should I do, to obtain it?
If need remove missing values for not exist values use Series.dropna in lambda function in GroupBy.apply:
d = df.groupby('ID')['2nd ID'].apply(lambda x: x.dropna().tolist()).to_dict()
print (d)
{'ID_1': ['R_1', 'R_2'], 'ID_2': ['R_3'], 'ID_3': [], 'ID_4': ['R_4'], 'ID_5': []}
Or use fact np.nan == np.nan return False in list compehension for filter non missing values, check also warning in docs for more explain.
d = df.groupby('ID')['2nd ID'].apply(lambda x: [y for y in x if y == y]).to_dict()
If need remove empty strings:
d = df.groupby('ID')['2nd ID'].apply(lambda x: [y for y in x if y != '']).to_dict()
Apply a function over the dataframe over the rows which appends the value to your dict. Apply is not inplace and thus your dictionary would be created.
d = dict.fromkeys(df.ID.unique(), [])
def func(x):
d[x.ID].append(x["2nd ID"])
# will return a series of Nones
df.apply(func, axis = 1)
Edit:
I asked it on Gitter and #gurukiran07 gave me an answer. What you are trying to do is reverse of explode function
s = pd.Series([[1, 2, 3], [4, 5]])
0 [1, 2, 3]
1 [4, 5]
dtype: object
exploded = s.explode()
0 1
0 2
0 3
1 4
1 5
dtype: object
exploded.groupby(level=0).agg(list)
0 [1, 2, 3]
1 [4, 5]
dtype: object
Related
Example data:
| alcoholism | diabites | | handicapped | hypertensive | new col |
| -------- | -------- | | -------- | -------- | ---------------- |
| 1 | 0 | | 1 | 0 | alcoholism, handicapped |
| 0 | 1 | | 0 | 1 | diabites, hypertensive |
| 0 | 1 | | 0 | 0 | diabites |
If any of the above columns has value = 1, then I need the new column to have the names of these columns only,
and if all are zero return no condition.
I had tried to do it with the below code:
problems = ['alcoholism', 'diabetes','hypertension','handicap']
m1 = df[problems].isin([1])
mask = m1 | (m1.loc[~m1.any(axis=1)])
df['sp_name'] = mask.mul(problems).apply(lambda x: [i for i in x if i], axis=1)
But it returns the data with brackets like [handicapped, alcoholism].
The issue is that I can't do value counts as the zero values show as empty [] and will not be plotted.
I still don't understand your ultimate goal, or how this will be useful in plotting, but all you're really missing is using str.join to combine each list into the string you want. That said, the way you've gotten there involves unnecessary steps. First, multiply the DataFrame by its own column names:
df * df.columns
alcoholism diabetes handicapped hypertension
0 alcoholism handicapped
1 diabetes hypertension
2 diabetes
Then you can apply the same as you did:
(df * df.columns).apply(lambda row: [i for i in row if i], axis=1)
0 [alcoholism, handicapped]
1 [diabetes, hypertension]
2 [diabetes]
dtype: object
Then you just need to include a string join in the function you supply to apply. Here's a complete example:
import pandas as pd
df = pd.DataFrame({
'alcoholism': [1, 0, 0],
'diabetes': [0, 1, 1],
'handicapped': [1, 0, 0],
'hypertension': [0, 1, 0],
})
df['new_col'] = (
(df * df.columns)
.apply(lambda row: ', '.join([i for i in row if i]), axis=1)
)
print(df)
alcoholism diabetes handicapped hypertension new_col
0 1 0 1 0 alcoholism, handicapped
1 0 1 0 1 diabetes, hypertension
2 0 1 0 0 diabetes
df['new_col'] = df.iloc[:, :-1].dot(df.add_suffix(",").columns[:-1]).str[:-1]
i already found this solution helpful for me
I have a pandas dataframe with a few columns. I want to convert one of the string columns into an array of strings with fixed length.
Here is how current table looks like:
+-----+--------------------+--------------------+
|col1 | col2 | col3 |
+-----+--------------------+--------------------+
| 1 |Marco | LITMATPHY |
| 2 |Lucy | NaN |
| 3 |Andy | CHMHISENGSTA |
| 4 |Nancy | COMFRNPSYGEO |
| 5 |Fred | BIOLIT |
+-----+--------------------+--------------------+
How can I split string of "col 3" into array of string of length 3 as follows:
PS: There can be blanks or NaN in the col 3 and they should be replaced with empty array.
+-----+--------------------+----------------------------+
|col1 | col2 | col3 |
+-----+--------------------+----------------------------+
| 1 |Marco | ['LIT','MAT','PHY] |
| 2 |Lucy | [] |
| 3 |Andy | ['CHM','HIS','ENG','STA'] |
| 4 |Nancy | ['COM','FRN','PSY','GEO'] |
| 5 |Fred | ['BIO','LIT'] |
+-----+--------------------+----------------------------+
Use textwrap.wrap:
import textwrap
df['col3'].apply(lambda x: textwrap.wrap(x, 3) if pd.notna(x) else [])
If there are string whose lenghts aren't the multiple of 3, the remaining letters will be pushed to the last. If you only want to have strings of lenght 3, you can apply one more to get rid of those strings:
df['col3'].apply(lambda x: textwrap.wrap(x, 3) if pd.notna(x) else []).\
apply(lambda x: x[:-1] if len(x[-1]) % 3 != 0 else x)
Another way can be this;
import pandas as pd
import numpy as np
df = pd.DataFrame({"col3":["LITMATPHY",np.nan,"CHMHISENGSTA","COMFRNPSYGEO","BIOLIT"]})
def split_str(s):
lst=[]
for i in range(0,len(s),3):
lst.append(s[i:i+3])
return lst
df["col3_result"] = df["col3"].apply(lambda x: [] if pd.isna(x) else split_str(s=x))
# Output
col3 col3_result
0 LITMATPHY [LIT, MAT, PHY]
1 NaN []
2 CHMHISENGSTA [CHM, HIS, ENG, STA]
3 COMFRNPSYGEO [COM, FRN, PSY, GEO]
4 BIOLIT [BIO, LIT]
With only using Pandas we can do:
df = pd.DataFrame(['LITMATPHY', np.nan, '', 'CHFDIOSFF', 'CHFIOD', 'FHDIFOSDFJKL'], columns=['col3'])
def to_list(string, n):
if string != string: # True if string = np.nan
lst = []
else:
lst = [string[i:i+n] for i in range(0, len(string), n)]
return lst
df['new_col3'] = df['col3'].apply(lambda x: to_list(x, 3))
Output:
col3 new_col3
0 LITMATPHY [LIT, MAT, PHY]
1 NaN []
2 []
3 CHFDIOSFF [CHF, DIO, SFF]
4 CHFIOD [CHF, IOD]
5 FHDIFOSDFJKL [FHD, IFO, SDF, JKL]
I have 2 dataframes df and df1 and both are having file paths like this.
df = pd.DataFrame({"X1": ['f','f','o','o','b','b'],
"X2": ['fb/FOO1/bar0.wav','fb/FOO1/bar1.wav','fb/FOO2/bar2.wav','fb/FOO2/bar3.wav','fb/FOO3/bar4.wav','fb/FOO3/bar5.wav']})
X1 X2
0 f fb/FOO1/bar0.wav
1 f fb/FOO1/bar1.wav
2 o fb/FOO2/bar2.wav
3 o fb/FOO2/bar3.wav
4 b fb/FOO3/bar4.wav
5 b fb/FOO3/bar5.wav
and another dataframe,
df1 = pd.DataFrame({"X1": ['b','o','b','f','o','f'],
"X2": ['fb1/FOO3/bar5.opus','fb1/FOO2/bar2.opus','fb1/FOO3/bar4.opus','fb1/FOO1/bar1.opus','fb1/FOO2/bar3.opus','fb1/FOO1/bar0.opus']})
X1 X2
0 b fb1/FOO3/bar5.opus
1 o fb1/FOO2/bar2.opus
2 b fb1/FOO3/bar4.opus
3 f fb1/FOO1/bar1.opus
4 o fb1/FOO2/bar3.opus
5 f fb1/FOO1/bar0.opus
Now I want to sort the 2nd dataframe df1's X2 column (filepath) according to the first dataframe df's filepaths. Such that, output should like this
X1 X2
0 f fb1/FOO1/bar0.opus
1 f fb1/FOO1/bar1.opus
2 o fb1/FOO2/bar2.opus
3 o fb1/FOO2/bar3.opus
4 b fb1/FOO3/bar4.opus
5 b fb1/FOO3/bar5.opus
You might create a sorter dictionnary which would allow you to sort your values with a custom key:
#the following is creating a key with the name part of the filepath (could have been done with regex)
sorter_dict = dict(zip(df.X2.apply(lambda x : x.split('/')[-1].split('.')[0]),df.index))
#{'bar0': 0, 'bar1': 1, 'bar2': 2, 'bar3': 3, 'bar4': 4, 'bar5': 5}
#on df1, let's create a temp col with the name part of the filepath
df1['temp'] = df1.X2.apply(lambda x : x.split('/')[-1].split('.')[0])
#and apply our sorter dict
df1['sorter'] = df1.temp.map(sorter_dict)
#at the end, simply sort
df1 = df1.sort_values('sorter')
#and delete unecessary cols
del df1['temp'], df1['sorter']
Output
| X1 | X2 |
|:-----|:-------------------|
| f | fb1/FOO1/bar0.opus |
| f | fb1/FOO1/bar1.opus |
| o | fb1/FOO2/bar2.opus |
| o | fb1/FOO2/bar3.opus |
| b | fb1/FOO3/bar4.opus |
| b | fb1/FOO3/bar5.opus |
This could work if the file path names are a consistent length within the dataframes. Simply create a new column with the part that you want to sort-by, sort-by that column and then drop the new column:
df['X3'] = df['X2'].astype(str).str[3:-4]
df1['X3'] = df1['X2'].astype(str).str[4:-5]
df1 = df1.set_index('X3')
df1 = df1.reindex(index=df['X3'])
df1 = df1.reset_index()
df1 = df1.drop('X3', axis = 1)
df = df.drop('X3', axis = 1)
df1
I have a dataframe that contains a column, let's call it "names". "names" has the name of other columns. I would like to add a new column that would have for each row the value based on the column name contained on that "names" column.
Example:
Input dataframe:
pd.DataFrame.from_dict({"a": [1, 2, 3,4], "b": [-1,-2,-3,-4], "names":['a','b','a','b']})
a | b | names |
--- | --- | ---- |
1 | -1 | 'a' |
2 | -2 | 'b' |
3 | -3 | 'a' |
4 | -4 | 'b' |
Output dataframe:
pd.DataFrame.from_dict({"a": [1, 2, 3,4], "b": [-1,-2,-3,-4], "names":['a','b','a','b'], "new_col":[1,-2,3,-4]})
a | b | names | new_col |
--- | --- | ---- | ------ |
1 | -1 | 'a' | 1 |
2 | -2 | 'b' | -2 |
3 | -3 | 'a' | 3 |
4 | -4 | 'b' | -4 |
You can use lookup:
df['new_col'] = df.lookup(df.index, df.names)
df
# a b names new_col
#0 1 -1 a 1
#1 2 -2 b -2
#2 3 -3 a 3
#3 4 -4 b -4
EDIT
lookup has been deprecated, here's the currently recommended solution:
idx, cols = pd.factorize(df['names'])
df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
Because DataFrame.lookup is deprecated as of Pandas 1.2.0, the following is what I came up with using DataFrame.melt:
df['new_col'] = df.melt(id_vars='names', value_vars=['a', 'b'], ignore_index=False).query('names == variable').loc[df.index, 'value']
Output:
>>> df
a b names new_col
0 1 -1 a 1
1 2 -2 b -2
2 3 -3 a 3
3 4 -4 b -4
Can this be simplified? For correctness, the index must not be ignored.
Additional reference:
Looking up values by index/column labels (archive)
Solution using pd.factorize (from https://github.com/pandas-dev/pandas/issues/39171#issuecomment-773477244):
idx, cols = pd.factorize(df['names'])
df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
With the straightforward and easy solution (lookup) deprecated, another alternative to the pandas-based ones proposed here is to convert df into a numpy array and use numpy indexing:
df['new_col'] = df.values[df.index.get_indexer(df['names'].index), df.columns.get_indexer(df['names'])]
Let me explain what this does. df.values is a numpy array based on the DataFrame. As numpy arrays have to be indexed numerically, we need to use the get_indexer function to convert the pandas row and column index names to index numbers that can be used with numpy:
>>> df.index.get_indexer(df['names'].index)
array([0, 1, 2, 3], dtype=int64)
>>> df.columns.get_indexer(df['names'])
array([0, 1, 0, 1], dtype=int64)
(In this case, where the row index is already numerical, you could get away with simply using df.index as the first argument inside the bracket, but this does not work generally.)
Here's a short solution using df.melt and df.merge:
df.merge(df.melt(var_name='names', ignore_index=False), on=[None, 'names'])
Outputs:
key_0 a b names value
0 0 1 -1 a 1
1 1 2 -2 b -2
2 2 3 -3 a 3
3 3 4 -4 b -4
There's a redundant key_0 column which you need to drop with df.drop.
If I have two lists
l1 = ['A', 'B']
l2 = [1, 2]
what is the most elegant way to get a pandas data frame which looks like:
+-----+-----+-----+
| | l1 | l2 |
+-----+-----+-----+
| 0 | A | 1 |
+-----+-----+-----+
| 1 | A | 2 |
+-----+-----+-----+
| 2 | B | 1 |
+-----+-----+-----+
| 3 | B | 2 |
+-----+-----+-----+
Note, the first column is the index.
use product from itertools:
>>> from itertools import product
>>> pd.DataFrame(list(product(l1, l2)), columns=['l1', 'l2'])
l1 l2
0 A 1
1 A 2
2 B 1
3 B 2
As an alternative you can use pandas' cartesian_product (may be more useful with large numpy arrays):
In [11]: lp1, lp2 = pd.core.reshape.util.cartesian_product([l1, l2])
In [12]: pd.DataFrame(dict(l1=lp1, l2=lp2))
Out[12]:
l1 l2
0 A 1
1 A 2
2 B 1
3 B 2
This seems a little messy to read in to a DataFrame with the correct orient...
Note: previously cartesian_product was located at pd.core.reshape.util.cartesian_product.
You can also use the sklearn library, which uses a NumPy-based approach:
from sklearn.utils.extmath import cartesian
df = pd.DataFrame(cartesian((L1, L2)))
For more verbose but possibly more efficient variants see Numpy: cartesian product of x and y array points into single array of 2D points.
You can use the function merge:
df1 = pd.DataFrame(l1, columns=['l1'])
df2 = pd.DataFrame(l2, columns=['l2'])
df1.merge(df2, how='cross')
Output:
l1 l2
0 A 1
1 A 2
2 B 1
3 B 2