I have a pandas dataframe with a few columns. I want to convert one of the string columns into an array of strings with fixed length.
Here is how current table looks like:
+-----+--------------------+--------------------+
|col1 | col2 | col3 |
+-----+--------------------+--------------------+
| 1 |Marco | LITMATPHY |
| 2 |Lucy | NaN |
| 3 |Andy | CHMHISENGSTA |
| 4 |Nancy | COMFRNPSYGEO |
| 5 |Fred | BIOLIT |
+-----+--------------------+--------------------+
How can I split string of "col 3" into array of string of length 3 as follows:
PS: There can be blanks or NaN in the col 3 and they should be replaced with empty array.
+-----+--------------------+----------------------------+
|col1 | col2 | col3 |
+-----+--------------------+----------------------------+
| 1 |Marco | ['LIT','MAT','PHY] |
| 2 |Lucy | [] |
| 3 |Andy | ['CHM','HIS','ENG','STA'] |
| 4 |Nancy | ['COM','FRN','PSY','GEO'] |
| 5 |Fred | ['BIO','LIT'] |
+-----+--------------------+----------------------------+
Use textwrap.wrap:
import textwrap
df['col3'].apply(lambda x: textwrap.wrap(x, 3) if pd.notna(x) else [])
If there are string whose lenghts aren't the multiple of 3, the remaining letters will be pushed to the last. If you only want to have strings of lenght 3, you can apply one more to get rid of those strings:
df['col3'].apply(lambda x: textwrap.wrap(x, 3) if pd.notna(x) else []).\
apply(lambda x: x[:-1] if len(x[-1]) % 3 != 0 else x)
Another way can be this;
import pandas as pd
import numpy as np
df = pd.DataFrame({"col3":["LITMATPHY",np.nan,"CHMHISENGSTA","COMFRNPSYGEO","BIOLIT"]})
def split_str(s):
lst=[]
for i in range(0,len(s),3):
lst.append(s[i:i+3])
return lst
df["col3_result"] = df["col3"].apply(lambda x: [] if pd.isna(x) else split_str(s=x))
# Output
col3 col3_result
0 LITMATPHY [LIT, MAT, PHY]
1 NaN []
2 CHMHISENGSTA [CHM, HIS, ENG, STA]
3 COMFRNPSYGEO [COM, FRN, PSY, GEO]
4 BIOLIT [BIO, LIT]
With only using Pandas we can do:
df = pd.DataFrame(['LITMATPHY', np.nan, '', 'CHFDIOSFF', 'CHFIOD', 'FHDIFOSDFJKL'], columns=['col3'])
def to_list(string, n):
if string != string: # True if string = np.nan
lst = []
else:
lst = [string[i:i+n] for i in range(0, len(string), n)]
return lst
df['new_col3'] = df['col3'].apply(lambda x: to_list(x, 3))
Output:
col3 new_col3
0 LITMATPHY [LIT, MAT, PHY]
1 NaN []
2 []
3 CHFDIOSFF [CHF, DIO, SFF]
4 CHFIOD [CHF, IOD]
5 FHDIFOSDFJKL [FHD, IFO, SDF, JKL]
Related
Example data:
| alcoholism | diabites | | handicapped | hypertensive | new col |
| -------- | -------- | | -------- | -------- | ---------------- |
| 1 | 0 | | 1 | 0 | alcoholism, handicapped |
| 0 | 1 | | 0 | 1 | diabites, hypertensive |
| 0 | 1 | | 0 | 0 | diabites |
If any of the above columns has value = 1, then I need the new column to have the names of these columns only,
and if all are zero return no condition.
I had tried to do it with the below code:
problems = ['alcoholism', 'diabetes','hypertension','handicap']
m1 = df[problems].isin([1])
mask = m1 | (m1.loc[~m1.any(axis=1)])
df['sp_name'] = mask.mul(problems).apply(lambda x: [i for i in x if i], axis=1)
But it returns the data with brackets like [handicapped, alcoholism].
The issue is that I can't do value counts as the zero values show as empty [] and will not be plotted.
I still don't understand your ultimate goal, or how this will be useful in plotting, but all you're really missing is using str.join to combine each list into the string you want. That said, the way you've gotten there involves unnecessary steps. First, multiply the DataFrame by its own column names:
df * df.columns
alcoholism diabetes handicapped hypertension
0 alcoholism handicapped
1 diabetes hypertension
2 diabetes
Then you can apply the same as you did:
(df * df.columns).apply(lambda row: [i for i in row if i], axis=1)
0 [alcoholism, handicapped]
1 [diabetes, hypertension]
2 [diabetes]
dtype: object
Then you just need to include a string join in the function you supply to apply. Here's a complete example:
import pandas as pd
df = pd.DataFrame({
'alcoholism': [1, 0, 0],
'diabetes': [0, 1, 1],
'handicapped': [1, 0, 0],
'hypertension': [0, 1, 0],
})
df['new_col'] = (
(df * df.columns)
.apply(lambda row: ', '.join([i for i in row if i]), axis=1)
)
print(df)
alcoholism diabetes handicapped hypertension new_col
0 1 0 1 0 alcoholism, handicapped
1 0 1 0 1 diabetes, hypertension
2 0 1 0 0 diabetes
df['new_col'] = df.iloc[:, :-1].dot(df.add_suffix(",").columns[:-1]).str[:-1]
i already found this solution helpful for me
0| name1 | name2 | tot |
+-------+-------+-----+
1| A | B | 3 |
2| C | A | 3 |
3| B | D | 4 |
4| A | E | 2 |
5| B | C | 5 |
+-------+-------+-----+
I want to select rows based on the previuous rows, where a "letter" is present in other rows above at least 2 time (respectively in name1 or name2) and their tot is >= 3.
In this example i want to select:
A E 2
B C 5
because in 4th row we have A (name1) that appear in 1st and 2nd rows, with a tot >= 3;
and the B C 5 rows, because we have B that appear in 1st and 3rd rows, with a tot >= 3.
ps. I want to create another dataset based on these new results
You can build a cache using collections.defaultdict
from collections import defaultdict
df = pd.DataFrame({'name1': list('ACBAB'), 'name2': list('BADEC'), 'tot': [3, 3, 4, 2, 5]})
seen = defaultdict(int) # every new key will be initialized with 0
keep = []
for row in df.itertuples():
keep.append(
(seen[row.name1] > 1) |
(seen[row.name2] > 1)
)
if row.tot >= 3:
# we can do this safely without risk of KeyError because `seen` is a default dict
seen[row.name1] += 1
seen[row.name2] += 1
out = df[keep]
Output
name1 name2 tot
3 A E 2
4 B C 5
I've got excel/pandas dataframe/file looking like this:
+------+--------+
| ID | 2nd ID |
+------+--------+
| ID_1 | R_1 |
| ID_1 | R_2 |
| ID_2 | R_3 |
| ID_3 | |
| ID_4 | R_4 |
| ID_5 | |
+------+--------+
How can I transform it to python dictionary? I want my result to be like:
{'ID_1':['R_1','R_2'],'ID_2':['R_3'],'ID_3':[],'ID_4':['R_4'],'ID_5':[]}
What should I do, to obtain it?
If need remove missing values for not exist values use Series.dropna in lambda function in GroupBy.apply:
d = df.groupby('ID')['2nd ID'].apply(lambda x: x.dropna().tolist()).to_dict()
print (d)
{'ID_1': ['R_1', 'R_2'], 'ID_2': ['R_3'], 'ID_3': [], 'ID_4': ['R_4'], 'ID_5': []}
Or use fact np.nan == np.nan return False in list compehension for filter non missing values, check also warning in docs for more explain.
d = df.groupby('ID')['2nd ID'].apply(lambda x: [y for y in x if y == y]).to_dict()
If need remove empty strings:
d = df.groupby('ID')['2nd ID'].apply(lambda x: [y for y in x if y != '']).to_dict()
Apply a function over the dataframe over the rows which appends the value to your dict. Apply is not inplace and thus your dictionary would be created.
d = dict.fromkeys(df.ID.unique(), [])
def func(x):
d[x.ID].append(x["2nd ID"])
# will return a series of Nones
df.apply(func, axis = 1)
Edit:
I asked it on Gitter and #gurukiran07 gave me an answer. What you are trying to do is reverse of explode function
s = pd.Series([[1, 2, 3], [4, 5]])
0 [1, 2, 3]
1 [4, 5]
dtype: object
exploded = s.explode()
0 1
0 2
0 3
1 4
1 5
dtype: object
exploded.groupby(level=0).agg(list)
0 [1, 2, 3]
1 [4, 5]
dtype: object
I have a dataframe that looks like this:
Col1 | Col2 | Col1 | Col3 | Col1 | Col4
a | d | | h | a | p
b | e | b | i | b | l
| l | a | l | | a
l | r | l | a | l | x
a | i | a | w | | i
| c | | i | r | c
d | o | d | e | d | o
Col1 is repeated multiple times in the dataframe. In each Col1, there is missing information. I need to create a new column that has all of the information from each Col1 occurrence.
How can I create a column with the complete information and then delete the previous duplicate columns?
Some information may be missing from multiple columns. This script is also meant to be used in the future when there could be one, three, five, or any number of duplicated Col1 columns.
The desired output looks like this:
Col2 | Col3 | Col4 | Col5
d | h | p | a
e | i | l | b
l | l | a | a
r | a | x | l
i | w | i | a
c | i | c | r
o | e | o | d
I have been looking over this question but it is not clear to me how I could keep the desired Col1 with complete values. I could delete multiple columns of the same name but I need to first create a column with complete information.
First replace empty values in your columns with nan as below:
import numpy as np
df = df.replace(r'^\s*$', np.nan, regex=True)
Then, you could use groupby and then first()
df.groupby(level = 0, axis = 1).first()
May be something like this is what you are looking for.
col_list = list(set(df.columns))
dicts={}
for col in col_list:
val = list(filter(None,set(df.filter(like=col).stack().reset_index()[0].str.strip(' ').tolist())))
dicts[col]= val
max_len=max([len(k) for k in dicts.values()])
pd.DataFrame({k:pd.Series(v[:max_len]) for k,v in dicts.items()})
output
Col3 Col4 Col1 Col2
0 h i d d
1 w l b r
2 i c r i
3 l x l l
4 a p a o
5 e o NaN c
6 NaN a NaN e
I have a dataframe like:
|column1 |
|a,b,c |
|d,b |
|a & b,c |
and i'd like to have it like this
column_a | column_b | column_c | column_d | column_a & b
1 | 1 | 1 |0 | 0
0 | 1 | 0 |1 | 0
1 | 1 | 1 |0 | 1
similar to get dummies, except that I have multiple strings per cell
i don't believe there are repeat strings in a cell, so no '2's
any help would be greatly appreciated!!!
You could start with something like this:
data = '''|column1 |
|a,b,c |
|d,b |
|a & b,c |'''
rows = [r.strip() for r in data.replace('\n','').split('|')[3:] if r.strip() != '']
values = []
for r in rows:
values += r.split(',')
values = set(values)
print(' | '.join(['column_' + v for v in values]))
for r in rows:
output = ''
for v in values:
if v in r:
output += '1'
else:
output += '0'
output += ' | '
print(output)
You'll have to use some string formatting to make it look pretty, but this should get you started.