I have dataframe I am trying to split col1 string value if value contains ":" and take first element and then put it into another col2 like this:
df['col1'] = df['col1'].astype(str)
df['col2'] = df['col1'].astype(str)
for i, row in df.iterrows():
if (":") in row['col1']:
row['col2'] = row['col1'].split(":")[1]+" "+ "in Person"
row['col1'] = 'j'
It is working on sample dataframe like this but It doesn't change the result in origional dataframe--
import pandas as pd
d = {'col1': ['a:b', 'ac'], 'col2': ['z 26', 'y 25']}
df = pd.DataFrame(data=d)
print(df)
col1 col2
j b in Person
ac y 25
what I am doing wrong and what are alternatives for this condition.
For the extracting part, try:
df['col2'] = df.col1.str.extract(r':(.+)', expand=False).add(' ').add(df.col2, fill_value='')
# Output
col1 col2
0 a:b b z 26
1 ac y 25
I'm not sure if I understand the replacing correctly, but here is a try:
df.loc[df.col1.str.contains(':'), 'col1'] = 'j'
# Output
col1 col2
0 j b z 26
1 ac y 25
Related
The code below creates multiple empty dataframes named from the report2 list. They are then populated with a filtered existing dataframe called dfsource.
With a nested for loop, I'd like to filter each of these dataframes using a list of values but the sub loop does not work as shown.
import pandas as pd
report=['A','B','C']
suffix='_US'
report2=[s + suffix for s in report]
print (report2) #result: ['A_US', 'B_US', 'C_US']
source = {'COL1': ['A','B','C'], 'COL2': ['D','E','F']}
dfsource=pd.DataFrame(source)
print(dfsource)
df_dict = {}
for i in report2:
df_dict[i]=pd.DataFrame()
for x in report:
df_dict[i]=dfsource.query('COL1==x')
#df_dict[i]=dfsource.query('COL1=="A"') #Example, this works filtering for value A but not what I need.
print(df_dict['A_US'])
print(df_dict['B_US'])
print(df_dict['C_US'])
You can reference a variable in a query by using #
df_dict[i]=dfsource.query('COL1==#x')
So the total code looks like this
import pandas as pd
report=['A','B','C']
suffix='_US'
report2=[s + suffix for s in report]
print (report2) #result: ['A_US', 'B_US', 'C_US']
source = {'COL1': ['A','B','C'], 'COL2': ['D','E','F']}
dfsource=pd.DataFrame(source)
print(dfsource)
df_dict = {}
for i in report2:
df_dict[i]=pd.DataFrame()
for x in report:
df_dict[i]=dfsource.query('COL1==#x')
#df_dict[i]=dfsource.query('COL1=="A"') #Example, this works filtering for value A but not what I need.
print(df_dict['A_US'])
print(df_dict['B_US'])
print(df_dict['C_US'])
which outputs
COL1 COL2
0 A D
1 B E
2 C F
COL1 COL2
2 C F
COL1 COL2
2 C F
COL1 COL2
2 C F
However, I think you want to create a new dictionary based on the i and x of each list, then you can move the creation of the dataframe to the second for loop and then create a new key for each iteration.
import pandas as pd
report=['A','B','C']
suffix='_US'
report2=[s + suffix for s in report]
print (report2) #result: ['A_US', 'B_US', 'C_US']
source = {'COL1': ['A','B','C'], 'COL2': ['D','E','F']}
dfsource=pd.DataFrame(source)
print(dfsource)
df_dict = {}
for i in report2:
for x in report:
new_key = x + i
df_dict[new_key]=pd.DataFrame()
df_dict[new_key]=dfsource.query('COL1==#x')
for item in df_dict.items():
print(item)
Outputs 9 unique dataframes which are filtered based on whatever x value was passed.
('AA_US', COL1 COL2
0 A D)
('BA_US', COL1 COL2
1 B E)
('CA_US', COL1 COL2
2 C F)
('AB_US', COL1 COL2
0 A D)
('BB_US', COL1 COL2
1 B E)
('CB_US', COL1 COL2
2 C F)
('AC_US', COL1 COL2
0 A D)
('BC_US', COL1 COL2
1 B E)
('CC_US', COL1 COL2
2 C F)
I want to acquire all rows in Dataframe where if the length of any cloumu shorter than 2.
For example:
df = pd.DataFrame({"col1":["a","ab",""],"col2":["bc","abc", "a"]})
col1 col2
0 a bc
1 ab abc
2 a
How to get this output:
col1 col2
0 a bc
2 a
Let's try stack to reshape then using str.len compute the length and create boolean mask with lt + any:
df[df.stack().str.len().lt(2).any(level=0)]
col1 col2
0 a bc
2 a
You can use the len() method of the pandas Series :
for col in df.columns:
df[col] = df[col][df[col].str.len() < 3]
df = df.dropna()
A list comprehension could help here :
df.loc[[not any(len(word) > 2 for word in entry)
for entry in df.to_numpy()]
]
col1 col2
0 a bc
2 a
Below is my script for a generic data frame in Python using pandas. I am hoping to split a certain column in the data frame that will create new columns, while respecting the original orientation of the items in the original column.
Please see below for my clarity. Thank you in advance!
My script:
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1': ['x,y,z', 'a,b', 'c']})
print(df)
Here's what I want
df = pd.DataFrame({'col1': ['x',np.nan,np.nan],
'col2': ['y','a',np.nan],
'col3': ['z','b','c']})
print(df)
Here's what I get
df = pd.DataFrame({'col1': ['x','a','c'],
'col2': ['y','b',np.nan],
'col3': ['z',np.nan,np.nan]})
print(df)
You can use the justify function from this answer with Series.str.split:
dfn = pd.DataFrame(
justify(df['col1'].str.split(',', expand=True).to_numpy(),
invalid_val=None,
axis=1,
side='right')
).add_prefix('col')
col0 col1 col2
0 x y z
1 None a b
2 None None c
Here is a way of tweaking the split:
max_delim = df['col1'].str.count(',').max() #count the max occurance of `,`
delim_to_add = max_delim - df['col1'].str.count(',') #get difference of count from max
# multiply the delimiter and add it to series, followed by split
df[['col1','col2','col3']] = (df['col1'].radd([','*i for i in delim_to_add])
.str.split(',',expand=True).replace('',np.nan))
print(df)
col1 col2 col3
0 x y z
1 NaN a b
2 NaN NaN c
Try something like
s=df.col1.str.count(',')
#(s.max()-s).map(lambda x : x*',')
#0
#1 ,
#2 ,,
Name: col1, dtype: object
(s.max()-s).map(lambda x : x*',').add(df.col1).str.split(',',expand=True)
0 1 2
0 x y z
1 a b
2 c
import pandas
d = {'col1': [25,20,30],
'col2': [25,20,30],
'col3': [25,20,30],
'col4': [25,39,11]
}
df = pandas.DataFrame(data=d)
How would I loop from this data frame and add col1 + col2 + col3 + col4 and if not equal 100, take value in that index perform this col1/(col1+col2+col3+col4 and make that the new value for that spot. This way now when you sum col1 + col2 + col3 + col4 it will add up to 100 for that index.
So for example for index 0, when you add col1 +col2 + col3 + col4 it equals 100, therefore, go to the next index, however for index 1 it adds up to 99 so take 20/99 and make it the new value of that position, etc.
expected output:
d = {'col1': [25,20/99,30/101],
'col2': [25,20/99,30/101],
'col3': [25,20/99,30/101],
'col4': [25,39/99,11/101]
}
df = pandas.DataFrame(data=d)
here is a vectorized version:
c = df.sum(1).ne(100)
vals = np.where(c[:,None],df.div(df.sum(1),axis=0),df)
new_df = pd.DataFrame(vals,index=df.index,columns=df.columns)
# for overwriting the original df , use: df[:] = vals
print(new_df)
col1 col2 col3 col4
0 25.00000 25.00000 25.00000 25.000000
1 0.20202 0.20202 0.20202 0.393939
2 0.29703 0.29703 0.29703 0.108911
This achieves what you want by first generating each column as a list of its own:
col = [d[row][i] for row in d]
The process you describe is then applied:
if sum(col) != 100:
newcol = [n/sum(col) for n in col]
and then the column can be re-inserted. Final product:
for i in range(0, 3):
col = [d[row][i] for row in d]
if sum(col) != 100:
newcol = [n/sum(col) for n in col]
else:
newcol = col.copy()
for row in d:
d[row][i] = newcol[int(row[-1:])-1]
I ended up using this method to resolve my question
for i in range(len(df)):
x = (df.loc[i,'col1']+df.loc[i,'col2']+df.loc[i,'col3']+df.loc[i,'col4'])
for j in range(0,4):
df.iloc[i,j] = (df.iloc[i,j])/(x)
This question is an extension of Pandas conditional creation of a series/dataframe column.
If we had this dataframe:
Col1 Col2
1 A Z
2 B Z
3 B X
4 C Y
5 C W
and we wanted to do the equivalent of:
if Col2 in ('Z','X') then Col3 = 'J'
else if Col2 = 'Y' then Col3 = 'K'
else Col3 = {value of Col1}
How could I do that?
You can use loc with isin and last fillna:
df.loc[df.Col2.isin(['Z','X']), 'Col3'] = 'J'
df.loc[df.Col2 == 'Y', 'Col3'] = 'K'
df['Col3'] = df.Col3.fillna(df.Col1)
print (df)
Col1 Col2 Col3
1 A Z J
2 B Z J
3 B X J
4 C Y K
5 C W C
Try this use np.where : outcome = np.where(condition, true, false)
df["Col3"] = np.where(df['Col2'].isin(['Z','X']), "J", np.where(df['Col2'].isin(['Y']), 'K', df['Col1']))
Col1 Col2 Col3
1 A Z J
2 B Z J
3 B X J
4 C Y K
5 C W C
A simple (but likely inefficient) way can be useful when you have multiple if condition. Like you are trying to put values into (say) four buckets based on quartiles.
df holds your data, col1 has the values, col2 should have the bucketized values (1,2,3,4)
quart has the 25%, 50% and 75% bounds.
try this
create a dummy list as dummy = []
iterate through the data frame with: for index, row in df.iterrows():
Set up the if conditions like: if row[col1] <= quart[0]:#25%
append proper value in dummy under the if: dummy.append(1)
the nested if-elif can take care of all the needed optional values which you append to dummy.
add dummy as a column: df[col2] = dummy
You can find the quartiles via A = df.describe() and then print(A[col1])