Iterate and input data into a column in a pandas dataframe - python

I have a pandas dataframe with a column that is a small selection of strings. Let's call the column 'A' and all of the values in it are string_1, string_2, string_3.
Now, I want to add another column and fill it with numeric values that correspond to the strings.
I created a dictionary
d = { 'string_1' : 1, 'string_2' : 2, 'string_3': 3}
I then initialized the new column:
df['B'] = pd.Series(index=df.index)
Now, I want to fill it with the integer values. I can call the values associated with the strings in the dictionary by:
for s in df['A']:
n = d[s]
That works fine, but I've tried using just plain df['B'] = n to fill the new column in the for-loop, but that doesn't work, and I've tried to figure out indexing with pandas.

If I understand you correctly you can just call map:
df['B'] = df['A'].map(d)
This will perform the lookup and fill the values you are looking for.

Rather than fill as an empty column, you can simply populate this with an apply:
df['B'] = df['A'].apply(d.get)

Related

how to check value existing in pandas dataframe column value of type list

I have pandas dataframe which contains value in below format. How to filter dataframe which matches the 'd6d4e77e-b8ec-467a-ba06-1c6079aa2d82' in any of the value of type list part of PathDSC column
i tried
def get_projects_belongs_to_root_project(project_df, root_project_id):
filter_project_df = project_df.query("root_project_id in PathDSC")
it didn't work i got empty dataframe
Assuming the values of PathDSC column are lists of strings, you can check row-wise if each list contains the wanted value and mask those rows using Series.apply. Then select only those rows using boolean indexing.
def get_projects_belongs_to_root_project(project_df, root_project_id):
mask = project_df['PathDSC'].apply(lambda lst: root_project_id in lst)
filter_project_df = project_df[mask]
# ...
root_project_id = 'd6d4e77e-b8ec-467a-ba06-1c6079aa2d82'
df = df[df['PathDSC'].str.contains(root_project_id)]

How to clean dataframe column filled with names using Python?

I have the following dataframe:
df = pd.DataFrame( columns = ['Name'])
df['Name'] = ['Aadam','adam','AdAm','adammm','Adam.','Bethh','beth.','beht','Beeth','Beth']
I want to clean the column in order to achieve the following:
df['Name Corrected'] = ['adam','adam','adam','adam','adam','beth','beth','beth','beth','beth']
df
Cleaned names are based on the following reference table:
ref = pd.DataFrame( columns = ['Cleaned Names'])
ref['Cleaned Names'] = ['adam','beth']
I am aware of fuzzy matching but I'm not sure if that's the most efficient way of solving the problem.
You can try:
lst=['adam','beth']
out=pd.concat([df['Name'].str.contains(x,case=False).map({True:x}) for x in lst],axis=1)
df['Name corrected']=out.bfill(axis=1).iloc[:,0]
#Finally:
df['Name corrected']=df['Name corrected'].ffill()
#but In certain condition ffill() gives you wrong values
Explaination:
lst=['adam','beth']
#created a list of words
out=pd.concat([df['Name'].str.contains(x,case=False).map({True:x}) for x in lst],axis=1)
#checking If the 'Name' column contain the word one at a time that are inside the list and that will give a boolean series of True and False and then we are mapping The value of that particular element that is inside list so True becomes that value and False become NaN and then we are concatinating both list of Series on axis=1 so that It becomes a Dataframe
df['Name corrected']=out.bfill(axis=1).iloc[:,0]
#Backword filling values on axis=1 and getting the 1st column
#Finally:
df['Name corrected']=df['Name corrected'].ffill()
#Forward filling the missing values

Create new column in DataFrame from a conditional of a list

I have a DataFrame like:
df = pd.DataFrame({'A' : (1,2,3), 'B': ([0,1],[3,4],[6,0])})
I want to create a new column called test that displays a 1 if a 0 exists within each list in column B. The results hopefully would look like:
df = pd.DataFrame({'A' : (1,2,3), 'B': ([0,1],[3,4],[6,0]), 'test': (1,0,1)})
With a dataframe that contains strings rather than lists I have created additional columns from the string values using the following
df.loc[df['B'].str.contains(0),'test']=1
When I try this with the example df, I generate a
TypeError: first argument must be string or compiled pattern
I also tried to convert the list to a string but only retained the first integer in the list. I suspect that I need something else to access the elements within the list but don't know how to do it. Any suggestions?
This should do it for you:
df['test'] = pd.np.where(df['B'].apply(lambda x: 0 in x), 1, 0)

split, map data in two columns in pandas data frame

I want to split data in two columns from a data frame and construct new columns using this data.
My data frame is,
dfc = pd.DataFrame( {"A": ["GT:DP:RO:QR:AO:QA:GL", "GT:DP:RO:QR:AO:QA:GL", "GT:DP:RO:QR:AO:QA:GL", "GT:DP:GL", "GT:DP:GL"], "B": ["0/1:71:43:1363:28:806:-71.1191,0,-121.278", "0/1:71:43:1363:28:806:-71.1191,0,-121.278", "0/1:71:43:1363:28:806:-71.1191,0,-121.278", "1/1:49:-103.754,0,-3.51307", "1/1:49:-103.754,0,-3.51307"]} )
I want individual columns named GT, DP, RO, QR, AO, QA, GL with values from column B
I want to produce output as,
We can split the two columns using a = df.A.str.split(":", expand = True)and b = df.B.str.split(":", expand = True) to get two individual data frames. These can be merged with c = pd.merge(a, b, left_index = True, right_index = True) to get all desired data. But, not in the format as expected.
Any suggestions ? I think better way can be using split on both columns A and B and then creating a dictcolumn with values from A as key and B as values. Then this column can be converted to data frame.
Thanks
Use an OrderedDict to preserve the order after creating a dict mapping of the two concerned columns of the dataframe split on the sep ":", flattened to a list.
Feed this to the dataframe constructor later.
from collections import OrderedDict
L = dfc.apply(
lambda x: OrderedDict(zip(x['A'].split(':'), x['B'].split(':'))), 1).tolist()
pd.DataFrame(L)
I'm going to split everything by ':'. But I have 2 columns. If I stack first, I get a series in which I can more easily use str.split
I now have a split series in which I can group by level=0 which is the original index.
I zip and dict to get series like structures with the original column A as the indices and B as the values.
unstack and I'm done.
gb = dfc.stack().str.split(':').groupby(level=0)
gb.apply(lambda x: dict(zip(*x))).unstack()

How to iterate over columns of pandas dataframe to run regression

I have this code using Pandas in Python:
all_data = {}
for ticker in ['FIUIX', 'FSAIX', 'FSAVX', 'FSTMX']:
all_data[ticker] = web.get_data_yahoo(ticker, '1/1/2010', '1/1/2015')
prices = DataFrame({tic: data['Adj Close'] for tic, data in all_data.iteritems()})
returns = prices.pct_change()
I know I can run a regression like this:
regs = sm.OLS(returns.FIUIX,returns.FSTMX).fit()
but how can I do this for each column in the dataframe? Specifically, how can I iterate over columns, in order to run the regression on each?
Specifically, I want to regress each other ticker symbol (FIUIX, FSAIX and FSAVX) on FSTMX, and store the residuals for each regression.
I've tried various versions of the following, but nothing I've tried gives the desired result:
resids = {}
for k in returns.keys():
reg = sm.OLS(returns[k],returns.FSTMX).fit()
resids[k] = reg.resid
Is there something wrong with the returns[k] part of the code? How can I use the k value to access a column? Or else is there a simpler approach?
for column in df:
print(df[column])
You can use iteritems():
for name, values in df.iteritems():
print('{name}: {value}'.format(name=name, value=values[0]))
This answer is to iterate over selected columns as well as all columns in a DF.
df.columns gives a list containing all the columns' names in the DF. Now that isn't very helpful if you want to iterate over all the columns. But it comes in handy when you want to iterate over columns of your choosing only.
We can use Python's list slicing easily to slice df.columns according to our needs. For eg, to iterate over all columns but the first one, we can do:
for column in df.columns[1:]:
print(df[column])
Similarly to iterate over all the columns in reversed order, we can do:
for column in df.columns[::-1]:
print(df[column])
We can iterate over all the columns in a lot of cool ways using this technique. Also remember that you can get the indices of all columns easily using:
for ind, column in enumerate(df.columns):
print(ind, column)
You can index dataframe columns by the position using ix.
df1.ix[:,1]
This returns the first column for example. (0 would be the index)
df1.ix[0,]
This returns the first row.
df1.ix[:,1]
This would be the value at the intersection of row 0 and column 1:
df1.ix[0,1]
and so on. So you can enumerate() returns.keys(): and use the number to index the dataframe.
A workaround is to transpose the DataFrame and iterate over the rows.
for column_name, column in df.transpose().iterrows():
print column_name
Using list comprehension, you can get all the columns names (header):
[column for column in df]
Based on the accepted answer, if an index corresponding to each column is also desired:
for i, column in enumerate(df):
print i, df[column]
The above df[column] type is Series, which can simply be converted into numpy ndarrays:
for i, column in enumerate(df):
print i, np.asarray(df[column])
I'm a bit late but here's how I did this. The steps:
Create a list of all columns
Use itertools to take x combinations
Append each result R squared value to a result dataframe along with excluded column list
Sort the result DF in descending order of R squared to see which is the best fit.
This is the code I used on DataFrame called aft_tmt. Feel free to extrapolate to your use case..
import pandas as pd
# setting options to print without truncating output
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
import statsmodels.formula.api as smf
import itertools
# This section gets the column names of the DF and removes some columns which I don't want to use as predictors.
itercols = aft_tmt.columns.tolist()
itercols.remove("sc97")
itercols.remove("sc")
itercols.remove("grc")
itercols.remove("grc97")
print itercols
len(itercols)
# results DF
regression_res = pd.DataFrame(columns = ["Rsq", "predictors", "excluded"])
# excluded cols
exc = []
# change 9 to the number of columns you want to combine from N columns.
#Possibly run an outer loop from 0 to N/2?
for x in itertools.combinations(itercols, 9):
lmstr = "+".join(x)
m = smf.ols(formula = "sc ~ " + lmstr, data = aft_tmt)
f = m.fit()
exc = [item for item in x if item not in itercols]
regression_res = regression_res.append(pd.DataFrame([[f.rsquared, lmstr, "+".join([y for y in itercols if y not in list(x)])]], columns = ["Rsq", "predictors", "excluded"]))
regression_res.sort_values(by="Rsq", ascending = False)
I landed on this question as I was looking for a clean iterator of columns only (Series, no names).
Unless I am mistaken, there is no such thing, which, if true, is a bit annoying. In particular, one would sometimes like to assign a few individual columns (Series) to variables, e.g.:
x, y = df[['x', 'y']] # does not work
There is df.items() that gets close, but it gives an iterator of tuples (column_name, column_series). Interestingly, there is a corresponding df.keys() which returns df.columns, i.e. the column names as an Index, so a, b = df[['x', 'y']].keys() assigns properly a='x' and b='y'. But there is no corresponding df.values(), and for good reason, as df.values is a property and returns the underlying numpy array.
One (inelegant) way is to do:
x, y = (v for _, v in df[['x', 'y']].items())
but it's less pythonic than I'd like.
Most of these answers are going via the column name, rather than iterating the columns directly. They will also have issues if there are multiple columns with the same name. If you want to iterate the columns, I'd suggest:
for series in (df.iloc[:,i] for i in range(df.shape[1])):
...
assuming X-factor, y-label (multicolumn):
columns = [c for c in _df.columns if c in ['col1', 'col2','col3']] #or '..c not in..'
_df.set_index(columns, inplace=True)
print( _df.index)
X, y = _df.iloc[:,:4].values, _df.index.values

Categories

Resources