I have pandas DataFrame that has a number of columns (about 20) containing string objects. I'm looking for a simple method to add all of the columns together into one new column, but have so far been unsuccessful e.g.:
for i in df.columns:
df[‘newcolumn’] = df[‘newcolumn’] + ‘/‘ + df.ix[:,i]
This results in an empty DataFrame column ‘newcolumn’ instead of the concatenated column.
I’m new to pandas, so any help would be much appreciated.
df['newcolumn'] = df.apply(''.join, axis=1)
Related
Wondering if someone could please help me on this:
Have a pandas df with a rather large amount of columns (over 50). I'd like to remove duplicates based on a subset (column 2 to 50).
Been trying to use df.drop_duplicates(subset=["col1","col2",....]), but wondering if there is a way to pass the column index instead so I don't have to actually write out all the column headers to consider for the drop but instead can do something along the lines of df.drop_duplicates(subset = [2:])
Thanks upfront
You can slice df.columns like:
df.drop_duplicates(subset = df.columns[2:])
Or:
df.drop_duplicates(subset = df.columns[2:].tolist())
I have an array of unique elements and a dataframe.
I want to find out if the elements in the array exist in all the row of the dataframe.
p.s- I am new to python.
This is the piece of code I've written.
for i in uniqueArray:
for index,row in newDF.iterrows():
if i in row['MKT']:
#do something to find out if the element i exists in all rows
Also, this way of iterating is quite expensive, is there any better way to do the same?
Thanks in Advance.
Pandas allow you to filter a whole column like if it was Excel:
import pandas
df = pandas.Dataframe(tableData)
Imagine your columns names are "Column1", "Column2"... etc
df2 = df[ df["Column1"] == "ValueToFind"]
df2 now has only the rows that has "ValueToFind" in df["Column1"]. You can concatenate several filters and use AND OR logical doors.
You can try
for i in uniqueArray:
if newDF['MKT'].contains(i).any():
# do your task
You can use isin() method of pd.Series object.
Assuming you have a data frame named df and you check if your column 'MKT' includes any items of your uniqueArray.
new_df = df[df.MKT.isin(uniqueArray)].copy()
new_df will only contain the rows where values of MKT is contained in unique Array.
Now do your things on new_df, and join/merge/concat to the former df as you wish.
I am 'translating' a python code to pyspark. I would like to use an existing column as index for a dataframe. I did this in python using pandas. The small piece of code below explains what I did. Thanks for helping.
df.set_index('colx',drop=False,inplace=True)
# Ordena index
df.sort_index(inplace=True)
I expect the result to be a dataframe with 'colx' as index.
add index to pyspark dataframe as a column and use it
rdd_df = df.rdd.zipWithIndex()
df_index = rdd_df.toDF()
#and extract the columns
df_index = df_index.withColumn('colA', df_index['_1'].getItem("'colA"))
df_index = df_index.withColumn('colB', df_index['_1'].getItem("'colB"))
This is not how it works with Spark. No such concept exists.
One can add a column to an RDD zipWithIndex by convert DF to RDD and back, but that is a new column, so not the same thing.
I have to create around 800 dummy columns in a dataframe which have Null values in it.
I don't want to use df.withColumn('x', lit(None)) for individual columns as there are many columns.
I tried map(lambda x: df.withColumn(x, lit(None)), column_list) but it's not working.
Writing the snippet below also looks like a bad approach.
for column in columns:
df = df.withColumn(column, lit(None))
Can someone suggest what is the best optimum way.
The only approach which you haven't listed that I can think of is to use rdd.
Map each row to itself plus (None,)*len(columns)
schema = StructType(df.schema.fields + [StructField(c, NullType()) for c in columns])
df = df.rdd.map(lambda row: tuple(row) + (None,)*len(columns)).toDF(schema=schema)
I am using python 2.7 with dask
I have a dataframe with one column of tuples that I created like this:
table[col] = table.apply(lambda x: (x[col1],x[col2]), axis = 1, meta = pd.Dataframe)
I want to re convert this tuple column into two seperate columns
In pandas I would do it like this:
table[[col1,col2]] = table[col].apply(pd.Series)
The point of doing so, is that dask dataframe does not support multi index and i want to use groupby according to multiple columns, and wish to create a column of tuples that will give me a single index containing all the values I need (please ignore efficiency vs multi index, for there is not yet a full support for this is dask dataframe)
When i try to unpack the tuple columns with dask using this code:
rxTable[["a","b"]] = rxTable["tup"].apply(lambda x: s(x), meta = pd.DataFrame, axis = 1)
I get this error
AttributeError: 'Series' object has no attribute 'columns'
when I try
rxTable[["a","b"]] = rxTable["tup"].apply(dd.Series, axis = 1, meta = pd.DataFrame)
I get the same
How can i take a column of tuples and convert it to two columns like I do in Pandas with no problem?
Thanks
Best i found so for in converting into pandas dataframe and then convert the column, then go back to dask
df1 = df.compute()
df1[["a","b"]] = df1["c"].apply(pd.Series)
df = dd.from_pandas(df1,npartitions=1)
This will work well, if the df is too big for memory, you can either:
1.compute only the wanted column, convert it into two columns and then use merge to get the split results into the original df
2.split the df into chunks, then converting each chunk and adding it into an hd5 file, then using dask to read the entire hd5 file into the dask dataframe
I found this methodology works well and avoids converting the Dask DataFrame to Pandas:
df['a'] = df['tup'].str.partition(sep)[0]
df['b'] = df['tup'].str.partition(sep)[2]
where sep is whatever delimiter you were using in the column to separate the two elements.