Adding DataFrame columns in Python pandas

Adding DataFrame columns in Python pandas - python

I have pandas DataFrame that has a number of columns (about 20) containing string objects. I'm looking for a simple method to add all of the columns together into one new column, but have so far been unsuccessful e.g.:
for i in df.columns:
df[‘newcolumn’] = df[‘newcolumn’] + ‘/‘ + df.ix[:,i]
This results in an empty DataFrame column ‘newcolumn’ instead of the concatenated column.
I’m new to pandas, so any help would be much appreciated.

df['newcolumn'] = df.apply(''.join, axis=1)

Related

Pandas Dataframe: How to drop_duplicates() based on index subset?

Wondering if someone could please help me on this:
Have a pandas df with a rather large amount of columns (over 50). I'd like to remove duplicates based on a subset (column 2 to 50).
Been trying to use df.drop_duplicates(subset=["col1","col2",....]), but wondering if there is a way to pass the column index instead so I don't have to actually write out all the column headers to consider for the drop but instead can do something along the lines of df.drop_duplicates(subset = [2:])
Thanks upfront

You can slice df.columns like:
df.drop_duplicates(subset = df.columns[2:])
Or:
df.drop_duplicates(subset = df.columns[2:].tolist())

How to find if a values exists in all rows of a dataframe?

I have an array of unique elements and a dataframe.
I want to find out if the elements in the array exist in all the row of the dataframe.
p.s- I am new to python.
This is the piece of code I've written.
for i in uniqueArray:
for index,row in newDF.iterrows():
if i in row['MKT']:
#do something to find out if the element i exists in all rows
Also, this way of iterating is quite expensive, is there any better way to do the same?
Thanks in Advance.

Pandas allow you to filter a whole column like if it was Excel:
import pandas
df = pandas.Dataframe(tableData)
Imagine your columns names are "Column1", "Column2"... etc
df2 = df[ df["Column1"] == "ValueToFind"]
df2 now has only the rows that has "ValueToFind" in df["Column1"]. You can concatenate several filters and use AND OR logical doors.

You can try
for i in uniqueArray:
if newDF['MKT'].contains(i).any():
# do your task

You can use isin() method of pd.Series object.
Assuming you have a data frame named df and you check if your column 'MKT' includes any items of your uniqueArray.
new_df = df[df.MKT.isin(uniqueArray)].copy()
new_df will only contain the rows where values of MKT is contained in unique Array.
Now do your things on new_df, and join/merge/concat to the former df as you wish.

How to use an existing column as index in Spark's Dataframe

I am 'translating' a python code to pyspark. I would like to use an existing column as index for a dataframe. I did this in python using pandas. The small piece of code below explains what I did. Thanks for helping.
df.set_index('colx',drop=False,inplace=True)
# Ordena index
df.sort_index(inplace=True)
I expect the result to be a dataframe with 'colx' as index.

add index to pyspark dataframe as a column and use it
rdd_df = df.rdd.zipWithIndex()
df_index = rdd_df.toDF()
#and extract the columns
df_index = df_index.withColumn('colA', df_index['_1'].getItem("'colA"))
df_index = df_index.withColumn('colB', df_index['_1'].getItem("'colB"))

This is not how it works with Spark. No such concept exists.
One can add a column to an RDD zipWithIndex by convert DF to RDD and back, but that is a new column, so not the same thing.

How can we create many new columns in a dataframe in pyspark using withcolumn

I have to create around 800 dummy columns in a dataframe which have Null values in it.
I don't want to use df.withColumn('x', lit(None)) for individual columns as there are many columns.
I tried map(lambda x: df.withColumn(x, lit(None)), column_list) but it's not working.
Writing the snippet below also looks like a bad approach.
for column in columns:
df = df.withColumn(column, lit(None))
Can someone suggest what is the best optimum way.

The only approach which you haven't listed that I can think of is to use rdd.
Map each row to itself plus (None,)*len(columns)
schema = StructType(df.schema.fields + [StructField(c, NullType()) for c in columns])
df = df.rdd.map(lambda row: tuple(row) + (None,)*len(columns)).toDF(schema=schema)

python dask dataframe splitting column of tuples into two columns

I am using python 2.7 with dask
I have a dataframe with one column of tuples that I created like this:
table[col] = table.apply(lambda x: (x[col1],x[col2]), axis = 1, meta = pd.Dataframe)
I want to re convert this tuple column into two seperate columns
In pandas I would do it like this:
table[[col1,col2]] = table[col].apply(pd.Series)
The point of doing so, is that dask dataframe does not support multi index and i want to use groupby according to multiple columns, and wish to create a column of tuples that will give me a single index containing all the values I need (please ignore efficiency vs multi index, for there is not yet a full support for this is dask dataframe)
When i try to unpack the tuple columns with dask using this code:
rxTable[["a","b"]] = rxTable["tup"].apply(lambda x: s(x), meta = pd.DataFrame, axis = 1)
I get this error
AttributeError: 'Series' object has no attribute 'columns'
when I try
rxTable[["a","b"]] = rxTable["tup"].apply(dd.Series, axis = 1, meta = pd.DataFrame)
I get the same
How can i take a column of tuples and convert it to two columns like I do in Pandas with no problem?
Thanks

Best i found so for in converting into pandas dataframe and then convert the column, then go back to dask
df1 = df.compute()
df1[["a","b"]] = df1["c"].apply(pd.Series)
df = dd.from_pandas(df1,npartitions=1)
This will work well, if the df is too big for memory, you can either:
1.compute only the wanted column, convert it into two columns and then use merge to get the split results into the original df
2.split the df into chunks, then converting each chunk and adding it into an hd5 file, then using dask to read the entire hd5 file into the dask dataframe

I found this methodology works well and avoids converting the Dask DataFrame to Pandas:
df['a'] = df['tup'].str.partition(sep)[0]
df['b'] = df['tup'].str.partition(sep)[2]
where sep is whatever delimiter you were using in the column to separate the two elements.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Adding DataFrame columns in Python pandas - python

df['newcolumn'] = df.apply(''.join, axis=1)

Related

Pandas Dataframe: How to drop_duplicates() based on index subset?

How to find if a values exists in all rows of a dataframe?

How to use an existing column as index in Spark's Dataframe

How can we create many new columns in a dataframe in pyspark using withcolumn

python dask dataframe splitting column of tuples into two columns

Categories

Resources