Change spark dataframe columns name

Change spark dataframe columns name - python

Hi I have a spark data frame in the below format
| id | Name | Round_1_id |Round_1_name|Round_2_id|Round_2_name|
| ---| ------|------------|------------|----------|------------|
| 12 | ABC | 45 |BCD | 34 | HRF |
there are not only two rounds there and a total 10 rounds
I want to change the columns name as below only for the round column name
id
Name
Round_1_identity
Round_1_Fullname
Round_2_identity
Round_2_Fullname
12
ABC
45
BCD
34
HRF
only the columns name which have round should be changed
I am trying the below code but it is not working
rename_col={"id":"identity","name":"Fullname"}
for c in df.columns:
if 'Round' in c:
for key,value in rename_col.items():
df1=df.replace(key,value)
Please help me on the same. it would be very helpful.

You can conditionally find the column name and replace the characters with the value from dict to get the new column and use withColumnRenamed to rename columns.
See the code below
rename_col = {"id":"identity", "name":"Fullname"}
for col in df.columns:
if "Round" in col:
key = col.split("_")[-1]
new_col_name = col.replace(key, rename_col[key])
df = df.withColumnRenamed(col, new_col_name)

Related

How to combine two pandas dataset based on multiple conditions?

I want to combine two datasets in Python based on multiple conditions using pandas.
The two datasets are different numbers of rows.
The first one contains almost 300k entries, while the second one contains almost 1000 entries.
More specifically, The first dataset: "A" has the following information:
Path | Line | Severity | Vulnerability | Name | Text | Title
An instance of the content of "A" is this:
src.bla.bla.class.java| 24; medium| Logging found| hr.kravarscan.enchantedfortress_15| description| Enchanted Fortress
While the second dataset: "B" contains the following information:
Class | Path | DTWC | DR | DW | IDFP
An instance of the content in "B" is this:
y.x.bla.MainActivity | com.lucao.limpazap_11| 0 | 0 | 0 | 0
I want to combine these two dataset as follow:
If A['Name'] is equal to B['Path'] AND B['Class'] is in A['Class']
Than
Merge the two lines into another data frame "C"
An output example is the following:
Suppose that A contains:
src.bla.bla.class.java| 24| medium| Logging found| hr.kravarscan.enchantedfortress_15| description| Enchanted Fortress|
and B contains:
com.bla.class | hr.kravarscan.enchantedfortress_15| 0 | 0 | 0 | 0
the output should be the following:
src.bla.bla.class.java| 24| medium| Logging found| hr.kravarscan.enchantedfortress_15| description| Enchanted Fortress| com.bla.class | hr.kravarscan.enchantedfortress_15| 0 | 0 | 0 | 0

I'm not sure if this the best and the most efficient way but i have test it and it worked. So my answer is pretty straight forward, we will loop over two dataframes and apply the desired conditions.
Suppose the dataset A is df_a and dataset B is df_b.
First we have to add a suffix on every columns on df_a and df_b so both rows can be appended later.
df_a.columns= [i+'_A' for i in df_a.columns]
df_b.columns= [i+'_B' for i in df_b.columns]
And then we can apply this for loop
df_c= pd.DataFrame()
# Iterate through df_a
for (idx_A, v_A) in df_a.iterrows():
# Iterate through df_b
for (idx_B, v_B) in df_b.iterrows():
# Apply the condition
if v_A['Name_A']==v_B['Path_B'] and v_B['Class_B'] in v_A['Path_A']:
# Cast both series to dictionary and then append them to a new dict
c_dict= {**v_A.to_dict(), **v_B.to_dict()}
# Append the df_c with c_dict
df_c= df_c.append(c_dict, ignore_index=True)

Python Pandas: How do I sumproduct by rows with an if condition?

I know there are some questions on Stack Overflow on Sumproduct but the solution are not working for me. I am also new to Python Pandas.
For each row, I want to do a sumproduct of certain columns only if column['2020'] !=0.
I used the below code, but get error:
IndexError: ('index 2018 is out of bounds for axis 0 with size 27', 'occurred at index 0')
Pls help. Thank you
# df_copy is my dataframe
column_list=[2018,2019]
weights=[6,9]
def test(df_copy):
if df_copy[2020]!=0:
W_Avg=sum(df_copy[column_list]*weights)
else:
W_Avg=0
return W_Avg
df_copy['sumpr']=df_copy.apply(test, axis=1)
df_copy
**|2020 | 2018 | 2019 | sumpr|**
|0 | 100 | 20 | 0 |
|1 | 30 | 10 | 270 |
|3 | 10 | 10 | 150 |
I am sorry if the table doesn't look like a table. I can't create a table properly in Stackoverflow.
Basically for a particular row, if
2020 = 2 ,
2018 =30 ,
2019 =10 ,
sumpr= 30 * 9 + 10*9 = 270

Your column names are most likely strings, not integers.
To confirm it, run df_copy.columns and you should receive something like:
Index(['2020', '2018', '2019'], dtype='object')
(note apostrophes surrounding column names).
So change your column list to:
column_list = ['2018', '2019']
In your function change also the column name to a string:
df_copy['2020']
Then your code should run.
You can also run a more concise code:
df_copy['sumpr'] = np.where(df_copy['2020'] != 0, (df_copy[column_list]
* weights).sum(axis=1), 0)

add values in Pandas DataFrame

I want to add values in a dataframe. But i want to write clean code (short and faster). I really want to improve my skill in writing.
Suppose that we have a DataFrame and 3 values
df=pd.DataFrame({"Name":[],"ID":[],"LastName":[]})
value1="ema"
value2=023123
value3="Perez"
I can write:
df.append([value1,value2,value3])
but the output is gonna create a new column
like
0 | Name | ID | LastName
ema | nan | nan | nan
023123 | nan | nan| nan
Perez | nan | nan | nan
i want the next output with the best clean code
Name | ID | LastName
ema | 023123 | Perez
There are a way to do this , without append one by one? (i want the best short\fast code)

You can convert the values to dict then use append
df.append(dict(zip(['Name', 'ID', 'LastName'],[value1,value2,value3])), ignore_index=True)
Name ID LastName
0 ema 23123.0 Perez

Here the explanation:
First set your 3 values into an array
values=[value1,value2,value3]
and make variable as index marker when lopping latter
i = 0
Then use the code below
for column in df.columns:
df.loc[0,column] = values[i]
i+=1
column in df.columns will give you all the name of the column in the DataFrame
and df.loc[0,column] = values[i] will set the values at index i to row=0 and column=column
[Here the code and the result]

How to add new row in pandas dataframe? [duplicate]

I have an existing dataframe which I need to add an additional column to which will contain the same value for every row.
Existing df:
Date, Open, High, Low, Close
01-01-2015, 565, 600, 400, 450
New df:
Name, Date, Open, High, Low, Close
abc, 01-01-2015, 565, 600, 400, 450
I know how to append an existing series / dataframe column. But this is a different situation, because all I need is to add the 'Name' column and set every row to the same value, in this case 'abc'.

df['Name']='abc' will add the new column and set all rows to that value:
In [79]:
df
Out[79]:
Date, Open, High, Low, Close
0 01-01-2015, 565, 600, 400, 450
In [80]:
df['Name'] = 'abc'
df
Out[80]:
Date, Open, High, Low, Close Name
0 01-01-2015, 565, 600, 400, 450 abc

You can use insert to specify where you want to new column to be. In this case, I use 0 to place the new column at the left.
df.insert(0, 'Name', 'abc')
Name Date Open High Low Close
0 abc 01-01-2015 565 600 400 450

Summing up what the others have suggested, and adding a third way
You can:
assign(**kwargs):
df.assign(Name='abc')
access the new column series (it will be created) and set it:
df['Name'] = 'abc'
insert(loc, column, value, allow_duplicates=False)
df.insert(0, 'Name', 'abc')
where the argument loc ( 0 <= loc <= len(columns) ) allows you to insert the column where you want.
'loc' gives you the index that your column will be at after the insertion. For example, the code above inserts the column Name as the 0-th column, i.e. it will be inserted before the first column, becoming the new first column. (Indexing starts from 0).
All these methods allow you to add a new column from a Series as well (just substitute the 'abc' default argument above with the series).

Single liner works
df['Name'] = 'abc'
Creates a Name column and sets all rows to abc value

I want to draw more attention to a portion of #michele-piccolini's answer.
I strongly believe that .assign is the best solution here. In the real world, these operations are not in isolation, but in a chain of operations. And if you want to support a chain of operations, you should probably use the .assign method.
Here is an example using snowfall data at a ski resort (but the same principles would apply to say ... financial data).
This code reads like a recipe of steps. Both assignment (with =) and .insert make this much harder:
raw = pd.read_csv('https://github.com/mattharrison/datasets/raw/master/data/alta-noaa-1980-2019.csv',
parse_dates=['DATE'])
def clean_alta(df):
return (df
.loc[:, ['STATION', 'NAME', 'LATITUDE', 'LONGITUDE', 'ELEVATION', 'DATE',
'PRCP', 'SNOW', 'SNWD', 'TMAX', 'TMIN', 'TOBS']]
.groupby(pd.Grouper(key='DATE', freq='W'))
.agg({'PRCP': 'sum', 'TMAX': 'max', 'TMIN': 'min', 'SNOW': 'sum', 'SNWD': 'mean'})
.assign(LOCATION='Alta',
T_RANGE=lambda w_df: w_df.TMAX-w_df.TMIN)
)
clean_alta(raw)
Notice the line .assign(LOCATION='Alta', that creates a column with a single value in the middle of the rest of the operations.

One Line did the job for me.
df['New Column'] = 'Constant Value'
df['New Column'] = 123

You can Simply do the following:
df['New Col'] = pd.Series(["abc" for x in range(len(df.index))])

This single line will work.
df['name'] = 'abc'

The append method has been deprecated since Pandas 1.4.0
So instead use the above method only if using actual pandas DataFrame object:
df["column"] = "value"
Or, if setting value on a view of a copy of a DataFrame, use concat() or assign():
This way the new Series created has the same index as original DataFrame, and so will match on exact rows
# adds a new column in view `where_there_is_one` named
# `client` with value `display_name`
# `df` remains unchanged
df = pd.DataFrame({"number": ([1]*5 + [0]*5 )})
where_there_is_one = df[ df["number"] == 1]
where_there_is_one = pd.concat([
where_there_is_one,
pd.Series(["display_name"]*df.shape[0],
index=df.index,
name="client")
],
join="inner", axis=1)
# Or use assign
where_there_is_one = where_there_is_one.assign(client = "display_name")
Output:
where_there_is_one: df:
| 0 | number | client | | 0 | number |
| --- | --- | --- | |---| -------|
| 0 | 1 | display_name | | 0 | 1 |
| 1 | 1 | display_name | | 1 | 1 |
| 2 | 1 | display_name | | 2 | 1 |
| 3 | 1 | display_name | | 3 | 1 |
| 4 | 1 | display_name | | 4 | 1 |
| 5 | 0 |
| 6 | 0 |
| 7 | 0 |
| 8 | 0 |
| 9 | 0 |

Ok, all, I have a similar situation here but if i take this code to use: df['Name']='abc'
instead 'abc' the name for the new column I want to take from somewhere else in the csv file.
As you can see from the picture, df is not cleaned yet but I want to create 2 columns with the name "ADI dms rivoli" which will continue for every row, and the same for the "December 2019". Hope it is clear for you to understand, it was hard to explaine, sorry.

Python - Groupby a DataFrameGroupBy object

I have a panda dataframe in Python at which I am applying a groupby. And then I want to apply a new groupby + sum on the previous result. To be more specific, first I am doing:
check_df = data_df.groupby(['hotel_code', 'dp_id', 'market', 'number_of_rooms'])
[['market', 'number_of_rooms']]
And then I want to do:
check_df = check_df.groupby(['market'])['number_of_rooms'].sum()
So, I am getting the following error:
AttributeError: Cannot access callable attribute 'groupby' of 'DataFrameGroupBy'
objects, try using the 'apply' method
My initial data look like that:
hotel_code | market | number_of_rooms | ....
---------------------------------------------
001 | a | 200 | ...
001 | a | 200 |
002 | a | 300 | ...
Notice that I may have duplicates of pairs like (a - 200), that's why I want need the first groupby.
What I want in the end is something like that:
Market | Rooms
--------------
a | 3000
b | 250
I'm just trying to translate the following sql query into python:
select a.market, sum(a.number_of_rooms)
from (
select market, number_of_rooms
from opinmind_dev..cg_mm_booking_dataset_full
group by hotel_code, market, number_of_rooms
) as a
group by market ;
Any ideas how I can fix that? If you need any more info, let me know.
ps. I am new to Python and data science

IIUC, instead of:
check_df = data_df.groupby(['hotel_code', 'dp_id', 'market', 'number_of_rooms'])
[['market', 'number_of_rooms']]
You should simply do:
check_df = data_df.drop_duplicates(subset=['hotel_code', 'dp_id', 'market', 'number_of_rooms'])\
.loc[:, ['market', 'number_of_rooms']]\
.groupby('market')\
.sum()

df = pd.DataFrame({'Market': [1,1,1,2,2,2,3,3], 'Rooms':range(8), 'C':np.random.rand(8)})
Market Rooms C
0 1 0 0.187793
1 1 1 0.325284
2 1 2 0.095147
3 2 3 0.296781
4 2 4 0.022262
5 2 5 0.201078
6 3 6 0.160082
7 3 7 0.683151
You need to move the column selection away from the grouped DataFrame. Either of the following should work.
df.groupby('Market').sum()[['Rooms']]
df[['Rooms']].groupby(df['Market']).sum()
Rooms
Market
1 3
2 12
3 13
If you select using ['Rooms'] instead of [['Rooms']] you will get a Series instead of a DataFrame.
The dataframes produced use market as their index. If you want to convert it to a normal data column, use:
df.reset_index()
Market Rooms
0 1 3
1 2 12
2 3 13

If I understand your question correctly, You could simply do -
data_df.groupby('Market').agg({'Rooms': np.sum}) OR
data_df.groupby(['market'], as_index=False).agg({'Rooms': np.sum})
data_df = pd.DataFrame({'Market' : ['A','B','C','B'],
'Hotel' : ['H1','H2','H4','H5'],
'Rooms' : [20,40,50,34]
})
data_df.groupby('Market').agg({'Rooms': np.sum})

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Change spark dataframe columns name - python

Related

How to combine two pandas dataset based on multiple conditions?

Python Pandas: How do I sumproduct by rows with an if condition?

add values in Pandas DataFrame

How to add new row in pandas dataframe? [duplicate]

Python - Groupby a DataFrameGroupBy object

Categories

Resources