add values in Pandas DataFrame - python

I want to add values in a dataframe. But i want to write clean code (short and faster). I really want to improve my skill in writing.
Suppose that we have a DataFrame and 3 values
df=pd.DataFrame({"Name":[],"ID":[],"LastName":[]})
value1="ema"
value2=023123
value3="Perez"
I can write:
df.append([value1,value2,value3])
but the output is gonna create a new column
like
0 | Name | ID | LastName
ema | nan | nan | nan
023123 | nan | nan| nan
Perez | nan | nan | nan
i want the next output with the best clean code
Name | ID | LastName
ema | 023123 | Perez
There are a way to do this , without append one by one? (i want the best short\fast code)

You can convert the values to dict then use append
df.append(dict(zip(['Name', 'ID', 'LastName'],[value1,value2,value3])), ignore_index=True)
Name ID LastName
0 ema 23123.0 Perez

Here the explanation:
First set your 3 values into an array
values=[value1,value2,value3]
and make variable as index marker when lopping latter
i = 0
Then use the code below
for column in df.columns:
df.loc[0,column] = values[i]
i+=1
column in df.columns will give you all the name of the column in the DataFrame
and df.loc[0,column] = values[i] will set the values at index i to row=0 and column=column
[Here the code and the result]

Related

How to combine two pandas dataset based on multiple conditions?

I want to combine two datasets in Python based on multiple conditions using pandas.
The two datasets are different numbers of rows.
The first one contains almost 300k entries, while the second one contains almost 1000 entries.
More specifically, The first dataset: "A" has the following information:
Path | Line | Severity | Vulnerability | Name | Text | Title
An instance of the content of "A" is this:
src.bla.bla.class.java| 24; medium| Logging found| hr.kravarscan.enchantedfortress_15| description| Enchanted Fortress
While the second dataset: "B" contains the following information:
Class | Path | DTWC | DR | DW | IDFP
An instance of the content in "B" is this:
y.x.bla.MainActivity | com.lucao.limpazap_11| 0 | 0 | 0 | 0
I want to combine these two dataset as follow:
If A['Name'] is equal to B['Path'] AND B['Class'] is in A['Class']
Than
Merge the two lines into another data frame "C"
An output example is the following:
Suppose that A contains:
src.bla.bla.class.java| 24| medium| Logging found| hr.kravarscan.enchantedfortress_15| description| Enchanted Fortress|
and B contains:
com.bla.class | hr.kravarscan.enchantedfortress_15| 0 | 0 | 0 | 0
the output should be the following:
src.bla.bla.class.java| 24| medium| Logging found| hr.kravarscan.enchantedfortress_15| description| Enchanted Fortress| com.bla.class | hr.kravarscan.enchantedfortress_15| 0 | 0 | 0 | 0
I'm not sure if this the best and the most efficient way but i have test it and it worked. So my answer is pretty straight forward, we will loop over two dataframes and apply the desired conditions.
Suppose the dataset A is df_a and dataset B is df_b.
First we have to add a suffix on every columns on df_a and df_b so both rows can be appended later.
df_a.columns= [i+'_A' for i in df_a.columns]
df_b.columns= [i+'_B' for i in df_b.columns]
And then we can apply this for loop
df_c= pd.DataFrame()
# Iterate through df_a
for (idx_A, v_A) in df_a.iterrows():
# Iterate through df_b
for (idx_B, v_B) in df_b.iterrows():
# Apply the condition
if v_A['Name_A']==v_B['Path_B'] and v_B['Class_B'] in v_A['Path_A']:
# Cast both series to dictionary and then append them to a new dict
c_dict= {**v_A.to_dict(), **v_B.to_dict()}
# Append the df_c with c_dict
df_c= df_c.append(c_dict, ignore_index=True)

Change spark dataframe columns name

Hi I have a spark data frame in the below format
| id | Name | Round_1_id |Round_1_name|Round_2_id|Round_2_name|
| ---| ------|------------|------------|----------|------------|
| 12 | ABC | 45 |BCD | 34 | HRF |
there are not only two rounds there and a total 10 rounds
I want to change the columns name as below only for the round column name
id
Name
Round_1_identity
Round_1_Fullname
Round_2_identity
Round_2_Fullname
12
ABC
45
BCD
34
HRF
only the columns name which have round should be changed
I am trying the below code but it is not working
rename_col={"id":"identity","name":"Fullname"}
for c in df.columns:
if 'Round' in c:
for key,value in rename_col.items():
df1=df.replace(key,value)
Please help me on the same. it would be very helpful.
You can conditionally find the column name and replace the characters with the value from dict to get the new column and use withColumnRenamed to rename columns.
See the code below
rename_col = {"id":"identity", "name":"Fullname"}
for col in df.columns:
if "Round" in col:
key = col.split("_")[-1]
new_col_name = col.replace(key, rename_col[key])
df = df.withColumnRenamed(col, new_col_name)

Subsetting data with a column condition

I have a dataframe which contains Date, Visitor_ID and Pages columns. In the Page_visited column there are different row wise entries for each dates. Please refer the below table to understand the data.
[| Dates | Visitor_ID| Pages |
|:------ |:---------:| -----: |
| 10/1/2021 | 1 | xy |
| 10/1/2021 | 1 | step2 |
|10/1/2021 | 1 | xx |
|10/1/2021 | 1 | NetBanking|
| 10/1/2021 | 2 | step1 |
| 10/1/2021 | 2 | xy |
|10/1/2021 | 3 | step1 |
|10/1/2021 | 3 | NetBanking|
|11/1/2021 | 4 | step1 |
|12/1/2021 | 4 | NetBanking|][1]
Desired output:
Date Visitor_ID
|10/1/2021 | 1 |
|10/1/2021 | 3 |
the output should be a subset of actual data where the condition is that if for same Visitor_ID the page contains string "step" before string "Netbanking in same date then return the Visitor ID.
To initialise your dataframe you could do:
import pandas as pd
columns = ["Dates", "Visitor_ID", "Pages"]
records = [
["10/1/2021", 1, "xy"],
["10/1/2021", 1, "step2"],
["10/1/2021", 1, "NetBanking"],
["10/1/2021", 2, "step1"],
["10/1/2021", 2, "xy"],
["10/1/2021", 3, "step1"],
["10/1/2021", 3, "NetBanking"],
["11/1/2021", 4, "step1"],
["12/1/2021", 4, "NetBanking"]]
data = pd.DataFrame().from_records(records, columns=columns)
data["Dates"] = pd.DatetimeIndex(data["Dates"])
index_names = columns[:2]
data.set_index(index_names, drop=True, inplace=True)
Note that I have left out your third line in the records, otherwise I cannot reproduce your desired output. I have made this a multi-index data frame in order to easily loop over the groups 'date/visitor'. The structure of the dataframe looks like:
print(data)
Pages
Dates Visitor_ID
2021-10-01 1 xy
1 step2
1 NetBanking
2 step1
2 xy
3 step1
3 NetBanking
2021-11-01 4 step1
2021-12-01 4 NetBanking
Now to select the customers from the same date and from the same group, I am going to loop over these groups and use 2 masks to select the required records:
for date_time, data_per_date in data.groupby(level=0):
for visitor, data_per_visitor in data_per_date.groupby(level=0):
# select the column with the Pages
pages = data_per_visitor["Pages"].str
# make 2 boolean masks, for the records with step and netbanking
has_step = pages.contains("step")
has_netbanking = pages.contains("NetBanking")
# to get the records after each 'step' records, apply a diff on 'has_step'
# Convert to int first for the correct result
# each diff with outcome -1 fulfills this requirement. Make a
# mask based on this requirement
diff_step = has_step.astype(int).diff()
records_after_step = diff_step == -1
# combine the 2 mask to create your final mask to make a selection
mask = records_after_step & has_netbanking
# select the records and print to screen
selection = data_per_visitor[mask]
if not selection.empty:
print(selection.reset_index()[index_names])
This gives the following output:
Dates Visitor_ID
0 2021-10-01 1
1 2021-10-01 3
EDIT:
I was reading your question again. The solution above assumed that only records with 'NetBanking' directly following a record with 'step' is valid. That is why I thought your example input was not corresponding with your desired output. However, in case you are allowing rows in between an occurrence with 'step' and the first 'netbanking', the solution does not work. In that case, it is better to explicitly iterate of the rows of your dataframe per date and client id. An example then would be:
for date_time, data_per_date in data.groupby(level=0):
for visitor, data_per_visitor in data_per_date.groupby(level=0):
after_step = False
index_selection = list()
data_per_visitor.reset_index(inplace=True)
for index, records in data_per_visitor.iterrows():
page = records["Pages"]
if "step" in page and not after_step:
after_step = True
if "NetBanking" in page and after_step:
index_selection.append(index)
after_step = False
selection = data_per_visitor.reindex(index_selection)
if not selection.empty:
print(selection.reset_index()[index_names]
Normally I would not recommend to use 'iterrows' as it is really slow, but in this case I don't see an easy other solution. The output of the second algorithm is the same as the first for my data. In case you do include the third line from your example data, the second algorithm still gives the same output.

How to merge different column under one column in Pandas

I have a dataframe which is sparsed and something like this,
Conti_mV_XSCI_140|Conti_mV_XSCI_12|Conti_mV_XSCI_76|Conti_mV_XSCO_11|Conti_mV_XSCO_203|Conti_mV_XSCO_75
1 | nan | nan | 12 | nan | nan
nan | 22 | nan | nan | 13 | nan
nan | nan | 9 | nan | nan | 31
As you can see, XSCI is present in 3 header names, only thing is a random number(_140, _12, _76) is added which makes them different.
This is not correct. The column names should be like this - Conti_mV_XSCI, Conti_mV_XSCO.
and the final column name(without any random number), should be having values from all the three columns it was spread to(for example - xsci was xsci_140, xsci_12,xsci_76) like that.
The final dataframe should look something like this -
Conti_mV_XSCI| Conti_mV_XSCO
1 | 12
22 | 13
99 | 31
If you notice, the first value of XSCI comes from the first XSCI_140, second value comes from the second column with XSCI and so on. This is same for XSCO as well.
The issue is, I have to do this for all the columns starting with certain value, like - "Conti_mV,"IDD_PowerUp_mA" etc
My issue:
I am having a hard time cleaning out the header names because as soon as I remove the random number from the last, it throws an error of columns being duplicate, also it is not elegant
It would be a great help if anyone can help me. Please comment if anything is not clear here.
I need a new dataframe with one column(where there were 3) and combine the data from them.
Thanks.
First if necessary convert all columns to numeric:
df = df.apply(pd.to_numeric, errors='coerce')
If need grouping by column names splited with right side and selected first values:
df = df.groupby(lambda x: x.rsplit('_', 1)[0], axis=1).sum()
print (df)
Conti_mV_XSCI Conti_mV_XSCO
0 1.0 12.0
1 22.0 13.0
2 9.0 31.0
If need filter columns manually:
df['Conti_mV_XSCI'] = df.filter(like='XSCI').sum(axis=1)
df['Conti_mV_XSCO'] = df.filter(like='XSCO').sum(axis=1)
EDIT: One idea for sum only columns specified in list of starts of columns names:
cols = ['IOZH_Pat_uA', 'IOZL_Pat_uA', 'Power_Short_uA', 'IDDQ_uA']
for c in cols:
# here ^ is for start of string
columns = df.filter(regex=f'^{c}')
df[c] = columns.sum(axis=1)
df = df.drop(columns, axis=1)
print (df)
try:
df['Conti_mV_XSCI']=df.filter(regex='XSCI').sum()
df['Conti_mV_XSCO']=df.filter(regex='XSCO').sum()
edit:
you can fillna with zeroes before the above operations.
df=df.fillna(0)
This will add a column Conti_mV_XSCI with the first non-nan entry for any column whose name begins with Conti_mV_XSCI
from math import isnan
df['Conti_mV_XSCI'] = df.filter(regex=("Conti_mV_XSCI.*")).apply(lambda row: [_ for _ in row if not isnan(_)][0], axis=1)
you can use the pivot_longer function from pyjanitor; at the moment you have to install the latest development version from github
# install the latest dev version of pyjanitor
# pip install git+https://github.com/ericmjl/pyjanitor.git
import janitor
(df.pivot_longer(names_to=".value",
names_pattern=r"(.+)_\d+")
.dropna())
Conti_mV_XSCI Conti_mV_XSCO
0 1.0 12.0
4 22.0 13.0
8 9.0 31.0
The code looks for values that match a pattern in the group, and returns those values with the header.

How to add new row in pandas dataframe? [duplicate]

I have an existing dataframe which I need to add an additional column to which will contain the same value for every row.
Existing df:
Date, Open, High, Low, Close
01-01-2015, 565, 600, 400, 450
New df:
Name, Date, Open, High, Low, Close
abc, 01-01-2015, 565, 600, 400, 450
I know how to append an existing series / dataframe column. But this is a different situation, because all I need is to add the 'Name' column and set every row to the same value, in this case 'abc'.
df['Name']='abc' will add the new column and set all rows to that value:
In [79]:
df
Out[79]:
Date, Open, High, Low, Close
0 01-01-2015, 565, 600, 400, 450
In [80]:
df['Name'] = 'abc'
df
Out[80]:
Date, Open, High, Low, Close Name
0 01-01-2015, 565, 600, 400, 450 abc
You can use insert to specify where you want to new column to be. In this case, I use 0 to place the new column at the left.
df.insert(0, 'Name', 'abc')
Name Date Open High Low Close
0 abc 01-01-2015 565 600 400 450
Summing up what the others have suggested, and adding a third way
You can:
assign(**kwargs):
df.assign(Name='abc')
access the new column series (it will be created) and set it:
df['Name'] = 'abc'
insert(loc, column, value, allow_duplicates=False)
df.insert(0, 'Name', 'abc')
where the argument loc ( 0 <= loc <= len(columns) ) allows you to insert the column where you want.
'loc' gives you the index that your column will be at after the insertion. For example, the code above inserts the column Name as the 0-th column, i.e. it will be inserted before the first column, becoming the new first column. (Indexing starts from 0).
All these methods allow you to add a new column from a Series as well (just substitute the 'abc' default argument above with the series).
Single liner works
df['Name'] = 'abc'
Creates a Name column and sets all rows to abc value
I want to draw more attention to a portion of #michele-piccolini's answer.
I strongly believe that .assign is the best solution here. In the real world, these operations are not in isolation, but in a chain of operations. And if you want to support a chain of operations, you should probably use the .assign method.
Here is an example using snowfall data at a ski resort (but the same principles would apply to say ... financial data).
This code reads like a recipe of steps. Both assignment (with =) and .insert make this much harder:
raw = pd.read_csv('https://github.com/mattharrison/datasets/raw/master/data/alta-noaa-1980-2019.csv',
parse_dates=['DATE'])
def clean_alta(df):
return (df
.loc[:, ['STATION', 'NAME', 'LATITUDE', 'LONGITUDE', 'ELEVATION', 'DATE',
'PRCP', 'SNOW', 'SNWD', 'TMAX', 'TMIN', 'TOBS']]
.groupby(pd.Grouper(key='DATE', freq='W'))
.agg({'PRCP': 'sum', 'TMAX': 'max', 'TMIN': 'min', 'SNOW': 'sum', 'SNWD': 'mean'})
.assign(LOCATION='Alta',
T_RANGE=lambda w_df: w_df.TMAX-w_df.TMIN)
)
clean_alta(raw)
Notice the line .assign(LOCATION='Alta', that creates a column with a single value in the middle of the rest of the operations.
One Line did the job for me.
df['New Column'] = 'Constant Value'
df['New Column'] = 123
You can Simply do the following:
df['New Col'] = pd.Series(["abc" for x in range(len(df.index))])
This single line will work.
df['name'] = 'abc'
The append method has been deprecated since Pandas 1.4.0
So instead use the above method only if using actual pandas DataFrame object:
df["column"] = "value"
Or, if setting value on a view of a copy of a DataFrame, use concat() or assign():
This way the new Series created has the same index as original DataFrame, and so will match on exact rows
# adds a new column in view `where_there_is_one` named
# `client` with value `display_name`
# `df` remains unchanged
df = pd.DataFrame({"number": ([1]*5 + [0]*5 )})
where_there_is_one = df[ df["number"] == 1]
where_there_is_one = pd.concat([
where_there_is_one,
pd.Series(["display_name"]*df.shape[0],
index=df.index,
name="client")
],
join="inner", axis=1)
# Or use assign
where_there_is_one = where_there_is_one.assign(client = "display_name")
Output:
where_there_is_one: df:
| 0 | number | client | | 0 | number |
| --- | --- | --- | |---| -------|
| 0 | 1 | display_name | | 0 | 1 |
| 1 | 1 | display_name | | 1 | 1 |
| 2 | 1 | display_name | | 2 | 1 |
| 3 | 1 | display_name | | 3 | 1 |
| 4 | 1 | display_name | | 4 | 1 |
| 5 | 0 |
| 6 | 0 |
| 7 | 0 |
| 8 | 0 |
| 9 | 0 |
Ok, all, I have a similar situation here but if i take this code to use: df['Name']='abc'
instead 'abc' the name for the new column I want to take from somewhere else in the csv file.
As you can see from the picture, df is not cleaned yet but I want to create 2 columns with the name "ADI dms rivoli" which will continue for every row, and the same for the "December 2019". Hope it is clear for you to understand, it was hard to explaine, sorry.

Categories

Resources