Dataframe updates with pandas that includes duplicated column headers

Dataframe updates with pandas that includes duplicated column headers - python

I am incredibly new to pandas python module and have a problem I'm trying to solve. Take the following dataframe as an example. This was read in from a .csv where "link" is the column header for the last three columns:
summary link link.1 link.2
0 test PCR-12345 PCR-54321 PCR-65432
1 test2 NaN NaN NaN
2 test3 DR-1234 PCR-1244 NaN
3 test4 PCR-4321 DR-4321 NaN
My goal is to update the dataframe to the following:
summary link link.1 link.2
0 test NaN NaN NaN
1 test2 NaN NaN NaN
2 test3 DR-1234 NaN NaN
3 test4 NaN DR-4321 NaN
So the criteria is basically, if the column header is "link.X" AND the value contains a string that starts with "PCR-", update it to an empty/NaN value.
How do I loop through each row's values, check the header and value, and replace if criteria is satisfied?

Let's try pd.Series.str.startswith and pd.Series.mask:
# columns starting with `link`
cols = df.columns[df.columns.str[:4]=='link']
# for each `link` column, mask the `PCR` with `NaN`:
df[cols] = df[cols].apply(lambda x: x.mask(x.str.startswith('PCR')==True) )
Output:
summary link link.1 link.2
0 test NaN NaN NaN
1 test2 NaN NaN NaN
2 test3 DR-1234 NaN NaN
3 test4 NaN DR-4321 NaN

Here is another way. uses str.startswith() to find the columns that start with link, then where() to keep only the cases that are true.:
cols = df.columns.str.startswith('link')
df.loc[:,cols] = df.loc[:,cols].where(df.loc[:,cols].replace(r'[-].*','',regex=True).eq('DR'))

I used #Quang Hoang's answer. I also need to make sure the headers were written back out to a csv as their original values. I did that by first doing this to grab the original header:
with open('test.csv') as f:
orig_header = f.readline()
orig_header = orig_header.split (",")
orig_header[-1] = orig_header[-1].strip() #get rid of newline
I then went ahead and did the data manipulation with Quang's sugeestion. After that I write it back out to a csv with the original header:
df.to_csv('test_updated.csv', index = False, header=orig_header)

Related

Pandas dataframe create from dict of crossings

I want to create a simple matrix where I have as index the name of a software requirement and as column all the software test cases in the project.
Where one SWRS is covered by one SWTS, I need to place "something" (for example a cross).
In my code draft, I create an empty dataframe and then iterate to place the cross:
import pandas as pd
struct = {
"swrslist":["swrs1","swrs2","swrs3","swrs4"],
"swtslist":["swts1","swts2","swts3","swts4","swts5","swts6"],
"mapping":
{
"swrs1": ["swts1", "swts3", "swts4"],
"swrs2": ["swts2", "swts3", "swts5"],
"swrs4": ["swts1", "swts3", "swts5"]
}
}
if __name__ == "__main__":
df = pd.DataFrame( index = pd.Index(pd.Series(struct["swrslist"])),
columns = pd.Index(struct["swtslist"]))
print(df)
for key in struct["mapping"].keys():
for elem in struct["mapping"][key]:
print(key, elem)
df.at[key,elem] = "x"
print(df)
df.to_excel("mapping.xlsx")
the output is the following
swts1 swts2 swts3 swts4 swts5 swts6
swrs1 x NaN x x NaN NaN
swrs2 NaN x x NaN x NaN
swrs3 NaN NaN NaN NaN NaN NaN
swrs4 x NaN x NaN x NaN
I know that create an empty dataframe and then iterate is not efficient.
I tried to create the dataframe as following
df = pd.DataFrame(struct["mapping"], index = pd.Index(pd.Series(struct["swrslist"])),
columns = pd.Index(struct["swtslist"]))
but it creates an empty dataframe:
swts1 swts2 swts3 swts4 swts5 swts6
swrs1 NaN NaN NaN NaN NaN NaN
swrs2 NaN NaN NaN NaN NaN NaN
swrs3 NaN NaN NaN NaN NaN NaN
swrs4 NaN NaN NaN NaN NaN NaN
Furthermore, in future I plan to provide different values if a SWTS is a pass, fail or not executed.
How can I create the dataframe efficently, rather that iterate on the "mapping" entries?

Though I used for loop too, how about this?
df = pd.DataFrame(index=pd.Index(pd.Series(struct["swrslist"])), columns=pd.Index(struct["swtslist"]))
for key, value in struct["mapping"].items():
df.loc[key, value] = "x"

Adding a value at the end of a column in a multindex column dataframe

I have a simple problem that probably has a simple solution but I couldn't found it anywhere. I have the following multindex column Dataframe:
mux = pd.MultiIndex.from_product(['A','B','C'], ['Datetime', 'Str', 'Ret']])
dfr = pd.DataFrame(columns=mux)
| A | B | C |
|Datetime|Str|Ret|Datetime|Str|Ret|Datetime|Str|Ret|
I need to add values one by one at the end of a specific subcolumn. For example add one value at the end of column A sub-column Datetime and leave the rest of the row as it is, then add another value to column B sub-column Str and again leave the rest of the values in the same row untouched and so on. So my questions are: Is it possible to target individual locations in this type of Dataframes? How? and also Is it possible to append not a full row but an individual value always at the end after the previous value without knowing where the end is?. Thank you so much for your answers.

IIUC, you can use .loc:
idx = len(dfr) # get the index of the next row after the last one
dfr.loc[idx, ('A', 'Datetime')] = pd.to_datetime('2021-09-24')
dfr.loc[idx, ('B', 'Str')] = 'Hello'
dfr.loc[idx, ('C', 'Ret')] = 4.3
Output:
>>> dfr
A B C
Datetime Str Ret Datetime Str Ret Datetime Str Ret
0 2021-09-24 00:00:00 NaN NaN NaN Hello NaN NaN NaN 4.3
Update
I mean for example when I have different number of values in different columns (for example 6 values in column A-Str but only 4 in column B-Datetime) but I don´t really know. In that case what I need is to add the next value in that column after the last one so I need to know the index of the last non Nan value of that particular column so I can use it in your answer because if I use len(dfr) while trying to add value to the column that only has 4 values it will end up in the 7th row instead of the 5th row, this is because one of the columns may have more values than the others.
You can do it easily using last_valid_index. Create a convenient function append_to_col to append values inplace in your dataframe:
def append_to_col(col, val):
idx = dfr[col].last_valid_index()
dfr.loc[idx+1 if idx is not None else 0, col] = val
# Fill your dataframe
append_to_col(('A', 'Datetime'), '2021-09-24')
append_to_col(('A', 'Datetime'), '2021-09-25')
append_to_col(('B', 'Str'), 'Hello')
append_to_col(('C', 'Ret'), 4.3)
append_to_col(('C', 'Ret'), 8.2)
append_to_col(('A', 'Datetime'), '2021-09-26')
Output:
>>> dfr
A B C
Datetime Str Ret Datetime Str Ret Datetime Str Ret
0 2021-09-24 NaN NaN NaN Hello NaN NaN NaN 4.3
1 2021-09-25 NaN NaN NaN NaN NaN NaN NaN 8.2
2 2021-09-26 NaN NaN NaN NaN NaN NaN NaN NaN

Unable to remove rows which have more than 2 NaN

I have a df which has NaN in it, I tried running df.dropna() the it drops all rows which have NaN value but when I try using df.dropna(thresh=2) then nothing happens ,no row gets deleted. Can someone please explain me why is this occurring.
This is how I have changed the values to NaN
:
for col in df.columns:
df.loc[df[col] == '?', col] = np.nan
It is the the pic with total number of rows and Columns
second pic when I used df.dropna()
third pic is after using df.dropna(thresh=2)

thresh=2 says the row must have at least 2 valid / non NaN values otherwise delete that row.
In given screen shots there are 13 column.
So, to remove rows which have more than 2 Nan the thresh should be thresh = 11.
In this way Pandas will move all the rows where it finds more than 2 NaN
Hope this helps!

As per my knowledge, thresh works for integer values. it will respond if you have only integer value is there and rest are NaN and thresh should be greater than 1. eq.
import pandas as pd
data = [{'id':1,'name' : 'John'},
{'id':2, 'name' : 'Aaron', 'phone':43242123213,'age':32},
{'id':3, 'name':'alan' }
]
df = pd.DataFrame(data)
OUTPUT:
age id name phone
0 NaN 1 John NaN
1 32.0 2 Aaron 4.324212e+10
2 NaN 3 Alan NaN
>> df.dropna(thresh=2)
It will not work but if I remove the name 'Alan' at index 2, It will respond and delete 3rd row 3 as only integer value id left.

How to delete all rows in a dataframe?

I want to delete all the rows in a dataframe.
The reason I want to do this is so that I can reconstruct the dataframe with an iterative loop. I want to start with a completely empty dataframe.
Alternatively, I could create an empty df from just the column / type information if that is possible

Here's another method if you have an existing DataFrame that you'd like to empty without recreating the column information:
df_empty = df[0:0]
df_empty is a DataFrame with zero rows but with the same column structure as df

The latter is possible and strongly recommended - "inserting" rows row-by-row is highly inefficient. A sketch could be
>>> import numpy as np
>>> import pandas as pd
>>> index = np.arange(0, 10)
>>> df = pd.DataFrame(index=index, columns=['foo', 'bar'])
>>> df
Out[268]:
foo bar
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 NaN NaN
9 NaN NaN

If you have an existing DataFrame with the columns you want then extract the column names into a list comprehension then create an empty DataFrame with your column names.
# Creating DataFrame from a CSV file with desired headers
csv_a = "path/to/my.csv"
df_a = pd.read_csv(csv_a)
# Extract column names into a list
names = [x for x in df_a.columns]
# Create empty DataFrame with those column names
df_b = pd.DataFrame(columns=names)

df.drop(df.index,inplace=True)
This line will delete all rows, while keeping the column names.

You can also use head:
df_empty = df.head(0)

Old Thread. But i found another way
df_final=df_dup[0:0].copy(deep=True)

Merging two Excel files by ID and combining columns with same name (python, pandas)

I am new to stackoverflow and pandas for python. I found part of my answer in the post Looking to merge two Excel files by ID into one Excel file using Python 2.7
However, I also want to merge or combine columns from the two excel files with the same name. I thought the following post would have my answer but I guess it's not titled correctly: Merging Pandas DataFrames with the same column name
Right now I have the code:
import pandas as pd
file1 = pd.read_excel("file1.xlsx")
file2 = pd.read_excel("file2.xlsx")
file3 = file1.merge(file2, on="ID", how="outer")
file3.to_excel("merged.xlsx")
file1.xlsx
ID,JanSales,FebSales,test
1,100,200,cars
2,200,500,
3,300,400,boats
file2.xlsx
ID,CreditScore,EMMAScore,test
2,good,Watson,planes
3,okay,Thompson,
4,not-so-good,NA,
what I get is merged.xlsx
ID,JanSales,FebSales,test_x,CreditScore,EMMAScore,test_y
1,100,200,cars,NaN,NaN,
2,200,500,,good,Watson,planes
3,300,400,boats,okay,Thompson,
4,NaN,NaN,,not-so-good,NaN,
what I want is merged.xlsx
ID,JanSales,FebSales,CreditScore,EMMAScore,test
1,100,200,NaN,NaN,cars
2,200,500,good,Watson,planes
3,300,400,okay,Thompson,boats
4,NaN,NaN,not-so-good,NaN,NaA
In my real data, there are 200+ columns that correspond to the "test" column in my example. I want the program to find these columns with the same names in both file1.xlsx and file2.xlsx and combine them in the merged file.

OK, here is a more dynamic way, after merging we assume that clashes will occur and result in 'column_name_x' or '_y'.
So first figure out the common column names and remove 'ID' from this list
In [51]:
common_columns = list(set(list(df1.columns)) & set(list(df2.columns)))
common_columns.remove('ID')
common_columns
Out[51]:
['test']
Now we can iterate over this list to create the new column and use where to conditionally assign the value dependent on which value is not null.
In [59]:
for col in common_columns:
df3[col] = df3[col+'_x'].where(df3[col+'_x'].notnull(), df3[col+'_y'])
df3
Out[59]:
ID JanSales FebSales test_x CreditScore EMMAScore test_y test
0 1 100 200 cars NaN NaN NaN cars
1 2 200 500 NaN good Watson planes planes
2 3 300 400 boats okay Thompson NaN boats
3 4 NaN NaN NaN not-so-good NaN NaN NaN
[4 rows x 8 columns]
Then just to finish off drop all the extra columns:
In [68]:
clash_names = [elt+suffix for elt in common_columns for suffix in ('_x','_y') ]
clash_names
df3.drop(labels=clash_names, axis=1,inplace=True)
df3
Out[68]:
ID JanSales FebSales CreditScore EMMAScore test
0 1 100 200 NaN NaN cars
1 2 200 500 good Watson planes
2 3 300 400 okay Thompson boats
3 4 NaN NaN not-so-good NaN NaN
[4 rows x 6 columns]
The snippet above is from this :Prepend prefix to list elements with list comprehension

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Dataframe updates with pandas that includes duplicated column headers - python

Here is another way. uses str.startswith() to find the columns that start with link, then where() to keep only the cases that are true.: cols = df.columns.str.startswith('link') df.loc[:,cols] = df.loc[:,cols].where(df.loc[:,cols].replace(r'[-].*','',regex=True).eq('DR'))

Related

Pandas dataframe create from dict of crossings

Adding a value at the end of a column in a multindex column dataframe

Unable to remove rows which have more than 2 NaN

How to delete all rows in a dataframe?

Merging two Excel files by ID and combining columns with same name (python, pandas)

Categories

Resources