combine data in pandas - python

I have a pandas dataframe like this:
index integer_2_x integer_2_y
0 49348 NaN
1 26005 NaN
2 5 NaN
3 NaN 26
4 26129 NaN
5 129 NaN
6 NaN 26
7 NaN 17
8 60657 NaN
9 17031 NaN
I want to make a third column that looks like this by taking the numeric value in the first and the second and eliminating the NaN. How do I do this?
index integer_2_z
0 49348
1 26005
2 5
3 26
4 26129
5 129
6 26
7 17
8 60657
9 17031

One way is to use the update function.
import pandas as np
import numpy as np
# some artificial data
# ========================
df = pd.DataFrame({'X':[10,20,np.nan,40,np.nan], 'Y':[np.nan,np.nan,30,np.nan,50]})
print(df)
X Y
0 10 NaN
1 20 NaN
2 NaN 30
3 40 NaN
4 NaN 50
# processing
# =======================
df['Z'] = df['X']
# for every missing value in column Z, replace it with value in column Y
df['Z'].update(df['Y'])
print(df)
X Y Z
0 10 NaN 10
1 20 NaN 20
2 NaN 30 30
3 40 NaN 40
4 NaN 50 50

I used http://pandas.pydata.org/pandas-docs/stable/basics.html#general-dataframe-combine
import pandas as pd
import numpy as np
df = pd.read_csv("data", sep="\s*") # cut and pasted your data into 'data' file
df["integer_2_z"] = df["integer_2_x"].combine(df["integer_2_y"], lambda x, y: np.where(pd.isnull(x), y, x))
Output
index integer_2_x integer_2_y integer_2_z
0 0 49348 NaN 49348
1 1 26005 NaN 26005
2 2 5 NaN 5
3 3 NaN 26 26
4 4 26129 NaN 26129
5 5 129 NaN 129
6 6 NaN 26 26
7 7 NaN 17 17
8 8 60657 NaN 60657
9 9 17031 NaN 17031

Maybe you can simply use the fillna function.
# Creating the DataFrame
df = pd.DataFrame({'integer_2_x': [49348, 26005, 5, np.nan, 26129, 129, np.nan, np.nan, 60657, 17031],
'integer_2_y': [np.nan, np.nan, np.nan, 26, np.nan, np.nan, 26, 17, np.nan, np.nan]})
# Using fillna to fill a new column
df['integer_2_z'] = df['integer_2_x'].fillna(df['integer_2_y'])
# Printing the result below, you can also drop x and y columns if they are no more required
print(df)
integer_2_x integer_2_y integer_2_z
0 49348 NaN 49348
1 26005 NaN 26005
2 5 NaN 5
3 NaN 26 26
4 26129 NaN 26129
5 129 NaN 129
6 NaN 26 26
7 NaN 17 17
8 60657 NaN 60657
9 17031 NaN 17031

Related

Remove pandas row that is based on previous row

I have the following dataframe, which the value should be increasing. Originally the dataframe has some unknown values.
index
value
0
1
1
2
3
2
4
5
6
7
4
8
9
10
3
11
3
12
13
14
15
5
Based on the assumsion that the value should be increasing, I would like to remove the value at index 10 and 11. This would be the desired dataframe:
index
value
0
1
1
2
3
2
4
5
6
7
4
8
9
12
13
14
15
5
Thank you very much
Assuming NaN in the empty cells (if not, temporarily replace them with NaN), use boolean indexing:
# if not NaNs uncomment below
# and use s in place of df['value'] afterwards
# s = pd.to_numeric(df['value'], errors='coerce')
# is the cell empty?
m1 = df['value'].isna()
# are the values strictly increasing?
m2 = df['value'].ge(df['value'].cummax())
out = df[m1|m2]
Output:
index value
1 1 NaN
2 2 NaN
3 3 2.0
4 4 NaN
5 5 NaN
6 6 NaN
7 7 4.0
8 8 NaN
9 9 NaN
12 12 NaN
13 13 NaN
14 14 NaN
15 15 5.0
Try this:
def del_df(df):
df_no_na = df.dropna().reset_index(drop = True)
num_tmp = df_no_na['value'][0] # First value which is not NaN.
del_index_list = [] # indicies to delete
for row_index in range(1, len(df_no_na)):
if df_no_na['value'][row_index] > num_tmp : #Increasing
num_tmp = df_no_na['value'][row_index] # to compare following two values.
else : # Not increasing(same or decreasing)
del_index_list.append(df_no_na['index'][row_index]) # index to delete
df_goal = df.drop([df.index[i] for i in del_index_list])
return df_goal
output:
index value
0 0 1.0
1 1 NaN
2 2 NaN
3 3 2.0
4 4 NaN
5 5 NaN
6 6 NaN
7 7 4.0
8 8 NaN
9 9 NaN
12 12 NaN
13 13 NaN
14 14 NaN
15 15 5.0

Insert rows from dataframeB to DataframeA with keys and without Merge

I have a dataframe with thousand records as:
ID to from Date price Type
1 69 18 2/2020 10 A
2 11 12 2/2020 5 A
3 18 10 3/2020 4 B
4 10 11 3/2020 10 A
5 12 69 3/2020 4 B
6 12 20 3/2020 3 B
7 69 21 3/2020 3 A
The output that i want is :
ID to from Date price Type ID to from Date price Type
1 69 18 2/2020 4 A 5 12 69 3/2020 4 B
1' 69 18 2/2020 6 A Nan Nan Nan Nan Nan Nan
2 11 12 2/2020 5 A Nan Nan Nan Nan Nan Nan
4 10 11 3/2020 4 A 3 18 10 3/2020 4 B
4' 10 11 3/2020 6 A Nan Nan Nan Nan Nan Nan
Nan Nan Nan Nan Nan Nan 6 12 20 3/2020 3 B
7 69 21 3/2020 3 A Nan Nan Nan Nan Nan Nan
The idea is to iterate over row , if the type is B , put the row next to the first record with type A and from = TO ,
if the price are equals its ok , if its not split the row with higher price , and the new price will be soustracted.
i divise the dataframe in type A and B , and im trying to iterate both of them
grp = df.groupby('type')
transformed_df_list = []
for idx, frame in grp:
frame.reset_index(drop=True, inplace=True)
transformed_df_list.append(frame.copy())
A = pd.DataFrame([transformed_df_list[0])
B= pd.DataFrame([transformed_df_list[1])
for i , row in A.iterrows():
for i, row1 in B.iterrows():
if row['to'] == row1['from']:
if row['price'] == row1['price']:
row_df = pd.DataFrame([row1])
output = pd.merge(A ,B, how='left' , left_on =['to'] , right_on =['from'] )
The problem is that with merge function a get several duplicate rows and i cant check the price to split the row ?
There is way to insert B row in A dataframe witout merge function ?

pivot table in specific intervals pandas

I have a one column dataframe which looks like this:
Neive Bayes
0 8.322087e-07
1 3.213342e-24
2 4.474122e-28
3 2.230054e-16
4 3.957606e-29
5 9.999992e-01
6 3.254807e-13
7 8.836033e-18
8 1.222642e-09
9 6.825381e-03
10 5.275194e-07
11 2.224289e-06
12 2.259303e-09
13 2.014053e-09
14 1.755933e-05
15 1.889681e-04
16 9.929193e-01
17 4.599619e-05
18 6.944654e-01
19 5.377576e-05
I want to pivot it to wide format but with specific intervals. The first 9 rows should make up 9 columns of the first row, and continue this pattern until the final table has 9 columns and has 9 times fewer rows than now. How would I achieve this?
Using pivot_table:
df.pivot_table(columns=df.index % 9, index=df.index // 9, values='Neive Bayes')
0 1 2 3 4 \
0 8.322087e-07 3.213342e-24 4.474122e-28 2.230054e-16 3.957606e-29
1 6.825381e-03 5.275194e-07 2.224289e-06 2.259303e-09 2.014053e-09
2 6.944654e-01 5.377576e-05 NaN NaN NaN
5 6 7 8
0 0.999999 3.254807e-13 8.836033e-18 1.222642e-09
1 0.000018 1.889681e-04 9.929193e-01 4.599619e-05
2 NaN NaN NaN NaN
Construct multiindex, set_index and unstack
iix = pd.MultiIndex.from_arrays([np.arange(df.shape[0]) // 9,
np.arange(df.shape[0]) % 9])
df_wide = df.set_index(iix)['Neive Bayes'].unstack()
Out[204]:
0 1 2 3 4 \
0 8.322087e-07 3.213342e-24 4.474122e-28 2.230054e-16 3.957606e-29
1 6.825381e-03 5.275194e-07 2.224289e-06 2.259303e-09 2.014053e-09
2 6.944654e-01 5.377576e-05 NaN NaN NaN
5 6 7 8
0 0.999999 3.254807e-13 8.836033e-18 1.222642e-09
1 0.000018 1.889681e-04 9.929193e-01 4.599619e-05
2 NaN NaN NaN NaN

Appending Pandas DataFrame in a loop

Let's say I have a df such as this:
df = pd.DataFrame({'A': [1,2,3,4,5], 'A_z': [2,3,4,5,6], 'B': [3,4,5,6,7], 'B_z': [4,5,6,7,8],
'C': [5,6,7,8,9], 'C_z': [6,7,8,9,10]})
Which looks like this:
A A_z B B_z C C_z
0 1 2 3 4 5 6
1 2 3 4 5 6 7
2 3 4 5 6 7 8
3 4 5 6 7 8 9
4 5 6 7 8 9 10
What I'm looking to do is create a new df and for each letter (A,B,C) append this new df vertically with the data from the two columns per letter so that it looks like this:
Letter Letter_z
0 1 2
1 2 3
2 3 4
3 4 5
4 5 6
5 3 4
6 4 5
7 5 6
8 6 7
9 7 8
10 5 6
11 6 7
12 7 8
13 8 9
14 9 10
As far as I'm concerned something like this should work fine:
for col in df.columns:
if col[-1] != 'z':
new_df = new_df.append(df[[col, col + '_z']])
However this results in the following mess:
A A_z B B_z C C_z
0 1.0 2.0 NaN NaN NaN NaN
1 2.0 3.0 NaN NaN NaN NaN
2 3.0 4.0 NaN NaN NaN NaN
3 4.0 5.0 NaN NaN NaN NaN
4 5.0 6.0 NaN NaN NaN NaN
0 NaN NaN 3.0 4.0 NaN NaN
1 NaN NaN 4.0 5.0 NaN NaN
2 NaN NaN 5.0 6.0 NaN NaN
3 NaN NaN 6.0 7.0 NaN NaN
4 NaN NaN 7.0 8.0 NaN NaN
0 NaN NaN NaN NaN 5.0 6.0
1 NaN NaN NaN NaN 6.0 7.0
2 NaN NaN NaN NaN 7.0 8.0
3 NaN NaN NaN NaN 8.0 9.0
4 NaN NaN NaN NaN 9.0 10.0
What am I doing wrong? Any help would be really appreciated, cheers.
EDIT:
After the kind help from jezrael the renaming of the columns in his answer got me thinking about a possible way to do it using my original train of thought.
I can now also achieve the new df I want using the following:
for col in df:
if col[-1] != 'z':
d = df[[col, col + '_z']]
d.columns = ['Letter', 'Letter_z']
new_df = new_df.append(d)
The different columns names were clearly what was causing the problem which is something I wasn't aware of at the time. Hope this helps anyone.
One ide is use Series.str.split with expand=True for MultiIndex, then use rename for avoid NaNs and finally new columns names, reshape by DataFrame.stack, sort for correct order by DataFrame.sort_index and last remove MultiIndex:
df.columns = df.columns.str.split('_', expand=True)
df = df.rename(columns=lambda x:'Letter_z' if x == 'z' else 'Letter', level=1)
df = df.stack(0).sort_index(level=[1,0]).reset_index(drop=True)
print (df)
Letter Letter_z
0 1 2
1 2 3
2 3 4
3 4 5
4 5 6
5 3 4
6 4 5
7 5 6
8 6 7
9 7 8
10 5 6
11 6 7
12 7 8
13 8 9
14 9 10
Or if possible simplify problem with reshape all non z values to one column and all z values to another use numpy.ravel:
m = df.columns.str.endswith('_z')
a = df.loc[:, ~m].to_numpy().T.ravel()
b = df.loc[:, m].to_numpy().T.ravel()
df = pd.DataFrame({'Letter': a,'Letter_z': b})
print (df)
Letter Letter_z
0 1 2
1 2 3
2 3 4
3 4 5
4 5 6
5 3 4
6 4 5
7 5 6
8 6 7
9 7 8
10 5 6
11 6 7
12 7 8
13 8 9
14 9 10
You can use the function concat and a list comprehension:
cols = df.columns[~df.columns.str.endswith('_z')]
func = lambda x: 'letter_z' if x.endswith('_z') else 'letter'
pd.concat([df.filter(like=i).rename(func, axis=1) for i in cols])
or
cols = df.columns[~df.columns.str.endswith('_z')]
pd.concat([df.filter(like=i).set_axis(['letter', 'letter_z'], axis=1, inplace=False) for i in cols])

How to set a value of a panda dataframe between two indices?

I would like to set a value to a panda dataframe based on the values of another column. In a nutshell, for example, if I wanted to set indices of a column my_column of a pandas dataframe pd where another column, my_interesting_column is between 10 and 30, I would like to do something like:
start_index=pd.find_closest_index_where_pd["my_interesting_column"].is_closest_to(10)
end_index=pd.find_closest_index_where_pd["my_interesting_column"].is_closest_to(30)
pd["my_column"].between(star_index, end_index)= some_value
As a simple illustration, suppose I have the following dataframe
df = pd.DataFrame(np.arange(10, 20), columns=list('A'))
df["B"]=np.nan
>>> df
A B
0 10 NaN
1 11 NaN
2 12 NaN
3 13 NaN
4 14 NaN
5 15 NaN
6 16 NaN
7 17 NaN
8 18 NaN
9 19 NaN
How can I do something like
df.where(df["A"].is_between(13,16))= 5
So that the end results looks like
>>> df
A B
0 10 NaN
1 11 NaN
2 12 NaN
3 13 5
4 14 5
5 15 5
6 16 5
7 17 NaN
8 18 NaN
9 19 NaN
pd.loc[start_idx:end_idx, 'my_column'] = some_value
I think this is what you are looking for
df.loc[(df['A'] >= 13) & (df['A'] <= 16), 'B'] = 5

Categories

Resources