python how do I perform the below operation in dataframe - python

df1 = pd.DataFrame({
'Year': ["1A", "2A", "3A", "4A", "5A"],
'Tval1' : [1, 9, 8, 1, 6],
'Tval2' : [34, 56, 67, 78, 89]
})
it looks more like this
and I want to change it to make it look like this, the 2nd column is moved under the individual row.

Idea is get numbers from Year column, then set new columns names after Year column and reshape by DataFrame.stack:
df1['Year'] = df1['Year'].str.extract('(\d+)')
df = df1.set_index('Year')
#add letters by length of columns, working for 1 to 26 columns A-Z
import string
df.columns = list(string.ascii_uppercase[:len(df.columns)])
#here working same like
#df.columns = ['A','B']
df = df.stack().reset_index(name='Val')
df['Year'] = df['Year'] + df.pop('level_1')
print (df)
Year Val
0 1A 1
1 1B 34
2 2A 9
3 2B 56
4 3A 8
5 3B 67
6 4A 1
7 4B 78
8 5A 6
9 5B 89
Another idea with DataFrame.melt:
df = (df1.replace({'Year': {'A':''}}, regex=True)
.rename(columns={'Tval1':'A','Tval2':'B'})
.melt('Year'))
df['Year'] = df['Year'] + df.pop('variable')
print (df)
Year value
0 1A 1
1 2A 9
2 3A 8
3 4A 1
4 5A 6
5 1B 34
6 2B 56
7 3B 67
8 4B 78
9 5B 89

Try the below code. I split it into two dataframes, and then concatenated after changing the Years' ends to be a 'B' instead of an 'A'.
import pandas as pd
df = pd.DataFrame(data=dict(Year=['1A', '2A', '3A'], val1=[1, 2, 3], val2=[4,5,6]))
df1 = df.drop(columns=['val2'])
df2 = df.drop(columns=['val1'])
columns = ['Year', 'val']
df1.columns = columns
df2.columns = columns
df2['Year'] = df2['Year'].str.replace('A', 'B')
pd.concat([df1, df2]).reset_index(drop=True)

Related

Convert Nx1 pandas dataframe with single 1xM array-containing column to M columns in Pandas dataframe

This is the current dataframe I have: It is Nx1 with each cell containing a numpy array.
print (df)
age
0 [35, 34, 55, 56]
1 [25, 34, 35, 66]
2 [45, 35, 53, 16]
.
.
.
N [45, 35, 53, 16]
I would like somehow to ravel each value of each cell to a new column.
# do conversion
print (df)
age1 age2 age3 age4
0 35 34 55 56
1 25 34 35 66
2 45 35 53 16
.
.
.
N 45 35 53 16
You can reconstruct the dataframe from the lists, and customize the column names with:
df = pd.DataFrame(df.age.values.tolist())
df.columns += 1
df = df.add_prefix('age')
print(df)
age1 age2 age3 age4
0 35 34 55 56
1 25 34 35 66
...
Here is another alternative:
import pandas as pd
df = pd.DataFrame({'age':[[35,34,55,54],[1,2,3,4],[5,6,7,8],[9,10,11,12]]})
df['age_aux'] = df['age'].astype(str).str.split(',')
for i in range(4):
df['age_'+str(i)] = df['age_aux'].str.get(i).map(lambda x: x.lstrip('[').rstrip(']'))
df = df.drop(columns=['age','age_aux'])
print(df)
Output:
age_0 age_1 age_2 age_3
0 35 34 55 54
1 1 2 3 4
2 5 6 7 8
3 9 10 11 12
You can create DataFrame by constructor for improve performance and change columns names by rename with f-strings:
df1 = (pd.DataFrame(df.age.values.tolist(), index=df.index)
.rename(columns = lambda x: f'age{x+1}'))
Another variation is to apply pd.Series to the column and massage the column names:
df= pd.DataFrame( { "age": [[1,2,3,4],[2,3,4,5]] })
df = df["age"].apply(pd.Series)
df.columns = ["age1","age2","age3","age4"]

How to initialize or change a dataframe according to my index length?

I have an index that looks like:
MyIndex
11
12
13
and a dataframe which might be longer than my index: (they could be equal under some situations)
OldIndex c1
0 00
1 01
2 02
3 03
4 04
I want to fit the dataframe into the index (by dropping the extra rows at the tail)
.
MyIndex c1
11 00
12 01
13 02
Is there any simple solution? It would be better if I can achieve this without creating a new dataframe.
You can try this:
my_idx = pd.Series([11, 12, 13], name='MyIndex')
df = pd.DataFrame({'OldIndex': [0,1,2,3,4],
'c1':['00', '01', '02', '03', '04']}).set_index('OldIndex')
l = len(my_idx)
df = df.iloc[:l].set_index(idx.index)
c1
MyIndex
11 00
12 01
13 02
simple way to do it would just be this...
assuming you have:
df1 = pd.DataFrame(index=[0,1,2])
df2 = pd.DataFrame({'c1':[1,2,3,4,5]},index=[0,1,2,3,4])
df1['c1'] = df2['c1'].values[:len(df1.index)]
output:
>>> df1
c1
0 1
1 2
2 3
without creating a new df...
say ind = pd.Index([0,1,2])
df2, df2.index = df2.iloc[:len(ind)], ind
output:
>>> df2
c1
0 1
1 2
2 3
You have to create the desired index (as pd.Index) and then set this index to the new df (subset based on the length of the new index)
myindex = pd.Index(['I1','I2','I3'])
df = df.iloc[:len(myindex)].set_index(myindex)
Output:
c1
I1 00
I2 01
I3 02

Subtract two columns in dataframe

My df looks as follows:
Index Country Val1 Val2 ... Val10
1 Australia 1 3 ... 5
2 Bambua 12 33 ... 56
3 Tambua 14 34 ... 58
I'd like to substract Val10 from Val1 for each country, so output looks like:
Country Val10-Val1
Australia 4
Bambua 23
Tambua 24
So far I've got:
def myDelta(row):
data = row[['Val10', 'Val1']]
return pd.Series({'Delta': np.subtract(data)})
def runDeltas():
myDF = getDF() \
.apply(myDelta, axis=1) \
.sort_values(by=['Delta'], ascending=False)
return myDF
runDeltas results in this error:
ValueError: ('invalid number of arguments', u'occurred at index 9')
What's the proper way to fix this?
Given the following dataframe:
import pandas as pd
df = pd.DataFrame([["Australia", 1, 3, 5],
["Bambua", 12, 33, 56],
["Tambua", 14, 34, 58]
], columns=["Country", "Val1", "Val2", "Val10"]
)
It comes down to a simple broadcasting operation:
>>> df["Val1"] - df["Val10"]
0 -4
1 -44
2 -44
dtype: int64
You can also store this into a new column with:
>>> df['Val_1_minus_10'] = df['Val1'] - df['Val10']
>>> df
Country Val1 Val2 Val10 Val_1_minus_10
0 Australia 1 3 5 -4
1 Bambua 12 33 56 -44
2 Tambua 14 34 58 -44
Using this as the df:
df = pd.DataFrame([["Australia", 1, 3, 5],
["Bambua", 12, 33, 56],
["Tambua", 14, 34, 58]
], columns=["Country", "Val1", "Val2", "Val10"]
)
You can also do the subtraction and put it into a new column as follows.
>>>df['Val_Diff'] = df['Val10'] - df['Val1']
Country Val1 Val2 Val10 Val_Diff
0 Australia 1 3 5 4
1 Bambua 12 33 56 44
2 Tambua 14 34 58 44
You can do this by using lambda function and assign to new column.
df['Val10-Val1'] = df.apply(lambda x: x['Val10'] - x['Val1'], axis=1)
print df
You can also use pandas.DataFrame.assign function: e,g
import numpy as np
import pandas as pd
df = pd.DataFrame([["Australia", 1, 3, 5],
["Bambua", 12, 33, 56],
["Tambua", 14, 34, 58]
], columns=["Country", "Val1", "Val2", "Val10"]
)
df = df.assign(Val10_minus_Val1 = df['Val10'] - df['Val1'])
The best part of assign is you can add as many assignments as you wish. e.g. getting both the difference and then the log of it
df = df.assign(Val10_minus_Val1 = df['Val10'] - df['Val1'], log_result = lambda x: np.log(x.Val10_minus_Val1) )
Results:
Though it's an old question but pandas allows subtracting two DataFrames or Seriess using pandas.DataFrame.subtract
import pandas as pd
df = pd.DataFrame([["Australia", 1, 3, 5],
["Bambua", 12, 33, 56],
["Tambua", 14, 34, 58]
], columns=["Country", "Val1", "Val2", "Val10"]
)
df["Val1"].subtract(df["Val2"])
Output:
0 -2
1 -21
2 -20
dtype: int64
You can also use eval here:
In [12]: df.eval('Val10_minus_Val1 = Val10-Val1', inplace=True)
In [13]: df
Out[13]:
Country Val1 Val2 Val10 Val10_minus_Val1
0 Australia 1 3 5 4
1 Bambua 12 33 56 44
2 Tambua 14 34 58 44
Since inplace=True you don't have to assign it back to df.
What I have faced today, makes me ambitious to share it with you. As people mentioned above you can used easily:
df['Val10-Val1'] = df['Val10']-df['Val1']
but sometimes you might need to use apply function, so you might use the following line:
df['Val10-Val1'] = df.apply(lambda row: row['Val10']-row['Val1'])

combine row with different name for a column pandas python

i have a sample data set:
import pandas as pd
df = {
'columA':['1A','2A','3A','4A','5A','6A'],
'count': [ 1, 12, 34, 52, '3',2],
'columnB': ['a','dd','dd','ee','d','f']
}
df = pd.DataFrame(df)
it looks like this:
columA columnB count
1A a 1
2A dd 12
3A dd 34
4A ee 52
5A d 3
6A f 2
Update: The combined 2A and 3A name should be something arbitrary like 'SAB' or '2A plus 3A', etc., I used '2A|3A' as the example and it confused some of the people.
I want to sum up the count the rows 2A and 3A and give it a name SAB
desired output:
columA columnB count
1A a 1
SAB dd 46
4A ee 52
5A d 3
6A f 2
We can use a groupby on columnB
df = {'columA':['1A','2A','3A','4A','5A','6A'],
'count': [ 1, 12, 34, 52, '3',2],
'columnB': ['a','dd','dd','ee','d','f']}
df = pd.DataFrame(df)
df.groupby('columnB').agg({'count': 'sum', 'columA': 'sum'})
columA count
columnB
a 1A 1
d 5A 3
dd 2A3A 46
ee 4A 52
f 6A 2
If you're concerned about the index name you can write a function like so.
def join_by_pipe(s):
return '|'.join(s)
df.groupby('columnB').agg({'count': 'sum', 'columA': join_by_pipe})
columA count
columnB
a 1A 1
d 5A 3
dd 2A|3A 46
ee 4A 52
f 6A 2

how to filter this python dataframe

Greeting
I try to get the smallest sizes dataframe that got valid row
import pandas as pd
import random
columns = ['x0','y0']
df_ = pd.DataFrame(index=range(0,30), columns=columns)
df_ = df_.fillna(0)
columns1 = ['x1','y1']
df = pd.DataFrame(index=range(0,11), columns=columns1)
for index, row in df.iterrows():
df.loc[index, "x1"] = random.randint(1, 100)
df.loc[index, "y1"] = random.randint(1, 100)
df_ = df_.combine_first(df)
df = pd.DataFrame(index=range(0,17), columns=columns1)
for index, row in df.iterrows():
df.loc[index, "x2"] = random.randint(1, 100)
df.loc[index, "y2"] = random.randint(1, 100)
df_ = df_.combine_first(df)
From the example the dataframe should output rows from 0 to 10 and the rest got filter out.
I think of keep a counter to keep track of the min row
or using pandasql
or if there is a trick to get this info from the dataframe
the size of dataframe
Actually I will be appending 500+ files with various size to append
and use it to do some analysis. So perf is a consideration.
-student of python
If you want to drop the rows which have NaNs use dropna (here, this is the first ten rows):
In [11]: df_.dropna()
Out[11]:
x0 x1 x2 y0 y1 y2
0 0 49 58 0 68 2
1 0 2 37 0 19 71
2 0 26 95 0 12 17
3 0 87 5 0 70 69
4 0 84 77 0 70 92
5 0 71 98 0 22 5
6 0 28 95 0 70 15
7 0 31 19 0 24 31
8 0 9 37 0 55 29
9 0 30 53 0 15 45
10 0 8 61 0 74 41
However a cleaner, more efficient, and faster way to do this entire process is to update just those first rows (I'm assuming the random integer stuff is just you generating some example dataframes).
Let's store your DataFrames in a list:
In [21]: df1 = pd.DataFrame([[1, 2], [np.nan, 4]], columns=['a', 'b'])
In [22]: df2 = pd.DataFrame([[1, 2], [5, 6], [7, 8]], columns=['a', 'c'])
In [23]: dfs = [df1, df2]
Take the minimum length:
In [24]: m = min(len(df) for df in dfs)
First create an empty DataFrame with the desired rows and columns:
In [25]: columns = reduce(lambda x, y: y.columns.union(x), dfs, [])
In [26]: res = pd.DataFrame(index=np.arange(m), columns=columns)
To do this efficiently we're going to update, and making these changes inplace - on just this DataFrame*:
In [27]: for df in dfs:
res.update(df)
In [28]: res
Out[28]:
a b c
0 1 2 2
1 5 4 6
*If we didn't do this, or were using combine_first or similar, we'd most likely have lots of copying (new DataFrames being created), which will slow things down.
Note: combine_first doesn't offer an inplace flag... you could use combine but this is also more complicated (as well as less efficient). It's also quite straightforward to use where (and manually update), which IIRC is what combine does under the hood.

Categories

Resources