Pandas, Python - merging columns with same key, but with different values - python

From my for-loop, the resulted lists are as follow:
#These lists below are list types and in ordered/structured.
key=[1234,2345,2223,6578,9976]
index0=[1,4,6,3,4,5,6,2,1]
index1=[4,3,2,1,6,8,5,3,1]
index2=[9,4,6,4,3,2,1,4,1]
How do I merge them all into a table by pandas? Below is the expectation.
key | index0 | index1 | index2
1234 | 1 | 4 | 9
2345 | 4 | 3 | 4
... | ... | ... | ...
9967 | 1 | 1 | 1
I had tried using pandas, but only came across into an error about data type. Then I set the dtype into int64 and int32, but still came across the error about data type again.
And for an optional question, should I had approached assembling a table from such a similar data in lists with SQL? I am just learning SQL with mySQL and wonder if it would've been convenient than with pandas for record keeping and persistent storage?

Just use a dict and pass it to pd.DataFrame:
dct = {
'key': pd.Series(key),
'index0': pd.Series(index0),
'index1': pd.Series(index1),
'index2': pd.Series(index2),
}
df = pd.DataFrame(dct)
Output:
>>> df
key index0 index1 index2
0 1234.0 1 4 9
1 2345.0 4 3 4
2 2223.0 6 2 6
3 6578.0 3 1 4
4 9976.0 4 6 3
5 NaN 5 8 2
6 NaN 6 5 1
7 NaN 2 3 4
8 NaN 1 1 1

Here is another way:
First load data into a dictionary:
d = dict(key=[1234,2345,2223,6578,9976],
index0=[1,4,6,3,4,5,6,2,1],
index1=[4,3,2,1,6,8,5,3,1],
index2=[9,4,6,4,3,2,1,4,1])
Then convert to a df:
df = pd.DataFrame({i:pd.Series(j) for i,j in d.items()})
Output:
key index0 index1 index2
0 1234.0 1 4 9
1 2345.0 4 3 4
2 2223.0 6 2 6
3 6578.0 3 1 4
4 9976.0 4 6 3
5 NaN 5 8 2
6 NaN 6 5 1
7 NaN 2 3 4
8 NaN 1 1 1

Related

Why does pd.rolling and .apply() return multiple outputs from a function returning a single value?

I'm trying to create a rolling function that:
Divides two DataFrames with 3 columns in each df.
Calculate the mean of each row from the output in step 1.
Sums the averages from step 2.
This could be done by using pd.iterrows() hence looping through each row. However, this would be inefficient when working with larger datasets. Therefore, my objective is to create a pd.rolling function that could do this much faster.
What I would need help with is to understand why my approach below returns multiple values while the function I'm using only returns a single value.
EDIT : I have updated the question with the code that produces my desired output.
This is the test dataset I'm working with:
#import libraries
import pandas as pd
import numpy as np
#create two dataframes
values = {'column1': [7,2,3,1,3,2,5,3,2,4,6,8,1,3,7,3,7,2,6,3,8],
'column2': [1,5,2,4,1,5,5,3,1,5,3,5,8,1,6,4,2,3,9,1,4],
"column3" : [3,6,3,9,7,1,2,3,7,5,4,1,4,2,9,6,5,1,4,1,3]
}
df1 = pd.DataFrame(values)
df2 = pd.DataFrame([[2,3,4],[3,4,1],[3,6,1]])
print(df1)
print(df2)
column1 column2 column3
0 7 1 3
1 2 5 6
2 3 2 3
3 1 4 9
4 3 1 7
5 2 5 1
6 5 5 2
7 3 3 3
8 2 1 7
9 4 5 5
10 6 3 4
11 8 5 1
12 1 8 4
13 3 1 2
14 7 6 9
15 3 4 6
16 7 2 5
17 2 3 1
18 6 9 4
19 3 1 1
20 8 4 3
0 1 2
0 2 3 4
1 3 4 1
2 3 6 1
One method to achieve my desired output by looping through each row:
RunningSum = []
for index, rows in df1.iterrows():
if index > 3:
Div = abs((((df2 / df1.iloc[index-3+1:index+1].reset_index(drop="True").values)-1)*100))
Average = Div.mean(axis=0)
SumOfAverages = np.sum(Average)
RunningSum.append(SumOfAverages)
#printing my desired output values
print(RunningSum)
[330.42328042328046,
212.0899470899471,
152.06349206349208,
205.55555555555554,
311.9047619047619,
209.1269841269841,
197.61904761904765,
116.94444444444444,
149.72222222222223,
430.0,
219.51058201058203,
215.34391534391537,
199.15343915343914,
159.6031746031746,
127.6984126984127,
326.85185185185185,
204.16666666666669]
However, this would be timely when working with large datasets. Therefore, I've tried to create a function which applies to a pd.rolling() object.
def SumOfAverageFunction(vals):
Div = df2 / vals.reset_index(drop="True")
Average = Div.mean(axis=0)
SumOfAverages = np.sum(Average)
return SumOfAverages
RunningSum = df1.rolling(window=3,axis=0).apply(SumOfAverageFunction)
The problem here is that my function returns multiple output. How can I solve this?
print(RunningSum)
column1 column2 column3
0 NaN NaN NaN
1 NaN NaN NaN
2 3.214286 4.533333 2.277778
3 4.777778 3.200000 2.111111
4 5.888889 4.416667 1.656085
5 5.111111 5.400000 2.915344
6 3.455556 3.933333 5.714286
7 2.866667 2.066667 5.500000
8 2.977778 3.977778 3.063492
9 3.555556 5.622222 1.907937
10 2.750000 4.200000 1.747619
11 1.638889 2.377778 3.616667
12 2.986111 2.005556 5.500000
13 5.333333 3.075000 4.750000
14 4.396825 5.000000 3.055556
15 2.174603 3.888889 2.148148
16 2.111111 2.527778 1.418519
17 2.507937 3.500000 3.311111
18 2.880952 3.000000 5.366667
19 2.722222 3.370370 5.750000
20 2.138889 5.129630 5.666667
After reordering of operations, your calculations can be simplified
BASE = df2.sum(axis=0) /3
BASE_series = pd.Series({k: v for k, v in zip(df1.columns, BASE)})
result = df1.rdiv(BASE_series, axis=1).sum(axis=1)
print(np.around(result[4:], 3))
Outputs:
4 5.508
5 4.200
6 2.400
7 3.000
...
if you dont want to calculate anything before index 4 then change:
df1.iloc[4:].rdiv(...

change a pandas datafarme shape and flatten it out

I have a data source (csv file) which is in this shape:
Sample raw data is as follow:
id stage D1 D2 D3 D4 D5 D6
1 base A
1 s1 2 2 4 5
1 s2 3 3 6 7
2 base AA
2 s1 5 3 4 3
2 s2 3 3 2 4
2 s3 2 2 3 6
3 base B
3 s1 4 4 4 5
4 base BC
The first line is an ID and all rows with the same ID are related to the same experiment.
I need to make it flat and change the shape of it when I read it in Pandas to this shape:
id stage D1 D2 D3_s1 D4_s1 D5_s1 D6_s1 D3_s2 D4_s2 D5_s2 D6_s2 D3_s3 D4_s3 D5_s3 D6_s3
1 base A 2 2 4 5 3 3 6 7
2 base AA 5 3 4 3 3 3 2 4 2 2 3 6
3 base B 4 4 4 5
4 base BC
What is the best way to do this in Python?
As a C/C++ programmer, I started using several loops to go over each cell and create a new dataframe with the required shape (Still not successful!).
I believe there should be a better way rather than iterating over all rows and cols.
My questions:
What is the best way to do this in Python?
How can I find that D2 is blank and can drop it?
Assuming you already read the data into a DataFrame:
Split it into 2 dataframes: base (containing rows with stage = base) and other
Unstack the second dataframe and change the column names
Recombine the two
The code
is_base = df['stage'] == 'base'
base = df.loc[is_base, 'id':'D2'].set_index('id')
other = df.loc[~is_base, ['id','stage','D3','D4','D5','D6']].set_index(['id', 'stage'])
other = other.unstack()
other.columns = other.columns.get_level_values(0) + '_' + other.columns.get_level_values(1)
# Reset index if needed
final = pd.merge(base, other, left_index=True, right_index=True)
As you're a C++ programmer, you'll be happy to know that a lot of the core functions in pandas are actually written in C++ for performance reasons
We can use two filters and a MultiIndex by unstacking.
s = df1[df1['stage'].ne('base')]
s1 = s.set_index(['id','stage']).stack().unstack([-1,-2])
s1.columns = [f'{x}_{y}' for x,y in s1.columns]
# to match your output we flatten the multi index.
print(s1)
D1_s1 D2_s1 D3_s1 D4_s1 D1_s2 D2_s2 D3_s2 D4_s2 D1_s3 D2_s3 D3_s3 D4_s3
id
1 2 2 4 5 3 3 6 7 NaN NaN NaN NaN
2 5 3 4 3 3 3 2 4 2 2 3 6
3 4 4 4 5 NaN NaN NaN NaN NaN NaN NaN NaN
then we filter on the base value and join based on the id column.
df2 = df1.loc[df1['stage'].eq('base'), ['id','stage','D1','D2']].set_index('id').join(s1)
as for dropping D2 if its blank a simple if will do.
if df2['D2'].isna().all():
df2 = df2.drop('D2',1)
print(df2)
stage D1 D1_s1 D2_s1 D3_s1 D4_s1 D1_s2 D2_s2 D3_s2 D4_s2 D1_s3 D2_s3 \
id
1 base A 2 2 4 5 3 3 6 7 NaN NaN
2 base AA 5 3 4 3 3 3 2 4 2 2
3 base B 4 4 4 5 NaN NaN NaN NaN NaN NaN
4 base BC NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
D3_s3 D4_s3
id
1 NaN NaN
2 3 6
3 NaN NaN
4 NaN NaN
You should turn it to numpy array and then flatten it and reshape it. like this:
data=pd.read_csv(#Your CSV File Name).values
data=data.flatten()
data.reshape(#Your New Shape)

Compute difference between values in dataframe column

i have this dataframe:
a b c d
4 7 5 12
3 8 2 8
1 9 3 5
9 2 6 4
i want the column 'd' to become the difference between n-value of column a and n+1 value of column 'a'.
I tried this but it doesn't run:
for i in data.index-1:
data.iloc[i]['d']=data.iloc[i]['a']-data.iloc[i+1]['a']
can anyone help me?
Basically what you want is diff.
df = pd.DataFrame.from_dict({"a":[4,3,1,9]})
df["d"] = df["a"].diff(periods=-1)
print(df)
Output
a d
0 4 1.0
1 3 2.0
2 1 -8.0
3 9 NaN
lets try simple way:
df=pd.DataFrame.from_dict({'a':[2,4,8,15]})
diff=[]
for i in range(len(df)-1):
diff.append(df['a'][i+1]-df['a'][i])
diff.append(np.nan)
df['d']=diff
print(df)
a d
0 2 2.0
1 4 4.0
2 8 7.0
3 15 NaN

Adding rows in dataframe based on values of another dataframe

I have the following two dataframes. Please note that 'amt' is grouped by 'id' in both dataframes.
df1
id code amt
0 A 1 5
1 A 2 5
2 B 3 10
3 C 4 6
4 D 5 8
5 E 6 11
df2
id code amt
0 B 1 9
1 C 12 10
I want to add a row in df2 for every id of df1 not contained in df2. For example as Id's A, D and E are not contained in df2,I want to add a row for these Id's. The appended row should contain the id not contained in df2, null value for the attribute code and stored value in df1 for attribute amt
The result should be something like this:
id code name
0 B 1 9
1 C 12 10
2 A nan 5
3 D nan 8
4 E nan 11
I would highly appreciate if I can get some guidance on it.
By using pd.concat
df=df1.drop('code',1).drop_duplicates()
df[~df.id.isin(df2.id)]
pd.concat([df2,df[~df.id.isin(df2.id)]],axis=0).rename(columns={'amt':'name'}).reset_index(drop=True)
Out[481]:
name code id
0 9 1.0 B
1 10 12.0 C
2 5 NaN A
3 8 NaN D
4 11 NaN E
Drop dups from df1 then append df2 then drop more dups then append again.
df2.append(
df1.drop_duplicates('id').append(df2)
.drop_duplicates('id', keep=False).assign(code=np.nan),
ignore_index=True
)
id code amt
0 B 1.0 9
1 C 12.0 10
2 A NaN 5
3 D NaN 8
4 E NaN 11
Slight variation
m = ~np.in1d(df1.id.values, df2.id.values)
d = ~df1.duplicated('id').values
df2.append(df1[m & d].assign(code=np.nan), ignore_index=True)
id code amt
0 B 1.0 9
1 C 12.0 10
2 A NaN 5
3 D NaN 8
4 E NaN 11

How to remove ugly row in pandas.dataframe

so I am filling dataframes from 2 different files. While those 2 files should have the same structure (the values should be different thought) the resulting dataframes look different. So when printing those I get:
a b c d
0 70402.14 70370.602112 0.533332 98
1 31362.21 31085.682726 1.912552 301
... ... ... ... ...
753919 64527.16 64510.008206 0.255541 71
753920 58077.61 58030.943621 0.835758 152
a b c d
index
0 118535.32 118480.657338 0.280282 47
1 49536.10 49372.999416 0.429902 86
... ... ... ... ...
753970 52112.95 52104.717927 0.356051 116
753971 37044.40 36915.264944 0.597472 165
So in the second dataframe there is that "index" row that doesnt make any sense for me and it causes troubles in my following code. I did neither write the code to fill the files into the dataframes nor I did create those files. So I am rather interested in checking if such a row exists and how I might be able to remove it. Does anyone have an idea about this?
The second dataframe has an index level named "index".
You can remove the name with
df.index.name = None
For example,
In [126]: df = pd.DataFrame(np.arange(15).reshape(5,3))
In [128]: df.index.name = 'index'
In [129]: df
Out[129]:
0 1 2
index
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
In [130]: df.index.name = None
In [131]: df
Out[131]:
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
The dataframe might have picked up the name "index" if you used reset_index and set_index like this:
In [138]: df.reset_index()
Out[138]:
index 0 1 2
0 0 0 1 2
1 1 3 4 5
2 2 6 7 8
3 3 9 10 11
4 4 12 13 14
In [140]: df.reset_index().set_index('index')
Out[140]:
0 1 2
index
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
Index is just the first column - it's numbering the rows by default, but you can change it a number of ways (e.g. filling it with values from one of the columns)

Categories

Resources