Python/Pandas DataFrame with leapfrog assigned columns - python

I would like to kindly approach you with request for help and support with mine conundrum.
I am working at moment of refresher of old issue and it occurred me to work on improvements :))).
I am creating DataFrame for future analysis from multiple Excel Files.
When file contains always multiple columns which are transposed to rows and than connected to DF.
So fa so good.
However once I start generating additional columns by generating columns names based on readied entry from Excel.
I am having a issue find some elegant solution for this for sure trivial problem.
Example:
dat1 = {'A':[3, 1], 'B':[4, 1]}
df_1 = pd.DataFrame(data = dat1)
this is DataFrame df_1:
A B
0 a3 b4
1 a1 b1
dat2 = {'U':[9,9], 'Y':[2,2]}
df_2 = pd.DataFrame(data = dat2)
this is DataFrame df_2:
U Y
0 u9 y2
1 u9 y2
Wished output is to assigne value to DF by columns name for multiple entries (assign complete DF to another one):
dat3 = {'A':[], 'U':[], 'B':[], 'Y':[]}
df_3 = pd.DataFrame(data = dat3)
this is DataFrame df_3:
A U B Y
0 a3 u9 b4 y2
1 a1 u9 b1 y2
At moment I am elaborating with all Join/Merge/Concat function but non of them is able to do do it by itself.
I can imagine to try to create new DF or assign according some index however this seems as overshoot for this. Main column name list is made separately in separate function.
Please is there any simple way which I am missing?
Many thanks for your time, consideration and potential help in advance.
Best Regards
Jan

You should use concat method to concatenate the data frames as the following:
First, creating the data frames:
import pandas as pd
dat1 = {'A':[3, 1], 'B':[4, 1]}
df_1 = pd.DataFrame(data = dat1)
df_1
output:
A B
0 3 4
1 1 1
dat2 = {'U':[9,9], 'Y':[2,2]}
df_2 = pd.DataFrame(data = dat2)
df_2
U Y
0 9 2
1 9 2
Then use concat method:
df_3 = pd.concat([df_1, df_2], axis=1)
df_3
output:
A B U Y
0 3 4 9 2
1 1 1 9 2
The last step is to rearrange df_3 columns to get an output similar to the one you have shown in your question, you should use:
df_3 = df_3[['A', 'U', 'B', 'Y']]
df_3
output:
A U B Y
0 3 9 4 2
1 1 9 1 2

Related

combining two dataframes into one new dataframe in a zig zag/zipper way

I have df1 and df2, i want to create new data frame df3, such that the first record of df3 should be first record from df1, second record of df3 should be first record of df2. and it continues in the similar manner.
I tried many methods with pandas, but didn't get answer.
Is there any ways to achieve it.
You can create a column with incremental id (one with odd numbers and other with even numbers:
import numpy as np
df1['unique_id'] = np.arange(0, df1.shape[0]*2,2)
df2['unique_id'] = np.arange(1, df2.shape[0]*2,2)
and then append them and sort by this column:
df3 = df1.append(df2)
df3 = df3.sort_values(by=['unique_id'])
after which you can drop the column you created:
df3 = df3.drop(columns=['unique_id'])
You could do it this way:
import pandas as pd
df1 = pd.DataFrame({'A':[3,3,4,6], 'B':['a1','b1','c1','d1']})
df2 = pd.DataFrame({'A':[5,4,6,1], 'B':['a2','b2','c2','d2']})
dfff = pd.DataFrame()
for i in range(0,4):
dfx = pd.concat([df1.iloc[i].T, df2.iloc[i].T])
dfff = pd.concat([dfff, dfx])
print(pd.concat([df1, df2]).sort_index(kind='merge'))
Which gives
A B
0 3 a1
0 5 a2
1 3 b1
1 4 b2
2 4 c1
2 6 c2
3 6 d1
3 1 d2

How to append multiple columns values into a single column without append function?

I have a data which consist of 16310 columns x 6000 rows. I wanted to append all columns value into one columns. Let say
c1 c2
2 3
5 4
1 2
I wanted output like this
c1
2
5
1
3
4
2
I have done this using append function and it's working fine.
acc_y_c = acc_y[0]
for i in range(1, len(acc_y.columns)):
acc_y_c = acc_y_c.append(acc_y[i])
But the issue is that it's taking to much time as I said above the data is consist of 16310 columns x 6000 rows.
I wanted to know is there any method which take less time as compared to above method?
Do you want .melt()?
df = pd.DataFrame({'c1': [2, 5, 1], 'c2': [3, 4, 2]})
df.melt(value_name='c1')
# returns:
variable c1
0 c1 2
1 c1 5
2 c1 1
3 c2 3
4 c2 4
5 c2 2
Timed Example:
data = [[np.random.randint(10) for _ in range(6000)] for _ in range(16310)]
df = pd.DataFrame(data)
Melt:
%%time
df_melt = df.melt()
>>> Wall time: 978 ms
Append:
%%time
acc_y_c = df[0]
for i in range(1, len(df.columns)):
acc_y_c = acc_y_c.append(df[i])
>>> Wall time: 19min 15s

How to rename the rows in dataframe using pandas read (Python)?

I want to rename rows in python program (version - spyder 3 - python 3.6) . At this point I have something like that:
import pandas as pd
data = pd.read_csv(filepath, delim_whitespace = True, header = None)
Before that i wanted to rename my columns:
data.columns = ['A', 'B', 'C']
It gave me something like that.
A B C
0 1 n 1
1 1 H 0
2 2 He 1
3 3 Be 2
But now, I want to rename rows. I want:
A B C
n 1 n 1
H 1 H 0
He 2 He 1
Be 3 Be 2
How can I do it? The main idea is to rename every row created by pd.read by the data in the B column. I tried something like this:
for rows in data:
data.rename(index={0:'df.loc(index, 'B')', 1:'one'})
but it's not working.
Any ideas? Maybe just replace the data frame rows by column B? How?
I think need set_index with rename_axis:
df1 = df.set_index('B', drop=False).rename_axis(None)
Solution with rename and dictionary:
df1 = df.rename(dict(zip(df.index, df['B'])))
print (dict(zip(df.index, df['B'])))
{0: 'n', 1: 'H', 2: 'He', 3: 'Be'}
If default RangeIndex solution should be:
df1 = df.rename(dict(enumerate(df['B'])))
print (dict(enumerate(df['B'])))
{0: 'n', 1: 'H', 2: 'He', 3: 'Be'}
Output:
print (df1)
A B C
n 1 n 1
H 1 H 0
He 2 He 1
Be 3 Be 2
EDIT:
If dont want column B solution is with read_csv by parameter index_col:
import pandas as pd
temp=u"""1 n 1
1 H 0
2 He 1
3 Be 2"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), delim_whitespace=True, header=None, index_col=[1])
print (df)
0 2
1
n 1 1
H 1 0
He 2 1
Be 3 2
I normally rename my rows in my dataset by following these steps.
import pandas as pd
df=pd.read_csv("zzzz.csv")
#in a dataframe it is hard to change the names of our rows so,
df.transpose()
#this changes all the rows to columns
df.columns=["","",.....]
# make sure the length of this and the length of columns are same ie dont skip any names.
#Once you are done renaming them:
df.transpose()
#We get our original dataset with changed row names.
just put colnames into "names" when reading
import pandas as pd
df = pd.read_csv('filename.csv', names=["colname A", "colname B"])

reshape a pandas dataframe

suppose a dataframe like this one:
df = pd.DataFrame([[1,2,3,4],[5,6,7,8],[9,10,11,12]], columns = ['A', 'B', 'A1', 'B1'])
I would like to have a dataframe which looks like:
what does not work:
new_rows = int(df.shape[1]/2) * df.shape[0]
new_cols = 2
df.values.reshape(new_rows, new_cols, order='F')
of course I could loop over the data and make a new list of list but there must be a better way. Any ideas ?
The pd.wide_to_long function is built almost exactly for this situation, where you have many of the same variable prefixes that end in a different digit suffix. The only difference here is that your first set of variables don't have a suffix, so you will need to rename your columns first.
The only issue with pd.wide_to_long is that it must have an identification variable, i, unlike melt. reset_index is used to create a this uniquely identifying column, which is dropped later. I think this might get corrected in the future.
df1 = df.rename(columns={'A':'A1', 'B':'B1', 'A1':'A2', 'B1':'B2'}).reset_index()
pd.wide_to_long(df1, stubnames=['A', 'B'], i='index', j='id')\
.reset_index()[['A', 'B', 'id']]
A B id
0 1 2 1
1 5 6 1
2 9 10 1
3 3 4 2
4 7 8 2
5 11 12 2
You can use lreshape, for column id numpy.repeat:
a = [col for col in df.columns if 'A' in col]
b = [col for col in df.columns if 'B' in col]
df1 = pd.lreshape(df, {'A' : a, 'B' : b})
df1['id'] = np.repeat(np.arange(len(df.columns) // 2), len (df.index)) + 1
print (df1)
A B id
0 1 2 1
1 5 6 1
2 9 10 1
3 3 4 2
4 7 8 2
5 11 12 2
EDIT:
lreshape is currently undocumented, but it is possible it might be removed(with pd.wide_to_long too).
Possible solution is merging all 3 functions to one - maybe melt, but now it is not implementated. Maybe in some new version of pandas. Then my answer will be updated.
I solved this in 3 steps:
Make a new dataframe df2 holding only the data you want to be added to the initial dataframe df.
Delete the data from df that will be added below (and that was used to make df2.
Append df2 to df.
Like so:
# step 1: create new dataframe
df2 = df[['A1', 'B1']]
df2.columns = ['A', 'B']
# step 2: delete that data from original
df = df.drop(["A1", "B1"], 1)
# step 3: append
df = df.append(df2, ignore_index=True)
Note how when you do df.append() you need to specify ignore_index=True so the new columns get appended to the index rather than keep their old index.
Your end result should be your original dataframe with the data rearranged like you wanted:
In [16]: df
Out[16]:
A B
0 1 2
1 5 6
2 9 10
3 3 4
4 7 8
5 11 12
Use pd.concat() like so:
#Split into separate tables
df_1 = df[['A', 'B']]
df_2 = df[['A1', 'B1']]
df_2.columns = ['A', 'B'] # Make column names line up
# Add the ID column
df_1 = df_1.assign(id=1)
df_2 = df_2.assign(id=2)
# Concatenate
pd.concat([df_1, df_2])

Element-wise average and standard deviation across multiple dataframes

Data:
Multiple dataframes of the same format (same columns, an equal number of rows, and no points missing).
How do I create a "summary" dataframe that contains an element-wise mean for every element? How about a dataframe that contains an element-wise standard deviation?
A B C
0 -1.624722 -1.160731 0.016726
1 -1.565694 0.989333 1.040820
2 -0.484945 0.718596 -0.180779
3 0.388798 -0.997036 1.211787
4 -0.249211 1.604280 -1.100980
5 0.062425 0.925813 -1.810696
6 0.793244 -1.860442 -1.196797
A B C
0 1.016386 1.766780 0.648333
1 -1.101329 -1.021171 0.830281
2 -1.133889 -2.793579 0.839298
3 1.134425 0.611480 -1.482724
4 -0.066601 -2.123353 1.136564
5 -0.167580 -0.991550 0.660508
6 0.528789 -0.483008 1.472787
You can create a panel of your DataFrames and then compute the mean and SD along the items axis:
df1 = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'])
df2 = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'])
df3 = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'])
p = pd.Panel({n: df for n, df in enumerate([df1, df2, df3])})
>>> p.mean(axis=0)
A B C
0 -0.024284 -0.622337 0.581292
1 0.186271 0.596634 -0.498755
2 0.084591 -0.760567 -0.334429
3 -0.833688 0.403628 0.013497
4 0.402502 -0.017670 -0.369559
5 0.733305 -1.311827 0.463770
6 -0.941334 0.843020 -1.366963
7 0.134700 0.626846 0.994085
8 -0.783517 0.703030 -1.187082
9 -0.954325 0.514671 -0.370741
>>> p.std(axis=0)
A B C
0 0.196526 1.870115 0.503855
1 0.719534 0.264991 1.232129
2 0.315741 0.773699 1.328869
3 1.169213 1.488852 1.149105
4 1.416236 1.157386 0.414532
5 0.554604 1.022169 1.324711
6 0.178940 1.107710 0.885941
7 1.270448 1.023748 1.102772
8 0.957550 0.355523 1.284814
9 0.582288 0.997909 1.566383
One simple solution here is to simply concatenate the existing dataframes into a single dataframe while adding an ID variable to track the original source:
dfa = pd.DataFrame( np.random.randn(2,2), columns=['a','b'] ).assign(id='a')
dfb = pd.DataFrame( np.random.randn(2,2), columns=['a','b'] ).assign(id='b')
df = pd.concat([df1,df2])
a b id
0 -0.542652 1.609213 a
1 -0.192136 0.458564 a
0 -0.231949 -0.000573 b
1 0.245715 -0.083786 b
So now you have two 2x2 dataframes combined into a single 4x2 dataframe. The 'id' columns identifies the source dataframe so you haven't lost any generality, and can select on 'id' to do the same thing you would to any single dataframe. E.g. df[ df['id'] == 'a' ].
But now you can also use groupby to do any pandas method such as mean() or std() on an element by element basis:
df.groupby('id').mean()
a b
index
0 0.198164 -0.811475
1 0.639529 0.812810
The following solution worked for me.
average_data_frame = (dataframe1 + dataframe2 ) / 2
Or, if you have more than two dataframes, say n, then
average_data_frame = dataframe1
for i in range(1,n):
average_data_frame = average_data_frame + i_th_dataframe
average_data_frame = average_data_frame / n
Once you have the average, you can go for the standard deviation. If you are looking for a "true Pythonic" approach, you should follow other answers. But if you are looking for a working and quick solution, this is it.

Categories

Resources