How can I merge split array into a new DataFrame?

How can I merge split array into a new DataFrame? - python

I have a Dataframe that is composed by 3760 rows. I want to split it in 10 parts of equal lenght and then use each new array as a column for a new DataFrame.
A way that I found to do this is:
alfa = np.array_split(dff, 10)
caa = pd.concat([alfa[0].reset_index(drop=True), alfa[1].reset_index(drop=True), alfa[2].reset_index(drop=True), alfa[3].reset_index(drop=True),
alfa[4].reset_index(drop=True), alfa[5].reset_index(drop=True), alfa[6].reset_index(drop=True), alfa[7].reset_index(drop=True),
alfa[8].reset_index(drop=True), alfa[9].reset_index(drop=True)], axis=1)
Not very cool, not very efficient.
Then I tried
teta = pd.concat(np.array_split(dff, 10), axis=1, ignore_index=True)
But it doesn't work as I wanted since it gives me this:
I assume that is because the ignore_index works on the axis 1
Is there a better way to do it?

You can use list comprehension to concat your columns. This code expects your columns name is init_col:
chunks = 10
cols = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
out = pd.concat(
[np.array_split(dff, chunks)[i]
.reset_index(drop=True)
.rename(columns={"init_col": cols[i]})
for i in range(chunks)],
axis=1
)

It seems the original DataFrame seems to be just an array? In that case, perhaps you could use numpy.reshape:
new_df = pd.DataFrame(dff.to_numpy().reshape(10,-1).T, columns=dff.columns.tolist()*10)

Related

How do I match data from 2 dataframes with not the same number of rows and input values from one into the other in Jupyter?

I have a question regarding using one dataframe (called pricing), which has 30 rows, and another called nop, which has 1000 rows.
the Pricing dataframe has a column called Price that has data that I am hoping to copy over to the nop dataframe, and I want all rows on the nop dataframe populated with the Price value from the pricing dataframe. They are matched by the column Id.
I was thinking of doing a for loop within a for loop but I am thinking that there might be an easier way to do this.
nop['Price'] = ''
for i in nop:
for j in pricing:
if nop['Id'][i] == pricing['Id'][j]:
nop['Price'][i] = pricing['Price'][j]
else:
j+1
Thanks!

you could use join method to join the two dataframes on common key, which id id column in your case. To demonstrate same, I have created two dataframes withid as common key and performed join as in below code
import pandas as pd
pricing_df = pd.DataFrame({'id':[1,2,3,4,5], 'price':[10,20,40,21,99]})
nop = pd.DataFrame({'id':[3,4,1,2,5,1,2,2,5,3,1,4,4,1,4],
'product': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O']})
pricing_df
nop
by calling join method on joining key would yield below results.
nop.join(pricing_df.set_index('id'), on='id').sort_values('id')

Difference between index.name and index.names in pandas

I have just started leaning pandas and this is really my first question here so pls don't mind if its too basic !
When to use df.index.name and when df.index.names ?
Would really appreciate to know the difference and application.
Many Thanks

name returns the name of the Index or MultiIndex.
names returns a list of the level names of an Index (just one level) or a MultiIndex (more than one level).
Index:
df1 = pd.DataFrame(columns=['a', 'b', 'c']).set_index('a')
print(df1.index.name, df1.index.names)
# a ['a']
MultiIndex:
df2 = pd.DataFrame(columns=['a', 'b', 'c']).set_index(['a', 'b'])
print(df2.index.name, df2.index.names)
# None ['a', 'b']
df2.index.name = 'my_multiindex'
print(df2.index.name, df2.index.names)
# my_multiindex ['a', 'b']

Excel shows duplicate merge columns

I have a issue writing to excel after merging.
Outfile1 = r’k:\dir1\outfile1.xlsx’
DF0 = [‘A’, ‘B’, ‘C’]
DF1 = [‘A’, ‘B’, ‘D’]
DF2 = DF0.merge(DF1, on = [‘A’, ‘B’])
DF2.to_excel(Outfile1, engine = ‘xlsxwriter’)
Excel file has the following columns:
‘A’ ‘B’ ‘A’ ‘B’ ‘C’ ‘D’ the second ‘A’ & ‘B’ are blank.
What am I doing wrong? I only want ‘A’ ‘B’ ‘C’ ‘D’ in spreadsheet.

this should do it.
import pandas as pd
// data
data = {'letters':['a', 'e', 'c', 'g', 'h', 'b']}
data1 = {'letters':['a', 'd', 'b', 'e', 'f',]}
// data to df
df0 = pd.DataFrame(data)
df1 = pd.DataFrame(data1)
// merge
df2 = df0.merge(df1, how='outer')
now they are merged without duplicates, but out of order. Use sort_values to correct this
df2 = df2.sort_values(['letters'])
print(df2)

Conclusion: after merge , its recommended that reset the index otherwise sometime
reindex creates extra columns in output.
Step 1: dataset
df1=pd.DataFrame([[1,2,3],[4,5,6]],columns=['A','B','C'])
df2=pd.DataFrame([[7,7,7],[1,2,8]],columns=['A','B','D'])
df_merge=df1.merge(df2,on=['A','B'])
Step 2 , reset_index before reorder the dataframe through reindex.
my_col_order=['D','B','C']
df_merge.reset_index(inplace=True) # after merge , its recommended that reset the index otherwise sometime
#reindex creates extra columns in output.
df_5=df_merge.reindex(my_col_order,axis='columns')

Parallel Processing of Loop of Pandas Columns

I have the following code which I would like to speed up.
EDIT: we would like the columns in 'colsi' to be shifted by the group columns in 'colsj'. Pandas allows us to shift multiple columns at once through vectorization of 'colsi'. I loop through each group column and perform the vectorized shifts. Then I fill the NAs by the medians of the columns in 'colsi'. The reindex is just to create new blank columns before they are assigned. The issue is that I have many groups and looping through each is becoming time consuming.
EDIT2: My goal is to engineer new columns by the lag of each group. I have many group columns and many columns to be shifted. 'colsi' contains the columns to be shifted. 'colsj' contains the group columns. I am able to vectorize 'colsi', but looping through each group column in 'colsj' is still time consuming.
colsi = ['a', 'b', 'c']
colsj = ['d', 'e', 'f']
med = df[colsi].median()
for j in colsj:
newcols=[j+i+'_n' for i in colsi]
newmed = med.copy()
newmed.index=newcols
df = df.reindex(columns=df.columns.tolist()+newcols)
df[newcols] = df.groupby(j)[colsi].shift()
df[newcols] = df[newcols].fillna(newmed)
Parallelization seems to be a good way to do it. Leaning on this code, I attempted the following but it didn't work:
from multiprocessing.pool import ThreadPool
pool = ThreadPool(processes=3)
colsi = ['a', 'b', 'c']
colsj = ['d', 'e', 'f']
med = df[colsi].median()
def funct(j):
newcols=[j+i+'_n' for i in colsi]
newmed = med.copy()
newmed.index=newcols
df = df.reindex(columns=df.columns.tolist()+newcols)
df[newcols] = df.groupby(j)[colsi].shift()
df[newcols] = df[newcols].fillna(newmed)
for j in colsj:
pool.apply_async(funct, (j))
I do not have any knowledge on how to go about parallel processing, so I am not sure whats missing here. Please advise.

Pandas dataframe column selection

I am using Pandas to select columns from a dataframe, olddf. Let's say the variable names are 'a', 'b','c', 'starswith1', 'startswith2', 'startswith3',...,'startswith10'.
My approach was to create a list of all variables with a common starting value.
filter_col = [col for col in list(health) if col.startswith('startswith')]
I'd like to then select columns within that list as well as others, by name, so I don't have to type them all out. However, this doesn't work:
newdf = olddf['a','b',filter_col]
And this doesn't either:
newdf = olddf[['a','b'],filter_col]
I'm a newbie so this is probably pretty simple. Is the reason this doesn't work because I'm mixing a list improperly?
Thanks.

Use
newdf = olddf[['a','b']+filter_col]
since adding lists concatenates them:
In [264]: ['a', 'b'] + ['startswith1']
Out[264]: ['a', 'b', 'startswith1']
Alternatively, you could use the filter method:
newdf = olddf.filter(regex=r'^(startswith|[ab])')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I merge split array into a new DataFrame? - python

It seems the original DataFrame seems to be just an array? In that case, perhaps you could use numpy.reshape: new_df = pd.DataFrame(dff.to_numpy().reshape(10,-1).T, columns=dff.columns.tolist()*10)

Related

How do I match data from 2 dataframes with not the same number of rows and input values from one into the other in Jupyter?

Difference between index.name and index.names in pandas

Excel shows duplicate merge columns

Parallel Processing of Loop of Pandas Columns

Pandas dataframe column selection

Categories

Resources