How can I merge split array into a new DataFrame? - python

I have a Dataframe that is composed by 3760 rows. I want to split it in 10 parts of equal lenght and then use each new array as a column for a new DataFrame.
A way that I found to do this is:
alfa = np.array_split(dff, 10)
caa = pd.concat([alfa[0].reset_index(drop=True), alfa[1].reset_index(drop=True), alfa[2].reset_index(drop=True), alfa[3].reset_index(drop=True),
alfa[4].reset_index(drop=True), alfa[5].reset_index(drop=True), alfa[6].reset_index(drop=True), alfa[7].reset_index(drop=True),
alfa[8].reset_index(drop=True), alfa[9].reset_index(drop=True)], axis=1)
Not very cool, not very efficient.
Then I tried
teta = pd.concat(np.array_split(dff, 10), axis=1, ignore_index=True)
But it doesn't work as I wanted since it gives me this:
I assume that is because the ignore_index works on the axis 1
Is there a better way to do it?

You can use list comprehension to concat your columns. This code expects your columns name is init_col:
chunks = 10
cols = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
out = pd.concat(
[np.array_split(dff, chunks)[i]
.reset_index(drop=True)
.rename(columns={"init_col": cols[i]})
for i in range(chunks)],
axis=1
)

It seems the original DataFrame seems to be just an array? In that case, perhaps you could use numpy.reshape:
new_df = pd.DataFrame(dff.to_numpy().reshape(10,-1).T, columns=dff.columns.tolist()*10)

Related

How do I match data from 2 dataframes with not the same number of rows and input values from one into the other in Jupyter?

I have a question regarding using one dataframe (called pricing), which has 30 rows, and another called nop, which has 1000 rows.
the Pricing dataframe has a column called Price that has data that I am hoping to copy over to the nop dataframe, and I want all rows on the nop dataframe populated with the Price value from the pricing dataframe. They are matched by the column Id.
I was thinking of doing a for loop within a for loop but I am thinking that there might be an easier way to do this.
nop['Price'] = ''
for i in nop:
for j in pricing:
if nop['Id'][i] == pricing['Id'][j]:
nop['Price'][i] = pricing['Price'][j]
else:
j+1
Thanks!
you could use join method to join the two dataframes on common key, which id id column in your case. To demonstrate same, I have created two dataframes withid as common key and performed join as in below code
import pandas as pd
pricing_df = pd.DataFrame({'id':[1,2,3,4,5], 'price':[10,20,40,21,99]})
nop = pd.DataFrame({'id':[3,4,1,2,5,1,2,2,5,3,1,4,4,1,4],
'product': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O']})
pricing_df
nop
by calling join method on joining key would yield below results.
nop.join(pricing_df.set_index('id'), on='id').sort_values('id')

Difference between index.name and index.names in pandas

I have just started leaning pandas and this is really my first question here so pls don't mind if its too basic !
When to use df.index.name and when df.index.names ?
Would really appreciate to know the difference and application.
Many Thanks
name returns the name of the Index or MultiIndex.
names returns a list of the level names of an Index (just one level) or a MultiIndex (more than one level).
Index:
df1 = pd.DataFrame(columns=['a', 'b', 'c']).set_index('a')
print(df1.index.name, df1.index.names)
# a ['a']
MultiIndex:
df2 = pd.DataFrame(columns=['a', 'b', 'c']).set_index(['a', 'b'])
print(df2.index.name, df2.index.names)
# None ['a', 'b']
df2.index.name = 'my_multiindex'
print(df2.index.name, df2.index.names)
# my_multiindex ['a', 'b']

Excel shows duplicate merge columns

I have a issue writing to excel after merging.
Outfile1 = r’k:\dir1\outfile1.xlsx’
DF0 = [‘A’, ‘B’, ‘C’]
DF1 = [‘A’, ‘B’, ‘D’]
DF2 = DF0.merge(DF1, on = [‘A’, ‘B’])
DF2.to_excel(Outfile1, engine = ‘xlsxwriter’)
Excel file has the following columns:
‘A’ ‘B’ ‘A’ ‘B’ ‘C’ ‘D’ the second ‘A’ & ‘B’ are blank.
What am I doing wrong? I only want ‘A’ ‘B’ ‘C’ ‘D’ in spreadsheet.
this should do it.
import pandas as pd
// data
data = {'letters':['a', 'e', 'c', 'g', 'h', 'b']}
data1 = {'letters':['a', 'd', 'b', 'e', 'f',]}
// data to df
df0 = pd.DataFrame(data)
df1 = pd.DataFrame(data1)
// merge
df2 = df0.merge(df1, how='outer')
now they are merged without duplicates, but out of order. Use sort_values to correct this
df2 = df2.sort_values(['letters'])
print(df2)
Conclusion: after merge , its recommended that reset the index otherwise sometime
reindex creates extra columns in output.
Step 1: dataset
df1=pd.DataFrame([[1,2,3],[4,5,6]],columns=['A','B','C'])
df2=pd.DataFrame([[7,7,7],[1,2,8]],columns=['A','B','D'])
df_merge=df1.merge(df2,on=['A','B'])
Step 2 , reset_index before reorder the dataframe through reindex.
my_col_order=['D','B','C']
df_merge.reset_index(inplace=True) # after merge , its recommended that reset the index otherwise sometime
#reindex creates extra columns in output.
df_5=df_merge.reindex(my_col_order,axis='columns')

Parallel Processing of Loop of Pandas Columns

I have the following code which I would like to speed up.
EDIT: we would like the columns in 'colsi' to be shifted by the group columns in 'colsj'. Pandas allows us to shift multiple columns at once through vectorization of 'colsi'. I loop through each group column and perform the vectorized shifts. Then I fill the NAs by the medians of the columns in 'colsi'. The reindex is just to create new blank columns before they are assigned. The issue is that I have many groups and looping through each is becoming time consuming.
EDIT2: My goal is to engineer new columns by the lag of each group. I have many group columns and many columns to be shifted. 'colsi' contains the columns to be shifted. 'colsj' contains the group columns. I am able to vectorize 'colsi', but looping through each group column in 'colsj' is still time consuming.
colsi = ['a', 'b', 'c']
colsj = ['d', 'e', 'f']
med = df[colsi].median()
for j in colsj:
newcols=[j+i+'_n' for i in colsi]
newmed = med.copy()
newmed.index=newcols
df = df.reindex(columns=df.columns.tolist()+newcols)
df[newcols] = df.groupby(j)[colsi].shift()
df[newcols] = df[newcols].fillna(newmed)
Parallelization seems to be a good way to do it. Leaning on this code, I attempted the following but it didn't work:
from multiprocessing.pool import ThreadPool
pool = ThreadPool(processes=3)
colsi = ['a', 'b', 'c']
colsj = ['d', 'e', 'f']
med = df[colsi].median()
def funct(j):
newcols=[j+i+'_n' for i in colsi]
newmed = med.copy()
newmed.index=newcols
df = df.reindex(columns=df.columns.tolist()+newcols)
df[newcols] = df.groupby(j)[colsi].shift()
df[newcols] = df[newcols].fillna(newmed)
for j in colsj:
pool.apply_async(funct, (j))
I do not have any knowledge on how to go about parallel processing, so I am not sure whats missing here. Please advise.

Pandas dataframe column selection

I am using Pandas to select columns from a dataframe, olddf. Let's say the variable names are 'a', 'b','c', 'starswith1', 'startswith2', 'startswith3',...,'startswith10'.
My approach was to create a list of all variables with a common starting value.
filter_col = [col for col in list(health) if col.startswith('startswith')]
I'd like to then select columns within that list as well as others, by name, so I don't have to type them all out. However, this doesn't work:
newdf = olddf['a','b',filter_col]
And this doesn't either:
newdf = olddf[['a','b'],filter_col]
I'm a newbie so this is probably pretty simple. Is the reason this doesn't work because I'm mixing a list improperly?
Thanks.
Use
newdf = olddf[['a','b']+filter_col]
since adding lists concatenates them:
In [264]: ['a', 'b'] + ['startswith1']
Out[264]: ['a', 'b', 'startswith1']
Alternatively, you could use the filter method:
newdf = olddf.filter(regex=r'^(startswith|[ab])')

Categories

Resources