I have pandas dataframe whose shape is (4628,).
How do I change the shape of dataframe to (4628,1)?
You might have a Series, you can turn it into DataFrame with Series.to_frame
s = pd.Series(range(10))
out = s.to_frame('col_name') # or .to_frame()
print(s.shape)
(10,)
print(out.shape)
(10, 1)
Don't know how you get that Series, if you use a label like df['a'], you can try passing a list of labels instead, like df[['a']]
You can use the function called reshape(). It's easy to use. Run the code and you'll see the result from (4628,) to (4628,1).
import pandas as pd
df = pd.DataFrame(range(1,4629))
print(df)
df = df.values.reshape(-1)
print(df.shape)
df = df.reshape(-1,1)
print(df.shape)
Results:
......
[4628 rows x 1 columns]
(4628,)
(4628, 1)
Related
I have two Dataframes storing numpy arrays. I would like to concatenate all numpy arrays from Dataframe 1 with those from Dataframe 2. How can I archieve this ?
A possible solution could look like this:
def concat_df(df, other_df):
for column in df.columns.values:
for (_, row1), (_, row2) in zip(df.iterrows(), other_df.iterrows()):
row1[column] = np.concatenate(row1[column], row2[column])
IIUC:
try:
out=pd.Series(np.concatenate([df['column name'].values, other_df['column name'].values]))
OR
out=df['column name'].append(other_df['column name'],ignore_index=True)
OR
out=pd.Series(np.hstack([df['column name'].values,other_df['column name'].values]))
Now If you print out you will get your required Series
If they have the same columns you can use pd.concat.
new_df = pd.concat([df, other_df])
I want to concatenate two data frames of the same length, by adding a column to the first one (df).
But because certain df rows are being filtered, it seems the index isn't matching.
import pandas as pd
pd.read_csv(io.StringIO(uploaded['customer.csv'].decode('utf-8')), sep=";")
df["Margin"] = df["Sales"]-df["Cost"]
df = df.loc[df["Margin"]>-100000]
df = df.loc[df["Sales"]> 1000]
df.reindex()
df
This returns:
So this operation:
customerCluster = pd.concat([df, clusters], axis = 1, ignore_index= True)
print(customerCluster)
Is returning:
So, I've tried reindex and the argument ignore_index = True as you can see in above code snippet.
Thanks for all the answers. If anyone encounters the same problem, the solution I found was this:
customerID = df["CustomerID"]
customerID = customerID.reset_index(drop=True)
df = df.reset_index(drop=True)
So, basically, the indexes of both data frames are now matching, thus:
customerCluster = pd.concat((customerID, clusters), axis = 1)
This will concatenate correctly the two data frames.
I am trying following example, where I need to group on columns:
import pandas as pd
import numpy as np
y = pd.DataFrame(np.random.randint(0,10, (20,30)).astype(float),
columns = pd.MultiIndex.from_tuples(
list(zip(np.arange(30),
np.random.randint(0,10, (30,))))
))
y.T.groupby(level = 1).agg(lambda x: np.std(x)/np.mean(x))
and it works. However following returns an error:
y.groupby(level = 1, axis = 1).agg(lambda x: np.std(x)/np.mean(x))
Am I missing something?
Upd: Following works when take separately:
y.groupby(level = 1, axis = 1).agg(np.std)/\
y.groupby(level = 1, axis = 1).agg(np.mean)
The groupby function is applied column-wise to your dataframe, however, when the dataframe is transposed, rows become columns and vice-versa.
This wouldn't be an issue if it weren't for the fact that your rows and columns aren't both multi-index. However, since you're treating your row index as a multi-index via the level=1 attribute, you're getting that error.
Also if you're trying to group by rows, you should have axis=0
y.groupby(y.index, axis = 0).agg(lambda x: np.std(x)/np.mean(x))
Can someone explain what is wrong with this pandas concat code, and why data frame remains empty ?I am using anaconda distibution, and as far as I remember it was working before.
You want to use this form:
result = pd.concat([dataframe, series], axis=1)
The pd.concat(...) doesn't happen "inplace" into the original dataframe but it would return the concatenated result so you'll want to assign the concatenation somewhere, e.g.:
>>> import pandas as pd
>>> s = pd.Series([1,2,3])
>>> df = pd.DataFrame()
>>> df = pd.concat([df, s], axis=1) # We assign the result back into df
>>> df
0
0 1
1 2
2 3
I'm using Pandas library for remote sensing time series analysis. Eventually I would like to save my DataFrame to csv by using chunk-sizes, but I run into a little issue. My code generates 6 NumPy arrays that I convert to Pandas Series. Each of these Series contains a lot of items
>>> prcpSeries.shape
(12626172,)
I would like to add the Series into a Pandas DataFrame (df) so I can save them chunk by chunk to a csv file.
d = {'prcp': pd.Series(prcpSeries),
'tmax': pd.Series(tmaxSeries),
'tmin': pd.Series(tminSeries),
'ndvi': pd.Series(ndviSeries),
'lstm': pd.Series(lstmSeries),
'evtm': pd.Series(evtmSeries)}
df = pd.DataFrame(d)
outFile ='F:/data/output/run1/_'+str(i)+'.out'
df.to_csv(outFile, header = False, chunksize = 1000)
d = None
df = None
But my code get stuck at following line giving a Memory Error
df = pd.DataFrame(d)
Any suggestions? Is it possible to fill the Pandas DataFrame chunk by chunk?
If you know each of these are the same length then you could create the DataFrame directly from the array and then append each column:
df = pd.DataFrame(prcpSeries, columns=['prcp'])
df['tmax'] = tmaxSeries
...
Note: you can also use the to_frame method (which allows you to (optionally) pass a name - which is useful if the Series doesn't have one):
df = prcpSeries.to_frame(name='prcp')
However, if they are variable length then this will lose some data (any arrays which are longer than prcpSeries). An alternative here is to create each as a DataFrame and then perform an outer join (using concat):
df1 = pd.DataFrame(prcpSeries, columns=['prcp'])
df2 = pd.DataFrame(tmaxSeries, columns=['tmax'])
...
df = pd.concat([df1, df2, ...], join='outer', axis=1)
For example:
In [21]: dfA = pd.DataFrame([1,2], columns=['A'])
In [22]: dfB = pd.DataFrame([1], columns=['B'])
In [23]: pd.concat([dfA, dfB], join='outer', axis=1)
Out[23]:
A B
0 1 1
1 2 NaN