I have two Dataframes storing numpy arrays. I would like to concatenate all numpy arrays from Dataframe 1 with those from Dataframe 2. How can I archieve this ?
A possible solution could look like this:
def concat_df(df, other_df):
for column in df.columns.values:
for (_, row1), (_, row2) in zip(df.iterrows(), other_df.iterrows()):
row1[column] = np.concatenate(row1[column], row2[column])
IIUC:
try:
out=pd.Series(np.concatenate([df['column name'].values, other_df['column name'].values]))
OR
out=df['column name'].append(other_df['column name'],ignore_index=True)
OR
out=pd.Series(np.hstack([df['column name'].values,other_df['column name'].values]))
Now If you print out you will get your required Series
If they have the same columns you can use pd.concat.
new_df = pd.concat([df, other_df])
Related
I have pandas dataframe whose shape is (4628,).
How do I change the shape of dataframe to (4628,1)?
You might have a Series, you can turn it into DataFrame with Series.to_frame
s = pd.Series(range(10))
out = s.to_frame('col_name') # or .to_frame()
print(s.shape)
(10,)
print(out.shape)
(10, 1)
Don't know how you get that Series, if you use a label like df['a'], you can try passing a list of labels instead, like df[['a']]
You can use the function called reshape(). It's easy to use. Run the code and you'll see the result from (4628,) to (4628,1).
import pandas as pd
df = pd.DataFrame(range(1,4629))
print(df)
df = df.values.reshape(-1)
print(df.shape)
df = df.reshape(-1,1)
print(df.shape)
Results:
......
[4628 rows x 1 columns]
(4628,)
(4628, 1)
I've written the following simple function, but it's running rather slow.
For df1 I'm using a dataframe without any column (only index) and for df2 I'm using a dataframe with 10 columns and trying to add one df2's columns along with it's value to df1 if the indices match.
def add_labels_column(df1, df2):
for idx1 in df1.index:
for idx2 in df2.index:
if idx1 == idx2:
df1['Finding Labels'] = df2['Finding Labels']
return df1
I'm looking for a faster solution, perhaps using pandas and/or numpy. I'm a new to Python, pandas or numpy.
It's hard to answer your question without sample data, but one possible option is:
df1['Finding Labels'] = df1['Finding Labels'].update(df2['Finding Labels'])
I have the following situation:
I have multiple tables that look like this:
table1 = pd.DataFrame([[0,1],[0,1],[0,1],[0,1],[0,1]], columns=['v1','v2'])
I have one dataframe that each element refers to these tables, something like this:
df = pd.DataFrame([table1, table2, table3, table4], columns=['tablename'])
I need to create a new column in df that contains, for each table, the values that I get from np.polyfit(table1['v1'],table1['v2'],1)
I have tried to do the following
for x in df['tablename']:
df.loc[:,'fit_result'] = np.polyfit(x['v1'],x['v2'],1)
but it returns me
TypeError: string indices must be integers
Is there a way to do it? or am I writing something that makes no sense?
obs: in fact, these tables are HUGE and contains more than two columns.
You can try something like this
import numpy as np
import pandas as pd
table1 = pd.DataFrame([[0.0,0.0],[1.0,0.8],[2.0,0.9],[3.0,0.1],[4.0,-0.8],[5.0,-1.0]], columns=['table1_v1','table1_v2'])
df = pd.DataFrame([['some','random'],['values','here']], columns=['example_1','example_2'])
def fit_result(v1,v2):
return np.polyfit(v1, v2, 1)
df['fit_result'] = df.apply(lambda row: fit_result(table1['table1_v1'].values,table1['table1_v2'].values), axis=1)
df.head()
Output
example_1 example_2 fit_result
0 some random [-0.3028571428571428, 0.7571428571428572]
1 values here [-0.3028571428571428, 0.7571428571428572]
You only need do this over all your dataframes and concat all off them at the end
df_col = pd.concat([df1,df2], axis=1) (https://www.datacamp.com/community/tutorials/joining-dataframes-pandas)
I have an array of dataframes dfs = [df0, df1, ...]. Each one of them have a date column of varying size (some dates might be in one dataframe but not the other).
What I'm trying to do is this:
pd.concat(dfs).groupby("date", as_index=False).sum()
But with date no longer being a column but an index (dfs = [df.set_index("date") for df in dfs]).
I've seen you can pass df.index to groupby (.groupby(df.index)) but df.index might not include all the dates.
How can I do this?
The goal here is to call .sum() on the groupby, so I'm not tied to using groupby nor concat is there's any alternative method to do so.
If I am able to understand maybe you want something like this:
df = pd.concat([dfs])
df.groupby(df.index).sum()
Here's small example:
tmp1 = pd.DataFrame({'date':['2019-09-01','2019-09-02','2019-09-03'],'value':[1,1,1]}).set_index('date')
tmp2 = pd.DataFrame({'date':['2019-09-01','2019-09-02','2019-09-04','2019-09-05'],'value':[2,2,2,2]}).set_index('date')
df = pd.concat([tmp1,tmp2])
df.groupby(df.index).sum()
I have a pandas series that I would like to combine in three different ways. The series is as follows:
import pandas as pd
timestamps = [1,1,1,2,3,3,3,4]
quantities = [10,0,2,6,7,2,8,0]
series = pd.Series(quantities, index=timestamps)
Clearly the timestamps have 3 values of 1, 1 value of 2, 3 values of 3 and 1 value of 1. I would like to generate the following series:
1. Sum of the duplicate index values:
pd.Series([12,6,17,0], index=[1,2,3,4])
2. Median of the duplicate index values:
pd.Series([2,6,7,0], index=[1,2,3,4])
2. The number of duplicate index values:
pd.Series([3,1,3,1], index=[1,2,3,4])
In numpy I would achieve this using a unique_elements_to_indices method:
from typing import Dict
import numpy as np
def unique_elements_to_indices(array: np.array) -> Dict:
mapping = {}
for unique_element in np.unique(array):
mapping[unique_element] = np.where(array == unique_element)[0]
return mapping
... and then I would loop through the unique_elements and use np.where to locate the quantities for that given unique_element.
Is there away to achieve this quickly in pandas, please?
Thanks.
Here is possible use functions sum, median for separate outputs with parameter level=0 for aggregate by index:
print (series.sum(level=0))
print (series.median(level=0))
But generaly aggregate by index with function:
print (series.groupby(level=0).sum())
print (series.groupby(level=0).median())
#difference between count and size is count exclude NaNs values
print (series.groupby(level=0).size())
print (series.groupby(level=0).count())
If need all together to new DataFrame use GroupBy.agg with list of aggregate functions:
print(series.groupby(level=0).agg(['sum', 'median', 'size']))
You could use .groupby for this:
import pandas as pd
timestamps = [1,1,1,2,3,3,3,4]
quantities = [10,0,2,6,7,2,8,0]
sr = pd.Series(quantities, index=timestamps)
print(sr.groupby(sr.index).sum())
print(sr.groupby(sr.index).median())
print(sr.groupby(sr.index).count())
When you are working with pandas library then advisable to convert your data into dataframe. The Easiest way is as below in pandas
timestamps = [1,1,1,2,3,3,3,4]
quantities = [10,0,2,6,7,2,8,0]
d = {'quantities': quantities, 'timestamps': timestamps}
df = pd.DataFrame(d)
df.groupby('timestamps').sum().reset_index()
The Similar way you can also use other function as well. Kindly let me know if this works for you.