I want to concatenate two data frames of the same length, by adding a column to the first one (df).
But because certain df rows are being filtered, it seems the index isn't matching.
import pandas as pd
pd.read_csv(io.StringIO(uploaded['customer.csv'].decode('utf-8')), sep=";")
df["Margin"] = df["Sales"]-df["Cost"]
df = df.loc[df["Margin"]>-100000]
df = df.loc[df["Sales"]> 1000]
df.reindex()
df
This returns:
So this operation:
customerCluster = pd.concat([df, clusters], axis = 1, ignore_index= True)
print(customerCluster)
Is returning:
So, I've tried reindex and the argument ignore_index = True as you can see in above code snippet.
Thanks for all the answers. If anyone encounters the same problem, the solution I found was this:
customerID = df["CustomerID"]
customerID = customerID.reset_index(drop=True)
df = df.reset_index(drop=True)
So, basically, the indexes of both data frames are now matching, thus:
customerCluster = pd.concat((customerID, clusters), axis = 1)
This will concatenate correctly the two data frames.
Related
I'm in awe that this isn't working because I've done this a hundred times. I want to stack two dataframes vertically and pandas is creating duplicate columns and refusing to put the data in the right columns.
df1 looks like this:
df2 looks like this:
then I run this:
frames = [df1,df2]
final = pd.concat(frames, ignore_index = True, axis = 0)
final
and get 6 columns instead of 3 like this:
I have no idea why two dataframes with identical column names and data types would not simply stack on top of each other. Any help appreciated.
Thanks.
update:
Big Thanks to #Naveed there was trailing whitespace in one of the dataframe's columns.
Here is the final fix:
df2.columns = [x.strip() for x in df2.columns]
frames = [df1,df2]
final = pd.concat(frames,ignore_index = True, axis = 0)
final
Try
check the column names, there might be white spaces that results in mis-alignment of column after the concat.
display(df1.columns, df2.columns)
# make axis=0 and remove ignore_index
final = pd.concat(frames, axis = 0)
final
I have a pandas dataframe with the following data: (in csv)
#list1
poke_id,symbol
0,BTC
1,ETB
2,USDC
#list2
5,SOL
6,XRP
I am able to concatenate them into one dataframe using the following code:
df = pd.concat([df1, df2], ignore_index = True)
df = df.reset_index(drop = True)
df['poke_id'] = df.index
df = df[['poke_id','symbol']]
which gives me the output: (in csv)
poke_id,symbol
0,BTC
1,ETB
2,USDC
3,SOL
4,XRP
Is there any other way to do the same. I think calling the whole data frame of ~4000 entries just to add ~100 more will be a little pointless and cumbersome. How can I make it in such a way that it picks list 1 (or dataframe 1) and picks the highest poke_id; and just does i + 1 to the later entries in list 2.
Your solution is good, is possible simplify:
df = pd.concat([df1, df2], ignore_index = True).rename_axis('poke_id').reset_index()
use indexes to get what data you want from the dataframe, although this is not effective if you want large amounts of data from the dataframe, this method allows you to take specific amounts of data from the dataframe
I have multiple dataframes with the same columns but different values that look like that
Product 1 Dataframe
Here's the code that generated them
import pandas as pd
d1 = {"Year":[2018,2019,2020],"Quantity": [10,20,30], "Price": [100,200,300]}
df_product1 = pd.DataFrame(data=d1)
d2 = {"Year":[2018,2019,2020],"Quantity": [20,20,50], "Price": [120,110,380]}
df_product2 = pd.DataFrame(data=d2)
d3 = {"Year":[2018,2019,2020],"Quantity": [40,20,70], "Price": [1000,140,380]}
df_product3 = pd.DataFrame(data=d3)
I merge two dataframes and identify suffixes like so
df_total = df_product1.merge(df_product2,on="Year", suffixes = ("_Product1","_Product2"))
And I get
First Merged Dataframe
However, when I merge another dataframe to the result above using:
df_total = df_total.merge(df_product3,on="Year", suffixes = ("_Product","_Product3"))
I get
Final Merged Dataframe
Where there is no suffix for the third product.
I would like the last two columns of the dataframe to be Quantity_Product3, Price_Product3 instead of just Quantity and Price.
Let me know if it is possible or if I need to approach the problem from a completely different angle.
Why you don't get the result you want
It's explained in the docs: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html
suffixeslist-like, default is (“_x”, “_y”) A length-2 sequence where
each element is optionally a string indicating the suffix to add to
overlapping column names in left and right respectively. Pass a value
of None instead of a string to indicate that the column name from left
or right should be left as-is, with no suffix. At least one of the
values must not be None.
Suffixes are added to overlapping column names.
See this example - suffixes are added to column b, because both dataframes have a column b, but not to columns a and c, as they are unique and not in common between the two dataframes.
df1 = pd.DataFrame(columns =['a','b'], data = np.random.rand(10,2))
df2 = pd.DataFrame(columns =['b','c'], data = np.random.rand(10,2), index = np.arange(5,15))
# equivalent to an inner join on the indices
out = pd.merge(df1, df2, how ='inner', left_index = True, right_index = True)
A crude solution
Why don't you just rename the columns manually? Not elegant but effective
A possible alternative
The table you are trying to build looks like a pivot. I would look into normalising all your dataframes, concatenating them, then running a pivot on the result.
Depending on your case, this may well be more convoluted and could well be overkill. I mention it because I want to bring your attention to the concepts of pivoting/unpivoting (stacking/unstacking/normalising) data.
The code below takes a df which looks similar to yours and normalises it. For simpler cases you can use pandas.melt(). I don't have the exact data of your example but this should be a good starting point.
def random_dates(start, end, n, unit='D', seed=None):
ndays = (end - start).days + 1
return start + pd.to_timedelta(
np.random.randint(0, ndays, n), unit=unit
)
df = pd.DataFrame()
mysize = 20
df['key'] = np.arange(0,mysize)
df['A_value'] = np.random.randint(0,10000,mysize)
df['A_date'] = random_dates(pd.to_datetime('2010-01-01' ), pd.to_datetime('2019-01-01'), mysize)
df['B_value'] = np.random.randint(-5000,5000,mysize)
df['B_date'] = random_dates(pd.to_datetime('2005-01-01' ), pd.to_datetime('2015-01-01'), mysize)
df['C_value'] = np.random.randint(-10000,10000,mysize)
df['C_date'] = random_dates(pd.to_datetime('2000-01-01' ), pd.to_datetime('2019-01-01'), mysize)
df2 = df.set_index('key', drop=True, verify_integrity = True)
df2 = df2.stack().reset_index()
df2.columns=['key','rawsource','rawvalue']
df2['source'] = df2['rawsource'].apply(lambda x: x[0:1])
df2['metric'] = df2['rawsource'].apply(lambda x: x[2:])
df2 = df2.drop(['rawsource'], axis = 1)
df_piv = df2.pivot_table( index=['key','source'], columns = 'metric' , values ='rawvalue', aggfunc='first' ).reset_index().rename_axis(None, axis=1)
I am trying to get data from a list of dfs into one df.
I am using the following code
main_df = pd.DataFrame(columns=data.columns[-len_data+1:], index = is_refs)
for index, dict in enumerate(dict_lists):
df = pd.DataFrame(dict_lists[index])
df = df.reindex(is_refs)
main_df = main_df.append(df[df.columns])
The problem is that it returns the following DF. Clearly I don't want repeated items in the index and just want the financial data to fit into my rows. Any idea how to do this?
df_col = df[df.columns]
print(df_col)
main_df = main_df.join(df_col)
*RE Add missing dates to pandas dataframe, previously ask question
import pandas as pd
import numpy as np
idx = pd.date_range('09-01-2013', '09-30-2013')
df = pd.DataFrame(data = [2,10,5,1], index = ["09-02-2013","09-03-2013","09-06-2013","09-07-2013"], columns = ["Events"])
df.index = pd.DatetimeIndex(df.index); #question (1)
df = df.reindex(idx, fill_value=np.nan)
print(df)
In the above script what does the command noted as question one do? If you leave this
command out of the script, the df will be re-indexed but the data portion of the
original df will not be retained. As there is no reference to the df data in the
DatetimeIndex command, why is the data from the starting df lost?
Short answer: df.index = pd.DatetimeIndex(df.index); converts the string index of df to a DatetimeIndex.
You have to make the distinction between different types of indexes. In
df = pd.DataFrame(data = [2,10,5,1], index = ["09-02-2013","09-03-2013","09-06-2013","09-07-2013"], columns = ["Events"])
you have an index containing strings. When using
df.index = pd.DatetimeIndex(df.index);
you convert this standard index with strings to an index with datetimes (a DatetimeIndex). So the values of these two types of indexes are completely different.
Now, when you reindex with
idx = pd.date_range('09-01-2013', '09-30-2013')
df = df.reindex(idx)
where idx is also an index with datetimes. When you reindex the original df with a string index, there are no matching index values, so no column values of the original df are retained. When you reindex the second df (after converting the index to a datetime index), there will be matching index values, so the column values on those indixes are retained.
See also http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.reindex.html