I have a DataFrame like this:
import pandas as pd
df = pd.DataFrame(data= {"x": [1,2,3,4],"y":[5,6,7,8],"i":["a.0","a.1","a.0","a.1"]}).set_index("i")
df
Out:
x y
i
a.0 1 5
a.1 2 6
a.0 3 7
a.1 4 8
and I want to rename the index based on a column condition:
df.loc[df["y"]>6].rename(index=lambda x: x+ ">6" )
what gives me:
x y
i
a.0>6 3 7
a.1>6 4 8
I tried it with inplace=True, but it does not work
df.loc[df["y"]>6].rename(index=lambda x: x+ ">6" , inplace=True )
I only could get it done by resetting the index, changing the i-column-values via apply and set the index again:
df1 = df.reset_index()
df1.loc[df1["y"]>6, "i"] = df1.loc[df1["y"]>6, "i"].apply(lambda x: x+ ">6" )
df1.set_index("i", inplace=True)
df1
Out:
x y
i
a.0 1 5
a.1 2 6
a.0>6 3 7
a.1>6 4 8
But this is so complicated.
Do you know if there is an easier way?
How about trying this?
import numpy as np
df.index=np.where(df['y']>6, df.index+'>6', df.index)
Related
I'm trying to combine values of multiple columns into a single column. Suppose I have a csv with the following data
col1,col2,col3,col4
1,2,3,4
6,2,4,6
2,5,6,2
I want it to become a single column with the values concatenated separated by a blank space
col1
1 2 3 4
6 2 4 6
2 5 6 2
The number of columns is 2000+ so having the columns statically concatenated will not do.
I have no idea why you would want such a design. But you can aggregate across axis=1
df.astype(str).agg(' '.join, 1).to_frame('col')
col
0 1 2 3 4
1 6 2 4 6
2 2 5 6 2
I would try using pandas. This will find all of the column names and then concat the values for each row across all columns and save it as a new dataframe.
import pandas as pd
df = pd.read_csv('test.csv')
cols = df.columns
df = df[cols].apply(lambda row: ' '.join(row.values.astype(str)), axis=1)
The output for this csv file
c1,c2,c3
1,2,3
4,5,6
7,8,9
Is
Index(['c1', 'c2', 'c3'], dtype='object')
0 1 2 3
1 4 5 6
2 7 8 9
Where the ['c1', 'c2', 'c3'] is all the column names concatenated.
Setting things up:
import numpy as np
import pandas as pd
#generating random int dataframe
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
First case(by hand):
str_df1 = df.iloc[:, 0].apply(str) + " " + df.iloc[:, 1].apply(str) + " " + df.iloc[:, 2].apply(str) + " " + df.iloc[:, 3].apply(str)
Second case(generic):
str2_df = df.iloc[:, 0].apply(str)
for i in range(1, df.shape[1]):
str2_df += " " + df.iloc[:, i].apply(str)
the actual code here
the results
Hope I have helped.
I want to merge 3 columns into a single column. I have tried changing the column types. However, I could not do it.
For example, I have 3 columns such as A: {1,2,4}, B:{3,4,4}, C:{1,1,1}
Output expected: ABC Column {131, 241, 441}
My inputs are like this:
df['ABC'] = df['A'].map(str) + df['B'].map(str) + df['C'].map(str)
df.head()
ABC {13.01.0 , 24.01.0, 44.01.0}
The type of ABC seems object and I could not change via str, int.
df['ABC'].apply(str)
Also, I realized that there are NaN values in A, B, C column. Is it possible to merge these even with NaN values?
# Example
import pandas as pd
import numpy as np
df = pd.DataFrame()
# Considering NaN's in the data-frame
df['colA'] = [1,2,4, np.NaN,5]
df['colB'] = [3,4,4,3,np.NaN]
df['colC'] = [1,1,1,4,1]
# Using pd.isna() to check for NaN values in the columns
df['colA'] = df['colA'].apply(lambda x: x if pd.isna(x) else str(int(x)))
df['colB'] = df['colB'].apply(lambda x: x if pd.isna(x) else str(int(x)))
df['colC'] = df['colC'].apply(lambda x: x if pd.isna(x) else str(int(x)))
# Filling the NaN values with a blank space
df = df.fillna('')
# Transform columns into string
df = df.astype(str)
# Concatenating all together
df['ABC'] = df.sum(axis=1)
A workaround your NaN problem could look like this but now NaN will be 0
import numpy as np
df = pd.DataFrame({'A': [1,2,4, np.nan], 'B':[3,4,4,4], 'C':[1,np.nan,1, 3]})
df = df.replace(np.nan, 0, regex=True).astype(int).applymap(str)
df['ABC'] = df['A'] + df['B'] + df['C']
output
A B C ABC
0 1 3 1 131
1 2 4 0 240
2 4 4 1 441
3 0 4 3 043
I have 2 dataframes, containing same indexes and same column names (10 columns
For example:
from df1
A B C
1 0 4 8
2 5 6 9
3 2 5 1
from df2:
A B C
1 9 4 5
2 1 4 2
3 5 5 1
I want to plot on the same graph, column A from df1 vs column A from df2, column B from df1 vs column B from df2, etc..and this for every column.
how could I do that with pandas and matplotlib
This is one of the way to do:
import pandas as pd
import matplotlib.pyplot as plt
d1 = {'A':[0,5,2],'B':[4,6,5],'C':[8,9,1]}
d2 = {'A':[9,1,5],'B':[4,4,5],'C':[5,2,1]}
df1 = pd.DataFrame(data=d1)
df2 = pd.DataFrame(data=d2)
df1_a = df1['A'].tolist()
df1_b = df1['B'].tolist()
df2_a = df2['A'].tolist()
df2_b = df2['B'].tolist()
plt.plot(df1_a, df1_b, 'r')
plt.plot(df2_a, df2_b, 'b')
plt.show()
Assuming that df1 and df2 be your dataframes you could use the below code which loops over all columns and saves the plots for you as well.
import matplotlib.pyplot as plt
import pandas as pd
for column in df1.columns:
x = df1[column]
y = df2[column]
if len(x) != len(y):
x_ind = x.index
y_ind = y.index
common_ind = x_ind.intersection(y_ind)
x = x[common_ind]
y = y[common_ind]
plt.scatter(x,y)
plt.savefig("plot" +column+".png")
plt.clf()
Hope this helps!
df = pd.DataFrame({'x':[1,2,3,4,5,6],'y':[7,8,9,10,11,12],'z':['a','a','a','b','b','b']})
i = pd.Index([0,3,5,10,20])
The indices in i are from a larger dataframe, and df is a subset of that larger dataframe. So there will be indices in i that will not be in df. When I do
df.groupby('z').aggregate({'y':lambda x: sum(x.loc[i])}) #I know I can just use .aggregate({'y':sum}), this is just an example to illustrate my problem
I get this output
y
z
a NaN
b NaN
as well as a warning message
__main__:1: FutureWarning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.
How can I avoid this warning message and get the correct output? In my example the only valid indices for df are [0,3,5] so the expected output is:
y
z
a 7 #"sum" of index 0
b 22 #sum of index [3,5]
EDIT
The answers here work great but they do not allow different types of aggregation of x and y columns. For example, let's say I want to sum all elements of x, but for y only sum the elements in index i:
df.groupby('z').aggregate({'x':sum, 'y': lambda x: sum(x.loc[i])})
this is the desired output:
y x
z
a 7 6
b 22 15
Edit for updated question:
df.groupby('z').agg({'x':'sum','y':lambda r: r.reindex(i).sum()})
Output:
x y
z
a 6 7
b 15 22
Use reindex, to only select those index from i, then dropna to remove all those nans from because indexes in i aren't in df. Then groupyby and agg:
df.reindex(i).dropna(how='all').groupby('z').agg({'y':'sum'})
or, you really don't need to dropna:
df.reindex(i).groupby('z').agg({'y':'sum'})
Output:
y
z
a 7.0
b 22.0
Use intersection with df.index and i for get only matched values and then procees data like need:
print (df.loc[df.index.intersection(i)])
x y z
0 1 7 a
3 4 10 b
5 6 12 b
df = df.loc[df.index.intersection(i)].groupby('z').agg({'y':'sum'})
#comment alternative
#df = df.loc[df.index.isin(i)].groupby('z').agg({'y':'sum'})
print (df)
y
z
a 7
b 22
EDIT:
df1 = df.groupby('z').aggregate({'x':sum, 'y': lambda x: sum(x.loc[x.index.intersection(i)])})
#comment alternative
#df1 = df.groupby('z').aggregate({'x':sum, 'y': lambda x: sum(x.loc[x.index.isin(i)])})
print (df1)
x y
z
a 6 7
b 15 22
suppose a dataframe like this one:
df = pd.DataFrame([[1,2,3,4],[5,6,7,8],[9,10,11,12]], columns = ['A', 'B', 'A1', 'B1'])
I would like to have a dataframe which looks like:
what does not work:
new_rows = int(df.shape[1]/2) * df.shape[0]
new_cols = 2
df.values.reshape(new_rows, new_cols, order='F')
of course I could loop over the data and make a new list of list but there must be a better way. Any ideas ?
The pd.wide_to_long function is built almost exactly for this situation, where you have many of the same variable prefixes that end in a different digit suffix. The only difference here is that your first set of variables don't have a suffix, so you will need to rename your columns first.
The only issue with pd.wide_to_long is that it must have an identification variable, i, unlike melt. reset_index is used to create a this uniquely identifying column, which is dropped later. I think this might get corrected in the future.
df1 = df.rename(columns={'A':'A1', 'B':'B1', 'A1':'A2', 'B1':'B2'}).reset_index()
pd.wide_to_long(df1, stubnames=['A', 'B'], i='index', j='id')\
.reset_index()[['A', 'B', 'id']]
A B id
0 1 2 1
1 5 6 1
2 9 10 1
3 3 4 2
4 7 8 2
5 11 12 2
You can use lreshape, for column id numpy.repeat:
a = [col for col in df.columns if 'A' in col]
b = [col for col in df.columns if 'B' in col]
df1 = pd.lreshape(df, {'A' : a, 'B' : b})
df1['id'] = np.repeat(np.arange(len(df.columns) // 2), len (df.index)) + 1
print (df1)
A B id
0 1 2 1
1 5 6 1
2 9 10 1
3 3 4 2
4 7 8 2
5 11 12 2
EDIT:
lreshape is currently undocumented, but it is possible it might be removed(with pd.wide_to_long too).
Possible solution is merging all 3 functions to one - maybe melt, but now it is not implementated. Maybe in some new version of pandas. Then my answer will be updated.
I solved this in 3 steps:
Make a new dataframe df2 holding only the data you want to be added to the initial dataframe df.
Delete the data from df that will be added below (and that was used to make df2.
Append df2 to df.
Like so:
# step 1: create new dataframe
df2 = df[['A1', 'B1']]
df2.columns = ['A', 'B']
# step 2: delete that data from original
df = df.drop(["A1", "B1"], 1)
# step 3: append
df = df.append(df2, ignore_index=True)
Note how when you do df.append() you need to specify ignore_index=True so the new columns get appended to the index rather than keep their old index.
Your end result should be your original dataframe with the data rearranged like you wanted:
In [16]: df
Out[16]:
A B
0 1 2
1 5 6
2 9 10
3 3 4
4 7 8
5 11 12
Use pd.concat() like so:
#Split into separate tables
df_1 = df[['A', 'B']]
df_2 = df[['A1', 'B1']]
df_2.columns = ['A', 'B'] # Make column names line up
# Add the ID column
df_1 = df_1.assign(id=1)
df_2 = df_2.assign(id=2)
# Concatenate
pd.concat([df_1, df_2])