I have two dataframes:
The first one looks like this:
variable
entry
subentry
0
1
X
2
Y
3
Z
and the second one looks like:
variable
entry
subentry
0
1
A
2
B
I would like to merge the two dataframe such that I get:
variable
entry
subentry
0
1
X
2
Y
3
Z
1
1
A
2
B
Simply using df1.append(df2, ignore_index=True) gives
variable
0
X
1
Y
2
Z
3
A
4
B
In other words, it collapses the multindex into a single index. Is there a way around this?
Edit: Here is a code sinppet that will reproduce the problem:
arrays = [
np.array([0,0,0]),
np.array([0,1,2]),]
arrays_2 = [
np.array([0,0]),
np.array([0,1]),]
df1 = pd.DataFrame(np.random.randn(3, 1), index=arrays)
df2 = pd.DataFrame(np.random.randn(2, 1), index=arrays_2)
df = df1.append(df2, ignore_index=True)
print(df)
Edit: In practice, I am looking ao combine N dataframes, each with a different number of "entry" rows. So I am looking for an approach that will not rely on me knowing the exact of the dataframes I am combining.
One way try:
pd.concat([df1, df2], keys=[0,1]).droplevel(1)
Output:
0
0 0 -0.439749
1 -0.478744
2 0.719870
1 0 -1.055648
1 -2.007242
Use pd.concat to concat the dataframes together and since entry is the same of both, use keys parameter to create a new level with the naming you want your level to be. Finally, go back and drop the old index level (where the value was the same).
Related
I'm trying to copy data from different columns to a particular column in the same DataFrame.
Index
col1A
col2A
colB
list
CT
CW
CH
0
1
:
1
b
2
2
3
3d
But prior to that I wanted to search if those columns(col1A,col2A,colB) exist in the DataFrame and group those columns which are present and move the grouped data to relevant columns(CT,CH,etc) like,
CH
CW
CT
0
1
1
1
b
b
2
2
2
3
3d
3d
I did,
col_list1 = ['ColA','ColB','ColC']
test1 = any([ i in df.columns for i in col_list1 ])
if test1==True:
df['CH'] = df['Col1A'] +df['Col2A']
df['CT'] = df['ColB']
this code is throwing me a keyerror
.
I want it to ignore columns that are not present and add only those that are present
IIUC, you can use Python set or Series.isin to find the common columns
cols = list(set(col_list1) & set(df.columns))
# or
cols = df.columns[df.columns.isin(col_list1)]
df['CH'] = df[cols].sum(axis=1)
Instead of just concatenating the columns with +, collect them into a list and use sum with axis=1:
df['CH'] = np.sum([df[c] for c in cl if c in df], axis=1)
I have this dataframe and I need to drop all duplicates but I need to keep first AND last values
For example:
1 0
2 0
3 0
4 0
output:
1 0
4 0
I tried df.column.drop_duplicates(keep=("first","last")) but it doesn't word, it returns
ValueError: keep must be either "first", "last" or False
Does anyone know any turn around for this?
Thanks
You could use the panda's concat function to create a dataframe with both the first and last values.
pd.concat([
df['X'].drop_duplicates(keep='first'),
df['X'].drop_duplicates(keep='last'),
])
you can't drop both first and last... so trick is too concat data frames of first and last.
When you concat one has to handle creating duplicate of non-duplicates. So only concat unique indexes in 2nd Dataframe. (not sure if Merge/Join would work better?)
import pandas as pd
d = {1:0,2:0,10:1, 3:0,4:0}
df = pd.DataFrame.from_dict(d, orient='index', columns=['cnt'])
print(df)
cnt
1 0
2 0
10 1
3 0
4 0
Then do this:
d1 = df.drop_duplicates(keep=("first"))
d2 = df.drop_duplicates(keep=("last"))
d3 = pd.concat([d1,d2.loc[set(d2.index) - set(d1.index)]])
d3
Out[60]:
cnt
1 0
10 1
4 0
Use a groupby on your column named column, then reindex. If you ever want to check for duplicate values in more than one column, you can extend the columns you include in your groupby.
df = pd.DataFrame({'column':[0,0,0,0]})
Input:
column
0 0
1 0
2 0
3 0
df.groupby('column', as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[0, -1]]).reset_index(level=0, drop=True)
Output:
column
0 0
3 0
I have two dataframes.
feelingsDF with columns 'feeling', 'count', 'code'.
countryDF with columns 'feeling', 'countryCount'.
How do I make another dataframe that takes the columns from countryDF and combines it with the code column in feelingsDF?
I'm guessing you would need to somehow use same feeling column in feelingsDF to combine them and match sure the same code matches the same feeling.
I want the three columns to appear as:
[feeling][countryCount][code]
You are joining the two dataframes by the column 'feeling'. Assuming you only want the entries in 'feeling' that are common to both dataframes, you would want to do an inner join.
Here is a similar example with two dfs:
x = pd.DataFrame({'feeling': ['happy', 'sad', 'angry', 'upset', 'wow'], 'col1': [1,2,3,4,5]})
y = pd.DataFrame({'feeling': ['okay', 'happy', 'sad', 'not', 'wow'], 'col2': [20,23,44,10,15]})
x.merge(y,how='inner', on='feeling')
Output:
feeling col1 col2
0 happy 1 23
1 sad 2 44
2 wow 5 15
To drop the 'count' column, select the other columns of feelingsDF, and then sort by the 'countryCount' column. Note that this will leave your index out of order, but you can reindex the combined_df afterwards.
combined_df = feelingsDF[['feeling', 'code']].merge(countryDF, how='inner', on='feeling').sort_values('countryCount')
# To reset the index after sorting:
combined_df = combined_df.reset_index(drop=True)
You can join two dataframes using pd.merge. Assuming that you want to join on the feeling column, you can use:
df= pd.merge(feelingsDF, countryDF, on='feeling', how='left')
See documentation for pd.merge to understand how to use the on and how parameters.
feelingsDF = pd.DataFrame([{'feeling':1,'count':10,'code':'X'},
{'feeling':2,'count':5,'code':'Y'},{'feeling':3,'count':1,'code':'Z'}])
feeling count code
0 1 10 X
1 2 5 Y
2 3 1 Z
countryDF = pd.DataFrame([{'feeling':1,'country':'US'},{'feeling':2,'country':'UK'},{'feeling':3,'country':'DE'}])
feeling country
0 1 US
1 2 UK
2 3 DE
df= pd.merge(feelingsDF, countryDF, on='feeling', how='left')
feeling count code country
0 1 10 X US
1 2 5 Y UK
2 3 1 Z DE
I have an array of numbers (I think the format makes it a pivot table) that I want to turn into a "tidy" data frame. For example, I start with variable 1 down the left, variable 2 across the top, and the value of interest in the middle, something like this:
X Y
A 1 2
B 3 4
I want to turn that into a tidy data frame like this:
V1 V2 value
A X 1
A Y 2
B X 3
B Y 4
The row and column order don't matter to me, so the following is totally acceptable:
value V1 V2
2 A Y
4 B Y
3 B X
1 A X
For my first go at this, which was able to get me the correct final answer, I looped over the rows and columns. This was terribly slow, and I suspected that some machinery in Pandas would make it go faster.
It seems that melt is close to the magic I seek, but it doesn't get me all the way there. That first array turns into this:
V2 value
0 X 1
1 X 2
2 Y 3
3 Y 4
It gets rid of my V1 variable!
Nothing is special about melt, so I will be happy to read answers that use other approaches, particularly if melt is not much faster than my nested loops and another solution is. Nonetheless, how can I go from that array to the kind of tidy data frame I want as the output?
Example dataframe:
df = pd.DataFrame({"X":[1,3], "Y":[2,4]},index=["A","B"])
Use DataFrame.reset_index with DataFrame.rename_axis and then DataFrame.melt. If you want order columns we could use DataFrame.reindex.
new_df = (df.rename_axis(index = 'V1')
.reset_index()
.melt('V1',var_name='V2')
.reindex(columns = ['value','V1','V2']))
print(new_df)
Another approach DataFrame.stack:
new_df = (df.stack()
.rename_axis(index = ['V1','V2'])
.rename('value')
.reset_index()
.reindex(columns = ['value','V1','V2']))
print(new_df)
value V1 V2
0 1 A X
1 3 B X
2 2 A Y
3 4 B Y
to names names there is another alternative like commenting #Scott Boston in the comments
Melt is a good approach, but it doesn't seem to play nicely with identifying the results by index. You can reset the index first to move it to its own column, then use that column as the id col.
test = pd.DataFrame([[1,2],[3,4]], columns=['X', 'Y'], index=['A', 'B'])
X Y
A 1 2
B 3 4
test = test.reset_index()
index X Y
0 A 1 2
1 B 3 4
test.melt('index',['X', 'Y'], 'prev cols')
index prev cols value
0 A X 1
1 B X 3
2 A Y 2
3 B Y 4
I have a dataframe with a variable number of columns and with are handled inside MultiIndex for the columns. I'm trying to add several columns into the same MultiIndex structure
I've tried to add the new columns like if I would if there was only one column but it doesn't work
I have tried this:
df = pd.DataFrame(np.random.rand(4,2), columns=pd.MultiIndex.from_tuples([('plus_zero', 'A'), ('plus_zero', 'B')]))
df['plus_one'] = df['plus_zero'] + 1
But I get ValueError: Wrong number of items passed 2, placement implies 1.
The original df should look like this:
plus_zero
A B
0 0.602891 0.701130
1 0.395749 0.960206
2 0.268238 0.140606
3 0.165802 0.971707
And the result I want:
plus_zero plus_one
A B A B
0 0.602891 0.701130 1.602891 1.701130
1 0.395749 0.960206 1.395749 1.960206
2 0.268238 0.140606 1.268238 1.140606
3 0.165802 0.971707 1.165802 1.971707
Using pd.concat:
You must specify the names of the new columns and the axis=1 or axis='columns'
pd.concat([df.loc[:,'plus_zero'],df.loc[:,'plus_zero']+1],
keys=['plus_zero','plus_one'],
axis=1)
plus_zero plus_one
A B A B
0 0.049735 0.013907 1.049735 1.013907
1 0.782054 0.449790 1.782054 1.449790
2 0.148571 0.172844 1.148571 1.172844
3 0.875560 0.393258 1.875560 1.393258