I want to merge two dataframes in pandas one is 126720 rows and 3 columns. The other is 1280 rows and 3 columns. The columns name for both datadrame are same. I tried to merge them in rows but the result has 'NaN' in the values because the number of rows in two dataframes are not same. The way i want them to be merged is to be placed under each other. In this way there shouldn't be any 'NaN'. In other words i want to have a dataframe with 128000 rows and 3 columns.Can anyone point me in a direction how to do it?
Here is the code i tried and led me to 'NaN':
df = pd.read_csv('.csv')
df1 = pd.read_csv('.csv')
result = pd.concat([df1, df], ignore_index=True)
Now dataframe is 128000 rows and 4 columns. Here is the screenshot of part of 'df' and 'df1'.
Hint: I have a large dataset so i can not screenshot the whole dataset!
Showing smaller size of df1 and df (Just an example):
df: col1 | col2 | col3
11 | 13 | 15
------------------
12 | 14 | 16
df1: col1 | col2 | col3
1 | 3 | 6
--------------------
2 | 4 | 7
--------------------
3 | 5 | 8
What i want after merging:
result: col1 | col2 | col3
1 | 3 | 6
------------------
2 | 4 | 7
------------------
3 | 5 | 8
------------------
11 | 13 | 15
------------------
12 | 14 | 16
I think my problem is row indexing for both dataframe is same. That is why i am getting 'NaN' when i am merging them. So how can i change the index of row before merging?
Related
I am looking for a an efficient and elegant way in Pandas to remove "duplicate" rows in a DataFrame that have exactly the same value set but in different columns.
I am ideally looking for a vectorized way to do this as I can already identify very inefficient ways using the Pandas pandas.DataFrame.iterrows() method.
Say my DataFrame is:
source|target|
----------------
| 1 | 2 |
| 2 | 1 |
| 4 | 3 |
| 2 | 7 |
| 3 | 4 |
I want it to become:
source|target|
----------------
| 1 | 2 |
| 4 | 3 |
| 2 | 7 |
df = df[~pd.DataFrame(np.sort(df.values,axis=1)).duplicated()]
source target
0 1 2
2 4 3
3 2 7
explanation:
np.sort(df.values,axis=1) is sorting DataFrame column wise
array([[1, 2],
[1, 2],
[3, 4],
[2, 7],
[3, 4]], dtype=int64)
then making a dataframe from it and checking non duplicated using prefix ~ on duplicated
~pd.DataFrame(np.sort(df.values,axis=1)).duplicated()
0 True
1 False
2 True
3 True
4 False
dtype: bool
and using this as mask getting final output
source target
0 1 2
2 4 3
3 2 7
I don't know how to describe my problem in words, I just model it
Problem modeling:
Let say we have two dataframes df1, df2 with the same columns
df1
idx | col1 | col2 | col3 | col4
---------------------------------
0 | 1 | -100 | 2 | -100
df2
idx | col1 | col2 | col3 | col4
---------------------------------
0 | 12 | 23 | 34 | 45
Given these two df-s we get
df_result
idx | col1 | col2 | col3 | col4
---------------------------------
0 | 1 | 23 | 2 | 45
I.e. we get df1 where all -100 substituted with values from df2 accordingly.
Question: How can I do it without for-loop? In particular, is there an operation in pandas or on two lists of the same size that could do what we need?
PS: I can do it with for loop but it will be much slower.
You can use this:
df1[df1==-100] = df2
This is how it works step-by-step:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.array([[1,-100,2,-100],[-100,3,-100,-100]]), columns=['col1','col2','col3','col4'])
df1
col1 col2 col3 col4
1 -100 2 -100
-100 3 -100 -100
df2 = pd.DataFrame(np.array([[12,23,34,45],[1,2,3,4]]), columns=['col1','col2','col3','col4'])
df2
col1 col2 col3 col4
12 23 34 45
1 2 3 4
By using boolean indexing you have that
df1==-100
col1 col2 col3 col4
False True False True
True False True True
So when True you can assign the corresponding value of df2:
df1[df1==-100]=df2
df1
col1 col2 col3 col4
1 23 2 45
1 3 3 4
I have a df which looks like this:
a b
apple | 7 | 2 |
google | 8 | 8 |
swatch | 6 | 6 |
merc | 7 | 8 |
other | 8 | 9 |
I want to select a given row say by name, say "apple" and move it to a new location, say -1 (second last row)
desired output
a b
google | 8 | 8 |
swatch | 6 | 6 |
merc | 7 | 8 |
apple | 7 | 2 |
other | 8 | 9 |
Is there any functions available to achieve this?
Use Index.difference for remove value and numpy.insert for add value to new index, last use DataFrame.reindex or DataFrame.loc for change order of rows:
a = 'apple'
idx = np.insert(df.index.difference([a], sort=False), -1, a)
print (idx)
Index(['google', 'swatch', 'merc', 'apple', 'other'], dtype='object')
df = df.reindex(idx)
#alternative
#df = df.loc[idx]
print (df)
a b
google 8 8
swatch 6 6
merc 7 8
apple 7 2
other 8 9
This seems good, I am using pd.Index.insert() and pd.Index.drop_duplicates():
df.reindex(df.index.insert(-1,'apple').drop_duplicates(keep='last'))
a b
google 8 8
swatch 6 6
merc 7 8
apple 7 2
other 8 9
I'm not aware of any built-in function, but one approach would be to manipulate the index only, then use the new index to re-order the DataFrame (assumes all index values are unique):
name = 'apple'
position = -1
new_index = [i for i in df.index if i != name]
new_index.insert(position, name)
df = df.loc[new_index]
Results:
a b
google 8 8
swatch 6 6
merc 7 8
apple 7 2
other 8 9
Currently I am working on clustering problem and I have a problem with copying the values from one dataframe to the original dataframe.
CustomerID | Date | Time| TotalSum | CohortMonth| CohortIndex
--------------------------------------------------------------------
0 |17850.0|2017-11-29||08:26:00|15.30|2017-11-01|1|
--------------------------------------------------------------------
1 |17850.0|2017-11-29||08:26:00|20.34|2017-11-01|1|
--------------------------------------------------------------------
2 |17850.0|2017-11-29||08:26:00|22.00|2017-11-01|1|
--------------------------------------------------------------------
3 |17850.0|2017-11-29||08:26:00|20.34|2017-11-01|1|
--------------------------------------------------------------------
And the dataframe with values (clusters) to copy:
CustomerID|Cluster
------------------
12346.0 | 1
------------------
12346.0 | 1
------------------
12346.0 | 1
------------------
Please help me with the problem: How to copy values from the second df based on Customer ID criteria to the first dataframe.
I tried the code like this:
df.merge(ic,left_on='CustomerID',right_on='Cluster',how='left').drop('CustomerID',1).fillna('')
But it doesn't work and I get an error...
Besides it tried a version of such code as:
df, ic = [d.reset_index(drop=True) for d in (df, ic)]
ic.join(df[['CustomerID']])
But it gets the same error or error like the 'Customer ID' not in df...
Sorry if it's not clear and bad formatted question...It is my first question on stackoverflow. Thank you all.
UPDATE
I have tried this
df1=df.merge(ic,left_on='CustomerID',right_on='Cluster',how='left')
if ic['CustomerID'].values != df1['CustomerID_x'].values:
df1.Cluster=ic.Cluster
else:
df1.Cluster='NaN'
But I've got different cluster for the same customer.
CustomerID_x| Date | Time | TotalSum | CohortMonth | CohortIndex | CustomerID_y | Cluster
0|17850.0|2017-11-29||08:26:00|15.30 | 2017-11-01 | 1 | NaN | 1.0
1|17850.0|2017-11-29||08:26:00|20.34 | 2017-11-01 | 1 | NaN | 0.0
2|17850.0|2017-11-29||08:26:00|22.00 | 2017-11-01 | 1 | NaN | 1.0
3|17850.0|2017-11-29||08:26:00|20.34 | 2017-11-01 | 1 | NaN | 2.0
4|17850.0|2017-11-29||08:26:00|20.34 | 2017-11-01 | 1 | NaN | 1.0
Given what you've written, I think you want:
>>> df1 = pd.DataFrame({"CustomerID": [17850.0] * 4, "CohortIndex": [1,1,1,1] })
>>> df1
CustomerID CohortIndex
0 17850.0 1
1 17850.0 1
2 17850.0 1
3 17850.0 1
>>> df2
CustomerID Cluster
0 12346.0 1
1 17850.0 1
2 12345.0 1
>>> pd.merge(df1, df2, 'left', 'CustomerID')
CustomerID CohortIndex Cluster
0 17850.0 1 1
1 17850.0 1 1
2 17850.0 1 1
3 17850.0 1 1
I am trying to convert a table containing string columns and array columns to a table with string columns only
Here is how current table looks like:
+-----+--------------------+--------------------+
|col1 | col2 | col3 |
+-----+--------------------+--------------------+
| 1 |[2,3] | [4,5] |
| 2 |[6,7,8] | [8,9,10] |
+-----+--------------------+--------------------+
How can I get expected result like that:
+-----+--------------------+--------------------+
|col1 | col2 | col3 |
+-----+--------------------+--------------------+
| 1 | 2 | 4 |
| 1 | 3 | 5 |
| 2 | 6 | 8 |
| 2 | 7 | 9 |
| 2 | 8 | 10 |
+-----+--------------------+--------------------+
The confusion comes from mixing scalar columns and list columns.
Under the assumption that -for every row- col2 and col3 are of the same length, we can first translate all scalar columns into list columns and then concatenate:
df = pd.DataFrame({'col1': [1,2],
'col2': [[2,3] , [6,7,8]],
'col3': [[4,5], [8,9,10]]})
# First, we turn all columns into list columns
df['col1'] = df['col1'].apply(lambda x: [x]) * df['col2'].apply(len)
# Then we concatenate the lists
df.apply(np.concatenate)
Output:
col1 col2 col3
0 1 2 4
1 1 3 5
2 2 6 8
3 2 7 9
4 2 8 10
Conver the columns to lists and after that to numpy.array, finally convert them to a DataFrame:
vals1 = np.array(df.col2.values.tolist())
vals2 = np.array(df.col3.values.tolist())
col1 = np.repeat(df.col1, vals1.shape[1])
df = pd.DataFrame(np.column_stack((col1, vals1.ravel(), vals2.ravel())), columns=df.columns)
print(df)
col1 col2 col3
0 1 2 4
1 1 3 5
2 2 6 8
3 2 7 9