I want to intersect two Pandas dataframes (1 and 2) based on two columns (A and B) present in both dataframes. However, I would like to return a dataframe that only has data with respect to the data in the first dataframe, omitting anything that is not found in the second dataframe.
So for example:
Dataframe 1:
A | B | Extra | Columns | In | 1 |
----------------------------------
1 | 2 | Extra | Columns | In | 1 |
1 | 3 | Extra | Columns | In | 1 |
1 | 5 | Extra | Columns | In | 1 |
Dataframe 2:
A | B | Extra | Columns | In | 2 |
----------------------------------
1 | 3 | Extra | Columns | In | 2 |
1 | 4 | Extra | Columns | In | 2 |
1 | 5 | Extra | Columns | In | 2 |
should return:
A | B | Extra | Columns | In | 1 |
----------------------------------
1 | 3 | Extra | Columns | In | 1 |
1 | 5 | Extra | Columns | In | 1 |
Is there a way I can do this simply?
You can use df.merge:
df = df1.merge(df2, on=['A','B'], how='inner').drop('2', axis=1)
how='inner' is default. Just put it there for your understanding of how df.merge works.
As #piRSquared suggested, you can do:
df1.merge(df2[['A', 'B']], how='inner')
I have Pandas object created using cross tabulation function
df = pd.crosstab(db['Age Category'], db['Category'])
| Age Category | A | B | C | D |
|--------------|---|----|----|---|
| 21-26 | 2 | 2 | 4 | 1 |
| 26-31 | 7 | 11 | 12 | 5 |
| 31-36 | 3 | 5 | 5 | 2 |
| 36-41 | 2 | 4 | 1 | 7 |
| 41-46 | 0 | 1 | 3 | 2 |
| 46-51 | 0 | 0 | 2 | 3 |
| Above 51 | 0 | 3 | 0 | 6 |
df.dtype give me
Age Category
A int64
B int64
C int64
D int64
dtype: object
But, when i am writing this to MySQL I am not getting first column
The Output of MySQL is shown below:
| A | B | C | D |
|---|----|----|---|
| | | | |
| 2 | 2 | 4 | 1 |
| 7 | 11 | 12 | 5 |
| 3 | 5 | 5 | 2 |
| 2 | 4 | 1 | 7 |
| 0 | 1 | 3 | 2 |
| 0 | 0 | 2 | 3 |
| 0 | 3 | 0 | 6 |
I want to write in MySQL with First column.
I have created connection using
SQLAlchemy and PyMySQL
engine = create_engine('mysql+pymysql://[user]:[passwd]#[host]:[port]/[database]')
and I am writing using pd.to_sql()
df.to_sql(name = 'demo', con = engine, if_exists = 'replace', index = False)
but this is not giving me first column in MySQL.
Thank you for your time and consideration.
I am reading data from a text file in python using pandas. There are no header values (column names) assigned to the data in the text file. I want to reshape the data into a readable form. The problem i am facing is variable column lengths
For example in my text file i have
1,2,3,4,5,Hello,7,8
1,2,3,4,5,7,8,
1,2,3,4,5,7,8,
1,2,3,4,5,Hello,7,8,
Now when i create a data frame I want to make sure that in the second row instead of Hello a "NAN" is written as the value for that column is not present. and in the end after giving column names and rearranging the data frame will look like.
1,2,3,4,5,Hello,7,8
1,2,3,4,5,"NA,"7,8,
1,2,3,4,5,"NA",7,8,
1,2,3,4,5,Hello,7,8,
Answer to updated question and also a generalized solution for such case.
focus_col_idx = 5 # The column where you want to bring NaN in expected output
last_idx = df.shape[1] - 1
# Fetching the index of rows which have None in last column
idx = df[df[last_idx].isnull()].index
# Shifting the column values for those rows with index idx
df.iloc[idx,focus_col_idx+1:] = df.iloc[idx,focus_col_idx:last_idx].values
# Putting NaN for second column where row index is idx
df.iloc[idx,focus_col_idx] = np.NaN
df
+---+----+---+---+---+---+-------+---+-----+
| | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
+---+----+---+---+---+---+-------+---+-----+
| 0 | 1 | 2 | 3 | 4 | 5 | Hello | 7 | 8.0 |
| 1 | 1 | 2 | 3 | 4 | 5 | NaN | 7 | 8.0 |
| 2 | 1 | 2 | 3 | 4 | 5 | NaN | 7 | 8.0 |
| 3 | 1 | 2 | 3 | 4 | 5 | Hello | 7 | 8.0 |
+---+----+---+---+---+---+-------+---+-----+
Answer to previous data
Assuming only one column is having missing value (say 2nd column as per your previous data). Here's a quick sol -
df = pd.read_table('SO.txt',sep='\,', header=None)
df
+---+---+---+---+---+------+
| | 0 | 1 | 2 | 3 | 4 |
+---+---+---+---+---+------+
| 0 | A | B | C | D | E |
| 1 | A | C | D | E | None |
+---+---+---+---+---+------+
# Fetching the index of rows which have None in last column
idx = df[df[4].isnull()].index
idx
# Int64Index([1], dtype='int64')
# Shifting the column values for those rows with index idx
df.iloc[idx,2:] = df.iloc[idx,1:4].values
df
+---+---+---+---+---+---+
| | 0 | 1 | 2 | 3 | 4 |
+---+---+---+---+---+---+
| 0 | A | B | C | D | E |
| 1 | A | C | C | D | E | # <- Notice the shifting.
+---+---+---+---+---+---+
# Putting NaN for second column where row index is idx
df.iloc[idx,1] = np.NaN
# Final output
df
+---+---+-----+---+---+---+
| | 0 | 1 | 2 | 3 | 4 |
+---+---+-----+---+---+---+
| 0 | A | B | C | D | E |
| 1 | A | NaN | C | D | E |
+---+---+-----+---+---+---+
I have some data like this:
pd.DataFrame({'code': ['a', 'a', 'a', 'b', 'b', 'c'],
'value': [1,2,3, 4, 2, 1] })
+-------+------+-------+
| index | code | value |
+-------+------+-------+
| 0 | a | 1 |
+-------+------+-------+
| 1 | a | 2 |
+-------+------+-------+
| 2 | a | 3 |
+-------+------+-------+
| 3 | b | 4 |
+-------+------+-------+
| 4 | b | 2 |
+-------+------+-------+
| 5 | c | 1 |
+-------+------+-------+
i want add a column that contain the max value of each code :
| index | code | value | max |
|-------|------|-------|-----|
| 0 | a | 1 | 3 |
| 1 | a | 2 | 3 |
| 2 | a | 3 | 3 |
| 3 | b | 4 | 4 |
| 4 | b | 2 | 4 |
| 5 | c | 1 | 1 |
is there any way to do this with pandas?
Use GroupBy.transform for new column of aggregated values:
df['max'] = df.groupby('code')['value'].transform('max')
You can try this as well.
df["max"] = df.code.apply(lambda i : max(df.loc[df["code"] == i]["value"]))
I have the following DataFrame
| name | number |
|------|--------|
| a | 1 |
| a | 1 |
| a | 1 |
| b | 2 |
| b | 2 |
| b | 2 |
| c | 3 |
| c | 3 |
| c | 3 |
| d | 4 |
| d | 4 |
| d | 4 |
I wish to merge all the rows by string, but have their number value added up and kept in line with the name..
Output desired..
| name | number |
|------|--------|
| a | 3 |
| b | 6 |
| c | 9 |
| d | 12 |
It seems you need groupby and aggregate sum:
df = df.groupby('name', as_index=False)['number'].sum()
#or
#df = df.groupby('name')['number'].sum().reset_index()
Assuming DataFrame is your table name
Select name, SUM(number) [number] FROM DataFrame GROUP BY name
Insert the result after deleting the original rows