This question already has answers here:
How to select all columns whose names start with X in a pandas DataFrame
(11 answers)
Closed 2 years ago.
I have a dataframe with multiple columns and different headers.
I want to filter the dataframe to keep only the columns that start with the letter I. Some of my column headers have the letter i but start with a different letter.
Is there a way to do this?
I tried using df.filter but for some reason, it's not case sensitive.
You can use df.filter with the regex parameter:
df.filter(regex=r'(?i)^i')
this will return columns starting with I ignoring the case.
Regex Demo
Example below:
Lets consider the input dataframe:
df = pd.DataFrame(np.random.randint(0,20,(5,4)),
columns=['itest','Itest','another','anothericol'])
print(df)
itest Itest another anothericol
0 1 4 14 17
1 17 10 14 1
2 16 18 10 7
3 10 12 17 14
4 6 15 17 19
With df.filter
print(df.filter(regex=r'(?i)^i'))
itest Itest
0 1 4
1 17 10
2 16 18
3 10 12
4 6 15
This question already has answers here:
Concatenate strings from several rows using Pandas groupby
(8 answers)
Closed 2 years ago.
I have a dataframe structured like this:
df_have = pd.DataFrame({'id':[1,1,2,3,4,4], 'desc':['yes','no','chair','bird','person','car']})
How can I get something like this:
df_want = pd.DataFrame({'id':[1,2,3,4], 'desc':['yes no','chair','bird','person car']})
Use groupby().apply:
df_have.groupby('id', as_index=False)['desc'].apply(' '.join)
Output:
id desc
0 1 yes no
1 2 chair
2 3 bird
3 4 person car
I will do agg with groupby
df = df_have.groupby('id',as_index=False)[['desc']].agg(' '.join)
id desc
0 1 yes no
1 2 chair
2 3 bird
3 4 person car
I have a dataframe called 'data':
USER VALUE
XOXO 21
ABC-1 2
ABC-1B 4
ABC-2 4
ABC-2B 6
PEPE 12
I want to combine 'ABC-1' with 'ABC-1B' into a single row using the first USER name and then averaging the two values to arrive here:
USER VALUE
XOXO 21
ABC-1 3
ABC-2 5
PEPE 12
The dataframe may not be in order and there are other values in there as well that are unrelated that don't need averaging. I only want to average the two rows where 'XXX-X' is in 'XXX-XB'
data = pd.DataFrame({'USER':['XOXO','ABC-1','ABC-1B','ABC-2','ABC-2B', 'PEPE'], 'VALUE':[21,2,4,4,6,12]})
Let's try,
df.USER = df.USER.str.replace('(-\d)B', r"\1")
df = df.groupby("USER", as_index=False, sort=False).VALUE.mean()
print(df)
USER VALUE
0 XOXO 21
1 ABC-1 3
2 ABC-2 5
3 PEPE 12
I have 2 data sets one with distinct number of rows and columns but has common id's.
Question: I want both data frames to be combined to form a new dataframe that has same number of df1 rows but added extra Age column, values in age columns to be filled as per the id
Example:
data = [[1,'Alex',10],[2,'Bob',12],[3,'Clarke',13],[1,'Alex',13],[4,'Jim',13], [3,'Clarke',13]]
df1 = pd.DataFrame(data,columns=['id', 'Name','Score'],dtype=int)
data2 = [[1, 20],[2, 22],[3, 19],[4, 21]]
df2 = pd.DataFrame(data2,columns=['id','Age'],dtype=int)
Out:
No clue where to start
New to python, please help!
Expected Output:
id Name Score Age
0 1 Alex 10 20
1 2 Bob 12 22
2 3 Clarke 13 19
3 1 Alex 13 20
4 4 Jim 13 21
5 3 Clarke 13 19
Try this one:
>>> pd.merge(df1, df2, on="id")
id Name Score Age
0 1 Alex 10 20
1 1 Alex 13 20
2 2 Bob 12 22
3 3 Clarke 13 19
4 3 Clarke 13 19
5 4 Jim 13 21
Try "merge".
You should be able to join both csv's by writing:
combined_data = df1.merge(df2, on="id")
The merge function combines the tables, and the "on" parameter determined on what condition to merge them on.
You use the merge function to merge two dataframe with equal lengths if they have atleast one column in common. In your case it's the ID. So we merge it 'on' ID like so:
data = [[1,'Alex',10],[2,'Bob',12],[3,'Clarke',13],[1,'Alex',13],[4,'Jim',13], [3,'Clarke',13]]
df1 = pd.DataFrame(data,columns=['id', 'Name','Score'],dtype=int)
data2 = [[1, 20],[2, 22],[3, 19],[4, 21]]
df2 = pd.DataFrame(data2,columns=['id','Age'],dtype=int)
merged_df = df1.merge(df2, on="id")
This question already has answers here:
How to delete rows from a pandas DataFrame based on a conditional expression [duplicate]
(6 answers)
How do you filter pandas dataframes by multiple columns?
(10 answers)
Closed 4 years ago.
I want to drop rows with zero value in specific columns
>>> df
salary age gender
0 10000 23 1
1 15000 34 0
2 23000 21 1
3 0 20 0
4 28500 0 1
5 35000 37 1
some data in columns salary and age are missing
and the third column, gender is a binary variables, which 1 means male 0 means female. And 0 here is not a missing data,
I want to drop the row in either salary or age is missing
so I can get
>>> df
salary age gender
0 10000 23 1
1 15000 34 0
2 23000 21 1
3 35000 37 1
Option 1
You can filter your dataframe using pd.DataFrame.loc:
df = df.loc[~((df['salary'] == 0) | (df['age'] == 0))]
Option 2
Or a smarter way to implement your logic:
df = df.loc[df['salary'] * df['age'] != 0]
This works because if either salary or age are 0, their product will also be 0.
Option 3
The following method can be easily extended to several columns:
df.loc[(df[['a', 'b']] != 0).all(axis=1)]
Explanation
In all 3 cases, Boolean arrays are generated which are used to index your dataframe.
All these methods can be further optimised by using numpy representation, e.g. df['salary'].values.