I have a df look like below, but much bigger. There are some incorrect dates under the column of lastDate, and they are only incorrect if there is something in correctDate column, right next to them.
dff = pd.DataFrame(
{"lastDate":['2016-3-27', '2016-4-11', '2016-3-27', '2016-3-27', '2016-5-25', '2016-5-31'],
"fixedDate":['2016-1-3', '', '2016-1-18', '2016-4-5', '2016-2-27', ''],
"analyst":['John Doe', 'Brad', 'John', 'Frank', 'Claud', 'John Doe']
})
First one is what I have and the second one is what I'd like to have after the loop
First convert these columns to datetime dtypes:
for col in ['fixedDate', 'lastDate']:
df[col] = pd.to_datetime(df[col])
Then you could use
mask = pd.notnull(df['fixedDate'])
df.loc[mask, 'lastDate'] = df['fixedDate']
For example,
import pandas as pd
df = pd.DataFrame( {"lastDate":['2016-3-27', '2016-4-11', '2016-3-27', '2016-3-27', '2016-5-25', '2016-5-31'], "fixedDate":['2016-1-3', '', '2016-1-18', '2016-4-5', '2016-2-27', ''], "analyst":['John Doe', 'Brad', 'John', 'Frank', 'Claud', 'John Doe'] })
for col in ['fixedDate', 'lastDate']:
df[col] = pd.to_datetime(df[col])
mask = pd.notnull(df['fixedDate'])
df.loc[mask, 'lastDate'] = df['fixedDate']
print(df)
yields
analyst fixedDate lastDate
0 John Doe 2016-01-03 2016-01-03
1 Brad NaT 2016-04-11
2 John 2016-01-18 2016-01-18
3 Frank 2016-04-05 2016-04-05
4 Claud 2016-02-27 2016-02-27
5 John Doe NaT 2016-05-31
Related
Good Morning,
This is my code
data = {'Names_Males_GroupA': ['Robert', 'Andrew', 'Gordon', 'Steve'], 'Names_Females_GroupA': ['Brenda', 'Sandra', 'Karen', 'Megan'], 'Name_Males_GroupA': ['David', 'Patricio', 'Noe', 'Daniel']}
df = pd.DataFrame(data)
df
Since Name_Males_GroupA has an error (missing and 's')
I need to move all the values to the correct column which is Names_Males_GroupA
In other words: I want to Add the names David, Patricio, Noe and Daniel below the names Robert, Andrew, Gordon and Steve.
After that I can delete the wrong column.
Thank you.
If I understand you correctly, you can try
df = pd.concat([df.iloc[:, :2], df.iloc[:, 2].to_frame('Names_Males_GroupA')], ignore_index=True)
print(df)
Names_Males_GroupA Names_Females_GroupA
0 Robert Brenda
1 Andrew Sandra
2 Gordon Karen
3 Steve Megan
4 David NaN
5 Patricio NaN
6 Noe NaN
7 Daniel NaN
I would break them apart and put them back together with a pd.concat
data = {'Names_Males_GroupA': ['Robert', 'Andrew', 'Gordon', 'Steve'], 'Names_Females_GroupA': ['Brenda', 'Sandra', 'Karen', 'Megan'], 'Name_Males_GroupA': ['David', 'Patricio', 'Noe', 'Daniel']}
df = pd.DataFrame(data)
df1 = df[['Name_Males_GroupA', 'Names_Females_GroupA']]
df1.columns = ['Names_Males_GroupA', 'Names_Females_GroupA']
df = df[['Names_Males_GroupA', 'Names_Females_GroupA']]
pd.concat([df, df1])
I keep running into this use and I haven't found a good solution. I am asking for a solution in python, but a solution in R would also be helpful.
I've been getting data that looks something like this:
import pandas as pd
data = {'Col1': ['Bob', '101', 'First Street', '', 'Sue', '102', 'Second Street', '', 'Alex' , '200', 'Third Street', '']}
df = pd.DataFrame(data)
Col1
0 Bob
1 101
3
4 Sue
5 102
6 Second Street
7
8 Alex
9 200
10 Third Street
11
The pattern in my real data does repeat like this. Sometimes there is a blank row (or more than 1), and sometimes there are not any blank rows. The important part here is that I need to convert this column into a row.
I want the data to look like this.
Name Address Street
0 Bob 101 First Street
1 Sue 102 Second Street
2 Alex 200 Third Street
I have tried playing around with this, but nothing has worked. My thought was to iterate through a few rows at a time, assign the values to the appropriate column, and just build a data frame row by row.
x = len(df['Col1'])
holder = pd.DataFrame()
new_df = pd.DataFrame()
while x < 4:
temp = df.iloc[:5]
holder['Name'] = temp['Col1'].iloc[0]
holder['Address'] = temp['Col1'].iloc[1]
holder['Street'] = temp['Col1'].iloc[2]
new_df = pd.concat([new_df, holder])
df = temp[5:]
df.reset_index()
holder = pd.DataFrame()
x = len(df['Col1'])
new_df.head(10)
In R,
data <- data.frame(
Col1 = c('Bob', '101', 'First Street', '', 'Sue', '102', 'Second Street', '', 'Alex' , '200', 'Third Street', '')
)
k<-which(grepl("Street", data$Col1) == TRUE)
j <- k-1
i <- k-2
data.frame(
Name = data[i,],
Adress = data[j,],
Street = data[k,]
)
Name Adress Street
1 Bob 101 First Street
2 Sue 102 Second Street
3 Alex 200 Third Street
Or, if Street not ends with Street but Adress are always a number, you can also try
j <- which(apply(data, 1, function(x) !is.na(as.numeric(x)) ))
i <- j-1
k <- j+1
Python3
In Python 3, you can convert your DataFrame into an array and then reshape it.
n = df.shape[0]
df2 = pd.DataFrame(
data=df.to_numpy().reshape((n//4, 4), order='C'),
columns=['Name', 'Address', 'Street', 'Empty'])
This produces for your sample data this:
Name Address Street Empty
0 Bob 101 First Street
1 Sue 102 Second Street
2 Alex 200 Third Street
If you like you can remove the last column:
df2 = df2.drop(['Empty'], axis=1)
Name Address Street
0 Bob 101 First Street
1 Sue 102 Second Street
2 Alex 200 Third Street
One-liner code
df2 = pd.DataFrame(data=df.to_numpy().reshape((df.shape[0]//4, 4), order='C' ), columns=['Name', 'Address', 'Street', 'Empty']).drop(['Empty'], axis=1)
Name Address Street
0 Bob 101 First Street
1 Sue 102 Second Street
2 Alex 200 Third Street
In python i believe this may help u.
1 import pandas as pd
2
3 data = {'Col1': ['Bob', '101', 'First Street', '', 'Sue', '102', 'Second Street', '', 'Alex' , '200', 'Third Street', '']}
4
5 var = list(data.values())[0]
6 var2 = []
7 for aux in range(int(len(var)/4)):
8 var2.append(var[aux*4: aux*4+3])
9 data = pd.DataFrame(var2, columns=['Name', 'Address','Street',])
10 print(data)
Another R solution. This solution is based on the tidyverse package. The example data frame data is from Park's post (https://stackoverflow.com/a/69833814/7669809).
library(tidyverse)
data2 <- data %>%
mutate(ID = cumsum(Col1 %in% "")) %>%
filter(!Col1 %in% "") %>%
group_by(ID) %>%
mutate(Type = case_when(
row_number() == 1L ~"Name",
row_number() == 2L ~"Address",
row_number() == 3L ~"Street",
TRUE ~NA_character_
)) %>%
pivot_wider(names_from = "Type", values_from = "Col1") %>%
ungroup()
data2
# # A tibble: 3 x 4
# ID Name Address Street
# <int> <chr> <chr> <chr>
# 1 0 Bob 101 First Street
# 2 1 Sue 102 Second Street
# 3 2 Alex 200 Third Street
The values of the DataFrame are reshaped by unknown rows and 4 columns, then the first 3 columns of the entire array are taken out by slicing and converted into a DataFrame, and finally the columns of DataFrame are reset by set_axis
result = pd.DataFrame(df.values.reshape(-1, 4)[:, :-1])\
.set_axis(['Name', 'Address', 'Street'], axis=1)
result
>>>
Name Address Street
0 Bob 101 First Street
1 Sue 102 Second Street
2 Alex 200 Third Street
I have to extract rows from a pandas dataframe with values in 'Date of birth' column which occur in a list with dates.
import pandas as pd
df = pd.DataFrame({'Name': ['Jack', 'Mary', 'David', 'Bruce', 'Nick', 'Mark', 'Carl', 'Sofie'],
'Date of birth': ['1973', '1999', '1995', '1992/1991', '2000', '1969', '1994', '1989/1990']})
dates = ['1973', '1992', '1969', '1989']
new_df = df.loc[df['Date of birth'].isin(dates)]
print(new_df)
0 Jack 1973
1 Mary 1999
2 David 1995
3 Bruce 1992/1991
4 Nick 2000
5 Mark 1969
6 Carl 1994
7 Sofie 1989/1990
Eventually I get the table below. As you can see, Bruce's and Sofie's rows are absent since the value is followed by / and another value. How should I split up these two filter them out?
Name Date of birth
0 Jack 1973
5 Mark 1969
You could use str.contains:
import pandas as pd
df = pd.DataFrame({'Name': ['Jack', 'Mary', 'David', 'Bruce', 'Nick', 'Mark', 'Carl', 'Sofie'],
'Date of birth': ['1973', '1999', '1995', '1992/1991', '2000', '1969', '1994', '1989/1990']})
dates = ['1973', '1992', '1969', '1989']
new_df = df.loc[df['Date of birth'].str.contains(rf"\b{'|'.join(dates)}\b")]
print(new_df)
Output
Name Date of birth
0 Jack 1973
3 Bruce 1992/1991
5 Mark 1969
7 Sofie 1989/1990
The string rf"\b{'|'.join(dates)}\b" is a regex pattern that will match any of string that contains any of the dates.
I like #DaniMesejo way better but here is a way splitting up the values and stacking:
df[df['Date of birth'].str.split('/', expand=True).stack().isin(dates).max(level=0)]
Output:
Name Date of birth
0 Jack 1973
3 Bruce 1992/1991
5 Mark 1969
7 Sofie 1989/1990
I´m trying to apply to my pandas dataframe something similar to R's tidyr::spread . I saw in some places people using pd.pivot but so far I had no success.
So in this example I have the following dataframe DF:
df = pd.DataFrame({'action_id' : [1,2,1,4,5],
'name': ['jess', 'alex', 'jess', 'cath', 'mary'],
'address': ['house', 'house', 'park', 'park', 'park'],
'date': [ '01/01', '02/01', '03/01', '04/01', '05/01']})
How does it look like:
Ok, so what I want is a multi-index pivot table having 'action_id' and 'name' as index, "spread" the address column and fill it with the 'date' column. So my df would look like this:
What I tryed to do was:
df.pivot(index = ['action_id', 'name'], columns = 'address', values = 'date')
And I got the error TypeError: MultiIndex.name must be a hashable type
Does anyone know what am I doing wrong?
You do not need to mention the index in pd.pivot
This will work
import pandas as pd
df = pd.DataFrame({'action_id' : [1,2,1,4,5],
'name': ['jess', 'alex', 'jess', 'cath', 'mary'],
'address': ['house', 'house', 'park', 'park', 'park'],
'date': [ '01/01', '02/01', '03/01', '04/01', '05/01']})
df = pd.concat([df, pd.pivot(data=df, index=None, columns='address', values='date')], axis=1) \
.reset_index(drop=True).drop(['address','date'], axis=1)
print(df)
action_id name house park
0 1 jess 01/01 NaN
1 2 alex 02/01 NaN
2 1 jess NaN 03/01
3 4 cath NaN 04/01
4 5 mary NaN 05/01
And to arrive at what you want, you need to do a groupby
df = df.groupby(['action_id','name']).agg({'house':'first','park':'first'}).reset_index()
print(df)
action_id name house park
0 1 jess 01/01 03/01
1 2 alex 02/01 NaN
2 4 cath NaN 04/01
3 5 mary NaN 05/01
Dont forget to accept the answer if it helped you
Another option:
df2 = df.set_index(['action_id','name','address']).date.unstack().reset_index()
df2.columns.name = None
I have a data frame, df, like this:
data = {'A': ['Jason (121439)', 'Molly (194439)', 'Tina (114439)', 'Jake (127859)', 'Amy (122579)'],
'B': ['Bob (127439)', 'Mark (136489)', 'Tyler (121443)', 'John (126259)', 'Anna(174439)'],
'C': ['Jay (121596)', 'Ben (12589)', 'Toom (123586)', 'Josh (174859)', 'Al(121659)'],
'D': ['Paul (123839)', 'Aaron (124159)', 'Steve (161899)', 'Vince (179839)', 'Ron (128379)']}
df = pd.DataFrame(data)
And I want to create a new data frame with one column with the name and the other column with the number between parenthesis, which would look like this:
data2 = {'Name': ['Jason ', 'Molly ', 'Tina ', 'Jake ', 'Amy '],
'ID#': ['121439', '194439', '114439', '127859', '122579']}
result = pd.DataFrame(data2)
I try different things, but it all did not work:
1)
List_name=pd.DataFrame()
List_id=pd.DataFrame()
List_both=pd.DataFrame(columns=["Name","ID"])
for i in df.columns:
left=df[i].str.split("(",1).str[0]
right=df[i].str.split("(",1).str[1]
List_name=List_name.append(left)
List_id=List_id.append(right)
List_both=pd.concat([List_name,List_id], axis=1)
List_both
2) applying a function on all cell
Names = lambda x: x.str.split("(",1).str[0]
IDS = Names = lambda x: x.str.split("(",1).str[1]
But I was wondering how to do it in order to store it in a data frame that will look like result...
You can use stack followed by str.extract.
(df.stack()
.str.strip()
.str.extract(r'(?P<Name>.*?)\s*\((?P<ID>.*?)\)$')
.reset_index(drop=True))
Name ID
0 Jason 121439
1 Bob 127439
2 Jay 121596
3 Paul 123839
4 Molly 194439
5 Mark 136489
6 Ben 12589
7 Aaron 124159
8 Tina 114439
9 Tyler 121443
10 Toom 123586
11 Steve 161899
12 Jake 127859
13 John 126259
14 Josh 174859
15 Vince 179839
16 Amy 122579
17 Anna 174439
18 Al 121659
19 Ron 128379