I need to merge two pandas data frames using a columns which contains numerical values.
For example, the two data frames could be like the following ones:
data frame "a"
a1 b1
0 "x" 13560
1 "y" 193309
2 "z" 38090
3 "k" 37212
data frame "b"
a2 b2
0 "x" 13,56
1 "y" 193309
2 "z" 38,09
3 "k" 37212
What i need to do, is merge a with b on column b1/b2.
The problem is that as you can see, some values of data frame b', are a little bit different. First of all, b' values are not integers but strings and furthermore, the values which end with 0 are "rounded" (13560 --> 13,56).
What i've tried to do, is replace the comma and then cast them to int, but it doesn't work; more in details this procedure doesn't add the missing zero.
This is the code that i've tried:
b['b2'] = b['b2'].str.replace(",", "")
b['b2'] = b['b2'].astype(np.int64) # np is numpy
Is there any procedure that i can use to fix this problem?
I believe need create boolean mask for specify which values has to be multiple:
#or add parameter thousands=',' to read_csv like suggest #Inder
b['b2'] = b['b2'].str.replace(",", "", regex=True).astype(np.int64)
mask = b['b2'] < 10000
b['b2'] = np.where(mask, b['b2'] * 10, b['b2'])
print (b)
a2 b2
0 x 13560
1 y 193309
2 z 38090
3 k 37212
Correcting the column first with a apply and a lambda function:
b.b2 = b.b2.apply(lambda x: int(x.replace(',','')) * 10 if ',' in x else int(x))
Related
I have 1st dataFrame with column 'X' as :
X
A468593-3
A697269-2
A561044-2
A239882 04
2nd dataFrame with column 'Y' as :
Y
000A561044
000A872220
I would like to match the part of substrings from both columns with minimum no. of characters(example 7 chars only alphanumeric to be considered for matching. all special chars to be excluded).
so, my output DataFrame should be like this
X
A561044-2
Any possible solution would highly appreciate.
Thanks in advance.
IIUC and assuming that the first three values of Y start with 0, you can slice Y by [3:] to remove the first three zero values. Then, you can join these values by |. Finally, you can create your mask using contains that checks whether a series contains a specified value (in your case you would have something like 'A|B' and check whether a value contains 'A' or 'B'). Then, this mask can be used to query your other data frame.
Code:
import pandas as pd
df1 = pd.DataFrame({"X": ["A468593-3", "A697269-2", "A561044-2", "A239882 04"]})
df2 = pd.DataFrame({"Y": ["000A561044", "000A872220"]})
mask = df1["X"].str.contains(f'({"|".join(df2["Y"].str[3:])})')
df1.loc[mask]
Output:
X
2 A561044-2
If you have values in Y that do not start with three zeros, you can use this function to reduce your columns and remove all first n zeros.
def remove_first_numerics(s):
counter = 0
while s[counter].isnumeric():
counter +=1
return s[counter:]
df_test = pd.DataFrame({"A": ["01Abd3Dc", "3Adv3Dc", "d31oVgZ", "10dZb1B", "CCcDx10"]})
df_test["A"].apply(lambda s: remove_first_numerics(s))
Output:
0 Abd3Dc
1 Adv3Dc
2 d31oVgZ
3 dZb1B
4 CCcDx10
Name: A, dtype: object
I read in a file and created a Dataframe from that file, the problem is that not all of the information that I read was separated properly and was not the same length. I have a df that has 1600 columns but I do not need them all I specifically need the information that is 3 columns to the left of a specific particular sting in one of the previous columns. For Example:
In the 1st row column number 1000, it has a value of ['HFOBR'] and then I need the column value that is 3 to the left.
In the 2nd row the column number with ['PQOBR'] might be 799 but I still need the value that is 3 to the left.
In the 3rd row the column number might be 400 with ['BBSOBR'] but I still need the lave 3 to the left.
And so on I really am trying to search each row for the partial sting OBR and then take the value of 3 to the left of it and put that value in a new df with a column of its own.
Here you will find a snapshot of the dataframe
Here you will see the code I used to create the dataframe in the first place where I read in an HL7 file and tried to convert it to a Dataframe, and each of the HL7 messages are not the same length whish is casing part of the problem I am having
message = []
parsed_msg = []
with open(filename) as msgs:
start = False
for line in msgs.readlines():
if line[:3] == 'MSH':
if start:
parsed_msg = hl7.parse_batch(msg)
#print(parsed_msg)
start = False
message += parsed_msg
msg = line
start = True
else:
msg += line
df = pd.DataFrame(message)
Sample data:
df = pd.DataFrame([["HFOBR", "foo", "a", "b", "c"], ["foo", "PQOBR", "a", "b", "c"]])
df
0 1 2 3 4
0 HFOBR foo a b c
1 foo PQOBR a b c
Define a function to find the value three columns to the left of the first column containing a string with "OBR"
import numpy as np
def find_left_value(row):
obr_col_idx = np.where(row.str.contains("OBR"))[0]
left_col_idx = obr_col_idx + 3
return row[left_col_idx].iloc[0]
Apply this function to your dataframe:
df['result'] = df.apply(find_left_value, axis=1)
Resulting dataframe:
0 1 2 3 4 result
0 HFOBR foo a b c b
1 foo PQOBR a b c c
FYI: making sample data like this that people can test answers on will help you 1) define your problem more clearly, and 2) get answers.
I have DataFrame with thousands rows. Its structure is as below
A B C D
0 q 20 'f'
1 q 14 'd'
2 o 20 'a'
I want to compare the A column of current row and next row. If those values are equal I want to add the value of B column which has lower the value to D column of compared row which has greater value. Then I want to remove the moved column value of column B. It's like a swap process.
A B C D
0 q 20 'f' 14
1 o 20 'a'
I have thousands rows and iloc, loc, at methods work slow. At least I want to use DataFrame apply method. I tried some code samples but they didn't work.
I want to do something as below:
DataFrame.apply(lambda row: self.compare(row, next(row)), axis=1))
I have a compare method but I couldn't pass next row to the compare method. How can I pass it to the method? Also I am open to hear faster pandas solutions.
Best not to do that with apply as it will be slow; you can look at using shift, e.g.
df['A_shift'] = df['A'].shift(1)
df['Is_Same'] = 0
df.loc[df.A_shift == df.A, 'Is_Same'] = 1
Gets a bit more complicated if you're doing the shift within groups, but still possible.
I am struggling to understand how df.apply()exactly works.
My problem is as follows: I have a dataframe df. Now I want to search in several columns for certain strings. If the string is found in any of the columns I want to add for each row where the string is found a "label" (in a new column).
I am able to solve the problem with map and applymap(see below).
However, I would expect that the better solution would be to use applyas it applies a function to an entire column.
Question: Is this not possible using apply? Where is my mistake?
Here are my solutions for using map and applymap.
df = pd.DataFrame([list("ABCDZ"),list("EAGHY"), list("IJKLA")], columns = ["h1","h2","h3","h4", "h5"])
Solution using map
def setlabel_func(column):
return df[column].str.contains("A")
mask = sum(map(setlabel_func, ["h1","h5"]))
df.ix[mask==1,"New Column"] = "Label"
Solution using applymap
mask = df[["h1","h5"]].applymap(lambda el: True if re.match("A",el) else False).T.any()
df.ix[mask == True, "New Column"] = "Label"
For applyI don't know how to pass the two columns into the function / or maybe don't understand the mechanics at all ;-)
def setlabel_func(column):
return df[column].str.contains("A")
df.apply(setlabel_func(["h1","h5"]),axis = 1)
Above gives me alert.
'DataFrame' object has no attribute 'str'
Any advice? Please note that the search function in my real application is more complex and requires a regex function which is why I use .str.contain in the first place.
Another solutions are use DataFrame.any for get at least one True per row:
print (df[['h1', 'h5']].apply(lambda x: x.str.contains('A')))
h1 h5
0 True False
1 False False
2 False True
print (df[['h1', 'h5']].apply(lambda x: x.str.contains('A')).any(1))
0 True
1 False
2 True
dtype: bool
df['new'] = np.where(df[['h1','h5']].apply(lambda x: x.str.contains('A')).any(1),
'Label', '')
print (df)
h1 h2 h3 h4 h5 new
0 A B C D Z Label
1 E A G H Y
2 I J K L A Label
mask = df[['h1', 'h5']].apply(lambda x: x.str.contains('A')).any(1)
df.loc[mask, 'New'] = 'Label'
print (df)
h1 h2 h3 h4 h5 New
0 A B C D Z Label
1 E A G H Y NaN
2 I J K L A Label
pd.DataFrame.apply iterates over each column, passing the column as a pd.Series to the function being applied. In you case, the function you're trying to apply doesn't lend itself to being used in apply
Do this instead to get your idea to work
mask = df[['h1', 'h5']].apply(lambda x: x.str.contains('A').any(), 1)
df.loc[mask, 'New Column'] = 'Label'
h1 h2 h3 h4 h5 New Column
0 A B C D Z Label
1 E A G H Y NaN
2 I J K L A Label
IIUC you can do it this way:
In [23]: df['new'] = np.where(df[['h1','h5']].apply(lambda x: x.str.contains('A'))
.sum(1) > 0,
'Label', '')
In [24]: df
Out[24]:
h1 h2 h3 h4 h5 new
0 A B C D Z Label
1 E A G H Y
2 I J K L A Label
Others have given good alternative methods. Here is a way to use apply 'row wise' (axis=1) to get your new column indicating presence of "A" for a bunch of columns.
If you are passed a row, you can just join the strings together into one big string and then use a string comparison ("in") see below. here I am combing all columns, but you can do it with just H1 and h5 easily.
df = pd.DataFrame([list("ABCDZ"),list("EAGHY"), list("IJKLA")], columns = ["h1","h2","h3","h4", "h5"])
def dothat(row):
sep = ""
return "A" in sep.join(row['h1':'h5'])
df['NewColumn'] = df.apply(dothat,axis=1)
This just squashes squashes each row into one string (e.g. ABCDZ) and looks for "A". This is not that efficient though if you just want to quit the first time you find the string then combining all the columns could be a waste of time. You could easily change the function to look column by column and quit (return true) when it finds a hit.
As part of trying to learn pandas I'm trying to reshape a spreadsheet. After removing non zero values I need to get some data from a single column.
For the sample columns below, I want to find the most effective way of finding the row and column index of the cell that contains the value date and get the value next to it. (e.g. here it would be 38477.
In practice this would be a much bigger DataFrame and the date row could change and it may not always be in the first column.
What is the best way to find out where date is in the array and return the value in the adjacent cell?
Thanks
<bound method DataFrame.head of 0 1 2 4 5 7 8 10 \
1 some title
2 date 38477
5 cat1 cat2 cat3 cat4
6 a b c d e f g
8 Z 167.9404 151.1389 346.197 434.3589 336.7873 80.52901 269.1486
9 X 220.683 56.0029 73.73679 428.8939 483.7445 251.1877 243.7918
10 C 433.0189 390.1931 251.6636 418.6703 12.21859 113.093 136.28
12 V 226.0135 418.1141 310.2038 153.9018 425.7491 73.08073 277.5065
13 W 295.146 173.2747 2.187459 401.6453 51.47293 175.387 397.2021
14 S 306.9325 157.2772 464.1394 216.248 478.3903 173.948 328.9304
15 A 19.86611 73.11554 320.078 199.7598 467.8272 234.0331 141.5544
This really just reformats a lot of the iteration you are doing to make it clearer and take advantage of pandas ability to easily select, etc.
First, we need a dummy dataframe (with date in the last row and explicitly ordered the way you have in your setup)
import pandas as pd
df = pd.DataFrame({"A": [1,2,3,4,np.NaN],
"B":[5, 3, np.NaN, 3, "date"],
"C":[np.NaN,2, 1,3, 634]})[["A","B","C"]]
A clear way to do it is to find the row and then enumerate over the row to find date:
row = df[df.apply(lambda x: (x == "date").any(), axis=1)].values[0] # will be an array
for i, val in enumerate(row):
if val == "date":
print row[i + 1]
break
If your spreadsheet only has a few non-numeric columns, you could go by column, check for date and get a row and column index (this may be faster because it searches by column rather than by row, though I'm not sure)
# gives you column labels, which are `True` if at least one entry has `date` in it
# have to check `kind` otherwise you get an error.
col_result = df.apply(lambda x: x.dtype.kind == "O" and (x == "date").any())
# select only columns where True (this should be one entry) and get their index (for the label)
column = col_result[col_result].index[0]
col_index = df.columns.get_loc(column)
# will be True if it contains date
row_selector = df.icol(col_index) == "date"
print df[row_selector].icol(col_index + 1).values