I'm trying to do something that I think should be a one-liner, but am struggling to get it right.
I have a large dataframe, we'll call it lg, and a small dataframe, we'll call it sm. Each dataframe has a start and an end column, and multiple other columns all of which are identical between the two dataframes (for simplicity, we'll call all of those columns type). Sometimes, sm will have the same start and end as lg, and if that is the case, I want sm's type to overwrite lg's type.
Here's the setup:
lg = pd.DataFrame({'start':[1,2,3,4], 'end':[5,6,7,8], 'type':['a','b','c','d']})
sm = pd.DataFrame({'start':[9,2,3], 'end':[10,6,11], 'type':['e','f','g']})
...note that the only matching ['start','end'] combo is ['2','6']
My desired output:
start end type
0 1 5 a
1 2 6 f # where sm['type'] overwrites lg['type'] because of matching ['start','end']
2 3 7 c
3 3 11 g # where there is no overwrite because 'end' does not match
4 4 8 d
5 9 10 e # where this row is added from sm
I've tried multiple versions of .merge(), merge_ordered(), etc. but to no avail. I've actually gotten it to work with merge_ordered() and drop_duplicates() only to realize that it was simply dropping the duplicate that was earlier in the alphabet, not because it was from sm.
You can try to set start and end columns as index and then use combine_first:
sm.set_index(['start', 'end']).combine_first(lg.set_index(['start', 'end'])).reset_index()
Related
Imagine you have a pyspark data frame df with three columns: A, B, C. I want to take the rows in the data frame where the value of B does not exist in C.
Example:
A B C
a 1 2
b 2 4
c 3 6
d 4 8
would return
A B C
a 1 2
c 3 6
What I tried
df.filter(~df.B.isin(df.C))
I also tried to making the values of B into a list, but that takes a significant amount of time.
The problem is how you're using isin. For better or worse, isin can't actually handle another pyspark Column object as an input, it needs an actual collection. So one thing you could do is convert your column to a list :
col_values = df.select("C").rdd.flatMap(lambda x: x).collect()
df.filter(~df.B.isin(col_values))
Performance wise though, this is obviously not ideal as your master node is now in charge of manipulating the entire contents of the single column you've just loaded into memory. You could use a left anti join to get the result you need without having to transform anything into a list and losing the efficiency of spark distributed computing :
df0 = df[["C"]].withColumnRenamed("C", "B")
df.join(df0, "B", "leftanti").show()
Thanks to Emma in the comments for her contribution.
So, I have indexes in range data frame. I want to use them to find values in test dataframe and extract values from into new data frame.
My current code is:
d = []
for index in _range_.index:
d.append((test.loc[[index],:]))
_range_ data set:
a
2334 0.097946
3345 0.098201
3357 0.091249
3486 0.098214
5862 0.097946
6873 0.098201
6885 0.091249
7014 0.098214
_test_ data set:
0 1 2 3 4 5
0 4.187268 4.261664 4.329495 4.458864 3.071192 3.652938
You could join the two dataframes together on their common index using 'inner' then keep only the test columns.
cols = __test__.columns
df = __range__.join(__test__, how='inner')
df=df[cols]
If you have an overlap of colum names between the two dataframes, attach an lsuffix='_l' or something similar to ensure the range columns are ignored.
I'm unable to test this code on for your example though, it might be worth reading over this for future posts https://stackoverflow.com/help/minimal-reproducible-example
I have what I'm sure is a fundamental lack of understanding about how dataframes work in Python. I am sure this is an easy question, but I have looked everywhere and can't find a good explanation. I am trying to understand why sometimes dataframe calculations seem to run on a row-by-row (or cell by cell) basis, and sometimes seem to run for an entire column... For example:
data = {'Name':['49-037-23094', '49-029-21476', '49-029-20812', '49-041-21318'], 'Depth':[20, 21, 7, 18]}
df = pd.DataFrame(data)
df
Which gives:
Name Depth
0 49-037-23094 20
1 49-029-21476 21
2 49-029-20812 7
3 49-041-21318 18
Now I know I can do:
df['DepthDouble']=df['Depth']*2
And get:
Name Depth DepthDouble
0 49-037-23094 20 40
1 49-029-21476 21 42
2 49-029-20812 7 14
3 49-041-21318 18 36
Which is what I would expect. But this doesn't always work, and I'm trying to understand why. For example, I am trying to run this code to modify the name:
df['newName']=''.join(re.findall('\d',str(df['Name'])))
which gives:
Name Depth DepthDouble \
0 49-037-23094 20 40
1 49-029-21476 21 42
2 49-029-20812 7 14
3 49-041-21318 18 36
newName
0 04903723094149029214762490292081234904121318
1 04903723094149029214762490292081234904121318
2 04903723094149029214762490292081234904121318
3 04903723094149029214762490292081234904121318
So it is taking all the values in my name column, removing the dashes, and concatenating them. Of course, I'd just like it to be a new name column exactly the same as the original "Name" column, but without the dashes.
So, can anyone help me understand what I am doing wrong here? I Don't understand why sometimes Dataframe calculations for one column are done row by row (e.g., the Depth Doubled column) and sometimes Python seems to take all values in the entire column and run the calculation (e.g., the newName column).
Surely the way to get around this isn't by making a loop for every index in the df to force it to run individually for each row for a given column?
If the output you're looking for is:
Name Depth newName
0 49-037-23094 20 4903723094
1 49-029-21476 21 4902921476
2 49-029-20812 7 4902920812
3 49-041-21318 18 4904121318
The way to get this is:
df['newName']=df['Name'].map(lambda name: ''.join(re.findall('\d', name)))
map is like apply but specifically for Series objects. Since you're applying to only the Name column you are operating on a Series.
If the lambda part is confusing, an equivalent way to write it is:
def find_digits(name):
return ''.join(re.findall('\d', name))
df['newName']=df['Name'].map(find_digits)
The equivalent operation in traditional for loops is:
newNameSeries = pd.Series(name='newName')
for name in df['Name']:
newNameSeries = newNameSeries.append(pd.Series(''.join(re.findall('\d', name))), ignore_index=True)
pd.concat([df, newNameSeries], axis=1).rename(columns={0:'newName'})
While there might be a slightly cleaner way to do the loop, you can see how much simpler the first approach is compared to trying to use for-loops. It's also faster. As you already have indicated you know, avoid for loops when using pandas.
The issue is that with str(df['Name']) you are converting the entire Name-column of your DataFrame to one single string. What you want to do instead is to use one of pandas' own methods for strings, which will be applied to every single element of the column.
For example, you could use pandas' replace method for strings:
import pandas as pd
data = {'Name':['49-037-23094', '49-029-21476', '49-029-20812', '49-041-21318'], 'Depth':[20, 21, 7, 18]}
df = pd.DataFrame(data)
df['newName'] = df['Name'].str.replace('-', '')
My DataFrame has a string in the first column, and a number in the second one:
GEOSTRING IDactivity
9 wydm2p01uk0fd2z 2
10 wydm86pg6r3jyrg 2
11 wydm2p01uk0fd2z 2
12 wydm80xfxm9j22v 2
39 wydm9w92j538xze 4
40 wydm8km72gbyuvf 4
41 wydm86pg6r3jyrg 4
42 wydm8mzt874p1v5 4
43 wydm8mzmpz5gkt8 5
44 wydm86pg6r3jyrg 5
45 wydm8w1q8bjfpcj 5
46 wydm8w1q8bjfpcj 5
What I want to do is to manipulate this DataFrame in order to have a list object that contains a string, made out of the 5th character for each "GEOSTRING" value, for each different "IDactivity" value.
So in this case, I have 3 different "IDactivity" values, and I will have in my list object 3 strings that look like this:
['2828', '9888','8888']
where again, the symbols you see in each string, are the 5th value of each "GEOSTRING" value.
What I'm asking is a solution, or an approach, that doesn't involve a too complicated for loop and have it as efficient as possible since I have to manipulate lots of data. I'd like it to be clean and fast.
I hope it's clear enough.
this can be done easily as follows as a one liner: (considered to be pretty fast too)
result = df.groupby('IDactivity')['GEOSTRING'].apply(lambda x:''.join(x.str[4])).tolist()
this groups the dataframe by values of IDactivity then select from each corresponding string of GEOSTRING column the 5th element (index 4) and joins it with the other corresponding strings. Finally we add tolist() method to get the output as list not pandas Series.
output:
['2828', '9888', '8888']
Documentation:
pandas.groupby
pandas.apply
Here's a solution involving a temp column, and taking inspiration for the key operation from this answer:
# create a temp column with the character we want from each string
dframe['Temp'] = dframe['GEOSTRING'].apply(lambda x: x[4])
# groupby ID and then concatenate using a sneaky call to .sum()
dframe.groupby('IDactivity')['Temp'].sum().tolist()
Result:
['2828', '9888', '8888']
When I use apply to a user defined function in Pandas, it looks like python is creating an additional array. How could I get rid of it? Here is my code:
def fnc(group):
x = group.C.values
out = x[np.where(x < 0)]
return pd.DataFrame(out)
data = pd.DataFrame({'A':np.random.randint(1, 3, 10),
'B':3,
'C':np.random.normal(0, 1, 10)})
data.groupby(by=['A', 'B']).apply(fnc).reset_index()
There is this weird Level_2 index created. Is there a way to avoid creating it when running my function?
A B level_2 0
0 1 3 0 -1.054134802
1 1 3 1 -0.691996447
2 2 3 0 -1.068693768
3 2 3 1 -0.080342046
4 2 3 2 -0.181869799
As such, you will have no way to avoid level_2 appearing. This is because the result of your grouping is a dataframe with several items in it: pandas is cool enough to understand your wish is to broadcast these items across the grouped keys, yet it is taking the index of the dataframe as an additional level to guarantee coherent output data. So dropping level=-1 at the end of your processing explicitly is expected.
If you want to avoid to reset that extra index, but still have some post processing, another way would be to call transform instead of apply, and get the returned data from fnc being the entire group vector where you put np.nan for results to exclude. Then, your dataframe will not have an extra level, but you'll need to call dropna() afterwards.