how to join 2 rows based on column value - python

i have the dataframe like picture below:
enter image description here
and based on col_3 value i want to extract this dataframe.
enter image description here
i tried :
df1 = df[df['col_8'] == 2]
df2 = df[df['col_8'] == 3]
df3 = pd.merge(df1, df2, on=['col_3'], how = 'inner')
but because i have just one col_3=252 after the merge this row is deleted.
how can i fix the problem and with which function i can extract above dataframe?

What are you trying to do?
In your picture, col_3 only has values of 2 and 3. And then, you split the dataframe on the condition of col_3 = 2 or 3. And then you want to merge it.
So, you are trying to slice a dataframe and the rejoin it as it was? Why?

I think this is happening due to your df2 being empty, since there is no df[df['col_8'] == 3]. Inner join is the intersection of the sets. So Df2 is empty so then you try and then you try and merge this it will return nothing.
I think you are trying to do this:
df2 = df[df['col_8_3'] == 3]
Then when you take the inner join it should work produce one row

Related

Join 2 columns of a dataframe based on syntax of values in the 2 columns

I have a Python dataframe and I am trying to combine the cells in the first 2 columns IF the first column value is a string with letters, and the second column value has the syntax of parentheses-single digit-parentheses.
eg: this is the current layout
0
1
2
text
(5)
moretext
this is what I want the result to be:
0
1
text (5)
moretext
I tried using the str.join() function but it's not working for me.
df1 = df.iloc[:, 0:1].str.join(r'(\(\d\))')
please let me know how I can write this, thank you
I believe join is suppose to join lists (which are inside one column) into a string and not several columns into a unique column (https://pandas.pydata.org/docs/reference/api/pandas.Series.str.join.html)
I might not have understood your problem completely but maybe this could work :
idx = df[(df[0].str.contains('\w') & df[1].str.contains('\(\d\)'))].index.values # find the indices that matches your criteria
df1 = pd.DataFrame()
df1[0] = df[0][idx] + ' ' + df[1][idx] # merges values of your columns for the proper indices
df1[1] = df[2][idx]

Joining two dataframes on subvalue of the key column

I am currently trying to join / merge two df on the column Key, where in df1 the key is a standalone value such as 5, but in df2, the key can consist of multiple values such as [5,6,13].
For example like this:
df1 = pd.DataFrame({'key': [["5","6","13"],["10","7"],["6","8"]]})
df2 = pd.DataFrame({'sub_key': ["5","10","6"]})
However, my df are a lot bigger and consist of many columns, so an efficient solution would be great.
As a result I would like to have a table like this:
Key1
Key2
5
5,6,13
10
10,7
and so on ....
I already tried to apply this approach to my code, but it didn't work:
df1['join'] = 1
df2['join'] = 1
merged= df1.merge(df2, on='join').drop('join', axis=1)
df2.drop('join', axis=1, inplace=True)
merged['match'] = merged.apply(lambda x: x.key(x.sub_key), axis=1).ge(0)
I also tried to split and explode the column and to join on single values, however there the problem was, that not all column values were split correctly and I would need to combine everything back into one cell once joined.
Help would be much appreciated!
If you only want to match the first key:
df1['sub_key'] = df1.key.str[0]
df1.merge(df2)
If you want to match ANY key:
df3 = df1.explode('key').rename(columns={'key':'sub_key'})
df3 = df3.join(df1)
df3.merge(df2)
Edit: First version had a small bug, fixed it.

Python Pandas showing change in position between two dataframes

I am reading two dataframes looking at one column and then showing the difference in position between the two dataframe with a -1 or +1 etc.
I have try the following code but it only shows 0 in Position Change when there should be a difference between British Airways and Ryanair
first = pd.read_csv("C:\\Users\\airma\\PycharmProjects\\Vatsim_Stats\\Vatsim_stats\\Base.csv", encoding='unicode_escape')
df1 = pd.DataFrame(first, columns=['airlines', 'Position'])
second = pd.read_csv("C:\\Users\\airma\\PycharmProjects\\Vatsim_Stats\\Vatsim_stats\\Base2.csv", encoding='unicode_escape')
df2 = pd.DataFrame(second, columns=['airlines', 'Position'])
df1['Position Change'] = np.where(df1['airlines'] == df2['airlines'], 0, df1['Position'] - df2['Position'])
I have also try to do it with the following code, but just keep getting a ValueError: cannot reindex from a duplicate axis
df1.set_index('airlines', drop=False) # Set index to cross reference by (icao)
df2.set_index('airlines', drop=False)
df2['Position Change'] = df1[['Position']].sub(df2['Position'], axis=0)
df2 = df2.reset_index(drop=True)
pd.set_option('display.precision', 0)
Base csv looks like this -
and Base2 csv looks like this -
As you can see British Airways is in 3 position on Base csv and 4 in Base 2 csv, but when running the code it just shows 0 and does not do the math between the two dataframes.
Have been stuck on this for days now, would be so grateful for any help.
I would like to offer some easier way based on columns, value and if-statement.
It is probably a little bit useless while you have big dataframe, but it can gives you the information you expect.
first = pd.read_csv("C:\\Users\\airma\\PycharmProjects\\Vatsim_Stats\\Vatsim_stats\\Base.csv", encoding='unicode_escape')
df1 = pd.DataFrame(first, columns=['airlines', 'Position'])
second = pd.read_csv("C:\\Users\\airma\\PycharmProjects\\Vatsim_Stats\\Vatsim_stats\\Base2.csv", encoding='unicode_escape')
df2 = pd.DataFrame(second, columns=['airlines', 'Position'])
I agree, that my answer was not correct with your question.
Now, if I understand correctly - you want to create new column in DataFrame that gives you -1 if two same columns in 2 DataFrames are incorrect and 1 if correct.
It should help:
key = "Name_Of_Column"
new = []
for i in range(0, len(df1)):
if df1[key][i] != df2[key][i]:
new.append(-1)
else:
new.append(1)
df3 = pd.DataFrame({"Diff":new}) # I create new DataFrame as Dictionary.
df1 = df1.append(df3, ignore_index = True)
print(df1)
i am giving u an alternative, i am not sure whether it is appreciated or not. But just an idea.
After reading two csv's and getting the column u require, why don't you try to join two dataframes for the column'airlines'? it will merge two dataframes with key as 'airlines'

Pandas - contains from other DF

I have 2 dataframes:
DF A:
and DF B:
I need to check every row in the DFA['item'] if it contains some of the values in the DFB['original'] and if it does, then add new column in DFA['my'] that would correspond to the value in DFB['my'].
So here is the result I need:
I tought of converting the DFB['original'] into list and then use regex, but this way I wont get the matching result from column 'my'.
Ok, maybe not the best solution, but it seems to be working.
I did cartesian join and then check the records which contains the data needed
dfa['join'] = 1
dfb['join'] = 1
dfFull = dfa.merge(dfb, on='join').drop('join' , axis=1)
dfFull['match'] = dfFull.apply(lambda x: x.original in x.item, axis = 1)
dfFull[dfFull['match']]

How can I get data from a dataframe based on matching criteria with pandas?

I have a list called 'common_numbers'. This list has numbers (in str format) that will match with some of the numbers in a data frame that are in the 4th column. So for example:
common_numbers = ['512', '653', '950']
(example row in a data frame) df = expeditious, tree, www.stackflow.com, 512, asn
data frame example:
0 0,1,2,3,4
1 host,ip,FQDN,asn,asnOrgName
2 barracuda,208.92.204.42,barracuda.godsgarden.com,17359,exampleorgName
The commonality in common_numbers and the data frame in this example is 512. Thus, the value I want to retrieve is www.stackflow.com from the data frame.
I tried:
wanted_data =[]
if i in common_values:
print("Match found.. generating fqdn..")
for i in df_is_not:
wanted_data.append(df.loc[df[2].isin(common_values)])
print(wanted_fqdn_data)
It returns:
Columns: [0, 1, 2, 3, 4]
Index: [], Empty DataFrame
What am I doing wrong? How can I fix this? Thanks so much. With the example I gave above I'm expecting to get:
print(wanted_data)
>>>['www.stackflow.com']
Try this
df1 = pd.DataFrame(['512', '653', '950'])
df2 = pd.DataFrame([['expeditious', 'tree', 'www.stackflow.com', '512', 'asn'],
['barracuda','208.92.204.42','barracuda.godsgarden.com','17359','exampleorgName']],
columns=['c1','c2','c3','c4','c5'])
df3 = df2.merge(df1, left_on=['c4'], right_on=[0], how='inner', left_index=False)[['c3']]
df3
The result will be
c3
0 www.stackflow.com
You have the right idea, there just really isn't a need for a loop in this case.
If you ultimately just want to pull out the third column of every row where the fourth column is in a list you have, then you can do the following:
df = df[df[3].isin(common_numbers)]
wanted_data = list(df[2])
Hopefully, this answers your question.

Categories

Resources