I have two dataframes df1 and df2. df1 is a dataframe with various columns and df2 is a dataframe with only one column col2, which is a list of words.
It is obviously wrong, but my code so far is: df1["col_new"] = df1[df1["col1"]].str.contains(df2["col2"])
Basically, I want to create a new column called col_new in df1 that has copied values from col2 in df2 if the values are partial matches to values in col1 in df1.
For example, if col2 = "apple" and col1 = "im.apple3", then I want to copy or assign the value "apple" to col_new and so on.
Another question I have is to find the index/position of second uppercase letter in a string in col1 in df1.
I found a similar question on here and wrote this code: df["sec_upper"] = df["col1"].apply(lambda x: re.research("[A-Z]+{2}",x).span())[1] but I get an error saying "multiple repeat at position 6".
Can someone please help me out? Thank you in advance!
EDIT2: First problem solved. Can anyone please help me with the second problem?
EDIT1:
Example dataframes:
df1
col1
im.apple3
Cookiemm
Hi_World123
df2
col2
apple
cookie
world
candy
soda
Expected output:
col1 new_col sec_upper
im.apple3 apple NaN
Cookiemm cookie NaN
Hi_World123 world 4
Try this:
df1['new_col'] = df1['col1'].str.lower().str.extract(f"({'|'.join(df2['col2'])})")
Output:
col1 new_col
0 im.apple3 apple
1 Cookiemm cookie
2 Hi_World123 world
Related
how can I use Pandas to update / combine / merge a dataframe (df1) with values from another dataframe (df2) where df1 has a new column (col3) with values from df2.col2? In other words, df1 is the current month values and I would like df1 to also have a column from df2 which is last month's values.
Any insights on this is appreciated; thank you SO.
df1:
key
date
col1
col2
key1
feb-01
df1_val01
df1_val02
key2
feb-01
df1_val11
df1_val12
df2:
key
date
col1
col2
key1
jan-01
df2_val01
df2_val02
key2
jan-01
df2_val11
df2_val12
desired df:
key
date
col1
col2
col3
key1
feb-01
df1_val01
df1_val02
df2_val01
key2
feb-01
df1_val11
df1_val12
df2_val12
Merging, joining, concatenating, dataframes can be tricky. One of the simplest ways I've found is to make the shared "key" column into an index. Rename the column (to "col3" in your case) and join on the "key" index column.
In your case, it would look like:
right_df = df_1[["key", "col2"]].set_index("key")
right_df = right_df.rename(columns={"col2": "col3"})
new_df = df2.join(right_df, on="key")
(I didn't test this code. This is from memory. Let me know if it fails miserably and I'll see if I can fix it.)
I am using python with pandas imported to manipulate some data from a csv file I have. Just playing around to try and learn something new.
I have the following data frame:
I would like to group the data by col1 so that I get the following result. Which is a groupby on col1 and col3 and col4 multiplied together.
I have been watching some youtube videos and reading some similar questions on stack overflow but I am having trouble. So far I have the following which involves creating a new Col to hold the result of Col3 x Col4:
df['Col5'] = df.Col3 * df.Col4
gf = df.groupby(['col1', 'Col5'])
You can use solution without creating new column, you can multiple columns and aggregate by column df['Col1'] with aggregate sum, it is syntactic sugar:
gf = (df.Col3 * df.Col4).groupby(df['Col1']).sum().reset_index(name='Col2')
print (gf)
Col1 Col2
0 12345 38.64
1 23456 2635.10
2 45678 419.88
Another solution is possible create index by Col1 by set_index, multiple columns by prod and last sum by index by level=0:
gf = df.set_index('Col1')[['Col3','Col4']].prod(axis=1).sum(level=0).reset_index(name='Col2')
Almost, but you are grouping by too many columns in the end. Try:
gf = df.groupby('Col1')['Col5'].sum()
Or to get it as a dataframe, rather than Col1 as an index (I'm judging that this is what you want from your image), include as_index=False in your groupby:
gf = df.groupby('Col1', as_index=False)['Col5'].sum()
I am trying to get one result per 'name' with all of the latest data, unless the column is blank. In R I would have used group_by, sorted by timestamp and selected the latest value for each column. I tried to do that here and got very confused. Can someone explain how to do this in Python? In the example below my goal is:
col2 date name
1 4 2018-03-27 15:55:29 bil #latest timestamp with the latest non-blank col4 value
Heres my code so far:
d = {'name':['bil','bil','bil'],'date': ['2018-02-27 14:55:29', '2018-03-27 15:55:29', '2018-02-28 19:55:29'], 'col2': [3,'', 4]}
df2 = pd.DataFrame(data=d)
print(df2)
grouped = df2.groupby(['name']).sum().reset_index()
print(grouped)
sortedvals=grouped.sort_values(['date'], ascending=False)
print(sortedvals)
Here's one way:
df3 = df2[df2['col2'] != ''].sort_values('date', ascending=False).drop_duplicates('name')
# col2 date name
# 2 4 2018-02-28 19:55:29 bil
However, the dataframe you provided and output you desire seem to be inconsistent.
Is it possible to reorder pandas.DataFrame rows (based on an index) inplace?
I am looking for something like this:
df.reorder(idx, axis=0, inplace=True)
Where idx is not equal to but is the same type of df.index, it contains the same elements, but in another order. The output is df reordered before new idx.
I have not found anything in documentation and I fail to use reindex_axis. Which made me hoping it was possible because:
A new object is produced unless the new index is equivalent to the
current one and copy=False
I might have misunderstood what equivalent index means in this context.
Try using the reindex function (note that this is not inplace):
>>import pandas as pd
>>df = pd.DataFrame({'col1':[1,2,3],'col2':['test','hi','hello']})
>>df
col1 col2
0 1 test
1 2 hi
2 3 hello
>>df = df.reindex([2,0,1])
>>df
col1 col2
2 3 hello
0 1 test
1 2 hi
Col1 Col2
0 APT UB0
1 AK0 UUP
2 IL2 PB2
3 OIU U5B
4 K29 AAA
My data frame looks similar to the above data. I'm trying to change the values in Col1 if the corresponding values in Col2 have the letter "B" in it. If the value in Col2 has "B", then I want to add "-B" to the end of the value in Col1.
Ultimately I want Col1 to look like this:
Col1
0 APT-B
1 AK0
2 IL2-B
.. ...
I have an idea of how to approach it... but I'm somewhat confused because I know my code is incorrect. In addition there are NaN values in my actual code for Col1... which will definitely give an error when I'm trying to do val += "-B" since it's not possible to add a string and a float.
for value in dataframe['Col2']:
if "Z" in value:
for val in dataframe['Col1']:
val += "-B"
Does anyone know how to fix/solve this?
Rather than using a loop, lets use pandas directly:
import pandas as pd
df = pd.DataFrame({'Col1': ['APT', 'AK0', 'IL2', 'OIU', 'K29'], 'Col2': ['UB0', 'UUP', 'PB2', 'U5B', 'AAA']})
df.loc[df.Col2.str.contains('B'), 'Col1'] += '-B'
print(df)
Output:
Col1 Col2
0 APT-B UB0
1 AK0 UUP
2 IL2-B PB2
3 OIU-B U5B
4 K29 AAA
You have too many "for" loops in your code. You just need to iterate over the rows once, and for any row satisfying your condition you make the change.
for idx, row in df.iterrows():
if 'B' in row['Col2']:
df.loc[idx, 'Col1'] = str(df.loc[idx, 'Col1']) + '-B'
edit: I used str to convert the previous value in Col1 to a string before appending, since you said you sometimes have non-string values there. If this doesn't work for you, please post your test data and results.
You can use a lambda expression. If 'B' is in Col2, then '-B' get appended to Col1. The end result is assigned back to Col1.
df['Col1'] = df.apply(lambda x: x.Col1 + ('-B' if 'B' in x.Col2 else ''), axis=1)
>>> df
Col1 Col2
0 APT-B UB0
1 AK0 UUP
2 IL2-B PB2
3 OIU-B U5B
4 K29 AAA