Concatenating a string with part of another string in Python - python

Problem:
I have a csv file with partial city names in one column (usually missing the first one or two letters), and another column in the same file, that has other information and often contains the one or two missing letters in the end of that string.
e.g.
Column 1
w York
Column 3
word word Ne
My logic on how to approach this problem is to have a separate CSV file with valid city names, and perform the Python version of a VLOOKUP pre and post concatenation, so that it only concatenates if it does not already match with valid city data.
I am stuck on how to actually pull one or two characters from the end of a string in column 3 (substring, but repeatedly down a column) and merge it with the start of a string in column 1, but I already know how to execute the rest of my idea.
Here is a general script for concatenation using Pandas:
pd.concat([col1, col2.set_axis(col1.index[-len(col2):], inplace=False)], axis=1)
Would the addition of a -2 resolve the issue? i.e.
pd.concat([col1, col2.set_axis(col1.index[-len(col2)-2:], inplace=False)], axis=1)
Thank you!

If you decide on Pandas, after loading the csv into a pandas dataframe, you can extract the last 1 or 2 characters from column 3 this way and prepend it to column 2 this way:
df_city_names['col3'].map(lambda x: "".join(str(x)[-(2 if len(str(x)) > 1 else 1 if len(str(x)) > 0 else ''):])) + df_city_names['col2']

Related

Pandas deleting partly duplicate rows with wrong values in specific columns

I have a large dataframe from a csv file which has a few dozen columns. I have another csv file which I concatenated to the original. Now, the second file has exactly the same structure but a particular column may have incorrect values. I want to delete rows which are duplicates that have this one wrong column. For example in the below the last row should be removed. (The names of the specimens (Albert, etc.) are unique). I have been struggling to find a way of deleting only the data which has the wrong value, without risking deleting the correct row.
0 Albert alive
1 Newton alive
2 Galileo alive
3 Copernicus dead
4 Galileo dead
...
Any help would be greatly appreciated!
You could use this to determine if a name is mentioned more than 1 time
df['RN'] = df.groupby(['Name']).cumcount() + 1
You can also expand it out to have more columns in the "groupby" to see if there are any more limitations you want to put on the duplicates
df['RN'] = df.groupby(['Name', 'Another Column']).cumcount() + 1
The advantage I like with this is it gives you more control over the RN selection if you needed to df.loc[df['RN'] > 2].

Is there a way to parse a single column into multiple columns using python?

I'm new to python still and am learning. I'm working a csv file and would like to manipulate the data by parsing a single column into multiple columns, by splitting the values of that single column into several different columns. Here's an example
Old Column **New Columns**
|**column 1**| |**column 1**| **column 2** | **column 3**|
value 1 Value 1 Value 2 Value 3
value 2
value 3
** sorry if this reads like crap... for some reason the spacing is all jacked up when trying to post my query here on stackoverflow

Get rid of initial spaces at specific cells in Pandas

I am working with a big dataset (more than 2 million rows x 10 columns) that has a column with string values that were filled oddly. Some rows start and end with many space characters, while others don't.
What I have looks like this:
col1
0 (spaces)string(spaces)
1 (spaces)string(spaces)
2 string
3 string
4 (spaces)string(spaces)
I want to get rid of those spaces at the beginning and at the end and get something like this:
col1
0 string
1 string
2 string
3 string
4 string
Normally, for a small dataset I would use a for iteration (I know it's far from optimal) but now it's not an option given the time it would take.
How can I use the power of pandas to avoid a for loop here?
Thanks!
edit: I can't get rid of all the whitespaces since the strings contain spaces.
df['col1'].apply(lambda x: x.strip())
might help

Match all values str column dataframe with other dataframe str column

I have two pandas dataframes:
Dataframe 1:
ITEM ID TEXT
1 some random words
2 another word
3 blah
4 random words
Dataframe 2:
INDEX INFO
1 random
3 blah
I would like to match the values from the INFO column (of dataframe 2) with the TEXT column of dataframe 1. If there is a match, I would like to see a new column with a "1".
Something like this:
ITEM ID TEXT MATCH
1 some random words 1
2 another word
3 blah 1
4 random words 1
I was able to create a match per value of the INFO column that I'm looking for with this line of code:
dataframe1.loc[dataframe1['TEXT'].str.contains('blah'), 'MATCH'] = '1'
However, in reality, my real dataframe 2 has 5000 rows. So I cannot manually copy paste all of this. But basically I'm looking for something like this:
dataframe1.loc[dataframe1['TEXT'].str.contains('Dataframe2[INFO]'), 'MATCH'] = '1'
I hope someone can help, thanks!
Give this a shot:
Code:
dfA['MATCH'] = dfA['TEXT'].apply(lambda x: min(len([ y for y in dfB['INFO'] if y in x]), 1))
Output:
ITEM ID TEXT MATCH
0 1 some random words 1
1 2 another word 0
2 3 blah 1
3 4 random words 1
It's a 0 if it's not a match, but that's easy enough to weed out.
There may be a better / faster native solution, but it gets the job done by iterating over both the 'TEXT' column and the 'INFO'. Depending on your use case, it may be fast enough.
Looks like .map() in lieu of .apply() would work just as well. Could make a difference in timing, again, based on your use case.
Updated to take into account string contains instead of exact match...
You could get the unique values from the column in the first dataframe convert them to list and then use eval method on the second with Column.str.contains on that list.
unique = df1['TEXT'].unique().tolist()
df2.eval("Match=Text.str.contains('|'.join(#unique))")

pandas DataFrame conditional string split

I have a column of influenza virus names within my DataFrame. Here is a representative sampling of the name formats present:
(A/Egypt/84/2001(H1N2))
A/Brazil/1759/2004(H3N2)
A/Argentina/126/2004
I am only interested in getting out A/COUNTRY/NUMBER/YEAR from the strain names, e.g. A/Brazil/1759/2004. I have tried doing:
df['Strain Name'] = df['Original Name'].str.split("(")
However, if I try accessing .str[0], then I miss out case #1. If I do .str[1], I miss out case 2 and 3.
Is there a solution that works for all three cases? Or is there some way to apply a condition in string splits, without iterating over each row in the data frame?
So, based on EdChum's recommendation, I'll post my answer here.
Minimal data frame required for tackling this problem:
Index Strain Name Year
0 (A/Egypt/84/2001(H1N2)) 2001
1 A/Brazil/1759/2004(H3N2) 2004
2 A/Argentina/126/2004 2004
Code for getting the strain names only, without parentheses or anything else inside the parentheses:
df['Strain Name'] = df['Strain Name'].str.split('(').apply(lambda x: max(x, key=len))
This code works for the particular case spelled here, as the trick is that the isolate's "strain name" is the longest string after splitting by the opening parentheses ("(") value.

Categories

Resources