How to create new column by manipulating another column? pandas - python

I am trying to make a new column depending on different criteria. I want to add characters to the string dependent on the starting characters of the column.
An example of the data:
RH~111~header~120~~~~~~~ball
RL~111~detailed~12~~~~~hat
RA~111~account~13~~~~~~~~~car
I want to change those starting with RH and RL, but not the ones starting with RA. So I want to look like:
RH~111~header~120~~1~~~~~ball
RL~111~detailed~12~~cancel~~~ball
RA~111~account~12~~~~~~~~~ball
I have attempted to use str split, but it doesn't seem to actually be splitting the string up
(np.where(~df['1'].str.startswith('RH'),
df['1'].str.split('~').str[5],
df['1']))
This is referencing the correct columns but not splitting it where I thought it would, and cant seem to get further than this. I feel like I am not really going about this the right way.

Define a function to replace element No pos in arr list:
def repl(arr, pos):
arr[pos] = '1' if arr[0] == 'RH' else 'cancel'
return '~'.join(arr)
Then perform the substitution:
df[0] = df[0].mask(df[0].str.match('^R[HL]'),
df[0].str.split('~').apply(repl, pos=5))
Details:
str.match provides that only proper elements are substituted.
df[0].str.split('~') splits the column of strings into a column
of lists (resulting from splitting of each string).
apply(repl, pos=5) computes the value to sobstitute.
I assumed that you have a DataFrame with a single column, so its column
name is 0 (an integer), instead of '1' (a string).
If this is not the case, change the column name in the code above.

Related

Why is re not removing some values from my list?

I'm asking more out of curiosity at this point since I found a work-around, but it's still bothering me.
I have a list of dataframes (x) that all have the same column names. I'm trying to use pandas and re to make a list of the subset of column names that have the format
"D(number) S(number)"
so I wrote the following function:
def extract_sensor_columns(x):
sensor_name = list(x[0].columns)
for j in sensor_name:
if bool(re.match('D(\d+)S(\d+)', j))==False:
sensor_name.remove(j)
return sensor_name
The list that I'm generating has 103 items (98 wanted items, 5 items). This function removes three of the five columns that I want to get rid of, but keeps the columns labeled 'Pos' and 'RH.' I generated the sensor_name list outside of the function and tested the truth value of the
bool(re.match('D(\d+)S(\d+)', sensor_name[j]))
for all five of the items that I wanted to get rid of and they all gave the False value. The other thing I tried is changing the conditional to ==True, which even more strangely gave me 54 items (all of the unwanted column names and half of the wanted column names).
If I rewrite the function to add the column names that have a given format (rather than remove column names that don't follow the format), I get the list I want.
def extract_sensor_columns(x):
sensor_name = []
for j in list(x[0].columns):
if bool(re.match('D(\d+)S(\d+)', j))==True:
sensor_name.append(j)
return sensor_name
Why is the first block of code acting so strangely?
In general, do not change arrays while iterating over them. The problem lies in the fact that you remove elements of the iterable in the first (wrong) case. But in the second (correct) case, you add correct elements to an empty list.
Consider this:
arr = list(range(10))
for el in arr:
print(el)
for i, el in enumerate(arr):
print(el)
arr.remove(arr[i+1])
The second only prints even number as every next one is removed.

Is there a way to Iterate through a specific column and replace cell values in pandas?

How do I replace the cell values in a column if they contain a number in general or contain a specific thing like a comma, replace the whole cell value with something else.
Say for example a column that has a comma meaning it has more than one thing I want it to be replaced by text like "ENM".
For a column that has a cell with a number value, I want to replace it by 'UNM'
As you have not provided examples of what your expected and current output look like, I'm making some assumptions below. What it seems like you're trying to do is iterate through every value in a column and if the value meets certain conditions, change it to something else.
Just a general pointer. Iterating through dataframes requires some important considerations for larger sizes. Read through this answer for more insight.
Start by defining a function you want to use to check the value:
def has_comma(value):
if ',' in value:
return True
return False
Then use the pandas.DataFrame.replace method to make the change.
for i in df['column_name']:
if has_comma(i):
df['column_name'] = df['column_name'].replace([i], 'ENM')
else:
df['column_name'] = df['column_name'].replace([i], 'UNM')
Say you have a column, i.e. pandas Series called col
The following code can be used to map values with comma to "ENM" as per your example
col.mask(col.str.contains(','), "ENM")
You can overwrite your original column with this result if that's what you want to do. This approach will be much faster than looping through each element.
For mapping floats to "UNM" as per your example the following would work
col.mask(col.apply(isinstance, args=(float,)), "UNM")
Hopefully you get the idea.
See https://pandas.pydata.org/docs/reference/api/pandas.Series.mask.html for more info on masking

Python Dataframe find the index of a second occurance of a string

I have seen that similar questions have been psoted, but the solutions there do not work for me because, I believe, I am working with a column in a dataframe.
I have a column which has a string in it. I find the first occurance of a term. This works. I then want to find the second occurance of that term. This doesn't work.
My code
import pandas as pd
data = {"Text" : ["['one', 'one two']","['two one', 'three']"]}
df = pd.DataFrame(data)
#yes the data is in a list in a column but I treat it as a string
#finding if "one" is in the string - works
df["Ones"] = df.Text.str.find("one")
#finding if "one" is in the string another time as in the first row
df["NextOnes"] = df.Text.str.find("one",df.Ones +1)
The line for "NextOnes" returns NAs. In my real code, it returns blanks. If I replace the reference to the column with a number, such as 2, then this returns the correct value. However this value needs to be dynamic.
I have just got this the needle haystack approach to work but building in for loops seems inefficient here,
for i in range(len(df)):
... df["Next"][i] = find_nth(str(df.Text[i]),"one",2)
You could try using the find method from str, but pass in the previous item index as starting point (as parameter to find).

How can multiple matches be applied in dataframe?

I have a little performance issue with that problem:
There is a (3_000_000, 6) shape dataframe, call that A, and another one, that has (72_000, 6) shape, call that B. To keep it simple, suppose both of them have only string columns.
In the B dataframe, there are fields of rows where could be ?, so question mark in the value.
For example CITY column: New ?ork instead of New York. The task is to find the right string from the A dataframe.
So another example:
B Dataframe
CITY
ADDRESS
ADDRESS_TYPE
New ?ork
D?ck
str?et
A Dataframe
CITY
ADDRESS
ADDRESS_TYPE
New York
Duck
street
My plan was to make a multiprocess investigation that iterates over the B dataframe and every each row I make a multistep filter, where first I filter with A = A[A.CITY.str.match(city) == True], where the city is a regular expression New ?ork => New .ork.
So first I pre-filter with the city, after the address, and so on...
matched_rows = A[A.CITY.str.match(city) == True]
matched_rows = matched_rows[matched_rows.ADDRESS.str.match(address) == True]
...
Besides that, there is another really important thing: there is a unique ID ( not the index ) column in the A dataframe, that uniquely identifies the rows.
ABC123 - New York - Duck - Street
For me, it's really important to find the appropriate row of A dataframe for every each of B row.
The solution works but its performance is horrible. I don't really know how I should approach.
5 cores were able to decrease to ~200 minutes but I hope there is a better solution for that problem.
So, my question is, Is there any better, and more performance way to apply multiple matches to a dataframe?
As indicated by #dwhswenson, the only strategy that comes to mind is to reduce the size of the dataframes you test, which is not so much a programming problem as a data management problem. This is going to depend on your dataset and what kind of work you want to do, but one naive strategy would be to store indices of rows in which column values start with 'a', 'b', etc. and then select a dataframe to match over based on the query string. So you'd do something like
import itertools
alphabet = 'abcdefghijklmnopqrstuvwxyz'
keys = list(itertools.chain.from_iterable(([c, l] for c in A.columns) for l in alphabet))
A_indexes = {}
for k in keys:
begins_with = lambda x: (x[0] == k[1]) or (x[0] == k[1].upper())
A_indexes[k[0], k[1]] = A[k[0]].loc[A[k[0]].apply(begins_with)].index
Then make a function that takes a column name and a string to match and returns a view of A that contains only rows whos entries for that column begin with the same letter as the string to match:
def get_view(column, string_to_match):
return A.loc[A_indexes[[column, string_to_match[0].lower()]]]
You'd have to double the number of indices of course for the case in which the first letter of the string to match is a wildcard, but this is only an example of the kind of thing you could do to slice the dataset before doing a regex on every row.
if you want to work with the unique ID above, you could make a more complex index lookup dictionary for views into A:
import itertools
alphabet = 'abcdefghijklmnopqrstuvwxyz'
number_of_fields_in_ID = 6
separator = ' - '
keys = itertools.product(alphabet, repeat=number_of_fields_in_ID)
def check_ID(id, key):
id = id.split(separator)
matches_key = True
for i,first_letter in enumerate(key):
if id[i].lower() != first_letter:
matches_key = False
break
return matches_key
A_indexes = {}
for k in keys:
A_indexes[k] = A.loc[A['unique_ID'].apply(check_ID, args=(k,))].index
This will apply the kludgy function check_ID to your 3e6 element series A['unique_ID'] 26**number_of_fields_in_ID times, indexing the dataframe once per iteration (this is more than 4.5e5 iterations if the ID has 4 fields), so there may be significant up-front cost in compute time depending on your dataset. Whether or not this is worth it depends first on what you want to do afterwards (are you just doing a single set of queries to make a second dataset and you're done, or do you need to do a lot of arbitrary queries in the future?) and second on whether the first letters of each ID field are roughly evenly distributed over the alphabet (or decimal digits if you know a field is numeric, or both if it's alphanumeric). If you just have two or three cities, for instance, you wouldn't make indexes for the whole alphabet in that field. But again, doing the lookup this way is naive - if you go this route you'd come up with a lookup method that's tailored to your data.

What does this anonymmous split function do?

narcoticsCrimeTuples = narcoticsCrimes.map(lambda x:(x.split(",")[0], x))
I have a CSV I am trying to parse by splitting on commas and the first entry in each array of strings is the primary key.
I would like to get the key on a separate line (or just separate) from the value when calling narcoticsCrimeTuples.first()[1]
My current understanding is 'split x by commas, take the first part of each split [0], and return that as the new x', but I'm pretty sure that middle part is not right because the number inside the [] can be anything and returns the same result.
Your variable is named "narcoticsCrimeTuples", so you seem to be expected to get a "tuple".
Your two values of the tuple are the first column of the CSV x.split(",")[0] and the entire line x.
I would like to get the key on a separate line
Not really clear why you want that...
(or just separate) from the value when calling narcoticsCrimeTuples.first()[1]
Well, when you call .first(), you get the entire tuple. [0] is the first column, and [1] would be the corresponding line of the CSV, which also contains the [0] value.
If you narcoticsCrimes.flatMap(lambda x: x.split(",")), then all the values will be separated.
For example, in the word count example...
textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1))
Judging by the syntax seems like you are in PySpark. If that's true you're mapping over your RDD and for each row creating a (key, row) tuple, the key being the first element in a comma-separated list of items. Doing narcoticsCrimeTuples.first() will just give you the first record.
See an example here:
https://gist.github.com/amirziai/5db698ea613c6857d72e9ce6189c1193

Categories

Resources