Converting a string to code
Noteworthy points:
I'm new to coding and am testing various things to learn;
i.e. yes, I'm sure there are better ways do achieve what I am trying to do;
I would like to know any alternative / more efficient methods, however;
I would also still like to know how to convert string to code to achieve my goal with this technique
So far I have looked around the forum and on google, and seen a few topics on this, none of which I can made work here, or which precisely answer the question from my perspective, including using eval and exec.
The Scenario
I have a dataframe: london with 23 columns
I want to create a dataframe showing all rows with 'NaN' values
I have tried to use .isnull(), but it appears to only work on a single column at a time
I am trying to achieve my desired result by using | to return any rows in any columns where .isnull() returns True
An example of this working with just two columns is:
london[(london['Events'].isnull() | london['Max Gust SpeedKm/h'].isnull())]
However, I need to achieve this result with all 23 columns, so I have attempted to complete this with some code.
Attempted Solution
Creating a string containing all of the column headers
i.e. london[(london['Column Header'].isnull() followed by | and then the next column
Then using this string within the container shown in the working example above
i.e. london[(string)]
I have managed to create the string I need using the following:
string = []
for i in (london.columns.values):
string.append("london['" + i + "'].isnull()")
string.append(" | ")
del string[-1]
final_string = "".join(string)
And finally when I try to implement the final step, I cannot work out how to convert this string into usable code.
For example:
now = eval(final_string)
london[now]
Resulting in:
NotImplementedError: 'Call' nodes are not implemented
Thank you in advance.
This is the easiest way to select the rows in your dataframe with NaN values:
df[pd.isnull(df).any(axis=1)]
string = []
for i in (london.columns.values):
string.append(london[i].isnull())
london[0<sum(string)]
Since you will have only 1 and 0 and you are looking for at least one 1 then you can just add 1,0's to your list then sum them. if the sum is more than one your if will turn 1 otherwise your if will turn 0 so you can do london index after that.
Related
I have a set with strings inside a column in a Pandas DataFrame:
x
A {'string1, string2, string3'}
B {'string4, string5, string6'}
I need to get the length of each set and ideally create a new column with the results
x x_length
A {'string1, string2, string3'} 3
B {'string4, string5'} 2
I don't know why but everything i tried to far always returns the length of the set as 1.
Here's what I've tried:
df['x_length'] = df['x'].str.len()
df['x_length'] = df['x'].apply(lambda x: len(x))
Custom function from another post:
def to_1D(series):
return pd.Series([len(x) for _list in series for x in _list])
to_1D(df['x'])
This function returns the number of characters in the whole set, not the length of the set.
I've even tried to convert the set to a list and tried the same functions, but still got the wrong results.
I feel like I'm very close to the answer, but I can't seem to figure it out.
I don't know why but everything i tried to far always returns the
length of the set as 1.
{'string1, string2, string3'} and {'string4, string5, string6'} are sets holding single str each (delimited by ') rather than sets with 3 str each (which would be {'string1', 'string2', 'string3'} and {'string4', 'string5', 'string6'} respectively) so there is problem somewhere earlier which leads to getting sets with single element rather than multitude of them. After you find and eliminate said problem your functions should start work as intended.
How do I replace the cell values in a column if they contain a number in general or contain a specific thing like a comma, replace the whole cell value with something else.
Say for example a column that has a comma meaning it has more than one thing I want it to be replaced by text like "ENM".
For a column that has a cell with a number value, I want to replace it by 'UNM'
As you have not provided examples of what your expected and current output look like, I'm making some assumptions below. What it seems like you're trying to do is iterate through every value in a column and if the value meets certain conditions, change it to something else.
Just a general pointer. Iterating through dataframes requires some important considerations for larger sizes. Read through this answer for more insight.
Start by defining a function you want to use to check the value:
def has_comma(value):
if ',' in value:
return True
return False
Then use the pandas.DataFrame.replace method to make the change.
for i in df['column_name']:
if has_comma(i):
df['column_name'] = df['column_name'].replace([i], 'ENM')
else:
df['column_name'] = df['column_name'].replace([i], 'UNM')
Say you have a column, i.e. pandas Series called col
The following code can be used to map values with comma to "ENM" as per your example
col.mask(col.str.contains(','), "ENM")
You can overwrite your original column with this result if that's what you want to do. This approach will be much faster than looping through each element.
For mapping floats to "UNM" as per your example the following would work
col.mask(col.apply(isinstance, args=(float,)), "UNM")
Hopefully you get the idea.
See https://pandas.pydata.org/docs/reference/api/pandas.Series.mask.html for more info on masking
New Python user here, so please pardon my ignorance if my approach seems completely off.
I am having troubles filtering rows of a column based off of their Character/Number format.
Here's an example of the DataFrame and Series
df = {'a':[1,2,4,5,6], 'b':[7, 8, 9,10 ], 'target':[ 'ABC1234','ABC123', '123ABC', '7KZA23']
The column I am looking to filter is the "target" column based on their character/number combos and I am essentially trying to make a dict like below
{'ABC1234': counts_of_format
'ABC123': counts_of_format
'123ABC': counts_of_format
'any_other_format': counts_of_format}
Here's my progress so far:
col = df['target'].astype('string')
abc1234_pat = '^[A-Z]{3}[0-9]{4]'
matches = re.findall(abc1234_pat, col)
I keep getting this error:
TypeError: expected string or bytes-like object
I've double checked the dtype and it comes back as string. I've researched the TypeError and the only solutions I can find it converting it to a string.
Any insight or suggestion on what I might be doing wrong, or if this is simply the wrong approach to this problem, will be greatly appreciated!
Thanks in advance!
I am trying to create a dict that returns how many times the different character/number combos occur. For example, how many time does 3 characters followed by 4 numbers occur and so on.
(Your problem would have been earlier and easier understood had you stated this in the question post itself rather than in a comment.)
By characters, you mean letters; by numbers, you mean digits.
abc1234_pat = '^[A-Z]{3}[0-9]{4]'
Since you want to count occurrences of all character/number combos, this approach of using one concrete pattern would not lead very far. I suggest to transform the targets to a canonical form which serves as the key of your desired dict, e. g. substitute every letter with C and every digit with N (using your terms).
Of the many ways to tackle this, one is using str.translate together with a class which does the said transformation.
class classify():
def __getitem__(self, key):
return ord('C' if chr(key).isalpha() else 'N' if chr(key).isdigit() else None)
occ = df.target.str.translate(classify()).value_counts()#.todict()
Note that this will purposely raise an exception if target contains non-alphanumeric characters.
You can convert the resulting Series to a dict with .to_dict() if you like.
I have a dataframe where each series if filled with 0 and 1 as follows:
flagdf=pd.DataFrame({'a':[1,0,1,0,1,0,0,1], 'b':[0,0,1,0,1,0,1,0]})
Now, depending on some analysis I have done, I need to change some 0s to 1s. So the final dataframe will be:
final=pd.DataFrame({'a':[1,1,1,0,1,1,1,1], 'b':[1,1,1,0,1,1,1,1]})
The results of the analysis which shows which 0s have to be changed are stored in a second dataframe built with a multi-index:
first last
a 1 1 1
5 5 6
b 0 0 1
5 5 5
7 7 7
For each 'a' and 'b' I have the first and the last indexes of the 0s I need to change.
First question: The second index in the multi-index dataframe is equal to the series 'first'. I was initially trying to use it directly, but I found it easier to handle two series rather than an index and a series. Am I missing something?
Here is the code to do the job:
def change_one_value_one_column(flagdf,col_name,event):
flagdf[col_name].iloc[event]=1
def change_val_column(col_name, tochange, flagdf):
col_tochange=tochange.ix[col_name]
tomod=col_tochange[['first','last']].values
iter_tomod=[xrange(el[0],el[1]+1) for el in tomod]
[change_one_value_one_column(flagdf,col_name,event) for iterel in iter_tomod for event in iterel]
[change_val_colmun(col_name) for col_name in flagdf.columns]
Second question: I genuinely think that a list comprehension is always good but in cases like that, when I write a function specifically for a list comprehension, I have some doubt. Is it truly the best thing to do?
Third question: I think that the code is quite pythonic, but I am not proud of that because of the last list comprehension which is running over the series of the dataframe: using the method apply would look better to my eyes (but I'm not sure how to do it). Nontheless is there any real reason (apart from elegance) I should work to do the changes?
To answer the part about exhausting an iterator, I think you have a few pythonic choices (all of which I prefer over a list comprehension):
# the easiest, and most readable
for col_name in flagdf.columns:
change_val_column(col_name)
# consume/exhaust an iterator using built-in any (assuming each call returns None)
any(change_val_colmun(col_name) for col_name in flagdf.columns)
# use itertools' consume recipe
consume(change_val_colmun(col_name) for col_name in flagdf.columns)
See consume recipe from itertools.
However, when doing this kind of thing in numpy/pandas, you should be asking yourself "can I vertorize / use indexing here?". If you can your code will usually be both faster and more readable.
I think in this case you'll be able to remove one level of loops by doing something like:
def change_val_column(col_name, tochange, flagdf):
col_tochange = tochange.ix[col_name] # Note: you're accessing index not column here??
tomod = col_tochange[['first','last']].values
for i, j in tomod:
flag_df.loc[i:j, col_name] = 1
You may even be able to remove the for loop, but it's not obvious how / what the intention is here...
If I'm staying in python and iterating over rows, I prefer using zip/izip as a first pass.
for col, start, end in izip(to_change.index.get_level_values(0), tochange['first'], tochange['last']):
flagdf.loc[start:end, col] = 1
Simple and fast.
I am working with python and I am new to it. I am looking for a way to take a string and split it into two smaller strings. An example of the string is below
wholeString = '102..109'
And what I am trying to get is:
a = '102'
b = '109'
The information will always be separated by two periods like shown above, but the number of characters before and after can range anywhere from 1 - 10 characters in length. I am writing a loop that counts characters before and after the periods and then makes a slice based on those counts, but I was wondering if there was a more elegant way that someone knew about.
Thanks!
Try this:
a, b = wholeString.split('..')
It'll put each value into the corresponding variables.
Look at the string.split method.
split_up = [s.strip() for s in wholeString.split("..")]
This code will also strip off leading and trailing whitespace so you are just left with the values you are looking for. split_up will be a list of these values.