Getting length of set inside pandas column - python

I have a set with strings inside a column in a Pandas DataFrame:
x
A {'string1, string2, string3'}
B {'string4, string5, string6'}
I need to get the length of each set and ideally create a new column with the results
x x_length
A {'string1, string2, string3'} 3
B {'string4, string5'} 2
I don't know why but everything i tried to far always returns the length of the set as 1.
Here's what I've tried:
df['x_length'] = df['x'].str.len()
df['x_length'] = df['x'].apply(lambda x: len(x))
Custom function from another post:
def to_1D(series):
return pd.Series([len(x) for _list in series for x in _list])
to_1D(df['x'])
This function returns the number of characters in the whole set, not the length of the set.
I've even tried to convert the set to a list and tried the same functions, but still got the wrong results.
I feel like I'm very close to the answer, but I can't seem to figure it out.

I don't know why but everything i tried to far always returns the
length of the set as 1.
{'string1, string2, string3'} and {'string4, string5, string6'} are sets holding single str each (delimited by ') rather than sets with 3 str each (which would be {'string1', 'string2', 'string3'} and {'string4', 'string5', 'string6'} respectively) so there is problem somewhere earlier which leads to getting sets with single element rather than multitude of them. After you find and eliminate said problem your functions should start work as intended.

Related

Is it possible to hard declare a variable in Python?

I am trying to use a variable inside a substructure. I guess the variable should be of integer data type, and I am trying to add a loop here but it my data type is list since it contains multiple integers.
INV_match_id = [['3749052'],['3749522']]
from statsbombpy import sb
for x in range(2):
match=INV_match_id[x]
match_db = sb.events(match_id=match)
print(match)
I have tried to extract the data one by one using another variable, but still it got declared as list. Whenever I give direct values to "match" it works. for eg: if I add a line match=12546 the substructure takes the value properly.
Next thing I want to try is hard declare "match" variable as integer. Any input is appreciated. I am pretty new to Python.
Edit: Adding this solution from #quamrana here.
"So, to answer your original question: Is it possible to hard declare a variable in Python?, the answer is No. Variables in python are just references to objects. Objects can be of whatever type they want to be."
You said: " I want to loop and take the numbers one by one."
Did you mean this:
for match in INV_match_id:
match_db = sb.events(match_id=match)
I don't know what you want to do with match_db
Update:
"that single number is also declared as a list. like this- ['125364']"
Well if match == ['125364'] then it depends on whether you want: "125364" or 125364. I assume the latter since you talk a lot about integers:
for match in INV_match_id:
match = int(match[0])
match_db = sb.events(match_id=match)
Next Update:
So you have: INV_match_id = ['3749052','3749522']
This means that the list is a list of strings, so the code changes to this:
for match in INV_match_id:
match_db = sb.events(match_id=int(match))
Your original code was making match into a list of the digits of each number. (eg match = [1,2,5,3,6,4])
Reversionary Update:
This time we have: INV_match_id = [['3749052'],['3749522']]
that just means going back to the second version of my code above:
for match in INV_match_id:
match = int(match[0])
match_db = sb.events(match_id=match)
It's as simple as:
from statsbombpy import sb
INV_match_id = [['3749052'],['3749522']]
for e in INV_match_id:
match_db = sb.events(match_id=e[0])
print(match_db)
You have a list of lists albeit that the sub-lists only contain one item.
match_id can be either a string or int

How to Filter Rows in a DataFrame Based on a Specific Number of Characters and Numbers

New Python user here, so please pardon my ignorance if my approach seems completely off.
I am having troubles filtering rows of a column based off of their Character/Number format.
Here's an example of the DataFrame and Series
df = {'a':[1,2,4,5,6], 'b':[7, 8, 9,10 ], 'target':[ 'ABC1234','ABC123', '123ABC', '7KZA23']
The column I am looking to filter is the "target" column based on their character/number combos and I am essentially trying to make a dict like below
{'ABC1234': counts_of_format
'ABC123': counts_of_format
'123ABC': counts_of_format
'any_other_format': counts_of_format}
Here's my progress so far:
col = df['target'].astype('string')
abc1234_pat = '^[A-Z]{3}[0-9]{4]'
matches = re.findall(abc1234_pat, col)
I keep getting this error:
TypeError: expected string or bytes-like object
I've double checked the dtype and it comes back as string. I've researched the TypeError and the only solutions I can find it converting it to a string.
Any insight or suggestion on what I might be doing wrong, or if this is simply the wrong approach to this problem, will be greatly appreciated!
Thanks in advance!
I am trying to create a dict that returns how many times the different character/number combos occur. For example, how many time does 3 characters followed by 4 numbers occur and so on.
(Your problem would have been earlier and easier understood had you stated this in the question post itself rather than in a comment.)
By characters, you mean letters; by numbers, you mean digits.
abc1234_pat = '^[A-Z]{3}[0-9]{4]'
Since you want to count occurrences of all character/number combos, this approach of using one concrete pattern would not lead very far. I suggest to transform the targets to a canonical form which serves as the key of your desired dict, e. g. substitute every letter with C and every digit with N (using your terms).
Of the many ways to tackle this, one is using str.translate together with a class which does the said transformation.
class classify():
def __getitem__(self, key):
return ord('C' if chr(key).isalpha() else 'N' if chr(key).isdigit() else None)
occ = df.target.str.translate(classify()).value_counts()#.todict()
Note that this will purposely raise an exception if target contains non-alphanumeric characters.
You can convert the resulting Series to a dict with .to_dict() if you like.

How to create new column by manipulating another column? pandas

I am trying to make a new column depending on different criteria. I want to add characters to the string dependent on the starting characters of the column.
An example of the data:
RH~111~header~120~~~~~~~ball
RL~111~detailed~12~~~~~hat
RA~111~account~13~~~~~~~~~car
I want to change those starting with RH and RL, but not the ones starting with RA. So I want to look like:
RH~111~header~120~~1~~~~~ball
RL~111~detailed~12~~cancel~~~ball
RA~111~account~12~~~~~~~~~ball
I have attempted to use str split, but it doesn't seem to actually be splitting the string up
(np.where(~df['1'].str.startswith('RH'),
df['1'].str.split('~').str[5],
df['1']))
This is referencing the correct columns but not splitting it where I thought it would, and cant seem to get further than this. I feel like I am not really going about this the right way.
Define a function to replace element No pos in arr list:
def repl(arr, pos):
arr[pos] = '1' if arr[0] == 'RH' else 'cancel'
return '~'.join(arr)
Then perform the substitution:
df[0] = df[0].mask(df[0].str.match('^R[HL]'),
df[0].str.split('~').apply(repl, pos=5))
Details:
str.match provides that only proper elements are substituted.
df[0].str.split('~') splits the column of strings into a column
of lists (resulting from splitting of each string).
apply(repl, pos=5) computes the value to sobstitute.
I assumed that you have a DataFrame with a single column, so its column
name is 0 (an integer), instead of '1' (a string).
If this is not the case, change the column name in the code above.

Pyspark tuple object has no attribute split

I am struggling with a Pyspark assignment. I am required to get a sum of all the viewing numbers per channels. I have 2 sets of files: 1 showing the show and views per show the other showing the shows and what channel they are shown on (can be multiple).
I have performed a join operation on the 2 files and the result looks like ..
[(u'Surreal_News', (u'BAT', u'11')),
(u'Hourly_Sports', (u'CNO', u'79')),
(u'Hourly_Sports', (u'CNO', u'3')),
I now need to extract the channel as the key and then I think do a reduceByKey to get the sum of views for the channels.
I have written this function to extract the chan as key with the views alongside, which I could then use a reduceByKey function to sum the results. However when I try to display results of below function with collect() I get an "AttributeError: 'tuple' object has no attribute 'split'" error
def extract_chan_views(show_chan_views):
key_value = show_chan_views.split(",")
chan_views = key_value[1].split(",")
chan = chan_views[0]
views = int(chan_views[1])
return (chan,views)
Since this is an assignment, I'll try to explain what's going on rather than just doing the answer. Hopefully that will be more helpful!
This actually isn't anything to do with pySpark; it's just a plain Python issue. Like the error is saying, you're trying to split a tuple, when split is a string operation. Instead access them by index. The object you're passing in:
[(u'Surreal_News', (u'BAT', u'11')),
(u'Hourly_Sports', (u'CNO', u'79')),
(u'Hourly_Sports', (u'CNO', u'3')),
is a list of tuples, where the first index is a unicode string and the second is another tuple. You can split them apart like this (I'll annotate each step with comments):
for item in your_list:
#item = (u'Surreal_News', (u'BAT', u'11')) on iteration one
first_index, second_index = item #this will unpack the two indices
#now:
#first_index = u'Surreal_News'
#second_index = (u'BAT', u'11')
first_sub_index, second_sub_index = second_index #unpack again
#now:
#first_sub_index = u'BAT'
#second_sub_index = u'11'
Note that you never had to split on commas anywhere. Also note that the u'11' is a string, not an integer in your data. It can be converted, as long as you're sure it's never malformed, with int(u'11'). Or if you prefer specifying indices to unpacking, you can do the same thing:
first_index, second_index = item
is equivalent to:
first_index = item[0]
second_index = item[1]
Also note that this gets more complicated if you are unsure what form the data will take - that is, if sometimes the objects have two items in them, other times three. In that case unpacking and indexing in a generalized way for a loop require a bit more thought.
I am not exactly resolving your code , but I faced same error when I applied join transformation on two datasets.
lets say , A and B are two RDDs.
c = A.join(B)
We may think that c is also Rdd , wrong. It is a tuple object where we cannot perform any split(",") kind of operations.One needs to make c into Rdd then proceed.
If we want tuple to be accessed, Lets say D is tuple.
E= D[1] // instead of E= D.split(",")[1]

How do I convert a string into code in Python?

Converting a string to code
Noteworthy points:
I'm new to coding and am testing various things to learn;
i.e. yes, I'm sure there are better ways do achieve what I am trying to do;
I would like to know any alternative / more efficient methods, however;
I would also still like to know how to convert string to code to achieve my goal with this technique
So far I have looked around the forum and on google, and seen a few topics on this, none of which I can made work here, or which precisely answer the question from my perspective, including using eval and exec.
The Scenario
I have a dataframe: london with 23 columns
I want to create a dataframe showing all rows with 'NaN' values
I have tried to use .isnull(), but it appears to only work on a single column at a time
I am trying to achieve my desired result by using | to return any rows in any columns where .isnull() returns True
An example of this working with just two columns is:
london[(london['Events'].isnull() | london['Max Gust SpeedKm/h'].isnull())]
However, I need to achieve this result with all 23 columns, so I have attempted to complete this with some code.
Attempted Solution
Creating a string containing all of the column headers
i.e. london[(london['Column Header'].isnull() followed by | and then the next column
Then using this string within the container shown in the working example above
i.e. london[(string)]
I have managed to create the string I need using the following:
string = []
for i in (london.columns.values):
string.append("london['" + i + "'].isnull()")
string.append(" | ")
del string[-1]
final_string = "".join(string)
And finally when I try to implement the final step, I cannot work out how to convert this string into usable code.
For example:
now = eval(final_string)
london[now]
Resulting in:
NotImplementedError: 'Call' nodes are not implemented
Thank you in advance.
This is the easiest way to select the rows in your dataframe with NaN values:
df[pd.isnull(df).any(axis=1)]
string = []
for i in (london.columns.values):
string.append(london[i].isnull())
london[0<sum(string)]
Since you will have only 1 and 0 and you are looking for at least one 1 then you can just add 1,0's to your list then sum them. if the sum is more than one your if will turn 1 otherwise your if will turn 0 so you can do london index after that.

Categories

Resources