I have below command output:
data = """
abcd11 11
abcd12 12
abcd13 13
abcd21 14
abcd22 15
abcd23 16
abcd101 17
abcd102 18
abcd103 19
... so on
abcd501 1
abcd502 2
"""
Condition1 Numbers (As per data it is string) range must be between 1 to 255, that is not exceed 255,
Code:
#Check abcd401, abcd402, abcd403
check = set()
for line in data.split("\n"):
if len(line.split()) > 1:
line = line.strip()
check.add(line.split()[0])
if not "abcd401" in check and not "abcd402" in check and not "abcd402" in check:
print "Not exist"
else:
print "Its already exist. Program exit"
sys.exit()
Now Need to assign a Number to abcd401, abcd402, abcd403
Number between 1 to 255.
I can always assign abcd401 = 1, abcd402 = 2, abcd403 = 3, but i need to fill 1-255, then starts 1-255, and so on Please help.
I am trying to solve adding the line if not exist in the data (data
is multiline string input content).
This can be done using pandas
with the easy way, also solution will scales if you use pandas.
I did
not focus on the randomizing the number assigned to each line, I just
cycling from 1 to 255 and once idx is reached I am starting from 1
again. :)
this part you can take care...
My quick solution is:
import cStringIO as io
import pandas as pd
from itertools import cycle
text_data = """abcd11 11
abcd12 12
abcd13 13
abcd21 14
abcd22 15
abcd23 16
abcd101 17
abcd102 18
abcd103 19
abcd501 1
abcd502 2"""
def get_next_id():
cyc = cycle(range(1,256))
for i in cyc:
yield i
next_id = get_next_id()
def load_data():
content = io.StringIO(text_data)
df = pd.read_csv(content, header=None, sep="\s+", names=["txt", "num"])
print "Content is loaded to pandas dataframe\n", df
return df
def add_line_to_df(txt, df):
idx = next_id.next()
df2 = pd.DataFrame([[txt, idx]], columns=["txt", "num"])
df.loc[len(df.index)] = df2.iloc[0]
#print df #res_df
return df # res_df
def insert_valid_line(line, df):
if line in pd.Series(df["txt"]).values:
print "{}: already existed.".format(line)
else:
print "{}: adding to existing df.".format(line)
add_line_to_df(line, df)
def main():
df = load_data()
new_texts = ["abcd501", "abcd502", "abcd402", "abcd403"]
for txt in new_texts:
print "-" * 20
insert_valid_line(txt, df)
print "-" * 20
print df
#In this place df is holding all the data
if __name__ == '__main__':
main()
Output looks like this...:
Content is loaded to pandas dataframe
txt num
0 abcd11 11
1 abcd12 12
2 abcd13 13
3 abcd21 14
4 abcd22 15
5 abcd23 16
6 abcd101 17
7 abcd102 18
8 abcd103 19
9 abcd501 1
10 abcd502 2
--------------------
abcd501: already existed.
--------------------
abcd502: already existed.
--------------------
abcd402: adding to existing df.
--------------------
abcd403: adding to existing df.
--------------------
txt num
0 abcd11 11
1 abcd12 12
2 abcd13 13
3 abcd21 14
4 abcd22 15
5 abcd23 16
6 abcd101 17
7 abcd102 18
8 abcd103 19
9 abcd501 1
10 abcd502 2
11 abcd402 1
12 abcd403 2
Related
Python newbie here.
Imagine a csv file that looks something like this:
(...except that in real life, there are 20 distinct names in the Person column, and each Person has 300-500 rows. Also, there are multiple data columns, not just one.)
What I want to do is randomly flag 10% of each Person's rows and mark this in a new column. I came up with a ridiculously convoluted way to do this--it involved creating a helper column of random numbers and all sorts of unnecessarily complicated jiggery-pokery. It worked, but was crazy. More recently, I came up with this:
import pandas as pd
df = pd.read_csv('source.csv')
df['selected'] = ''
names= list(df['Person'].unique()) #gets list of unique names
for name in names:
df_temp = df[df['Person']== name]
samp = int(len(df_temp)/10) # I want to sample 10% for each name
df_temp = df_temp.sample(samp)
df_temp['selected'] = 'bingo!' #a new column to mark the rows I've randomly selected
df = df.merge(df_temp, how = 'left', on = ['Person','data'])
df['temp'] =[f"{a} {b}" for a,b in zip(df['selected_x'],df['selected_y'])]
#Note: initially instead of the line above, I tried the line below, but it didn't work too well:
#df['temp'] = df['selected_x'] + df['selected_y']
df = df[['Person','data','temp']]
df = df.rename(columns = {'temp':'selected'})
df['selected'] = df['selected'].str.replace('nan','').str.strip() #cleans up the column
As you can see, essentially I'm pulling out a temporary DataFrame for each Person, using DF.sample(number) to do the randomising, then using DF.merge to get the 'marked' rows back into the original DataFrame. And it involved iterating through a list to create each temporary DataFrame...and my understanding is that iterating is kind of lame.
There's got to be a more Pythonic, vectorising way to do this, right? Without iterating. Maybe something involving groupby? Any thoughts or advice much appreciated.
EDIT: Here's another way that avoids merge...but it's still pretty clunky:
import pandas as pd
import math
#SETUP TEST DATA:
y = ['Alex'] * 2321 + ['Doug'] * 34123 + ['Chuck'] * 2012 + ['Bob'] * 9281
z = ['xyz'] * len(y)
df = pd.DataFrame({'persons': y, 'data' : z})
df = df.sample(frac = 1) #shuffle (optional--just to show order doesn't matter)
percent = 10 #CHANGE AS NEEDED
#Add a 'helper' column with random numbers
df['rand'] = np.random.random(df.shape[0])
df = df.sample(frac=1) #this shuffles data, just to show order doesn't matter
#CREATE A HELPER LIST
helper = pd.DataFrame(df.groupby('persons'['rand'].count()).reset_index().values.tolist()
for row in helper:
df_temp = df[df['persons'] == row[0]][['persons','rand']]
lim = math.ceil(len(df_temp) * percent*0.01)
row.append(df_temp.nlargest(lim,'rand').iloc[-1][1])
def flag(name,num):
for row in helper:
if row[0] == name:
if num >= row[2]:
return 'yes'
else:
return 'no'
df['flag'] = df.apply(lambda x: flag(x['persons'], x['rand']), axis=1)
You could use groupby.sample, either to pick out a sample of the whole dataframe for further processing, or to identify rows of the dataframe to mark if that's more convenient.
import pandas as pd
percentage_to_flag = 0.5
# Toy data: 8 rows, persons A and B.
df = pd.DataFrame(data={'persons':['A']*4 + ['B']*4, 'data':range(8)})
# persons data
# 0 A 0
# 1 A 1
# 2 A 2
# 3 A 3
# 4 B 4
# 5 B 5
# 6 B 6
# 7 B 7
# Pick out random sample of dataframe.
random_state = 41 # Change to get different random values.
df_sample = df.groupby("persons").sample(frac=percentage_to_flag,
random_state=random_state)
# persons data
# 1 A 1
# 2 A 2
# 7 B 7
# 6 B 6
# Mark the random sample in the original dataframe.
df["marked"] = False
df.loc[df_sample.index, "marked"] = True
# persons data marked
# 0 A 0 False
# 1 A 1 True
# 2 A 2 True
# 3 A 3 False
# 4 B 4 False
# 5 B 5 False
# 6 B 6 True
# 7 B 7 True
If you really do not want the sub-sampled dataframe df_sample you can go straight to marking a sample of the original dataframe:
# Mark random sample in original dataframe with minimal intermediate data.
df["marked2"] = False
df.loc[df.groupby("persons")["data"].sample(frac=percentage_to_flag,
random_state=random_state).index,
"marked2"] = True
# persons data marked marked2
# 0 A 0 False False
# 1 A 1 True True
# 2 A 2 True True
# 3 A 3 False False
# 4 B 4 False False
# 5 B 5 False False
# 6 B 6 True True
# 7 B 7 True True
If I understood you correctly, you can achieve this using:
df = pd.DataFrame(data={'persons':['A']*10 + ['B']*10, 'col_1':[2]*20})
percentage_to_flag = 0.5
a = df.groupby(['persons'])['col_1'].apply(lambda x: pd.Series(x.index.isin(x.sample(frac=percentage_to_flag, random_state= 5, replace=False).index))).reset_index(drop=True)
df['flagged'] = a
Input:
persons col_1
0 A 2
1 A 2
2 A 2
3 A 2
4 A 2
5 A 2
6 A 2
7 A 2
8 A 2
9 A 2
10 B 2
11 B 2
12 B 2
13 B 2
14 B 2
15 B 2
16 B 2
17 B 2
18 B 2
19 B 2
Output with 50% flagged rows in each group:
persons col_1 flagged
0 A 2 False
1 A 2 False
2 A 2 True
3 A 2 False
4 A 2 True
5 A 2 True
6 A 2 False
7 A 2 True
8 A 2 False
9 A 2 True
10 B 2 False
11 B 2 False
12 B 2 True
13 B 2 False
14 B 2 True
15 B 2 True
16 B 2 False
17 B 2 True
18 B 2 False
19 B 2 True
This is TMBailey's answer, tweaked so it works in my Python version. (Didn't want to edit someone else's answer but if I'm doing it wrong I'll take this down.) This works really great and really fast!
EDIT: I've updated this based on additional suggestion by TMBailey to replace frac=percentage_to_flag with n=math.ceil(percentage_to_flag * len(x)). This ensures that rounding doesn't pull the sampled %age under the 'percentage_to_flag' threshhold. (For what it's worth, you can replace it with frac=(math.ceil(percentage_to_flag * len(x)))/len(x) too).
import pandas as pd
import math
percentage_to_flag = .10
# Toy data:
y = ['Alex'] * 2321 + ['Eddie'] * 876 + ['Doug'] * 34123 + ['Chuck'] * 2012 + ['Bob'] * 9281
z = ['xyz'] * len(y)
df = pd.DataFrame({'persons': y, 'data' : z})
df = df.sample(frac = 1) #optional shuffle, just to show order doesn't matter
# Pick out random sample of dataframe.
random_state = 41 # Change to get different random values.
df_sample = df.groupby("persons").apply(lambda x: x.sample(n=(math.ceil(percentage_to_flag * len(x))),random_state=random_state))
#had to use lambda in line above
df_sample = df_sample.reset_index(level=0, drop=True) #had to add this to simplify multi-index DF
# Mark the random sample in the original dataframe.
df["marked"] = False
df.loc[df_sample.index, "marked"] = True
And then to check:
pp = df.pivot_table(index="persons", columns="marked", values="data", aggfunc='count', fill_value=0)
pp.columns = ['no','yes']
pp = pp.append(pp.sum().rename('Total')).assign(Total=lambda d: d.sum(1))
pp['% selected'] = 100 * pp.yes/pp.Total
print(pp)
OUTPUT:
no yes Total % selected
persons
Alex 2088 233 2321 10.038776
Bob 8352 929 9281 10.009697
Chuck 1810 202 2012 10.039761
Doug 30710 3413 34123 10.002051
Eddie 788 88 876 10.045662
Total 43748 4865 48613 10.007611
Works like a charm.
I am trying to implement the 'Bottom-Up Computation' algorithm in data mining (https://www.aaai.org/Papers/FLAIRS/2003/Flairs03-050.pdf).
I need to use the 'pandas' library to create a dataframe and provide it to a recursive function, which should also return a dataframe as output. I am only able to return the final column as output, because I am unable to figure out how to dynamically build a data frame.
Here is the python program:
import pandas as pd
def project_data(df, d):
return df.iloc[:, d]
def select_data(df, d, val):
col_name = df.columns[d]
return df[df[col_name] == val]
def remove_first_dim(df):
return df.iloc[:, 1:]
def slice_data_dim0(df, v):
df_temp = select_data(df, 0, v)
return remove_first_dim(df_temp)
def buc(df):
dims = df.shape[1]
if dims == 1:
input_sum = sum(project_data(df, 0) )
print(input_sum)
else:
dim_vals = set(project_data(df, 0).values)
for dim_val in dim_vals:
sub_data = slice_data_dim0(df, dim_val)
buc(sub_data)
sub_data = remove_first_dim(df)
buc(sub_data)
data = {'A':[1,1,1,1,2],
'B':[1,1,2,3,1],
'M':[10,20,30,40,50]
}
df = pd.DataFrame(data, columns = ['A','B','M'])
buc(df)
I get the following output:
30
30
40
100
50
50
80
30
40
But what I need is a dataframe, like this (not necessarily formatted, but a data frame):
A B M
0 1 1 30
1 1 2 30
2 1 3 40
3 1 ALL 100
4 2 1 50
5 2 ALL 50
6 ALL 1 80
7 ALL 2 30
8 ALL 3 40
9 ALL ALL 150
How do I achieve this?
Unfortunately pandas doesn't have functionality to do subtotals - so the trick is to just calculate them on the side and concatenate together with original dataframe.
from itertools import combinations
import numpy as np
dim = ['A', 'B']
vals = ['M']
df = pd.concat(
[df]
# subtotals:
+ [df.groupby(list(gr), as_index=False)[vals].sum() for r in range(len(dim)-1) for gr in combinations(dim, r+1)]
# total:
+ [df.groupby(np.zeros(len(df)))[vals].sum()]
)\
.sort_values(dim)\
.reset_index(drop=True)\
.fillna("ALL")
Output:
A B M
0 1 1 10
1 1 1 20
2 1 2 30
3 1 3 40
4 1 ALL 100
5 2 1 50
6 2 ALL 50
7 ALL 1 80
8 ALL 2 30
9 ALL 3 40
10 ALL ALL 150
I have a text file with multiple delimiters separating the value. from that I just want to read the pipe separated values
data is like this for example:
'
10|10|10|10|10|10|10|10|10;10:10:10,10,10,10 ... etc
'
I want to read only upto the 8 pipe separated values as a dataframe and ignore the values with ";,:". How do I do that?
It would be a two step process. First read csv with | as delimiter
df = pd.read_csv(StringIO(
"10|10|10|10|10|10|10|10|10;10:10:10,10,10,10"
), delimiter='|', header=None)
0 1 2 3 4 5 6 7 8
0 10 10 10 10 10 10 10 10 10;10:10:10,10,10,10
Then update the last column by removing the string coming after [;,:]
df.iloc[:, -1] = df.iloc[:, -1].str.replace(r'[;,:].*', '', regex=True)
0 1 2 3 4 5 6 7 8
0 10 10 10 10 10 10 10 10 10
If you know the exact character after which you have to ignore, you can use comment attribute as follows. Everything after that 1 char string would be ignored.
df = pd.read_csv(StringIO(
"10|10|10|10|10|10|10|10|10;10:10:10,10,10,10"
), delimiter='|', header=None, comment=';')
df
0 1 2 3 4 5 6 7 8
0 10 10 10 10 10 10 10 10 10
This is longer than other proposed solutions, but also possibly faster because it only reads what's needed. It collects the result as a list but it could be another container type :
df = "10,10,10,10|10|10|10|10|10|10|10|10;10:10:10,10,10,10"
coll = []
start = 0
prevIdx = -1
while True:
try:
idx = df.index("|", start)
if prevIdx >= 0:
n = int(df[prevIdx+1:idx])
if isinstance(n, int): coll.append(n)
start = idx+1
prevIdx = idx
except:
break;
print(coll) # ==> [10, 10, 10, 10, 10, 10, 10]
I am trying to replicate a specific method of attributing records into deciles, and there is the pandas.qcut() function which does a good job. My only concern is that there doesn't be a method to attribute an uneven number to a specific bin as denoted by the method I am trying to replicate.
This is my example:
num = np.random.rand(153, 1)
my_list = map(lambda x: x[0], num)
ser = pd.Series(my_list)
bins = pd.qcut(ser, 10, labels=False)
bins.value_counts()
Which outputs:
9 16
4 16
0 16
8 15
7 15
6 15
5 15
3 15
2 15
1 15
There are 7 with 15 and 3 with 16, what I would like to do is to specify the bins that would receive 16 records:
9 16 <
4 16
0 16
8 15
7 15
6 15
5 15 <
3 15
2 15 <
1 15
Is this possible using pd.qcut?
As there was no answer, and asking a few people it didn't seem possible, I have cobbled together a function that does this:
def defined_qcut(df, value_series, number_of_bins, bins_for_extras, labels=False):
if max(bins_for_extras) > number_of_bins or any(x < 0 for x in bins_for_extras):
raise ValueError('Attempted to allocate to a bin that doesnt exist')
base_number, number_of_values_to_allocate = divmod(df[value_series].count(), number_of_bins)
bins_for_extras = bins_for_extras[:number_of_values_to_allocate]
if number_of_values_to_allocate == 0:
df['bins'] = pd.qcut(df[value_series], number_of_bins, labels=labels)
return df
elif number_of_values_to_allocate > len(bins_for_extras):
raise ValueError('There are more values to allocate than the list provided, please select more bins')
bins = {}
for i in range(number_of_bins):
number_of_values_in_bin = base_number
if i in bins_for_extras:
number_of_values_in_bin += 1
bins[i] = number_of_values_in_bin
df1 = df.copy()
df1['rank'] = df1[value_series].rank()
df1 = df1.sort_values(by=['rank'])
df1['bins'] = 0
row_to_start_allocate = 0
row_to_end_allocate = 0
for bin_number, number_in_bin in bins.items():
row_to_end_allocate += number_in_bin
bins.update({bin_number: [number_in_bin, row_to_start_allocate, row_to_end_allocate]})
row_to_start_allocate = row_to_end_allocate
conditions = [df1['rank'].iloc[v[1]: v[2]] for k, v in bins.items()]
series_to_add = pd.Series()
for idx, series in enumerate(conditions):
series[series > -1] = idx
series_to_add = series_to_add.append(series)
df1['bins'] = series_to_add
df1 = df1.reset_index()
return df1
It ain't pretty, but it does the job. You pass in the dataframe, the name of the column with the values, and an ordered list of the bins where any extra values should be allocated. I'd happily take some advise as to how to improve this code.
i have a pandas series like this:
0 $233.94
1 $214.14
2 $208.74
3 $232.14
4 $187.15
5 $262.73
6 $176.35
7 $266.33
8 $174.55
9 $221.34
10 $199.74
11 $228.54
12 $228.54
13 $196.15
14 $269.93
15 $257.33
16 $246.53
17 $226.74
i want to get rid of the dollar sign so i can convert the values to numeric. I made a function in order to do this:
def strip_dollar(series):
for number in dollar:
if number[0] == '$':
number[0].replace('$', ' ')
return dollar
This function is returning the original series untouched, nothing changes, and i don't know why.
Any ideas about how to get this right?
Thanks in advance
Use lstrip and convert to floats:
s = s.str.lstrip('$').astype(float)
print (s)
0 233.94
1 214.14
2 208.74
3 232.14
4 187.15
5 262.73
6 176.35
7 266.33
8 174.55
9 221.34
10 199.74
11 228.54
12 228.54
13 196.15
14 269.93
15 257.33
16 246.53
17 226.74
Name: A, dtype: float64
Setup:
s = pd.Series(['$233.94', '$214.14', '$208.74', '$232.14', '$187.15', '$262.73', '$176.35', '$266.33', '$174.55', '$221.34', '$199.74', '$228.54', '$228.54', '$196.15', '$269.93', '$257.33', '$246.53', '$226.74'])
print (s)
0 $233.94
1 $214.14
2 $208.74
3 $232.14
4 $187.15
5 $262.73
6 $176.35
7 $266.33
8 $174.55
9 $221.34
10 $199.74
11 $228.54
12 $228.54
13 $196.15
14 $269.93
15 $257.33
16 $246.53
17 $226.74
dtype: object
Using str.replace("$", "")
Ex:
import pandas as pd
df = pd.DataFrame({"Col" : ["$233.94", "$214.14"]})
df["Col"] = pd.to_numeric(df["Col"].str.replace("$", ""))
print(df)
Output:
Col
0 233.94
1 214.14
CODE:
ser = pd.Series(data=['$123', '$234', '$232', '$6767'])
def rmDollar(x):
return x[1:]
serWithoutDollar = ser.apply(rmDollar)
serWithoutDollar
OUTPUT:
0 123
1 234
2 232
3 6767
dtype: object
Hope it helps!