Drop rows in dataframe if length of the name columns <=1 - python

Please point out where i am doing wrong or a duplicate of this question
I have 11 columns in my table, i am loading data from Ceph(AWS) bucket to Postgres and while doing that i have to filter the data with the below conditions before inserting data into Postgres
Drop the entire row if there is any empty/ Null values in any column
First name and last name should have more than a single letter. Ex : first name = A or last name = P, any record either first name or last name or both , entire record/row should be dropped
Zip code should be 5 digit or greater . Max 7 digit
First name and last name records should not have [Jr, Sr, I, II, etc] in it. or drop the entire record
i have managed to execute the first step (new to pandas) but i was blocked at the next step and i believe that it might also help me solve step3 if i find a solution for step2. While doing a quick research in google, I found that i might be complicating the process by using chunks and might have to use 'concat' to apply it for all chunks or may be i am wrong but i am dealing with huge amount of data and using chunks would help me load the data faster into Postgres.
I am going to paste my code here and mention what i tried, what was the output and what would be the expected output
what i tried:
columns = [
'cust_last_nm',
'cust_frst_nm',
'cust_brth_dt',
'cust_gendr_cd',
'cust_postl_cd',
'indiv_entpr_id',
'TOKEN_1',
'TOKEN_2',
'TOKEN_3',
'TOKEN_4',
'TOKEN_KEY'
]
def push_to_pg_weekly(key):
vants = []
print(key)
key = _download_s3(key)
how_many_files_pushed.append(True)
s=sp.Popen(["wc", "-l", key], stdout=sp.PIPE)
a, b = s.communicate()
total_rows = int(a.split()[0])
rows = 0
data = pd.read_csv(key, sep="|", header=None, chunksize=100000)
for chunk in data:
rows += len(chunk)
print("Processed rows: ", (float(rows)/total_rows)*100)
chunk = chunk.dropna(axis=0) #step-1 Drop the rows where at least one element is missing.
index_names = chunk[(len(chunk[0]) <= 1) | (len(chunk[1]) <= 1)].index #step2
chunk.drop(index_names, axis=0)
chunk.to_csv("/tmp/sample.csv", sep="|", header=None, index=False)
connection = psycopg2.connect(user = os.environ.get("DATABASE_USER", “USERNAME”),
password = os.environ.get("DATABASE_PASS", “PASSWORD“),
host = os.environ.get("DATABASE_HOST", "cvlpsql.pgsql.com"),
port = 5432,
dbname = os.environ.get("DATABASE_NAME", "cvlpsql_db"),
options = "-c search_path=DATAVANT_O")
with connection.cursor() as cursor:
cursor.copy_from(open('/tmp/sample.csv'), "COVID1", sep='|')
connection.commit()
def push_to_pg():
paginator = CLIENT.get_paginator('list_objects')
pages = paginator.paginate(Bucket=bucket)
for page in pages:
if "Contents" in page:
for obj in page["Contents"]:
if obj['Key'].startswith('test/covid-2020-11-10-175213') and (obj['Key'].endswith('.txt') or obj['Key'].endswith('.csv')):
push_to_pg_weekly(obj['Key'])
os.remove(obj['Key'])
return
Data:
john|doe|1974-01-01|F|606.0|113955973|cC80fi6kHVjKRNgUnATuE8Nn5x/YyoTUdSDY3sUDis4=|2qalDHguJRO9gR66LZcRLSe2SSQSQAIcT9btvaqLnZk=|eLQ9vYAj0aUfMM9smdpXqIh7QRxLBh6wYl6iYkItz6g=|3ktelRCCKf1CHOVBUdaVbjqltxa70FF+9Lf9MNJ+HDU=|cigna_TOKEN_ENCRYPTION_KEY
j|ab|1978-01-01|M|328.0|125135976|yjYaupdG9gdlje+2HdQB+FdEEj6Lh4+WekqEuB1DSvM=|j8VuTUKll7mywqsKrqBnomppGutsoJAR+2IoH/Tq0b8=|6qNP9ch57MlX912gXS7RMg7UfjtaP6by/cR68PbzNmQ=|R5DemSNrFvcevijrktwf3aixOShNU6j7wfahyKeUyzk=|cigna_TOKEN_ENCRYPTION_KEY
j|j|1985-01-01|F|105.0|115144390|fn0r8nVzmDJUihnaQh1SXm1sLOIjzGsPDBskdX4/b+0=|Fh6facONoOiL9hCCA8Q1rtUp9n5h9VBhg2IaX9gjaKI=|NWtnZegpcpgcit2u063zQv3pcEhk4bpKHKFa9hW7LtU=|P3cVOUd6PyYN5tKezdMkVDI62aW8dv+bjIwKtAgX3OM=|cigna_TOKEN_ENCRYPTION_KEY
jh|on|1989-01-01|M|381.0|133794239|PvCWdh+ucgi1WyP5Vr0E6ysTrTZ1gLTQIteXDxZbEJg=|7K3RsfC8ItQtrEQ+MdBGpx6neggYvBvR8nNDMOBTRtU=|nHsF/rJFM/O+HPevTj9cVYwrXS1ou+2/4FelEXTV0Ww=|Jw/nzI/Gu9s6QsgtxTZhTFFBXGLUv06vEewxQbhDyWk=|cigna_TOKEN_ENCRYPTION_KEY
||1969-01-01|M|926.0|135112782|E2sboFz4Mk2aGIKhD4vm6J9Jt3ZSoSdLm+0PCdWsJto=|YSILMFS5sPPZZF/KFroEHV77z1bMeiL/f4FqF2kj4Xc=|tNjgnby5zDbfT2SLsCCwhNBxobSDcCp7ws0zYVme5w4=|kk25p0lrp2T54Z3B1HM3ZQN0RM63rjqvewrwW5VhYcI=|cigna_TOKEN_ENCRYPTION_KEY
||1978-01-01|M|70.0|170737333|Q8NDJz563UrquOUUz0vD6Es05vIaAD/AfVOef4Mhj24=|k5Q02GVd0nJ6xMs1vHVM24MxV6tZ46HJNKoePcDsyoM=|C9cvHz5n+sDycUecioiWZW8USE6D2dli5gRzo4nOyvY=|z4eNSVNDAjiPU2Sw3VY+Ni1djO5fptl5FGQvfnBodr4=|cigna_TOKEN_ENCRYPTION_KEY
||1996-01-01|M|840.0|91951973|Y4kmxp0qdZVCW5pJgQmvWCfc4URg9oFnv2DWGglfQKM=|RJfyDYJjwuZ1ZDjP+5PA5S2fLS6llFD51Lg+uJ84Tus=|+PXzrKt7O79FehSnL3Q8EjGmnyZVDUfdM4zzHk1ghOY=|gjyVKjunky2Aui3dxzmeLt0U6+vT39/uILMbEiT0co8=|cigna_TOKEN_ENCRYPTION_KEY
||1960-01-01|M|180.0|64496569|80e1CgNJeO8oYQHlSn8zWYL4vVrHSPe9AnK2T2PrdII=|bJl7veT+4MlU4j2mhFpFyins0xeCFWeaA30JUzWsfqo=|0GuhUfbS4xCnCj2ms43wqmGFG5lCnfiIQdyti9moneM=|lq84jO9yhz8f9/DUM0ACVc/Rp+sKDvHznVjNnLOaRo4=|cigna_TOKEN_ENCRYPTION_KEY
||1963-01-01|M|310.0|122732991|zEvHkd5AVT7hZFR3/13dR9KzN5WSulewY0pjTFEov2Y=|eGqNbLoeCN1GJyvgaa01w+z26OtmplcrAY2vxwOZ4Y4=|6q9DPLPK5PPAItZA/x253DvdAWA/r6zIi0dtIqPIu2g=|lOl11DhznPphGQOFz6YFJ8i28HID1T6Sg7B/Y7W1M3o=|cigna_TOKEN_ENCRYPTION_KEY
||2001-01-01|F|650.0|43653178|vv/+KLdhHqUm13bWhpzBexwxgosXSIzgrxZIUwB7PDo=|78cJu1biJAlMddJT1yIzQAH1KCkyDoXiL1+Lo1I2jkw=|9/BM/hvqHYXgfmWehPP2JGGuB6lKmfu7uUsmCtpPyz8=|o/yP8bMzFl6KJ1cX+uFll1SrleCC+8BXmqBzyuGdtwM=|cigna_TOKEN_ENCRYPTION_KEY
output - data inserted into postgresDB:
john|doe|1974-01-01|F|606.0|113955973|cC80fi6kHVjKRNgUnATuE8Nn5x/YyoTUdSDY3sUDis4=|2qalDHguJRO9gR66LZcRLSe2SSQSQAIcT9btvaqLnZk=|eLQ9vYAj0aUfMM9smdpXqIh7QRxLBh6wYl6iYkItz6g=|3ktelRCCKf1CHOVBUdaVbjqltxa70FF+9Lf9MNJ+HDU=|cigna_TOKEN_ENCRYPTION_KEY
j|ab|1978-01-01|M|328.0|125135976|yjYaupdG9gdlje+2HdQB+FdEEj6Lh4+WekqEuB1DSvM=|j8VuTUKll7mywqsKrqBnomppGutsoJAR+2IoH/Tq0b8=|6qNP9ch57MlX912gXS7RMg7UfjtaP6by/cR68PbzNmQ=|R5DemSNrFvcevijrktwf3aixOShNU6j7wfahyKeUyzk=|cigna_TOKEN_ENCRYPTION_KEY
j|j|1985-01-01|F|105.0|115144390|fn0r8nVzmDJUihnaQh1SXm1sLOIjzGsPDBskdX4/b+0=|Fh6facONoOiL9hCCA8Q1rtUp9n5h9VBhg2IaX9gjaKI=|NWtnZegpcpgcit2u063zQv3pcEhk4bpKHKFa9hW7LtU=|P3cVOUd6PyYN5tKezdMkVDI62aW8dv+bjIwKtAgX3OM=|cigna_TOKEN_ENCRYPTION_KEY
jh|on|1989-01-01|M|381.0|133794239|PvCWdh+ucgi1WyP5Vr0E6ysTrTZ1gLTQIteXDxZbEJg=|7K3RsfC8ItQtrEQ+MdBGpx6neggYvBvR8nNDMOBTRtU=|nHsF/rJFM/O+HPevTj9cVYwrXS1ou+2/4FelEXTV0Ww=|Jw/nzI/Gu9s6QsgtxTZhTFFBXGLUv06vEewxQbhDyWk=|cigna_TOKEN_ENCRYPTION_KEY
Expected Output:
john|doe|1974-01-01|F|606.0|113955973|cC80fi6kHVjKRNgUnATuE8Nn5x/YyoTUdSDY3sUDis4=|2qalDHguJRO9gR66LZcRLSe2SSQSQAIcT9btvaqLnZk=|eLQ9vYAj0aUfMM9smdpXqIh7QRxLBh6wYl6iYkItz6g=|3ktelRCCKf1CHOVBUdaVbjqltxa70FF+9Lf9MNJ+HDU=|cigna_TOKEN_ENCRYPTION_KEY
jh|on|1989-01-01|M|381.0|133794239|PvCWdh+ucgi1WyP5Vr0E6ysTrTZ1gLTQIteXDxZbEJg=|7K3RsfC8ItQtrEQ+MdBGpx6neggYvBvR8nNDMOBTRtU=|nHsF/rJFM/O+HPevTj9cVYwrXS1ou+2/4FelEXTV0Ww=|Jw/nzI/Gu9s6QsgtxTZhTFFBXGLUv06vEewxQbhDyWk=|cigna_TOKEN_ENCRYPTION_KEY
Any answers/comments will be very much appriciated, thank you

Fastest way to do operations like this on pandas is through numpy.where.
eg for String length:
data = data[np.where((data['cust_last_nm'].str.len()>1) &
(data['cust_frst_nm'].str.len()>1), True, False)]
Note: you can add postal code condition in same way. by default in your data postal codes will read in as floats, so cast them to string first, and then set length limit:
## string length & postal code conditions together
data = data[np.where((data['cust_last_nm'].str.len()>1) &
(data['cust_frst_nm'].str.len()>1) &
(data['cust_postl_cd'].astype('str').str.len()>4) &
(data['cust_postl_cd'].astype('str').str.len()<8)
, True, False)]
EDIT:
Since you working in chunks, change the data to chunk and put this inside your loop. Also, since you don't import headers (headers=0, change column names to their index values. And convert all values to strings before comparison, since otherwise NaN columns will be treated as floats eg:
chunk = chunk[np.where((chunk[0].astype('str').str.len()>1) &
(chunk[1].astype('str').str.len()>1) &
(chunk[5].astype('str').str.len()>4) &
(chunk[5].astype('str').str.len()<8), True, False)]

Create a new column in the dataframe with a value for the length:
df['name_length'] = df.name.str.len()
Index using the new column:
df = df[df.name_length > 1]

Related

DataFrame returns Value Error after adding auto index

This script needs to query the DC server for events. Since this is done live, each time the server is queried, it returns query results of varying lengths. The log file is long and messy, as most logs are. I need to filter only the event names and their codes and then create a DataFrame. Additionally, I need to add a third column that counts the number of times each event took place. I've done most of it but can't figure out how to fix the error I'm getting.
After doing all the filtering from Elasticsearch, I get two lists - action and code - which I have emulated here.
action_list = ['logged-out', 'logged-out', 'logged-out', 'Directory Service Access', 'Directory Service Access', 'Directory Service Access', 'logged-out', 'logged-out', 'Directory Service Access', 'created-process', 'created-process']
code_list = ['4634', '4634', '4634', '4662', '4662', '4662', '4634', '4634', '4662','4688']
I then created a list that contains only the codes that need to be filtered out.
event_code_list = ['4662', '4688']
My script is as follows:
import pandas as pd
from collections import Counter
#Create a dict that combines action and code
lists2dict = {}
lists2dict = dict(zip(action_list,code_list))
# print(lists2dict)
#Filter only wanted eventss
filtered_events = {k: v for k, v in lists2dict.items() if v in event_code_list}
# print(filtered_events)
index = 1 * pd.RangeIndex(start=1, stop=2) #add automatic index to DataFrame
df = pd.DataFrame(filtered_events,index=index)#Create DataFrame from filtered events
#Create Auto Index
count = Counter(df)
action_count = dict(Counter(count))
action_count_values = action_count.values()
# print(action_count_values)
#Convert Columns to Rows and Add Index
new_df = df.melt(var_name="Event",value_name="Code")
new_df['Count'] = action_count_values
print(new_df)
Up until this point, everything works as it should. The problem is what comes next. If there are no events, the script outputs an empty DataFrame. This works fine. However, if there are events, then we should see the events, the codes, and the number of times each event occurred. The problem is that it always outputs 1. How can I fix this? I'm sure it's something ridiculous that I'm missing.
#If no alerts, create empty DataFrame
if new_df.empty:
empty_df = pd.DataFrame(columns=['Event','Code','Count'])
empty_df['Event'] = ['-']
empty_df['Code'] = ['-']
empty_df['Count'] = ['-']
empty_df.to_html()
html = empty_df.to_html()
with open('alerts.html', 'w') as f:
f.write(html)
else: #else, output alerts + codes + count
new_df.to_html()
html = new_df.to_html()
with open('alerts.html', 'w') as f:
f.write(html)
Any help is appreciated.
It is because you are collecting the result as dictionary - the repeated records are ignored. You lost the record count here: lists2dict = dict(zip(action_list,code_list)).
You can do all these operations very easily on dataframe. Just construct a pandas dataframe from given lists, then filter by code, groupby, and aggregate as count:
df = pd.DataFrame({"Event": action_list, "Code": code_list})
df = df[df.Code.isin(event_code_list)] \
.groupby(["Event", "Code"]) \
.agg(Count = ("Code", len)) \
.reset_index()
print(df)
Output:
Event Code Count
0 Directory Service Access 4662 4
1 created-process 4688 2

Trying to add prefixes to url if not present in pandas df column

I am trying to add prefixes to urls in my 'Websites' Column. I can't figure out how to keep each new iteration of the helper column from overwriting everything from the previous column.
for example say I have the following urls in my column:
http://www.bakkersfinedrycleaning.com/
www.cbgi.org
barstoolsand.com
This would be the desired end state:
http://www.bakkersfinedrycleaning.com/
http://www.cbgi.org
http://www.barstoolsand.com
this is as close as I have been able to get:
def nan_to_zeros(df, col):
new_col = f"nanreplace{col}"
df[new_col] = df[col].fillna('~')
return df
df1 = nan_to_zeros(df1, 'Website')
df1['url_helper'] = df1.loc[~df1['nanreplaceWebsite'].str.startswith('http')| ~df1['nanreplaceWebsite'].str.startswith('www'), 'url_helper'] = 'https://www.'
df1['url_helper'] = df1.loc[df1['nanreplaceWebsite'].str.startswith('http'), 'url_helper'] = ""
df1['url_helper'] = df1.loc[df1['nanreplaceWebsite'].str.startswith('www'),'url_helper'] = 'www'
print(df1[['nanreplaceWebsite',"url_helper"]])
which just gives me a helper column of all www because the last iteration overwrites all fields.
Any direction appreciated.
Data:
{'Website': ['http://www.bakkersfinedrycleaning.com/',
'www.cbgi.org', 'barstoolsand.com']}
IIUC, there are 3 things to fix here:
df1['url_helper'] = shouldn't be there
| should be & in the first condition because 'https://www.' should be added to URLs that start with neither of the strings in the condition. The error will become apparent if we check the first condition after the other two conditions.
The last condition should add "http://" instead of "www".
Alternatively, your problem could be solved using np.select. Pass in the multiple conditions in the conditions list and their corresponding choice list and assign values accordingly:
import numpy as np
s = df1['Website'].fillna('~')
df1['fixed Website'] = np.select([~(s.str.startswith('http') | ~s.str.contains('www')),
~(s.str.startswith('http') | s.str.contains('www'))
],
['http://' + s, 'http://www.' + s], s)
Output:
Website fixed Website
0 http://www.bakkersfinedrycleaning.com/ http://www.bakkersfinedrycleaning.com/
1 www.cbgi.org http://www.cbgi.org
2 barstoolsand.com http://www.barstoolsand.com

Optimize row access and transformation in pyspark

I have a large dataset(5GB) in the form of jason in S3 bucket.
I need to transform the schema of the data, and write back the transformed data to S3 using an ETL script.
So I use a crawler to detect the schema and load the data in pyspark dataframe, and change the schema. Now I iterate over every row in the dataframe and convert it to dictionary. Remove null columns, and then convert the dictionary to string and write back to S3. Following is the code:
#df is the pyspark dataframe
columns = df.columns
print(columns)
s3 = boto3.resource('s3')
cnt = 1
for row in df.rdd.toLocalIterator():
data = row.asDict(True)
for col_name in columns:
if data[col_name] is None:
del data[col_name]
content = json.dumps(data)
object = s3.Object('write-test-transaction-transformed', str(cnt)).put(Body=content)
cnt = cnt+1
print(cnt)
I have used toLocalIterator.
Is the execution of above code performes serially? if yes then how to optimize it? Is there any better approach for execution of above logic?
assuming, each row in the dataset as json string format
import pyspark.sql.functions as F
def drop_null_cols(data):
import json
content = json.loads(data)
for key, value in list(content.items()):
if value is None:
del content[key]
return json.dumps(content)
drop_null_cols_udf = F.udf(drop_null_cols, F.StringType())
df = spark.createDataFrame(
["{\"name\":\"Ranga\", \"age\":25, \"city\":\"Hyderabad\"}",
"{\"name\":\"John\", \"age\":null, \"city\":\"New York\"}",
"{\"name\":null, \"age\":31, \"city\":\"London\"}"],
"string"
).toDF("data")
df.select(
drop_null_cols_udf("data").alias("data")
).show(10,False)
If the input dataframe is having the cols and output only needs to be not null cols json
df = spark.createDataFrame(
[('Ranga', 25, 'Hyderabad'),
('John', None, 'New York'),
(None, 31, 'London'),
],
['name', 'age', 'city']
)
df.withColumn(
"data", F.to_json(F.struct([x for x in df.columns]))
).select(
drop_null_cols_udf("data").alias("data")
).show(10, False)
#df.write.format("csv").save("s3://path/to/file/) -- save to s3
which results
+-------------------------------------------------+
|data |
+-------------------------------------------------+
|{"name": "Ranga", "age": 25, "city": "Hyderabad"}|
|{"name": "John", "city": "New York"} |
|{"age": 31, "city": "London"} |
+-------------------------------------------------+
I'll follow the below approach(written in scala, but can be implemented in python with minimal change)-
Find the dataset count and named it as totalCount
val totalcount = inputDF.count()
Find the count(col) for all the dataframe columns and get the map of fields to their count
Here for all columns of input dataframe, the count is getting computed
Please note that count(anycol) returns the number of rows for which the supplied column are all non-null. For example - if a column has 10 row value and if say 5 values are null then the count(column) becomes 5
Fetch the first row as Map[colName, count(colName)] referred as fieldToCount
val cols = inputDF.columns.map { inputCol =>
functions.count(col(inputCol)).as(inputCol)
}
// Returns the number of rows for which the supplied column are all non-null.
// count(null) returns 0
val row = dataset.select(cols: _*).head()
val fieldToCount = row.getValuesMap[Long]($(inputCols))
Get the columns to be removed
Use the Map created in step#2 here and mark the column having count less than the totalCount as the column to be removed
select all the columns which has count == totalCount from the input dataframe and save the processed output Dataframe anywhere in any format as per requirement.
Please note that, this approach will remove all the column having at least one null value
val fieldToBool = fieldToCount.mapValues(_ < totalcount)
val processedDF = inputDF.select(fieldToBool.filterNot(_._2).map(_.1) :_*)
// save this processedDF anywhere in any format as per requirement
I believe this approach will perform well than the approach you have currently
I solved the above problem.
We can simply query the dataframe for null values.
df = df.filter(df.column.isNotNull()) thereby removing all rows where null is present.
So if there are n columns, We need 2^n queries to filter out all possible combinations. In my case there were 10 columns so total of 1024 queries, which is acceptable as sql queries are parrallelized.

Counting the repeated values in one column base on other column

Using Panda, I am dealing with the following CSV data type:
f,f,f,f,f,t,f,f,f,t,f,t,g,f,n,f,f,t,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,t,t,nowin
t,f,f,f,f,f,f,f,f,f,t,f,g,f,b,f,f,t,f,f,f,f,f,t,f,t,f,f,f,f,f,f,f,t,f,n,won
t,f,f,f,t,f,f,f,t,f,t,f,g,f,b,f,f,t,f,f,f,t,f,t,f,t,f,f,f,f,f,f,f,t,f,n,won
f,f,f,f,f,f,f,f,f,f,t,f,g,f,b,f,f,t,f,f,f,f,f,t,f,t,f,f,f,f,f,f,f,t,f,n,nowin
t,f,f,f,t,f,f,f,t,f,t,f,g,f,b,f,f,t,f,f,f,t,f,t,f,t,f,f,f,f,f,f,f,t,f,n,won
f,f,f,f,f,f,f,f,f,f,t,f,g,f,b,f,f,t,f,f,f,f,f,t,f,t,f,f,f,f,f,f,f,t,f,n,win
For this part of the raw data, I was trying to return something like:
Column1_name -- t -- counts of nowin = 0
Column1_name -- t -- count of wins = 3
Column1_name -- f -- count of nowin = 2
Column1_name -- f -- count of win = 1
Based on this idea get dataframe row count based on conditions I was thinking in doing something like this:
print(df[df.target == 'won'].count())
However, this would return always the same number of "wons" based on the last column without taking into consideration if this column it's a "f" or a "t". In other others, I was hoping to use something from Panda dataframe work that would produce the idea of a "group by" from SQL, grouping based on, for example, the 1st and last column.
Should I keep pursing this idea of should I simply start using for loops?
If you need, the rest of my code:
import pandas as pd
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/chess/king-rook-vs-king-pawn/kr-vs-kp.data"
df = pd.read_csv(url,names=[
'bkblk','bknwy','bkon8','bkona','bkspr','bkxbq','bkxcr','bkxwp','blxwp','bxqsq','cntxt','dsopp','dwipd',
'hdchk','katri','mulch','qxmsq','r2ar8','reskd','reskr','rimmx','rkxwp','rxmsq','simpl','skach','skewr',
'skrxp','spcop','stlmt','thrsk','wkcti','wkna8','wknck','wkovl','wkpos','wtoeg','target'
])
features = ['bkblk','bknwy','bkon8','bkona','bkspr','bkxbq','bkxcr','bkxwp','blxwp','bxqsq','cntxt','dsopp','dwipd',
'hdchk','katri','mulch','qxmsq','r2ar8','reskd','reskr','rimmx','rkxwp','rxmsq','simpl','skach','skewr',
'skrxp','spcop','stlmt','thrsk','wkcti','wkna8','wknck','wkovl','wkpos','wtoeg','target']
# number of lines
#tot_of_records = np.size(my_data,0)
#tot_of_records = np.unique(my_data[:,1])
#for item in my_data:
# item[:,0]
num_of_won=0
num_of_nowin=0
for item in df.target:
if item == 'won':
num_of_won = num_of_won + 1
else:
num_of_nowin = num_of_nowin + 1
print(num_of_won)
print(num_of_nowin)
print(df[df.target == 'won'].count())
#print(df[:1])
#print(df.bkblk.to_string(index=False))
#print(df.target.unique())
#ini_entropy = (() + ())
This could work -
outdf = df.apply(lambda x: pd.crosstab(index=df.target,columns=x).to_dict())
Basically we are going in on each feature column and making a crosstab with target column
Hope this helps! :)

pandas - drop row with list of values, if contains from list

I have a huge set of data. Something like 100k lines and I am trying to drop a row from a dataframe if the row, which contains a list, contains a value from another dataframe. Here's a small time example.
has = [['#a'], ['#b'], ['#c, #d, #e, #f'], ['#g']]
use = [1,2,3,5]
z = ['#d','#a']
df = pd.DataFrame({'user': use, 'tweet': has})
df2 = pd.DataFrame({'z': z})
tweet user
0 [#a] 1
1 [#b] 2
2 [#c, #d, #e, #f] 3
3 [#g] 5
z
0 #d
1 #a
The desired outcome would be
tweet user
0 [#b] 2
1 [#g] 5
Things i've tried
#this seems to work for dropping #a but not #d
for a in range(df.tweet.size):
for search in df2.z:
if search in df.loc[a].tweet:
df.drop(a)
#this works for my small scale example but throws an error on my big data
df['tweet'] = df.tweet.apply(', '.join)
test = df[~df.tweet.str.contains('|'.join(df2['z'].astype(str)))]
#the error being "unterminated character set at position 1343770"
#i went to check what was on that line and it returned this
basket.iloc[1343770]
user_id 17060480
tweet [#IfTheyWereBlackOrBrownPeople, #WTF]
Name: 4612505, dtype: object
Any help would be greatly appreciated.
is ['#c, #d, #e, #f'] 1 string or a list like this ['#c', '#d', '#e', '#f'] ?
has = [['#a'], ['#b'], ['#c', '#d', '#e', '#f'], ['#g']]
use = [1,2,3,5]
z = ['#d','#a']
df = pd.DataFrame({'user': use, 'tweet': has})
df2 = pd.DataFrame({'z': z})
simple solution would be
screen = set(df2.z.tolist())
to_delete = list() # this will speed things up doing only 1 delete
for id, row in df.iterrows():
if set(row.tweet).intersection(screen):
to_delete.append(id)
df.drop(to_delete, inplace=True)
speed comparaison (for 10 000 rows):
st = time.time()
screen = set(df2.z.tolist())
to_delete = list()
for id, row in df.iterrows():
if set(row.tweet).intersection(screen):
to_delete.append(id)
df.drop(to_delete, inplace=True)
print(time.time()-st)
2.142000198364258
st = time.time()
for a in df.tweet.index:
for search in df2.z:
if search in df.loc[a].tweet:
df.drop(a, inplace=True)
break
print(time.time()-st)
43.99799990653992
For me, your code works if I make several adjustments.
First, you're missing the last line when putting range(df.tweet.size), either increase this or (more robust, if you don't have an increasing index), use df.tweet.index.
Second, you don't apply your dropping, use inplace=True for that.
Third, you have #d in a string, the following is not a list: '#c, #d, #e, #f' and you have to change it to a list so it works.
So if you change that, the following code works fine:
has = [['#a'], ['#b'], ['#c', '#d', '#e', '#f'], ['#g']]
use = [1,2,3,5]
z = ['#d','#a']
df = pd.DataFrame({'user': use, 'tweet': has})
df2 = pd.DataFrame({'z': z})
for a in df.tweet.index:
for search in df2.z:
if search in df.loc[a].tweet:
df.drop(a, inplace=True)
break # so if we already dropped it we no longer look whether we should drop this line
This will provide the desired result. Be aware of this potentially being not optimal due to missing vectorization.
EDIT:
you can achieve the string being a list with the following:
from itertools import chain
df.tweet = df.tweet.apply(lambda l: list(chain(*map(lambda lelem: lelem.split(","), l))))
This applies a function to each line (assuming each line contains a list with one or more elements): Split each element (should be a string) by comma into a new list and "flatten" all the lists in one line (if there are multiple) together.
EDIT2:
Yes, this is not really performant But basically does what was asked. Keep that in mind and after having it working, try to improve your code (less for iterations, do tricks like collecting the indices and then drop all of them).

Categories

Resources