How to add conditional 'if' using 'map(+lambda)' function under Python - python

I have an example csv file with name 'r2.csv':
Factory | Product_Number | Date | Avg_Noshow | Walk_Cost | Room_Rev
-------------------------------------------------------------------------
A | 1 | 01MAY2017 | 5.6 | 125 | 275
-------------------------------------------------------------------------
A | 1 | 02MAY2017 | 0 | 200 | 300
-------------------------------------------------------------------------
A | 1 | 03MAY2017 | 6.6 | 150 | 250
-------------------------------------------------------------------------
A | 1 | 04MAY2017 | 7.5 | 175 | 325
-------------------------------------------------------------------------
And I would like to read the file and calculate as output using the following code:
I have the following python code to read a csv file and transfer the columns to arrays:
# Read csv file
import numpy as np
import scipy.stats as stats
from scipy.stats import poisson, norm
import csv
with open('r2.csv', 'r') as infile:
reader = csv.DictReader(infile)
data = {}
for row in reader:
for header, value in row.items():
try:
data[header].append(value)
except KeyError:
data[header] = [value]
# Transfer the column from list to arrays for later calculation.
mu = data['Avg_Noshow']
cs = data['Walk_Cost']
co = data['Room_Rev']
mu = map(float,mu)
cs = map(float,cs)
co = map(float,co)
The prior part works fine and it reads data. Following is the function for calculation.
# The following 'map()' function calculates Overbooking number
Overbooking_Number = map(lambda mu_,cs_,co_:np.ceil(poisson.ppf(co_/(cs_+co_),mu_)),mu,cs,co)
data['Overbooking_Number'] = Overbooking_Number
header = 'LOC_ID', 'Prod_Class_ID', 'Date', 'Avg_Noshow', 'Walk_Cost', 'Room_Rev', 'Overbooking_Number'
# Write to an output file
with open("output.csv",'wb') as resultFile:
wr = csv.writer(resultFile,quoting=csv.QUOTE_ALL)
wr.writerow(header)
z = zip(data['LOC_ID'],data['Prod_Class_ID'],data['Date'],data['Avg_Noshow'],data['Walk_Cost'],data['Room_Rev'],data['Overbooking_Number'])
for i in z:
wr.writerow(i)
It works fine as well.
However, if I would like to calculate and output 'Overbooking_Number' using the above function only if 'Avg_Noshow > 0' and output 'Overbooking_Number = 0' if 'Avg_Noshow = 0'?
For example, the output table may look like below:
Factory | Product_Number | Date | Avg_Noshow | Walk_Cost | Room_Rev | Overbooking_Number
----------------------------------------------------------------------------------------------
A | 1 | 01MAY2017 | 5.6 | 125 | 275 | ...
----------------------------------------------------------------------------------------------
A | 1 | 02MAY2017 | 0 | 200 | 300 | 0
----------------------------------------------------------------------------------------------
A | 1 | 03MAY2017 | 6.6 | 150 | 250 | ...
----------------------------------------------------------------------------------------------
A | 1 | 04MAY2017 | 7.5 | 175 | 325 | ...
----------------------------------------------------------------------------------------------
What shall I add a conditional if into my map(+lambda) function?
Thank you!

If I understand correctly, the condition is that mu should be higher than zero. In this case, I think you can simply use Python's "ternary operator" like this:
Overbooking_Number = map(lambda mu_, cs_, co_:
(np.ceil(poisson.ppf(co_ / (cs_ + co_), mu_)
if mu_ > 0 else 0),
mu, cs, co)

Related

How to extract data from unknown data type returned by BS html parser

https://docs.google.com/document/d/1qqhVYhuwQsR2GOkpcTwhLvX5QUBCj5tv7LYqXAzB2UE/edit?usp=sharing
The above document shows the output of BeautifulSoup after html parsing. This is a response from an API with POST request. In the website, it renders as a table.
Can anyone tell what is the data format and why find() and find_all() not working with it.
# Import libs
import pandas as pd
import requests
from bs4 import BeautifulSoup
import json
# Reading codes from local CSV file on my computer
dp_code = ("code.csv")
# Form Data for passing to the request body
formdata = {'objid': '14'}
# URL
url = "https://www.somewebsite.com"
# Query
for i in dp_code:
formdata["objid"] = str(i)
response = requests.request("POST", url, data=formdata, timeout=1500)
out = response.content
soup = BeautifulSoup(out,"html.parser")
json = json.loads(soup.text)
df = pd.DataFrame(bat["form"])
df.to_csv(str(i) + ".csv")
Can anyone tell what is the data format
It looks like either json or stringified json - which you probably realized, since you're using json.loads; I don't think parsing with BeautifulSoup before parsing json is necessary at all, but I can't be sure without knowing what response.content looks like - in fact, if response.json() works, even json.loads becomes unnecessary.
...the output of BeautifulSoup after html parsing...
...and why find() and find_all() not working with it.
There's not much point to using BeautifulSoup (which is for html parsing as you yourself have noted!) unless the input is in a html/lxml/xml format. Otherwise, it tends to be just parsed as document with a single NavigableString (and that's likely what happened here); so then it looks [to bs4] like there's nothing to find.
Anyway, I downloaded the document as a txt and then read it and extracted the one value [which is a HTML string] with
docContents = open('Unknow Data Type.txt', mode='r', encoding='utf-8-sig').read()
formHtml = json.loads(docContents)['form']
(The encoding took a little bit of trial and error to figure out, but I expect that step will be unnecessary for you as you have the raw contents.)
After that, formHtml can be parsed like any HTML string with BeautifulSoup; since it's just tables, you can even use pandas.read_html directly, but since you asked about find and find_all, I tried this little example:
formSoup = BeautifulSoup(formHtml, 'html.parser')
for t in formSoup.find_all('table'):
print('+'*100)
for r in t.find_all('tr'):
cols = [c.text for c in r.find_all(['td', 'th'])]
cols = [f'{c[:10].strip():^12}' for c in cols] #just formatting
print(f'| {" | ".join(cols)} |')
print('+'*100)
It prints the tables as output:
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
| DISTRICT: | KASARGOD | LOCAL BODY | G14001-Kum |
| WARD: | 001-ANNADU | POLLING ST | 002-G J B |
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
| Serial No | Name | Guardian's | New House | House Name | Gender / A | ID Card No |
| 1 | | Adru | 1 | | M / 55 | KL/01/002/ |
| 2 | | Battya | 1 | | F / 41 | KL/01/002/ |
| 3 | | | MOIDEEN KU | 289 | C.H.NAGAR | F / 22 | SECID15757 |
| 566 | | MOHAMMED K | 296 | ANNADKA HO | M / 49 | SECID15400 |
| 567 | | MOIDDEEN K | 296 | ANNADKA HO | F / 40 | SECID15400 |
| 568 | | MOHAMMED K | 296 | MUNDRAKOLA | M / 36 | SECID15400 |
| 569 | | RADHA | 381 | MACHAVU HO | M / 23 | SECID15400 |
| 570 | | SHIVAPPA S | 576 | UJJANTHODY | F / 47 | ZII0819813 |
| 571 | | SURESHA K | 826 | KARUVALTHA | F / 33 | JWQ1718857 |
| കൂട്ടിച്ചേ |
| 572 | DIVYA K | SUNDARA K | 182 | BHANDARA V | F / 24 | ZII0767137 |
| 573 | KUNHAMMA | ACHU BELCH | 185 | PODIPALLA | F / 84 | KL/01/002/ |
| 574 | SUJATHA M | KESHAVAN K | 186 | PODIPALLA | F / 48 | JWQ1687797 |
| 575 | SARATH M | SUJATHA M | 186 | PODIPALLA | M / 25 | SECID4BCFE |
| 576 | SAJITH K | SUJATHA M | 186 | PODIPPALLA | M / 21 | ZII3300043 |
| തിരുത്തലുക |
| ഇല്ല |
| ഒഴിവാക്കലു |
| ഇല്ല |
| |
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

How to assign time stamp to the command in Python?

There are several types of Commands in the third column of the text file. So, I am using the regular expression method to grep the number of occurrences for each type of command.
For example, ACTIVE has occurred 3 times, REFRESH 2 times. I desire to enhance the flexibility of my program. So, I wish to assign the time for each command.
Since one command can happen more than 1 time, if the script supports the command being assigned to the time, then the users will know which ACTIVE occurs at what time. Any guidance or suggestions are welcomed.
The idea is to have more flexible support for the script.
My code:
import re
a = a_1 = b = b_1 = c = d = e = 0
lines = open("page_stats.txt", "r").readlines()
for line in lines:
if re.search(r"WRITING_A", line):
a_1 += 1
elif re.search(r"WRITING", line):
a += 1
elif re.search(r"READING_A", line):
b_1 += 1
elif re.search(r"READING", line):
b += 1
elif re.search(r"PRECHARGE", line):
c += 1
elif re.search(r"ACTIVE", line):
d += 1
File content:
-----------------------------------------------------------------
| Number | Time | Command | Data |
-----------------------------------------------------------------
| 1 | 0015 | ACTIVE | |
| 2 | 0030 | WRITING | |
| 3 | 0100 | WRITING_A | |
| 4 | 0115 | PRECHARGE | |
| 5 | 0120 | REFRESH | |
| 6 | 0150 | ACTIVE | |
| 7 | 0200 | READING | |
| 8 | 0314 | PRECHARGE | |
| 9 | 0318 | ACTIVE | |
| 10 | 0345 | WRITING_A | |
| 11 | 0430 | WRITING_A | |
| 12 | 0447 | WRITING | |
| 13 | 0503 | PRECHARGE | |
| 14 | 0610 | REFRESH | |
Assuming you want to count the occurrences of each command and store
the timestamps of each command as well, would you please try:
import re
count = {}
timestamps = {}
with open ("page_stats.txt", "r") as f:
for line in f:
m = re.split(r"\s*\|\s*", line)
if len(m) > 3 and re.match(r"\d+", m[1]):
count[m[3]] = count[m[3]] + 1 if m[3] in count else 1
if m[3] in timestamps:
timestamps[m[3]].append(m[2])
else:
timestamps[m[3]] = [m[2]]
# see the limited result (example)
#print(count["ACTIVE"])
#print(timestamps["ACTIVE"])
# see the results
for key in count:
print("%-10s: %2d, %s" % (key, count[key], timestamps[key]))
Output:
REFRESH : 2, ['0120', '0610']
WRITING : 2, ['0030', '0447']
PRECHARGE : 3, ['0115', '0314', '0503']
ACTIVE : 3, ['0015', '0150', '0318']
READING : 1, ['0200']
WRITING_A : 3, ['0100', '0345', '0430']
m = re.split(r"\s*\|\s*", line) splits line on a pipe character which
may be preceded and/or followed by blank characters.
Then the list elements m[1], m[2], m[3] are assingned to the Number, Time, Command
in order.
The condition if len(m) > 3 and re.match(r"\d+", m[1]) skips the
header lines.
Then the dictionary variables count and timestamps are assigned,
incremented or appended one by one.

Splitting a csv into multiple csv's depending on what is in column 1 using python

so I currently have a large csv containing data for a number of events.
Column one contains a number of dates as well as some id's for each event for example.
Basically I want to write something within Python that whenever there is an id number (AL.....) it creates a new csv with the id number as the title with all the data in it before the next id number so i end up with a csv for each event.
For info the whole csv contains 8 columns but the division into individual csvs is only predicated on column one
Use Python to split a CSV file with multiple headers
I notice this questions is quite similar but in my case I I have AL and then a different string of numbers after it each time and also I want to call the new csvs by the id numbers.
You can achieve this using pandas, so let's first generate some data:
import pandas as pd
import numpy as np
def date_string():
return str(np.random.randint(1, 32)) + "/" + str(np.random.randint(1, 13)) + "/1997"
l = [date_string() for x in range(20)]
l[0] = "AL123"
l[10] = "AL321"
df = pd.DataFrame(l, columns=['idx'])
# -->
| | idx |
|---:|:-----------|
| 0 | AL123 |
| 1 | 24/3/1997 |
| 2 | 8/6/1997 |
| 3 | 6/9/1997 |
| 4 | 31/12/1997 |
| 5 | 11/6/1997 |
| 6 | 2/3/1997 |
| 7 | 31/8/1997 |
| 8 | 21/5/1997 |
| 9 | 30/1/1997 |
| 10 | AL321 |
| 11 | 8/4/1997 |
| 12 | 21/7/1997 |
| 13 | 9/10/1997 |
| 14 | 31/12/1997 |
| 15 | 15/2/1997 |
| 16 | 21/2/1997 |
| 17 | 3/3/1997 |
| 18 | 16/12/1997 |
| 19 | 16/2/1997 |
So, interesting positions are 0 and 10 as there are the AL* strings...
Now to filter the AL* you can use:
idx = df.index[df['idx'].str.startswith('AL')] # get's you all index where AL is
dfs = np.split(df, idx) # splits the data
for out in dfs[1:]:
name = out.iloc[0, 0]
out.to_csv(name + ".csv", index=False, header=False) # saves the data
This gives you two csv files named AL123.csv and AL321.csv with the first line being the AL* string.

How to efficiently extract unique rows from massive CSV using Python or R

I have a massive CSV (1.4gb, over 1MM rows) of stock market data that I will process using R.
The table looks roughly like this. For each ticker, there are thousands of rows of data.
+--------+------+-------+------+------+
| Ticker | Open | Close | High | Low |
+--------+------+-------+------+------+
| A | 121 | 121 | 212 | 2434 |
| A | 32 | 23 | 43 | 344 |
| A | 121 | 121 | 212 | 2434 |
| A | 32 | 23 | 43 | 344 |
| A | 121 | 121 | 212 | 2434 |
| B | 32 | 23 | 43 | 344 |
+--------+------+-------+------+------+
To make processing and testing easier, I'm breaking this colossus into smaller files using the script mentioned in this question: How do I slice a single CSV file into several smaller ones grouped by a field?
The script would output files such as data_a.csv, data_b.csv, etc.
But, I would also like to create index.csv which simply lists all the unique stock ticker names.
E.g.
+---------+
| Ticker |
+---------+
| A |
| B |
| C |
| D |
| ... |
+---------+
Can anybody recommend an efficient way of doing this in R or Python, when handling a huge filesize?
You could loop through each file, grabbing the index of each and creating a set union of all indices.
import glob
tickers = set()
for csvfile in glob.glob('*.csv'):
data = pd.read_csv(csvfile, index_col=0, header=None) # or True, however your data is set up
tickers.update(data.index.tolist())
pd.Series(list(tickers)).to_csv('index.csv', index=False)
You can retrieve the index from the file names:
(index <- data.frame(Ticker = toupper(gsub("^.*_(.*)\\.csv",
"\\1",
list.files()))))
## Ticker
## 1 A
## 2 B
write.csv(index, "index.csv")

Spark DataFrame operators (nunique, multiplication)

I'm using jupyter notebook with pandas, but when i use Spark, i want to use Spark DataFrame to convert or computation instead of Pandas. Please help me convert some computation to Spark DataFrame or RDD.
DataFrame:
df =
+--------+-------+---------+--------+
| userId | item | price | value |
+--------+-------+---------+--------+
| 169 | I0111 | 5300 | 1 |
| 169 | I0973 | 70 | 1 |
| 336 | C0174 | 455 | 1 |
| 336 | I0025 | 126 | 1 |
| 336 | I0973 | 4 | 1 |
| 770963 | B0166 | 2 | 1 |
| 1294537| I0110 | 90 | 1 |
+--------+-------+---------+--------+
1. Using Pandas computing:
(1) userItem = df.groupby(['userId'])['item'].nunique()
and result is a Series object:
+--------+------+
| userId | |
+--------+------+
| 169 | 2 |
| 336 | 3 |
| 770963 | 1 |
| 1294537| 1 |
+--------+------+
2. Using multiplication
data_sum = df.groupby(['userId', 'item'])['value'].sum() --> result is Series object
average_played = np.mean(userItem) --> result is number
(2) weighted_games_played = data_sum * (average_played / userItem)
Please help me using Spark DataFrame and Opertors on Spark to do this (1) and (2)
You can achieve (1) using something like the following:
import pyspark.sql.functions as f
userItem=df.groupby('userId').agg(f.expr('count(distinct item)').alias('n_item'))
and for (2):
data_sum=df.groupby(['userId','item']).agg(f.sum('value').alias('sum_value'))
average_played=userItem.agg(f.mean('n_item').alias('avg_played'))
data_sum=data_sum.join(userItem, on='userId').crossJoin(average_played)
data_sum=data_sum.withColumn("weighted_games_played", f.expr("sum_value*avg_played/n_item"))
You can define method like below:
import org.apache.spark.mllib.linalg.distributed.{RowMatrix}
import org.apache.spark.mllib.linalg.{Vectors,Matrices,DenseVector}
import org.apache.spark.sql.types.{StructType,StructField,DoubleType}
import org.apache.spark.{SparkConf,SparkContext}
object retain {
implicit class DataFrameTransforms(left: DataFrame) {
val dftordd = left.rdd.map{case row =>
Vectors.dense(row.toSeq.toArray.map{x=>x.asInstanceOf[Double]})}
val leftRM = new RowMatrix(dftordd)
def multiply(right:DataFrame):DataFrame = {
val matrixC = right.columns.map(col(_))
val arr = right.select(array(matrixC:_*).as("arr")).as[Array[Double]].collect.flatten
val rows = right.count().toInt
val cols = matrixC.length
val rightRM = Matrices.dense(cols,rows,arr).transpose
val product = leftRM.multiply(rightRM).rows
val x = product.map(r=>r.toArray).collect.map(p=>Row(p: _*))
var schema = new StructType()
var i = 0
val t = cols
while (i < t) {
schema = schema.add(StructField(s"component${i}", DoubleType, true))
i = i + 1
}
val err = spark.createDataFrame(sc.parallelize(x),schema)
err
}
}
}
and before using just
import retain._
say you have two dataframes called df1(m×n) and df2(n×m)
df1.multiply(df2)

Categories

Resources