How to read a csv with rows of NUL, ('\x00'), into pandas?

How to read a csv with rows of NUL, ('\x00'), into pandas? - python

I have a set of csv files with Date and Time as the first two columns (no headers in the files). The files open up fine in Excel but when I try to read them into Python using Pandas read_csv, only the first Date is returned, whether or not I try a type conversion.
When I open in Notepad, it's not simply comma separated and has loads of space before each line after line 1; I have tried skipinitialspace = True to no avail
I have also tried various type conversions but none work. I am currently using parse_dates = [['Date','Time']], infer_datetime_format = True, dayfirst = True
Example output (no conversion):
0 1 2 3 4 ... 12 13 14 15 16
0 02/03/20 15:13:39 5.5 5.8 42.84 ... 30.0 79.0 0.0 0.0 0.0
1 NaN 15:13:49 5.5 5.8 42.84 ... 30.0 79.0 0.0 0.0 0.0
2 NaN 15:13:59 5.5 5.7 34.26 ... 30.0 79.0 0.0 0.0 0.0
3 NaN 15:14:09 5.5 5.7 34.26 ... 30.0 79.0 0.0 0.0 0.0
4 NaN 15:14:19 5.5 5.4 17.10 ... 30.0 79.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ...
39451 NaN 01:14:27 5.5 8.4 60.00 ... 30.0 68.0 0.0 0.0 0.0
39452 NaN 01:14:37 5.5 8.4 60.00 ... 30.0 68.0 0.0 0.0 0.0
39453 NaN 01:14:47 5.5 8.4 60.00 ... 30.0 68.0 0.0 0.0 0.0
39454 NaN 01:14:57 5.5 8.4 60.00 ... 30.0 68.0 0.0 0.0 0.0
39455 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN
And with parse_dates etc:
Date_Time pH1 SP pH Ph1 PV pH ... 1 2 3
0 02/03/20 15:13:39 5.5 5.8 ... 0.0 0.0 0.0
1 nan 15:13:49 5.5 5.8 ... 0.0 0.0 0.0
2 nan 15:13:59 5.5 5.7 ... 0.0 0.0 0.0
3 nan 15:14:09 5.5 5.7 ... 0.0 0.0 0.0
4 nan 15:14:19 5.5 5.4 ... 0.0 0.0 0.0
... ... ... ... ... ... ... ...
39451 nan 01:14:27 5.5 8.4 ... 0.0 0.0 0.0
39452 nan 01:14:37 5.5 8.4 ... 0.0 0.0 0.0
39453 nan 01:14:47 5.5 8.4 ... 0.0 0.0 0.0
39454 nan 01:14:57 5.5 8.4 ... 0.0 0.0 0.0
39455 nan nan NaN NaN ... NaN NaN NaN
Data copied from Notepad (there is actually more whitespace in front of each line but it wouldn't work here):
Data from 67.csv
02/03/20,15:13:39,5.5,5.8,42.84,7.2,6.8,10.63,60.0,0.0,300,1,30,79,0.0,0.0, 0.0
02/03/20,15:13:49,5.5,5.8,42.84,7.2,6.8,10.63,60.0,0.0,300,1,30,79,0.0,0.0, 0.0
02/03/20,15:13:59,5.5,5.7,34.26,7.2,6.8,10.63,60.0,22.3,300,1,30,79,0.0,0.0, 0.0
02/03/20,15:14:09,5.5,5.7,34.26,7.2,6.8,10.63,60.0,15.3,300,45,30,79,0.0,0.0, 0.0
02/03/20,15:14:19,5.5,5.4,17.10,7.2,6.8,10.63,60.0,50.2,300,86,30,79,0.0,0.0, 0.0
And in Excel (so I know the information is there and readable):
Code
import sys
import numpy as np
import pandas as pd
from datetime import datetime
from tkinter import filedialog
from tkinter import *
def import_file(filename):
print('\nOpening ' + filename + ":")
##Read the data in the file
df = pd.read_csv(filename, header = None, low_memory = False)
print(df)
df['Date_Time'] = pd.to_datetime(df[0] + ' ' + df[1])
df.drop(columns=[0, 1], inplace=True)
print(df)
filenames=[]
print('Select files to read, Ctrl or Shift for Multiples')
TkWindow = Tk()
TkWindow.withdraw() # we don't want a full GUI, so keep the root window from appearing
## Show an "Open" dialog box and return the path to the selected file
filenames = filedialog.askopenfilename(title='Open data file', filetypes=(("Comma delimited", "*.csv"),), multiple=True)
TkWindow.destroy()
if len(filenames) == 0:
print('No files selected - Exiting program.')
sys.exit()
else:
print('\n'.join(filenames))
##Read the data from the specified file/s
print('\nReading data file/s')
dfs=[]
for filename in filenames:
dfs.append(import_file(filename))
if len(dfs) > 1:
print('\nCombining data files.')

The file is filled with NUL, '\x00', which needs to be removed.
Use pandas.DataFrame to load the data from d, after the rows have been cleaned.
import pandas as pd
import string # to make column names
# the issue is the the file is filled with NUL not whitespace
def import_file(filename):
# open the file and clean it
with open(filename) as f:
d = list(f.readlines())
# replace NUL, strip whitespace from the end of the strings, split each string into a list
d = [v.replace('\x00', '').strip().split(',') for v in d]
# remove some empty rows
d = [v for v in d if len(v) > 2]
# load the file with pandas
df = pd.DataFrame(d)
# convert column 0 and 1 to a datetime
df['datetime'] = pd.to_datetime(df[0] + ' ' + df[1])
# drop column 0 and 1
df.drop(columns=[0, 1], inplace=True)
# set datetime as the index
df.set_index('datetime', inplace=True)
# convert data in columns to floats
df = df.astype('float')
# give character column names
df.columns = list(string.ascii_uppercase)[:len(df.columns)]
# reset the index
df.reset_index(inplace=True)
return df.copy()
# call the function
dfs = list()
filenames = ['67.csv']
for filename in filenames:
dfs.append(import_file(filename))
display(df)
A B C D E F G H I J K L M N O
datetime
2020-02-03 15:13:39 5.5 5.8 42.84 7.2 6.8 10.63 60.0 0.0 300.0 1.0 30.0 79.0 0.0 0.0 0.0
2020-02-03 15:13:49 5.5 5.8 42.84 7.2 6.8 10.63 60.0 0.0 300.0 1.0 30.0 79.0 0.0 0.0 0.0
2020-02-03 15:13:59 5.5 5.7 34.26 7.2 6.8 10.63 60.0 22.3 300.0 1.0 30.0 79.0 0.0 0.0 0.0
2020-02-03 15:14:09 5.5 5.7 34.26 7.2 6.8 10.63 60.0 15.3 300.0 45.0 30.0 79.0 0.0 0.0 0.0
2020-02-03 15:14:19 5.5 5.4 17.10 7.2 6.8 10.63 60.0 50.2 300.0 86.0 30.0 79.0 0.0 0.0 0.0

Related

Calculating sum of up to the current row in pandas while iterating on each row in a time series data

Suppose I have the following code that calculates how many products I can purchase given my budget-
import math
import pandas as pd
data = [['2021-01-02', 5.5], ['2021-02-02', 10.5], ['2021-03-02', 15.0], ['2021-04-02', 20.0]]
df = pd.DataFrame(data, columns=['Date', 'Current_Price'])
df.Date = pd.to_datetime(df.Date)
mn = df.Date.min()
mx = df.Date.max()
dr = pd.date_range(mn - pd.tseries.offsets.MonthBegin(), mx + pd.tseries.offsets.MonthEnd(), name="Date")
df = df.set_index("Date").reindex(dr).reset_index()
df['Current_Price'] = df.groupby(
pd.Grouper(key='Date', freq='1M'))['Current_Price'].ffill().bfill()
# The dataframe below shows the current price of the product
# I'd like to buy at the specific date_range
print(df)
# Create 'Day' column to know which day of the month
df['Day'] = pd.to_datetime(df['Date']).dt.day
# Create 'Deposit' column to record how much money is
# deposited in, say, my bank account to buy the product.
# 'Withdrawal' column is to record how much I spent in
# buying product(s) at the current price on a specific date.
# 'Num_of_Products_Bought' shows how many items I bought
# on that specific date.
#
# Please note that the calculate below takes into account
# the left over money, which remains after I've purchased a
# product, for future purchase. For example, if you observe
# the resulting dataframe at the end of this code, you'll
# notice that I was able to purchase 7 products on March 1, 2021
# although my deposit on that day was $100. That is because
# on the days leading up to March 1, 2021, I have been saving
# the spare change from previous product purchases and that
# extra money allows me to buy an extra product on March 1, 2021
# despite my budget of $100 should only allow me to purchase
# 6 products.
df[['Deposit', 'Withdrawal', 'Num_of_Products_Bought']] = 0.0
# Suppose I save $100 at the beginning of every month in my bank account
df.loc[df['Day'] == 1, 'Deposit'] = 100.0
for index, row in df.iterrows():
if df.loc[index, 'Day'] == 1:
# num_prod_bought = (sum_of_deposit_so_far - sum_of_withdrawal)/current_price
df.loc[index, 'Num_of_Products_Bought'] = math.floor(
(sum(df.iloc[0:(index + 1)]['Deposit'])
- sum(df.iloc[0:(index + 1)]['Withdrawal']))
/ df.loc[index, 'Current_Price'])
# Record how much I spent buying the product on specific date
df.loc[index, 'Withdrawal'] = df.loc[index, 'Num_of_Products_Bought'] * df.loc[index, 'Current_Price']
print(df)
# This code above is working as intended,
# but how can I make it more efficient/pandas-like?
# In particular, I don't like to idea of having to
# iterate the rows and having to recalculate
# the running (sum of) deposit amount and
# the running (sum of) the withdrawal.
As mentioned in the comment in the code, I would like to know how to accomplish the same without having to iterate the rows one by one and calculating the sum of the rows up to the current row in my iteration (I read around StackOverflow and saw cumsum() function, but I don't think cumsum has the notion of current row in the iteration).
Thank you very much in advance for your suggestions/answers!

A solution using .apply:
def fn():
leftover = 0
amount, deposit = yield
while True:
new_amount, new_deposit = yield (deposit + leftover) // amount
leftover = (deposit + leftover) % amount
amount, deposit = new_amount, new_deposit
df = df.set_index("Date")
s = fn()
next(s)
m = df.index.day == 1
df.loc[m, "Deposit"] = 100
df.loc[m, "Num_of_Products_Bought"] = df.loc[
m, ["Current_Price", "Deposit"]
].apply(lambda x: s.send((x["Current_Price"], x["Deposit"])), axis=1)
df.loc[m, "Withdrawal"] = (
df.loc[m, "Num_of_Products_Bought"] * df.loc[m, "Current_Price"]
)
print(df.fillna(0).reset_index())
Prints:
Date Current_Price Deposit Num_of_Products_Bought Withdrawal
0 2021-01-01 5.5 100.0 18.0 99.0
1 2021-01-02 5.5 0.0 0.0 0.0
2 2021-01-03 5.5 0.0 0.0 0.0
3 2021-01-04 5.5 0.0 0.0 0.0
4 2021-01-05 5.5 0.0 0.0 0.0
5 2021-01-06 5.5 0.0 0.0 0.0
6 2021-01-07 5.5 0.0 0.0 0.0
7 2021-01-08 5.5 0.0 0.0 0.0
8 2021-01-09 5.5 0.0 0.0 0.0
9 2021-01-10 5.5 0.0 0.0 0.0
10 2021-01-11 5.5 0.0 0.0 0.0
11 2021-01-12 5.5 0.0 0.0 0.0
12 2021-01-13 5.5 0.0 0.0 0.0
13 2021-01-14 5.5 0.0 0.0 0.0
14 2021-01-15 5.5 0.0 0.0 0.0
15 2021-01-16 5.5 0.0 0.0 0.0
16 2021-01-17 5.5 0.0 0.0 0.0
17 2021-01-18 5.5 0.0 0.0 0.0
18 2021-01-19 5.5 0.0 0.0 0.0
19 2021-01-20 5.5 0.0 0.0 0.0
20 2021-01-21 5.5 0.0 0.0 0.0
21 2021-01-22 5.5 0.0 0.0 0.0
22 2021-01-23 5.5 0.0 0.0 0.0
23 2021-01-24 5.5 0.0 0.0 0.0
24 2021-01-25 5.5 0.0 0.0 0.0
25 2021-01-26 5.5 0.0 0.0 0.0
26 2021-01-27 5.5 0.0 0.0 0.0
27 2021-01-28 5.5 0.0 0.0 0.0
28 2021-01-29 5.5 0.0 0.0 0.0
29 2021-01-30 5.5 0.0 0.0 0.0
30 2021-01-31 5.5 0.0 0.0 0.0
31 2021-02-01 10.5 100.0 9.0 94.5
32 2021-02-02 10.5 0.0 0.0 0.0
33 2021-02-03 10.5 0.0 0.0 0.0
34 2021-02-04 10.5 0.0 0.0 0.0
35 2021-02-05 10.5 0.0 0.0 0.0
36 2021-02-06 10.5 0.0 0.0 0.0
37 2021-02-07 10.5 0.0 0.0 0.0
38 2021-02-08 10.5 0.0 0.0 0.0
39 2021-02-09 10.5 0.0 0.0 0.0
40 2021-02-10 10.5 0.0 0.0 0.0
41 2021-02-11 10.5 0.0 0.0 0.0
42 2021-02-12 10.5 0.0 0.0 0.0
43 2021-02-13 10.5 0.0 0.0 0.0
44 2021-02-14 10.5 0.0 0.0 0.0
45 2021-02-15 10.5 0.0 0.0 0.0
46 2021-02-16 10.5 0.0 0.0 0.0
47 2021-02-17 10.5 0.0 0.0 0.0
48 2021-02-18 10.5 0.0 0.0 0.0
49 2021-02-19 10.5 0.0 0.0 0.0
50 2021-02-20 10.5 0.0 0.0 0.0
51 2021-02-21 10.5 0.0 0.0 0.0
52 2021-02-22 10.5 0.0 0.0 0.0
53 2021-02-23 10.5 0.0 0.0 0.0
54 2021-02-24 10.5 0.0 0.0 0.0
55 2021-02-25 10.5 0.0 0.0 0.0
56 2021-02-26 10.5 0.0 0.0 0.0
57 2021-02-27 10.5 0.0 0.0 0.0
58 2021-02-28 10.5 0.0 0.0 0.0
59 2021-03-01 15.0 100.0 7.0 105.0
60 2021-03-02 15.0 0.0 0.0 0.0
61 2021-03-03 15.0 0.0 0.0 0.0
62 2021-03-04 15.0 0.0 0.0 0.0
63 2021-03-05 15.0 0.0 0.0 0.0
64 2021-03-06 15.0 0.0 0.0 0.0
65 2021-03-07 15.0 0.0 0.0 0.0
66 2021-03-08 15.0 0.0 0.0 0.0
67 2021-03-09 15.0 0.0 0.0 0.0
68 2021-03-10 15.0 0.0 0.0 0.0
69 2021-03-11 15.0 0.0 0.0 0.0
70 2021-03-12 15.0 0.0 0.0 0.0
71 2021-03-13 15.0 0.0 0.0 0.0
72 2021-03-14 15.0 0.0 0.0 0.0
73 2021-03-15 15.0 0.0 0.0 0.0
74 2021-03-16 15.0 0.0 0.0 0.0
75 2021-03-17 15.0 0.0 0.0 0.0
76 2021-03-18 15.0 0.0 0.0 0.0
77 2021-03-19 15.0 0.0 0.0 0.0
78 2021-03-20 15.0 0.0 0.0 0.0
79 2021-03-21 15.0 0.0 0.0 0.0
80 2021-03-22 15.0 0.0 0.0 0.0
81 2021-03-23 15.0 0.0 0.0 0.0
82 2021-03-24 15.0 0.0 0.0 0.0
83 2021-03-25 15.0 0.0 0.0 0.0
84 2021-03-26 15.0 0.0 0.0 0.0
85 2021-03-27 15.0 0.0 0.0 0.0
86 2021-03-28 15.0 0.0 0.0 0.0
87 2021-03-29 15.0 0.0 0.0 0.0
88 2021-03-30 15.0 0.0 0.0 0.0
89 2021-03-31 15.0 0.0 0.0 0.0
90 2021-04-01 20.0 100.0 5.0 100.0
91 2021-04-02 20.0 0.0 0.0 0.0
92 2021-04-03 20.0 0.0 0.0 0.0
93 2021-04-04 20.0 0.0 0.0 0.0
94 2021-04-05 20.0 0.0 0.0 0.0
95 2021-04-06 20.0 0.0 0.0 0.0
96 2021-04-07 20.0 0.0 0.0 0.0
97 2021-04-08 20.0 0.0 0.0 0.0
98 2021-04-09 20.0 0.0 0.0 0.0
99 2021-04-10 20.0 0.0 0.0 0.0
100 2021-04-11 20.0 0.0 0.0 0.0
101 2021-04-12 20.0 0.0 0.0 0.0
102 2021-04-13 20.0 0.0 0.0 0.0
103 2021-04-14 20.0 0.0 0.0 0.0
104 2021-04-15 20.0 0.0 0.0 0.0
105 2021-04-16 20.0 0.0 0.0 0.0
106 2021-04-17 20.0 0.0 0.0 0.0
107 2021-04-18 20.0 0.0 0.0 0.0
108 2021-04-19 20.0 0.0 0.0 0.0
109 2021-04-20 20.0 0.0 0.0 0.0
110 2021-04-21 20.0 0.0 0.0 0.0
111 2021-04-22 20.0 0.0 0.0 0.0
112 2021-04-23 20.0 0.0 0.0 0.0
113 2021-04-24 20.0 0.0 0.0 0.0
114 2021-04-25 20.0 0.0 0.0 0.0
115 2021-04-26 20.0 0.0 0.0 0.0
116 2021-04-27 20.0 0.0 0.0 0.0
117 2021-04-28 20.0 0.0 0.0 0.0
118 2021-04-29 20.0 0.0 0.0 0.0
119 2021-04-30 20.0 0.0 0.0 0.0

Iterate and transfer dataframe information to an image

I'm trying to "convert" the information of several blocks of dataframe rows 0 to 15 and columns col1 to col 16 into an image (16x16).
Between each 16x16 block, I have an empty line as you can see.
I'm reading the dataframe from a .txt file:
df = pd.read_csv('User1/Video1.txt', sep="\s+|\t+|\s+\t+|\t+\s+", header=None, names=headers, engine='python', parse_dates=parse_dates)
date arrow col1 col2 ... col13 col14 col15 col16
0 2020-11-09 09:53:39.552 -> 0.0 0.0 ... 0.0 0.0 0.0 0.0
1 2020-11-09 09:53:39.552 -> 0.0 2.0 ... 0.0 0.0 0.0 0.0
2 2020-11-09 09:53:39.552 -> 0.0 0.0 ... 0.0 0.0 6.0 6.0
3 2020-11-09 09:53:39.552 -> 0.0 0.0 ... 0.0 0.0 0.0 0.0
4 2020-11-09 09:53:39.586 -> 0.0 9.0 ... 0.0 7.0 0.0 0.0
...
15 2020-11-09 09:53:39.586 -> 0.0 9.0 ... 0.0 7.0 0.0 0.0
16 2020-11-09 09:53:39.586 ->
...
1648 2020-11-09 09:54:06.920 -> 4.0 0.0 ... 4.0 4.0 0.0 0.0
I'm capable of reshaping the first 16x16 block: img=np.resize(df.iloc[:16, -16:].to_numpy(), (16, 16, 3)) but I want to iterate over all dataframe and sum all the pixel values of each 16x16 block.
Can you provide any advice?

If you can provide input of the first 32 rows and columns I can test, but try the following code (untested). Essentially, you can just loop through each 16x16 block, get the sum and append to a list:
i = 16
img_list = []
for _ in range(df.shape[0] / 17):
img = np.resize(df.iloc[i-16:i, -16:].to_numpy(), (16, 16, 3))
img_sum = img.sum()
img_list.append(img_sum)
i += 17
Edit: I just saw you had one empty line separating blocks. I have update the necessary values to 17.

Add a date to a list of dataframes by extracting it from the filename

I have a list of multiple dataframes dfs.
The dataframes come from files that have dates in its name. Eg. FilenameYYYYMMDD.xlsx
files = [str(file) for file in Path(/dir)]
dfs = [pd.read_excel(file, header=1)] for file in files]
I can extract the date from the file names:
date_extract = re.search('[0-9]{8}',files[0...20])
date = datetime.datetime.strptime(date_extract[0...20], '%Y%m%d').date()
But how can I assign to each df its respective date (by adding a column called 'Date')?

if your using pathlib we can use a dictionary to hold your dataframes and use a quick regex to extract the date, when we concat the dataframes the index will be set to the date.
import re
from pathlib import Path
dfs = {
re.search('(\d{4}.*).xlsx',f.name).group(1): pd.read_excel(f,header=1)
for f in Path(
/dir
).glob("*.xlsx")
}
print(pd.concat(dfs))
Unnamed: 0 e f c d
20200610 0 0 0.0 0.0 NaN NaN
1 1 0.0 0.0 NaN NaN
2 2 0.0 0.0 NaN NaN
3 3 0.0 0.0 NaN NaN
4 4 1.0 0.0 NaN NaN
5 5 0.0 1.0 NaN NaN
6 6 0.0 0.0 NaN NaN
7 7 0.0 0.0 NaN NaN
8 8 0.0 0.0 NaN NaN
9 9 0.0 0.0 NaN NaN
10 10 0.0 0.0 NaN NaN
11 11 0.0 0.0 NaN NaN
12 12 0.0 0.0 NaN NaN
13 13 0.0 0.0 NaN NaN
14 14 0.0 0.0 NaN NaN
15 15 0.0 0.0 NaN NaN
16 16 0.0 0.0 NaN NaN
17 17 0.0 0.0 NaN NaN
18 18 0.0 0.0 NaN NaN
19 19 0.0 0.0 NaN NaN
20 20 0.0 0.0 NaN NaN
21 21 0.0 0.0 NaN NaN
22 22 0.0 0.0 NaN NaN
23 23 0.0 0.0 NaN NaN
24 24 0.0 0.0 NaN NaN
25 25 0.0 0.0 NaN NaN
20201012 0 0 NaN NaN 0.0 0.0
1 1 NaN NaN 0.0 0.0
2 2 NaN NaN 1.0 0.0
3 3 NaN NaN 0.0 1.0

Making columns from a range of lines in pandas

I have a csv file like the below:
A, B
1,2
3,4
5,6
C,D
7,8
9,10
11,12
E,F
13,14
15,16
As you can see, and imagine, when I import this data using pd.read_csv, pandas creates the whole thing making with two columns (A,B) and a bunch of lines. It's correct because of the shape. However, I want do create various columns (A,B,C,D...). Fortunately, there're a blank space at the end of each "column", and I think that this could be used to separete theses lines in some way. However, I don't know how to proced with this.
The data:
https://raw.githubusercontent.com/AlessandroMDO/Dinamica_de_Voo/master/data.csv

It's normal behavior of pandas.read_csv, but usually data is not stored in csv files this way.
You can try to read the csv, strip extra spaces and split it by empty lines to parts first. Then read each part using pandas.read_csv and StringIO and concatenate them together using pandas.concat.
import pandas as pd
from io import StringIO
with open('test.csv', 'r') as f:
parts = f.read().strip().split('\n\n')
df = pd.concat([pd.read_csv(StringIO(part)) for part in parts], axis=1)
I have tried this with your csv:
Alpha Cd Alpha CL Alpha ... Cnp Alpha Cnr Alpha Clr
0 -14.0 0.08941 -14.0 -0.19430 -14.0 ... 0.0 -14.0 0.0 -14.0 0.0
1 -12.0 0.07646 -12.0 -0.17150 -12.0 ... 0.0 -12.0 0.0 -12.0 0.0
2 -10.0 0.06509 -10.0 -0.14710 -10.0 ... 0.0 -10.0 0.0 -10.0 0.0
3 -8.0 0.05545 -8.0 -0.12150 -8.0 ... 0.0 -8.0 0.0 -8.0 0.0
4 -6.0 0.04766 -6.0 -0.09479 -6.0 ... 0.0 -6.0 0.0 -6.0 0.0
5 -4.0 0.04181 -4.0 -0.06722 -4.0 ... 0.0 -4.0 0.0 -4.0 0.0
6 -2.0 0.03797 -2.0 -0.03905 -2.0 ... 0.0 -2.0 0.0 -2.0 0.0
7 0.0 0.03620 0.0 -0.01054 0.0 ... 0.0 0.0 0.0 0.0 0.0
8 2.0 0.03651 2.0 0.01806 2.0 ... 0.0 2.0 0.0 2.0 0.0
9 4.0 0.03960 4.0 0.05879 4.0 ... 0.0 4.0 0.0 4.0 0.0
10 6.0 0.04814 6.0 0.12650 6.0 ... 0.0 6.0 0.0 6.0 0.0
11 8.0 0.06494 8.0 0.22050 8.0 ... 0.0 8.0 0.0 8.0 0.0
12 10.0 0.09268 10.0 0.33960 10.0 ... 0.0 10.0 0.0 10.0 0.0
13 12.0 0.13390 12.0 0.48240 12.0 ... 0.0 12.0 0.0 12.0 0.0
14 14.0 0.19110 14.0 0.64710 14.0 ... 0.0 14.0 0.0 14.0 0.0
[15 rows x 36 columns]

Counting consonants and vowels in a split string

I read in a .csv file. I have the following data frame that counts vowels and consonants in a string in the column Description. This works great, but my problem is I want to split Description into 8 columns and count the consonants and vowels for each column. The second part of my code allows for me to split Description into 8 columns. How can I count the vowels and consonants on all 8 columns the Description is split into?
import pandas as pd
import re
def anti_vowel(s):
result = re.sub(r'[AEIOU]', '', s, flags=re.IGNORECASE)
return result
data = pd.read_csv('http://core.secure.ehc.com/src/util/detail-price-list/TristarDivision_SummitMedicalCenter_CM.csv')
data.dropna(inplace = True)
data['Vowels'] = data['Description'].str.count(r'[aeiou]', flags=re.I)
data['Consonant'] = data['Description'].str.count(r'[bcdfghjklmnpqrstvwxzy]', flags=re.I)
print (data)
This is the code I'm using to split the column Description into 8 columns.
import pandas as pd
data = data["Description"].str.split(" ", n = 8, expand = True)
data = pd.read_csv('http://core.secure.ehc.com/src/util/detail-price-list/TristarDivision_SummitMedicalCenter_CM.csv')
data.dropna(inplace = True)
data = data["Description"].str.split(" ", n = 8, expand = True)
print (data)
Now how can I put it all together?
In order to read each column of the 8 and count consonants I know i can use the following replacing the 0 with 0-7:
testconsonant = data[0].str.count(r'[bcdfghjklmnpqrstvwxzy]', flags=re.I)
testvowel = data[0].str.count(r'[aeiou]', flags=re.I)
Desired output would be:
Description [0] vowel count consonant count Description [1] vowel count consonant count Description [2] vowel count consonant count Description [3] vowel count consonant count Description [4] vowel count consonant count all the way to description [7]

stack then unstack
stacked = data.stack()
pd.concat({
'Vowels': stacked.str.count('[aeiou]', flags=re.I),
'Consonant': stacked.str.count('[bcdfghjklmnpqrstvwxzy]', flags=re.I)
}, axis=1).unstack()
Consonant Vowels
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
0 3.0 5.0 5.0 1.0 2.0 NaN NaN NaN NaN 1.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN
1 8.0 5.0 1.0 0.0 0.0 0.0 0.0 0.0 NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN
2 8.0 5.0 1.0 0.0 0.0 0.0 0.0 0.0 NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN
3 8.0 5.0 1.0 0.0 0.0 0.0 0.0 0.0 NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN
4 3.0 5.0 3.0 1.0 0.0 0.0 0.0 0.0 NaN 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 NaN
5 3.0 5.0 3.0 1.0 0.0 0.0 0.0 0.0 NaN 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 NaN
6 3.0 4.0 0.0 1.0 0.0 0.0 0.0 NaN NaN 3.0 1.0 0.0 0.0 0.0 0.0 0.0 NaN NaN
7 3.0 3.0 0.0 1.0 0.0 0.0 0.0 NaN NaN 3.0 1.0 0.0 1.0 0.0 0.0 0.0 NaN NaN
8 3.0 3.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 3.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
9 3.0 3.0 0.0 1.0 0.0 0.0 0.0 NaN NaN 3.0 1.0 0.0 1.0 0.0 0.0 0.0 NaN NaN
10 3.0 3.0 0.0 1.0 0.0 0.0 0.0 0.0 NaN 3.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN
11 3.0 3.0 0.0 2.0 2.0 NaN NaN NaN NaN 3.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN
12 3.0 3.0 0.0 1.0 0.0 0.0 0.0 0.0 NaN 3.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN
13 3.0 3.0 0.0 2.0 2.0 NaN NaN NaN NaN 3.0 1.0 0.0 0.0 0.0 NaN NaN NaN NaN
14 3.0 5.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
15 3.0 3.0 0.0 3.0 1.0 NaN NaN NaN NaN 3.0 0.0 0.0 0.0 1.0 NaN NaN NaN NaN
If you want to combine this with the data dataframe, you can do:
stacked = data.stack()
pd.concat({
'Data': data,
'Vowels': stacked.str.count('[aeiou]', flags=re.I),
'Consonant': stacked.str.count('[bcdfghjklmnpqrstvwxzy]', flags=re.I)
}, axis=1).unstack()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to read a csv with rows of NUL, ('\x00'), into pandas? - python

Related

Calculating sum of up to the current row in pandas while iterating on each row in a time series data

Iterate and transfer dataframe information to an image

Add a date to a list of dataframes by extracting it from the filename

Making columns from a range of lines in pandas

Counting consonants and vowels in a split string

Categories

Resources