Extract a string from a CSV cell containing special characters in Python - python

I'm writing a Python program to extract specific values from each cell in a .CSV file column and then make all the extracted values new columns.
Sample column cell:(This is actually a small part, the real cell contains much more data)
AudioStreams":[{"JitterInterArrival":10,"JitterInterArrivalMax":24,"PacketLossRate":0.01353227,"PacketLossRateMax":0.09027778,"BurstDensity":null,"BurstDuration":null,"BurstGapDensity":null,"BurstGapDuration":null,"BandwidthEst":25245423,"RoundTrip":520,"RoundTripMax":11099,"PacketUtilization":2843,"RatioConcealedSamplesAvg":0.02746676,"ConcealedRatioMax":0.01598402,"PayloadDescription":"SIREN","AudioSampleRate":16000,"AudioFECUsed":true,"SendListenMOS":null,"OverallAvgNetworkMOS":3.487248,"DegradationAvg":0.2727518,"DegradationMax":0.2727518,"NetworkJitterAvg":253.0633,"NetworkJitterMax":1149.659,"JitterBufferSizeAvg":220,"JitterBufferSizeMax":1211,"PossibleDataMissing":false,"StreamDirection":"FROM-to-
One value I'm trying to extract is number 10 between the "JitterInterArrival": and ,"JitterInterArrivalMax" . But since each cell contains relatively long strings and special characters around it(such as ""), opener=re.escape(r"***")and closer=re.escape(r"***") wouldn't work.
Does anyone know a better solution? Thanks a lot!

IIUC, you have a json string and wish to get values from its attributes. So, given
s = '''
{"AudioStreams":[{"JitterInterArrival":10,"JitterInterArrivalMax":24,"PacketLossRate":0.01353227,"PacketLossRateMax":0.09027778,"BurstDensity":null,
"BurstDuration":null,"BurstGapDensity":null,"BurstGapDuration":null,"BandwidthEst":25245423,"RoundTrip":520,"RoundTripMax":11099,"PacketUtilization":2843,"RatioConcealedSamplesAvg":0.02746676,"ConcealedRatioMax":0.01598402,"PayloadDescription":"SIREN","AudioSampleRate":16000,"AudioFECUsed":true,"SendListenMOS":null,"OverallAvgNetworkMOS":3.487248,"DegradationAvg":0.2727518,
"DegradationMax":0.2727518,"NetworkJitterAvg":253.0633,
"NetworkJitterMax":1149.659,"JitterBufferSizeAvg":220,"JitterBufferSizeMax":1211,
"PossibleDataMissing":false}]}
'''
You can do
import json
>>> data = json.loads(s)
>>> ji = data['AudioStreams'][0]['JitterInterArrival']
10
In a data frame scenario, if you have a column col of strings such as the above, e.g.
df = pd.DataFrame({"col": [s]})
You can use transform passing json.loads as argument
df.col.transform(json.loads)
to get a Series of dictionaries. Then, you can manipulate these dicts or just access the data as done above.

Related

Manipulate string in python (replace string with part of the string itself)

So I am trying to transform the data I have into the form I can work with. I have this column called "season/ teams" that looks smth like "1989-90 Bos"
I would like to transform it into a string like "1990" in python using pandas dataframe. I read some tutorials about pd.replace() but can't seem to find a use for my scenario. How can I solve this? thanks for the help.
FYI, I have 16k lines of data.
A snapshot of the data I am working with:
To change that field from "1989-90 BOS" to "1990" you could do the following:
df['Yr/Team'] = df['Yr/Team'].str[:2] + df['Yr/Team'].str[5:7]
If the structure of your data will always be the same, this is an easy way to do it.
If the data in your Yr/Team column has a standard format you can extract the values you need based on their position.
import pandas as pd
df = pd.DataFrame({'Yr/Team': ['1990-91 team'], 'data': [1]})
df['year'] = df['Yr/Team'].str[0:2] + df['Yr/Team'].str[5:7]
print(df)
Yr/Team data year
0 1990-91 team 1 1991
You can use pd.Series.str.extract to extract a pattern from a column of string. For example, if you want to extract the first year, second year and team in three different columns, you can use this:
df["year"].str.extract(r"(?P<start_year>\d+)-(?P<end_year>\d+) (?P<team>\w+)")
Note the use of named parameters to automatically name the columns
See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extract.html

How to extract multiple columns of data out of two different text files and format them correctly to be used in more code

I am trying to extract multiple columns of data from two different text files. I am going to loop through those columns of data with additional code. How do I extract it, and format the data correctly so I can use it. There are probably 20 columns in one text file, and 15 columns in the other text file.
I have tried extract the data using genfromtext, but I get a weird format and mapping it doesn't help. I also can't use the extracted data in any additional loops or functions.
This is the code I was trying to use:
data = np.genfromtxt("Basecol_Basic_New_1.txt", unpack=True);
J_i2=data[0];
J_f2=data[1];
kH2=data[5:, :]
data = np.genfromtxt("Lamda_HeHCL.txt", unpack=True);
J_i1=data[1];
J_f1=data[2];
kHe=data[7:, :]
I also tried using this to format correctly, but it kept resulting in errors.
kHe = map(float, kHe)
kH2 = map(float, kH2)
kHe = np.array(kHe)
kH2 = np.array(kH2)
g = len(kH2)
However, once I have the columns of data, they are formatted differently than I am used to. They seem to be unusable.
I expect that the data will come out as multiple arrays [1,2,3], [4,5,6]. What I am currently getting is [[5.678e-8 ....] [7.893e-10 ...]].
It isn't in the right format and all my attempts to put it in the right format result in a size-1 error or similar.
From your code I'm assuming the data is separated by spaces. Then you can just read the file and format instead of using np.genfromtext
Edited for mapping float and column 5 to 10 inclusive (6 columns).
list=[]
with open ("Basecol_Basic_New_1.txt", 'r') as data:
for line in data:
list.append(map(float,line.strip().split(' ')[4:10]))

Taking dictionary/list, and mapping to dataframe with x number of matching columns

I have a table that contains 2 columns.
Column 1 | Column 2
----------------------------
unique_number | '123 Main St. Suite 100 Chicago, IL'
I've been exploring address parsing using https://parserator.datamade.us/api-docs and ideally would like to parse the address, and put the results into new columns.
import usaddress
addr='123 Main St. Suite 100 Chicago, IL'
Two options for returning parsed results, and I plan on using whichever is easier to add to a dataframe:
usaddress.parse(addr) The parse method will split your address string into components, and
label each component. (returns list)
usaddress.tag(addr) The tag method will try to be a little smarter it will merge consecutive components, strip commas, & return an address type (returns ordered list)
There are 26 different tags available for an address using this parser.
However, Not all addresses will contain all of these tags.
I need to grab the full address for each row, parse it, map the parsed results to each matching column in that same row.
What the tag data looks like using from_records (index isn't exactly ideal)
What the parse data looks like using from_records
I can't quite figure out how to logic of row by row calculations and mapping results.
First, create a column of json responses from the parsing service
df['json_response'] = df['address'].apply(usaddress.pars)
Next, combine all the jsons into a single json string
json_combined = json.dumps(list(df['json_response']))
Finally parse the combined json to a dataframe (after parsing the json string)
df_parsed = pd.io.json.json_normalize(json.loads(json_combined))
Now you should have a structured dataframe with all required columns which you can df.join with your original dataframe to produce a single unified dataset.
Just a note, depending on the structure of the returned json, you may need to pass further arguments to the `pandas.io.json.json_normalize function. The example on the linked page is a good starting point.
Super late in posting this solution, but wanted to in case anyone else ran into the same problem
Address csv file headings:
name, address
Imports:
import pandas as pd
import numpy as np
import json
import itertools
import usaddress
def address_func(address):
try:
return usaddress.tag(address)
except:
return [{'AddressConverstion':'Error'}]
# import file
file = pd.read_excel('addresses.xlsx')
# apply function
file['tag_response'] = file['Full Address'].apply(address_func)
# copy values to new column
file['tags'] = file.apply(lambda row: row['tag_response'][0], axis=1)
# dump json
tags_combined = json.dumps(list(file['tags']))
# create dataframe of parsed info
df_parsed = pd.io.json.json_normalize(json.loads(tags_combined))
# merge dataframes on index
merged = file.join(df_parsed)

Replacing multiple values within a pandas dataframe cell - python

My problem: I have a pandas dataframe and one column in particular which I need to process contains values separated by (":") and in some cases, some of those values between ":" can be value = value, and can appear at the start/middle/end of the string. The length of the string can differ in each cell as we iterate through the row, for e.g.
clickstream['events']
1:3:5:7=23
23=1:5:1:5:3
9:0:8:6=5:65:3:44:56
1:3:5:4
I have a file which contains the lookup values of these numbers,e.g.
event_no,description,event
1,xxxxxx,login
3,ffffff,logout
5,eeeeee,button_click
7,tttttt,interaction
23,ferfef,click1
output required:
clickstream['events']
login:logout:button_click:interaction=23
click1=1:button_click:login:button_click:logout
Is there a pythonic way of looking up these individual values and replacing with the event column corresponding to the event_no row as shown in the output? I have hundreds of events and trying to work out a smart way of doing this. pd.merge would have done the trick if I had a single value, but I'm struggling to work out how I can work across the values and ignore the "=value" part of the string
Edit for to ignore missing keys in Dict:
import pandas as pd
EventsDict = {1:'1:3:5:7',2:'23:45:1:5:3',39:'0:8:46:65:3:44:56',4:'1:3:5:4'}
clickstream = pd.Series(EventsDict)
#Keep this as a dictionary
EventsLookup = {1:'login',3:'logout',5:'button_click',7:'interaction'}
def EventLookup(x):
list1 = [EventsLookup.get(int(item),'Missing') for item in x.split(':')]
return ":".join(list1)
clickstream.apply(EventLookup)
Since you are using a full DF and not just a series, use:
clickstream['events'].apply(EventLookup)
Output:
1 login:logout:button_click:interaction
2 Missing:Missing:login:button_click:logout
4 login:logout:button_click:Missing
39 Missing:Missing:Missing:Missing:logout:Missing...

Adding names and assigning data types to ASCII data

My professor uses IDL and sent me a file of ASCII data that I need to eventually be able to read and manipulate.
He used the following command to read the data:
readcol, 'sn-full.txt', format='A,X,X,X,X,X,F,A,F,A,X,X,X,X,X,X,X,X,X,A,X,X,X,X,A,X,X,X,X,F,X,I,X,F,F,X,X,F,X,F,F,F,F,F,F', $
sn, off1, dir1, off2, dir2, type, gal, dist, htype, d1, d2, pa, ai, b, berr, b0, k, kerr
Here's a picture of what the first two rows look like: http://i.imgur.com/hT7YIE3.png
Since I'm not going to be an astronomer, I am using Python but since I am new to it, I am having a hard time reading the data.
I know that the his code assigns the data type A (string data) to column one, skips columns two -six by using an X, and then assigns the data type F (floating point) to column seven, etc. Then sn is assigned to the first column that isn't skipped, etc.
I have been trying to replicate this by using either numpy.loadtxt("sn-full.txt") or ascii.read("sn-full.txt") but am not sure how to enter the dtype parameter. I know I could assign everything to be a certain data type, but how do I assign data types to individual columns?
Using astropy.io.ascii you should be able to read your file relatively easily:
from astropy.io import ascii
# Give names for ALL of the columns, as there is no easy way to skip columns
# for a table with no column header.
colnames = ('sn', 'gal_name1', 'gal_name2', 'year', 'month', 'day', ...)
table = ascii.read('sn_full.txt', Reader=ascii.NoHeader, names=colnames)
This gives you a table with all of the data columns. The fact that you have some columns you don't need is not a problem unless the table is mega-rows long. For the table you showed you don't need to specify the dtypes explicitly since io.ascii.read will figure them out correctly.
One slight catch here is that the table you've shown is really a fixed width table, meaning that all the columns line up vertically. Notice that the first row begins with 1998S NGC 3877. As long as every row has the same pattern with three space-delimited columns indicating the supernova name and the galaxy name as two words, then you're fine. But if any of the galaxy names are a single word then the parsing will fail. I suspect that if the IDL readcol is working then the corresponding io.ascii version should work out of the box. If not then io.ascii has a way of reading fixed width tables where you supply the column names and positions explicitly.
[EDIT]
Looks like in this case a fixed width reader is needed to inform the parser how to split the columns instead of just using space as delimiter. So basically you need to add two rows at the top of the table file, where the first one gives the column names and the second has dashes that indicate the span of each column:
a b c
---- ------------ ------
1.2 hello there 2
2.4 worlds 3
It's also possible in astropy.io.ascii to just specify by code the start and stop position of each column if you don't have the option of modifying the input data file, e.g.:
>>> ascii.read(table, Reader=ascii.FixedWidthNoHeader,
names=('Name', 'Phone', 'TCP'),
col_starts=(0, 9, 18),
col_ends=(5, 17, 28),
)
http://casa.colorado.edu/~ginsbura/pyreadcol.htm looks like it does what you want. It emulates IDL's readcol function.
Another possibility is https://pypi.python.org/pypi/fortranformat. It looks like it might be more capable and the data you're looking at is in fixed format and the format specifiers (X, A, etc.) are fortran format specifiers.
I would use Pandas for that particular purpose. The easiest way to do it is, assuming your columns are single-tab-separated:
import pandas as pd
import scipy as sp # Provides all functionality from numpy, too
mydata = pd.read_table(
'filename.dat', sep='\t', header=None,
names=['sn', 'gal_name1', 'gal_name2', 'year', 'month',...],
dtype={'sn':sp.float64, 'gal_name1':object, 'year':sp.int64, ...},)
(Strings here fall into the general 'object' datatype).
Each column now has a name and can be accessed as mydata['colname'], and this can then be sliced like regular numpy 1D arrays like e.g. mydata['colname'][20:50] etc. etc.
Pandas has built-in plotting calls to matplotlib, so you can quickly get an overview of a numerical type column by mydata['column'].plot(), or two different columns against each other as mydata.plot('col1', 'col2'). All normal plotting keywords can be passed.
If you want to plot the data in a normal matplotlib routine, you can just pass the columns to matplotlib, where they will be treated as ordinary Numpy vectors.
Each column can be accessed as an ordinary Numpy vector as mydata['colname'].values.
EDIT
If your data are not uniformly separated, numpy's genfromtxt() function is better. You can then convert it to a Pandas DataFrame by
mydf = pd.DataFrame(myarray, columns=['col1', 'col2', ...],
dtype={'col1':sp.float64, 'col2':object, ...})

Categories

Resources