Struggling to clean the column in pandas - python

I need help with cleaning a column of my data. So, basically in the column, in each separate cell there are dates, time, letter, floating points so many other type of data. The datatype of this column is 'Object'.
What I want to do is, remove all the dates and replace it with empty cells and keep only the time in the entire column. And then I want to insert the average time into the empty cells.
I'm using pycharm and using PANDAS to clean the column.
[enter image description here][1]

I would imagine you can achieve this with something along the lines of below. For time format, it seems like for your data column just checking if string contains 2 semi colons is enough. You can also specify something more robust:
def string_splitter (x):
x=x.split()
y=[]
for stuff in x:
if stuff.index(":")>1: #<you can also replace with a more robust pattern for time>
y.append(stuff)
else:
y.append("")#<add your string for indicating empty space>
return " ".join(y)
df['column_name'].apply(string_splitter)

Related

Delete categorical data from a column and leave the numerical data only ? I want to delete the word "SUBM-" from "SUBM-1245" IN PHYTON

I have a column in phyton which data type is object but I want to change it to integer.
The records on that column show :
SUBM - 4562
SUBM - 4563
and all the information in that column is like that. I want to delete the SUBM - word from the records and apply a similar function like excel "replace with" and I will add 0 to leave that empty with the numerical data only. Can anyone suggest a way to do that ?
If you are working with a column in python, so I assume you are using pandas to parse your table. In this case, you can simply use
df["mycolumn"] = df["mycolumn"].str.replace("SUBM-","")
However, you still have a column of type "object" then. A save way to convert it to numeric is this, where you basically throw away everything that can't be converted to a numeric:
df["mycolumn"] = pd.to_numeric(df["mycolumn"], errors="coerce", downcast="integer")
If you specifically need integer values (float not acceptible for you in case of NaN) you can afterwards fill empty cells with 0 and convert the column to integer:
df["mycolumn"] = df["mycolumn"].fillna(0).map(int) # if you specifically need integers
Alternative is to extract all numeric values using regular expressions. This would automatically return NaN if the expressions do not match (i.e. also when "SUBM-" is not present in your cell)
df["mycolumn"] = df["mycolumn"].str.extract("SUBM-([0-9]*)")

Manipulate string in python (replace string with part of the string itself)

So I am trying to transform the data I have into the form I can work with. I have this column called "season/ teams" that looks smth like "1989-90 Bos"
I would like to transform it into a string like "1990" in python using pandas dataframe. I read some tutorials about pd.replace() but can't seem to find a use for my scenario. How can I solve this? thanks for the help.
FYI, I have 16k lines of data.
A snapshot of the data I am working with:
To change that field from "1989-90 BOS" to "1990" you could do the following:
df['Yr/Team'] = df['Yr/Team'].str[:2] + df['Yr/Team'].str[5:7]
If the structure of your data will always be the same, this is an easy way to do it.
If the data in your Yr/Team column has a standard format you can extract the values you need based on their position.
import pandas as pd
df = pd.DataFrame({'Yr/Team': ['1990-91 team'], 'data': [1]})
df['year'] = df['Yr/Team'].str[0:2] + df['Yr/Team'].str[5:7]
print(df)
Yr/Team data year
0 1990-91 team 1 1991
You can use pd.Series.str.extract to extract a pattern from a column of string. For example, if you want to extract the first year, second year and team in three different columns, you can use this:
df["year"].str.extract(r"(?P<start_year>\d+)-(?P<end_year>\d+) (?P<team>\w+)")
Note the use of named parameters to automatically name the columns
See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extract.html

Pandas series string manipulation using Python - 1st two chars flip and append to end of string

I have a column (series) of values I'm trying to move characters around and I'm going nowhere fast! I found some snippets of code to get me where I am but need a "Closer". I'm working with one column, datatype (STR). Each column strings are a series of numbers. Some are duplicated. These duplicate numbers have a (n-) in front of the number. The (n) number will change based on how many duplicate numbers strings are listed. Some may have two duplicates, some eight duplicates. Doesn't matter, order should stay the same.
I need to go down through each cell or string, pluck the (n-) from the left of string, swap the two characters around, and append it to the end of the string. No number sorting needed. The column is 4-5k lines long and will look like the example given all the way down. No other special characters or letters. Also, the duplicate rows will always be together no matter where in the column.
My problem is the code below actually works and will step through each string, evaluate it for a dash, then process the numbers in the way I need. However, I have not learned how to get the changes back into my dataframe from a python for-loop. I was really hoping that somebody had a niffy lambda fix or a pandas apply function to address the whole column at once. But I haven't found anything that I can tweak to work. I know there is a better way than slowly traversing down through a series and I would like to learn.
Two possible fixes needed:
Is there a way to have the code below replace the old df.string value with the newly created df.string value? If so, please let me know.
I've been trying to read up on df.apply using the split feature so I can address the whole column at once. I understand it's the smarter play. Is there a couple of lines code that would do what I need?
Please let me know what you think. I appreciate the help. Thank you for taking the time.
import re
import pandas as pd
from pandas import DataFrame, Series
import numpy as np
df = pd.read_excel("E:\Book2.xlsx")
df.column1=df.column1.astype(str)
for r in df['column1']: #Finds Column
if bool(re.search('-', r))!=True: #test if string has '-'
continue
else:
a = [] #string holder for '-'
b = [] #string holder for numbers
for c in r:
if c == '-': #if '-' then hold in A
a.append(c)
else:
b.append(c) #if number then hold in B
t = (''.join(b + a)) #puts '-' at the end of string
z = t[1:] + t[:1] #picks up 1st position char and moves to end of string
r = z #assigns new created string to df.column1 value
print(df)
Starting File: Ending File:
column1 column1
41887 41887
1-41845 41845-1
2-41845 41845-2
40905 40905
1-41323 41323-1
2-41323 41323-2
3-41323 41323-3
41778 41778
You can use df.str.replace():
If we recreate your example with a file containing all your values and retain column1 as the column name:
import pandas as pd
df=pd.read_csv('file.txt')
df.columns=['column1']
df['column1']=df['column1'].str.replace('(^\d)-(\d+)',r'\2-\1')
print(df)
This will give the desired output. Replace the old column with new one, and do it all in one (without loops).
#in
41887
1-41845
2-41845
40905
1-41323
2-41323
3-41323
41778
#out
column1
0 41887
1 41845-1
2 41845-2
3 40905
4 41323-1
5 41323-2
6 41323-3
7 41778

Search ,identify duplicate data in an excel file and reformat it using Pandas

I have to process an excel file using pandas. The excel file has three columns as shown here(sample)
excelfile. The 'LicNo' column is not unique.
My Task-1:
I have to group it by 'LicNo' to bring all 'Licensees' in the same row. In the grouped table there will be 'LicNo' and a bunch of 'Licensee' columns only(all in a row) ignoring the middle column in the original excel file.
My Task-2:
Now to identify the duplicates, I have to search if the 'Licensee' in the first column is repeated in the subsequent columns (i.e.axis-1). The search will be based on the first word in the text in the first column; if this is repeated in the next columns to declare it as 'Possible duplicates'
Here is the code that I wrote:
enter code here
import pandas as pd
import numpy as np
from itertools import chain
df=pd.read_excel("compare_usr.xlsx",dtype={'LicNo': int,'ScheduleNo':str, 'Licensee': str})
#df.isna().any()
df=df.dropna(axis='index')
df['Licensee'] = df.Licensee.str.replace(r'\n', '')
#to make the strings uniform as in certain fields the M/s is present.
df["Licensee"]=df.Licensee.str.replace("M/s ","")
# the character ';' is inserted to segregate the duplicate LicNo while transposing into rows
df.loc[:,"Licensee"]=df["Licensee"].astype(str)+";"
def func(x):
ch = chain.from_iterable(y.split(';') for y in x.tolist())
return '\n'.join(((ch)))
dfNew=df.groupby('LicNo') ['Licensee'].apply(func).str.split("\n+",expand=True)#.to_excel("test2.xls")
the following two lines marked as (1) and (2) does not produce the desired output and not in the code. The below function code (fun(row)) however produce the intended result.
#dfNew['Status']=np.where((dfNew[x+1].str.contains(dfNew[0].str,na=False) for x in col),"match","unmatch") # (1)
#dfNew['Status']=np.where((dfNew[x+1].str.apply(lambda y: dfNew[0].str in y) for x in range(6)),"match","unmatch") #(2)
# in the original Excel file there are about 4000 rows and it produced 18 columns of duplicates.
enter code here
def fun(row):
col=[i+1 for i in range(17)]
for i in col:
if row[i] is None:
continue
#to extract the first word from row[0]
if (row[0].split()[0].lower() in row[i].lower()):
return True
return False
dfNew['Status']=np.where(dfNew.apply(fun,axis=1)==True,dfNew.index,"May be duplicate or only in one file")
Now my question is I would like to replace the function 'fun(row)' (this is working and producing the desired result) with either of the two lines marked as (1) and (2) in the code to produce the dfNew['Status'] . I am not able to appreciate what is going wrong in either of these two line as both produces wrong result.
I am a beginner code writer in Python and owe to stackflow.com for copying the codes from some of the answers in some other topic. Would you be able to help me?
Thanks.
Edit:
Desired outcome:-
The result file

Extract a string from a CSV cell containing special characters in Python

I'm writing a Python program to extract specific values from each cell in a .CSV file column and then make all the extracted values new columns.
Sample column cell:(This is actually a small part, the real cell contains much more data)
AudioStreams":[{"JitterInterArrival":10,"JitterInterArrivalMax":24,"PacketLossRate":0.01353227,"PacketLossRateMax":0.09027778,"BurstDensity":null,"BurstDuration":null,"BurstGapDensity":null,"BurstGapDuration":null,"BandwidthEst":25245423,"RoundTrip":520,"RoundTripMax":11099,"PacketUtilization":2843,"RatioConcealedSamplesAvg":0.02746676,"ConcealedRatioMax":0.01598402,"PayloadDescription":"SIREN","AudioSampleRate":16000,"AudioFECUsed":true,"SendListenMOS":null,"OverallAvgNetworkMOS":3.487248,"DegradationAvg":0.2727518,"DegradationMax":0.2727518,"NetworkJitterAvg":253.0633,"NetworkJitterMax":1149.659,"JitterBufferSizeAvg":220,"JitterBufferSizeMax":1211,"PossibleDataMissing":false,"StreamDirection":"FROM-to-
One value I'm trying to extract is number 10 between the "JitterInterArrival": and ,"JitterInterArrivalMax" . But since each cell contains relatively long strings and special characters around it(such as ""), opener=re.escape(r"***")and closer=re.escape(r"***") wouldn't work.
Does anyone know a better solution? Thanks a lot!
IIUC, you have a json string and wish to get values from its attributes. So, given
s = '''
{"AudioStreams":[{"JitterInterArrival":10,"JitterInterArrivalMax":24,"PacketLossRate":0.01353227,"PacketLossRateMax":0.09027778,"BurstDensity":null,
"BurstDuration":null,"BurstGapDensity":null,"BurstGapDuration":null,"BandwidthEst":25245423,"RoundTrip":520,"RoundTripMax":11099,"PacketUtilization":2843,"RatioConcealedSamplesAvg":0.02746676,"ConcealedRatioMax":0.01598402,"PayloadDescription":"SIREN","AudioSampleRate":16000,"AudioFECUsed":true,"SendListenMOS":null,"OverallAvgNetworkMOS":3.487248,"DegradationAvg":0.2727518,
"DegradationMax":0.2727518,"NetworkJitterAvg":253.0633,
"NetworkJitterMax":1149.659,"JitterBufferSizeAvg":220,"JitterBufferSizeMax":1211,
"PossibleDataMissing":false}]}
'''
You can do
import json
>>> data = json.loads(s)
>>> ji = data['AudioStreams'][0]['JitterInterArrival']
10
In a data frame scenario, if you have a column col of strings such as the above, e.g.
df = pd.DataFrame({"col": [s]})
You can use transform passing json.loads as argument
df.col.transform(json.loads)
to get a Series of dictionaries. Then, you can manipulate these dicts or just access the data as done above.

Categories

Resources