I need to polish a csv dataset, but it seems the changes are not applied to the dataset itslef.
CSV is in this format:
ID, TRACK_LINK
761607, https://mylink.com//track/...
This is my script:
import pandas as pd
df = pd.read_csv('./file.csv').fillna('')
# remove double // from TRACK_LINK
def polish_track_link(track_link):
return track_link.replace("//track", "/track")
df['LINK'].apply(polish_track_link)
print(df)
this prints something like:
...
761607 https://mylink.com//track/...
note the //track
If I do print(df['LINK'].apply(polish_track_link)) I get:
...
761607, https://mylink.com/track/...
So the function polish_track_link works but it's not applied to the dataset. Any idea why?
You need assign back:
df['TRACK_LINK'] = df['TRACK_LINK'].apply(polish_track_link)
But better is use pandas functions str.replace or replace with regex=True for replace substrings:
df['TRACK_LINK'] = df['TRACK_LINK'].str.replace("//track", "/track")
Or:
df['TRACK_LINK'] = df['TRACK_LINK'].replace("//track", "/track", regex=True)
print(df)
ID TRACK_LINK
0 761607 https://mylink.com/track/
Related
Hi I have a column with path like this:
path_column = ['C:/Users/Desktop/sample\\1994-QTR1.tsv','C:/Users/Desktop/sample\\1995-QTR1.tsv']
I need to split and get just the file name.
Expected output:
[1994-QTR1,1995-QTR1]
Thanks
Use str.extract:
df['new'] = df['path'].str.extract(r'\\([^\\]*)\.\w+$', expand=False)
The equivalent with rsplit would be much less efficient:
df['new'] = df['path'].str.rsplit('\\', n=1).str[-1].str.rsplit('.', n=1).str[0]
Output:
path new
0 C:/Users/Desktop/sample\1994-QTR1.tsv 1994-QTR1
1 C:/Users/Desktop/sample\1995-QTR1.tsv 1995-QTR1
regex demo
Similarly to the above, but you don't need to declare the separator.
import os
path = "C:/Users/Desktop/sample\\1994-QTR1.tsv"
name = path.split(os.path.sep)[-1]
print(name)
Use this or you can use regex to match and take what you want.
path.split("\\")[-1].split(".")[0]
Output:
'1994-QTR1'
Edit
new_col=[]
for i in path_column:
new_col.append(i.split("\\")[-1].split(".")[0])
print (new_col)
NOTE: If you need it in a list, you can append it to a new list from the loop.
Output:
['1994-QTR1', '1995-QTR1']
You might harness pathlib for this task following way
import pathlib
import pandas as pd
def get_stem(path):
return pathlib.PureWindowsPath(path).stem
df = pd.DataFrame({'paths':['C:/Users/Desktop/sample\\1994-QTR1.tsv','C:/Users/Desktop/sample\\1994-QTR2.tsv','C:/Users/Desktop/sample\\1994-QTR3.tsv']})
df['names'] = df.paths.apply(get_stem)
print(df)
gives output
paths names
0 C:/Users/Desktop/sample\1994-QTR1.tsv 1994-QTR1
1 C:/Users/Desktop/sample\1994-QTR2.tsv 1994-QTR2
2 C:/Users/Desktop/sample\1994-QTR3.tsv 1994-QTR3
I try to split and output the csv file. I must use the date to be the file name but don't need the time.
So I want to split the Order_Date, which is a timestamp that has both date and time.
How can I group by a part of value in pandas?
There is my code:
import csv
import re
import pandas as pd
import os
df = pd.read_csv('test.csv',delimiter='|')
for i,x in df.groupby('Order_Date'):
p = os.path.join(r'~/Desktop/',("data_{}.csv").format(i.lower()))
x.to_csv(p,sep = '|', index=False)
Now I can get this:
data_2019-07-23 00:06:00.csv
data_2019-07-23 00:06:50.csv
data_2019-07-23 00:06:55.csv
data_2019-07-28 12:31:00.csv
Example test.csv data:
Channel|Store_ID|Store_Code|Store_Type|Order_ID|Order_Date|Member_ID|Member_Tier|Coupon_ID|Order_Total|Material_No|Material_Name|Size|Quantity|Unit_Price|Line_Total|Discount_Amount
ECOM|ECOM|ECOM|ECOM|A190700|2019-07-23 00:06:00||||1064.00|7564|Full Zip|750|1.00|399.00|168.00|231.00
ECOM|ECOM|ECOM|ECOM|A190700|2019-07-23 00:06:00||||1064.00|1361|COOL TEE|200|1.00|199.00|84.00|115.00
ECOM|ECOM|ECOM|ECOM|A190700|2019-07-23 00:06:00||||1064.00|7699|PANT|690|1.00|499.00|210.00|289.00
ECOM|ECOM|ECOM|ECOM|A190700|2019-07-23 00:06:00||||1064.00|8700|AI DRESS|690|1.00|399.00|196.00|203.00
ECOM|ECOM|ECOM|ECOM|A190700|2019-07-23 00:06:50||||1064.00|8438|COPA|690|1.00|229.00|112.00|117.00
ECOM|ECOM|ECOM|ECOM|A190700|2019-07-23 00:06:55||||1064.00|8324|CLASS|350|1.00|599.00|294.00|305.00
ECOM|ECOM|ECOM|ECOM|A190701|2019-07-28 12:31:00||||798.00|3689|DRESS|500|1.00|699.00|294.00|405.00
Expect I get this:
data_2019-07-23.csv
data_2019-07-28.csv
Any help would be very much appreciated.
You need to convert Order_Date to dates - stripping the time information. One quick way to do this is:
df['Order_Date1'] = pd.to_datetime(df['Order_Date']).dt.strftime('%Y-%m-%d')
Then proceed with a groupby using Order_Date1.
You can try making the i a string and then using .split() and then using the 0 index:
str(i).split()[0]
so replaced in your code:
for i,x in df.groupby('Order_Date'):
p = os.path.join(r'~/Desktop/',("data_{}.csv").format(str(i).split()[0]))
x.to_csv(p,sep = '|', index=False)
I have a DataFrame with 29 columns, and need to replace part of a string in some columns with a hashed part of the string.
Example of the column is as follows:
ABSX, PLAN=PLAN_A ;SFFBJD
ADSFJ, PLAN=PLAN_B ;AHJDG
...
...
Code that captures the part of the string:
Test[14] = Test[14].replace({'(?<=PLAN=)(^"]+ ;)' :'hello'}, regex=True)
I want to change the 'hello' to hash of '(?<=PLAN=)(^"]+ ;)' but it doesn't work this way. Wanted to check if anyone did this before without looping line by line of the DataFrame?
here is what I suggest:
import hashlib
import re
import pandas as pd
# First I reproduce a similar dataset
df = pd.DataFrame({"v1":["ABSX", "ADSFJ"],
"v2": ["PLAN=PLAN_A", "PLAN=PLAN_B"],
"v3": ["SFFBJD", "AHJDG"]})
# I search for the regex and create a column matched_el with the hash
r = re.compile(r'=[a-zA-Z_]+')
df["matched_el"] = ["".join(r.findall(w)) for w in df.v2]
df["matched_el"] = df["matched_el"].str.replace("=","")
df["matched_el"] = [hashlib.md5(w.encode()).hexdigest() for w in df.matched_el]
# Then I replace in v2 using this hash
df["v2"] = df["v2"].str.replace("(=[a-zA-Z_]+)", "=")+df["matched_el"]
df = df.drop(columns="matched_el")
Here is the result
v1 v2 v3
0 ABSX PLAN=8d846f78aa0b0debd89fc1faafc4c40f SFFBJD
1 ADSFJ PLAN=3b9a3c8184829ca5571cb08c0cf73c8d AHJDG
Im trying to drop rows in pandas that contains "???", it works for every other value except for "???", I do not know whats the problem.
This is my code (I have tried both types):
df = df[~df["text"].str.contains("?????", na=False)]
df = df[~df["text"].str.contains("?????")]
error that I'm getting:
re.error: nothing to repeat at position 0
It works for every other value except for "????".
I have googled it, and looked all over this website but I couldnt find any solutions.
The parameter expects a regular expression, hence the error re.error.
You can either escape the ? inside the expression like this:
df = df[~df["text"].str.contains("\?\?\?\?\?")]
Or set regex=False as Vorsprung sugested:
df = df[~df["text"].str.contains("?????",regex=False)]
let's convert this into running code:
import numpy as np
import pandas as pd
data = {'A': ['abc', 'cxx???xx', '???',], 'B': ['add', 'ddb', 'c', ]}
df = pd.DataFrame.from_dict(data)
df
output:
A B
0 abc add
1 cxx???xx ddb
2 ??? c
with this:
df[df['A'].str.contains('???',regex=False)]
output:
A B
1 cxx???xx ddb
2 ??? c
you need to tell contains(), that your search string is not a regex.
I'm working on a data frame taken from Adafruit IO and sadly some of my data is from a time when my project malfunctioned so some of the values are just equal NaN.
I tried to remove it by typing this code lines:
onlyValidData=temp_data.mask(temp_data['value'] =='NaN')
onlyValidData
This is data retreived from Adafruit IO Feed, getting analyzed by pandas, I tried using 'where' function too but it didn't work
my entire code is
import pandas as pd
temp_data = pd.read_json('https://io.adafruit.com/api/(...)')
light_data = pd.read_json('https://io.adafruit.com/api/(...)')
temp_data['created_at'] = pd.to_datetime(temp_data['created_at'], infer_datetime_format=True)
temp_data = temp_data.set_index('created_at')
light_data['created_at'] = pd.to_datetime(light_data['created_at'], infer_datetime_format=True)
light_data = light_data.set_index('created_at')
tempVals = pd.Series(temp_data['value'])
lightVals = pd.Series(light_data['value'])
onlyValidData=temp_data.mask(temp_data['value'] =='NaN')
onlyValidData
The output is all of my data for some reason, but it should be only the valid values.
Hey I think the issue here that you're looking for values equal to the string 'NaN', while actual NaN values aren't a string, or more specifically aren't anything.
Try using:
onlyValidData = temp_data.mask(temp_data['value'].isnull())
Edit: to remove rows rather than marking all values in that row as NaN:
onlyValidData = temp_data.dropna()