how to open csv in python? - python

I have a dataset in following format.
row_num;locale;day_of_week;hour_of_day;agent_id;entry_page;path_id_set;traffic_type;session_durantion;hits
"988681;L6;Monday;17;1;2111;""31672;0"";6;7037;\N"
"988680;L2;Thursday;22;10;2113;""31965;0"";2;49;14"
"988679;L4;Saturday;21;2;2100;""0;78464"";1;1892;14"
"988678;L3;Saturday;19;8;2113;51462;6;0;1;\N"
I want it to be in following format :
row_num locale day_of_week hour_of_day agent_id entry_page path_id_set traffic_type session_durantion hits
988681 L6 Monday 17 1 2111 31672 0 6 7037 N
988680 L2 Thursday 22 10 2113 31965 0 2 49 14
988679 L4 Saturday 21 2 2100 0 78464 1 1892 14
988678 L3 Saturday 19 8 2113 51462 6 0 1 N
I tried with the following code :
import pandas as pd
df = pd.read_csv("C:\Users\Rahhy\Desktop\trivago.csv", delimiter = ";")
But I am getting a error as :
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape

Using replace():
with open("data_test.csv", "r") as fileObj:
contents = fileObj.read().replace(';',' ').replace('\\', '').replace('"', '')
print(contents)
OUTPUT:
row_num locale day_of_week hour_of_day agent_id entry_page path_id_set traffic_type session_durantion hits
988681 L6 Monday 17 1 2111 31672 0 6 7037 N 988680 L2 Thursday 22 10 2113 31965 0 2 49 14 988679 L4 Saturday 21 2 2100 0 78464 1 1892 14 988678 L3 Saturday 19 8 2113 51462 6 0 1 N
EDIT:
You can open a file, read its content, replace the unwanted chars. write the new contents to the file and then read it through pd.read_csv:
with open("data_test.csv", "r") as fileObj:
contents = fileObj.read().replace(';',' ').replace('\\', '').replace('"', '')
# print(contents)
with open("data_test.csv", "w+") as fileObj2:
fileObj2.write(contents)
import pandas as pd
df = pd.read_csv(r"data_test.csv", index_col=False)
print(df)
OUTPUT:
row_num locale day_of_week hour_of_day agent_id entry_page path_id_set traffic_type session_durantion hits
988681 L6 Monday 17 1 2111 31672 0 6 7037 N 988680 L2 Thursday 22 10 2113 31965 0 2 49 14 988679 L4 Saturday 21 2 2100 0 78464 1 1892 14 988678 L3 Saturday 19 8 2113 51462 6 0 1 N

import pandas as pd
from io import StringIO
# Load the file to a string (prefix r (raw) to not use \ for escaping)
filename = r'c:\temp\x.csv'
with open(filename, 'r') as file:
raw_file_content = file.read()
# Remove the quotes which break the CSV file
file_content_without_quotes = raw_file_content.replace('"','')
# Simulate a file with the corrected CSV content
simulated_file = StringIO(file_content_without_quotes)
# Get the CSV as a table with pandas
# Since the first field in each data row shall not be used for indexing we need to set index_col=False
csv_data = pd.read_csv(simulated_file, delimiter = ';', index_col=False)
print(csv_data['hits']) # print some column
csv_data
Since there are 11 data fields and 10 headers only the first 10 fields are used. You'll have to figure out what you want to do with the last one (Values: \N, 14)
Output:
0 7037
1 49
2 1892
3 1
See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

Related

Character specific conditional check in a string

I have to read and analyse some log files using python which usually containing strings in the following desired format :
date Don Dez 10 21:41:41.747 2020
base hex timestamps absolute
no internal events logged
// version 13.0.0
//28974.328957 previous log file: 21-41-41_Voltage.asc
// Measurement UUID: 9e0029d6-43a0-49e3-8708-3ec70363124c
28976.463987 LoggingString := "Log,5:45 AM, Friday, December 11, 2020,05:45:20.6,65.48,11.99,0.009843,12,0.01078,11.99,0.01114,11.99,0.01096,12,0.009984,4.595,0,1.035,0,0.1745,0,2,OM_2_1,0"
28978.600018 LoggingString := "Log,5:45 AM, Friday, December 11, 2020,05:45:22.7,65.47,11.99,0.009896,12,0.01079,11.99,0.01117,11.99,0.01097,12,0.009965,4.628,0,1.044,0,0.1698,0,2,OM_2_1,0"
However, sometime it occurs that files are created that have undesired formats like below :
date Die Jul 13 08:40:22.878 2021
base hex timestamps absolute
no internal events logged
// version 13.0.0
//1035.595166 previous log file: 08-40-22_Voltage.asc
// Measurement UUID: 2baf3f3f-300a-4f0a-bcbf-0ba5679d8be2
"1203.997816 LoggingString := ""Log" 9:01 am Tuesday July 13 2021 09:01:58.3 24.53 13.38 0.8948 13.37 0.8801 13.37 0.89 13.37 0.9099 13.47 0.8851 4.551 0.00115 0.8165 0 0.2207 0 5 OM_3_2 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 "0"""
"1206.086064 LoggingString := ""Log" 9:02 am Tuesday July 13 2021 09:02:00.4 24.53 13.37 0.8945 13.37 0.8801 13.37 0.8902 13.37 0.9086 13.46 0.8849 5.142 0.001185 1.033 0 0.1897 0 5 OM_3_2 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 "0"""
OR
date Mit Jun 16 10:11:43.493 2021
base hex timestamps absolute
no internal events logged
// version 13.0.0
// Measurement UUID: fe4a6a97-d907-4662-89f9-bd246aa54a33
10025.661597 LoggingString := """""""Log""" 12:59 PM Wednesday June 16 2021 12:59:01.1 66.14 0.00423 0 0.001206 0 0.001339 0 0.001229 0 0.001122 0 0.05017 0 0.01325 0 0.0643 0 0 OM_2_1_transition 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 """0"""""""
10030.592652 LoggingString := """""""Log""" 12:59 PM Wednesday June 16 2021 12:59:06.1 66.14 11.88 0.1447 11.88 0.1444 11.88 0.1442 11.87 0.005552 11.9 0.00404 2.55 0 0.4712 0 0.09924 0 0 OM_2_1_transition 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 """0"""""""
Since i am only concerned with data below "// Measurement UUID " line, i am using this code to extract data from the string that is of desired format :
files = os.listdir(directory)
files = natsorted(files)
for file in files:
base, ext = os.path.splitext(file)
if file not in processed_files and ext == '.asc':
print("File added:", file)
file_path = os.path.join(directory, file)
count = 0
with open(file_path, 'r') as file_in:
processed_files.append(file)
Output_list = [] # Each string from file is read into this list
Final = [] # Required specific data from each string is isolated & stored here
for line in map(str.strip, file_in):
if "LoggingString" in line:
first_quote = line.index(
'"') # returns the column number where " first appears in the whole string
last_quote = line.index('"', first_quote + 1)
# returns the column value where " appears last in the whole string ( end of line )
# print(first_quote)
Output_list.append(
line[:first_quote].split(maxsplit=1)
+ line[first_quote + 1: last_quote].split(","),
)
Final.append(Output_list[count][7:27])
The undesired format contains one or more whitespaces between each string character as seen above. I guess it is because the log file generator sometime generates a non comma separate file or a comma separated file with error probably, i am not sure.
I tried to put the condition after:
if "LoggingString" in line :
if ',' in line:
first_quote = line.index('"')
last_quote = line.index('"', first_quote + 1)
Output_list.append(line[:first_quote].split(maxsplit=1)
+ line[first_quote + 1: last_quote].split(","),)
Final.append(Output_list[count][7:27])
else:
raise Exception("Error in File")
However, this didn't serve the purpose because if in any other undesired format if there is even one ',' in the string, the program would consider it valid and process it which results in false results.
How do i ensure that after the only files that contain strings in desired format are processed and if others are processed then an error message would be print out ? What type of conditional check could be implemented here ?
You can use pandas.read_csv with a regex separator :
import glob
import pandas as pd
l = []
for f in glob.glob("/tmp/Log*.txt"):
df = (pd.read_csv(f, sep=',|(?<=[\w"])\s+(?=[\w"])',
header=None, skiprows=6, engine="python").iloc[:, 2:28])
df.insert(0, "filename", f.split("\\", )[-1])
l.append(df)
out = pd.concat(l)
Output :

Pandas how to preserve all values in dataframe into a csv?

I want to convert the html to csv using pandas functions
This is a part of what I read in the dataframe df
0 1
0 sequence 2
1 trainNo K805
2 trainNumber K805
3 departStation 鹰潭
4 departStationPy yingtan
5 arriveStation 南昌
6 arriveStationPy nanchang
7 departDate 2020-05-24
8 departTime 03:55
9 arriveDate 2020-05-24
10 arriveTime 05:44
11 isStartStation False
12 isEndStation False
13 runTime 1小时49分钟
14 preSaleTime NaN
15 takeDays 0
16 isBookable True
17 seatList seatNamepriceorderPriceinventoryisBookablebutt...
18 curSeatIndex 0
seatName price orderPrice inventory isBookable buttonDisplayName buttonType
0 硬座 23.5 23.5 99 True NaN 0
1 硬卧 69.5 69.5 99 True NaN 0
2 软卧 104.5 104.5 4 True NaN 0
0 1
0 departDate 2020-05-23
1 departStationList NaN
2 endStationList NaN
3 departStationFilterMap NaN
4 endStationFilterMap NaN
5 departCityName 上海
6 arriveCityName 南昌
7 gtMinPrice NaN
My code is like this
for i, df in enumerate(pd.read_html(html,encoding='utf-8')):
df.to_csv(".\other.csv",index=True,encoding='utf-8-sig')
To preserve the characters in csv, I need to use utf-8-sig encoding. But I don't know how to write the format symbol %
,0,1
0,departDate,2020-05-23
1,departStationList,
2,endStationList,
3,departStationFilterMap,
4,endStationFilterMap,
5,departCityName,上海
6,arriveCityName,南昌
7,gtMinPrice,
This is what I got in csv file, only the last part is preserved.
The dataframe is correct, while the csv need correction. Can you show me how to make the correct output?
you're saving each dataframe to the same file, so each is getting overwritten until the last one.
note the addition of the f-string to change the save file name e.g. f".\other_{i}.csv"
each dataframe is a different shape, so they won't all fit together properly
To CSV
for i, df in enumerate(pd.read_html(html,encoding='utf-8')):
df.to_csv(f".\other_{i}.csv", index=True, encoding='utf-8-sig')
To Excel
with pd.ExcelWriter('output.xlsx', mode='w') as writer:
for i, df in enumerate(pd.read_html(html,encoding='utf-8')):
df.to_excel(writer, sheet_name=f'Sheet{i}', encoding='utf-8-sig')

How to get rows of a most recent day in the ascending order of time way when reading csv file?

I want to get rows of a most recent day which is in ascending order of time way.
I get dataframe as follows:
label uId adId operTime siteId slotId contentId netType
0 0 u147333631 3887 2019-03-30 15:01:55.617 10 30 2137 1
1 0 u146930169 1462 2019-03-31 09:51:15.275 3 32 1373 1
2 0 u139816523 2084 2019-03-27 08:10:41.769 10 30 2336 1
3 0 u106546472 1460 2019-03-31 08:51:41.085 3 32 1371 4
4 0 u106642861 2295 2019-03-27 22:58:03.679 3 32 2567 4
Cause I get about 100 million rows in this csv file, it is impossible to load all this into my PC memory.
So I want to get rows of a most recent day in ascending order of time way when reading this csv files.
For examples, if the most recent day is on 2019-04-04, it will output as follows:
#this not a real data, just for examples.
label uId adId operTime siteId slotId contentId netType
0 0 u147336431 3887 2019-04-04 00:08:42.315 1 54 2427 2
1 0 u146933269 1462 2019-04-04 01:06:16.417 30 36 1343 6
2 0 u139536523 2084 2019-04-04 02:08:58.079 15 23 1536 7
3 0 u106663472 1460 2019-04-04 03:21:13.050 32 45 1352 2
4 0 u121642861 2295 2019-04-04 04:36:08.653 3 33 3267 4
Could anyone help me?
Thanks in advence.
I'm assuming you can't read the entire file into memory, and the file is in a random order. You can read the file in chunks and iterate through the chunks.
# read 50,000 lines of the file at a time
reader = pd.read_csv(
'csv_file.csv',
parse_dates=True,
chunksize=5e5,
header=0
)
recent_day=pd.datetime(2019,4,4)
next_day=recent_day + pd.Timedelta(days=1)
df_list=[]
for chunk in reader:
#check if any rows match the date range
date_rows = chunk.loc[
(chunk['operTime'] >= recent_day]) &\
(chunk['operTime'] < next_day)
]
#append dataframe of matching rows to the list
if date_rows.empty:
pass
else:
df_list.append(date_rows)
final_df = pd.concat(df_list)
final_df = final_df.sort_values('operTime')
Seconding what anky_91 said, sort_values() will be helpful here.
import pandas as pd
df = pd.read_csv('file.csv')
# >>> df
# label uId adId operTime siteId slotId contentId netType
# 0 0 u147333631 3887 2019-03-30 15:01:55.617 10 30 2137 1
# 1 0 u146930169 1462 2019-03-31 09:51:15.275 3 32 1373 1
# 2 0 u139816523 2084 2019-03-27 08:10:41.769 10 30 2336 1
# 3 0 u106546472 1460 2019-03-31 08:51:41.085 3 32 1371 4
# 4 0 u106642861 2295 2019-03-27 22:58:03.679 3 32 2567 4
sub_df = df[(df['operTime']>'2019-03-31') & (df['operTime']<'2019-04-01')]
# >>> sub_df
# label uId adId operTime siteId slotId contentId netType
# 1 0 u146930169 1462 2019-03-31 09:51:15.275 3 32 1373 1
# 3 0 u106546472 1460 2019-03-31 08:51:41.085 3 32 1371 4
final_df = sub_df.sort_values(by=['operTime'])
# >>> final_df
# label uId adId operTime siteId slotId contentId netType
# 3 0 u106546472 1460 2019-03-31 08:51:41.085 3 32 1371 4
# 1 0 u146930169 1462 2019-03-31 09:51:15.275 3 32 1373 1
I think you could also use a datetimeindex here; that might be necessary if the file is sufficiently large.
Like #anky_91 mentionned, you can use the sort_values function. Here is a short example of how it works:
df = pd.DataFrame( {'Symbol':['A','A','A'] ,
'Date':['02/20/2015','01/15/2016','08/21/2015']})
df.sort_values(by='Date')
Out :
Date Symbol
2 08/21/2015 A
0 02/20/2015 A
1 01/15/2016 A

Changing a coded min to a datetime in python pandas

I have a data set which looks like this. I must mention that 263 means (0-15 min), 264 means (16-30 min), 265 means (31-45 min), and 266 is (46-60 min). I need to convert these columns to a single column as : YYYY-MM-DD HH:MM:SS
LOCAL_YEAR LOCAL_MONTH LOCAL_DAY LOCAL_HOUR VALUE FLAG STATUS MEAS_TYPE_ELEMENT_ALIAS
2006 4 11 0 0 R 263
2006 4 11 0 0 R 264
2006 4 11 0 0 R 265
2006 4 11 0 0 R 266
2006 4 11 1 0 R 263
2006 4 11 1 0 R 264
2006 4 11 1 0 R 265
2006 4 11 1 0 R 266
I was wondering if anyone could help me with this?
This is the code:
import pandas as pd
import numpy as np
raw_data=pd.read_csv('Squamish_263_264_265_266.csv')
############################################## Reading rainfall and years ###################################
df=raw_data.iloc[:,[2,3,4,5,6,9]]
#print(df)
import datetime
dmap = {263:0,264:16,265:31,266:46}
df['MEAS_TYPE_ELEMENT_ALIAS_map'] = df['MEAS_TYPE_ELEMENT_ALIAS'].map(dmap)
for row, v in df.iterrows():
df.loc[row,'date'] = datetime.datetime(v['LOCAL_YEAR'],v['LOCAL_MONTH'],v['LOCAL_DAY'],v['LOCAL_HOUR'],v['MEAS_TYPE_ELEMENT_ALIAS_map'])
but it gives this error:
TypeError: integer argument expected, got float
and
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Use a map to translate the alias into a minute and the iterate to build your dates
dmap = {263:0,264:16,265:31,266:46}
df['ALIAS_map'] = df['MEAS_TYPE_ELEMENT_ALIAS'].map(dmap)
df.reset_index(inplace=True)
for row in df.head(50).itertuples():
df.loc[row[0],'date'] = datetime.datetime(int(row[1]),row[2],row[3],row[4],row[-1])

Replace text strings in a file that start with certain characters

I would like to replace text in a file by searching for specific letters at the beginning of the string. For example here is a section of the file:
6 HT 4.092000 4.750000 -0.502000 0 5 7
7 HT 5.367000 5.548000 -0.325000 0 5 6
8 OT -5.470000 5.461000 1.463000 0 9 10
9 HT -5.167000 4.571000 1.284000 0 8 10
10 HT -4.726000 6.018000 1.235000 0 8 9
11 OT -4.865000 -5.029000 -3.915000 0 12 13
12 HT -4.758000 -4.129000 -3.608000 0 11 13
I would like to use "HT" as the search and be able to replace the "space0space" with 2002. When I try I replace all 0 with 2002 and not the column that is just 0. After this I need to then search "OT" and replace the 0 column with 2001.
So basically I need to search a string that identify the line and replace a column specific string while the text that lies between is variable. The output needs to be printed to a new_file.xyz. Also I will be doing this repeatedly on lots of files so it would be great to be a script that can typed in front of the file that will be operated on. Thanks.
This should do it for you (for HT):
with open('file.txt') as f:
lines = f.readlines()
new_lines = []
for line in lines:
if "HT" in line:
new_line = line.replace(' 0 ', '2002')
new_lines.append(new_line)
else:
new_lines.append(line)
content = ''.join(new_lines)
print(content)
# 6 HT 4.092000 4.750000 -0.502000 2002 5 7
# 7 HT 5.367000 5.548000 -0.325000 2002 5 6
# 8 OT -5.470000 5.461000 1.463000 0 9 10
# 9 HT -5.167000 4.571000 1.284000 2002 8 10
# 10 HT -4.726000 6.018000 1.235000 2002 8 9
# 11 OT -4.865000 -5.029000 -3.915000 0 12 13
# 12 HT -4.758000 -4.129000 -3.608000 2002 11 13
Repeat the same logic (add to the case or otherwise) for other line identifiers.
If you put this in a function, you could use it to replace all by id:
def _find_and_replace(current_lines, line_id, value):
lines = []
for l in current_lines:
lines.append(l.replace(' 0 ', value)) if line_id in l else lines.append(l)
return ''.join(lines)
with open('file.txt') as f:
lines = f.readlines()
new_lines = _find_and_replace(lines, line_id='HT', value='2002')
print(new_lines)
Though, if you have many identifiers, I would implement a solution which won't go through the list of lines every time but rather lookup the identifier as it iterates the lines.
The solution using fileinput module, re.search() and re.sub() functions:
import fileinput, re
with fileinput.input(files=("lines.txt"), inplace=True) as f:
for line in f:
if (re.search(r'\bHT\b', line)): # checks if line contains `HT` column
print(re.sub(r' 0 ', '2002', line).strip())
elif (re.search(r'\OT\b', line)): # checks if line contains `OT` column
print(re.sub(r' 0 ', '2001', line).strip())
else:
print(line)
The file contents after processing:
6 HT 4.092000 4.750000 -0.502000 2002 5 7
7 HT 5.367000 5.548000 -0.325000 2002 5 6
8 OT -5.470000 5.461000 1.463000 2001 9 10
9 HT -5.167000 4.571000 1.284000 2002 8 10
10 HT -4.726000 6.018000 1.235000 2002 8 9
11 OT -4.865000 -5.029000 -3.915000 2001 12 13
12 HT -4.758000 -4.129000 -3.608000 2002 11 13
Optional in-place filtering: if the keyword argument inplace=True is passed to fileinput.input() or to the FileInput
constructor, the file is moved to a backup file and standard output is
directed to the input file (if a file of the same name as the backup
file already exists, it will be replaced silently). This makes it
possible to write a filter that rewrites its input file in place.

Categories

Resources