l have the following csv file that l process as follow
import pandas as pd
df = pd.read_csv('file.csv', sep=',',header=None)
id ocr raw_value
00037625-4706-4dfe-a7b3-de8c47e3a28d A 3
000a7b30-4c4f-4756-a757-f688ccc55d5d A /c
000b08e3-4129-4fd2-8ec0-23d00fe38a45 A yes
00196436-12bc-4024-b623-25bac586d314 A know
001b8c43-3e73-43c1-ba4f-df5edb10dfac A hi
002882ca-48bb-4161-a75a-cf0ec984d650 A fd
003b2890-3727-4c79-955a-f74ec6945ed7 A Sensible
004d9025-86f0-4f8c-9720-01e3385c5e77 A 2015
Now l want to add a new column :
df['val']=None
for img in images:
id, ext = img.rsplit('.',1)
idx = df[df[0] ==id].index.values
df.loc[df.index[idx], 'val'] = id
When l write df in a new file as follow :
df.to_csv('new_file.csv', sep=',',encoding='utf-8')
l noticed that the column is correctly added and filled. But the column remains without name and it's supposed to be named val
id ocr raw_value
00037625-4706-4dfe-a7b3-de8c47e3a28d A 3 4
000a7b30-4c4f-4756-a757-f688ccc55d5d A /c 3
000b08e3-4129-4fd2-8ec0-23d00fe38a45 A yes 1
00196436-12bc-4024-b623-25bac586d314 A know 8
001b8c43-3e73-43c1-ba4f-df5edb10dfac A hi 9
002882ca-48bb-4161-a75a-cf0ec984d650 A fd 10
003b2890-3727-4c79-955a-f74ec6945ed7 A Sensible 14
How to set set to the last column added ?
EDIT1:
print(df.head())
0 1 2 3
0 id ocr raw_value manual_raw_value
1 00037625-4706-4dfe-a7b3-de8c47e3a28d ABBYY 03 03
2 000a7b30-4c4f-4756-a757-f688ccc55d5d ABBYY y/c y/c
3 000b08e3-4129-4fd2-8ec0-23d00fe38a45 ABBYY armoire armoire
4 00196436-12bc-4024-b623-25bac586d314 ABBYY point point
val
0 None
1 93
2 yic
3 armoire
4 point
Need only read_csv, because sep=',' is by default and can be omit and header=None is used if csv have no header:
df = pd.read_csv('file.csv')
Problem is your first row was not parsed to columns names, but to first data row.
df = pd.read_csv('file.csv', sep=',', header=0, index_col=0)
should allow you to simplify the next portion to
df['val']=None
for img in images:
image_id, ext = img.rsplit('.',1)
df.loc[image_id, 'val'] = image_id
If you don't need the image_id as index afterwards, use df.reset_index(inplace=True)
one easy way...
before to_csv:
df.columns.value[3]="val"
Related
i got .csv file with lines like this :
result,table,_start,_stop,_time,_value,_field,_measurement,device
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:12:35Z,44.61,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:12:40Z,17.33,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:12:45Z,41.2,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:12:51Z,33.49,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:12:56Z,55.68,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:12:57Z,55.68,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:13:02Z,25.92,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:13:08Z,5.71,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
I need to make them look like this:
time value
0 2022-10-24T12:12:35Z 44.61
1 2022-10-24T12:12:40Z 17.33
2 2022-10-24T12:12:45Z 41.20
3 2022-10-24T12:12:51Z 33.49
4 2022-10-24T12:12:56Z 55.68
I will need that for my anomaly detection code so I dont have to manualy delete columns and so on. At least not all of them. I cant do it with the program that works with the mashine that collect wattage info.
I tried this but it doeasnt work enough:
df = pd.read_csv('coffee_machine_2022-11-22_09_22_influxdb_data.csv')
df['_time'] = pd.to_datetime(df['_time'], format='%Y-%m-%dT%H:%M:%SZ')
df = pd.pivot(df, index = '_time', columns = '_field', values = '_value')
df.interpolate(method='linear') # not neccesary
It gives this output:
0
9 83.908
10 80.342
11 79.178
12 75.621
13 72.826
... ...
73522 10.726
73523 5.241
Here is the canonical way to project down to a subset of columns in the pandas ecosystem.
df = df[['_time', '_value']]
You can simply use the keyword argument usecols of pandas.read_csv :
df = pd.read_csv('coffee_machine_2022-11-22_09_22_influxdb_data.csv', usecols=["_time", "_value"])
NB: If you need to read the entire data of your (.csv) and only then select a subset of columns, Pandas core developers suggest you to use pandas.DataFrame.loc. Otherwise, by using df = df[subset_of_cols] synthax, the moment you'll start doing some operations on the (new?) sub-dataframe, you'll get a warning :
SettingWithCopyWarning:
A value is trying to be set on a copy of a
slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] =
value instead
So, in your case you can use :
df = pd.read_csv('coffee_machine_2022-11-22_09_22_influxdb_data.csv')
df = df.loc[:, ["_time", "_value"]] #instead of df[["_time", "_value"]]
Another option is pandas.DataFrame.copy,
df = pd.read_csv('coffee_machine_2022-11-22_09_22_influxdb_data.csv')
df = df[["_time", "_value"]].copy()
.read_csv has a usecols parameter to specify which columns you want in the DataFrame.
df = pd.read_csv(f,header=0,usecols=['_time','_value'] )
print(df)
_time _value
0 2022-10-24T12:12:35Z 44.61
1 2022-10-24T12:12:40Z 17.33
2 2022-10-24T12:12:45Z 41.20
3 2022-10-24T12:12:51Z 33.49
4 2022-10-24T12:12:56Z 55.68
5 2022-10-24T12:12:57Z 55.68
6 2022-10-24T12:13:02Z 25.92
7 2022-10-24T12:13:08Z 5.71
I have a .dat file which looks something like the below....
#| step | Channel| Mode | Duration|Freq.| Amplitude | Phase|
0 1 AWG Pi/2 100 2 1
1 1 SIN^2 100 1 1
2 1 SIN^2 200 0.5 1
3 1 REC 50 100 1 1
100 0 REC Pi/2 150 1 1
I had created a data frame and I wanted to read extract data from the data frame but I have an error
TypeError: expected str, bytes or os.PathLike object, not DataFrame
My code is below here,
import pandas as pd
import numpy as np
path = "updated.dat"
datContent = [i.strip().split() for i in open(path).readlines()]
#print(datContent)
column_names = datContent.pop(0)
print(column_names)
df = pd.DataFrame(datContent)
print(df)
extract_column = df.iloc[:,2]
with open (df, 'r') as openfile :
for line in openfile:
for column_search in line:
column_search = df.iloc[:,2]
if "REC" in column_search:
print ("Rec found")
Any suggestion would be appreciated
Since your post does not have any clear question, I have to guess based on your code. I am assuming that what you want to get is to find all rows in DataFrame where column Mode contains value REC.
Based on that, I prepared a small, self contained example that works on your data.
In your situation, the only line that you should use is the last one. Assuming that your DataFrame is created and filled correctly, your code below print(df) can be exchanged by this single line.
I would really recommend you reading the official documentation about indexing and selecting data from DataFrames. https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html
import pandas as pd
from io import StringIO
data = StringIO("""
no;step;Channel;Mode;Duration;Freq.;Amplitude;Phase
;0;1;AWG;Pi/2;100;2;1
;1;1;SIN^2;;100;1;1
;2;1;SIN^2;;200;0.5;1
;3;1;REC;50;100;1;1
;100;0;REC;Pi/2;150;1;1
""")
df = pd.read_csv(data, sep=";")
df.loc[df.loc[:, 'Mode'] == "REC", :]
I have a .csv file that contains 3 types of records, each with different quantity of columns.
I know the structure of each record type and that the rows are always of type1 first, then type2 and type 3 at the end, but I don't know how many rows of each record type there are.
The first 4 characters of each row define the record type of that row.
CSV Example:
typ1,John,Smith,40,M,Single
typ1,Harry,Potter,22,M,Married
typ1,Eva,Adams,35,F,Single
typ2,2020,08,16,A
typ2,2020,09,02,A
typ3,Chevrolet,FC101TT,2017
typ3,Toyota,CE972SY,2004
How can I read It with Pandas? It doesn't matter if I have to read one record type each time.
Thanks!!
Here it is a pandas solution.
First we must read the csv file in a way that pandas keeps the entires lines in one cell each. We do that by simply using a wrong separator, such as the 'at' symbol '#'. It can be whatever we want, since we guarantee it won't ever appear in our data file.
wrong_sep = '#'
right_sep = ','
df = pd.read_csv('my_file.csv', sep=wrong_sep).iloc[:, 0]
The .iloc[:, 0] is used as a quick way to convert a DataFrame into a Series.
Then we use a loop to select the rows that belong to each data structure based on their starting characters. Now we use the "right separator" (probably a comma ',') to split the desired data into real DataFrames.
starters = ['typ1', 'typ2', 'typ3']
detected_dfs = dict()
for start in starters:
_df = df[df.str.startswith(start)].str.split(right_sep, expand=True)
detected_dfs[start] = _df
And here you go. If we print the resulting DataFrames, we get:
0 1 2 3 4 5
0 typ1 Harry Potter 22 M Married
1 typ1 Eva Adams 35 F Single
0 1 2 3 4
2 typ2 2020 08 16 A
3 typ2 2020 09 02 A
0 1 2 3
4 typ3 Chevrolet FC101TT 2017
5 typ3 Toyota CE972SY 2004
Let me know if it helped you!
Not Pandas:
from collections import defaultdict
filename2 = 'Types.txt'
with open(filename2) as dataLines:
nL = dataLines.read().splitlines()
defDList = defaultdict(list)
subs = ['typ1','typ2','typ3']
dataReadLines = [defDList[i].append(j) for i in subs for j in nL if i in j]
# dataReadLines = [i for i in nL]
print(defDList)
Output:
defaultdict(<class 'list'>, {'typ1': ['typ1,John,Smith,40,M,Single', 'typ1,Harry,Potter,22,M,Married', 'typ1,Eva,Adams,35,F,Single'], 'typ2': ['typ2,2020,08,16,A', 'typ2,2020,09,02,A'], 'typ3': ['typ3,Chevrolet,FC101TT,2017', 'typ3,Toyota,CE972SY,2004']})
You can make use of the skiprows parameter of pandas read_csv method to skip the rows you are not interested in for a particular record type. The following gives you a dictionary dfs of dataframes for each type. An advantage is that records of the same types don't necessarily have to be adjacent to each other in the csv file.
For larger files you might want to adjust the code such that the file is only read once instead of twice.
import pandas as pd
from collections import defaultdict
indices = defaultdict(list)
types = ['typ1', 'typ2', 'typ3']
filename = 'test.csv'
with open(filename) as csv:
for idx, line in enumerate(csv.readlines()):
for typ in types:
if line.startswith(typ):
indices[typ].append(idx)
dfs = {typ: pd.read_csv(filename, header=None,
skiprows=lambda x: x not in indices[typ])
for typ in types}
Read the file as a CSV file using the CSV reader. The reader fortunately does not care about line formats:
import csv
with open("yourfile.csv") as infile:
data = list(csv.reader(infile))
Collect the rows with the same first element and build a dataframe of them:
import pandas as pd
from itertools import groupby
dfs = [pd.DataFrame(v) for _,v in groupby(data, lambda x: x[0])]
You've got a list of three dataframes (or as many as necessary).
dfs[1]
# 0 1 2 3 4
#0 typ2 2020 08 16 A
#1 typ2 2020 09 02 A
My data is looking like this:
pd.read_csv('/Users/admin/desktop/007538839.csv').head()
105586.18
0 105582.910
1 105585.230
2 105576.445
3 105580.016
4 105580.266
I want to move that 105568.18 to the 0 index because now it is the column name. And after that I want to name this column 'flux'. I've tried
pd.read_csv('/Users/admin/desktop/007538839.csv', sep='\t', names = ["flux"])
but it did not work, probably because the dataframe is not in the right format.
How can I achieve that?
For me your code working very nice:
import pandas as pd
temp=u"""105586.18
105582.910
105585.230
105576.445
105580.016
105580.266"""
#after testing replace 'pd.compat.StringIO(temp)' to '/Users/admin/desktop/007538839.csv'
df = pd.read_csv(pd.compat.StringIO(temp), sep='\t', names = ["flux"])
print (df)
flux
0 105586.180
1 105582.910
2 105585.230
3 105576.445
4 105580.016
5 105580.266
For overwrite original file with same data with new header flux:
df.to_csv('/Users/admin/desktop/007538839.csv', index=False)
Try this:
df=pd.read_csv('/Users/admin/desktop/007538839.csv',header=None)
df.columns=['flux']
header=None is the friend of yours.
[Update my question]
I have a text file looks like below,
#File_infoomation1
#File_information2
A B C D
1 2 3 4.2
5 6 7 8.5 #example.txt separate by tab '\t' column A dtype is object
I'd like to merge the text file with a csv database file based on column E. The column contains integer.
E,name,age
1,john,23
5,mary,24 # database.csv column E type is int64
So I tried to read the text file then remove first 2 unneeded head lines.
example = pd.read_csv('example.txt', header = 2, sep = '\t')
database = pd.read_csv('database.csv')
request = example.rename(columns={'A': 'E'})
New_data = request.merge(database, on='E', how='left')
But the result does not appear the stuff I want, while it shows NaN in column name and age,
I think int64 and object dtype is where the mistake, dose anyone know how to work this out?
E,B,C,D,name,age
1,2,3,4.2,NaN,NaN
5,6,7,8.5,NaN,NaN
You just need to edit this in your code:
instead of
example = pd.read_csv('example.txt', header = 2, sep = '\t', delim_whitespace=False )
Use this:
example = pd.read_csv('example.txt', sep = ' ' ,index_col= False)
Actually I tried reading your files with:
example = pd.read_csv('example.txt', header = 2, sep = '\t')
# Renaming
example.columns = ['E','B','C','D']
database = pd.read_csv('database.csv')
New_data = example.merge(database, on='E', how='left')
And this returns:
E B C D name age
0 1 2 3 4.2 john 23
1 5 6 7 8.5 mary 24
EDIT: actually is not clear the separator of the original example.txt file. If it is space try putting sep='\s' instead sep=' ' for space.