How to have multiple rows under the same row index using pandas - python

I'm writing a script to normalise data from RT-PCR. I am reading the data from a tsv file and I'm struggling to put it into a pandas data frame so that it's usabale. The issue here is that the row index have the same name, is it possible to make it a hierarchal structure?
I'm using Python 3.6. I've tried .groupby() and .pivot() but I can't seem to get it to do what I want.
def calculate_peaks(file_path):
peaks_tsv = pd.read_csv(file_path, sep='\t', header=0, index_col=0)
My input file is this:
input file image
My expected output:
EMB.brep1.peak EMB.brep1.length EMB.brep2.peak EMB.brep2.length EMB.brep3.peak EMB.brep3.length
primer name
Hv161 0 19276 218.41 20947 218.39 21803 218.26
1 22906 221.35 26317 221.17 26787 221.21
Hv223 0 4100 305.24 5247 305.37 4885 305.25
1 2593 435.25 3035 435.30 2819 435.32
2 4864 597.40 5286 597.20 4965 596.60
Actual Output:
EMB.brep1.peak EMB.brep1.length EMB.brep2.peak EMB.brep2.length EMB.brep3.peak EMB.brep3.length
primer name
Hv161 19276 218.41 20947 218.39 21803 218.26
Hv161 22906 221.35 26317 221.17 26787 221.21
Hv223 4100 305.24 5247 305.37 4885 305.25
Hv223 2593 435.25 3035 435.30 2819 435.32
Hv223 4864 597.40 5286 597.20 4965 596.60

You can do this:
peaks_tsv = pd.read_csv(file_path, sep='\t', header=0)
peaks_tsv['idx'] = peaks_tsv.groupby('primer name').cumcount()
peaks_tsv.set_index(['primer name', 'idx'], inplace=True)

Related

FIlrer csv table to have just 2 columns. Python pandas pd .pd

i got .csv file with lines like this :
result,table,_start,_stop,_time,_value,_field,_measurement,device
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:12:35Z,44.61,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:12:40Z,17.33,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:12:45Z,41.2,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:12:51Z,33.49,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:12:56Z,55.68,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:12:57Z,55.68,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:13:02Z,25.92,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:13:08Z,5.71,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
I need to make them look like this:
time value
0 2022-10-24T12:12:35Z 44.61
1 2022-10-24T12:12:40Z 17.33
2 2022-10-24T12:12:45Z 41.20
3 2022-10-24T12:12:51Z 33.49
4 2022-10-24T12:12:56Z 55.68
I will need that for my anomaly detection code so I dont have to manualy delete columns and so on. At least not all of them. I cant do it with the program that works with the mashine that collect wattage info.
I tried this but it doeasnt work enough:
df = pd.read_csv('coffee_machine_2022-11-22_09_22_influxdb_data.csv')
df['_time'] = pd.to_datetime(df['_time'], format='%Y-%m-%dT%H:%M:%SZ')
df = pd.pivot(df, index = '_time', columns = '_field', values = '_value')
df.interpolate(method='linear') # not neccesary
It gives this output:
0
9 83.908
10 80.342
11 79.178
12 75.621
13 72.826
... ...
73522 10.726
73523 5.241
Here is the canonical way to project down to a subset of columns in the pandas ecosystem.
df = df[['_time', '_value']]
You can simply use the keyword argument usecols of pandas.read_csv :
df = pd.read_csv('coffee_machine_2022-11-22_09_22_influxdb_data.csv', usecols=["_time", "_value"])
NB: If you need to read the entire data of your (.csv) and only then select a subset of columns, Pandas core developers suggest you to use pandas.DataFrame.loc. Otherwise, by using df = df[subset_of_cols] synthax, the moment you'll start doing some operations on the (new?) sub-dataframe, you'll get a warning :
SettingWithCopyWarning:
A value is trying to be set on a copy of a
slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] =
value instead
So, in your case you can use :
df = pd.read_csv('coffee_machine_2022-11-22_09_22_influxdb_data.csv')
df = df.loc[:, ["_time", "_value"]] #instead of df[["_time", "_value"]]
Another option is pandas.DataFrame.copy,
df = pd.read_csv('coffee_machine_2022-11-22_09_22_influxdb_data.csv')
df = df[["_time", "_value"]].copy()
.read_csv has a usecols parameter to specify which columns you want in the DataFrame.
df = pd.read_csv(f,header=0,usecols=['_time','_value'] )
print(df)
_time _value
0 2022-10-24T12:12:35Z 44.61
1 2022-10-24T12:12:40Z 17.33
2 2022-10-24T12:12:45Z 41.20
3 2022-10-24T12:12:51Z 33.49
4 2022-10-24T12:12:56Z 55.68
5 2022-10-24T12:12:57Z 55.68
6 2022-10-24T12:13:02Z 25.92
7 2022-10-24T12:13:08Z 5.71

pandas: Use truncated filename as header for column in new dataframe from multiple csv files, read specific columns, set date as index

I read multiple questions similar to this one but not specifically addressing this use case.
I have multiple ticker.csv files in a folder such as:
ZZZ.TO.csv containing:
Date Open High Low Close Volume
0 2017-03-14 28.347332 28.347332 27.871055 28.267952 22400
1 2017-03-15 28.320875 28.400254 27.959257 28.188574 39200
2 2017-03-16 28.179758 28.797155 28.126837 28.708954 51600
3 2017-03-17 28.576658 28.691315 28.091559 28.550196 57400
I would like to create a dataframe containing all 'Date' and 'Close' data from each file. Set 'Date' as the index and have each ticker as the column header in the final dataframe.
So the final dataframe would look like this:
Date FOO.TO ZOMD.V ZEN.V TICKER.BAR
2017-03-14 28.347332 28.347332 27.871055 28.267952
2017-03-15 28.320875 28.400254 27.959257 28.188574
2017-03-16 28.179758 28.797155 28.126837 28.708954
2017-03-17 28.576658 28.691315 28.091559 28.550196
This is what I tried:
import pandas as pd
import glob
path = r'/path_where_files_are/'
all_files = glob.glob(path + "/*.csv")
all_files.sort()
fields = ['Date','Close']
list = []
for filename in all_files:
df = pd.read_csv(filename, header=0, usecols=fields)
df.set_index(['Date'], inplace=True)
list.append(df)
frame = pd.concat(list, axis=0)
but it produces:
Date Close
2017-03-14 0.050000
2017-09-21 0.040000
2017-09-22 0.040000
2017-10-13 0.100000
2017-10-16 0.110000
Any help is welcome. Cheers.
You can try:
import pandas as pd
import pathlib
path = pathlib.Path(r'./data2')
data = {}
for filename in sorted(path.glob('*.csv')):
data[filename.stem] = pd.read_csv(filename, index_col='Date',
usecols=['Date', 'Close'],
parse_dates=['Date']).squeeze()
df = pd.concat(data, axis=1)
Output:
>>> df
ZEN.V ZZZ.TO
Date
2017-03-14 28.267952 28.267952
2017-03-15 28.188574 28.188574
2017-03-16 28.708954 28.708954
2017-03-17 28.550196 28.550196
Few things that can help you:
You want to concatenate horizontally, so use pd.concat(..., axis=1) or pd.concat(..., axis='columns');
Don't forget to rename the Close column in your dataframe after you read it;
It is good practice not to overwrite names of Python built-ins – so instead of list, use something descriptive e.g. dfs_to_merge.

Read multiple yaml files to pandas Dataframe

I do realize this has already been addressed here (e.g., Reading csv zipped files in python, How can I parse a YAML file in Python, Retrieving data from a yaml file based on a Python list). Nevertheless, I hope this question was different.
I know loading a YAML file to pandas dataframe
import yaml
import pandas as pd
with open(r'1000851.yaml') as file:
df = pd.io.json.json_normalize(yaml.load(file))
df.head()
I would like to read several yaml files from a directory into pandas dataframe and concatenate them into one big DataFrame. I have not been able to figure it out though...
import pandas as pd
import glob
path = r'../input/cricsheet-a-retrosheet-for-cricket/all' # use your path
all_files = glob.glob(path + "/*.yaml")
li = []
for filename in all_files:
df = pd.json_normalize(yaml.load(filename, Loader=yaml.FullLoader))
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
Error
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<timed exec> in <module>
/opt/conda/lib/python3.7/site-packages/pandas/io/json/_normalize.py in _json_normalize(data, record_path, meta, meta_prefix, record_prefix, errors, sep, max_level)
268
269 if record_path is None:
--> 270 if any([isinstance(x, dict) for x in y.values()] for y in data):
271 # naive normalization, this is idempotent for flat records
272 # and potentially will inflate the data considerably for
/opt/conda/lib/python3.7/site-packages/pandas/io/json/_normalize.py in <genexpr>(.0)
268
269 if record_path is None:
--> 270 if any([isinstance(x, dict) for x in y.values()] for y in data):
271 # naive normalization, this is idempotent for flat records
272 # and potentially will inflate the data considerably for
AttributeError: 'str' object has no attribute 'values'
Sample Dataset Zipped
Sample Dataset
Is there a way to do this and read files efficiently?
It seems your first part of the code and the second one you added is different.
First part correctly reads yaml files, but the second part is broken:
for filename in all_files:
# `filename` here is just a string containing the name of the file.
df = pd.json_normalize(yaml.load(filename, Loader=yaml.FullLoader))
li.append(df)
The problem is that you need to read the files. Currently you're just giving the filename and not the file content. Do this instead
li=[]
# Only loading 3 files:
for filename in all_files[:3]:
with open(filename,'r') as fh:
df = pd.json_normalize(yaml.safe_load(fh.read()))
li.append(df)
len(li)
3
pd.concat(li)
output:
innings meta.data_version meta.created meta.revision info.city info.competition ... info.player_of_match info.teams info.toss.decision info.toss.winner info.umpires info.venue
0 [{'1st innings': {'team': 'Glamorgan', 'delive... 0.9 2020-09-01 1 Bristol Vitality Blast ... [AG Salter] [Glamorgan, Gloucestershire] field Gloucestershire [JH Evans, ID Blackwell] County Ground
0 [{'1st innings': {'team': 'Pune Warriors', 'de... 0.9 2013-05-19 1 Pune IPL ... [LJ Wright] [Pune Warriors, Delhi Daredevils] bat Pune Warriors [NJ Llong, SJA Taufel] Subrata Roy Sahara Stadium
0 [{'1st innings': {'team': 'Botswana', 'deliver... 0.9 2020-08-29 1 Gaborone NaN ... [A Rangaswamy] [Botswana, St Helena] bat Botswana [R D'Mello, C Thorburn] Botswana Cricket Association Oval 1
[3 rows x 18 columns]

Pandas creating a consolidate report from excel

I have a excel file with below detail. I am trying to use panda to get only first 5 language and their sum in a excel
files language blank comment code
61 Java 1031 533 3959
10 Maven 73 66 1213
12 JSON 0 0 800
32 XML 16 74 421
7 HTML 14 16 161
1 Markdown 23 0 39
1 CSS 0 0 1
Below is my code
import pandas as pd
from openpyxl import load_workbook
df = pd.read_csv("myfile_cloc.csv", nrows=20)
#df = df.iloc[1:]
top_five = df.head(5)
print(top_five)
print(top_five['language'])
print(top_five['code'].sum())
d = {'Languages (CLOC) (Top 5 Only)': "", 'LOC (CLOC)Only Code': 0}
newdf = pd.DataFrame(data=d)
newdf['Languages (CLOC) (Top 5 Only)'] = str(top_five['language'])
newdf['LOC (CLOC)Only Code'] = top_five['code'].sum()
#Load excel to append the consolidated info
writer = newdf.ExcelWriter("myfile_cloc.xlsx", engine='openpyxl')
book = load_workbook('myfile_cloc.xlsx')
writer.book = book
newdf.to_excel(writer, sheet_name='top_five', index=False)
writer.save()
Need suggestion in these line
newdf['Languages (CLOC) (Top 5 Only)'] = str(top_five['language'])
newdf['LOC (CLOC)Only Code'] = top_five['code'].sum()
so that Expected Output can be
Languages (CLOC) (Top 5 Only) LOC (CLOC)Only Code
Java,Maven,JSON,XML,HTML 6554
Presently getting error
raise ValueError('If using all scalar values, you must pass'
ValueError: If using all scalar values, you must pass an index
try this,
one way to solve this use index attribute
a=df.head()
df=pd.DataFrame({"Languages (CLOC) (Top 5 Only)": ','.join(a['language'].unique()),"LOC (CLOC)Only Code":a['code'].sum()},index=range(1))
another way to solve this,
use from_records and pass list of dict in Dataframe.
df=pd.DataFrame.from_records([{"Languages (CLOC) (Top 5 Only)": ','.join(a['language'].unique()),"LOC (CLOC)Only Code":a['code'].sum()}])
Output:
Languages (CLOC) (Top 5 Only) LOC (CLOC)Only Code
0 Java,Maven,JSON,XML,HTML 6554
import pandas as pd
sheet1 = pd.read_csv("/home/mycomputer/Desktop/practise/sorting_practise.csv")
sheet1.head()
sortby_blank=sheet1.sort_values('blank',ascending=False)
sortby_blank['blank'].head(5).sum()
values = sortby_blank['blank'].head(5).sum()
/home/nptel/Desktop/practise/sorting_practise.csv ---> File Directory
blank ---> Column you want to sort
use .tail() if you need bottom values
"values" variable will have the answer you are looking for

Name a column added with pandas dataframe

l have the following csv file that l process as follow
import pandas as pd
df = pd.read_csv('file.csv', sep=',',header=None)
id ocr raw_value
00037625-4706-4dfe-a7b3-de8c47e3a28d A 3
000a7b30-4c4f-4756-a757-f688ccc55d5d A /c
000b08e3-4129-4fd2-8ec0-23d00fe38a45 A yes
00196436-12bc-4024-b623-25bac586d314 A know
001b8c43-3e73-43c1-ba4f-df5edb10dfac A hi
002882ca-48bb-4161-a75a-cf0ec984d650 A fd
003b2890-3727-4c79-955a-f74ec6945ed7 A Sensible
004d9025-86f0-4f8c-9720-01e3385c5e77 A 2015
Now l want to add a new column :
df['val']=None
for img in images:
id, ext = img.rsplit('.',1)
idx = df[df[0] ==id].index.values
df.loc[df.index[idx], 'val'] = id
When l write df in a new file as follow :
df.to_csv('new_file.csv', sep=',',encoding='utf-8')
l noticed that the column is correctly added and filled. But the column remains without name and it's supposed to be named val
id ocr raw_value
00037625-4706-4dfe-a7b3-de8c47e3a28d A 3 4
000a7b30-4c4f-4756-a757-f688ccc55d5d A /c 3
000b08e3-4129-4fd2-8ec0-23d00fe38a45 A yes 1
00196436-12bc-4024-b623-25bac586d314 A know 8
001b8c43-3e73-43c1-ba4f-df5edb10dfac A hi 9
002882ca-48bb-4161-a75a-cf0ec984d650 A fd 10
003b2890-3727-4c79-955a-f74ec6945ed7 A Sensible 14
How to set set to the last column added ?
EDIT1:
print(df.head())
0 1 2 3
0 id ocr raw_value manual_raw_value
1 00037625-4706-4dfe-a7b3-de8c47e3a28d ABBYY 03 03
2 000a7b30-4c4f-4756-a757-f688ccc55d5d ABBYY y/c y/c
3 000b08e3-4129-4fd2-8ec0-23d00fe38a45 ABBYY armoire armoire
4 00196436-12bc-4024-b623-25bac586d314 ABBYY point point
val
0 None
1 93
2 yic
3 armoire
4 point
Need only read_csv, because sep=',' is by default and can be omit and header=None is used if csv have no header:
df = pd.read_csv('file.csv')
Problem is your first row was not parsed to columns names, but to first data row.
df = pd.read_csv('file.csv', sep=',', header=0, index_col=0)
should allow you to simplify the next portion to
df['val']=None
for img in images:
image_id, ext = img.rsplit('.',1)
df.loc[image_id, 'val'] = image_id
If you don't need the image_id as index afterwards, use df.reset_index(inplace=True)
one easy way...
before to_csv:
df.columns.value[3]="val"

Categories

Resources