pandas read_csv skiprows - determine rows to skip

pandas read_csv skiprows - determine rows to skip - python

below is a csv snippet with some dummy headers while the actual frame anchored by beerId:
This work is an unpublished, copyrighted work and contains confidential information.
beer consumption
consumptiondate 7/24/2018
consumptionlab H1
numbeerssuccessful 40
numbeersfailed 0
totalnumbeers 40
consumptioncomplete TRUE
beerId Book
341027 Northern Light
this df = pd.read_csv(path_csv, header=8) code works, but the issue is that header is not always in 8 depending on a day. cannot figure out how to use lambda from help as in
skiprows : list-like or integer or callable, default None
Line numbers to skip (0-indexed) or number of lines to skip (int) at
the start of the file.
If callable, the callable function will be evaluated against the row
indices, returning True if the row should be skipped and False
otherwise. An example of a valid callable argument would be lambda x:
x in [0, 2].
to find the index row of beerId

I think need preprocessing first:
path_csv = 'file.csv'
with open(path_csv) as f:
lines = f.readlines()
#get list of all possible lins starting by beerId
num = [i for i, l in enumerate(lines) if l.startswith("beerId" )]
#if not found value return 0 else get first value of list subtracted by 1
num = 0 if len(num) == 0 else num[0] - 1
print (num)
8
df = pd.read_csv(path_csv, header=num)
print (df)
beerId Book
0 341027 Northern Light

Related

How would I find the longest string per row in a data frame and print the row number if it exceeds a certain amount

I want to write a program which searches through a data frame and if any of the items in it are above 50 characters long, print the row number and ask if you want to continue through the data frame.
threshold = 50
mask = (df.drop(columns=exclude, errors='ignore')
.apply(lambda s: s.str.len().ge(threshold))
)
out = df.loc[~mask.any(axis=1)]
I tried using this, but I don't want to drop the rows, just print the row numbers where the strings exceed 50
Input:
0 "Robert","20221019161921","London"
1 "Edward","20221019161921","London"
2 "Johnny","20221019161921","London"
3 "Insane string which is way too longggggggggggg","20221019161921","London"
Output:
Row 3 is above the 50-character limit.
I would also like the program to print the specific value or string which is too long.

You can use:
exclude = []
threshold = 30
mask = (df.drop(columns=exclude, errors='ignore')
.apply(lambda s: s.str.len().ge(threshold))
)
s = mask.any(axis=1)
for idx in s[s].index:
print(f'row {idx} is above the {threshold}-character limit.')
s2 = mask.loc[idx]
for string in df.loc[idx, s2.reindex(df.columns, fill_value=False)]:
print(string)
Output:
row 3 is above the 30-character limit.
"Insane string which is way too longggggggggggg","20221019161921","London"
Intermediate s:
0 False
1 False
2 False
3 True
dtype: bool

Python CSV list manipulation / Date and Time

I am working on a python program that changes the format of an existing CSV
This is the Goal format for the CSV
This is the Original State
3 obstacles
remove "-" from modelchass (Complete)
add "-" to prod date and also 'T' (complete)
change PL Seq list to time's / or possibly create new column with times
Condition, start at 08:00:00
Condition, each line increase by 1
Condition, restart for each date
Condition, restart for H seq
Steps 1 and 2 I have figured out, but I am lost on step 3;
this is my code so far
import pandas as pd
df = pd.read_csv("AMS truck schedule.txt",delimiter=';')
df.to_csv('Demo1.csv')
import csv
with open('Demo1.csv','r') as csv_file:
csv_reader = csv.DictReader(csv_file)
order_numbers = []
csvtimes = []
sequence = []
for line in csv_reader:
order_numbers.append(line['MODELCHASS'])
csvtimes.append(line['Prod Date'])
sequence.append(line['PL Seq'])
#replace the dash for the order numbers
on = [sub.replace("-","") for sub in order_numbers]
print(on[1225])
newtimes = [x[0] + x[1] +x[2] +x[3] +"-" +x[4] +x[5] +"-" +x[6] +x[7] + "T" for x in csvtimes]

I am not 100% sure when you want to restart.
From what I understand you restart an hour if:
a) in prod date col last digit changes
b) when in pl seq first letter changes
What if we reach 24h and the (a or b) is False? Do we continue with 24h until we a or b is True?
Anyway you can add more conditions, I don't know if it is the most effective way but it works. Before doing it you have to create a column:
df['hours'] = 8, so it's a column with all rows = 8
and df = pd.DataFrame(pd.read_csv(filename))
prev_row = None
for index,row in df.iterrows():
if prev_row is not None:
if (row['pl'][0] == prev_row['pl'][0]) and (str(row['date'])[-2:] == str(prev_row['date'])[-2:]) :
row['hours'] = prev_row['hours'] + 1
print(row)
else:
row['hours'] = 8
df.iloc[index] = row
prev_row = row

Most efficient method to modify values within large dataframes - Python

Overview: I am working with pandas dataframes of census information, while they only have two columns, they are several hundred thousand rows in length. One column is a census block ID number and the other is a 'place' value, which is unique to the city in which that census block ID resides.
Example Data:
BLOCKID PLACEFP
0 60014001001000 53000
1 60014001001001 53000
...
5844 60014099004021 53000
5845 60014100001000
5846 60014100001001
5847 60014100001002 53000
Problem: As shown above, there are several place values that are blank, though they have a census block ID in their corresponding row. What I found was that in several instances, the census block ID that is missing a place value, is located within the same city as the surrounding blocks that do not have a missing place value, especially if the bookend place values are the same - as shown above, with index 5844 through 5847 - those two blocks are located within the same general area as the surrounding blocks, but just seem to be missing the place value.
Goal: I want to be able to go through this dataframe, find these instances and fill in the missing place value, based on the place value before the missing value and the place value that immediately follows.
Current State & Obstacle: I wrote a loop that goes through the dataframe to correct these issues, shown below.
current_state_blockid_df = pandas.DataFrame({'BLOCKID':[60014099004021,60014100001000,60014100001001,60014100001002,60014301012019,60014301013000,60014301013001,60014301013002,60014301013003,60014301013004,60014301013005,60014301013006],
'PLACEFP': [53000,,,53000,11964,'','','','','','',11964]})
for i in current_state_blockid_df.index:
if current_state_blockid_df.loc[i, 'PLACEFP'] == '':
#Get value before blank
prior_place_fp = current_state_blockid_df.loc[i - 1, 'PLACEFP']
next_place_fp = ''
_n = 1
# Find the end of the blank section
while next_place_fp == '':
next_place_fp = current_state_blockid_df.loc[i + _n, 'PLACEFP']
if next_place_fp == '':
_n += 1
# if the blanks could likely be in the same city, assign them the city's place value
if prior_place_fp == next_place_fp:
for _i in range(1, _n):
current_state_blockid_df.loc[_i, 'PLACEFP'] = prior_place_fp
However, as expected, it is very slow when dealing with hundreds of thousands or rows of data. I have considered using maybe ThreadPool executor to split up the work, but I haven't quite figured out the logic I'd use to get that done. One possibility to speed it up slightly, is to eliminate the check to see where the end of the gap is and instead just fill it in with whatever the previous place value was before the blanks. While that may end up being my goto, there's still a chance it's too slow and ideally I'd like it to only fill in if the before and after values match, eliminating the possibility of the block being mistakenly assigned. If someone has another suggestion as to how this could be achieved quickly, it would be very much appreciated.

You can use shift to help speed up the process. However, this doesn't solve for cases where there are multiple blanks in a row.
df['PLACEFP_PRIOR'] = df['PLACEFP'].shift(1)
df['PLACEFP_SUBS'] = df['PLACEFP'].shift(-1)
criteria1 = df['PLACEFP'].isnull()
criteria2 = df['PLACEFP_PRIOR'] == df['PLACEFP_AFTER']
df.loc[criteria1 & criteria2, 'PLACEFP'] = df.loc[criteria1 & criteria2, 'PLACEFP_PRIOR']
If you end up needing to iterate over the dataframe, use df.itertuples. You can access the column values in the row via dot notation (row.column_name).
for idx, row in df.itertuples():
# logic goes here

Using your dataframe as defined
def fix_df(current_state_blockid_df):
df_with_blanks = current_state_blockid_df[current_state_blockid_df['PLACEFP'] == '']
df_no_blanks = current_state_blockid_df[current_state_blockid_df['PLACEFP'] != '']
sections = {}
last_i = 0
grouping = []
for i in df_with_blanks.index:
if i - 1 == last_i:
grouping.append(i)
last_i = i
else:
last_i = i
if len(grouping) > 0:
sections[min(grouping)] = {'indexes': grouping}
grouping = []
grouping.append(i)
if len(grouping) > 0:
sections[min(grouping)] = {'indexes': grouping}
for i in sections.keys():
sections[i]['place'] = current_state_blockid_df.loc[i-1, 'PLACEFP']
l = []
for i in sections:
for x in sections[i]['indexes']:
l.append(sections[i]['place'])
df_with_blanks['PLACEFP'] = l
final_df = pandas.concat([df_with_blanks, df_no_blanks]).sort_index(axis=0)
return final_df
df = fix_df(current_state_blockid_df)
print(df)
Output:
BLOCKID PLACEFP
0 60014099004021 53000
1 60014100001000 53000
2 60014100001001 53000
3 60014100001002 53000
4 60014301012019 11964
5 60014301013000 11964
6 60014301013001 11964
7 60014301013002 11964
8 60014301013003 11964
9 60014301013004 11964
10 60014301013005 11964
11 60014301013006 11964

Abstract value in a csv cell using Regex in python

I am abstracting a number value from a csv column like:
column=[None, you earn 5%]
It would be great if it can store 'None' as 0 and simply 5% for the second one.
I tried to transfer the % with the following code. But it raise error as
"TypeError: expected string or bytes-like object"
data.loc[(data['column'] == re.findall(r'([\w]+)', data['column'])), 'disc'] = re.findall(r'([0-9]+\%)',data['column'])
And for loop. But doesn't seemed helpful
def fs(a):
for i in a:
if i == 'None':
a[i] = 0
else:
a[i]=re.search(r'(?<=\().+?(?=\))', a[i])

If you have a Data Frame that has a string column and you want to replace the string 'None" by 0 and also keep numbers and % then do:
df.textColumn.str.replace("None","0").str.replace("[^0-9.%]", "")
Example:
import pandas as pd
df = pd.DataFrame({'n':[1,2,3,4], 'text':["None","you earn 5%", "this is 3.4%", "5.5"]})
df['text'] = df.text.str.replace("None","0").str.replace("[^0-9.%]", "")
df
n text
0 1 0
1 2 5%
2 3 3.4%
3 4 5.5

Reading values from datafram.iloc is too slow and problem in dataframe.values

I use python and I have data of 35 000 rows I need to change values by loop but it takes too much time
ps: I have columns named by succes_1, succes_2, succes_5, succes_7....suces_120 so I get the name of the column by the other loop the values depend on the other column
exemple:
SK_1 Sk_2 Sk_5 .... SK_120 Succes_1 Succes_2 ... Succes_120
1 0 1 0 1 0 0
1 1 0 1 2 1 1
for i in range(len(data_jeux)):
for d in range (len(succ_len)):
ids = succ_len[d]
if data_jeux['SK_%s' % ids][i] == 1:
data_jeux.iloc[i]['Succes_%s' % ids]= 1+i
I ask if there is a way for executing this problem with the faster way I try :
data_jeux.values[i, ('Succes_%s' % ids)] = 1+i
but it returns me the following error maybe it doesn't accept string index

You can define columns and then use loc to increment. It's not clear whether your columns are naturally ordered; if they aren't you can use sorted with a custom function. String-based sorting will cause '20' to come before '100'.
def splitter(x):
return int(x.rsplit('_', maxsplit=1)[-1])
cols = df.columns
sk_cols = sorted(cols[cols.str.startswith('SK')], key=splitter)
succ_cols = sorted(cols[cols.str.startswith('Succes')], key=splitter)
df.loc[df[sk_cols] == 1, succ_cols] += 1

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas read_csv skiprows - determine rows to skip - python

Related

How would I find the longest string per row in a data frame and print the row number if it exceeds a certain amount

Python CSV list manipulation / Date and Time

Most efficient method to modify values within large dataframes - Python

Abstract value in a csv cell using Regex in python

Reading values from datafram.iloc is too slow and problem in dataframe.values

Categories

Resources