Create DataFrame from raw input - python

I am getting data as follows:-
$0011:0524-08-2021
$0021:0624-08-2021
&0011:0724-08-2021
&0021:0924-08-2021
$0031:3124-08-2021
&0031:3224-08-2021
$0041:3924-08-2021
&0041:3924-08-2021
$0012:3124-08-2021
&0012:3324-08-2021
In $0011:0524-08-2021, $ denotes start of string, 001 denotes ID, 1:05 denotes time, 24-08-2021 denotes date. Similarly &0011:0624-08-2021 everything is same except & denotes end of string.
Taking the above data I want to create a data frame as follows:-
1. $0011:0524-08-2021 &0011:0724-08-2021
2. $0021:0624-08-2021 &0021:0924-08-2021
3. $0031:3124-08-2021 &0031:3224-08-2021
4. $0041:3924-08-2021 &0041:3924-08-2021
5. $0012:3124-08-2021 &0012:3324-08-2021
Basically I want to sort the entries into a data frame as shown above. There are few conditions that must be satisfied in doing so.
1.) Column1 should have only entries of $ and Column2 should have only & entries.
2.) Both the columns should be arranged in increasing order of time. Column1 with $ entries
should be arranged in increasing order of time and same goes for column2 with & entries.

If you're getting the lines as you shown in your example, you can try:
import pandas as pd
def process_lines(lines):
buffer = {}
for line in map(str.strip, lines):
id_ = line[1:4]
if line[0] == "$":
buffer[id_] = line
elif line[0] == "&" and buffer.get(id_):
yield buffer[id_], line
del buffer[id_]
txt = """$0011:0524-08-2021
$0021:0624-08-2021
&0011:0724-08-2021
&0021:0924-08-2021
$0031:3124-08-2021
&0031:3224-08-2021
$0041:3924-08-2021
&0041:3924-08-2021
$0012:3124-08-2021
&0012:3324-08-2021"""
df = pd.DataFrame(process_lines(txt.splitlines()), columns=["A", "B"])
print(df)
Prints:
A B
0 $0011:0524-08-2021 &0011:0724-08-2021
1 $0021:0624-08-2021 &0021:0924-08-2021
2 $0031:3124-08-2021 &0031:3224-08-2021
3 $0041:3924-08-2021 &0041:3924-08-2021
4 $0012:3124-08-2021 &0012:3324-08-2021

Related

Use a value from one dataframe to lookup the value in another and return an adjacent cell value and update the first dataframe value

I have a 2 datasets (dataframes), one called source and the other crossmap. I am trying to find rows with a specific column value starting with "999", if one is found I need to look up the complete value of that column (e.x. "99912345") on the crossmap dataset (dataframe) and return the value from a column on that row in the cross-map.
# Source Dataframe
0 1 2 3 4
------ -------- -- --------- -----
0 303290 544981 2 408300622 85882
1 321833 99910722 1 408300902 85897
2 323241 99902978 3 408056001 95564
# Cross Map Dataframe
ID NDC ID DIN(NDC) GTIN NAME PRDID
------- ------ -------- -------------- ---------------------- -----
44563 321833 99910722 99910722000000 SALBUTAMOL SULFATE (A) 90367
69281 321833 99910722 99910722000000 SALBUTAMOL SULFATE (A) 90367
6002800 323241 99902978 75402850039706 EPINEPHRINE (A) 95564
8001116 323241 99902978 99902978000000 EPINEPHRINE (A) 95564
The 'straw dog' logic I am working with is this:
search source file and find '999' entries in column 1
df_source[df_source['Column1'].str.contains('999')]
interate through the rows returned and search for the value in column 1 in the crossmap dataframe column (DIN(NDC)) and return the corresponding PRDID
update the source dataframe with the PRDID, and write the updated file
It is these last two logic pieces where I am struggling with how to do this. Appreciate any direction/guidance anyone can provide.
Is there maybe a better/easier means of doing this using python but not pandas/dataframes?
So, as far as I understood you correctly: we are looking for the first digits of 999 in the 'Source Dataframe' in the first column of the value. Next, we find these values in the 'Cross Map' column 'DIN(NDC)' and we get the values of the column 'PRDID' on these lines.
If everything is correct, then I can't understand your further actions?
import pandas as pd
import more_itertools as mit
Cross_Map = pd.DataFrame({'DIN(NDC)': [99910722, 99910722, 99902978, 99902978],
'PRDID': [90367, 90367, 95564, 95564]})
df = pd.DataFrame({0: [303290, 321833, 323241], 1: [544981, 99910722, 99902978], 2: [2, 1, 3],
3: [408300622, 408300902, 408056001], 4: [85882, 85897, 95564]})
m = [i for i in df[1] if str(i)[:3] == '999'] #find the values in column 1
index = list(mit.locate(list(Cross_Map['DIN(NDC)']), lambda x: x in m)) #get the indexes of the matched column values DIN(NDC)
print(Cross_Map['PRDID'][index])

How to save output in .csv after every loop without overwriting in Pandas?

I want to save my output in .csv. When I am running my while loop and saving the output, My output is only saving for the last iteration.
Its not saving my all iteration value.
Also, I want to skip the zero value rows while printing my output.
This is my code:
import pandas as pd `#pandas library
sample = pd.DataFrame(pd.read_csv ("Sample.csv")) #importing .csv as pandas DataFrame
i = 0
while (i <= 23):
print('Value for', i) `#i vale`
sample2 = (sample[sample['Hour'] == i])`#Data for every hour`
sample3 = (sample2[(sample2['GHI']) == (sample2['GHI'].max(0))]) `#Max value from sample3 DataFrame`
sample3 = sample3.loc[sample3.ne(0).all(axis=1)]`ignoring all rows having zero values`
print(sample3) `print sample3`
sample3.to_csv('Output.csv')`trying to save for output after every iteration`
i = i + 1
An other way of doing what you want to do is to get rid of your loop, like this :
sample_with_max_ghi = sample.assign(max_ghi=sample.groupby('Hour')['GHI'].transform('max'))
sample_filtered = sample_with_max_ghi[sample_with_max_ghi['GHI'] == sample_with_max_ghi['max_ghi']]
output_sample = sample_filtered.loc[sample_filtered.ne(0).all(axis=1)].drop('max_ghi', axis=1)
output_sample.to_csv('Output.csv')
Some explanations :
1.
sample_with_max_ghi = sample.assign(max_ghi=sample.groupby('Hour')['GHI'].transform('max'))
This line add a new column to your dataframe containing the max of GHI column for your group of Hour
2.
sample_filtered = sample_with_max_ghi[sample_with_max_ghi['GHI'] == sample_with_max_ghi['max_ghi']]
This line filters only rows where the GHI value is actually the max of its Hour group
3.
output_sample = sample_filtered.loc[sample_filtered.ne(0).all(axis=1)].drop('max_ghi', axis=1)
And apply the last filter to get rid of the 0 values rows
while the loop is running adding the value at every loop to rename the csv file will make it to look unique and solve your problem.. eg:
import pandas as pd `#pandas library
sample = pd.DataFrame(pd.read_csv ("Sample.csv")) #importing .csv as pandas DataFrame
i = 0
while (i <= 23):
print('Value for', i) `#i vale`
sample2 = (sample[sample['Hour'] == i])`#Data for every hour`
sample3 = (sample2[(sample2['GHI']) == (sample2['GHI'].max(0))]) `#Max value from sample3 DataFrame`
sample3 = sample3.loc[sample3.ne(0).all(axis=1)]`ignoring all rows having zero values`
print(sample3) `print sample3`
sample3.to_csv(str(i)+'Output.csv')`trying to save for output after every iteration`
i = i + 1

How do I read the first 5 lines in this column and skip to the nth line and read the next 5 lines again until I reach the end of the column data?

I am reading from a csv file using python and I pulled the data from one column. Every 15 lines is a set of data for one category but only the first 5 lines from that set is relevant. How can I read the first 5 lines of every 15 lines from a total of 205 lines? It reads every other 15 up to a point and then it begins to misalign. The blue rectangles shows where it starts to stray.Here is an image of my data format from the column :
inp = pd.read_csv(input_dir)
win = inp['File '][1:]
ny4= inp['Unnamed: 24']
df = pd.DataFrame({"Ny/4" : ny4})
ny4_len = len(ny4)
#
for i in win:
itr.append(i.split('_'))
for e in itr:
plh1.append(e[4:e.index("SFRP")])
plh1 = plh1[1::13]
win_df = pd.DataFrame({'Window' : plh1})
for u in win_df['Window']:
plh2.append(k.join(u))
chicken = len(df)/12
#kow = list(islice(ny4,1, 17) )
#
maxm = pd.concat(list(map(lambda x: x[1:6], np.array_split(ny4,17))), ignore_index=True)
plh2_df= pd.DataFrame({'Window Name': plh2})
ny4_data= pd.DataFrame(np.reshape(maxm.values,(17,5)), columns = ['Center', 'UL', 'UR', 'LL','LR'])
conc= pd.concat([plh2_df,ny4_data], axis=1, sort=True)
[1]: https://i.stack.imgur.com/yMqw8.png
Use pd.concat with np.array_split:
print(pd.concat(list(map(lambda x: x[:5], np.array_split(df, len(df) / 15))), ignore_index=True))
It should work now.

Slicing my data frame is returning unexpected results

I have 13 CSV files that contain billing information in an unusual format. Multiple readings are recorded every 30 minutes of the day. Five days are recorded beside each other (columns). Then the next five days are recorded under it. To make things more complicated, the day of the week, date, and billing day is shown over the first recording of KVAR each day.
The image blow shows a small example. However, imagine that KW, KVAR, and KVA repeat 3 more times before continuing some 50 rows later.
My goal as to create a simple python script that would make the data into a data frame with the columns: DATE, TIME, KW, KVAR, KVA, and DAY.
The problem is my script returns NaN data for the KW, KVAR, and KVA data after the first five days (which is correlated with a new instance of a for loop). What is weird to me is that when I try to print out the same ranges I get the data that I expect.
My code is below. I have included comments to help further explain things. I also have an example of sample output of my function.
def make_df(df):
#starting values
output = pd.DataFrame(columns=["DATE", "TIME", "KW", "KVAR", "KVA", "DAY"])
time = df1.loc[3:50,0]
val_start = 3
val_end = 51
date_val = [0,2]
day_type = [1,2]
# There are 7 row movements that need to take place.
for row_move in range(1,8):
day = [1,2,3]
date_val[1] = 2
day_type[1] = 2
# There are 5 column movements that take place.
# The basic idea is that I would cycle through the five days, grab their data in a temporary dataframe,
# and then append that dataframe onto the output dataframe
for col_move in range(1,6):
temp_df = pd.DataFrame(columns=["DATE", "TIME", "KW", "KVAR", "KVA", "DAY"])
temp_df['TIME'] = time
#These are the 3 values that stop working after the first column change
# I get the values that I expect for the first 5 days
temp_df['KW'] = df.iloc[val_start:val_end, day[0]]
temp_df['KVAR'] = df.iloc[val_start:val_end, day[1]]
temp_df['KVA'] = df.iloc[val_start:val_end, day[2]]
# These 2 values work perfectly for the entire data set
temp_df['DAY'] = df.iloc[day_type[0], day_type[1]]
temp_df["DATE"] = df.iloc[date_val[0], date_val[1]]
# trouble shooting
print(df.iloc[val_start:val_end, day[0]])
print(temp_df)
output = output.append(temp_df)
# increase values for each iteration of row loop.
# seems to work perfectly when I print the data
day = [x + 3 for x in day]
date_val[1] = date_val[1] + 3
day_type[1] = day_type[1] + 3
# increase values for each iteration of column loop
# seems to work perfectly when I print the data
date_val[0] = date_val[0] + 55
day_type [0]= day_type[0] + 55
val_start = val_start + 55
val_end = val_end + 55
return output
test = make_df(df1)
Below is some sample output. It shows where the data starts to break down after the fifth day (or first instance of the column shift in the for loop). What am I doing wrong?
Could be pd.append requiring matched row indices for numerical values.
import pandas as pd
import numpy as np
output = pd.DataFrame(np.random.rand(5,2), columns=['a','b']) # fake data
output['c'] = list('abcdefghij') # add a column of non-numerical entries
tmp = pd.DataFrame(columns=['a','b','c'])
tmp['a'] = output.iloc[0:2, 2]
tmp['b'] = output.iloc[3:5, 2] # generates NaN
tmp['c'] = output.iloc[0:2, 2]
data.append(tmp)
(initial response)
How does df1 look like? Is df.iloc[val_start:val_end, day[0]] have any issue past the fifth day? The codes didn't show how you read from the csv files, or df1 itself.
My guess: if val_start:val_end gives invalid indices on the sixth day, or df1 happens to be malformed past the fifth day, df.iloc[val_start:val_end, day[0]] will return an empty Series object and possibly make its way into temp_df. iloc do not report invalid row indices, though similar column indices would trigger IndexError.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(5,3), columns=['a','b','c'], index=np.arange(5)) # fake data
df.iloc[0:2, 1] # returns the subset
df.iloc[100:102, 1] # returns: Series([], Name: b, dtype: float64)
A little off topic but I would recommend preprocessing the csv files rather than deal with indexing in Pandas DataFrame, as the original format was kinda complex. Slice the data by date and later use pd.melt or pd.groupby to shape them into the format you like. Or alternatively try multi-index if stick with Pandas I/O.

Matching parts of two csv files to return certain elements

Hello I am looking for some help to do like an index match in excel i am very new to python but my data sets are far to large for excel now
I will dumb my question right down as much as possible cause the data contains alot of irrelevant information to this problem
CSV A (has 3 Basic columns)
Name, Date, Value
CSV B (has 2 columns)
Value, Score
CSV C (I want to create this using python; 2 columns)
Name, Score
All I want to do is enter a date and have it look up all rows in CSV A which match that "date" and then look up the "score" associated to the "value" from that row in CSV A in CSV B and returning it in CSV C along with the name of the person. Rinse and repeat through every row
Any help is much appreciated I don't seem to be getting very far
Here is a working script using Python's csv module:
It prompts the user to input a date (format is m-d-yy), then reads csvA row by row to check if the date in each row matches the inputted date.
If yes, it checks if the value that corresponds the date from the current row of A matches any of the rows in csvB.
If there are matches, it will write the name from csvA and the score from csvB to csvC.
import csv
date = input('Enter date: ').strip()
A = csv.reader( open('csvA.csv', newline=''), delimiter=',')
matches = 0
# reads each row of csvA
for row_of_A in A:
# removes whitespace before and after of each string in each row of csvA
row_of_A = [string.strip() for string in row_of_A]
# if date of row in csvA has equal value to the inputted date
if row_of_A[1] == date:
B = csv.reader( open('csvB.csv', newline=''), delimiter=',')
# reads each row of csvB
for row_of_B in B:
# removes whitespace before and after of each string in each row of csvB
row_of_B = [string.strip() for string in row_of_B]
# if value of row in csvA is equal to the value of row in csvB
if row_of_A[2] == row_of_B[0]:
# if csvC.csv does not exist
try:
open('csvC.csv', 'r')
except:
C = open('csvC.csv', 'a')
print('Name,', 'Score', file=C)
C = open('csvC.csv', 'a')
# writes name from csvA and value from csvB to csvC
print(row_of_A[0] + ', ' + row_of_B[1], file=C)
m = 'matches' if matches > 1 else 'match'
print('Found', matches, m)
Sample csv files:
csvA.csv
Name, Date, Value
John, 2-6-15, 10
Ray, 3-5-14, 25
Khay, 4-4-12, 30
Jake, 2-6-15, 100
csvB.csv
Value, Score
10, 500
25, 200
30, 300
100, 250
Sample Run:
>>> Enter date: 2-6-15
Found 2 matches
csvC.csv (generated by script)
Name, Score
John, 500
Jake, 250
if you are using unix you can do this by below shell script
also I am assuming that you are appending the search output in file_C and there are no duplicated in both source files
while true
do
echo "enter date ..."
read date
value_one=grep $date file_A | cut -d',' -f1
tmp1=grep $date' file_A | cut -d',' -f3
value_two=grep $tmp1 file_B | cut -d',' -f2
echo "${value_one},${value_two}" >> file_c
echo "want to search more dates ... press y|Y, press any other key to exit"
read ch
if [ "$ch" = "y" ] || [ "$ch" = "y" ]
then
continue
else
exit
fi
done

Categories

Resources